AI Meets PostgreSQL - The pgvector Revolution in Text Search

PostgreSQL has been a top relational database for a while. Now, in the AI age, we also need to store vector data for fast and similar searches. That’s where pgvector comes in – it’s like an add-on for PostgreSQL that helps it work well with vector data.

What is a vector data?

Imagine you have a map, but instead of cities and streets, it shows your favorite things: movies, music, and funny cat videos. Each thing is like a dot on the map, and if the dots are close, it means those things are similar. For example, a comedy movie dot might be close to another comedy, but not so much to a sci-fi movie. In the digital world, vectors are like dots on a map representing information, like text or images. By comparing how close these dots are, we can tell how similar things are, even if they use different words or look a bit different. It’s like a secret language for computers to understand meanings beyond just words or pixels – kind of like magic, but with math! We represent vector data as lists of floating-point numbers.

Example: [0.5, 1.2, -0.3, 4.7] represents a 4-dimensional vector.

Other most commonly used vector databases
Pinecone
Faiss
vespa
milvus

pgvector – Revolution in Text Search

Think about normal text search like a detective carefully going through documents, word by word, to find what matches. Now, pgvector is more like an AI agent. It store text, images, or sounds into high-dimensional vectors, like numerical fingerprints that capture the core meaning of the data. As mentioned above If two vectors are close in this space, it means they are a lot alike in meaning. That’s where cosine distance helps. This cosine measure looks at the angle between two vectors, and if the angle is small, it means they are very similar.

Installing pgvector on Ubuntu

sudo apt install postgresql-15-pgvector

Installing pgvector on CentOS

sudo apt install pgvector_15

Note: Here 15 is my PostgreSQL server version change it according to your needs

Follow this official README for more installation methods i.e, Dockers

https://github.com/pgvector/pgvector/blob/master/README.md

What problem did pgvector resolve?

pgvector aimed to solve several key problems in PostgreSQL related to search and data representation

Limited text and data representation: Traditional PostgreSQL stores data in simple formats like text, numbers, and dates. This lack of flexible representation makes it difficult to capture the true meaning and relationships between complex data points like text documents, images, or scientific data. pgvector introduces multi-dimensional vectors as a data type, allowing a richer representation of data based on its inherent qualities and relationships. also, pgvector’s vector-based approach, along with indexing techniques like HNSW and IVFFlat, enables faster and more efficient search over large datasets, providing scalability for real-time applications and big data environments.

Inefficient similarity search: Traditional text searches in PostgreSQL rely on keyword matching or pattern recognition, which can be slow for finding similar documents or content also it cannot be used to perform search on the meaning of the data. pgvector enables efficient similarity search based on advanced distance metrics like cosine distance, allowing you to find items with similar meanings even if they don’t share the same words or patterns.

Lack of integration with machine learning: pgvector bridges the gap between PostgreSQL and machine learning by enabling seamless integration with vector embeddings generated by machine learning models. This allows you to leverage the power of machine learning for tasks like document retrieval, image search, and recommendation systems directly within your PostgreSQL database.

More from the Blog: IT Budget for 2024 – Tune Your PostgreSQL Before Upsizing Your Hardware

Exact Nearest Neighbor Search vs Approximate Nearest Neighbor Search

pgvector supports two primary types of searches Exact Nearest Neighbor Search (ENN) and Approximate Nearest Neighbor Search (ANN). Both ENN and ANN focus on finding similar vectors in a high-dimensional space, but they differ in their approach and offer different advantages and disadvantages

Exact Nearest Neighbor Search

Definition: Find the vector in the dataset that is identical to the query vector.
Algorithm: Employs a brute-force approach, comparing the query vector to every vector in the database.
Accuracy: Guarantees finding the true nearest neighbor, ensuring the highest possible precision.
Performance: Requires comparing the query vector to all available vectors, leading to potentially slow computation and high resource usage, especially for large datasets.
Common Use Cases: When absolute accuracy is crucial, such as in critical financial calculations or scientific analysis.

Approximate Nearest Neighbor Search

Definition: Finds vectors that are approximately similar to the query vector within a certain tolerance.
Algorithm: Leverages indexing techniques to identify candidate vectors without comparing them all, significantly reducing search space.
Accuracy: Provides an approximate answer, potentially missing the true nearest neighbor but often returning close alternatives.
Performance: Offers significantly faster search times and lower resource consumption compared to ENN, making it suitable for large datasets and real-time applications.
Common Use Cases: When real-time response and scalability are prioritized over absolute accuracy, such as in recommender systems, image/video retrieval, and natural language processing applications.

Choosing between ENN and ANN depends on your specific needs
Accuracy: If you need to find the absolute closest match with utmost precision, ENN is the choice.
Performance: For real-time applications and large datasets, ANN offers faster speeds and resource efficiency.
Data Size: The larger the dataset, the more significant the performance advantage of ANN becomes.
Tolerance for Inaccuracy: Think about how okay it is to have a close match in your application.

Additional Factors

Distance Metric: Both ENN and ANN can work with various distance metrics like L2, inner product, and cosine distance.
Indexing Techniques: Different ANN algorithms like HNSW and IVF offer varying performance and accuracy trade-offs.

L2 distance, inner product, and cosine distance explained with a simpler example

Imagine two apples in a fruit basket

L2 distance

This tells you how far apart the apples are in the basket, like measuring the straight-line distance between them. The bigger the distance, the less similar they are. This is used when you need to know the exact difference between things, like distances on a map or differences in prices
With the help of ENN pgvector uses L2 distance directly to find the vector identical to the query vector.

Inner product

This is like comparing the taste and texture of the apples. A high inner product means they’re both sweet and crispy, while a low one means they’re different (maybe one is sour and the other is soft). This is used when you want to compare how similar things are in terms of specific qualities, like comparing documents with similar topics or music with similar rhythms.

Cosine distance

Let’s assume that each apple has a tiny arrow on top, pointing in a certain direction. Now, if you want to know how much these apples are alike, you can check the angle between their arrows. This is used when you want to know how aligned things are in a general sense, like finding similar pictures or recommending relevant products based on your past purchases.

NOTE: pgvector itself doesn’t directly offer built-in search based on inner product or cosine distance. However, it leverages L2 distance in different ways to achieve similar results

Indexes in pgvector

pgvector provides 2 types of indexes IVFFlat and HNSW both are Approximate Nearest Neighbor (ANN) indexes used to speed up searching for similar vectors. Choosing the right one depends on several factors, Here’s a breakdown of their key differences

Feature	IVFFlat	HNSW
Algorithm	Divides vector space into partitions	Builds graph-like structure of similar vectors
Build Time	Significantly faster	Significantly slower
Memory Usage	Less memory	More memory
Accuracy	Generally less accurate, especially for high-dimensional vectors	Generally better accuracy, especially for high-dimensional vectors
Performance	Faster for large datasets and simple queries	Good for moderate-sized datasets
Adaptability	May not perform well for all vector distributions	Handles diverse vector distributions well

Understanding IVFFlat and HNSW with real-world example

Imagine you live in a city and need to find a similar restaurant to the one you love (a “nearest neighbor”).

IVFFlat is like having a map with labeled neighborhoods

Divides the city into different areas based on general characteristics.
Can tell you quickly which neighborhoods might have similar restaurants.
Leads you to restaurants within those neighborhoods that might be close, but not necessarily the most similar in every aspect.
Faster to get started with the map (faster index build).
Requires less memory to store the map.

HNSW is like having a helpful local guide

Knows the city well and understands how restaurants are connected (similar flavors, vibes, etc.).
Can quickly lead you to restaurants close to your favorite in terms of overall experience, even if they’re not identical.
Takes more time for the guide to learn the city (building the index).
Requires the guide to remember more information (higher memory usage).

Choosing the right method depends on your priorities

Accuracy: HNSW is like the guide and prioritizes finding the restaurant most similar to your favorite, even if it takes longer.
Speed: IVFFlat is like the map and focuses on finding a similar restaurant quickly, even if it’s not perfect.

HNSW works better for complex city layouts (high-dimensional data). Whereas IVFFlat is better for large cities (large datasets).

Process of Data Embedding

To create vector data, we first need to generate it. The process involved in creating the vector data is called embedding, It’s like a secret language that bridges the gap between the world of words, images, or sounds in the numerical form for computers.

Gathering Data: Think about a shelf full of books, an art gallery with lots of art, or a bunch of melodies playing together. When we start embedding, it means we’re collecting the basic stuff we want to show as vectors.

Choosing an Embedding Model: There are different embedding models available for specific types of information. Some common ones are:

HuggingFace or text-embedding-ada-002 from OpenAI for text
VGG16 or ResNet50 for images
Audio2Vec for sounds

Training the Model: This is where the magic happens! The model analyzes the data, identifying patterns, relationships, and hidden meanings. It maps words, images, or sounds to their corresponding vectors, crafting a unique numerical representation for each one.

Generating Vectors: Once the model has learned the language of the data, it can transform new inputs into vectors on demand.

HuggingFace or OpenAI Embedding Models – What to Choose?

Choosing between Hugging Face and OpenAI for your embedding needs depends on several factors

Feature	Hugging Face	OpenAI
Focus	Breadth and flexibility	High-performance embedding
Models	Wide range of NLP models for various tasks	Specialized embedding models
Cost	Mostly free, some costs for specific models or APIs	Paid access through plans
Ease of use	Beginner-friendly with tools and community support	API or Hugging Face model hub access
Customization	Open-source tools offer more control	Less control over model customization

Ultimately, the best choice depends on your specific needs, budget, and technical expertise. Consider carefully what you prioritize and compare the available options before making your decision.

NOTE: We will cover data embedding via different models in upcoming blogs

Utilizing pgvector: A Practical Illustration

Imagine a scenario where we store various words as vectors in a database. We then provide different words to the database to find the most similar ones

Words we will store in the database are Apple, Banana, Cat and Dog

Words we are going to search are Mango and Horse

Create extension: Once you have pgvector installed on your computer you first need to create the required extension

Create extension vector;

Create a table: To store vector data

CREATE TABLE items (id bigserial PRIMARY KEY, name varchar(100), embedding vector);

Generate Embedding

We are going to use HuggingFace in order to generate our embedding

Install required Python packages

pip3 install sentence-transformers
pip3 install langchain

Python3 code to generate the embeddings

from langchain.embeddings import HuggingFaceEmbeddings 
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2') 
print(embeddings.embed_query("Apple"))

NOTE: Here model used to generate embeddings is all-MiniLM-L6-v2′

The below diagram shows the embedding of the word Apple

Insert data into the database

INSERT INTO items (name, embedding) VALUES (‘Apple’, ‘<Your_Apple_Embedding>’);

NOTE: Replace Your_Apple_Embedding with the original embedding as shown in the above diagram

Similarly, we will insert additional vectors inside the database i.e, Banana, Cat and Dog

After inserting all your keywords, we can examine the number of rows in the items table.

select count(*) from items; 
count 
------- 
4 
(1 row)

Vector search

To find a word, we will once again create the embedding for the word and utilize the generated embedding to search for the word within the database.

Insert generated(Mango) embedding in the bellow query to get the results

SELECT name, 1-(embedding <=> '<Mango_embedding>') as cosine_similarity FROM items ORDER BY cosine_similarity DESC LIMIT 2;

Breakdown of the SELECT Query Structure

SELECT name: This selects the name of the item from the items table.

1-(embedding <=> ‘<Horse_embedding>’) as cosine_similarity: This expression calculates the cosine similarity between each item’s embedding and a specified embedding for “Horse”, and aliases the result as cosine_similarity column

embedding <=> ‘<Horse_embedding>’: This operator calculates the cosine similarity between two vectors. It compares each item’s embedding (stored in the embedding column) to a placeholder for a “Horse” embedding.

1 – … : The result of the similarity calculation is subtracted from 1 to reverse the scale, making higher values represent greater similarity.

ORDER BY cosine_similarity DESC: This sorts the results in descending order based on the calculated cosine similarity. This puts the items most similar to “Horse” at the top.

LIMIT 2: This restricts the output to only the top 2 most similar items, based on the calculated cosine similarity.

Finally, by running the above query we get the result as below

name  | cosine_similarity 
--------+-------------------- 
Banana | 0.5249995745718354 
Apple  | 0.4006202280151715
(2 rows)

Insert generated(Horse) embedding in the bellow query to get the results

SELECT name, 1-(embedding <=> '<Horse_embedding>') as cosine_similarity FROM items ORDER BY cosine_similarity DESC LIMIT 2;
name |  cosine_similarity 
------+--------------------- 
Dog  |  0.5349170288394121 
Cat  | 0.41092748143620783
(2 rows)

Results analyses

Mango and horse are not directly present in the table: As they are not listed in the results, and a perfect match (cosine similarity of 1) would have been returned if they were.

Banana and Apple have moderate semantic similarities to Mango

Banana (0.5249995745718354) is the most similar, but not an exact match.
Apple (0.4006202280151715) has a lower similarity, suggesting a weaker relationship.

Dog and Cat have moderate semantic similarity to Horse

Dog (0.5349170288394121) is the most similar, but not an exact match.
Cat (41092748143620783) has a lower similarity, suggesting a weaker relationship.

It is worth noticing that Cosine similarity ranges is from 0 to 1

1 signifies an exact match, meaning the items are identical in terms of the representation used for comparison.
0 indicates complete dissimilarity, meaning the items have no overlap in their representation.

Values between 0 and 1 represent varying degrees of similarity.

In the forthcoming blog posts, we will explore the optimal utilization of LLM alongside PostgreSQL and pgvector to address real-world scenarios effectively.