Using Database Vector Search to get Better Results from LLMs and RAG than with Traditional Search

• Jason Ladd

What are the Advantages of Using Vector Databases over traditional databases like SQL or NoSQL?

Pretty much all software developers are familiar with the concept of databases. Essentially they store rows of data, where in the case of SQL databases, each row has exactly the same number and types of columns. NoSQL offers a bit more flexibility in that each row can be a bit different (Sometimes, these are referred to as 'document store' databases) but they're basically the same in that they store concrete forms of data that allow you to retrieve parts of that data by sending requests to it called 'queries'. The thing about these queries is that they essentially do exact or fuzzy matching on values in the database. For example, “Show me all red cars in the database”, would translate into SQL to something like:

SELECT * FROM cars WHERE color = red;

Of course, this is very useful when you're looking for exact information. Another example would be querying for a specific user record by their id number:

SELECT username, email FROM users WHERE user_id = 123 LIMIT 1;

This is very useful in the traditional sense of storing and retrieving data, but if we introduce LLMs into the mix, these types of queries can be pretty limiting. That's because we might want to give AIs the ability to search through large amounts of documents in a way that doesn't rely on exact string or number matches. For example, your company might have some documentation that describes what the return policy for your products are. With a SQL database, you could do a simple pattern based search using something like this:

SELECT * FROM company_policy_documentation WHERE title LIKE '%return policy%';

What the above query does is simply look for any rows in the table where the title column's contents has the substring “return policy”. To break down what's happening in the syntax, the % symbol matches any substring (or nothing at all) and if you include a _, it matches exactly one character. So here are a few other common variation on LIKE queries.

WHERE title LIKE 'return policy%'   -- starts with "return policy"
WHERE title LIKE '%return policy'   -- ends with "return policy"
WHERE title LIKE '_ return policy%'  -- any single char, then "return policy", then anything
WHERE title ILIKE '%return policy%' -- case-insensitive version (Postgres-specific)

One important thing to mention about these though. Although they do work for searching for that substring, the only one that takes advantage of any indexing done on the table would be the first one above, WHERE title LIKE 'return policy%' because this takes advantage of of the Balanced Tree, or B-Tree indexing that might have been done on the table. The others can't because every relevant column's value has to still be scanned to see if there's any match in it.

Because this isn't very efficient at large scale, you can use pg_trgm to build a trigram index that does greatly improves the efficiency of the query like this:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_documentation_title_trgm ON company_policy_documentation USING GIN (title gin_trgm_ops);

SELECT * FROM company_policy_documentation WHERE title LIKE '%return policy%';

Now, your LIKE queries will be much faster because they'll be able to take advantage of the indexing. It's also worth noting that LIKE is case sensitive, so you if you need case insensitive queries, you'll need to use ILIKE instead…

But let's be honest, that still isn't very flexible. You still wouldn't get matches for typos, spelling variations, and abbreviations. There's yet another option, though! The similarity() function. Once you've enabled the pg_trgm extension like we did in the step above, you'll also have access to the similarity function. You can use it like this:

-- Returns rows where title is "close enough" to the query string
SELECT * FROM company_policy_documentation WHERE similarity(title, 'return polisy') > 0.3;

-- Or using the % operator syntax (same thing)
SELECT * FROM company_policy_documentation WHERE title % 'return polisy';

The similarity function is based on a score of how close your query is to the results. You can tune it like this, (in this case, we're setting it to be at least a 30% match) SET pg_trgm.similarity_threshold = 0.3

You can also use Levenshtein Distance which counts the number of single character differences between your search string and the actual match, for example:

CREATE EXTENSION fuzzystrmatch;

SELECT * FROM company_policy_documentation WHERE levenshtein(title, 'return polisy') <= 2;

All these options can be very fitting for specific use cases, but they still are very limited in that they're solely based on specific strings and their variations. What if there was a way to search based on the meaning of the query instead of its actual string representation? For example, what if instead searching for “return policy”, someone could ask, “What are your rules on refunds?” As you can see, the two queries don't share any of the same substrings or any similar variations, so none of the above techniques would work. But that's where Vector Databases come in. They actually allow you to do that. Let's look at how that's possible.

Vector Embeddings

Now, let's look at how we can update our database to not only be able to retrieve rows based on string matches, but also the 'meaning' of a query. What we want to be able to do is create a representation of the data in each row that captures the 'concept' of the row so we can search on that. This would be impossible with traditional computing but using Machine Learning, it's actually pretty feasible. Essentially we need two things here.

  • 1. An additional column in the table that holds a vector representation of the row's data
  • 2. A Machine Learning model that allows us to extract a vector representation of that data, and of our queries.

If you're not familiar with a vector, it's just a list of numbers. At a basic level, we're just comparing patterns in those lists of numbers. We're not gonna even concern ourselves with how or why the phrases are similar. We'll just let an embedding model create the vectors, then let Postgres measure the distances of those vectors to the vectors of our search queries. Here's a general example of what I mean:

"return policy"         → [0.21, -0.83, 0.14, 0.67, ...]
"rules on refunds“      → [0.19, -0.79, 0.18, 0.71, ...]  ← close
"store hours”           → [-0.54, 0.32, -0.71, 0.09, ...] ← far away

In this example, we're not gonna build the Machine Learning model ourselves, we'll just use one from OpenAi. And another cool thing is that Postgres is already able to store vectors and query on them by distance, so we don't need to go learn any new database or anything like that, we just need to use the vector extension and add a vector column for the embeddings, like so:

CREATE EXTENSION vector;

ALTER TABLE company_policy_documentation ADD COLUMN embedding vector(1536);

-- HNSW index on the embedding column
CREATE INDEX ON company_policy_documentation USING hnsw (embedding vector_cosine_ops);

In addition to enabling the extension and adding the column, let's also add and index using HNSW (Hierarchical Navigable Small World) so that we don't have to search every single vector in the database. This lets us create Approximate Nearest Neighbors (ANN) clusters to search within for efficiency.

The type of distance metric we're using for this is cosine similarity. That just means that we're measuring the size of the angles between the different vector points. If that's still not clear, think of it this way, if you have a pair of points in a 2-D space, (x1, y1) and (x2, y2). You could measure the distance between each of them using Euclidean distance (the straight line distance between them) or you could think of them as vectors, where they would start at the origin and emanate out in some direction. In that case, there would be an angle between them. If we measure the cosine of that angle and it is within a certain threshold, we know that the two vectors point roughly in the same direction. This is cosine similarity. It's generally used when you only really care about the directions that the vectors are pointing in, not really their magnitude. When making these distance or angle measurements, it doesn't really matter how many 'points' there are in the vector. Could be 2, could be 2,000 and it still works the same!

This would allow us to do queries like this:

-- Query: find 10 documentation articles most semantically similar to a query embedding
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM company_policy_documentation
ORDER BY embedding <=> $1
LIMIT 10;

To explain what's going on this this query: <=> is the cosine similarity operator, so it lets us compare our 'embedding' column with the search query being passed in, $1. Since <=> returns cosine distance, we subtract that from 1 so we can return a 'similarity' column on each of our results that ranges from 0 to 1 in floating point values.

But like I mentioned earlier, we have to have a Machine Learning model that creates the embeddings for us. So here's an example of creating those embeddings on our apps's back end by making a call to OpenAi's API:

func getEmbedding(text string) ([]float32, error) {
    // call your embedding provider - OpenAI, Cohere, local model, etc.
    resp, err := openaiClient.CreateEmbedding(ctx, openai.EmbeddingRequest{
        Model: openai.AdaEmbeddingV2,
        Input: []string{text},
    })
    return resp.Data[0].Embedding, err
}

// At index time
embedding, _ := getEmbedding(documentation.Title + " " + documentation.Content)
db.Exec("UPDATE company_policy_documentation SET embedding = $1 WHERE id = $2", pgvector.NewVector(embedding), company_policy_documentation.ID)

// At query time
queryVec, _ := getEmbedding(searchQuery)
rows, _ := db.Query(`
    SELECT id, title, 1 - (embedding <=> $1) AS similarity
    FROM company_policy_documentation ORDER BY embedding <=> $1 LIMIT 20
`, pgvector.NewVector(queryVec))

That's pretty much it! So now, if you have an AI Agent who's doing customer service, and that agent get's a question about the company's return policy, it can can use RAG (Retrieval Augmented Generation) to come up with an answer based directly on you're company's policy, without having to have been pre-trained on it. Because it doesn't have to match strings exactly, you can get results based on the meaning of what the customer is looking for.

One more important thing to consider though, is that you can potentially burn through a lot of API credits when making calls to get the embeddings. That's something that can drive up cost unnecessarily for when you could just use one of the fuzzy searches we looked at above. It would probably be the best strategy to implement a hybrid approach and try the fuzzy searches (or using indexed search like Elasticsearch) first to see if you get a hit, and if not, falling back on Vector search.

Copying Strings In C

In C, strings are just arrays of chars, which are bytes. This means that, in memory, a string spans across some block of memory at a sequence of addresses. Because of this, it’s not really possible to just simply assign one array to another in one oper...