Similarity Search Explained

January 14, 2024

Large Language Models (LLM's) have exploeded in popularity in the last year. As such, I have found there to be many new "AI Practitioners". These are people who I have found to not have the strongest understanding of some of the technical details surrounding these models and their various applications. This is not to generalize any one individual as not being technical. A software engineer who creates a Retrieval Augmented Generation (RAG) model is certainly a technical person, even if they don't understand all of the math behind it. But if you are new to the world of AI/ML and are looking for a better understanding of some of the math that makes these tools so powerful, continue reading.

In this post, we will look at a few of the similarity measures that are used for retrieval in vector databases. I mentioned the RAG model above, but what exactly is it? And why is it so commonly used? The basics of the model are as follows: a user types a question, that question is compared to a database of documents, and then the document in that database that is most similar to the question is returned. After that document is returned, you combine it with the original question and run that complete text through the LLM. In fact, I would say that the similarity search is the most important part! Sure the LLM's that we are using are very impressive, but all we are really doing is having it jumble up the words that we enter into the prompt. The hard part, answering the question, comes from the document that was returned from the vector database.

What makes this such a great application for companies to implement, is the ease in which you are able to now have a custom chat application on domain specific knowledge. There is no training of a transformer from scratch on some large corpus of private data, or even any finetuning. Simply load the documents into a vector database, query it based on some given text, and there you go, you have the answer you were hopefully looking for! There are plenty of tutorials on how to implement these solutions, so I will leave that for others to explain.

What's a vector embedding?

Before diving into the various approaches for calculating the similarity, a brief overview of what vector embeddings are would be helpful. In simple terms, it is a numerical representation of words. There are a variety of ways to come up with the values that represent each word or token, but that is not important right now. Using libriaries like LangChain and HuggingFace can allow you to easily convert the text that you have into numbers. And once that is done, you can load them into a database.

So, how do they work?

The document that is returned is the one that has the smallest distance between the document and the question that was asked. There are several ways to compute this distance between documents. As a reminder, when I refer to the documents, I am referring to the numerical representation of these documents (i.e. their embeddings).

There are three main ways to compute the distance and they are as follows:

Euclidean Distance
Manhattan Distnace
Cosine Similarity

Euclidean Distance

Below is the formula for the Euclidean distance:

$d = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$

It may look familiar from a middle school math class. If we were comparing two, 2-dimensional vectors, this formula would be the same as the Pythagorean theorem to find the length of the hypotenuse of a right triangle. When we have vectors with more than 2 dimensions, we are calculating the same thing, just in a multi-dimensional space.

Manhattan Distance

If we are in a city, we can think of the Euclidean as the direct distance between two points as if we were able to walk through buildings. The Manhattan distance essentially takes the buildings into account and assumes we can go only in two directions (if we were dealing with two dimensional vectors). For this reason, the Manhattan distance is also called the Taxicab distance. The formula is as follows:

$d = \sum_{i=1}^{n} |x_i - y_i|$

As we can see, the Manhattan distance is just the sum of the absolute difference between each element of the two vectors.

Cosine Similarity

The last distance measurement we will discuss is the Cosine Similarity. This metric focuses on the angle between two vectors and not the magnitude. This is beneficial when the magnitudes of two vectors are different which may sway the Euclidean distance to not classify them as similar. But if for our data we do not care about the magnitude, the Cosine Similarity may show that the two are truly very similar.

$cos(\theta) = \frac{\sum_{i=1}^{n} x_i \cdot y_i}{\sqrt{\sum_{i=1}^{n}(x_i)^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i)^2}}$

In the numerator, we simply multiply each element pair of the two vectors and then sum their results. The denominator has a bit more to it. Here we are multiplying the L2 norm of each vector. The L2 Norm is calculated by taking the square root of the sum of the squares of each element in the vector.

Python Implementation

Now that we have an understanding of the formulas, we can implement these in Python.

import numpy as np

class VectorSimilarity():
    '''
    Class for vector similarity search.

    - x, y: equal length numpy arrays
    '''

    def __init__(self, x: np.array, y: np.array) -> None:

        self.x = x
        self.y = y
        self._validate_params()

    def _validate_params(self):
        if isinstance(self.x, np.ndarray) and isinstance(self.y, np.ndarray) and len(self.x) == len(self.y):
            pass
        else:
            raise TypeError('Input arrays must be equal length ndarrays!')

    def euclidean(self):
        '''Returns the euclidean distance between the two vectors'''
        return np.sqrt(np.sum(np.square((self.x - self.y))))

    def manhattan(self):
        '''Returns the manhattan distance between the two vectors'''
        return np.sum(np.absolute(self.x - self.y))

    def cosine(self):
        '''Returns the cosine similarity between the two vectors'''

        num = np.sum(self.x*self.y)
        denom = np.sqrt(np.sum(np.square(self.x)))*np.sqrt(np.sum(np.square(self.y)))
        cos_theta = num/denom

        return cos_theta

Hopefully now you have a better understanding of some of the similarity metrics used in vector retrieval. This functionality is at the core of RAG models when using LLM's on domain specific documentation!