Jeremy Maslanko

Similarity Search Explained

Jeremy Maslanko — Sun, 14 Jan 2024 12:00:00 -0400

Large Language Models (LLM's) have exploeded in popularity in the last year. As such, I have found there to be many new "AI Practitioners". These are people who I have found to not have the strongest understanding of some of the technical details surrounding these models and their various applications. This is not to generalize any one individual as not being technical. A software engineer who creates a Retrieval Augmented Generation (RAG) model is certainly a technical person, even if they don't understand all of the math behind it. But if you are new to the world of AI/ML and are looking for a better understanding of some of the math that makes these tools so powerful, continue reading.

In this post, we will look at a few of the similarity measures that are used for retrieval in vector databases. I mentioned the RAG model above, but what exactly is it? And why is it so commonly used? The basics of the model are as follows: a user types a question, that question is compared to a database of documents, and then the document in that database that is most similar to the question is returned. After that document is returned, you combine it with the original question and run that complete text through the LLM. In fact, I would say that the similarity search is the most important part! Sure the LLM's that we are using are very impressive, but all we are really doing is having it jumble up the words that we enter into the prompt. The hard part, answering the question, comes from the document that was returned from the vector database.

What makes this such a great application for companies to implement, is the ease in which you are able to now have a custom chat application on domain specific knowledge. There is no training of a transformer from scratch on some large corpus of private data, or even any finetuning. Simply load the documents into a vector database, query it based on some given text, and there you go, you have the answer you were hopefully looking for! There are plenty of tutorials on how to implement these solutions, so I will leave that for others to explain.

What's a vector embedding?

Before diving into the various approaches for calculating the similarity, a brief overview of what vector embeddings are would be helpful. In simple terms, it is a numerical representation of words. There are a variety of ways to come up with the values that represent each word or token, but that is not important right now. Using libriaries like LangChain and HuggingFace can allow you to easily convert the text that you have into numbers. And once that is done, you can load them into a database.

So, how do they work?

The document that is returned is the one that has the smallest distance between the document and the question that was asked. There are several ways to compute this distance between documents. As a reminder, when I refer to the documents, I am referring to the numerical representation of these documents (i.e. their embeddings).

There are three main ways to compute the distance and they are as follows:

Euclidean Distance
Manhattan Distnace
Cosine Similarity

Euclidean Distance

Below is the formula for the Euclidean distance:

\(d = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}\)

It may look familiar from a middle school math class. If we were comparing two, 2-dimensional vectors, this formula would be the same as the Pythagorean theorem to find the length of the hypotenuse of a right triangle. When we have vectors with more than 2 dimensions, we are calculating the same thing, just in a multi-dimensional space.

Manhattan Distance

If we are in a city, we can think of the Euclidean as the direct distance between two points as if we were able to walk through buildings. The Manhattan distance essentially takes the buildings into account and assumes we can go only in two directions (if we were dealing with two dimensional vectors). For this reason, the Manhattan distance is also called the Taxicab distance. The formula is as follows:

\(d = \sum_{i=1}^{n} |x_i - y_i|\)

As we can see, the Manhattan distance is just the sum of the absolute difference between each element of the two vectors.

Cosine Similarity

The last distance measurement we will discuss is the Cosine Similarity. This metric focuses on the angle between two vectors and not the magnitude. This is beneficial when the magnitudes of two vectors are different which may sway the Euclidean distance to not classify them as similar. But if for our data we do not care about the magnitude, the Cosine Similarity may show that the two are truly very similar.

\(cos(\theta) = \frac{\sum_{i=1}^{n} x_i \cdot y_i}{\sqrt{\sum_{i=1}^{n}(x_i)^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i)^2}}\)

In the numerator, we simply multiply each element pair of the two vectors and then sum their results. The denominator has a bit more to it. Here we are multiplying the L2 norm of each vector. The L2 Norm is calculated by taking the square root of the sum of the squares of each element in the vector.

Python Implementation

Now that we have an understanding of the formulas, we can implement these in Python.

import numpy as np

class VectorSimilarity():
    '''
    Class for vector similarity search.

    - x, y: equal length numpy arrays
    '''

    def __init__(self, x: np.array, y: np.array) -> None:

        self.x = x
        self.y = y
        self._validate_params()

    def _validate_params(self):
        if isinstance(self.x, np.ndarray) and isinstance(self.y, np.ndarray) and len(self.x) == len(self.y):
            pass
        else:
            raise TypeError('Input arrays must be equal length ndarrays!')

    def euclidean(self):
        '''Returns the euclidean distance between the two vectors'''
        return np.sqrt(np.sum(np.square((self.x - self.y))))

    def manhattan(self):
        '''Returns the manhattan distance between the two vectors'''
        return np.sum(np.absolute(self.x - self.y))

    def cosine(self):
        '''Returns the cosine similarity between the two vectors'''

        num = np.sum(self.x*self.y)
        denom = np.sqrt(np.sum(np.square(self.x)))*np.sqrt(np.sum(np.square(self.y)))
        cos_theta = num/denom

        return cos_theta

Hopefully now you have a better understanding of some of the similarity metrics used in vector retrieval. This functionality is at the core of RAG models when using LLM's on domain specific documentation!

I'm Sick of LLM's

Jeremy Maslanko — Sat, 16 Dec 2023 12:00:00 -0400

A few hours after Google announced their new family of language models named Gemini, I was asked what I thought about them. I gave an honest answer, that I hadn't really looked into them. I saw a headline that they had were released, and shrugged it off. Why? Well, I'm kind of sick of them..

When I say this, it is not meant to discredit or downplay how impressive the models are. The ability to ask a question about linear algebra followed by a prompt to write a poem in the tone of pirate and confidently get a response that is generally accurate is incredible. I remember less than two years ago messing around with BERT and its variants, but being dissapointed in respones that were nonsensical.

But with that said, I've become a bit unimpressed with new models. Most of them tell you that they are some percentage point better than other models on a variety of different metrics. And while that is progress, it's just not that exciting to me. On top of that, the architecture of the models are largely the same, which means the differences stem from variations in training data. And if your model is only performing better because your model has ___ billion parameters more than the last, great. I am happy that you and your team have access to more compute.

What has been impressive to me, are the smaller models. In fact, I think most of the productivity gains we see will be from these smaller models. Retrieval Augmented Generation (RAG) models are one of the more common ways to use LLM's on domain specific data. The magic sauce for those is not the LLM, but the similarity search returning the context. As long as the LLM is able to produce coherent text, all it needs to do is mix up the input to provide the answer. In simple experiments, I have found that 3 bit quantized models are more than sufficient compared to 8 bit or even full base models. Perhaps I should run an actual experiment, I am a data scientist after all.

Maybe I'm not sick of LLM's. Maybe it's just that I am sick of people talking about "AI" as if it's something magical new inventions. AI is not some sentient thing, its a vague way to refer to complex applied math. Or maybe I am just a jaded data scientist sick of hearing people talk about AI who couldn't tell you how to multiply two vecotrs or explain what a p-value is. It's probably the later.

My First Post

Jeremy Maslanko — Wed, 13 Sep 2023 00:00:00 -0400

Welcome to my blog! I am very excited to begin this project as I expect to be able to learn a lot from it. My hope is that you will be able to pick something up along the way, whether it is a better understanding of a specific topic or an appreciation for a different viewpoint.

As I am sure you may have noticed, this isn't the prettiest site at the moment. I wanted to build this blog from scratch, so I have some work to do. But at the same time, my motivation to work on formatting HTML/CSS remains low.

I have no set plans on how often I will post. My hope is that I will find a blend of short, quick posts that are easier to churn out mixed in with some longer posts that are more thoughtful and take more time. The topics will vary from tech to math to finance and business. These are all topics that I find very interesting. Additionally, I am sure I will write about other things that are not part of those broader topics. That said, I do hope to post with some sort of regularity.

If I write about anything and make an error, please reach out so that I can correct it. My intention will never be to lead anyone astray. Feel free to reach out if you would like to discuss anything or talk about something I have written about. Just as I hope you can learn something from my posts, I am sure there is more that I can learn from you.

And with that, I can't think of anything else to share at this moment.