Understanding Cosine Similarity and Word Embeddings

Spencer Porter
4 min readAug 28, 2023

--

Image from Levi (@Levikul09 on Twitter)

Want to keep up with the latest AI research and need a more streamlined approach? Textlayer AI is the first purpose-built research platform for developers the gives you free access to personalized recommendations, easy-to-read summaries, and full chat with implementation support.

In the realm of Natural Language Processing (NLP) and AI, understanding text similarity is a critical component for various applications, from chatbots to recommendation systems. One of the most effective ways to quantify this similarity is through a mathematical concept known as Cosine Similarity. My hope here is to provide an easier way to approach what cosine similarity is from an intuitive sense, rather than just a strictly mathematical one.

What is Cosine Similarity?

To get the textbook answer out of the way:

Cosine Similarity is a metric used to determine the cosine of the angle between two non-zero vectors in a multi-dimensional space. It is a measure of orientation and not magnitude, ranging from -1 to 1. In the context of text similarity, this metric provides a robust way to gauge the similarity between two sets of text data.

Mathematical Definition: Cosine Similarity is calculated as the dot product of two vectors divided by the product of their magnitudes.

Simply put, and in the context of NLP — it’s a measure of how similar the ideas and concepts represented in two pieces of text are.

A Closer Look at Word Embeddings

Word embeddings are essentially vectors that capture the semantic essence of words. But what does that mean in practical terms? Imagine you have a vast library, and each book in that library is about a single word. The ‘location’ of each book in this imaginary library is determined by its content, or meaning. Words that are semantically similar would be located near each other. In computational terms, this ‘location’ is what a word embedding captures.

The Mechanics of Word Embeddings

Word embeddings are generated through neural networks trained on large text corpora. The network learns the contextual usage of words, effectively capturing not just the obvious synonyms but also the nuanced relationships between words. For example, the embeddings for “bank” in the context of a river and “bank” in a financial context would be different, even though the word is the same.

Technical Insight: The actual word embedding is a combination of the weights and biases learned by the neural network, often passed through additional transformations to generate the final embedding.

The Intuitive Appeal of Cosine Similarity

Cosine Similarity measures the cosine of the angle between two vectors. If two vectors are pointing in the same direction, the angle between them is zero, and the cosine is 1. If they are orthogonal, meaning they share no ‘directionality,’ the cosine is zero.

In the context of word embeddings, think of each vector as an arrow pointing from the origin to the ‘location’ of a word in our imaginary library. Words that are semantically similar will have vectors pointing in similar directions, resulting in a higher Cosine Similarity.

Evaluating LLMs Using Cosine Similarity

When working with Language Learning Models, one common task is to evaluate how well the model understands and generates text based on a given input. Here, Cosine Similarity can be invaluable.

Example Use-Case

Suppose you have trained an LLM to generate product descriptions. You could evaluate its performance by comparing the Cosine Similarity between the generated description and a human-written description for the same product. A high Cosine Similarity would indicate that the model has generated a description closely aligned with human expectations.

# Sample human-written description
human_description = "This laptop has a sleek design, high-resolution display, and long battery life."

# Sample LLM-generated description
llm_description = "Featuring a slim design, this laptop offers a vibrant screen and extended battery duration."

# Generate embeddings for both descriptions
embeddings_human = openai_session.generate_embeddings(human_description)
embeddings_llm = openai_session.generate_embeddings(llm_description)

# Compute Cosine Similarity
similarity_score = cosine_similarity(embeddings_human, embeddings_llm)
print(f"Cosine Similarity between human-written and LLM-generated descriptions: {similarity_score}")

A high similarity score would indicate that the LLM has effectively captured the essence of the product, while a low score might suggest areas for improvement.

Understanding the mechanics of word embeddings and the intuitive nature of Cosine Similarity can significantly enhance your ability to develop and evaluate text-based machine learning models. These tools not only offer a robust framework for quantifying text similarity but also provide actionable insights into the performance of your models

Thank you for reading, and if you’d like to keep up on all the newest Data Science and ML papers, be sure to get your free account at Textlayer AI

--

--

Responses (1)