Lesson 3 - Measuring Similarity and Distance

Welcome to Measuring Similarity and Distance

In the previous lessons you learned how to turn a sentence into an embedding — a list of numbers that captures its meaning. That is only half the story. An embedding on its own tells you nothing useful; the power comes from comparing two embeddings and asking how close their meanings are. In this lesson you will learn the standard tool for that comparison: cosine similarity. You will build it up from its definition, compute it by hand with numpy so nothing feels like magic, then reach for the one-line helper that sentence-transformers ships for production use.

By the end of this lesson, you will be able to:

Explain cosine similarity as the angle between two vectors and read its range from -1 to 1.
Compute cosine similarity by hand with numpy and with sentence_transformers.util.cos_sim.
Explain why L2-normalized embeddings let you skip the division and just take the dot product.
Relate cosine similarity to cosine distance and Euclidean distance, and say why direction is the right thing to measure for text.

Let’s start with the intuition behind the number, then make it concrete with real embeddings.

The Intuition: Comparing Direction, Not Length

Every embedding is a point in a high-dimensional space, which means it is also an arrow pointing from the origin to that point. When two sentences mean similar things, the model places their arrows so they point in nearly the same direction. When they mean unrelated things, the arrows point off at wide angles to each other.

Cosine similarity measures exactly that: the cosine of the angle between two vectors. If the arrows point in the same direction, the angle is 0 degrees and the cosine is 1. If they are perpendicular — completely unrelated — the angle is 90 degrees and the cosine is 0. If they point in opposite directions, the angle is 180 degrees and the cosine is -1. So the score lives in a tidy range:

~1 means same direction, nearly the same meaning.
~0 means unrelated.
~-1 means opposite direction.

The key word is direction. Cosine similarity deliberately ignores how long each arrow is and looks only at where it points. We will see in a moment why that is the right choice for comparing text.

Here is the formula. For two vectors $a$ and $b$ , cosine similarity is the dot product divided by the product of their lengths (their L2 norms):

\text{cosine}(a, b) = \frac{a \cdot b}{\lVert a \rVert \, \lVert b \rVert} = \frac{\sum_{i} a_i b_i}{\sqrt{\sum_{i} a_i^2}\ \sqrt{\sum_{i} b_i^2}}

The numerator, the dot product, grows when the two vectors share large values in the same positions. Dividing by both lengths strips out magnitude, leaving a pure measure of alignment. Now let’s compute it.

Computing Cosine Similarity by Hand with numpy

We will embed three short sentences. The first two are different ways of describing the same support problem; the third is about something entirely unrelated. If cosine similarity works the way we expect, the first pair should score high and any pair involving the third should score low.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "How do I reset my password?",
    "I forgot my login credentials and can't sign in.",
    "What's the weather like in Paris today?",
]

embeddings = model.encode(sentences)
print(embeddings.shape)


def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


print("cos_sim(s0, s1):", round(cosine(embeddings[0], embeddings[1]), 4))
print("cos_sim(s0, s2):", round(cosine(embeddings[0], embeddings[2]), 4))
print("cos_sim(s1, s2):", round(cosine(embeddings[1], embeddings[2]), 4))

(3, 384)
cos_sim(s0, s1): 0.6797
cos_sim(s0, s2): 0.0181
cos_sim(s1, s2): 0.0608

The numbers confirm the intuition. The two password sentences score 0.6797, a strong signal that the model sees them as closely related even though they share almost no words — one says “reset my password,” the other says “forgot my login credentials.” That is the whole point of embeddings: they match meaning, not vocabulary.

Meanwhile the Paris weather sentence sits near zero against both support sentences (0.0181 and 0.0608), which is exactly what we want for unrelated text. Notice that cosine similarity does not need to hit a perfect 1 to be useful; what matters is that related pairs score clearly higher than unrelated ones. That gap is what semantic search will exploit later.

The Shortcut: Normalized Vectors Make It a Dot Product

The model we are using, all-MiniLM-L6-v2, returns vectors that are already L2-normalized — each one has a length of exactly 1.0. Let’s confirm that directly.

for i, vector in enumerate(embeddings):
    print(f"norm of s{i}:", round(float(np.linalg.norm(vector)), 4))

# Because every vector has length 1, the denominator in the cosine
# formula is 1 * 1 = 1, so cosine similarity reduces to the dot product.
print("dot(s0, s1):", round(float(np.dot(embeddings[0], embeddings[1])), 4))

norm of s0: 1.0
norm of s1: 1.0
norm of s2: 1.0
dot(s0, s1): 0.6797

Every vector has a norm of 1.0, so the denominator $\lVert a \rVert \, \lVert b \rVert$ is just $1 \times 1 = 1$ . The whole formula collapses to the numerator alone. That is why the raw dot product of s0 and s1 comes out to 0.6797, identical to the cosine similarity we computed earlier. With normalized vectors you can skip the division entirely, which matters at scale: a plain dot product is faster and is exactly what vector databases lean on when they search millions of embeddings.

Why normalized vectors turn cosine into a dot product

Cosine similarity divides the dot product by both vector lengths to cancel out magnitude. When a model L2-normalizes its output, every vector already has length 1, so that division does nothing and the cosine similarity equals the plain dot product. Many sentence-transformers models, including all-MiniLM-L6-v2, normalize by default — but not all do. If you switch models, check whether the output is normalized before assuming the dot product is safe; if it is not, either normalize the vectors yourself or stick with the full cosine formula.

The Built-In Helper and the Distance Connection

Computing cosine similarity by hand is great for understanding, but in real code you will reach for the helper that sentence-transformers provides: util.cos_sim. It handles batches efficiently and does the right thing whether or not your vectors are normalized.

from sentence_transformers import util

score = util.cos_sim(embeddings[0], embeddings[1])
print(type(score))
print(score)
print("as a float:", round(score.item(), 4))

<class 'torch.Tensor'>
tensor([[0.6797]])
as a float:
 0.6797

Two things to notice. First, util.cos_sim returns a PyTorch tensor, not a plain Python number — here a 2D tensor tensor([[0.6797]]) because it is built to compare batches and gives you back a matrix of scores. When you have a single pair, call .item() to pull out the value, or float(...) to convert it. Second, the value is 0.6797, matching our by-hand computation to the last digit.

It helps to relate cosine similarity to two notions of distance. Cosine distance is simply one minus cosine similarity:

\text{cosine\_distance}(a, b) = 1 - \text{cosine}(a, b)

So our password pair, with similarity 0.6797, has a cosine distance of about 0.3203. High similarity means low distance, and that flip is useful because many search libraries are written in terms of “find the nearest” (smallest distance) rather than “find the most similar” (largest similarity).

The other notion is Euclidean distance — the straight-line gap between the two points, which numpy computes as np.linalg.norm(a - b). For the password pair it is about 0.80, and for the unrelated password-and-weather pair about 1.40: closer meaning gives a shorter line. There is a neat fact behind this: when vectors are normalized, Euclidean distance and cosine similarity rank pairs in the same order, so for this model they always agree on which sentence is the closest match.

Why prefer cosine (direction) over raw Euclidean distance for text? Because direction captures meaning while length often captures things you do not care about, like sentence length or word count. Two passages can express the same idea at very different lengths; embeddings of different magnitudes would be pushed apart by Euclidean distance even when their directions — their meanings — line up. Cosine similarity ignores that magnitude and focuses on the alignment, which is why it is the default choice for comparing text embeddings.

Practice Exercises

Exercise 1

Embed a fourth sentence, "My account is locked after too many login attempts.", and compute its cosine similarity against all three original sentences using your cosine function. Which existing sentence is it most similar to, and does that match your expectation?

Hint

Add the new sentence to a list, call model.encode on it, then loop over the three original embeddings calling cosine(new_vector, original_vector). The locked-account sentence is about sign-in trouble, so expect it to score highest against the login-credentials sentence.

Exercise 2

Verify the normalization shortcut yourself: take any two of the original embeddings and print both cosine(a, b) and float(np.dot(a, b)). Confirm they are equal, then explain in one sentence why.

Hint

The two values match because each embedding has an L2 norm of 1.0, so the denominator of the cosine formula is 1 and the division has no effect. You can double-check the norms with np.linalg.norm.

Exercise 3

For the password pair (s0 and s1), compute the cosine distance as 1 - cosine(s0, s1) and the Euclidean distance as np.linalg.norm(s0 - s1). Then repeat for the unrelated pair (s0 and s2). Confirm that the more similar pair has both the smaller cosine distance and the smaller Euclidean distance.

Hint

Higher cosine similarity always means lower cosine distance. For normalized vectors, Euclidean distance moves in the same direction, so the related pair will be closer on both measures — around 0.32 cosine distance and 0.80 Euclidean for the password pair.

Summary

You learned how to collapse two embeddings into a single number that says how alike their meanings are. Cosine similarity measures the angle between the vectors, runs from -1 to 1, and reports roughly 1 for the same meaning and roughly 0 for unrelated text. You computed it by hand with numpy, saw that all-MiniLM-L6-v2 returns L2-normalized vectors so the formula collapses to a plain dot product, and used util.cos_sim as the production-ready helper. Finally you connected similarity to cosine distance (1 minus similarity) and Euclidean distance, and saw why direction is the right signal for text.

Key Concepts

Cosine similarity: the cosine of the angle between two vectors; 1 means same direction, 0 means unrelated, -1 means opposite.
The formula: dot product divided by the product of the two vector lengths (L2 norms).
Normalized vectors: all-MiniLM-L6-v2 outputs vectors of length 1.0, so cosine similarity equals the dot product — faster and exactly what vector databases use.
util.cos_sim: the sentence-transformers helper; it returns a PyTorch tensor, so use .item() or float(...) for a single pair.
Cosine distance: 1 minus cosine similarity; high similarity means low distance.
Direction over magnitude: cosine ignores length, so it compares meaning rather than sentence size, which is why it is preferred over raw Euclidean distance for text.

Why This Matters

Similarity scoring is the engine of semantic search, recommendation, deduplication, and retrieval-augmented generation. Every one of those systems boils down to embedding some text, embedding a query, and ranking by similarity. Understanding that the score is just an angle — and that normalized vectors let you compute it as a single dot product — means you can read what a vector database is doing under the hood, debug surprising rankings, and choose the right metric instead of trusting a black box.

Next Steps

Continue to Lesson 4 - Guided Project: Semantic Search

Back to Module Overview

Continue Building Your Skills

You can now turn any two pieces of text into a meaningful similarity score and explain exactly what that number represents. In the next lesson you will put it to work, building a small semantic search engine that embeds a collection of documents, embeds a query, and ranks results by cosine similarity from start to finish.

Previous lesson

Lesson 2 - Generating Embeddings

Next lesson

Lesson 4 - Guided Project: Semantic Search

Courses

DATATWEETS

Title here

Lesson 3 - Measuring Similarity and Distance

Welcome to Measuring Similarity and Distance

The Intuition: Comparing Direction, Not Length

Computing Cosine Similarity by Hand with numpy

The Shortcut: Normalized Vectors Make It a Dot Product

The Built-In Helper and the Distance Connection

Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 4 - Guided Project: Semantic Search

Back to Module Overview

Continue Building Your Skills

Lesson 3 - Measuring Similarity and Distance

Welcome to Measuring Similarity and Distance#

The Intuition: Comparing Direction, Not Length#

Computing Cosine Similarity by Hand with numpy#

The Shortcut: Normalized Vectors Make It a Dot Product#

The Built-In Helper and the Distance Connection#

Practice Exercises#

Exercise 1#

Exercise 2#

Exercise 3#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 4 - Guided Project: Semantic Search

Back to Module Overview

Continue Building Your Skills#

Welcome to Measuring Similarity and Distance

The Intuition: Comparing Direction, Not Length

Computing Cosine Similarity by Hand with numpy

The Shortcut: Normalized Vectors Make It a Dot Product

The Built-In Helper and the Distance Connection

Practice Exercises

Exercise 1

Exercise 2

Exercise 3

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills