How do you measure whether two things mean the same thing? Every search engine, recommendation system, and language model answers this question — with a number.
You've just asked a search engine: "How do neural networks learn?" It has a database of 10 million documents. It can't read all of them for you — it needs to instantly rank which ones are most relevant to your question.
Somewhere in that database is a document titled "Gradient descent and backpropagation explained." That document is highly relevant. There's also one about "Fishing nets for tuna." That one is not. But how does the search engine know?
The trick: turn every sentence into a list of numbers — a vector. Then measure how "alike" two vectors are. Vectors that land close together represent sentences that mean similar things. Vectors far apart mean different things.
A vector is just a row of numbers. Your sentence "How do neural networks learn?" might become the vector [0.2, 0.8, 0.1, 0.5, 0.0, 0.9, ...] — hundreds or thousands of numbers, each capturing some aspect of meaning. The search engine then computes how similar your query vector is to every document vector.
We're going to build intuition for six families of metrics, understand when each one works and when it fails, and then watch them all compete in an interactive playground. By the end, you'll know exactly which metric to reach for in any situation.
Here are two sentences embedded as 2D vectors (simplified for illustration). The orange dot is your query. The teal dot is a relevant document. The red dot is an irrelevant document. Click a metric to see how it scores each pair — larger = more similar.
You learned Euclidean distance in middle school — it's the straight-line distance between two points. For two points on a number line, the distance is just the absolute difference. For two points in 2D space, it's the Pythagorean theorem. For vectors of any size, it's the same idea extended to more dimensions.
Let's be concrete. Say you have two vectors:
In general, for two n-dimensional vectors a and b:
This is also called L2 distance (the "2" refers to the square root of squared differences, as we'll see in Chapter 4). It's a distance not a similarity — smaller values mean more similar. If you want a similarity score, you can convert: sim = 1 / (1 + distance).
Consider two documents about cats. Document A is 1000 words long and mentions "cat" 50 times. Document B is 10 words long and mentions "cat" 5 times. Their word-count vectors are very different in magnitude, but they're saying the same thing about cats. Euclidean distance would call them "far apart" even though they mean the same thing.
Worse, in high dimensions (like the 768-dimensional vectors that language models produce), Euclidean distances between random vectors tend to converge — everything looks equally distant from everything else. This is the curse of dimensionality, and it kills Euclidean distance's usefulness for text embeddings.
Drag the orange and teal points. Watch the distance update. Then scale one vector up with the slider — notice how the distance grows even if the direction stays the same.
Remember that problem from Chapter 1? A short cat article and a long cat article look "far" in Euclidean space because one vector is much larger. The fix: stop caring about how long the vectors are. Only care about which direction they point.
Cosine similarity measures the angle between two vectors. If they point in the same direction, the angle is 0° — cosine similarity is 1.0 (perfect match). If they're perpendicular, the angle is 90° — cosine similarity is 0.0 (completely unrelated). If they point opposite ways, the angle is 180° — cosine similarity is -1.0 (opposite meaning).
Let's compare two word-count vectors for the word pair ("cat", "dog"):
The division by magnitudes is the key operation. It's called normalization. After you divide by the length, both vectors have length 1 — they lie on a unit circle (or unit sphere in higher dimensions). Now only angle matters.
Same Euclidean distance as many pairs, but cosine correctly says "completely unrelated." For semantic search, cosine is almost always the right default.
Drag the endpoints of the two vectors. The angle θ between them determines cosine similarity. Notice: scaling a vector (making it longer) doesn't change the angle or the cosine similarity.
Look at the cosine similarity formula again:
The dot product is just the numerator of cosine similarity — the division hasn't happened yet. For two vectors a and b:
The dot product combines two things: the angle between vectors (cosine) AND the magnitudes of both vectors. Algebraically: a · b = |a| × |b| × cos(θ). So the dot product is large when vectors point in the same direction AND when they're long.
In a transformer's self-attention mechanism, every token generates a query vector and every other token has a key vector. The attention score between two tokens is their dot product. Why not cosine similarity?
Three reasons: (1) Speed — no need to compute and divide by magnitudes across millions of token pairs per second. (2) Expressiveness — the model can learn to boost important tokens by making their key vectors longer, encoding confidence in the magnitude. (3) Convenience — the attention weights are then passed through softmax anyway, which normalizes them, so raw magnitude sensitivity is partially handled downstream.
The division by √dk is a pragmatic fix: in high dimensions, dot products grow large and push the softmax into regions with near-zero gradients. Dividing by the square root of the dimension keeps the values in a reasonable range.
Two vectors with a fixed angle of 45°. Watch how dot product changes as you scale them, while cosine similarity stays constant.
Imagine you're navigating a city built on a grid — like Manhattan. You can't cut diagonally through buildings. To get from one corner to another, you walk some blocks east and some blocks north. The total number of blocks you walk is the Manhattan distance (also called L1 distance).
Both metrics measure the same thing — how different two vectors are — but through a different geometric lens. Euclidean squares the differences before summing, making large differences dominate. Manhattan treats all dimensions equally.
Both L1 and L2 are special cases of the Minkowski distance with parameter p:
In sparse, high-dimensional data — like word frequency vectors where most entries are 0 — L1 often works better than L2. Here's why: L2 squares each difference, so a single large outlier dimension dominates the distance. L1 sums raw differences, giving each dimension equal weight.
Concrete example: two document vectors of 10,000 words. They differ on one word — "extraordinary" — which appears 0 times in A and 10 times in B. L2 distance contributed by this one word: √100 = 10. L1 contribution: 10. Now add 100 small differences of 1 across other words. L2 total: √(100 + 100) = √200 ≈ 14.1. L1 total: 10 + 100 = 110. L2 is dominated by the outlier; L1 is more balanced.
Two 8-dimensional vectors. Use the slider to set one dimension to a large outlier value. Watch how L2 (Euclidean) reacts more strongly than L1 (Manhattan).
All the metrics so far work on continuous vectors — lists of real numbers. But sometimes your data is more naturally represented as a set. A document might be a set of unique words. A playlist might be a set of song IDs. A shopping basket might be a set of products. For sets, the natural question is: how much do they overlap?
Jaccard similarity answers this by comparing the intersection (what they share) to the union (everything they contain between them):
Jaccard always falls in [0, 1]. Two identical sets have Jaccard = 1. Two disjoint sets have Jaccard = 0. It's completely insensitive to frequency — whether a word appears once or 100 times, it only counts as "present" or "absent."
A neat trick for text: instead of comparing word sets, compare shingle sets. A k-shingle is a contiguous sequence of k characters (or k words). The sentence "the cat sat" with 2-word shingles produces: {"the cat", "cat sat"}. Shingles capture local word order, so "the cat" and "cat the" produce different shingle sets.
Computing exact Jaccard for billions of document pairs is expensive: you'd have to compute the union and intersection for every pair. MinHash is a randomized shortcut. For each set, apply a random hash function and keep only the minimum hash value. The probability that two sets have the same minimum hash equals their Jaccard similarity. Apply many hash functions and average — you get an unbiased estimate of Jaccard without ever explicitly computing intersections.
Toggle words in each document set. Watch Jaccard similarity update as the overlap changes. The Venn diagram shows the intersection and union.
All the metrics so far are fixed formulas. They work on raw vectors — whatever features you hand them. But what if your raw features aren't the right representation? What if "similar" means something task-specific that no geometric formula can capture without training?
The solution: learn the metric. Train a neural network (an encoder) to map inputs into an embedding space where your chosen metric (usually cosine or dot product) correctly reflects your notion of similarity. The network learns to compress the inputs into vectors that are close for similar items and far for dissimilar ones.
The simplest learned metric uses contrastive loss. Given a positive pair (two similar things) and a negative pair (two dissimilar things), the loss pulls positives together and pushes negatives apart:
For positive pairs (y=1): minimize the distance. For negative pairs (y=0): only penalize if they're closer than margin m — you don't care how far apart they are beyond m.
Contrastive loss requires carefully balancing positives and negatives. Triplet loss simplifies this with three items: an anchor a, a positive p (similar to anchor), and a negative n (dissimilar). The loss forces anchor to be closer to positive than to negative, by at least a margin:
InfoNCE loss (used in CLIP, SimCLR, and most modern contrastive systems) scales this idea to entire batches. For each anchor in a batch of N items, one item is the positive and N-1 are negatives. The loss asks the model to identify the positive among all the negatives — like an N-way classification:
The anchor is pulled toward the positive and pushed from the negative. Click "Step" to apply one gradient update. The embeddings learn a metric.
Everything at once. Two vectors in 2D. Every metric computed live. Drag the points. Toggle normalization. Watch how each metric responds differently to rotation vs. scaling.
Drag the orange and teal vector endpoints. All metrics update live.
Rotate both vectors by the same angle. Cosine similarity and Euclidean distance should both stay constant (rotation doesn't change the angle between them or the distance). Dot product also stays constant for unit vectors. Confirm your intuition.
You now have six tools. Here's how to pick the right one for your problem.
| Metric | Range | Magnitude-sensitive? | Best for | Watch out for |
|---|---|---|---|---|
| Euclidean (L2) | [0, ∞) | Yes | Low-dim, normalized data, pixel similarity | High dimensions, different scales |
| Cosine | [-1, 1] | No | Text embeddings, semantic search, NLP | Zero vectors (undefined); short texts |
| Dot Product | (−∞, ∞) | Yes | Unit-norm embeddings, transformer attention | Un-normalized vectors; explosive values |
| Manhattan (L1) | [0, ∞) | Yes | Sparse high-dim data, robust to outliers | Harder to index; less geometric intuition |
| Jaccard | [0, 1] | No (binary) | Sets, binary data, document overlap | Frequency information lost |
| Learned (InfoNCE) | Task-dependent | Trained | Any data with task-specific similarity | Requires labeled pairs; compute-heavy training |
Click a task to see the recommended metric and why.
"The purpose of computing is insight, not numbers." — Richard Hamming