Similarity Metrics — From Absolute Zero to Mastery

Chapter 0: Why Do We Need a Number?

You've just asked a search engine: "How do neural networks learn?" It has a database of 10 million documents. It can't read all of them for you — it needs to instantly rank which ones are most relevant to your question.

Somewhere in that database is a document titled "Gradient descent and backpropagation explained." That document is highly relevant. There's also one about "Fishing nets for tuna." That one is not. But how does the search engine know?

The trick: turn every sentence into a list of numbers — a vector. Then measure how "alike" two vectors are. Vectors that land close together represent sentences that mean similar things. Vectors far apart mean different things.

The core question: Given two vectors, what single number best captures "how similar are these?" That number is called a similarity metric (or its opposite, a distance metric). Every metric you'll learn in this lesson is a different answer to this question.

A vector is just a row of numbers. Your sentence "How do neural networks learn?" might become the vector [0.2, 0.8, 0.1, 0.5, 0.0, 0.9, ...] — hundreds or thousands of numbers, each capturing some aspect of meaning. The search engine then computes how similar your query vector is to every document vector.

We're going to build intuition for six families of metrics, understand when each one works and when it fails, and then watch them all compete in an interactive playground. By the end, you'll know exactly which metric to reach for in any situation.

Two Sentences, Two Vectors

Here are two sentences embedded as 2D vectors (simplified for illustration). The orange dot is your query. The teal dot is a relevant document. The red dot is an irrelevant document. Click a metric to see how it scores each pair — larger = more similar.

A similarity metric takes two vectors and returns a number. What should that number be for two identical vectors?

Zero — identical things have no distance between them, so similarity is 0 Negative — identical things cancel out It depends on the metric — for cosine similarity it's 1, for Euclidean distance it's 0

Chapter 1: Euclidean Distance

You learned Euclidean distance in middle school — it's the straight-line distance between two points. For two points on a number line, the distance is just the absolute difference. For two points in 2D space, it's the Pythagorean theorem. For vectors of any size, it's the same idea extended to more dimensions.

Let's be concrete. Say you have two vectors:

Example vectors a = [1, 3]
b = [4, 7]

Euclidean distance = √( (4-1)² + (7-3)² )
                      = √( 3² + 4² )
                      = √( 9 + 16 )
                      = √25 = 5.0

In general, for two n-dimensional vectors a and b:

d(a, b) = √( ∑_i (a_i − b_i)² )

This is also called L2 distance (the "2" refers to the square root of squared differences, as we'll see in Chapter 4). It's a distance not a similarity — smaller values mean more similar. If you want a similarity score, you can convert: sim = 1 / (1 + distance).

When Euclidean works well: Low-dimensional spaces (2D, 3D). Data where magnitude matters — if a document mentions "python" 20 times vs 2 times, the difference is meaningful. Pixel-level image comparison.

When Euclidean breaks: Magnitudes Lie

Consider two documents about cats. Document A is 1000 words long and mentions "cat" 50 times. Document B is 10 words long and mentions "cat" 5 times. Their word-count vectors are very different in magnitude, but they're saying the same thing about cats. Euclidean distance would call them "far apart" even though they mean the same thing.

Worse, in high dimensions (like the 768-dimensional vectors that language models produce), Euclidean distances between random vectors tend to converge — everything looks equally distant from everything else. This is the curse of dimensionality, and it kills Euclidean distance's usefulness for text embeddings.

The magnitude problem: Euclidean distance conflates "different direction" with "different scale." A short document and a long document about the same topic will be Euclidean-distant even though they're semantically close. Cosine similarity (Chapter 2) fixes this.

Euclidean Distance Explorer

Drag the orange and teal points. Watch the distance update. Then scale one vector up with the slider — notice how the distance grows even if the direction stays the same.

Scale vector B 1.0×

Vectors a = [1, 0] and b = [3, 0] point in the same direction. What is their Euclidean distance?

2 — the difference in magnitude along the x-axis 0 — same direction means distance zero √10 — Pythagorean theorem

Chapter 2: Cosine Similarity

Remember that problem from Chapter 1? A short cat article and a long cat article look "far" in Euclidean space because one vector is much larger. The fix: stop caring about how long the vectors are. Only care about which direction they point.

Cosine similarity measures the angle between two vectors. If they point in the same direction, the angle is 0° — cosine similarity is 1.0 (perfect match). If they're perpendicular, the angle is 90° — cosine similarity is 0.0 (completely unrelated). If they point opposite ways, the angle is 180° — cosine similarity is -1.0 (opposite meaning).

Cosine Similarity Formula cos_sim(a, b) = (a · b) / (|a| × |b|)

where:
a · b = ∑_i a_i × b_i (the dot product)
|a| = √(∑_i a_i²) (the magnitude/length)

Worked Example with Real Numbers

Let's compare two word-count vectors for the word pair ("cat", "dog"):

Worked Example a = [4, 1]   (document A: mostly about cats)
b = [8, 2]   (document B: same ratio, but longer)

Step 1 — Dot product:
  a · b = (4×8) + (1×2) = 32 + 2 = 34

Step 2 — Magnitudes:
  |a| = √(4² + 1²) = √17 ≈ 4.123
  |b| = √(8² + 2²) = √68 ≈ 8.246

Step 3 — Divide:
  cos_sim = 34 / (4.123 × 8.246) = 34 / 34.0 = 1.000

Perfect similarity! B is just a scaled version of A — same direction.

The division by magnitudes is the key operation. It's called normalization. After you divide by the length, both vectors have length 1 — they lie on a unit circle (or unit sphere in higher dimensions). Now only angle matters.

Range: Cosine similarity always falls in [-1, 1]. For text embeddings where all values are non-negative (like word counts), it falls in [0, 1]. A value of 1 means identical direction; 0 means perpendicular (no overlap in meaning); -1 means opposite.

A Second Example: Direction vs. Magnitude

Different meanings, similar magnitudes a = [3, 0]   (all about cats)
b = [0, 3]   (all about dogs)

cos_sim = (3×0 + 0×3) / (√9 × √9)
         = 0 / (3 × 3) = 0.000

Euclidean distance = √( (3-0)² + (0-3)² ) = √18 ≈ 4.24

Same Euclidean distance as many pairs, but cosine correctly says "completely unrelated." For semantic search, cosine is almost always the right default.

Cosine Similarity — Angle Visualization

Drag the endpoints of the two vectors. The angle θ between them determines cosine similarity. Notice: scaling a vector (making it longer) doesn't change the angle or the cosine similarity.

Scale A 1.0×

Scale B 1.0×

Vectors a = [1, 0] and b = [0, 1] are perpendicular. What is their cosine similarity?

1.0 — they have the same magnitude 0.0 — perpendicular vectors have zero dot product -1.0 — opposite directions

Chapter 3: Dot Product

Look at the cosine similarity formula again:

cos_sim(a, b) = (a · b) / (|a| × |b|)

The dot product is just the numerator of cosine similarity — the division hasn't happened yet. For two vectors a and b:

Dot Product a · b = ∑_i a_i × b_i

For a = [2, 3, 1] and b = [4, 0, 5]:
a · b = (2×4) + (3×0) + (1×5) = 8 + 0 + 5 = 13

The dot product combines two things: the angle between vectors (cosine) AND the magnitudes of both vectors. Algebraically: a · b = |a| × |b| × cos(θ). So the dot product is large when vectors point in the same direction AND when they're long.

Dot product vs. cosine: If your vectors are normalized (length = 1), the dot product and cosine similarity are identical, because |a| = |b| = 1, so the denominator disappears. Modern embedding models (like OpenAI's text-embedding-ada-002) output unit-norm vectors specifically so you can use the fast dot product instead of computing cosine.

Why Transformers Use Dot Product for Attention

In a transformer's self-attention mechanism, every token generates a query vector and every other token has a key vector. The attention score between two tokens is their dot product. Why not cosine similarity?

Three reasons: (1) Speed — no need to compute and divide by magnitudes across millions of token pairs per second. (2) Expressiveness — the model can learn to boost important tokens by making their key vectors longer, encoding confidence in the magnitude. (3) Convenience — the attention weights are then passed through softmax anyway, which normalizes them, so raw magnitude sensitivity is partially handled downstream.

Scaled Dot-Product Attention Attention(Q, K, V) = softmax( Q K^T / √d_k ) × V

Q K^T = matrix of all query–key dot products
√d_k = scaling factor to prevent vanishing gradients

The division by √d_k is a pragmatic fix: in high dimensions, dot products grow large and push the softmax into regions with near-zero gradients. Dividing by the square root of the dimension keeps the values in a reasonable range.

Dot Product vs. Cosine — The Magnitude Effect

Two vectors with a fixed angle of 45°. Watch how dot product changes as you scale them, while cosine similarity stays constant.

Magnitude A 1.0

Magnitude B 1.0

Two unit-norm vectors (both with length = 1) have a dot product of 0.87. What is their cosine similarity?

0.87 — for unit vectors, dot product equals cosine similarity 0.87 / 2 = 0.435 — you still divide by both magnitudes You can't compute cosine similarity from the dot product alone

Chapter 4: Manhattan & Minkowski

Imagine you're navigating a city built on a grid — like Manhattan. You can't cut diagonally through buildings. To get from one corner to another, you walk some blocks east and some blocks north. The total number of blocks you walk is the Manhattan distance (also called L1 distance).

Manhattan Distance (L1) d_L1(a, b) = ∑_i |a_i − b_i|

For a = [1, 4, 2] and b = [4, 1, 5]:
d_L1 = |1−4| + |4−1| + |2−5|
= 3 + 3 + 3 = 9

Compare to Euclidean: √(9 + 9 + 9) = √27 ≈ 5.20

Both metrics measure the same thing — how different two vectors are — but through a different geometric lens. Euclidean squares the differences before summing, making large differences dominate. Manhattan treats all dimensions equally.

The Lp Family (Minkowski Distance)

Both L1 and L2 are special cases of the Minkowski distance with parameter p:

Minkowski Distance d_p(a, b) = ( ∑_i |a_i − b_i|^p )^1/p

p = 1 → Manhattan (L1)
p = 2 → Euclidean (L2)
p → ∞ → Chebyshev (max of absolute differences)

When Does L1 Beat L2?

In sparse, high-dimensional data — like word frequency vectors where most entries are 0 — L1 often works better than L2. Here's why: L2 squares each difference, so a single large outlier dimension dominates the distance. L1 sums raw differences, giving each dimension equal weight.

Concrete example: two document vectors of 10,000 words. They differ on one word — "extraordinary" — which appears 0 times in A and 10 times in B. L2 distance contributed by this one word: √100 = 10. L1 contribution: 10. Now add 100 small differences of 1 across other words. L2 total: √(100 + 100) = √200 ≈ 14.1. L1 total: 10 + 100 = 110. L2 is dominated by the outlier; L1 is more balanced.

Curse of dimensionality: As the number of dimensions grows, the ratio of farthest to nearest neighbor distance approaches 1 under L2 — everything looks equally far. L1 degrades more gracefully. But cosine similarity sidesteps the issue entirely by ignoring magnitude.

L1 vs L2 — The Outlier Effect

Two 8-dimensional vectors. Use the slider to set one dimension to a large outlier value. Watch how L2 (Euclidean) reacts more strongly than L1 (Manhattan).

Outlier size 0

Vectors a = [0, 0, 0, 10] and b = [0, 0, 0, 0]. The difference is entirely in one dimension. Which statement is correct?

L1 distance = 100, L2 distance = 10 L1 = L2 = 10 — single dimension, no difference L1 distance = 10, L2 distance = 10 — they agree when only one dimension differs

Chapter 5: Jaccard & Set-Based Metrics

All the metrics so far work on continuous vectors — lists of real numbers. But sometimes your data is more naturally represented as a set. A document might be a set of unique words. A playlist might be a set of song IDs. A shopping basket might be a set of products. For sets, the natural question is: how much do they overlap?

Jaccard similarity answers this by comparing the intersection (what they share) to the union (everything they contain between them):

Jaccard Similarity J(A, B) = |A ∩ B| / |A ∪ B|

A = {cat, dog, fish}
B = {dog, fish, bird}

A ∩ B = {dog, fish} |A ∩ B| = 2
A ∪ B = {cat, dog, fish, bird} |A ∪ B| = 4

J(A, B) = 2/4 = 0.5

Jaccard always falls in [0, 1]. Two identical sets have Jaccard = 1. Two disjoint sets have Jaccard = 0. It's completely insensitive to frequency — whether a word appears once or 100 times, it only counts as "present" or "absent."

Shingling for Text

A neat trick for text: instead of comparing word sets, compare shingle sets. A k-shingle is a contiguous sequence of k characters (or k words). The sentence "the cat sat" with 2-word shingles produces: {"the cat", "cat sat"}. Shingles capture local word order, so "the cat" and "cat the" produce different shingle sets.

MinHash — Approximate Jaccard at Scale

Computing exact Jaccard for billions of document pairs is expensive: you'd have to compute the union and intersection for every pair. MinHash is a randomized shortcut. For each set, apply a random hash function and keep only the minimum hash value. The probability that two sets have the same minimum hash equals their Jaccard similarity. Apply many hash functions and average — you get an unbiased estimate of Jaccard without ever explicitly computing intersections.

MinHash Key Insight P(min_h(A) = min_h(B)) = |A ∩ B| / |A ∪ B| = J(A, B)

Use k hash functions → k MinHash values → fraction that agree ≈ J(A, B)

Why MinHash matters: Duplicate detection on the web (which pages are near-copies?), plagiarism detection, and large-scale recommendation (users with similar listening histories) all use MinHash. It lets you estimate Jaccard for millions of pairs in milliseconds.

Jaccard Similarity — Set Overlap

Toggle words in each document set. Watch Jaccard similarity update as the overlap changes. The Venn diagram shows the intersection and union.

Sets A = {1, 2, 3} and B = {3, 4, 5}. What is the Jaccard similarity?

1/3 ≈ 0.33 — intersection is {3}, union is {1,2,3,4,5} 1/5 = 0.20 — one shared element out of five total 2/3 ≈ 0.67 — majority of A is in B

Chapter 6: Learned Metrics

All the metrics so far are fixed formulas. They work on raw vectors — whatever features you hand them. But what if your raw features aren't the right representation? What if "similar" means something task-specific that no geometric formula can capture without training?

The solution: learn the metric. Train a neural network (an encoder) to map inputs into an embedding space where your chosen metric (usually cosine or dot product) correctly reflects your notion of similarity. The network learns to compress the inputs into vectors that are close for similar items and far for dissimilar ones.

Contrastive Loss

The simplest learned metric uses contrastive loss. Given a positive pair (two similar things) and a negative pair (two dissimilar things), the loss pulls positives together and pushes negatives apart:

Contrastive Loss (Hinge) L = y × d(a, b)² + (1-y) × max(0, m − d(a, b))

y = 1 for similar pairs, y = 0 for dissimilar pairs
m = margin (how far apart dissimilar pairs should be)
d = Euclidean distance between embeddings

For positive pairs (y=1): minimize the distance. For negative pairs (y=0): only penalize if they're closer than margin m — you don't care how far apart they are beyond m.

Triplet Loss

Contrastive loss requires carefully balancing positives and negatives. Triplet loss simplifies this with three items: an anchor a, a positive p (similar to anchor), and a negative n (dissimilar). The loss forces anchor to be closer to positive than to negative, by at least a margin:

Triplet Loss L = max(0, d(a, p) − d(a, n) + margin)

Goal: d(anchor, positive) + margin < d(anchor, negative)
i.e., the positive must be at least `margin` closer than any negative

InfoNCE (the Modern Standard)

InfoNCE loss (used in CLIP, SimCLR, and most modern contrastive systems) scales this idea to entire batches. For each anchor in a batch of N items, one item is the positive and N-1 are negatives. The loss asks the model to identify the positive among all the negatives — like an N-way classification:

InfoNCE Loss L = −log[ exp(sim(a, p)/τ) / ∑_k=1..N exp(sim(a, k)/τ) ]

τ = temperature (controls how sharply to separate positives from negatives)
sim = cosine similarity or dot product
N = batch size (larger batch = more negatives = harder task = better representations)

Temperature τ: Low temperature (τ < 0.1) makes the distribution spiky — the model strongly differentiates positive from negative. High temperature (τ > 1) makes it soft. During training, start warm, then cool down. CLIP uses τ = 0.07.

Triplet Loss — Watch Points Organize

The anchor is pulled toward the positive and pushed from the negative. Click "Step" to apply one gradient update. The embeddings learn a metric.

In triplet loss with margin = 0.5, the anchor is 0.3 away from positive and 0.9 away from negative. Is there any loss?

Yes — the positive is too close to the anchor No — the negative is already more than 0.5 farther than the positive (0.9 - 0.3 = 0.6 > 0.5) Yes — any non-zero distance to positive contributes loss

Chapter 7: Interactive Metric Playground

Everything at once. Two vectors in 2D. Every metric computed live. Drag the points. Toggle normalization. Watch how each metric responds differently to rotation vs. scaling.

The experiment: (1) Scale one vector up — watch dot product grow but cosine stay flat. (2) Rotate both vectors together — watch cosine stay constant but Euclidean change. (3) Toggle normalization — watch dot product and cosine converge.

Metric Playground — All Metrics Side by Side

Drag the orange and teal vector endpoints. All metrics update live.

Raw vectors

Rotation Experiment

Rotate both vectors by the same angle. Cosine similarity and Euclidean distance should both stay constant (rotation doesn't change the angle between them or the distance). Dot product also stays constant for unit vectors. Confirm your intuition.

Rotate both 0°

Chapter 8: Choosing the Right Metric

You now have six tools. Here's how to pick the right one for your problem.

Rule 1 — Ask about your data type first. If your data is sets (documents as word bags, user histories as item sets), start with Jaccard. If your data is continuous vectors (embeddings, image features, audio representations), move to the next question.

Rule 2 — Are your vectors normalized? If yes (all vectors have length 1), dot product and cosine similarity are identical. Use dot product — it's faster. If no, magnitude differences matter: use cosine similarity if magnitude is noise (text search), use Euclidean or dot product if magnitude is signal (recommendation with confidence scores).

Rule 3 — What's your index? Production similarity search runs on approximate nearest neighbor (ANN) indexes. FAISS and ScaNN support inner product (dot product) and Euclidean natively. Cosine can be reduced to inner product by normalizing vectors first. If you're using an ANN library, check which metric it supports and normalize accordingly.

Rule 4 — High-dimensional sparse data? Prefer L1 over L2. The squaring in L2 amplifies outlier dimensions and degrades in high-dimensional sparse spaces. For text with TF-IDF features, Manhattan distance or cosine are both good choices.

Metric	Range	Magnitude-sensitive?	Best for	Watch out for
Euclidean (L2)	[0, ∞)	Yes	Low-dim, normalized data, pixel similarity	High dimensions, different scales
Cosine	[-1, 1]	No	Text embeddings, semantic search, NLP	Zero vectors (undefined); short texts
Dot Product	(−∞, ∞)	Yes	Unit-norm embeddings, transformer attention	Un-normalized vectors; explosive values
Manhattan (L1)	[0, ∞)	Yes	Sparse high-dim data, robust to outliers	Harder to index; less geometric intuition
Jaccard	[0, 1]	No (binary)	Sets, binary data, document overlap	Frequency information lost
Learned (InfoNCE)	Task-dependent	Trained	Any data with task-specific similarity	Requires labeled pairs; compute-heavy training

A Decision Tree

Data type?

What form does your input take?

↓

Sets / binary

→ Jaccard (or MinHash for scale)

↓ or

Continuous vectors

Are vectors normalized (length = 1)?

↓

Yes → Dot product

No → Is magnitude meaningful?

↓

Magnitude = noise

→ Cosine (e.g. semantic search)

↓ or

Magnitude = signal

→ Euclidean or dot product

The practitioner's default: Normalize your embeddings to unit norm, then use dot product (= cosine for unit vectors). This is what OpenAI, Cohere, and most embedding API providers assume. It's fast, interpretable, and works with every ANN index.

Which Metric for Which Task?

Click a task to see the recommended metric and why.

You're building a semantic search engine. Your embedding model outputs 384-dimensional vectors. Vectors are NOT pre-normalized. Which metric should you default to?

Manhattan distance — safer in high dimensions Raw dot product — fastest option Cosine similarity (or normalize then dot product) — direction matters, magnitude is noise

Continue learning: These metrics are the foundation for Contrastive Learning & CLIP, Attention & Transformers (dot product attention), and vector databases like Pinecone and Weaviate.

"The purpose of computing is insight, not numbers." — Richard Hamming