Loss Functions — From MSE to InfoNCE

Chapter 0: The Scoring Problem

You've built a neural network. You feed it an image of a cat. It outputs three numbers: [2.1, 0.8, -0.3], one for each class — cat, dog, bird. Those numbers say the network thinks "cat" is most likely. Good.

But here's the question nobody asks early enough: how do you tell the network it was right? And more importantly — when it outputs [0.5, 1.9, 0.3] for the same cat image and guesses "dog" — how do you tell it how wrong it was, and in which direction to fix itself?

You need a single number. A score. Low when the network is right, high when it's wrong. This number flows backward through every layer, nudging every weight. Get the scoring function wrong, and the network learns the wrong thing — or learns nothing at all.

That scoring function is called a loss function. It's the most important design choice you make when training a neural network. More important than the architecture. More important than the optimizer. Because the loss defines what the network is actually trying to do.

The loss function IS the task. A network minimizing cross-entropy learns to output calibrated probabilities. A network minimizing triplet loss learns to organize an embedding space. Change the loss, change what the network learns — even with identical architecture and data.

In this lesson, we'll build every major loss function from scratch. We'll start with the simplest idea — squared error — and discover why it fails for classification. We'll derive cross-entropy from information theory. We'll see why contrastive losses revolutionized representation learning. And we'll end with InfoNCE, the loss that powers CLIP, SimCLR, and most modern self-supervised learning.

But first, let's see the problem with our own eyes.

The Loss Signal

A model predicts probabilities for three classes. The true label is Cat. Drag the slider to change the model's confidence in "Cat" and watch what happens to different loss functions.

P(Cat) 0.70

Notice something striking. When the model is very confident and correct (P(Cat) near 1.0), both losses are near zero — all is well. But when the model is confidently wrong (P(Cat) near 0.0), cross-entropy explodes toward infinity while MSE stays calmly bounded. Cross-entropy screams at confident mistakes. MSE merely shrugs.

That difference matters enormously. A loss that screams at confident mistakes produces large gradients — strong learning signals — exactly when the model needs correction most. A loss that shrugs produces tiny gradients, and the model barely updates. This is why cross-entropy dominates classification. But to understand why, we need to build up from the beginning.

Why do we need a loss function at all?

To make the network run faster To produce a single number that measures how wrong the network is, so gradients can flow backward and update weights To decide which architecture to use

Chapter 1: Mean Squared Error — The Simplest Ruler

The most natural way to measure "how wrong" is to ask: how far is my prediction from the truth? If I predicted a house costs $300,000 and it actually costs $350,000, the error is $50,000. Square it to punish big errors more than small ones. Average over all examples. Done.

That's Mean Squared Error (MSE). For a single prediction:

L = (y - ŷ)²

Where y is the true value and ŷ (y-hat) is the prediction. For a batch of N examples, average them:

MSE = (1/N) · ∑_i=1^N (y_i - ŷ_i)²

Hand Calculation: House Prices

Suppose we predict three house prices (in thousands):

House	True y	Predicted ŷ	Error	Squared
A	350	300	50	2,500
B	200	210	-10	100
C	500	480	20	400

MSE = (2500 + 100 + 400) / 3 = 3000 / 3 = 1000.

House A dominates the loss — its error is 5× larger than B's, but its squared error is 25× larger. That's the key property of squaring: it amplifies large errors disproportionately. A single outlier can hijack the entire loss.

Why square? Three reasons. (1) It makes all errors positive — a prediction that's too high and one that's too low both increase the loss. (2) It's differentiable everywhere, giving smooth gradients. (3) Under the hood, minimizing MSE is equivalent to maximum likelihood estimation when errors are Gaussian-distributed. The math and the statistics agree.

The Gradient of MSE

Training requires derivatives. The gradient of MSE with respect to the prediction ŷ is:

∂L/∂ŷ = -2(y - ŷ) = 2(ŷ - y)

This is beautifully simple. The gradient is proportional to the error. Big error → big gradient → big update. Small error → small gradient → small update. The learning signal is directly proportional to "how wrong you are." For regression, this is exactly what we want.

From Scratch in Python

python
import numpy as np

def mse_loss(y_true, y_pred):
    """Mean Squared Error from scratch."""
    errors = y_true - y_pred
    squared = errors ** 2
    return np.mean(squared)

def mse_gradient(y_true, y_pred):
    """Gradient of MSE w.r.t. predictions."""
    n = len(y_true)
    return 2 * (y_pred - y_true) / n

# Example
y = np.array([350, 200, 500])
yhat = np.array([300, 210, 480])
print(mse_loss(y, yhat))   # 1000.0
print(mse_gradient(y, yhat)) # [-33.33, 6.67, -13.33]

Why MSE Fails for Classification

Now let's try MSE on a classification problem. Suppose the true label is "cat" = class 0. We encode this as a one-hot vector: [1, 0, 0]. The model outputs probabilities [0.2, 0.5, 0.3].

MSE = ((1-0.2)² + (0-0.5)² + (0-0.3)²) / 3 = (0.64 + 0.25 + 0.09) / 3 = 0.327.

Now imagine the model outputs [0.01, 0.98, 0.01] — it's 98% sure it's a dog. Catastrophically wrong.

MSE = ((1-0.01)² + (0-0.98)² + (0-0.01)²) / 3 = (0.9801 + 0.9604 + 0.0001) / 3 = 0.647.

The loss only went from 0.327 to 0.647 — roughly doubled. But the model went from "mildly confused" to "completely wrong with high confidence." MSE doesn't punish confident mistakes nearly enough. The gradient is proportional to the error, which stays bounded between 0 and 1 for probabilities. The signal is too weak.

Common mistake: Using MSE for classification. It "works" in the sense that loss goes down, but training is painfully slow because gradients near 0 and 1 are tiny. The loss landscape has flat regions where the model gets stuck. Cross-entropy (Chapter 3) fixes this by producing infinite loss for confidently wrong predictions.

MSE vs Classification: The Flat Gradient Problem

Binary classification: true label is 1. Drag the prediction and compare MSE loss and its gradient. Notice how the gradient flattens near 0 and 1.

Prediction ŷ 0.30

The flat gradient regions near 0 and 1 are exactly where we need the strongest learning signal — when the model is confidently wrong. MSE gives the weakest signal there. This fundamental mismatch is why we need a different loss for classification.

Why does MSE fail for classification?

MSE can't handle more than two classes MSE always gives zero loss MSE's gradients are too weak when the model is confidently wrong — it doesn't punish confident mistakes enough

Chapter 2: Softmax — From Scores to Probabilities

Before we can compute a classification loss, we need to convert the network's raw outputs into probabilities. A network's final layer outputs numbers like [2.1, 0.8, -0.3]. These are called logits — raw, unconstrained scores. They can be any real number: negative, large, small.

Probabilities must satisfy two constraints: (1) every value is between 0 and 1, and (2) they all sum to 1. How do we transform arbitrary logits into something that meets both constraints?

The Naive Approach (and Why It Fails)

First idea: just divide each logit by the sum. Logits [2.1, 0.8, -0.3], sum = 2.6. Probabilities: [0.81, 0.31, -0.12]. Negative! That's not a probability. And even if all logits were positive, this doesn't amplify differences — the largest logit barely dominates.

The Exponential Trick

Second idea: exponentiate first, then normalize. The exponential function e^x is always positive, solving the negativity problem. And it amplifies differences: e^2.1 is much larger than e^0.8, which is much larger than e^-0.3.

softmax(z_i) = e^z_i / ∑_j e^z_j

That's softmax. Exponentiate each logit, then divide by the sum of all exponentials. The result is always positive and always sums to 1. Perfect probabilities.

Hand Calculation

Logits: z = [2.1, 0.8, -0.3]

Step 1 — exponentiate each:

e^2.1 = 8.166, e^0.8 = 2.226, e^-0.3 = 0.741

Step 2 — sum the exponentials:

8.166 + 2.226 + 0.741 = 11.133

Step 3 — divide each by the sum:

P(cat) = 8.166 / 11.133 = 0.733

P(dog) = 2.226 / 11.133 = 0.200

P(bird) = 0.741 / 11.133 = 0.067

Check: 0.733 + 0.200 + 0.067 = 1.000. ✓ The highest logit (2.1) got the highest probability (73.3%). The negative logit (-0.3) got the smallest (6.7%).

Temperature: Sharpness Control

What if we divide the logits by a number T before exponentiating?

softmax(z_i ; T) = e^z_i/T / ∑_j e^z_j/T

T is called the temperature. When T is large (high temperature), the exponentials are all closer to 1, so probabilities become nearly uniform — the model hedges. When T is small (low temperature), differences are amplified — the model becomes more confident. At T → 0, softmax becomes argmax: all probability on the largest logit.

Temperature analogy: Think of T as the "chill factor." High temperature = model is relaxed, spreads probability evenly, doesn't commit. Low temperature = model is intense, puts nearly all probability on its top pick. T=1 is the standard — no scaling applied.

Numerical Stability: The Log-Sum-Exp Trick

There's a trap. If a logit is 1000, then e¹⁰⁰⁰ overflows to infinity. If it's -1000, e^-1000 underflows to zero. Real networks produce extreme logits all the time.

The fix: subtract the maximum logit before exponentiating. If z = [1000, 999, 998], compute z' = [0, -1, -2] by subtracting 1000. Now e⁰ = 1, e^-1 = 0.368, e^-2 = 0.135 — no overflow. The subtraction cancels in the ratio, so the result is identical.

python
def softmax(z, T=1.0):
    """Numerically stable softmax with temperature."""
    z = z / T
    z_max = np.max(z)          # subtract max for stability
    exps = np.exp(z - z_max)   # no overflow now
    return exps / np.sum(exps)

# Test
logits = np.array([2.1, 0.8, -0.3])
print(softmax(logits))       # [0.734, 0.200, 0.067]
print(softmax(logits, T=0.5)) # [0.924, 0.069, 0.008] — sharper
print(softmax(logits, T=5.0)) # [0.418, 0.323, 0.259] — flatter

Softmax Temperature Explorer

Three logits: [2.1, 0.8, -0.3]. Adjust the temperature and watch the probability distribution sharpen or flatten. Low T → confident. High T → uniform.

Temperature T 1.0

Common mistake: Forgetting the log-sum-exp trick. Raw softmax with large logits produces NaN (infinity divided by infinity). Always subtract the max first. Every production framework does this internally, but if you're implementing from scratch, you must do it yourself.

What happens to softmax output as temperature T approaches zero?

All probability concentrates on the largest logit (approaches argmax) All probabilities become equal (uniform distribution) All probabilities become zero

Chapter 3: Cross-Entropy — The Standard

We now have probabilities from softmax. We need a loss function that (1) is zero when the predicted probability for the correct class is 1.0, (2) is infinite when it's 0.0, and (3) produces strong gradients for confidently wrong predictions. MSE fails test (2) and (3). What works?

The answer comes from an unexpected place: information theory. To build up to cross-entropy, we need one concept: surprise.

Surprise: How Shocked Are You?

Imagine you're predicting tomorrow's weather. If your model says "99% chance of sun" and it rains — you're very surprised. If it says "50% chance of rain" and it rains — you're only mildly surprised. Surprise is inversely related to probability.

We define the information content (surprise) of an event with probability p as:

surprise(p) = -log(p)

Why logarithm? Two reasons. First, it turns multiplication into addition: the surprise of two independent events happening is the sum of their individual surprises. Second, it gives the right shape: -log(1.0) = 0 (no surprise when certain), -log(0.5) = 0.693 (moderate surprise), -log(0.01) = 4.605 (very surprised), and -log(0) → ∞ (infinitely surprised by the impossible).

Hand Calculation: Surprise Values

Probability p	-log(p)	Interpretation
1.00	0.000	No surprise — you knew it would happen
0.90	0.105	Barely surprised
0.50	0.693	Coin flip — moderate surprise
0.10	2.303	Quite surprised
0.01	4.605	Very surprised — thought it was almost impossible
→ 0	→ ∞	Infinitely surprised — model said impossible, yet it happened

From Surprise to Cross-Entropy

Entropy is the expected surprise under the true distribution. If the true distribution is p, entropy is:

H(p) = -∑_i p_i · log(p_i)

This measures the inherent uncertainty in the data. A fair coin has entropy 0.693. A loaded coin (99% heads) has entropy 0.056 — very predictable.

Cross-entropy measures the expected surprise when using a wrong distribution q to predict events that actually follow distribution p:

H(p, q) = -∑_i p_i · log(q_i)

For classification with one-hot labels, p is [1, 0, 0, ...] — all mass on the true class. This simplifies beautifully. If the true class is k:

H(p, q) = -log(q_k)

The entire loss is just the negative log-probability of the correct class. That's it. All the information theory collapses to one logarithm.

Name aliases: This is also called Negative Log-Likelihood (NLL). In PyTorch, nn.CrossEntropyLoss = softmax + NLL fused together, while nn.NLLLoss expects you to apply log-softmax yourself first. Same math, different API split.

Hand Calculation: Cross-Entropy Loss

Model outputs probabilities [0.733, 0.200, 0.067] for [cat, dog, bird]. True label: cat (class 0).

CE = -log(0.733) = 0.311

Now the model is wrong — it outputs [0.067, 0.733, 0.200]. True label still cat.

CE = -log(0.067) = 2.703

And catastrophically wrong: [0.01, 0.98, 0.01].

CE = -log(0.01) = 4.605

Compare with MSE from Chapter 1: MSE went from 0.327 to 0.647 (2× increase). Cross-entropy went from 0.311 to 4.605 (15× increase). Cross-entropy screams at confident mistakes.

The Gradient: Why It's Perfect

The gradient of cross-entropy loss after softmax has an astonishingly clean form:

∂L/∂z_i = q_i - p_i

Where z_i is the logit, q_i is the softmax output, and p_i is the true label (0 or 1). For the correct class: gradient = q_k - 1. When the model is confident and correct (q_k ≈ 1), gradient ≈ 0 — no update needed. When the model is confidently wrong (q_k ≈ 0), gradient ≈ -1 — maximum update. The gradient is exactly proportional to the error, just like MSE was for regression, but now in probability space where it matters.

python
def cross_entropy_loss(y_true_idx, probs):
    """Cross-entropy for one-hot labels.
    y_true_idx: integer index of correct class
    probs: softmax probabilities
    """
    return -np.log(probs[y_true_idx] + 1e-15)  # epsilon for stability

def cross_entropy_gradient(y_true_idx, probs):
    """Gradient w.r.t. logits (after softmax)."""
    grad = probs.copy()
    grad[y_true_idx] -= 1    # q_i - p_i, where p_k = 1
    return grad

# Example
probs = np.array([0.733, 0.200, 0.067])
print(cross_entropy_loss(0, probs))      # 0.311
print(cross_entropy_gradient(0, probs))  # [-0.267, 0.200, 0.067]

The beauty of softmax + cross-entropy: Despite being derived from information theory and involving exponentials and logarithms, the final gradient is just predicted minus true. The log and exp cancel each other out. This isn't a coincidence — it's a deep mathematical consequence of exponential families.

Binary Cross-Entropy (BCE)

For two classes, we can simplify. Let y ∈ {0, 1} be the true label and p be the predicted probability of class 1:

BCE = -[ y · log(p) + (1-y) · log(1-p) ]

When y=1: BCE = -log(p). When y=0: BCE = -log(1-p). This is the loss used in logistic regression, binary classification, and each output of a multi-label classifier.

Cross-Entropy vs MSE Loss Curves

True label is class 1 (y=1). As predicted probability p varies from 0 to 1, compare the loss curves and their gradients. Notice how CE gradient grows without bound as p→0.

Prediction p 0.50

Common mistake: Using cross-entropy without softmax (applying CE directly to raw logits). The logits aren't probabilities — they can be negative or greater than 1. Always softmax first, then CE. In practice, frameworks fuse them into a single "softmax cross-entropy" function for numerical stability (computing log-softmax directly avoids the log(exp(...)) roundtrip).

For a one-hot classification problem, what does cross-entropy loss simplify to?

The sum of all predicted probabilities Negative log of the predicted probability for the correct class: -log(q_k) The squared difference between predicted and true probabilities

Chapter 4: KL Divergence — Distance Between Distributions

Cross-entropy told us "how surprised are we when using model q to predict reality p." But we also want to know: how much extra surprise does q cause compared to the best possible model (p itself)?

That "extra surprise" is called Kullback-Leibler divergence, or KL divergence:

D_KL(p ‖ q) = H(p, q) - H(p) = ∑_i p_i · log(p_i / q_i)

Cross-entropy minus entropy. The entropy H(p) is fixed — it's a property of the data, not the model. So minimizing cross-entropy IS minimizing KL divergence. They lead to the same gradient, the same optimal model. KL divergence just removes the constant so the minimum is exactly zero.

Properties of KL Divergence

Non-negative: D_KL(p ‖ q) ≥ 0 always. It equals zero only when p = q exactly.

Not symmetric: D_KL(p ‖ q) ≠ D_KL(q ‖ p) in general. This matters enormously. "How well does q approximate p?" is a different question from "How well does p approximate q?"

Hand Calculation

True distribution p = [0.7, 0.2, 0.1]. Model distribution q = [0.5, 0.3, 0.2].

D_KL(p ‖ q) = 0.7 × log(0.7/0.5) + 0.2 × log(0.2/0.3) + 0.1 × log(0.1/0.2)

= 0.7 × log(1.4) + 0.2 × log(0.667) + 0.1 × log(0.5)

= 0.7 × 0.336 + 0.2 × (-0.405) + 0.1 × (-0.693)

= 0.235 + (-0.081) + (-0.069)

= 0.085

Now the other direction — D_KL(q ‖ p):

= 0.5 × log(0.5/0.7) + 0.3 × log(0.3/0.2) + 0.2 × log(0.2/0.1)

= 0.5 × (-0.336) + 0.3 × 0.405 + 0.2 × 0.693

= -0.168 + 0.122 + 0.139

= 0.092

Different! 0.085 vs 0.092. The asymmetry is small here because p and q are similar, but it can be enormous when the distributions differ significantly.

Forward vs Reverse KL

The two directions have profoundly different behaviors when approximating a complex distribution with a simpler one:

Forward KL — D_KL(p ‖ q): Also called "mean-seeking" or "moment-matching." When p is nonzero but q is near zero, the log ratio explodes → huge penalty. So q must cover everywhere p has mass. Result: q spreads out to cover all modes of p, even if it puts probability where p doesn't. Used in variational inference (ELBO).

Reverse KL — D_KL(q ‖ p): Also called "mode-seeking." When q is nonzero but p is near zero, the penalty is weighted by q (which is small there), so it's mild. But when q has mass where p doesn't, q × log(q/0) → ∞. So q avoids places where p is zero. Result: q locks onto one mode of p and ignores others. Used in policy optimization (PPO, RLHF).

Analogy: Forward KL is a cautious photographer who takes a wide-angle shot to make sure nothing is missed (but includes some empty sky). Reverse KL is a portrait photographer who zooms in on one face perfectly (but misses everyone else in the room).

python
def kl_divergence(p, q):
    """KL(p || q) from scratch."""
    # Only sum where p > 0 (0 * log(0/q) = 0 by convention)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

p = np.array([0.7, 0.2, 0.1])
q = np.array([0.5, 0.3, 0.2])
print(f"KL(p||q) = {kl_divergence(p, q):.4f}")  # 0.0853
print(f"KL(q||p) = {kl_divergence(q, p):.4f}")  # 0.0923

KL Divergence Explorer

Two distributions over 3 outcomes. Adjust q to match p. Watch both KL directions — they're not the same! Try to make both zero simultaneously.

q₁ 0.50

q₂ 0.30

Common mistake: Treating KL divergence as a "distance metric." It's not — it's asymmetric (A→B ≠ B→A) and doesn't satisfy the triangle inequality. When papers say "KL distance," they're being sloppy. It's a divergence, not a distance.

Why is minimizing cross-entropy equivalent to minimizing KL divergence?

Because KL = cross-entropy minus entropy, and entropy is a constant (property of the data, not the model) — so they share the same gradient Because KL divergence is always zero Because cross-entropy and KL divergence are the same formula

Chapter 5: Regression Losses — Beyond MSE

We've spent three chapters on classification. Let's go back to regression — predicting continuous numbers — because MSE isn't the only option, and sometimes it's not even the best one.

The weakness of MSE is its squared term: a single outlier with error 100 contributes 10,000 to the loss, drowning out hundreds of good predictions with error 1 (contributing 1 each). When your data has outliers, MSE chases them obsessively.

Mean Absolute Error (MAE / L1 Loss)

MAE = (1/N) · ∑_i |y_i - ŷ_i|

No squaring — just the absolute value of each error. An outlier with error 100 contributes 100, not 10,000. MAE is robust to outliers.

But MAE has its own problem: the gradient is always ±1 regardless of the error magnitude. Whether you're off by 100 or by 0.001, the gradient magnitude is the same. Near the minimum, the model oscillates instead of settling smoothly. And at exactly zero error, the absolute value isn't differentiable — there's a sharp corner.

Hand Calculation: MSE vs MAE

Five predictions, one outlier:

True	Pred	Error	\|Error\|	Error²
10	11	-1	1	1
20	19	1	1	1
15	14	1	1	1
12	13	-1	1	1
100	50	50	50	2500

MAE = (1 + 1 + 1 + 1 + 50) / 5 = 54/5 = 10.8

MSE = (1 + 1 + 1 + 1 + 2500) / 5 = 2504/5 = 500.8

The outlier contributes 50 out of 54 to MAE (93%). It contributes 2500 out of 2504 to MSE (99.8%). MSE is almost entirely determined by the outlier. MAE feels the outlier but isn't dominated by it.

Huber Loss: The Best of Both

What if we want MAE's robustness for large errors but MSE's smoothness for small errors? That's Huber loss:

L_δ(e) = { ½e² if |e| ≤ δ ; δ|e| - ½δ² if |e| > δ }

Below threshold δ: quadratic (smooth, like MSE). Above threshold δ: linear (robust, like MAE). The transition is smooth — the function and its derivative are continuous at δ.

The parameter δ controls where you switch from "small error" to "large error" behavior. δ = 1.0 is a common default. Small δ means almost everything is treated as a "large error" (more like MAE). Large δ means almost everything is "small" (more like MSE).

python
def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def huber_loss(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    abs_error = np.abs(error)
    quadratic = 0.5 * error ** 2
    linear = delta * abs_error - 0.5 * delta ** 2
    return np.mean(np.where(abs_error <= delta, quadratic, linear))

y = np.array([10, 20, 15, 12, 100])
yhat = np.array([11, 19, 14, 13, 50])
print(f"MSE:   {mse_loss(y, yhat):.1f}")    # 500.8
print(f"MAE:   {mae_loss(y, yhat):.1f}")    # 10.8
print(f"Huber: {huber_loss(y, yhat):.1f}")  # 10.3

When to use which: MSE when outliers are rare and you want precise predictions. MAE when outliers are common and you want median-like behavior. Huber when you want the smooth optimization of MSE for small errors but don't want outliers to hijack training. Object detection (Smooth L1) uses Huber with δ=1.

Name Aliases You'll See in the Wild

The same losses go by different names in different frameworks. L2 loss = MSE. L1 loss = MAE. Smooth L1 = Huber with δ=1 (PyTorch's SmoothL1Loss). Don't be thrown off — they're the same functions.

Weighted MSE

Standard MSE treats every sample equally. But sometimes some samples matter more. Weighted MSE assigns a weight w_i to each example:

Weighted MSE = (1/N) · ∑_i w_i · (y_i - ŷ_i)²

Use cases: (1) class-imbalanced regression — give rare targets higher weight. (2) temporal data — weight recent observations more than old ones. (3) heteroscedastic noise — if some measurements are noisier, downweight them. The weights encode your prior about which errors matter most.

python
def weighted_mse(y_true, y_pred, weights):
    """MSE with per-sample weights."""
    return np.mean(weights * (y_true - y_pred) ** 2)

# Weight recent data 3× more than old data
w = np.array([1, 1, 2, 3, 3])
print(weighted_mse(y, yhat, w))  # emphasizes recent errors

Regression Loss Comparison

Fitting a line through points. Drag the red outlier up/down and switch between loss functions. Watch how MSE chases the outlier while MAE and Huber resist.

Outlier Y 3.0

Common mistake: Using MAE everywhere because it's "more robust." MAE's constant gradient magnitude causes unstable training near the optimum — the model never fully converges, it oscillates. If you don't have outliers, MSE's smooth quadratic minimum is better. Know your data before choosing.

What does Huber loss do differently from MSE and MAE?

It uses a cubic function for all errors It ignores outliers completely It's quadratic for small errors (smooth like MSE) and linear for large errors (robust like MAE)

Chapter 6: Contrastive & Triplet Loss — Learning Similarity

Every loss function we've seen so far answers the question "what class is this?" or "what number should I predict?" But there's a fundamentally different question: "which things are similar to each other?"

Face recognition doesn't classify faces into a fixed set of people — there are billions of possible identities. Instead, it maps each face to a point in a high-dimensional embedding space, where similar faces land close together and different faces land far apart. No explicit class labels needed. Just structure.

To train this kind of network, we need a loss that cares about relative distances, not absolute class assignments.

Cosine Embedding Loss

Before we get to contrastive loss, a simpler similarity-based loss: cosine embedding loss. It measures similarity using the cosine of the angle between two vectors, not Euclidean distance:

L = { 1 - cos(a, b) if same class ; max(0, cos(a, b) - margin) if different class }

Cosine similarity ignores vector magnitude — it only cares about direction. Two vectors pointing the same way have cosine = 1, perpendicular = 0, opposite = -1. This is useful when the absolute scale of embeddings doesn't matter (e.g., sentence embeddings where a longer sentence shouldn't be "more similar").

python
def cosine_embedding_loss(a, b, label, margin=0.0):
    """label=1 for similar, -1 for dissimilar."""
    cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    if label == 1:
        return 1 - cos_sim
    else:
        return max(0, cos_sim - margin)

Contrastive Loss (2006)

Take two inputs. Are they similar (positive pair) or dissimilar (negative pair)? The network maps each to a point in embedding space. Compute their distance d. The loss is:

L = y · d² + (1-y) · max(0, m - d)²

Where y=1 for similar pairs, y=0 for dissimilar pairs, d is the Euclidean distance between embeddings, and m is the margin — the minimum distance we want between dissimilar items.

For similar pairs (y=1): loss = d². Push them together — the closer, the better.

For dissimilar pairs (y=0): loss = max(0, m-d)². Pull them apart — but only until they're at least m units apart. Beyond that, no penalty. We don't need negative pairs to be infinitely far away, just far enough.

Hand Calculation

Margin m = 2.0. Two examples:

Positive pair (same person's faces): embeddings at [1.0, 0.5] and [1.3, 0.7].
Distance d = √((1.3-1.0)² + (0.7-0.5)²) = √(0.09 + 0.04) = √0.13 = 0.361.
Loss = 1 × 0.361² = 0.130. Penalty for not being close enough.

Negative pair (different people): embeddings at [1.0, 0.5] and [1.5, 0.8].
Distance d = √(0.25 + 0.09) = √0.34 = 0.583.
Loss = max(0, 2.0 - 0.583)² = 1.417² = 2.008. Big penalty — they're way too close.

Negative pair (already far apart): embeddings at [1.0, 0.5] and [4.0, 3.5].
Distance d = √(9.0 + 9.0) = √18.0 = 4.243.
Loss = max(0, 2.0 - 4.243)² = max(0, -2.243)² = 0.000. Already past the margin — no penalty.

Triplet Loss (2015, FaceNet)

Contrastive loss works with pairs. Triplet loss uses three items at once: an anchor (reference image), a positive (same class as anchor), and a negative (different class).

L = max(0, d(a, p) - d(a, n) + m)

In words: the anchor-to-positive distance should be smaller than the anchor-to-negative distance, by at least margin m. If it already is, loss is zero. If not, the loss is how much the constraint is violated.

Why triplets over pairs? Because triplets encode relative ordering directly. "Is A closer to B than to C?" is often more natural than "are A and B similar?" with a hard cutoff.

The Hard Mining Problem

There's a catch with both losses. Most negatives are easy — a picture of a cat is obviously different from a picture of a truck. The loss for easy negatives is zero (they're already past the margin). The network learns nothing from them.

What matters is hard negatives: different items that the network currently thinks are similar. A Persian cat vs a Himalayan cat. Two faces that look alike but are different people. These violations are where all the learning signal lives.

In practice, you mine hard negatives: within each batch, find the triplets where d(a,p) is largest (worst positives) and d(a,n) is smallest (hardest negatives). Only train on those. Otherwise, most of your batch is wasted on zero-loss examples.

python
def contrastive_loss(emb1, emb2, label, margin=2.0):
    """label=1 for similar, 0 for dissimilar."""
    d = np.linalg.norm(emb1 - emb2)
    pos_loss = label * d ** 2
    neg_loss = (1 - label) * max(0, margin - d) ** 2
    return pos_loss + neg_loss

def triplet_loss(anchor, positive, negative, margin=0.3):
    """Triplet loss: push positive closer than negative."""
    dp = np.linalg.norm(anchor - positive)
    dn = np.linalg.norm(anchor - negative)
    return max(0, dp - dn + margin)

Key insight: Contrastive and triplet losses don't learn class boundaries — they learn a geometry. The embedding space organizes itself so that similarity in the real world maps to proximity in the space. This is why they generalize to classes never seen during training: the geometry transfers.

Embedding Space: Contrastive Forces

Drag points to see how contrastive loss creates forces. Orange pairs should attract. Teal pairs should repel (until past the margin circle). Click Step to apply one gradient update.

Margin 2.0

Common mistake: Choosing the margin blindly. Too small → negatives cluster right at the boundary, fragile. Too large → the network has to push everything impossibly far apart and can't converge. Margin should be calibrated to the embedding space scale. Normalize embeddings to the unit sphere first, then margins of 0.2–0.5 work well.

In triplet loss, what are the three elements and what does the loss enforce?

Input, label, prediction — enforces correct classification Anchor, positive, negative — enforces that the anchor-positive distance is smaller than the anchor-negative distance by at least a margin Query, key, value — enforces attention alignment

Chapter 7: InfoNCE — The Modern Workhorse

Triplet loss has a fundamental limitation: it only considers one negative at a time. The gradient signal from one "this is different" comparison is noisy. What if we could compare the positive against many negatives simultaneously, getting a richer, more stable signal?

That's the insight behind InfoNCE (Noise-Contrastive Estimation), the loss function that powers CLIP, SimCLR, MoCo, and virtually all modern self-supervised learning. It's also the loss behind contrastive language-image pretraining, contrastive code search, and audio-text alignment.

The Setup

You have an anchor (say, an image of a dog). You have one positive (another view of the same dog, or the text "a photo of a golden retriever"). And you have K negatives (other images or texts from the batch that don't match).

Each item is embedded into a vector. We measure similarity using the dot product (or cosine similarity). The question InfoNCE asks is: can the model pick out the positive from a lineup of K+1 candidates?

The Formula

L_InfoNCE = -log( e^sim(a,p)/τ / (e^sim(a,p)/τ + ∑_k=1^K e^{sim(a,n_k)/τ}) )

Look at the structure: this is softmax cross-entropy! The positive gets one "logit" (sim(a,p)/τ), each negative gets one logit (sim(a,n_k)/τ), and we compute cross-entropy with the positive as the correct class.

InfoNCE turns similarity learning into a (K+1)-way classification problem: "which of these K+1 items is the true match?" The answer is always index 0 (the positive). The loss is the negative log-probability of choosing correctly.

Temperature τ: The Sharpness Dial

The temperature τ (tau) plays the same role as in softmax but with a twist. Low τ makes the softmax sharper — the model must be very precise about which item is the positive. High τ makes it softer — near-matches get partial credit.

In practice, τ = 0.07 (CLIP) or τ = 0.1 (SimCLR) — much lower than the standard τ=1. This forces the model to learn fine-grained distinctions. If τ is too low, training becomes unstable. If too high, the model doesn't discriminate enough.

Hand Calculation

Anchor embedding a = [1, 0]. Positive p = [0.9, 0.1]. Three negatives: n₁ = [0.1, 0.8], n₂ = [-0.5, 0.5], n₃ = [0.3, -0.7]. Temperature τ = 0.5.

Step 1 — compute dot-product similarities:

sim(a, p) = 1×0.9 + 0×0.1 = 0.90

sim(a, n₁) = 1×0.1 + 0×0.8 = 0.10

sim(a, n₂) = 1×(-0.5) + 0×0.5 = -0.50

sim(a, n₃) = 1×0.3 + 0×(-0.7) = 0.30

Step 2 — divide by τ and exponentiate:

e^0.90/0.5 = e^1.8 = 6.050

e^0.10/0.5 = e^0.2 = 1.221

e^-0.50/0.5 = e^-1.0 = 0.368

e^0.30/0.5 = e^0.6 = 1.822

Step 3 — softmax probability for the positive:

P(positive) = 6.050 / (6.050 + 1.221 + 0.368 + 1.822) = 6.050 / 9.461 = 0.6394

Step 4 — loss:

L = -log(0.6394) = 0.447 (carry the fourth digit of P into the log — rounding it to 0.639 first would give 0.448)

The model gives 63.9% probability to the correct match. Not perfect — the loss pushes it to increase sim(a,p) and decrease sim(a,n_k).

Why More Negatives Help

With K=3 negatives, the "random guess" accuracy is 1/4 = 25%. With K=255 (a typical batch in SimCLR), random accuracy is 1/256 = 0.4%. More negatives make the task harder, forcing the model to learn finer distinctions. The theoretical bound tightens: InfoNCE lower-bounds the mutual information between anchor and positive, and the bound gets tighter with more negatives.

This is why contrastive learning uses enormous batch sizes. SimCLR used 8192. MoCo maintained a queue of 65,536 negatives. CLIP used 32,768. The number of negatives directly controls the quality of the learned representations.

python
def info_nce_loss(anchor, positive, negatives, tau=0.1):
    """InfoNCE loss from scratch.
    anchor: (D,) embedding
    positive: (D,) embedding
    negatives: (K, D) embeddings
    """
    # Similarities
    pos_sim = np.dot(anchor, positive) / tau
    neg_sims = negatives @ anchor / tau   # (K,)

    # Log-sum-exp trick for stability
    all_sims = np.concatenate([[pos_sim], neg_sims])
    max_sim = np.max(all_sims)
    log_sum_exp = max_sim + np.log(np.sum(np.exp(all_sims - max_sim)))

    # Loss = -log(softmax of positive)
    loss = -pos_sim + log_sum_exp
    return loss

# Example
a = np.array([1.0, 0.0])
p = np.array([0.9, 0.1])
negs = np.array([[0.1, 0.8], [-0.5, 0.5], [0.3, -0.7]])
print(f"Loss: {info_nce_loss(a, p, negs, tau=0.5):.3f}")  # 0.447

InfoNCE = softmax cross-entropy in disguise. The positive is the "correct class," negatives are "wrong classes," similarities are "logits," and temperature scales them. Everything you know about cross-entropy — its gradients, its behavior, why it works — transfers directly to InfoNCE. This is not a coincidence. It was designed this way.

InfoNCE: Pick the Match (Showcase)

The anchor must find its positive among negatives. Adjust temperature and number of negatives. Watch the softmax probabilities and loss change. Low τ → sharper discrimination. More negatives → harder task.

Temperature τ 0.50

Negatives K 5

Common mistake: Using too few negatives and thinking the representations are good. With K=4 negatives, random accuracy is 20% — a model at 80% accuracy looks great but has barely learned anything. With K=1000, the same model would need to reach 0.1% random baseline — much more discriminative. Always report accuracy relative to the random baseline 1/(K+1).

Why is InfoNCE essentially the same as softmax cross-entropy?

Because it treats similarities as logits, the positive as the correct class, and computes -log of the softmax probability for the positive — exactly cross-entropy over a (K+1)-way classification Because both losses use the exponential function Because both losses produce the same numerical value on all inputs

Chapter 8: Choosing Your Loss — The Practical Guide

We've built seven loss functions from scratch. Now the practical question: which one do you use? The answer depends on your task, your data, and what you want the network to learn.

The Decision Tree

What is your task?

Classification, regression, or similarity?

↓

Classification → Cross-Entropy

Multi-class: softmax + CE. Binary: sigmoid + BCE. Multi-label: independent sigmoid + BCE per label.

↓

Regression → MSE / Huber

Clean data: MSE. Outliers present: Huber. Median prediction wanted: MAE.

↓

Similarity / Retrieval → InfoNCE

Few negatives available: triplet loss. Large batches possible: InfoNCE. Cross-modal alignment: symmetric InfoNCE (CLIP-style).

Focal Loss: Handling Class Imbalance

In object detection, 99% of anchor boxes contain background, 1% contain objects. Standard cross-entropy is dominated by the easy background examples — they contribute little learning signal individually but overwhelm in aggregate.

Focal loss (Lin et al., 2017) down-weights easy examples:

FL = -(1 - p_t)^γ · log(p_t)

Where p_t is the model's predicted probability for the true class. When the model is already confident and correct (p_t ≈ 1), the factor (1-p_t)^γ ≈ 0 — the easy example is down-weighted to nearly zero. When the model struggles (p_t ≈ 0), the factor ≈ 1 — full weight.

The parameter γ controls how aggressively easy examples are down-weighted. γ=0 recovers standard cross-entropy. γ=2 is the standard choice. With γ=2, an example the model gets right with 90% confidence has its loss reduced by 100×.

Hand Calculation: Focal vs Standard CE

Model predicts p_t = 0.9 (confident and correct). γ = 2.

Standard CE: -log(0.9) = 0.105

Focal: -(1-0.9)² × log(0.9) = -0.01 × (-0.105) = 0.00105

The easy example's contribution dropped by 100×.

Now p_t = 0.1 (struggling):

Standard CE: -log(0.1) = 2.303

Focal: -(1-0.1)² × log(0.1) = -0.81 × (-2.303) = 1.865

Only 19% reduction. Hard examples keep most of their weight.

Label Smoothing

Standard cross-entropy with one-hot labels pushes the model toward infinite confidence — the loss only reaches zero when the predicted probability is exactly 1.0. This leads to overconfident predictions that don't calibrate well.

Label smoothing replaces the hard one-hot [1, 0, 0] with a soft target like [0.9, 0.05, 0.05]. The true class gets probability (1-ε) and the remaining ε is spread uniformly across all classes. Typical ε = 0.1.

This prevents the model from becoming infinitely confident and produces better-calibrated probabilities — the model's predicted confidence more closely matches its actual accuracy. It also acts as a regularizer, slightly reducing overfitting.

python
def focal_loss(p_t, gamma=2.0):
    """Focal loss: down-weight easy examples."""
    return -((1 - p_t) ** gamma) * np.log(p_t + 1e-15)

def label_smoothing(y_onehot, num_classes, epsilon=0.1):
    """Smooth hard labels."""
    return y_onehot * (1 - epsilon) + epsilon / num_classes

# Focal loss comparison
for pt in [0.9, 0.5, 0.1]:
    ce = -np.log(pt)
    fl = focal_loss(pt, gamma=2)
    print(f"p_t={pt:.1f}  CE={ce:.3f}  Focal={fl:.3f}  ratio={fl/ce:.3f}")
# p_t=0.9  CE=0.105  Focal=0.001  ratio=0.010
# p_t=0.5  CE=0.693  Focal=0.173  ratio=0.250
# p_t=0.1  CE=2.303  Focal=1.865  ratio=0.810

Hinge Loss (SVM Loss)

Before deep learning dominated, hinge loss was the standard for SVMs and margin classifiers:

L = max(0, 1 - y · f(x))

Where y ∈ {-1, +1} and f(x) is the raw score. Correct predictions with score > 1 have zero loss. This creates a "margin" of safety — the model must be confident enough, not just barely correct.

Multi-class hinge (SVM loss): for each incorrect class j, penalize if the score for j is within margin of the correct class score:

L = ∑_j≠y max(0, s_j - s_y + 1)

Hinge loss doesn't produce probabilities (no softmax). It only cares about margins. Once the correct class wins by a sufficient margin, gradient is zero — it stops learning on that example. This can be good (focuses on hard cases) or bad (doesn't refine already-correct predictions).

The Cheat Sheet

Loss	Task	Output	Key Property
MSE	Regression	Continuous	Smooth, quadratic, sensitive to outliers
MAE	Regression	Continuous	Robust to outliers, constant gradient
Huber	Regression	Continuous	Quadratic near zero, linear far away
Cross-Entropy	Classification	Probabilities	Strong gradient for confident mistakes
BCE	Binary/Multi-label	Per-class prob	Independent per output
Focal	Imbalanced classif.	Probabilities	Down-weights easy examples
Hinge	Margin classif.	Raw scores	Zero loss beyond margin
Contrastive	Pair similarity	Embeddings	Attract/repel with margin
Triplet	Relative similarity	Embeddings	Relative ordering, needs mining
InfoNCE	Representation learning	Embeddings	Multi-negative, scales with batch size

Loss Function Comparison Dashboard

Binary classification (y=1). Compare how different loss functions behave as predicted probability varies. Toggle each loss on/off.

Common mistake: Picking a loss function by gut feeling instead of matching it to the data and task. The loss IS the objective — if you optimize the wrong thing, the model will dutifully learn the wrong behavior. Garbage in, garbage out applies to loss functions more than anywhere else.

When would you use focal loss instead of standard cross-entropy?

When the model is too slow to train When you need regression instead of classification When classes are highly imbalanced and easy examples dominate the gradient — focal loss down-weights easy examples to focus learning on hard cases

Chapter 9: Connections — Where Loss Functions Lead

We've built ten loss functions from scratch, traced their gradients by hand, and seen how each shapes what a neural network learns. Let's zoom out to where these ideas connect to the broader landscape of deep learning.

What We Covered

Chapter	Key Concept	One Sentence
0	Why losses matter	The loss defines what the network learns — change the loss, change the behavior
1	MSE	Squared error: smooth and simple for regression, but punishes outliers and fails for classification
2	Softmax	Exponential normalization turns logits into probabilities with a temperature dial
3	Cross-Entropy	Negative log-probability of the correct class — screams at confident mistakes
4	KL Divergence	Cross-entropy minus entropy: measures extra surprise from using the wrong distribution
5	Regression losses	MAE resists outliers, Huber blends MSE and MAE smoothly
6	Contrastive & Triplet	Learn embedding geometry: similar things close, different things far
7	InfoNCE	Multi-negative contrastive = softmax CE over similarities — scales with batch size
8	Choosing	Match the loss to the task: classification → CE, regression → MSE/Huber, similarity → InfoNCE

Where to Go Next

Backpropagation — we showed what the gradient of each loss is, but how does it flow backward through the network? The Backpropagation lesson traces the chain rule through every layer.

Optimizers — the loss gives us a gradient. The optimizer decides how to use that gradient to update weights. SGD, Adam, AdamW — each handles the gradient differently. (Coming soon in the Optimizers lesson.)

Regularization — losses can include penalty terms (L1, L2, dropout) that prevent overfitting. Regularization & Optimization covers these.

Contrastive Learning — InfoNCE is the foundation. CLIP applies it to vision-language alignment. Self-Supervised Learning covers SimCLR, BYOL, and DINO.

RLHF and DPO — reinforcement learning from human feedback uses a special loss (reward model loss, then PPO or DPO) to align language models. RLHF & DPO builds on cross-entropy and KL divergence from this lesson.

Diffusion Models — diffusion training uses a denoising loss that's essentially a weighted MSE between predicted and actual noise. Diffusion Models derives this from the variational bound.

GANs — the generator and discriminator each have their own loss function (adversarial loss), and the choice between BCE, hinge, and Wasserstein losses profoundly affects training stability. GANs covers this.

The meta-lesson: Every major advance in deep learning — from ResNets to CLIP to diffusion models — can be traced to a loss function innovation. The architecture provides capacity. The data provides experience. But the loss function provides direction. Master losses, and you master the language in which training objectives are expressed.

A friend is building a face recognition system that must handle millions of identities never seen during training. Which loss family should they use?

Cross-entropy with a million output classes MSE on facial landmarks Contrastive/InfoNCE — learn an embedding space where similar faces are close, then use nearest-neighbor at test time for any identity