The signal that tells a neural network how wrong it is — and in which direction to improve. From the simplest squared error to the contrastive losses behind CLIP and SimCLR.
You've built a neural network. You feed it an image of a cat. It outputs three numbers: [2.1, 0.8, -0.3], one for each class — cat, dog, bird. Those numbers say the network thinks "cat" is most likely. Good.
But here's the question nobody asks early enough: how do you tell the network it was right? And more importantly — when it outputs [0.5, 1.9, 0.3] for the same cat image and guesses "dog" — how do you tell it how wrong it was, and in which direction to fix itself?
You need a single number. A score. Low when the network is right, high when it's wrong. This number flows backward through every layer, nudging every weight. Get the scoring function wrong, and the network learns the wrong thing — or learns nothing at all.
That scoring function is called a loss function. It's the most important design choice you make when training a neural network. More important than the architecture. More important than the optimizer. Because the loss defines what the network is actually trying to do.
In this lesson, we'll build every major loss function from scratch. We'll start with the simplest idea — squared error — and discover why it fails for classification. We'll derive cross-entropy from information theory. We'll see why contrastive losses revolutionized representation learning. And we'll end with InfoNCE, the loss that powers CLIP, SimCLR, and most modern self-supervised learning.
But first, let's see the problem with our own eyes.
A model predicts probabilities for three classes. The true label is Cat. Drag the slider to change the model's confidence in "Cat" and watch what happens to different loss functions.
Notice something striking. When the model is very confident and correct (P(Cat) near 1.0), both losses are near zero — all is well. But when the model is confidently wrong (P(Cat) near 0.0), cross-entropy explodes toward infinity while MSE stays calmly bounded. Cross-entropy screams at confident mistakes. MSE merely shrugs.
That difference matters enormously. A loss that screams at confident mistakes produces large gradients — strong learning signals — exactly when the model needs correction most. A loss that shrugs produces tiny gradients, and the model barely updates. This is why cross-entropy dominates classification. But to understand why, we need to build up from the beginning.
The most natural way to measure "how wrong" is to ask: how far is my prediction from the truth? If I predicted a house costs $300,000 and it actually costs $350,000, the error is $50,000. Square it to punish big errors more than small ones. Average over all examples. Done.
That's Mean Squared Error (MSE). For a single prediction:
Where y is the true value and ŷ (y-hat) is the prediction. For a batch of N examples, average them:
Suppose we predict three house prices (in thousands):
| House | True y | Predicted ŷ | Error | Squared |
|---|---|---|---|---|
| A | 350 | 300 | 50 | 2,500 |
| B | 200 | 210 | -10 | 100 |
| C | 500 | 480 | 20 | 400 |
MSE = (2500 + 100 + 400) / 3 = 3000 / 3 = 1000.
House A dominates the loss — its error is 5× larger than B's, but its squared error is 25× larger. That's the key property of squaring: it amplifies large errors disproportionately. A single outlier can hijack the entire loss.
Training requires derivatives. The gradient of MSE with respect to the prediction ŷ is:
This is beautifully simple. The gradient is proportional to the error. Big error → big gradient → big update. Small error → small gradient → small update. The learning signal is directly proportional to "how wrong you are." For regression, this is exactly what we want.
python import numpy as np def mse_loss(y_true, y_pred): """Mean Squared Error from scratch.""" errors = y_true - y_pred squared = errors ** 2 return np.mean(squared) def mse_gradient(y_true, y_pred): """Gradient of MSE w.r.t. predictions.""" n = len(y_true) return 2 * (y_pred - y_true) / n # Example y = np.array([350, 200, 500]) yhat = np.array([300, 210, 480]) print(mse_loss(y, yhat)) # 1000.0 print(mse_gradient(y, yhat)) # [-33.33, 6.67, -13.33]
Now let's try MSE on a classification problem. Suppose the true label is "cat" = class 0. We encode this as a one-hot vector: [1, 0, 0]. The model outputs probabilities [0.2, 0.5, 0.3].
MSE = ((1-0.2)² + (0-0.5)² + (0-0.3)²) / 3 = (0.64 + 0.25 + 0.09) / 3 = 0.327.
Now imagine the model outputs [0.01, 0.98, 0.01] — it's 98% sure it's a dog. Catastrophically wrong.
MSE = ((1-0.01)² + (0-0.98)² + (0-0.01)²) / 3 = (0.9801 + 0.9604 + 0.0001) / 3 = 0.647.
The loss only went from 0.327 to 0.647 — roughly doubled. But the model went from "mildly confused" to "completely wrong with high confidence." MSE doesn't punish confident mistakes nearly enough. The gradient is proportional to the error, which stays bounded between 0 and 1 for probabilities. The signal is too weak.
Binary classification: true label is 1. Drag the prediction and compare MSE loss and its gradient. Notice how the gradient flattens near 0 and 1.
The flat gradient regions near 0 and 1 are exactly where we need the strongest learning signal — when the model is confidently wrong. MSE gives the weakest signal there. This fundamental mismatch is why we need a different loss for classification.
Before we can compute a classification loss, we need to convert the network's raw outputs into probabilities. A network's final layer outputs numbers like [2.1, 0.8, -0.3]. These are called logits — raw, unconstrained scores. They can be any real number: negative, large, small.
Probabilities must satisfy two constraints: (1) every value is between 0 and 1, and (2) they all sum to 1. How do we transform arbitrary logits into something that meets both constraints?
First idea: just divide each logit by the sum. Logits [2.1, 0.8, -0.3], sum = 2.6. Probabilities: [0.81, 0.31, -0.12]. Negative! That's not a probability. And even if all logits were positive, this doesn't amplify differences — the largest logit barely dominates.
Second idea: exponentiate first, then normalize. The exponential function ex is always positive, solving the negativity problem. And it amplifies differences: e2.1 is much larger than e0.8, which is much larger than e-0.3.
That's softmax. Exponentiate each logit, then divide by the sum of all exponentials. The result is always positive and always sums to 1. Perfect probabilities.
Logits: z = [2.1, 0.8, -0.3]
Step 1 — exponentiate each:
e2.1 = 8.166, e0.8 = 2.226, e-0.3 = 0.741
Step 2 — sum the exponentials:
8.166 + 2.226 + 0.741 = 11.133
Step 3 — divide each by the sum:
P(cat) = 8.166 / 11.133 = 0.733
P(dog) = 2.226 / 11.133 = 0.200
P(bird) = 0.741 / 11.133 = 0.067
Check: 0.733 + 0.200 + 0.067 = 1.000. ✓ The highest logit (2.1) got the highest probability (73.3%). The negative logit (-0.3) got the smallest (6.7%).
What if we divide the logits by a number T before exponentiating?
T is called the temperature. When T is large (high temperature), the exponentials are all closer to 1, so probabilities become nearly uniform — the model hedges. When T is small (low temperature), differences are amplified — the model becomes more confident. At T → 0, softmax becomes argmax: all probability on the largest logit.
There's a trap. If a logit is 1000, then e1000 overflows to infinity. If it's -1000, e-1000 underflows to zero. Real networks produce extreme logits all the time.
The fix: subtract the maximum logit before exponentiating. If z = [1000, 999, 998], compute z' = [0, -1, -2] by subtracting 1000. Now e0 = 1, e-1 = 0.368, e-2 = 0.135 — no overflow. The subtraction cancels in the ratio, so the result is identical.
python def softmax(z, T=1.0): """Numerically stable softmax with temperature.""" z = z / T z_max = np.max(z) # subtract max for stability exps = np.exp(z - z_max) # no overflow now return exps / np.sum(exps) # Test logits = np.array([2.1, 0.8, -0.3]) print(softmax(logits)) # [0.733, 0.200, 0.067] print(softmax(logits, T=0.5)) # [0.914, 0.068, 0.018] — sharper print(softmax(logits, T=5.0)) # [0.395, 0.326, 0.279] — flatter
Three logits: [2.1, 0.8, -0.3]. Adjust the temperature and watch the probability distribution sharpen or flatten. Low T → confident. High T → uniform.
We now have probabilities from softmax. We need a loss function that (1) is zero when the predicted probability for the correct class is 1.0, (2) is infinite when it's 0.0, and (3) produces strong gradients for confidently wrong predictions. MSE fails test (2) and (3). What works?
The answer comes from an unexpected place: information theory. To build up to cross-entropy, we need one concept: surprise.
Imagine you're predicting tomorrow's weather. If your model says "99% chance of sun" and it rains — you're very surprised. If it says "50% chance of rain" and it rains — you're only mildly surprised. Surprise is inversely related to probability.
We define the information content (surprise) of an event with probability p as:
Why logarithm? Two reasons. First, it turns multiplication into addition: the surprise of two independent events happening is the sum of their individual surprises. Second, it gives the right shape: -log(1.0) = 0 (no surprise when certain), -log(0.5) = 0.693 (moderate surprise), -log(0.01) = 4.605 (very surprised), and -log(0) → ∞ (infinitely surprised by the impossible).
| Probability p | -log(p) | Interpretation |
|---|---|---|
| 1.00 | 0.000 | No surprise — you knew it would happen |
| 0.90 | 0.105 | Barely surprised |
| 0.50 | 0.693 | Coin flip — moderate surprise |
| 0.10 | 2.303 | Quite surprised |
| 0.01 | 4.605 | Very surprised — thought it was almost impossible |
| → 0 | → ∞ | Infinitely surprised — model said impossible, yet it happened |
Entropy is the expected surprise under the true distribution. If the true distribution is p, entropy is:
This measures the inherent uncertainty in the data. A fair coin has entropy 0.693. A loaded coin (99% heads) has entropy 0.056 — very predictable.
Cross-entropy measures the expected surprise when using a wrong distribution q to predict events that actually follow distribution p:
For classification with one-hot labels, p is [1, 0, 0, ...] — all mass on the true class. This simplifies beautifully. If the true class is k:
The entire loss is just the negative log-probability of the correct class. That's it. All the information theory collapses to one logarithm.
nn.CrossEntropyLoss = softmax + NLL fused together, while nn.NLLLoss expects you to apply log-softmax yourself first. Same math, different API split.Model outputs probabilities [0.733, 0.200, 0.067] for [cat, dog, bird]. True label: cat (class 0).
CE = -log(0.733) = 0.311
Now the model is wrong — it outputs [0.067, 0.733, 0.200]. True label still cat.
CE = -log(0.067) = 2.703
And catastrophically wrong: [0.01, 0.98, 0.01].
CE = -log(0.01) = 4.605
Compare with MSE from Chapter 1: MSE went from 0.327 to 0.647 (2× increase). Cross-entropy went from 0.311 to 4.605 (15× increase). Cross-entropy screams at confident mistakes.
The gradient of cross-entropy loss after softmax has an astonishingly clean form:
Where zi is the logit, qi is the softmax output, and pi is the true label (0 or 1). For the correct class: gradient = qk - 1. When the model is confident and correct (qk ≈ 1), gradient ≈ 0 — no update needed. When the model is confidently wrong (qk ≈ 0), gradient ≈ -1 — maximum update. The gradient is exactly proportional to the error, just like MSE was for regression, but now in probability space where it matters.
python def cross_entropy_loss(y_true_idx, probs): """Cross-entropy for one-hot labels. y_true_idx: integer index of correct class probs: softmax probabilities """ return -np.log(probs[y_true_idx] + 1e-15) # epsilon for stability def cross_entropy_gradient(y_true_idx, probs): """Gradient w.r.t. logits (after softmax).""" grad = probs.copy() grad[y_true_idx] -= 1 # q_i - p_i, where p_k = 1 return grad # Example probs = np.array([0.733, 0.200, 0.067]) print(cross_entropy_loss(0, probs)) # 0.311 print(cross_entropy_gradient(0, probs)) # [-0.267, 0.200, 0.067]
For two classes, we can simplify. Let y ∈ {0, 1} be the true label and p be the predicted probability of class 1:
When y=1: BCE = -log(p). When y=0: BCE = -log(1-p). This is the loss used in logistic regression, binary classification, and each output of a multi-label classifier.
True label is class 1 (y=1). As predicted probability p varies from 0 to 1, compare the loss curves and their gradients. Notice how CE gradient grows without bound as p→0.
Cross-entropy told us "how surprised are we when using model q to predict reality p." But we also want to know: how much extra surprise does q cause compared to the best possible model (p itself)?
That "extra surprise" is called Kullback-Leibler divergence, or KL divergence:
Cross-entropy minus entropy. The entropy H(p) is fixed — it's a property of the data, not the model. So minimizing cross-entropy IS minimizing KL divergence. They lead to the same gradient, the same optimal model. KL divergence just removes the constant so the minimum is exactly zero.
Non-negative: DKL(p ‖ q) ≥ 0 always. It equals zero only when p = q exactly.
Not symmetric: DKL(p ‖ q) ≠ DKL(q ‖ p) in general. This matters enormously. "How well does q approximate p?" is a different question from "How well does p approximate q?"
True distribution p = [0.7, 0.2, 0.1]. Model distribution q = [0.5, 0.3, 0.2].
DKL(p ‖ q) = 0.7 × log(0.7/0.5) + 0.2 × log(0.2/0.3) + 0.1 × log(0.1/0.2)
= 0.7 × log(1.4) + 0.2 × log(0.667) + 0.1 × log(0.5)
= 0.7 × 0.336 + 0.2 × (-0.405) + 0.1 × (-0.693)
= 0.235 + (-0.081) + (-0.069)
= 0.085
Now the other direction — DKL(q ‖ p):
= 0.5 × log(0.5/0.7) + 0.3 × log(0.3/0.2) + 0.2 × log(0.2/0.1)
= 0.5 × (-0.336) + 0.3 × 0.405 + 0.2 × 0.693
= -0.168 + 0.122 + 0.139
= 0.092
Different! 0.085 vs 0.092. The asymmetry is small here because p and q are similar, but it can be enormous when the distributions differ significantly.
The two directions have profoundly different behaviors when approximating a complex distribution with a simpler one:
Forward KL — DKL(p ‖ q): Also called "mean-seeking" or "moment-matching." When p is nonzero but q is near zero, the log ratio explodes → huge penalty. So q must cover everywhere p has mass. Result: q spreads out to cover all modes of p, even if it puts probability where p doesn't. Used in variational inference (ELBO).
Reverse KL — DKL(q ‖ p): Also called "mode-seeking." When q is nonzero but p is near zero, the penalty is weighted by q (which is small there), so it's mild. But when q has mass where p doesn't, q × log(q/0) → ∞. So q avoids places where p is zero. Result: q locks onto one mode of p and ignores others. Used in policy optimization (PPO, RLHF).
python def kl_divergence(p, q): """KL(p || q) from scratch.""" # Only sum where p > 0 (0 * log(0/q) = 0 by convention) mask = p > 0 return np.sum(p[mask] * np.log(p[mask] / q[mask])) p = np.array([0.7, 0.2, 0.1]) q = np.array([0.5, 0.3, 0.2]) print(f"KL(p||q) = {kl_divergence(p, q):.4f}") # 0.0853 print(f"KL(q||p) = {kl_divergence(q, p):.4f}") # 0.0923
Two distributions over 3 outcomes. Adjust q to match p. Watch both KL directions — they're not the same! Try to make both zero simultaneously.
We've spent three chapters on classification. Let's go back to regression — predicting continuous numbers — because MSE isn't the only option, and sometimes it's not even the best one.
The weakness of MSE is its squared term: a single outlier with error 100 contributes 10,000 to the loss, drowning out hundreds of good predictions with error 1 (contributing 1 each). When your data has outliers, MSE chases them obsessively.
No squaring — just the absolute value of each error. An outlier with error 100 contributes 100, not 10,000. MAE is robust to outliers.
But MAE has its own problem: the gradient is always ±1 regardless of the error magnitude. Whether you're off by 100 or by 0.001, the gradient magnitude is the same. Near the minimum, the model oscillates instead of settling smoothly. And at exactly zero error, the absolute value isn't differentiable — there's a sharp corner.
Five predictions, one outlier:
| True | Pred | Error | |Error| | Error² |
|---|---|---|---|---|
| 10 | 11 | -1 | 1 | 1 |
| 20 | 19 | 1 | 1 | 1 |
| 15 | 14 | 1 | 1 | 1 |
| 12 | 13 | -1 | 1 | 1 |
| 100 | 50 | 50 | 50 | 2500 |
MAE = (1 + 1 + 1 + 1 + 50) / 5 = 54/5 = 10.8
MSE = (1 + 1 + 1 + 1 + 2500) / 5 = 2504/5 = 500.8
The outlier contributes 50 out of 54 to MAE (93%). It contributes 2500 out of 2504 to MSE (99.8%). MSE is almost entirely determined by the outlier. MAE feels the outlier but isn't dominated by it.
What if we want MAE's robustness for large errors but MSE's smoothness for small errors? That's Huber loss:
Below threshold δ: quadratic (smooth, like MSE). Above threshold δ: linear (robust, like MAE). The transition is smooth — the function and its derivative are continuous at δ.
The parameter δ controls where you switch from "small error" to "large error" behavior. δ = 1.0 is a common default. Small δ means almost everything is treated as a "large error" (more like MAE). Large δ means almost everything is "small" (more like MSE).
python def mae_loss(y_true, y_pred): return np.mean(np.abs(y_true - y_pred)) def huber_loss(y_true, y_pred, delta=1.0): error = y_true - y_pred abs_error = np.abs(error) quadratic = 0.5 * error ** 2 linear = delta * abs_error - 0.5 * delta ** 2 return np.mean(np.where(abs_error <= delta, quadratic, linear)) y = np.array([10, 20, 15, 12, 100]) yhat = np.array([11, 19, 14, 13, 50]) print(f"MSE: {mse_loss(y, yhat):.1f}") # 500.8 print(f"MAE: {mae_loss(y, yhat):.1f}") # 10.8 print(f"Huber: {huber_loss(y, yhat):.1f}") # 10.3
The same losses go by different names in different frameworks. L2 loss = MSE. L1 loss = MAE. Smooth L1 = Huber with δ=1 (PyTorch's SmoothL1Loss). Don't be thrown off — they're the same functions.
Standard MSE treats every sample equally. But sometimes some samples matter more. Weighted MSE assigns a weight wi to each example:
Use cases: (1) class-imbalanced regression — give rare targets higher weight. (2) temporal data — weight recent observations more than old ones. (3) heteroscedastic noise — if some measurements are noisier, downweight them. The weights encode your prior about which errors matter most.
python def weighted_mse(y_true, y_pred, weights): """MSE with per-sample weights.""" return np.mean(weights * (y_true - y_pred) ** 2) # Weight recent data 3× more than old data w = np.array([1, 1, 2, 3, 3]) print(weighted_mse(y, yhat, w)) # emphasizes recent errors
Fitting a line through points. Drag the red outlier up/down and switch between loss functions. Watch how MSE chases the outlier while MAE and Huber resist.
Every loss function we've seen so far answers the question "what class is this?" or "what number should I predict?" But there's a fundamentally different question: "which things are similar to each other?"
Face recognition doesn't classify faces into a fixed set of people — there are billions of possible identities. Instead, it maps each face to a point in a high-dimensional embedding space, where similar faces land close together and different faces land far apart. No explicit class labels needed. Just structure.
To train this kind of network, we need a loss that cares about relative distances, not absolute class assignments.
Before we get to contrastive loss, a simpler similarity-based loss: cosine embedding loss. It measures similarity using the cosine of the angle between two vectors, not Euclidean distance:
Cosine similarity ignores vector magnitude — it only cares about direction. Two vectors pointing the same way have cosine = 1, perpendicular = 0, opposite = -1. This is useful when the absolute scale of embeddings doesn't matter (e.g., sentence embeddings where a longer sentence shouldn't be "more similar").
python def cosine_embedding_loss(a, b, label, margin=0.0): """label=1 for similar, -1 for dissimilar.""" cos_sim = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) if label == 1: return 1 - cos_sim else: return max(0, cos_sim - margin)
Take two inputs. Are they similar (positive pair) or dissimilar (negative pair)? The network maps each to a point in embedding space. Compute their distance d. The loss is:
Where y=1 for similar pairs, y=0 for dissimilar pairs, d is the Euclidean distance between embeddings, and m is the margin — the minimum distance we want between dissimilar items.
For similar pairs (y=1): loss = d². Push them together — the closer, the better.
For dissimilar pairs (y=0): loss = max(0, m-d)². Pull them apart — but only until they're at least m units apart. Beyond that, no penalty. We don't need negative pairs to be infinitely far away, just far enough.
Margin m = 2.0. Two examples:
Positive pair (same person's faces): embeddings at [1.0, 0.5] and [1.3, 0.7].
Distance d = √((1.3-1.0)² + (0.7-0.5)²) = √(0.09 + 0.04) = √0.13 = 0.361.
Loss = 1 × 0.361² = 0.130. Penalty for not being close enough.
Negative pair (different people): embeddings at [1.0, 0.5] and [1.5, 0.8].
Distance d = √(0.25 + 0.09) = √0.34 = 0.583.
Loss = max(0, 2.0 - 0.583)² = 1.417² = 2.008. Big penalty — they're way too close.
Negative pair (already far apart): embeddings at [1.0, 0.5] and [4.0, 3.5].
Distance d = √(9.0 + 9.0) = √18.0 = 4.243.
Loss = max(0, 2.0 - 4.243)² = max(0, -2.243)² = 0.000. Already past the margin — no penalty.
Contrastive loss works with pairs. Triplet loss uses three items at once: an anchor (reference image), a positive (same class as anchor), and a negative (different class).
In words: the anchor-to-positive distance should be smaller than the anchor-to-negative distance, by at least margin m. If it already is, loss is zero. If not, the loss is how much the constraint is violated.
Why triplets over pairs? Because triplets encode relative ordering directly. "Is A closer to B than to C?" is often more natural than "are A and B similar?" with a hard cutoff.
There's a catch with both losses. Most negatives are easy — a picture of a cat is obviously different from a picture of a truck. The loss for easy negatives is zero (they're already past the margin). The network learns nothing from them.
What matters is hard negatives: different items that the network currently thinks are similar. A Persian cat vs a Himalayan cat. Two faces that look alike but are different people. These violations are where all the learning signal lives.
In practice, you mine hard negatives: within each batch, find the triplets where d(a,p) is largest (worst positives) and d(a,n) is smallest (hardest negatives). Only train on those. Otherwise, most of your batch is wasted on zero-loss examples.
python def contrastive_loss(emb1, emb2, label, margin=2.0): """label=1 for similar, 0 for dissimilar.""" d = np.linalg.norm(emb1 - emb2) pos_loss = label * d ** 2 neg_loss = (1 - label) * max(0, margin - d) ** 2 return pos_loss + neg_loss def triplet_loss(anchor, positive, negative, margin=0.3): """Triplet loss: push positive closer than negative.""" dp = np.linalg.norm(anchor - positive) dn = np.linalg.norm(anchor - negative) return max(0, dp - dn + margin)
Drag points to see how contrastive loss creates forces. Orange pairs should attract. Teal pairs should repel (until past the margin circle). Click Step to apply one gradient update.
Triplet loss has a fundamental limitation: it only considers one negative at a time. The gradient signal from one "this is different" comparison is noisy. What if we could compare the positive against many negatives simultaneously, getting a richer, more stable signal?
That's the insight behind InfoNCE (Noise-Contrastive Estimation), the loss function that powers CLIP, SimCLR, MoCo, and virtually all modern self-supervised learning. It's also the loss behind contrastive language-image pretraining, contrastive code search, and audio-text alignment.
You have an anchor (say, an image of a dog). You have one positive (another view of the same dog, or the text "a photo of a golden retriever"). And you have K negatives (other images or texts from the batch that don't match).
Each item is embedded into a vector. We measure similarity using the dot product (or cosine similarity). The question InfoNCE asks is: can the model pick out the positive from a lineup of K+1 candidates?
Look at the structure: this is softmax cross-entropy! The positive gets one "logit" (sim(a,p)/τ), each negative gets one logit (sim(a,nk)/τ), and we compute cross-entropy with the positive as the correct class.
InfoNCE turns similarity learning into a (K+1)-way classification problem: "which of these K+1 items is the true match?" The answer is always index 0 (the positive). The loss is the negative log-probability of choosing correctly.
The temperature τ (tau) plays the same role as in softmax but with a twist. Low τ makes the softmax sharper — the model must be very precise about which item is the positive. High τ makes it softer — near-matches get partial credit.
In practice, τ = 0.07 (CLIP) or τ = 0.1 (SimCLR) — much lower than the standard τ=1. This forces the model to learn fine-grained distinctions. If τ is too low, training becomes unstable. If too high, the model doesn't discriminate enough.
Anchor embedding a = [1, 0]. Positive p = [0.9, 0.1]. Three negatives: n₁ = [0.1, 0.8], n₂ = [-0.5, 0.5], n₃ = [0.3, -0.7]. Temperature τ = 0.5.
Step 1 — compute dot-product similarities:
sim(a, p) = 1×0.9 + 0×0.1 = 0.90
sim(a, n₁) = 1×0.1 + 0×0.8 = 0.10
sim(a, n₂) = 1×(-0.5) + 0×0.5 = -0.50
sim(a, n₃) = 1×0.3 + 0×(-0.7) = 0.30
Step 2 — divide by τ and exponentiate:
e0.90/0.5 = e1.8 = 6.050
e0.10/0.5 = e0.2 = 1.221
e-0.50/0.5 = e-1.0 = 0.368
e0.30/0.5 = e0.6 = 1.822
Step 3 — softmax probability for the positive:
P(positive) = 6.050 / (6.050 + 1.221 + 0.368 + 1.822) = 6.050 / 9.461 = 0.639
Step 4 — loss:
L = -log(0.639) = 0.448
The model gives 63.9% probability to the correct match. Not perfect — the loss pushes it to increase sim(a,p) and decrease sim(a,nk).
With K=3 negatives, the "random guess" accuracy is 1/4 = 25%. With K=255 (a typical batch in SimCLR), random accuracy is 1/256 = 0.4%. More negatives make the task harder, forcing the model to learn finer distinctions. The theoretical bound tightens: InfoNCE lower-bounds the mutual information between anchor and positive, and the bound gets tighter with more negatives.
This is why contrastive learning uses enormous batch sizes. SimCLR used 8192. MoCo maintained a queue of 65,536 negatives. CLIP used 32,768. The number of negatives directly controls the quality of the learned representations.
python def info_nce_loss(anchor, positive, negatives, tau=0.1): """InfoNCE loss from scratch. anchor: (D,) embedding positive: (D,) embedding negatives: (K, D) embeddings """ # Similarities pos_sim = np.dot(anchor, positive) / tau neg_sims = negatives @ anchor / tau # (K,) # Log-sum-exp trick for stability all_sims = np.concatenate([[pos_sim], neg_sims]) max_sim = np.max(all_sims) log_sum_exp = max_sim + np.log(np.sum(np.exp(all_sims - max_sim))) # Loss = -log(softmax of positive) loss = -pos_sim + log_sum_exp return loss # Example a = np.array([1.0, 0.0]) p = np.array([0.9, 0.1]) negs = np.array([[0.1, 0.8], [-0.5, 0.5], [0.3, -0.7]]) print(f"Loss: {info_nce_loss(a, p, negs, tau=0.5):.3f}") # 0.448
The anchor must find its positive among negatives. Adjust temperature and number of negatives. Watch the softmax probabilities and loss change. Low τ → sharper discrimination. More negatives → harder task.
We've built seven loss functions from scratch. Now the practical question: which one do you use? The answer depends on your task, your data, and what you want the network to learn.
In object detection, 99% of anchor boxes contain background, 1% contain objects. Standard cross-entropy is dominated by the easy background examples — they contribute little learning signal individually but overwhelm in aggregate.
Focal loss (Lin et al., 2017) down-weights easy examples:
Where pt is the model's predicted probability for the true class. When the model is already confident and correct (pt ≈ 1), the factor (1-pt)γ ≈ 0 — the easy example is down-weighted to nearly zero. When the model struggles (pt ≈ 0), the factor ≈ 1 — full weight.
The parameter γ controls how aggressively easy examples are down-weighted. γ=0 recovers standard cross-entropy. γ=2 is the standard choice. With γ=2, an example the model gets right with 90% confidence has its loss reduced by 100×.
Model predicts pt = 0.9 (confident and correct). γ = 2.
Standard CE: -log(0.9) = 0.105
Focal: -(1-0.9)² × log(0.9) = -0.01 × (-0.105) = 0.00105
The easy example's contribution dropped by 100×.
Now pt = 0.1 (struggling):
Standard CE: -log(0.1) = 2.303
Focal: -(1-0.1)² × log(0.1) = -0.81 × (-2.303) = 1.865
Only 19% reduction. Hard examples keep most of their weight.
Standard cross-entropy with one-hot labels pushes the model toward infinite confidence — the loss only reaches zero when the predicted probability is exactly 1.0. This leads to overconfident predictions that don't calibrate well.
Label smoothing replaces the hard one-hot [1, 0, 0] with a soft target like [0.9, 0.05, 0.05]. The true class gets probability (1-ε) and the remaining ε is spread uniformly across all classes. Typical ε = 0.1.
This prevents the model from becoming infinitely confident and produces better-calibrated probabilities — the model's predicted confidence more closely matches its actual accuracy. It also acts as a regularizer, slightly reducing overfitting.
python def focal_loss(p_t, gamma=2.0): """Focal loss: down-weight easy examples.""" return -((1 - p_t) ** gamma) * np.log(p_t + 1e-15) def label_smoothing(y_onehot, num_classes, epsilon=0.1): """Smooth hard labels.""" return y_onehot * (1 - epsilon) + epsilon / num_classes # Focal loss comparison for pt in [0.9, 0.5, 0.1]: ce = -np.log(pt) fl = focal_loss(pt, gamma=2) print(f"p_t={pt:.1f} CE={ce:.3f} Focal={fl:.3f} ratio={fl/ce:.3f}") # p_t=0.9 CE=0.105 Focal=0.001 ratio=0.010 # p_t=0.5 CE=0.693 Focal=0.173 ratio=0.250 # p_t=0.1 CE=2.303 Focal=1.865 ratio=0.810
Before deep learning dominated, hinge loss was the standard for SVMs and margin classifiers:
Where y ∈ {-1, +1} and f(x) is the raw score. Correct predictions with score > 1 have zero loss. This creates a "margin" of safety — the model must be confident enough, not just barely correct.
Multi-class hinge (SVM loss): for each incorrect class j, penalize if the score for j is within margin of the correct class score:
Hinge loss doesn't produce probabilities (no softmax). It only cares about margins. Once the correct class wins by a sufficient margin, gradient is zero — it stops learning on that example. This can be good (focuses on hard cases) or bad (doesn't refine already-correct predictions).
| Loss | Task | Output | Key Property |
|---|---|---|---|
| MSE | Regression | Continuous | Smooth, quadratic, sensitive to outliers |
| MAE | Regression | Continuous | Robust to outliers, constant gradient |
| Huber | Regression | Continuous | Quadratic near zero, linear far away |
| Cross-Entropy | Classification | Probabilities | Strong gradient for confident mistakes |
| BCE | Binary/Multi-label | Per-class prob | Independent per output |
| Focal | Imbalanced classif. | Probabilities | Down-weights easy examples |
| Hinge | Margin classif. | Raw scores | Zero loss beyond margin |
| Contrastive | Pair similarity | Embeddings | Attract/repel with margin |
| Triplet | Relative similarity | Embeddings | Relative ordering, needs mining |
| InfoNCE | Representation learning | Embeddings | Multi-negative, scales with batch size |
Binary classification (y=1). Compare how different loss functions behave as predicted probability varies. Toggle each loss on/off.
We've built ten loss functions from scratch, traced their gradients by hand, and seen how each shapes what a neural network learns. Let's zoom out to where these ideas connect to the broader landscape of deep learning.
| Chapter | Key Concept | One Sentence |
|---|---|---|
| 0 | Why losses matter | The loss defines what the network learns — change the loss, change the behavior |
| 1 | MSE | Squared error: smooth and simple for regression, but punishes outliers and fails for classification |
| 2 | Softmax | Exponential normalization turns logits into probabilities with a temperature dial |
| 3 | Cross-Entropy | Negative log-probability of the correct class — screams at confident mistakes |
| 4 | KL Divergence | Cross-entropy minus entropy: measures extra surprise from using the wrong distribution |
| 5 | Regression losses | MAE resists outliers, Huber blends MSE and MAE smoothly |
| 6 | Contrastive & Triplet | Learn embedding geometry: similar things close, different things far |
| 7 | InfoNCE | Multi-negative contrastive = softmax CE over similarities — scales with batch size |
| 8 | Choosing | Match the loss to the task: classification → CE, regression → MSE/Huber, similarity → InfoNCE |
Backpropagation — we showed what the gradient of each loss is, but how does it flow backward through the network? The Backpropagation lesson traces the chain rule through every layer.
Optimizers — the loss gives us a gradient. The optimizer decides how to use that gradient to update weights. SGD, Adam, AdamW — each handles the gradient differently. (Coming soon in the Optimizers lesson.)
Regularization — losses can include penalty terms (L1, L2, dropout) that prevent overfitting. Regularization & Optimization covers these.
Contrastive Learning — InfoNCE is the foundation. CLIP applies it to vision-language alignment. Self-Supervised Learning covers SimCLR, BYOL, and DINO.
RLHF and DPO — reinforcement learning from human feedback uses a special loss (reward model loss, then PPO or DPO) to align language models. RLHF & DPO builds on cross-entropy and KL divergence from this lesson.
Diffusion Models — diffusion training uses a denoising loss that's essentially a weighted MSE between predicted and actual noise. Diffusion Models derives this from the variational bound.
GANs — the generator and discriminator each have their own loss function (adversarial loss), and the choice between BCE, hinge, and Wasserstein losses profoundly affects training stability. GANs covers this.