How a network teaches itself to see — with no labels at all — just by deciding which two pictures are secretly the same.
The internet has billions of images. Almost none of them are labeled. Labeling is slow, expensive, and human — ImageNet took years and an army of annotators to tag a million pictures. Meanwhile a model that wants to “understand” images the way a person does seems to need exactly those labels to learn from. That is the bottleneck.
So here is the audacious question contrastive learning asks: can a network learn genuinely useful visual features from raw, unlabeled images — no “cat,” no “dog,” no annotations at all? And the answer, which reshaped computer vision around 2020, is a resounding yes.
The trick is to invent a task the data can grade by itself — a pretext task, a fake job whose answer is free. Contrastive learning's pretext task is beautifully simple: take one image, make two different distorted copies of it, and teach the network that those two copies belong together — while every other image in the batch is a stranger.
Think about what the two views share. We crop, flip, blur, and recolor the same photo of a dog. The pixels are wildly different between the two views. The only thing that stays constant is the content — the dog. So the single strategy that lets the network call both views “the same” is to extract a representation of the content and throw away the nuisance details: position, color, crop. That representation is exactly the useful feature we wanted, and we got it without a single label.
An embedding is just the vector of numbers the network produces for an image — a point in a high-dimensional space. “Together” and “apart” are measured by how aligned two of these vectors are (their cosine similarity). The whole game is geometry: arrange the points so same-image views land near each other and different images spread out.
The toy below shows embeddings as points on a circle (a stand-in for the unit hypersphere where real embeddings live). Each colored pair is two views of one image. Press Train: positive pairs feel an attractive force pulling them together, and every point repels the others. Watch order emerge from noise — each pair collapses to a tight couple, and the couples spread evenly around the circle. That is contrastive learning, in miniature.
Each color = two augmented views of one image. They should end up close. Different colors should spread apart. Press Train and watch the geometry organize.
Everything in contrastive learning rests on one act: turning a single image into two views — two augmented copies that we declare to be a positive pair. The augmentation pipeline is not a detail; it is the curriculum. What you augment away is what the model learns to ignore.
The standard recipe (from SimCLR) stacks several random transforms: a random crop then resize, a random horizontal flip, random color jitter (brightness, contrast, saturation, hue), random grayscale, and a random Gaussian blur. Each view runs the image through this gauntlet with different random settings, so the two views of one dog look genuinely different to the eye — yet both still depict that dog.
There is a tension. If the two views are too similar (say, two nearly identical crops), the network can match them with a trivial low-level feature — a color histogram, a texture — and never learn anything deep. If the views are too different (crops from opposite corners that share no object), there is no shared content to extract and the task becomes noise.
The sweet spot — sometimes called the InfoMin principle — is views that share only the information you care about (the object identity) and nothing else. Maximize the difficulty of the matching task while keeping the answer well-defined. Good augmentation design is the search for that sweet spot.
python import torch from torchvision import transforms as T # the augmentation gauntlet — one call, random settings each time aug = T.Compose([ T.RandomResizedCrop(224, scale=(0.2, 1.0)), # crop a random patch, resize T.RandomHorizontalFlip(), T.RandomApply([T.ColorJitter(0.4,0.4,0.4,0.1)], p=0.8), T.RandomGrayscale(p=0.2), T.GaussianBlur(23), T.ToTensor(), ]) def two_views(image): # SAME image → TWO independent passes → a positive pair return aug(image), aug(image) # each is (3, 224, 224) # a batch of N images becomes 2N views; view i and view i+N are positives v1, v2 = zip(*[two_views(img) for img in batch]) views = torch.stack(v1 + v2) # (2N, 3, 224, 224)
Trace the data flow: one image enters, two independent random augmentations produce two tensors of shape three-by-224-by-224, and a batch of N images becomes 2N views. The bookkeeping that view i and view i+N are a positive pair (and everything else is a negative) is the entire label signal — and it came for free from the indexing.
Drag the augmentation strength. At low strength the two views are nearly identical (high shared content, but a trivial task — the model learns nothing). At extreme strength the views share almost no content (the task is impossible). Somewhere in the middle is the InfoMin sweet spot: hard but solvable.
Two views of one image as augmentation strength rises. The bar shows shared content (the learnable signal) and task difficulty. Find the middle.
We have a positive pair and a crowd of negatives. We need a single number — a loss — that is small when the positive is close and the negatives are far, and large otherwise. The answer, used by almost every contrastive method, is the InfoNCE loss (also called NT-Xent in SimCLR). It is, at heart, a disguised classification problem.
Here is the reframing. For an anchor view, look at its similarity to every other view in the batch: one positive, many negatives. Now ask the network a multiple-choice question: which of these is your positive partner? Run those similarities through a softmax to turn them into probabilities, and the loss is simply how much probability the model put on the correct (positive) answer. InfoNCE is cross-entropy where the “classes” are “which view is my match.”
One anchor, one positive, three negatives. Embeddings are normalized, so similarity is the cosine (between −1 and 1). Suppose:
| candidate | cosine sim | sim / τ (τ=0.2) | exp(·) |
|---|---|---|---|
| positive | 0.90 | 0.90 / 0.2 = 4.50 | e4.5 = 90.0 |
| negative A | 0.20 | 0.20 / 0.2 = 1.00 | e1.0 = 2.72 |
| negative B | 0.10 | 0.10 / 0.2 = 0.50 | e0.5 = 1.65 |
| negative C | 0.30 | 0.30 / 0.2 = 1.50 | e1.5 = 4.48 |
Add up the exponentials: 90.0 + 2.72 + 1.65 + 4.48 = 98.85. The probability the model assigns to the positive is its share of that total:
The loss is the negative natural log of that probability: minus the log of 0.910 is about 0.094 — a small loss, because the model already put 91% of its confidence on the right answer. If the positive's similarity were only 0.30 (same as negative C), its exp would be 4.48, the probability would drop to 4.48 / (4.48+2.72+1.65+4.48) = 0.336, and the loss would jump to about 1.09. The loss screams when the positive isn't clearly the closest. That gradient is what drags it closer.
python import torch, torch.nn.functional as F def info_nce(z1, z2, tau=0.2): # z1, z2: (N, d) — the two views; row i of z1 matches row i of z2 z1 = F.normalize(z1, dim=1) # put embeddings on the unit sphere z2 = F.normalize(z2, dim=1) z = torch.cat([z1, z2], dim=0) # (2N, d) — all views together sim = z @ z.T / tau # (2N, 2N) all pairwise similarities sim.fill_diagonal_(-9e15) # a view is not its own negative N = z1.shape[0] # for row i (in 0..N-1) the positive is row i+N, and vice-versa targets = torch.cat([torch.arange(N)+N, torch.arange(N)]) return F.cross_entropy(sim, targets) # softmax + neg-log of the positive
The last line is the punchline: once the similarities are arranged as a matrix and we know which column is the positive for each row, InfoNCE is literally cross_entropy — the same loss used for labeled classification. We turned “learn good features” into “classify which view is your twin,” and standard machinery does the rest.
Drag the positive similarity bar. The softmax probabilities (right) update live, and so does the loss readout. Push the positive up and watch its probability dominate and the loss fall toward zero; pull it down among the negatives and watch the loss explode.
Left bars = similarity of the anchor to its positive (teal) and negatives (gray). Right bars = softmax probabilities. Loss = −log(positive's probability). Drag the positive's similarity.
You met temperature in the last chapter as a divisor before the softmax. It looks innocent. It is not. Temperature is the single most sensitive hyperparameter in contrastive learning, and understanding what it does separates people who can make these methods work from people who can't.
Temperature controls how sharp the softmax is. Divide the similarities by a small temperature and the differences blow up — the softmax becomes peaky, putting almost all probability on the single most-similar candidate. Divide by a large temperature and the similarities get squashed together — the softmax flattens toward uniform, treating all candidates as nearly equal.
Take an anchor whose similarity to its positive is 0.80 and to two negatives is 0.60 (a hard negative) and 0.10 (an easy one). Watch what temperature does to the probability on the positive.
| τ | positive e0.8/τ | hard-neg e0.6/τ | easy-neg e0.1/τ | P(positive) |
|---|---|---|---|---|
| 0.1 (sharp) | e8.0=2981 | e6.0=403 | e1.0=2.7 | 2981/3387 = 0.880 |
| 0.5 (soft) | e1.6=4.95 | e1.2=3.32 | e0.2=1.22 | 4.95/9.49 = 0.522 |
At the sharp temperature 0.1, the positive grabs 88% of the probability and the easy negative (with its exp of 2.7) is essentially invisible next to the hard negative's 403 — so the gradient pours onto that hard negative. At the soft temperature 0.5, the positive holds only 52%, and the easy negative now matters too. Same embeddings, completely different learning pressure. Notice also: at low temperature the loss is small even though the positive is barely ahead in raw similarity — the temperature manufactures confidence.
Push temperature too low and the model fixates so hard on a few near neighbors that it tries to separate every point from every other point, including semantically similar images that should be close — it shatters the class structure and can become unstable. Push temperature too high and all negatives blur together; the gradient is weak and diffuse, and the representation stops discriminating. In practice the sweet spot is small — SimCLR uses 0.1 to 0.5, MoCo around 0.07. It must be tuned.
Five candidates: one positive and four negatives, one of which is a stubborn hard negative. Sweep temperature. At low temperature the distribution spikes and the hard negative is the only one with any mass besides the positive — that's where the gradient goes. At high temperature everything flattens. The readout tracks how concentrated the distribution is.
Bars = softmax probabilities over candidates at the current temperature. Lower τ = sharper = fixates on the positive and the hardest negative. Higher τ = flat.
Here is one of the most counterintuitive tricks in the whole field, and SimCLR's quiet masterstroke. The contrastive loss is not applied directly to the features you actually want. There is an extra little network — a projection head — bolted on top during training, and after training you delete it and keep what's underneath. Throwing away a part you trained sounds insane. It nearly doubles downstream accuracy.
An image enters the backbone (say a ResNet). The backbone produces a feature vector — call it h, the representation. This is the thing we ultimately care about; it's what a downstream classifier will use. But h does not go into the loss. Instead it passes through the projection head — a small two-layer network — producing a second vector z. The contrastive loss is computed on z, not h.
Think about the gradient. The loss pushes z toward invariance, and that pressure flows backward through the projection head g first. By the time it reaches h, the head has already “used up” much of the invariance requirement on its own weights. So h is trained to be useful for producing an invariant z, without itself being forced all the way to invariance. The head is a shock absorber between “what the loss wants” and “what we keep.” That is why a linear probe (a single linear classifier trained on frozen features) does markedly better on h than on z.
python class SimCLRModel(nn.Module): def __init__(self, backbone, dim=2048, proj=128): self.f = backbone # the keeper self.g = nn.Sequential( # the sacrificial head nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, proj)) def forward(self, x): h = self.f(x) # representation — used at inference z = self.g(h) # projection — used only for the loss return h, z # train on z ... h, z = model(views); loss = info_nce(z[:N], z[N:]) # ... but at inference, throw g away and use h features = model.f(image) # g is gone; h is what we wanted
The widget shows the pipeline. Toggle which vector you extract features from — h (before the head) or z (after). The downstream linear-probe accuracy bar updates. Extracting from h wins, because z has been squeezed dry of everything the augmentations touched.
Click a stage to extract features from it. The bar shows representative downstream linear-probe accuracy. See why h beats z.
Recall the lesson from Chapter 2: more negatives make harder, better multiple-choice questions, and better features. SimCLR gets its negatives from the batch — every other image is a negative. So SimCLR needs enormous batches, thousands of images, which means dozens of expensive accelerators just to hold them in memory. That is a brutal hardware tax. MoCo (Momentum Contrast) asks: can we get thousands of negatives without a thousand-image batch?
The negatives don't all have to come from the current batch. MoCo keeps a queue — a running buffer of embeddings from recent batches, thousands of them. Each step, the current batch's embeddings are pushed onto the front of the queue, and the oldest ones fall off the back. It's a first-in-first-out conveyor belt of negatives. A tiny batch of 256 can now be contrasted against a queue of 65,000 negatives — the negatives are decoupled from the batch size.
But there's a subtle bug. The queue holds embeddings computed by the network at past steps. The network is changing every step. So the old queued embeddings were produced by a different, now-stale version of the network — comparing today's query against last week's keys is inconsistent, and training becomes unstable.
MoCo's fix is elegant: use two encoders. The query encoder is the normal network, updated by gradient descent. The key encoder — which produces the embeddings that go into the queue — is not trained by gradients. Instead it is a slowly moving average of the query encoder. Each step, the key encoder takes a tiny step toward the query encoder, controlled by a momentum coefficient (typically 0.999). Because it changes so slowly, the keys in the queue stay consistent with each other even though they span many steps.
The key encoder's weights are updated as: new key weights equal momentum times the old key weights, plus (one minus momentum) times the current query weights. With momentum 0.999, that is 99.9% of the old key encoder and just 0.1% of the new query encoder, every step. Suppose a single weight in the query encoder is currently 0.50 and the key encoder's copy is 0.40:
The key barely moved — from 0.400 to 0.4001. Over a thousand steps it drifts smoothly toward the query encoder, never jerking. That gentle lag is exactly what keeps the queue's thousands of negatives mutually consistent. Set momentum to 0 (key encoder = query encoder, always fresh) and MoCo's accuracy collapses, because the keys become inconsistent the instant the network updates.
Press Step to run training. New keys (computed by the momentum encoder) push onto the queue's front; the oldest fall off. The two markers show the query encoder (fast, jumping) and the key encoder (slow, trailing). Crank momentum up and watch the key encoder lag further behind — stable keys. Crank it to zero and the key encoder snaps onto the query encoder, and the queue's older entries become stale and mismatched.
Top: the FIFO queue of negative keys (new at left, oldest at right, about to drop). Bottom: query encoder (fast) vs key encoder (slow EMA). Higher momentum = slower, more consistent keys.
This is the payoff. A real contrastive training run, live, with the embeddings of several images shown as points on the unit circle. Each color is one image with two augmented views — a positive pair that should end up together. The simulator runs actual InfoNCE gradient steps. You control the temperature, the number of images, and — crucially — whether negatives are used at all.
Two quantities tell you if it's working, and they are the modern way to diagnose contrastive learning:
Good contrastive learning achieves both: tight pairs (alignment) that are nonetheless spread out (uniformity). The tension between them is the entire game — and the experiment you must run is removing the negatives.
Colors = images, lines join positive pairs. Press Train. Watch pairs pull together (alignment) while colors spread apart (uniformity). Then flip to “positives only” and watch everything collapse to one point.
No quiz here — the simulator is the test. If you can explain why “positives only” collapses and “with negatives” doesn't, you understand the core of contrastive learning.
We just watched, in the simulator, what happens when you remove negatives: everything collapses to a point. Negatives are the repulsive force that holds the embedding space open. So this next result should feel impossible. BYOL (Bootstrap Your Own Latent) learns excellent representations with no negatives at all — only positive pairs — and somehow does not collapse. How?
BYOL uses two networks again, like MoCo: an online network (trained by gradient) and a target network (an EMA of the online one). Both encode a view. The task: the online network must predict the target network's embedding of the other view. That's it — just match your partner's representation. No pushing anything away.
Naively, this should collapse instantly: the trivial solution is for both networks to output the same constant for every image, and the prediction is perfect. BYOL avoids this with two ingredients working together:
python # online: encoder f → projector g → predictor q (all trained) # target: encoder f' → projector g' (EMA, no gradient) def byol_loss(v1, v2, online, target): p1 = online.predict(online.project(online.encode(v1))) # online prediction of view 1 p2 = online.predict(online.project(online.encode(v2))) with torch.no_grad(): # ← STOP-GRADIENT t1 = target.project(target.encode(v1)) # target — frozen this step t2 = target.project(target.encode(v2)) # online predicts the OTHER view's target representation return mse(normalize(p1), normalize(t2)) + mse(normalize(p2), normalize(t1)) # after each step: target ← m·target + (1-m)·online (EMA, the moving goalpost)
The torch.no_grad() around the target is the entire trick. It means the loss can only be reduced by changing the online network to match the target — never by changing the target to be easy to match. That one asymmetry, plus the predictor, is the difference between rich features and a collapsed constant.
The diagram shows BYOL's two branches. Toggle stop-gradient, then press Train. With stop-gradient ON, the representation stays diverse and healthy. Turn it OFF (let gradients flow into the target), and watch the diversity bar crash to zero — instant collapse, exactly the failure BYOL was designed to dodge.
Online branch (with predictor) chases the target branch (EMA). Toggle stop-gradient and Train. The diversity bar shows whether the representation stays alive or collapses.
BYOL showed you can avoid collapse without negatives using asymmetry and stop-gradient. DINO (self-distillation with no labels) takes a different, equally clever route — and it produced one of the most striking results in self-supervised vision: a Vision Transformer trained with DINO spontaneously learns to segment objects, its attention maps lighting up on the foreground, with no segmentation labels ever.
DINO frames the pretext task as self-distillation. A student network and a teacher network (an EMA of the student, just like before) each see different views of the image. Each network outputs a probability distribution over a set of K abstract “prototype” dimensions — think of them as learned cluster slots. The student is trained to match the teacher's distribution for the same image, with a stop-gradient on the teacher. Match your teacher's soft assignment over the clusters. No negatives, no contrastive loss — just distribution matching.
But distribution matching alone collapses, and in two distinct ways. DINO needs one fix for each, and the interplay is the heart of the method.
python # student & teacher output K-dim logits over learned prototypes def teacher_dist(logits, center, tau_t=0.04): # center: running mean of teacher outputs (anti one-cluster collapse) # tau_t small = sharpening (anti uniform collapse) return softmax((logits - center) / tau_t) # peaky AND de-biased def student_dist(logits, tau_s=0.1): return softmax(logits / tau_s) # softer than teacher loss = -(teacher_dist(t_logits, center).detach() * student_dist(s_logits).log()).sum() # after each step: center = 0.9 * center + 0.1 * t_logits.mean(0) # EMA of teacher outputs teacher = m * teacher + (1-m) * student # EMA weights, stop-grad target
The teacher's temperature is lower than the student's — the teacher is the “sharper, more confident” one, and the student is pulled toward it. The center term, updated as a running mean of teacher outputs, is subtracted every step to keep any single prototype from running away with all the assignments. Two scalars (a temperature and a centering EMA) stand between DINO and collapse.
The bars show the teacher's output distribution over prototype clusters. Slide sharpening and centering. Turn centering off and watch one cluster swallow everything (one-dimension collapse). Turn sharpening off (high temperature) and watch it flatten to uniform (uniform collapse). Find the balanced regime: a confident peak that isn't always the same cluster.
Bars = teacher distribution over prototype clusters. Centering fights one-cluster collapse; sharpening fights uniform collapse. Balance them.
You now understand the whole family: the pretext task of matching two views, the InfoNCE loss that powers it, the temperature that tunes hard-negative focus, the projection head you train and throw away, MoCo's queue and momentum encoder, and the two negative-free methods — BYOL and DINO — that dodge collapse with clever asymmetries. The single thread running through all of it: arrange the embedding space so that meaning-preserving changes leave the representation unchanged, while keeping the space open.
| Method | Negatives? | How it gets them | Anti-collapse trick |
|---|---|---|---|
| SimCLR | Yes | large batch (all other images) | negatives (explicit repulsion) |
| MoCo | Yes | queue + momentum key encoder | negatives, consistent via EMA |
| BYOL | No | — | predictor + stop-grad + EMA target |
| SimSiam | No | — | predictor + stop-grad (no EMA needed) |
| DINO | No | — | centering + sharpening on EMA teacher |
“What I cannot create, I do not understand.” — and a network that can recreate which two pictures are secretly the same has begun, without a single label, to understand what it sees.