Contrastive Learning — Representations Without Labels

Chapter 0: A World Without Labels

The internet has billions of images. Almost none of them are labeled. Labeling is slow, expensive, and human — ImageNet took years and an army of annotators to tag a million pictures. Meanwhile a model that wants to “understand” images the way a person does seems to need exactly those labels to learn from. That is the bottleneck.

So here is the audacious question contrastive learning asks: can a network learn genuinely useful visual features from raw, unlabeled images — no “cat,” no “dog,” no annotations at all? And the answer, which reshaped computer vision around 2020, is a resounding yes.

The trick is to invent a task the data can grade by itself — a pretext task, a fake job whose answer is free. Contrastive learning's pretext task is beautifully simple: take one image, make two different distorted copies of it, and teach the network that those two copies belong together — while every other image in the batch is a stranger.

The core idea in one line. Pull the two views of the same image together in embedding space, and push views of different images apart. Do that across millions of images and the network is forced to discover what actually makes an image what it is — because that is the only thing that survives the distortions.

Why this forces real understanding

Think about what the two views share. We crop, flip, blur, and recolor the same photo of a dog. The pixels are wildly different between the two views. The only thing that stays constant is the content — the dog. So the single strategy that lets the network call both views “the same” is to extract a representation of the content and throw away the nuisance details: position, color, crop. That representation is exactly the useful feature we wanted, and we got it without a single label.

An embedding is just the vector of numbers the network produces for an image — a point in a high-dimensional space. “Together” and “apart” are measured by how aligned two of these vectors are (their cosine similarity). The whole game is geometry: arrange the points so same-image views land near each other and different images spread out.

See it: the pull-together, push-apart force

The toy below shows embeddings as points on a circle (a stand-in for the unit hypersphere where real embeddings live). Each colored pair is two views of one image. Press Train: positive pairs feel an attractive force pulling them together, and every point repels the others. Watch order emerge from noise — each pair collapses to a tight couple, and the couples spread evenly around the circle. That is contrastive learning, in miniature.

Pull Together, Push Apart

Each color = two augmented views of one image. They should end up close. Different colors should spread apart. Press Train and watch the geometry organize.

Common misconception. “Without labels, the network can't know what's 'correct,' so it can't learn anything useful.” It doesn't need external correctness — it manufactures its own supervision from the structure of the data. The label is implicit: “these two came from the same source image.” That free, self-generated signal is what the whole field of self-supervised learning is built on.

What stays constant between two augmented views of the same image, and why does that matter?

The pixel values stay constant, so the network memorizes them The content stays constant while pixels change, so the only way to match the views is to extract content and discard nuisance details Nothing stays constant; the network learns from randomness

Chapter 1: Views — Manufacturing Positive Pairs

Everything in contrastive learning rests on one act: turning a single image into two views — two augmented copies that we declare to be a positive pair. The augmentation pipeline is not a detail; it is the curriculum. What you augment away is what the model learns to ignore.

The standard recipe (from SimCLR) stacks several random transforms: a random crop then resize, a random horizontal flip, random color jitter (brightness, contrast, saturation, hue), random grayscale, and a random Gaussian blur. Each view runs the image through this gauntlet with different random settings, so the two views of one dog look genuinely different to the eye — yet both still depict that dog.

You are choosing the invariances. By including color jitter, you tell the model “color doesn't define the object” — so it becomes color-invariant. By cropping aggressively, you teach “a part implies the whole.” The augmentations are how you encode your beliefs about what matters. Remove color jitter and SimCLR's accuracy famously craters, because the model takes a shortcut: it just matches the average color of the two crops instead of learning content.

The shortcut problem and InfoMin

There is a tension. If the two views are too similar (say, two nearly identical crops), the network can match them with a trivial low-level feature — a color histogram, a texture — and never learn anything deep. If the views are too different (crops from opposite corners that share no object), there is no shared content to extract and the task becomes noise.

The sweet spot — sometimes called the InfoMin principle — is views that share only the information you care about (the object identity) and nothing else. Maximize the difficulty of the matching task while keeping the answer well-defined. Good augmentation design is the search for that sweet spot.

From scratch: the two-view pipeline

python
import torch
from torchvision import transforms as T

# the augmentation gauntlet — one call, random settings each time
aug = T.Compose([
    T.RandomResizedCrop(224, scale=(0.2, 1.0)),   # crop a random patch, resize
    T.RandomHorizontalFlip(),
    T.RandomApply([T.ColorJitter(0.4,0.4,0.4,0.1)], p=0.8),
    T.RandomGrayscale(p=0.2),
    T.GaussianBlur(23),
    T.ToTensor(),
])

def two_views(image):
    # SAME image → TWO independent passes → a positive pair
    return aug(image), aug(image)        # each is (3, 224, 224)

# a batch of N images becomes 2N views; view i and view i+N are positives
v1, v2 = zip(*[two_views(img) for img in batch])
views = torch.stack(v1 + v2)               # (2N, 3, 224, 224)

Trace the data flow: one image enters, two independent random augmentations produce two tensors of shape three-by-224-by-224, and a batch of N images becomes 2N views. The bookkeeping that view i and view i+N are a positive pair (and everything else is a negative) is the entire label signal — and it came for free from the indexing.

See it: augmentation strength vs. shared content

Drag the augmentation strength. At low strength the two views are nearly identical (high shared content, but a trivial task — the model learns nothing). At extreme strength the views share almost no content (the task is impossible). Somewhere in the middle is the InfoMin sweet spot: hard but solvable.

The Augmentation Sweet Spot

Two views of one image as augmentation strength rises. The bar shows shared content (the learnable signal) and task difficulty. Find the middle.

Augmentation strength 0.50

Common misconception. “More aggressive augmentation is always better.” Past the sweet spot, you destroy the shared content and the positive pair no longer depicts the same thing — you're asking the model to match two unrelated patches, which teaches it nonsense. The art is matching augmentation strength to the dataset, not maximizing it.

Removing color jitter from SimCLR's augmentations sharply hurts learned features. Why?

Color jitter adds training data Color jitter makes the images prettier Without it, the model can match the two views by their shared average color — a trivial shortcut — instead of learning content

Chapter 2: InfoNCE — Turning “Pull/Push” Into a Loss

We have a positive pair and a crowd of negatives. We need a single number — a loss — that is small when the positive is close and the negatives are far, and large otherwise. The answer, used by almost every contrastive method, is the InfoNCE loss (also called NT-Xent in SimCLR). It is, at heart, a disguised classification problem.

Here is the reframing. For an anchor view, look at its similarity to every other view in the batch: one positive, many negatives. Now ask the network a multiple-choice question: which of these is your positive partner? Run those similarities through a softmax to turn them into probabilities, and the loss is simply how much probability the model put on the correct (positive) answer. InfoNCE is cross-entropy where the “classes” are “which view is my match.”

The whole loss in words. Compute the similarity of the anchor to its positive and to every negative. Divide by a temperature. Softmax. The loss is the negative log of the probability mass that lands on the positive. Minimizing it raises the positive's similarity and lowers the negatives' — exactly the pull-together, push-apart force from Chapter 0, now differentiable.

Worked example: computing InfoNCE by hand

One anchor, one positive, three negatives. Embeddings are normalized, so similarity is the cosine (between −1 and 1). Suppose:

candidate	cosine sim	sim / τ (τ=0.2)	exp(·)
positive	0.90	0.90 / 0.2 = 4.50	e^4.5 = 90.0
negative A	0.20	0.20 / 0.2 = 1.00	e^1.0 = 2.72
negative B	0.10	0.10 / 0.2 = 0.50	e^0.5 = 1.65
negative C	0.30	0.30 / 0.2 = 1.50	e^1.5 = 4.48

Add up the exponentials: 90.0 + 2.72 + 1.65 + 4.48 = 98.85. The probability the model assigns to the positive is its share of that total:

P(positive) = 90.0 / 98.85 = 0.910

The loss is the negative natural log of that probability: minus the log of 0.910 is about 0.094 — a small loss, because the model already put 91% of its confidence on the right answer. If the positive's similarity were only 0.30 (same as negative C), its exp would be 4.48, the probability would drop to 4.48 / (4.48+2.72+1.65+4.48) = 0.336, and the loss would jump to about 1.09. The loss screams when the positive isn't clearly the closest. That gradient is what drags it closer.

From scratch: InfoNCE in a few lines

python
import torch, torch.nn.functional as F

def info_nce(z1, z2, tau=0.2):
    # z1, z2: (N, d) — the two views; row i of z1 matches row i of z2
    z1 = F.normalize(z1, dim=1)          # put embeddings on the unit sphere
    z2 = F.normalize(z2, dim=1)
    z  = torch.cat([z1, z2], dim=0)        # (2N, d) — all views together
    sim = z @ z.T / tau                    # (2N, 2N) all pairwise similarities
    sim.fill_diagonal_(-9e15)           # a view is not its own negative
    N = z1.shape[0]
    # for row i (in 0..N-1) the positive is row i+N, and vice-versa
    targets = torch.cat([torch.arange(N)+N, torch.arange(N)])
    return F.cross_entropy(sim, targets)   # softmax + neg-log of the positive

The last line is the punchline: once the similarities are arranged as a matrix and we know which column is the positive for each row, InfoNCE is literally cross_entropy — the same loss used for labeled classification. We turned “learn good features” into “classify which view is your twin,” and standard machinery does the rest.

See it: similarities → softmax → loss

Drag the positive similarity bar. The softmax probabilities (right) update live, and so does the loss readout. Push the positive up and watch its probability dominate and the loss fall toward zero; pull it down among the negatives and watch the loss explode.

InfoNCE: One Anchor's Multiple-Choice Question

Left bars = similarity of the anchor to its positive (teal) and negatives (gray). Right bars = softmax probabilities. Loss = −log(positive's probability). Drag the positive's similarity.

Positive similarity 0.90

Temperature τ 0.20

Common misconception. “More negatives just means more compute.” More negatives also makes the multiple-choice question harder and the learned features better — each negative is another distractor the positive must beat. This is why SimCLR needed batch sizes in the thousands, and why the next chapters are largely about getting many negatives without paying for a giant batch.

InfoNCE is described as “cross-entropy in disguise.” What are the implicit classes?

The object categories like cat and dog The augmentation types applied “Which of the candidate views is my positive partner” — one class per candidate

Chapter 3: Temperature — The Most Important Knob

You met temperature in the last chapter as a divisor before the softmax. It looks innocent. It is not. Temperature is the single most sensitive hyperparameter in contrastive learning, and understanding what it does separates people who can make these methods work from people who can't.

Temperature controls how sharp the softmax is. Divide the similarities by a small temperature and the differences blow up — the softmax becomes peaky, putting almost all probability on the single most-similar candidate. Divide by a large temperature and the similarities get squashed together — the softmax flattens toward uniform, treating all candidates as nearly equal.

Temperature is a hardness focus. A low temperature makes the loss obsess over the hardest negatives — the ones most similar to the anchor, the near-misses. Their large similarities dominate the sharpened softmax, so they get almost all the repulsive gradient. A high temperature spreads the gradient evenly across all negatives, easy and hard alike. Tuning temperature is really tuning “how much should I fixate on my most confusable distractors?”

Worked example: the same scores, two temperatures

Take an anchor whose similarity to its positive is 0.80 and to two negatives is 0.60 (a hard negative) and 0.10 (an easy one). Watch what temperature does to the probability on the positive.

τ	positive e^0.8/τ	hard-neg e^0.6/τ	easy-neg e^0.1/τ	P(positive)
0.1 (sharp)	e^8.0=2981	e^6.0=403	e^1.0=2.7	2981/3387 = 0.880
0.5 (soft)	e^1.6=4.95	e^1.2=3.32	e^0.2=1.22	4.95/9.49 = 0.522

At the sharp temperature 0.1, the positive grabs 88% of the probability and the easy negative (with its exp of 2.7) is essentially invisible next to the hard negative's 403 — so the gradient pours onto that hard negative. At the soft temperature 0.5, the positive holds only 52%, and the easy negative now matters too. Same embeddings, completely different learning pressure. Notice also: at low temperature the loss is small even though the positive is barely ahead in raw similarity — the temperature manufactures confidence.

The Goldilocks failure modes

Push temperature too low and the model fixates so hard on a few near neighbors that it tries to separate every point from every other point, including semantically similar images that should be close — it shatters the class structure and can become unstable. Push temperature too high and all negatives blur together; the gradient is weak and diffuse, and the representation stops discriminating. In practice the sweet spot is small — SimCLR uses 0.1 to 0.5, MoCo around 0.07. It must be tuned.

See it: temperature reshaping the softmax

Five candidates: one positive and four negatives, one of which is a stubborn hard negative. Sweep temperature. At low temperature the distribution spikes and the hard negative is the only one with any mass besides the positive — that's where the gradient goes. At high temperature everything flattens. The readout tracks how concentrated the distribution is.

Temperature: From Peaky to Flat

Bars = softmax probabilities over candidates at the current temperature. Lower τ = sharper = fixates on the positive and the hardest negative. Higher τ = flat.

Temperature τ 0.10

Common misconception. “Temperature just rescales the loss, so it doesn't change what's learned.” It changes the gradient distribution across negatives, which changes the geometry the model converges to. Low temperature builds tight, well-separated clusters but risks over-fragmenting; high temperature builds loose, smooth structure. It is a real architectural choice disguised as a scalar.

Lowering the temperature makes the contrastive loss focus most of its repulsive gradient on which negatives?

The easiest negatives (least similar) The hardest negatives (most similar to the anchor), because their large similarities dominate the sharpened softmax All negatives equally

Chapter 4: The Projection Head — and Why We Throw It Away

Here is one of the most counterintuitive tricks in the whole field, and SimCLR's quiet masterstroke. The contrastive loss is not applied directly to the features you actually want. There is an extra little network — a projection head — bolted on top during training, and after training you delete it and keep what's underneath. Throwing away a part you trained sounds insane. It nearly doubles downstream accuracy.

The data flow — trace it carefully

An image enters the backbone (say a ResNet). The backbone produces a feature vector — call it h, the representation. This is the thing we ultimately care about; it's what a downstream classifier will use. But h does not go into the loss. Instead it passes through the projection head — a small two-layer network — producing a second vector z. The contrastive loss is computed on z, not h.

image

augmented view, 3×224×224

→

backbone f

ResNet → h (2048-dim)
KEEP THIS

→

projection g

MLP → z (128-dim)
loss applied here

→

InfoNCE

on z, then discard g

Why discard the head? The contrastive loss demands invariance: it wants z to be identical for both augmented views, which means z must throw away everything the augmentations changed — color, orientation, crop. But some of that “nuisance” information (color, pose) is actually useful for downstream tasks! If the loss acted directly on h, it would strip that information out of the representation. The projection head acts as a sacrificial buffer: it absorbs the invariance pressure, letting z become invariant while h upstream gets to keep the richer information.

Concept + realization: where the gradient bites

Think about the gradient. The loss pushes z toward invariance, and that pressure flows backward through the projection head g first. By the time it reaches h, the head has already “used up” much of the invariance requirement on its own weights. So h is trained to be useful for producing an invariant z, without itself being forced all the way to invariance. The head is a shock absorber between “what the loss wants” and “what we keep.” That is why a linear probe (a single linear classifier trained on frozen features) does markedly better on h than on z.

python
class SimCLRModel(nn.Module):
    def __init__(self, backbone, dim=2048, proj=128):
        self.f = backbone                       # the keeper
        self.g = nn.Sequential(                  # the sacrificial head
            nn.Linear(dim, dim), nn.ReLU(),
            nn.Linear(dim, proj))
    def forward(self, x):
        h = self.f(x)            # representation — used at inference
        z = self.g(h)            # projection — used only for the loss
        return h, z

# train on z ...
h, z = model(views);  loss = info_nce(z[:N], z[N:])
# ... but at inference, throw g away and use h
features = model.f(image)        # g is gone; h is what we wanted

See it: representation vs projection for downstream tasks

The widget shows the pipeline. Toggle which vector you extract features from — h (before the head) or z (after). The downstream linear-probe accuracy bar updates. Extracting from h wins, because z has been squeezed dry of everything the augmentations touched.

Extract Before or After the Head?

Click a stage to extract features from it. The bar shows representative downstream linear-probe accuracy. See why h beats z.

Common misconception. “The projection head is just extra capacity, so keeping it can only help.” The opposite: keeping z hurts, because z is deliberately invariant — it has discarded color, pose, and texture that downstream tasks often need. The head's whole job is to be thrown away. Its value is in what it protects upstream, not in what it outputs.

Why does a downstream classifier work better on h (backbone output) than on z (projection output)?

z is lower-dimensional so it has less compute The loss forces z to be invariant to augmentations, stripping out information (color, pose) that downstream tasks need; the head absorbs that pressure so h keeps it h is trained with labels and z is not

Chapter 5: The Negatives Problem — MoCo’s Queue

Recall the lesson from Chapter 2: more negatives make harder, better multiple-choice questions, and better features. SimCLR gets its negatives from the batch — every other image is a negative. So SimCLR needs enormous batches, thousands of images, which means dozens of expensive accelerators just to hold them in memory. That is a brutal hardware tax. MoCo (Momentum Contrast) asks: can we get thousands of negatives without a thousand-image batch?

Idea one: a queue of negatives

The negatives don't all have to come from the current batch. MoCo keeps a queue — a running buffer of embeddings from recent batches, thousands of them. Each step, the current batch's embeddings are pushed onto the front of the queue, and the oldest ones fall off the back. It's a first-in-first-out conveyor belt of negatives. A tiny batch of 256 can now be contrasted against a queue of 65,000 negatives — the negatives are decoupled from the batch size.

Why the queue is almost free. The queued embeddings are just stored vectors — no gradients flow through them, so they cost only memory, not backprop. You get the statistical benefit of 65,000 negatives at the compute cost of a 256-image batch. The queue is a cache of “other stuff” to push away from, refreshed continuously as training proceeds.

Idea two: the momentum encoder

But there's a subtle bug. The queue holds embeddings computed by the network at past steps. The network is changing every step. So the old queued embeddings were produced by a different, now-stale version of the network — comparing today's query against last week's keys is inconsistent, and training becomes unstable.

MoCo's fix is elegant: use two encoders. The query encoder is the normal network, updated by gradient descent. The key encoder — which produces the embeddings that go into the queue — is not trained by gradients. Instead it is a slowly moving average of the query encoder. Each step, the key encoder takes a tiny step toward the query encoder, controlled by a momentum coefficient (typically 0.999). Because it changes so slowly, the keys in the queue stay consistent with each other even though they span many steps.

Worked example: the momentum update

The key encoder's weights are updated as: new key weights equal momentum times the old key weights, plus (one minus momentum) times the current query weights. With momentum 0.999, that is 99.9% of the old key encoder and just 0.1% of the new query encoder, every step. Suppose a single weight in the query encoder is currently 0.50 and the key encoder's copy is 0.40:

key ← 0.999 × 0.40 + 0.001 × 0.50 = 0.3996 + 0.0005 = 0.4001

The key barely moved — from 0.400 to 0.4001. Over a thousand steps it drifts smoothly toward the query encoder, never jerking. That gentle lag is exactly what keeps the queue's thousands of negatives mutually consistent. Set momentum to 0 (key encoder = query encoder, always fresh) and MoCo's accuracy collapses, because the keys become inconsistent the instant the network updates.

See it: the queue and the lagging encoder

Press Step to run training. New keys (computed by the momentum encoder) push onto the queue's front; the oldest fall off. The two markers show the query encoder (fast, jumping) and the key encoder (slow, trailing). Crank momentum up and watch the key encoder lag further behind — stable keys. Crank it to zero and the key encoder snaps onto the query encoder, and the queue's older entries become stale and mismatched.

MoCo: A Queue of Negatives + A Momentum Encoder

Top: the FIFO queue of negative keys (new at left, oldest at right, about to drop). Bottom: query encoder (fast) vs key encoder (slow EMA). Higher momentum = slower, more consistent keys.

Momentum m 0.990

Common misconception. “The momentum encoder is there to be a better network.” It is usually worse in the moment than the query encoder — it's a stale average. Its job is not quality but consistency: it changes slowly enough that the thousands of keys sitting in the queue, produced across many past steps, still agree with each other. Consistency of negatives, not freshness, is what makes the queue usable.

Why does MoCo update the key encoder as a slow moving average instead of by gradient descent?

Gradient descent is too slow to compute So the keys stored in the queue across many past steps stay consistent with each other; a fast-changing encoder would make old queued keys stale and mismatched To save memory in the queue

Chapter 6: The Embedding-Space Simulator

This is the payoff. A real contrastive training run, live, with the embeddings of several images shown as points on the unit circle. Each color is one image with two augmented views — a positive pair that should end up together. The simulator runs actual InfoNCE gradient steps. You control the temperature, the number of images, and — crucially — whether negatives are used at all.

Two quantities tell you if it's working, and they are the modern way to diagnose contrastive learning:

Alignment — how close the two views of each image are. Lower is better; it means the model maps a thing and its augmentation to the same place.
Uniformity — how evenly the points spread around the circle. Higher is better; it means the model uses the whole space and doesn't cram everything into one spot.

Good contrastive learning achieves both: tight pairs (alignment) that are nonetheless spread out (uniformity). The tension between them is the entire game — and the experiment you must run is removing the negatives.

Live Contrastive Training on the Hypersphere

Colors = images, lines join positive pairs. Press Train. Watch pairs pull together (alignment) while colors spread apart (uniformity). Then flip to “positives only” and watch everything collapse to one point.

Temperature τ 0.30

Number of images 6

The experiment that teaches the most. Switch to “positives only” (remove the repulsion from negatives) and train. Alignment becomes perfect — every pair merges — but every image collapses onto the same point. The embedding is useless: it maps everything to one spot. This is representational collapse, and it is the villain of the next two chapters. Negatives are what hold the space open. The whole frontier of methods like BYOL and DINO is about avoiding collapse without negatives.

Common misconception. “Lower loss always means better representations.” In “positives only” mode the alignment loss goes to zero — a beautiful-looking number — while the representation is totally collapsed and worthless. A low loss achieved by collapse is the classic trap. You must watch uniformity too, not just the loss.

No quiz here — the simulator is the test. If you can explain why “positives only” collapses and “with negatives” doesn't, you understand the core of contrastive learning.

Chapter 7: BYOL — Learning Without Negatives

We just watched, in the simulator, what happens when you remove negatives: everything collapses to a point. Negatives are the repulsive force that holds the embedding space open. So this next result should feel impossible. BYOL (Bootstrap Your Own Latent) learns excellent representations with no negatives at all — only positive pairs — and somehow does not collapse. How?

The asymmetry that saves it

BYOL uses two networks again, like MoCo: an online network (trained by gradient) and a target network (an EMA of the online one). Both encode a view. The task: the online network must predict the target network's embedding of the other view. That's it — just match your partner's representation. No pushing anything away.

Naively, this should collapse instantly: the trivial solution is for both networks to output the same constant for every image, and the prediction is perfect. BYOL avoids this with two ingredients working together:

A predictor — a small extra network on the online side only. This makes the two branches asymmetric: the online side has to actively transform its output to match the target, rather than both sides lazily agreeing.
Stop-gradient — no gradient flows through the target network. The online network chases a target it cannot directly move. The target only changes slowly, via the EMA. So the online network is always chasing a slightly-behind, frozen-for-this-step version of itself.

Why it doesn't collapse (the intuition). Collapse requires both networks to agree on a constant. But the target is a stop-gradient EMA — the online network can't drag it to a constant directly; it can only chase it. And the predictor breaks the symmetry so “just output the mean” is no longer the easy optimum. The online network is forced to actually predict structure, and the EMA target slowly absorbs that structure, creating a moving target that keeps the representation alive. Remove the predictor or the stop-gradient, and BYOL collapses immediately — both are load-bearing.

Concept + realization: where the stop-gradient sits

python
# online: encoder f → projector g → predictor q   (all trained)
# target: encoder f' → projector g'                (EMA, no gradient)
def byol_loss(v1, v2, online, target):
    p1 = online.predict(online.project(online.encode(v1)))   # online prediction of view 1
    p2 = online.predict(online.project(online.encode(v2)))
    with torch.no_grad():                                # ← STOP-GRADIENT
        t1 = target.project(target.encode(v1))             # target — frozen this step
        t2 = target.project(target.encode(v2))
    # online predicts the OTHER view's target representation
    return mse(normalize(p1), normalize(t2)) + mse(normalize(p2), normalize(t1))

# after each step: target ← m·target + (1-m)·online   (EMA, the moving goalpost)

The torch.no_grad() around the target is the entire trick. It means the loss can only be reduced by changing the online network to match the target — never by changing the target to be easy to match. That one asymmetry, plus the predictor, is the difference between rich features and a collapsed constant.

See it: stop-gradient on vs. off

The diagram shows BYOL's two branches. Toggle stop-gradient, then press Train. With stop-gradient ON, the representation stays diverse and healthy. Turn it OFF (let gradients flow into the target), and watch the diversity bar crash to zero — instant collapse, exactly the failure BYOL was designed to dodge.

BYOL: The Stop-Gradient Is Load-Bearing

Online branch (with predictor) chases the target branch (EMA). Toggle stop-gradient and Train. The diversity bar shows whether the representation stays alive or collapses.

Common misconception. “BYOL proves negatives are unnecessary, so MoCo and SimCLR are obsolete.” Not quite — BYOL replaces the explicit repulsion of negatives with an implicit mechanism (asymmetry + stop-gradient + EMA) that prevents collapse. Something still has to stop collapse; BYOL just hides it cleverly. Later work (SimSiam) showed you can even drop the EMA and keep only the stop-gradient and predictor — pinpointing stop-gradient as the true essential ingredient.

BYOL has no negatives. What stops it from collapsing to a constant output?

A very high temperature The asymmetry of a predictor on the online branch plus a stop-gradient on the EMA target — the online net chases a target it can't directly collapse A larger batch size

Chapter 8: DINO — Self-Distillation & the Two Collapses

BYOL showed you can avoid collapse without negatives using asymmetry and stop-gradient. DINO (self-distillation with no labels) takes a different, equally clever route — and it produced one of the most striking results in self-supervised vision: a Vision Transformer trained with DINO spontaneously learns to segment objects, its attention maps lighting up on the foreground, with no segmentation labels ever.

The student and teacher

DINO frames the pretext task as self-distillation. A student network and a teacher network (an EMA of the student, just like before) each see different views of the image. Each network outputs a probability distribution over a set of K abstract “prototype” dimensions — think of them as learned cluster slots. The student is trained to match the teacher's distribution for the same image, with a stop-gradient on the teacher. Match your teacher's soft assignment over the clusters. No negatives, no contrastive loss — just distribution matching.

But distribution matching alone collapses, and in two distinct ways. DINO needs one fix for each, and the interplay is the heart of the method.

The two collapses and their two fixes

Collapse to uniform: the easy cheat is for both networks to output a flat distribution (every cluster equally likely) for every image. The fix is sharpening: the teacher uses a low temperature, which makes its distribution peaky. A peaky teacher target won't let the student settle into flatness.
Collapse to one dimension: the opposite cheat is for everything to pile into a single cluster — one prototype wins for every image. The fix is centering: the teacher subtracts a running average of its own outputs before the softmax, which cancels any dimension that is consistently dominating, spreading usage across clusters.

Centering and sharpening are opposites, balanced on purpose. Sharpening pushes toward a confident peak (avoiding uniform collapse). Centering pushes away from any always-winning dimension (avoiding one-cluster collapse). One says “be decisive,” the other says “but don't always pick the same thing.” Apply both to the teacher and they balance into a healthy target: confident per-image, but diverse across the dataset. Drop either one and DINO collapses in the corresponding direction.

Concept + realization: the teacher's output

python
# student & teacher output K-dim logits over learned prototypes
def teacher_dist(logits, center, tau_t=0.04):
    # center: running mean of teacher outputs (anti one-cluster collapse)
    # tau_t small = sharpening (anti uniform collapse)
    return softmax((logits - center) / tau_t)        # peaky AND de-biased

def student_dist(logits, tau_s=0.1):
    return softmax(logits / tau_s)                    # softer than teacher

loss = -(teacher_dist(t_logits, center).detach() * student_dist(s_logits).log()).sum()
# after each step:
center = 0.9 * center + 0.1 * t_logits.mean(0)        # EMA of teacher outputs
teacher = m * teacher + (1-m) * student              # EMA weights, stop-grad target

The teacher's temperature is lower than the student's — the teacher is the “sharper, more confident” one, and the student is pulled toward it. The center term, updated as a running mean of teacher outputs, is subtracted every step to keep any single prototype from running away with all the assignments. Two scalars (a temperature and a centering EMA) stand between DINO and collapse.

See it: balancing the two knobs

The bars show the teacher's output distribution over prototype clusters. Slide sharpening and centering. Turn centering off and watch one cluster swallow everything (one-dimension collapse). Turn sharpening off (high temperature) and watch it flatten to uniform (uniform collapse). Find the balanced regime: a confident peak that isn't always the same cluster.

DINO: Centering vs. Sharpening

Bars = teacher distribution over prototype clusters. Centering fights one-cluster collapse; sharpening fights uniform collapse. Balance them.

Sharpening (low τ = sharp) 0.10

Centering strength 0.80

Common misconception. “Sharpening and centering are both just regularizers doing the same thing.” They pull in opposite directions, and that's the point. Use only sharpening and you collapse to one cluster; use only centering and you collapse to uniform. Their balance — not either alone — is what keeps the distribution healthy. It's a tug of war engineered to have a stable middle.

In DINO, what are centering and sharpening each preventing?

Both prevent overfitting to the training set Centering prevents uniform collapse; sharpening prevents one-cluster collapse Sharpening prevents uniform collapse (forces a confident peak); centering prevents one-cluster collapse (cancels any always-dominant cluster)

Chapter 9: Connections & Cheat Sheet

You now understand the whole family: the pretext task of matching two views, the InfoNCE loss that powers it, the temperature that tunes hard-negative focus, the projection head you train and throw away, MoCo's queue and momentum encoder, and the two negative-free methods — BYOL and DINO — that dodge collapse with clever asymmetries. The single thread running through all of it: arrange the embedding space so that meaning-preserving changes leave the representation unchanged, while keeping the space open.

The methods, side by side

Method	Negatives?	How it gets them	Anti-collapse trick
SimCLR	Yes	large batch (all other images)	negatives (explicit repulsion)
MoCo	Yes	queue + momentum key encoder	negatives, consistent via EMA
BYOL	No	—	predictor + stop-grad + EMA target
SimSiam	No	—	predictor + stop-grad (no EMA needed)
DINO	No	—	centering + sharpening on EMA teacher

The cheat sheet

Positive pair: two augmented views of the SAME image

InfoNCE: softmax over similarities; loss = −log(probability on the positive)

Temperature τ: low = sharp = fixate on hard negatives; high = flat = weak gradient

Projection head: train on z = g(h); keep h; discard g at inference

MoCo: queue of negatives (no grad) + key encoder = slow EMA of query encoder

BYOL: online (with predictor) chases stop-grad EMA target; no negatives

DINO: student matches sharpened + centered EMA-teacher distribution

Diagnose: want low alignment (tight pairs) AND high uniformity (spread)

A decision guide

Limited hardware (small batches)?

Yes → MoCo (queue) or a negative-free method (BYOL/DINO/SimSiam).

↓

Want the simplest negative-free recipe?

SimSiam — just predictor + stop-gradient, no EMA, no queue.

↓

Using a Vision Transformer / want emergent segmentation?

DINO — its attention maps localize objects for free.

↓

Huge batch budget and want simplicity?

SimCLR — conceptually cleanest, just needs the negatives.

Where this connects

Contrastive CLIP — the multimodal sibling: instead of two views of one image, the positive pair is an image and its caption. Same InfoNCE, applied across modalities, unlocking zero-shot classification.
Loss Functions — InfoNCE, NT-Xent, and the triplet loss all live here; this lesson is the deep dive on the contrastive family.
Curriculum Learning — hard-negative mining is anti-curriculum applied to the negatives; temperature is a soft version of the same hard-negative emphasis.
Vector Embeddings & Similarity Metrics — the embedding space and cosine similarity that contrastive learning shapes.
Vision Transformers — the backbone DINO made famous for emergent object segmentation.
Data Augmentation — the augmentation pipeline that manufactures every positive pair.

The one thing to remember. Contrastive learning is a clever bargain: it trades the need for human labels for the need to prevent collapse. Every method in this lesson is a different answer to one question — “how do I pull positives together without everything piling into a single point?” Negatives push apart; momentum encoders keep negatives consistent; predictors and stop-gradients break symmetry; centering and sharpening balance the distribution. Master collapse, and you've mastered self-supervised learning.

A colleague trains a negative-free method, drops the stop-gradient “to let more gradient flow,” and the model's accuracy crashes to chance. What happened?

The learning rate was too low The batch size was too small for negatives Without the stop-gradient, nothing prevents representational collapse — both branches agree on a constant output, so all images map to the same point

“What I cannot create, I do not understand.” — and a network that can recreate which two pictures are secretly the same has begun, without a single label, to understand what it sees.