← Gleams
Stanford CS 231n · Lecture 12 · Self-Supervised Learning

Learning to See Without Labels

What if you could learn powerful visual representations from billions of unlabeled images — by inventing your own supervision signal from the data itself?

Pretext Tasks Contrastive Learning SimCLR & MoCo MAE & DINO
Roadmap

What You'll Master

Chapter 01

The Labeling Bottleneck

Supervised learning has a dirty secret. Those stunning ImageNet results — ResNets, ViTs, everything that kicked off the deep learning revolution — they all depend on 1.28 million hand-labeled images. Every single image was looked at by a human who decided "this is a golden retriever" or "this is a school bus." That took years and cost millions of dollars.

Now consider this: the internet has billions of unlabeled images. Instagram alone gets 100 million photos uploaded per day. YouTube accumulates 500 hours of video every minute. The labeled fraction of all visual data on Earth is a rounding error.

The Scale Argument

ImageNet: 1.28 million labeled images, the product of years of annotation effort. The internet: billions of unlabeled images, growing every second. If we could learn from unlabeled data, we'd have access to 1000× more training signal. That's not a small improvement — it's a paradigm shift.

But the problem goes deeper than scale. Labels are also biased. ImageNet's 1,000 categories reflect choices made by researchers in 2009. There's no "tired face" class, no "cracked sidewalk" class, no "half-eaten sandwich" class. Yet these are exactly the kinds of things a truly general visual system needs to understand. Labels impose a ceiling on what features your network can learn.

And labels are expensive. Medical imaging? You need board-certified radiologists. Satellite imagery? You need domain experts. Self-driving? Thousands of hours of tedious bounding box annotation. For every new domain, you start from scratch.

The Self-Supervised Framework

Self-supervised learning (SSL) sidesteps the labeling bottleneck entirely. The idea: design a task where the supervision signal comes from the data itself — no human labels needed. The network learns useful representations as a byproduct of solving this artificial task.

Definition
Self-Supervised Learning

A learning paradigm where the model creates its own supervision signal from unlabeled data. The model solves a pretext task (a puzzle constructed from the data's structure), and the representations learned for that task transfer to downstream tasks like classification, detection, or segmentation.

The framework has two stages:

The Self-Supervised Pipeline
  1. Pretrain: Train an encoder on a pretext task using unlabeled data. The encoder learns a feature extractor f(·) that maps images to representations.
  2. Transfer: Freeze the encoder (or fine-tune it) and train a small head on top for the actual downstream task (classification, detection, segmentation) using a small labeled dataset.

How Do We Know It Worked?

If we don't have labels during pretraining, how do we evaluate the quality of the learned representations? Four standard protocols:

ProtocolHow It WorksWhat It Tests
Linear probeFreeze encoder, train a single linear layer on topFeature quality — are the representations linearly separable?
Fine-tuningUnfreeze encoder, train end-to-end with small LRTransfer quality — how well does the network adapt?
k-NN evaluationExtract features, classify test images by nearest neighborsCluster quality — do similar things end up close?
t-SNE / UMAPVisualize feature space in 2DQualitative — do clusters match semantic categories?

The linear probe is the gold standard. If you can slap a single linear layer on frozen features and get 75%+ accuracy on ImageNet, your encoder has learned something genuinely useful — the features are already organized by semantic meaning without ever seeing a label.

Linear Probe vs Fine-Tuning Gap

A method with great fine-tuning but poor linear probe performance might just be learning easily adaptable features, not inherently good features. The linear probe is stricter: it demands that the representation already separates classes. This distinction matters when comparing SSL methods.

Chapter 02

Pretext Tasks — Learning by Solving Puzzles

The earliest self-supervised methods worked by a beautifully simple trick: take an image, corrupt or transform it in some known way, and ask the network to predict what you did. The network can't solve these puzzles without understanding the image's content — so understanding is forced as a side effect.

Rotation Prediction (Gidaris et al., 2018)

Take an image. Rotate it by one of four angles: 0°, 90°, 180°, or 270°. Ask the network: which rotation was applied? This is a 4-way classification problem — no labels needed, because you chose the rotation.

Why does this force semantic understanding? To distinguish 0° from 180°, the network must know which way is "up." It needs to understand that trees grow upward, that skies are at the top, that text reads left-to-right. These are semantic features — exactly the kind we want for downstream tasks.

Relative Patch Location (Doersch et al., 2015)

Extract two patches from an image: a center patch and one of its 8 neighbors. Ask the network: where is the second patch relative to the first? Top-left? Bottom-right? This is an 8-way classification problem.

To solve it, the network must learn spatial relationships between object parts. It learns that eyes are above noses, that wheels are below car bodies, that windows are on walls. These part-relationship features are exactly what object detectors need.

Jigsaw Puzzles (Noroozi & Favaro, 2016)

Split an image into a 3×3 grid of patches. Shuffle them. Ask the network to predict which permutation was applied. With 9 patches, there are 9! = 362,880 possible permutations — too many for classification. So in practice, you select a subset of ~100 maximally distinct permutations and classify among those.

Inpainting (Pathak et al., 2016)

Mask out the center region of an image. Ask the network to reconstruct the missing pixels. The loss combines L2 reconstruction (mean squared error on pixels) with an adversarial loss (a discriminator tries to distinguish real from generated patches). The adversarial loss encourages sharp, realistic completions rather than blurry averages.

To fill in a missing face region, the network must understand facial structure. To complete a missing building corner, it must understand architectural geometry. Inpainting forces holistic scene understanding.

Colorization (Zhang et al., 2016)

Convert a color image to grayscale. Ask the network to predict the original colors. This is framed as classification (quantize the color space into ~313 bins) rather than regression, because color is inherently ambiguous — a car could be red or blue, but the network must still understand it's a car to color it plausibly.

Pretext Tasks Gallery
Click each tab to see how different pretext tasks transform an image. Each task creates a different supervision signal from the same unlabeled image.

Do These Features Actually Transfer?

Surprisingly well. Rotation prediction on ImageNet, transferred to Pascal VOC detection, achieves ~65% of the performance of supervised pretraining. That's without seeing a single label during pretraining.

MethodPretext TaskVOC Detection mAP% of Supervised
Random initNone~40%54%
RotationPredict 0/90/180/270~54%73%
JigsawPredict patch order~56%76%
ColorizationPredict colors~52%70%
SupervisedImageNet labels~74%100%
Shortcut Solutions

Networks are lazy — they'll find the easiest way to solve the pretext task, even if it doesn't require semantic understanding. For jigsaw puzzles, the network can match edges rather than understanding content. For rotation, it can detect JPEG artifacts that differ by rotation. For relative patches, chromatic aberration (color fringing near image edges) leaks absolute position information. Designing pretext tasks that resist shortcuts is a major challenge.

Why Puzzles Force Semantics

The key insight: these tasks can only be solved perfectly with high-level understanding, but they can be solved cheaply with low-level tricks. The art of pretext task design is closing the gap — making shortcuts impossible so the network is forced into semantic understanding. This is why contrastive methods eventually won: they design the "puzzle" in feature space where pixel-level shortcuts don't exist.

Chapter 03

Masked Autoencoders — BERT for Vision

In NLP, BERT revolutionized representation learning with a simple idea: mask out some words in a sentence, predict the missing ones. The model must understand context, grammar, and semantics to fill in the blanks. Masked Autoencoders (MAE, He et al., 2022) bring this idea to images — but with a critical twist.

Why Not Just Mask 15% Like BERT?

In text, each token carries dense semantic information. Masking 15% of words creates a challenging prediction task. But images are spatially redundant: neighboring patches look almost identical. If you mask 15% of an image, the network can reconstruct the missing patches by interpolating from their neighbors — no semantic understanding needed, just texture extrapolation.

MAE's solution: mask 75% of patches. At this masking ratio, there simply aren't enough visible neighbors to interpolate from. The network is forced to understand global scene structure to fill in the gaps. What object is this? What's the scene layout? Where should the edges be?

The Masking Ratio Insight

Images need 5× higher masking than text (75% vs 15%) because images have much higher spatial redundancy. A pixel's value is highly predictable from its neighbors. Words in a sentence are not — knowing "The cat sat on the ___" tells you little about the word's pixel-level appearance, but knowing the pixel at (100, 200) tells you a lot about the pixel at (101, 200).

The Architecture

MAE uses a Vision Transformer (ViT) with an asymmetric encoder-decoder design:

MAE Architecture
  1. Patch & Mask: Divide the image into non-overlapping patches (16×16). Randomly select 75% to mask. Keep only the 25% visible patches.
  2. Encode: Feed only the visible patches through a large ViT encoder. This is the key efficiency trick — the encoder processes 4× fewer tokens than a full image.
  3. Decode: Combine encoded visible patches with learnable mask tokens (placeholder vectors at masked positions). Feed through a small, lightweight transformer decoder.
  4. Reconstruct: The decoder outputs pixel predictions for the masked patches only. Loss = MSE between predicted and actual pixels at masked positions.

The asymmetric design is crucial for efficiency. The heavy encoder (e.g., ViT-Large with 24 layers) only processes 25% of patches. The lightweight decoder (e.g., 8 layers) handles the full set. During pretraining, this gives a 3× speedup and uses less memory. At transfer time, you throw away the decoder entirely — only the encoder matters.

MAE Loss LMAE = ∑i ∈ masked || x̂i − xi ||2

Sum MSE over masked patches only. Visible patches have no loss — they're inputs, not targets.

Results & Key Properties

MAE achieves 86.9% top-1 on ImageNet with ViT-Huge (fine-tuned). Competitive with the best contrastive methods, and remarkably simpler — no negative pairs, no momentum encoders, no special augmentations.

But there's a catch: MAE's linear probe performance lags behind contrastive methods. It gets 75.8% with linear probe vs 78.2% for DINO. This suggests MAE learns features that are powerful but need fine-tuning to unlock — they're not as "ready to use" as contrastive features.

Connection to Denoising Autoencoders

MAE is a modern instantiation of a decades-old idea: denoising autoencoders (Vincent et al., 2008). Corrupt the input, reconstruct the original. Masking is a specific corruption strategy. What changed is the architecture (transformers instead of shallow networks) and the masking ratio (75% instead of 30%). The principle is the same: learning robust representations by reconstruction.

Worked Example — MAE Efficiency

A ViT-Large encoder with 24 layers. Image: 224×224, patch size 16×16 = 196 patches. Full processing: 24 layers × 196 tokens. MAE with 75% masking: 24 layers × 49 tokens (visible only). That's 4× fewer FLOPs in the encoder. The decoder is only 8 layers processing 196 tokens, much cheaper than a full encoder pass. Total savings: roughly 3× wall-clock speedup.

Chapter 04

Contrastive Learning — The Core Idea

Pretext tasks design puzzles in pixel space: predict rotations, fill in pixels, guess colors. But we don't actually care about pixels — we care about representations. What if we designed the learning signal directly in representation space?

That's contrastive learning. The intuition is almost embarrassingly simple:

The Contrastive Principle

Pull representations of similar things together. Push representations of different things apart. Do this with enough examples, and the embedding space organizes itself into semantically meaningful clusters — without ever seeing a label.

Positive and Negative Pairs

Every contrastive method needs to define what "similar" and "different" mean. The standard approach:

Positive pair: Two different augmentations of the same image. Crop it differently, change the colors, blur it — it's still the same scene, so the representations should be close.

Negative pair: Augmentations of different images. A photo of a cat and a photo of a truck should have distant representations, regardless of how you crop or color them.

Definition
Score Function s(x, y)

A function that measures how "close" two representations are in embedding space. Usually cosine similarity: s(x, y) = (x · y) / (||x|| · ||y||). Ranges from −1 (opposite directions) to +1 (identical directions). We want s(x, x+) to be high and s(x, x) to be low.

The InfoNCE Loss

How do we turn "pull positives, push negatives" into a differentiable loss? The InfoNCE loss (Oord et al., 2018) frames it as an N-way classification problem. Given an anchor x, one positive x+, and N−1 negatives {x1, ..., xN−1}, can we identify which one is the positive?

InfoNCE Loss LInfoNCE = −log   exp(s(x, x+) / τ) ⁄ [ exp(s(x, x+) / τ) + ∑n=1N−1 exp(s(x, xn) / τ) ]

Let's break this down:

Numerator: exp(s(x, x+) / τ) — the exponentiated similarity between the anchor and its positive. We want this to be large.

Denominator: The numerator plus the sum of exponentiated similarities with all negatives. This normalizes the expression into a probability.

Negative log: We're minimizing the negative log-probability that the positive is identified correctly. This is identical to the cross-entropy loss for a softmax classifier where the correct class is the positive.

Derivation — InfoNCE as N-way Softmax

Define the probability of selecting the positive from N candidates:

p(positive = x+) = exp(s(x, x+) / τ) / ∑j=0N−1 exp(s(x, xj) / τ)

This is exactly a softmax over N candidates with logits s(x, xj) / τ. The InfoNCE loss is the cross-entropy: L = −log p(positive = x+). Minimizing this maximizes the probability of correctly identifying the positive among N candidates.

The Temperature Parameter τ

Temperature τ controls how "peaked" the distribution is. Small τ (e.g., 0.05) makes the softmax very sharp — the loss strongly penalizes even slightly wrong rankings. Large τ (e.g., 1.0) makes it smooth — close negatives are tolerated.

In practice, τ = 0.07 or 0.1 works well. Too small and training is unstable (gradients explode). Too large and the loss becomes too easy (it can't distinguish hard negatives from easy ones).

InfoNCE Loss Explorer
Drag the positive (green) and negative (red) points to see how the InfoNCE loss changes. The anchor is fixed at center. Temperature controls how sharply the loss penalizes close negatives.
Connection to Mutual Information

InfoNCE is a lower bound on the mutual information I(x; x+) between the two views. Specifically: I(x; x+) ≥ log(N) − LInfoNCE. This means: the more negatives (larger N), the tighter the bound. With infinite negatives, minimizing InfoNCE maximizes mutual information exactly. This is why more negatives always help — they make the proxy loss closer to the true objective.

Worked Example — InfoNCE with 3 Negatives

Anchor x, positive x+, negatives x1, x2, x3. Cosine similarities: s(x, x+) = 0.9, s(x, x1) = 0.3, s(x, x2) = 0.1, s(x, x3) = −0.2. With τ = 0.1:

Numerator: exp(0.9/0.1) = exp(9) = 8103

Denominator: exp(9) + exp(3) + exp(1) + exp(−2) = 8103 + 20.1 + 2.72 + 0.14 = 8126

L = −log(8103/8126) = −log(0.9972) = 0.0028 — a very low loss because the positive is much closer than any negative.

Now if s(x, x1) = 0.85 (a "hard negative"): exp(8.5) = 4915. Denominator becomes 8103 + 4915 + 2.72 + 0.14 = 13021. L = −log(8103/13021) = 0.474 — much higher! Hard negatives dominate the loss.

Chapter 05

SimCLR — Simple Contrastive Learning

SimCLR (Chen et al., 2020) showed that you don't need clever architectures or auxiliary tasks to do great contrastive learning. You just need three ingredients: strong augmentations, a projection head, and a big batch.

The Pipeline

Given a batch of N images:

SimCLR Training Step
  1. Augment: For each image xi, apply two random augmentations to get views x̃2i−1 and x̃2i. Now you have 2N augmented views.
  2. Encode: Pass each view through encoder f(·) (a ResNet) to get representation h = f(x̃).
  3. Project: Pass h through a small MLP projection head g(·) to get z = g(h). This is a 2-layer MLP with ReLU.
  4. Compute similarities: Build the 2N × 2N cosine similarity matrix. For each anchor, the one positive is its augmentation partner; the 2(N−1) others are negatives.
  5. Compute InfoNCE loss for every anchor, average over the batch.

Augmentations Matter — A Lot

SimCLR's augmentation pipeline applies the following randomly:

AugmentationWhat It DoesWhy It Matters
Random crop & resizeTake a random rectangle (8-100% of area), resize to 224×224Forces learning of scale-invariant and position-invariant features
Color distortionRandom brightness, contrast, saturation, hue jitter + random grayscalePrevents the network from using color as a shortcut
Gaussian blurRandom blur with 50% probabilityForces learning of shape over texture
Horizontal flipFlip left-right with 50% probabilityStandard invariance

The ablation is striking: random crop alone gives ~65% linear probe. Adding color distortion jumps to ~75%. The combination of crop + color distortion is worth 10+ percentage points. Without strong augmentations, the network learns trivial color histogram matching instead of semantic features.

Why Color Distortion Is Critical

Two random crops of the same image often share similar color statistics. Without color distortion, the network can match positives just by comparing average color — no semantic understanding needed. Color jitter forces the network to look past color and learn shape, texture, and object structure. This is the single most important design choice in SimCLR.

The Projection Head Mystery

SimCLR found that adding a 2-layer MLP projection head g(·) after the encoder improves linear probe accuracy by 10+ percentage points. The contrastive loss is computed on z = g(h), but the representation used for downstream tasks is h (the encoder output, before the projection).

Why does this help? The projection head acts as an information bottleneck. It discards information that's useful for the contrastive task (like color distribution or exact crop position) but harmful for downstream tasks. The encoder h retains all the information; the projection z filters it. The contrastive loss "uses up" task-specific information in z, leaving h clean.

SimCLR Score Function sim(zi, zj) = zi · zj / (||zi|| · ||zj||) / τ

Cosine similarity divided by temperature τ. Representations are L2-normalized before the dot product.
SimCLR Affinity Matrix
The 2N×2N cosine similarity matrix for a mini-batch. Each image produces two views (columns/rows). Green highlights show positive pairs (same image); other off-diagonal entries are negatives. Adjust batch size to see how the number of negatives scales.

The Batch Size Problem

SimCLR's negatives all come from the current batch. With batch size N, each anchor has 2(N−1) negatives. More negatives = tighter InfoNCE bound = better representations. The paper reports results with batch sizes of 4096 to 8192, requiring 32-128 TPUs. At batch size 256, accuracy drops by ~5 points.

The GPU Tax

SimCLR with batch size 4096 requires 32 TPU v3 cores for training. Each core has 16GB HBM. That's ~$10,000+ per training run. This isn't "simple" for most researchers. The batch size requirement was SimCLR's Achilles' heel, and it's what motivated MoCo's memory bank approach.

Results

SimCLR v1 achieves 69.3% top-1 linear probe on ImageNet with ResNet-50. SimCLR v2 pushes this to 71.7% by using a deeper projection head (3-layer MLP) and a larger encoder (ResNet-50 4× width). With the wider network, fine-tuned performance reaches 76.5%.

Worked Example — SimCLR Negatives Count

Batch size N = 4096 images. After augmentation: 8192 views. For each anchor, there's 1 positive and 8190 negatives. That's a 8191-way classification problem per anchor. With N = 256: only 510 negatives. The loss surface is much less informative — the model can "cheat" by finding a few easy negatives and ignoring the rest.

Chapter 06

MoCo — Momentum Contrastive Learning

SimCLR's insight was right: more negatives help. But SimCLR's solution — giant batches — was brute force. MoCo (He et al., 2020) asked a cleverer question: can we decouple the number of negatives from the batch size?

The Dictionary Queue

MoCo maintains a running queue of encoded representations (keys) from recent mini-batches. This queue acts as a large, consistent dictionary of negatives. With a queue of size 65,536, every anchor has 65,536 negatives — regardless of whether the batch size is 256 or 64.

Each training step: encode the current batch to produce queries (from the query encoder) and keys (from the key encoder). The positive pair is (query, key) from the same image. The negatives are all keys in the queue. After each step, enqueue the new keys and dequeue the oldest ones. First-in, first-out.

The Momentum Encoder

Here's the subtle problem: the keys in the queue were encoded by different versions of the encoder at different training steps. If the encoder changes rapidly, old keys in the queue are inconsistent with new ones — they represent a stale, different feature space. This inconsistency hurts training.

MoCo's solution: update the key encoder very slowly via exponential moving average (momentum update):

Momentum Update θk ← m · θk + (1 − m) · θq

m = 0.999. The key encoder θk changes by only 0.1% per step toward the query encoder θq. This keeps keys consistent across time.

Only the query encoder receives gradients from the loss. The key encoder is never trained directly — it evolves slowly via momentum. This ensures that keys from 100 steps ago and keys from the current step live in approximately the same feature space.

Why m = 0.999 Works

Think of the key encoder as a slow-moving average of the query encoder. With m = 0.999, after 1000 steps the key encoder has "absorbed" roughly 63% of the current query encoder's parameters (1 − 0.9991000 ≈ 0.63). The queue holds ~256 batches worth of keys (65536/256). So the oldest key was encoded by a network that's ~63% similar to the current one. Close enough for contrastive learning to work. With m = 0.9, the oldest key comes from a completely different network — too stale.

MoCo Training Loop
  1. Sample a mini-batch of images. Apply two augmentations to each.
  2. Encode queries: q = fq(x1) through the query encoder (receives gradients).
  3. Encode keys: k = fk(x2) through the key encoder (no gradients).
  4. Compute InfoNCE: positive = (q, k) from same image. Negatives = all keys in the queue.
  5. Backprop through query encoder only.
  6. Momentum update: θk ← 0.999 · θk + 0.001 · θq.
  7. Queue update: enqueue new keys, dequeue oldest keys (FIFO).

MoCo v2: Best of Both Worlds

MoCo v1 used weak augmentations and no projection head. MoCo v2 (Chen et al., 2020) borrowed SimCLR's key innovations — strong augmentations (crop + color distortion) and an MLP projection head — and combined them with MoCo's queue. The result: 71.1% linear probe on ImageNet with ResNet-50, matching SimCLR v2 while using 32× smaller batch size (256 vs 8192).

MethodBatch SizeNegativesGPUsLinear Probe
SimCLR v14096819032 TPUs69.3%
MoCo v1256655368 GPUs60.6%
MoCo v2256655368 GPUs71.1%
SimCLR v24096819032 TPUs71.7%
Worked Example — Queue Mechanics

Queue size K = 65536, batch size B = 256. Each step adds 256 new keys and removes 256 old ones. The queue holds 65536/256 = 256 batches worth of keys. The oldest keys are from 256 steps ago. With momentum m = 0.999 and 256 updates, the key encoder has drifted by 1 − 0.999256 ≈ 22.6% from when the oldest keys were encoded. That's small enough for consistent contrastive targets.

Definition
Memory Bank vs Queue

An earlier approach (Wu et al., 2018) stored a memory bank — one representation per image in the entire dataset. Updated only when that image appeared in a batch. Problem: with 1.28M images and 256 batch size, each representation is updated once every 5000 steps — massively stale. MoCo's queue is better because it only keeps recent keys (from the last ~256 steps), ensuring freshness.

Chapter 07

CPC — Contrastive Predictive Coding

SimCLR and MoCo contrast static views: two crops of the same image. But what if your data has sequential structure — audio, video, text? Contrastive Predictive Coding (CPC, Oord et al., 2018) extends contrastive learning to sequences by predicting future representations from current context.

The Setup

Given a sequence of observations x1, x2, ..., xT (e.g., audio frames, image patches scanned top-to-bottom, or text tokens):

CPC Architecture
  1. Encode: Each observation xt is encoded independently: zt = genc(xt). This produces a sequence of local representations.
  2. Aggregate: An autoregressive model (GRU or Transformer) processes z1, ..., zt to produce a context vector: ct = gar(z1, ..., zt). This captures the "story so far."
  3. Predict future: For each future step k = 1, 2, ..., K, use a linear prediction head to predict the future representation: ẑt+k = Wk ct.
  4. Contrastive loss: The prediction ẑt+k should be close to the actual zt+k (positive) and far from zt+k of other sequences (negatives).
CPC Score Function sk(ct, zt+k) = ctT Wk zt+k

A bilinear score: the context ct predicts the future zt+k through a time-dependent linear map Wk. Different prediction heads for different horizons k.

The InfoNCE loss discriminates the true future zt+k from random negatives (z values sampled from other positions or other sequences).

Why Predict in Representation Space?

CPC predicts representations, not raw observations. Predicting raw audio samples would require modeling every detail of the waveform. Predicting representations lets the model focus on high-level structure: what word will come next, what note will play next, what object will appear next.

CPC as a World Model

CPC learns to predict the future — not in pixel/sample space, but in a learned representation space. This is exactly what a world model does: compress observations into states and predict how states evolve. CPC's autoregressive model gar is a primitive world model. The contrastive loss ensures the representations capture the features that are predictable and useful for prediction, discarding noise.

Applications of CPC

DomainObservation xtSequenceResult
Audio25ms audio frameSpeech waveformSOTA phone classification; competitive speech recognition
ImagesImage patchPatches scanned top→bottomPredict bottom patches from top; learns spatial structure
TextSentenceBook paragraphsPredict next sentence; competitive NLI
VideoVideo frameVideo clipAction recognition features
Why Autoregressive Context Matters

Without the autoregressive model, CPC would be predicting zt+k from zt alone. With the autoregressive model, it predicts from ct = gar(z1, ..., zt), which summarizes all past observations. This is critical for sequences with long-range dependencies: the next word depends not just on the previous word, but on the entire paragraph so far.

Worked Example — CPC for Audio

Audio at 16kHz, encoded into 100Hz representations (one z every 10ms). Context at t = 500 (= 5 seconds). Predict k = 1 to 12 steps ahead (10-120ms). Positives: the actual z at t + k. Negatives: z values from random positions in the same or other audio clips. The model learns that after "cat sat on the," the representation for "mat" is more likely than the representation for "helicopter." It captures semantic and syntactic patterns without transcription labels.

Chapter 08

DINO — Self-Distillation with No Labels

Contrastive methods need negative pairs. What if you didn't? DINO (Caron et al., 2021) takes a radically different approach: self-distillation. It's a teacher-student framework where the teacher is the student — or rather, a slow-moving average of it.

The Setup

DINO creates two networks from the same architecture (typically a ViT):

Student network: Receives local crops (small patches, ~96×96 pixels, covering ~25% of the image). Trained with gradient descent.

Teacher network: Receives global crops (large patches, ~224×224 pixels, covering ~50-100% of the image). Updated via exponential moving average of the student (like MoCo's key encoder). No gradients flow through the teacher.

The loss is not contrastive. It's cross-entropy between output distributions:

DINO Loss L = − ∑x ∈ globalx' ∈ local pt(x) log ps(x')

Cross-entropy between teacher output pt (from global crop) and student output ps (from local crop). The student learns to match the teacher's prediction from less information.

Both networks output a probability distribution over K dimensions (e.g., K = 65536) via a softmax layer. The student must learn to produce the same distribution as the teacher, even though the student sees only a small crop while the teacher sees the whole image.

Preventing Collapse

Without negatives, what prevents the trivial solution where both networks output the same constant vector for every image? Two mechanisms:

Centering: Subtract a running mean from the teacher's output before softmax: pt = softmax((gt(x) − c) / τt), where c is an exponential moving average of the teacher's outputs across the dataset. Centering prevents any single dimension from dominating, making the "collapse to one constant" solution unstable.

Sharpening: Use a low temperature τt = 0.04 for the teacher (producing sharp, peaked distributions) and a higher temperature τs = 0.1 for the student. The teacher commits strongly to its prediction; the student has to work harder to match this peaked target. A uniform (collapsed) output can never match a sharp target.

Why Self-Distillation Avoids Collapse

Consider what collapse looks like: both networks output the same vector c for every image. The cross-entropy loss would be H(softmax(c/τ), softmax(c/τ)) = constant. But with centering, we subtract c from the teacher's output, giving softmax((c − c)/τ) = softmax(0) = uniform distribution. A uniform teacher distribution has maximum entropy — it provides no training signal. The student can't reduce the loss by outputting a constant. To reduce the loss, the teacher must output non-uniform distributions, which requires producing different outputs for different images. Centering makes collapse a saddle point, not a minimum.

Emergent Segmentation

The most striking result from DINO: when trained with a ViT, the self-attention maps learn to segment objects without any segmentation supervision. The [CLS] token's attention in the last layer highlights the main object in the scene — cleanly separating foreground from background.

This happens because the global-local crop asymmetry forces the network to understand what the object is (which persists across crops) rather than where it is (which changes between crops). Object identity is precisely what segmentation captures.

DINO vs Contrastive Methods

Contrastive methods (SimCLR, MoCo) define similarity through data augmentation and enforce it via a push-pull loss. DINO achieves the same goal through a completely different mechanism: self-distillation with asymmetric views. The teacher sees more context (global crops) and "knows more"; the student sees less (local crops) and must learn to match the teacher's understanding. No negatives needed.

DINOv2: Scaling Up

DINOv2 (Oquab et al., 2023) combines DINO's self-distillation with several improvements: a curated training dataset (LVD-142M), iBOT's masked image modeling objective, and a ViT-g architecture with 1.1B parameters. The result is arguably the best general-purpose visual feature extractor available today.

PropertyDINODINOv2
ArchitectureViT-S/16 to ViT-B/8ViT-S/14 to ViT-g/14
Training dataImageNet-1K (1.28M)LVD-142M (curated)
ObjectivesSelf-distillation onlySelf-distillation + masked image modeling
Linear probe (IN-1K)77.0% (ViT-S/16)81.1% (ViT-g/14)
Key featureEmergent segmentationUniversal visual features (depth, segmentation, retrieval)
Worked Example — DINO Crops

Input image: 224×224. Teacher receives 2 global crops of size 224×224 (covering 50-100% of the original, resized). Student receives 6 local crops of size 96×96 (covering ~20% each). Total forward passes: 2 through teacher + 6 through student = 8 views. The loss is computed for all (global, local) pairs: 2 × 6 = 12 cross-entropy terms. The student must extract the "essence" of the image from tiny crops to match the teacher's holistic understanding.

Chapter 09

Showcase — Contrastive Training Simulator

Now let's see contrastive learning in action. The simulation below shows data points in a 2D embedding space. Points are colored by their true class (which the algorithm never sees). Each training step creates augmented pairs, then pulls positives together and pushes negatives apart.

Watch how the clusters form over training. Adjust the temperature τ to see how it affects separation sharpness. Try different augmentation strengths to see how weak augmentation leads to shortcut features.

Contrastive Training Simulator
Points are colored by true class (unknown to the algorithm). Each step, random positive pairs are created via augmentation, and the InfoNCE loss drives learning. Watch clusters emerge from random initialization.
Step: 0
What to Try

Low τ (0.05): Very sharp separation. Clusters form quickly but might not merge well — subgroups within a class stay separate. High τ (1.0): Soft separation. Everything blends together — too tolerant of wrong matches. Sweet spot (~0.1-0.2): Clean clusters with good within-class cohesion. Try low augmentation strength (~0.02) and watch how the algorithm fails to learn — augmented views are too similar to the original, so the network doesn't need to learn invariances.

Chapter 10

Summary & Connections

The Evolution of Self-Supervised Learning

SSL has gone through four phases, each solving a limitation of the previous:

EraApproachKey MethodsLimitation SolvedRemaining Problem
2015-2018Pretext tasksRotation, Jigsaw, ColorizationLearn without labelsShortcut solutions, weak features
2018-2020ContrastiveSimCLR, MoCo, CPCLearn in feature space, stronger featuresNeed many negatives, large batches
2021-2022Self-distillationBYOL, DINO, VICRegNo negatives needed, simpler trainingRequires careful collapse prevention
2022+Masked modelingMAE, BEiT, data2vecSimpler objective, efficient trainingNeeds fine-tuning (weaker linear probe)

Method Comparison

PropertySimCLRMoCo v2MAEDINO
Loss typeInfoNCEInfoNCEMSE recon.Cross-entropy distill.
Needs negatives?Yes (batch)Yes (queue)NoNo
Batch size4096+25640961024
ArchitectureResNetResNetViTViT
Linear probe (IN-1K)69.3%71.1%75.8%*77.0%
Fine-tune (IN-1K)76.5%~77%86.9%*82.8%
Key trickStrong augmentationsMomentum queue75% maskingCentering + sharpening

* MAE uses ViT-Huge; others use ResNet-50 or ViT-Small for fair comparison where noted. Numbers are approximate and depend on exact settings.

The Unifying Principle

All SSL Is Self-Supervision

Every method in this lecture creates an artificial supervision signal from the structure of the data itself. Pretext tasks exploit spatial structure (rotation, jigsaw). Contrastive methods exploit augmentation invariance (two crops of the same image should be similar). Masked models exploit reconstruction (visible patches predict masked ones). Self-distillation exploits view asymmetry (local crop should match global understanding). The core insight is always the same: the data knows more than any label set.

Key Principles That Transcend Methods

Universal Design Principles

1. Augmentation matters. Every method benefits from strong, diverse augmentations. Crop + color distortion is the minimum viable set. Without them, networks find trivial shortcuts.

2. Projection heads help. Computing the loss in a separate projection space (discarded at transfer) consistently improves representation quality across SimCLR, MoCo, BYOL, and DINO.

3. Scale helps. More data, bigger encoders, and longer training all improve SSL representations. Unlike supervised learning, SSL has no labeling bottleneck — you can always add more data.

4. Linear probe vs fine-tune is a real distinction. MAE is weaker on linear probe but stronger on fine-tuning than contrastive methods. Choose your evaluation to match your deployment scenario.

Connections to Other Topics

Self-supervised learning connects deeply to several other topics covered in this series:

CLIP extends contrastive learning to vision-language pairs. Instead of two image crops as the positive pair, CLIP uses an image and its text caption. This gives rise to zero-shot classification — the representation space is organized by language, not by a fixed label set. See the contrastive-clip lesson.

Transformers are the backbone architecture for modern SSL. ViT's patch-based design enables MAE's efficient masking strategy and DINO's attention-based segmentation. See the transformer lesson.

Generative models (VAEs, diffusion) are related to masked modeling — both learn by reconstruction. The difference: SSL focuses on the encoder (representation quality), while generative models focus on the decoder (sample quality). See VAE/VQ-VAE and diffusion.

References

PaperYearKey Contribution
Doersch et al., "Unsupervised Visual Representation Learning by Context Prediction"2015Relative patch location pretext task
Noroozi & Favaro, "Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles"2016Jigsaw permutation prediction
Pathak et al., "Context Encoders: Feature Learning by Inpainting"2016Image inpainting as pretext task
Zhang et al., "Colorful Image Colorization"2016Colorization as pretext task
Gidaris et al., "Unsupervised Representation Learning by Predicting Image Rotations"2018Rotation prediction
Oord et al., "Representation Learning with Contrastive Predictive Coding"2018InfoNCE loss, CPC framework
He et al., "Momentum Contrast for Unsupervised Visual Representation Learning"2020MoCo, momentum encoder, queue
Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations"2020SimCLR, projection head, strong augmentations
Caron et al., "Emerging Properties in Self-Supervised Vision Transformers"2021DINO, self-distillation, emergent segmentation
He et al., "Masked Autoencoders Are Scalable Vision Learners"2022MAE, 75% masking, asymmetric encoder-decoder
Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision"2023Scaled DINO, universal visual features
The One Sentence

Self-supervised learning creates its own labels from unlabeled data — via puzzles, augmentation invariance, masking, or self-distillation — and the representations it learns rival or surpass supervised ones.