What if you could learn powerful visual representations from billions of unlabeled images — by inventing your own supervision signal from the data itself?
Supervised learning has a dirty secret. Those stunning ImageNet results — ResNets, ViTs, everything that kicked off the deep learning revolution — they all depend on 1.28 million hand-labeled images. Every single image was looked at by a human who decided "this is a golden retriever" or "this is a school bus." That took years and cost millions of dollars.
Now consider this: the internet has billions of unlabeled images. Instagram alone gets 100 million photos uploaded per day. YouTube accumulates 500 hours of video every minute. The labeled fraction of all visual data on Earth is a rounding error.
ImageNet: 1.28 million labeled images, the product of years of annotation effort. The internet: billions of unlabeled images, growing every second. If we could learn from unlabeled data, we'd have access to 1000× more training signal. That's not a small improvement — it's a paradigm shift.
But the problem goes deeper than scale. Labels are also biased. ImageNet's 1,000 categories reflect choices made by researchers in 2009. There's no "tired face" class, no "cracked sidewalk" class, no "half-eaten sandwich" class. Yet these are exactly the kinds of things a truly general visual system needs to understand. Labels impose a ceiling on what features your network can learn.
And labels are expensive. Medical imaging? You need board-certified radiologists. Satellite imagery? You need domain experts. Self-driving? Thousands of hours of tedious bounding box annotation. For every new domain, you start from scratch.
Self-supervised learning (SSL) sidesteps the labeling bottleneck entirely. The idea: design a task where the supervision signal comes from the data itself — no human labels needed. The network learns useful representations as a byproduct of solving this artificial task.
A learning paradigm where the model creates its own supervision signal from unlabeled data. The model solves a pretext task (a puzzle constructed from the data's structure), and the representations learned for that task transfer to downstream tasks like classification, detection, or segmentation.
The framework has two stages:
If we don't have labels during pretraining, how do we evaluate the quality of the learned representations? Four standard protocols:
| Protocol | How It Works | What It Tests |
|---|---|---|
| Linear probe | Freeze encoder, train a single linear layer on top | Feature quality — are the representations linearly separable? |
| Fine-tuning | Unfreeze encoder, train end-to-end with small LR | Transfer quality — how well does the network adapt? |
| k-NN evaluation | Extract features, classify test images by nearest neighbors | Cluster quality — do similar things end up close? |
| t-SNE / UMAP | Visualize feature space in 2D | Qualitative — do clusters match semantic categories? |
The linear probe is the gold standard. If you can slap a single linear layer on frozen features and get 75%+ accuracy on ImageNet, your encoder has learned something genuinely useful — the features are already organized by semantic meaning without ever seeing a label.
A method with great fine-tuning but poor linear probe performance might just be learning easily adaptable features, not inherently good features. The linear probe is stricter: it demands that the representation already separates classes. This distinction matters when comparing SSL methods.
The earliest self-supervised methods worked by a beautifully simple trick: take an image, corrupt or transform it in some known way, and ask the network to predict what you did. The network can't solve these puzzles without understanding the image's content — so understanding is forced as a side effect.
Take an image. Rotate it by one of four angles: 0°, 90°, 180°, or 270°. Ask the network: which rotation was applied? This is a 4-way classification problem — no labels needed, because you chose the rotation.
Why does this force semantic understanding? To distinguish 0° from 180°, the network must know which way is "up." It needs to understand that trees grow upward, that skies are at the top, that text reads left-to-right. These are semantic features — exactly the kind we want for downstream tasks.
Extract two patches from an image: a center patch and one of its 8 neighbors. Ask the network: where is the second patch relative to the first? Top-left? Bottom-right? This is an 8-way classification problem.
To solve it, the network must learn spatial relationships between object parts. It learns that eyes are above noses, that wheels are below car bodies, that windows are on walls. These part-relationship features are exactly what object detectors need.
Split an image into a 3×3 grid of patches. Shuffle them. Ask the network to predict which permutation was applied. With 9 patches, there are 9! = 362,880 possible permutations — too many for classification. So in practice, you select a subset of ~100 maximally distinct permutations and classify among those.
Mask out the center region of an image. Ask the network to reconstruct the missing pixels. The loss combines L2 reconstruction (mean squared error on pixels) with an adversarial loss (a discriminator tries to distinguish real from generated patches). The adversarial loss encourages sharp, realistic completions rather than blurry averages.
To fill in a missing face region, the network must understand facial structure. To complete a missing building corner, it must understand architectural geometry. Inpainting forces holistic scene understanding.
Convert a color image to grayscale. Ask the network to predict the original colors. This is framed as classification (quantize the color space into ~313 bins) rather than regression, because color is inherently ambiguous — a car could be red or blue, but the network must still understand it's a car to color it plausibly.
Surprisingly well. Rotation prediction on ImageNet, transferred to Pascal VOC detection, achieves ~65% of the performance of supervised pretraining. That's without seeing a single label during pretraining.
| Method | Pretext Task | VOC Detection mAP | % of Supervised |
|---|---|---|---|
| Random init | None | ~40% | 54% |
| Rotation | Predict 0/90/180/270 | ~54% | 73% |
| Jigsaw | Predict patch order | ~56% | 76% |
| Colorization | Predict colors | ~52% | 70% |
| Supervised | ImageNet labels | ~74% | 100% |
Networks are lazy — they'll find the easiest way to solve the pretext task, even if it doesn't require semantic understanding. For jigsaw puzzles, the network can match edges rather than understanding content. For rotation, it can detect JPEG artifacts that differ by rotation. For relative patches, chromatic aberration (color fringing near image edges) leaks absolute position information. Designing pretext tasks that resist shortcuts is a major challenge.
The key insight: these tasks can only be solved perfectly with high-level understanding, but they can be solved cheaply with low-level tricks. The art of pretext task design is closing the gap — making shortcuts impossible so the network is forced into semantic understanding. This is why contrastive methods eventually won: they design the "puzzle" in feature space where pixel-level shortcuts don't exist.
In NLP, BERT revolutionized representation learning with a simple idea: mask out some words in a sentence, predict the missing ones. The model must understand context, grammar, and semantics to fill in the blanks. Masked Autoencoders (MAE, He et al., 2022) bring this idea to images — but with a critical twist.
In text, each token carries dense semantic information. Masking 15% of words creates a challenging prediction task. But images are spatially redundant: neighboring patches look almost identical. If you mask 15% of an image, the network can reconstruct the missing patches by interpolating from their neighbors — no semantic understanding needed, just texture extrapolation.
MAE's solution: mask 75% of patches. At this masking ratio, there simply aren't enough visible neighbors to interpolate from. The network is forced to understand global scene structure to fill in the gaps. What object is this? What's the scene layout? Where should the edges be?
Images need 5× higher masking than text (75% vs 15%) because images have much higher spatial redundancy. A pixel's value is highly predictable from its neighbors. Words in a sentence are not — knowing "The cat sat on the ___" tells you little about the word's pixel-level appearance, but knowing the pixel at (100, 200) tells you a lot about the pixel at (101, 200).
MAE uses a Vision Transformer (ViT) with an asymmetric encoder-decoder design:
The asymmetric design is crucial for efficiency. The heavy encoder (e.g., ViT-Large with 24 layers) only processes 25% of patches. The lightweight decoder (e.g., 8 layers) handles the full set. During pretraining, this gives a 3× speedup and uses less memory. At transfer time, you throw away the decoder entirely — only the encoder matters.
MAE achieves 86.9% top-1 on ImageNet with ViT-Huge (fine-tuned). Competitive with the best contrastive methods, and remarkably simpler — no negative pairs, no momentum encoders, no special augmentations.
But there's a catch: MAE's linear probe performance lags behind contrastive methods. It gets 75.8% with linear probe vs 78.2% for DINO. This suggests MAE learns features that are powerful but need fine-tuning to unlock — they're not as "ready to use" as contrastive features.
MAE is a modern instantiation of a decades-old idea: denoising autoencoders (Vincent et al., 2008). Corrupt the input, reconstruct the original. Masking is a specific corruption strategy. What changed is the architecture (transformers instead of shallow networks) and the masking ratio (75% instead of 30%). The principle is the same: learning robust representations by reconstruction.
A ViT-Large encoder with 24 layers. Image: 224×224, patch size 16×16 = 196 patches. Full processing: 24 layers × 196 tokens. MAE with 75% masking: 24 layers × 49 tokens (visible only). That's 4× fewer FLOPs in the encoder. The decoder is only 8 layers processing 196 tokens, much cheaper than a full encoder pass. Total savings: roughly 3× wall-clock speedup.
Pretext tasks design puzzles in pixel space: predict rotations, fill in pixels, guess colors. But we don't actually care about pixels — we care about representations. What if we designed the learning signal directly in representation space?
That's contrastive learning. The intuition is almost embarrassingly simple:
Pull representations of similar things together. Push representations of different things apart. Do this with enough examples, and the embedding space organizes itself into semantically meaningful clusters — without ever seeing a label.
Every contrastive method needs to define what "similar" and "different" mean. The standard approach:
Positive pair: Two different augmentations of the same image. Crop it differently, change the colors, blur it — it's still the same scene, so the representations should be close.
Negative pair: Augmentations of different images. A photo of a cat and a photo of a truck should have distant representations, regardless of how you crop or color them.
A function that measures how "close" two representations are in embedding space. Usually cosine similarity: s(x, y) = (x · y) / (||x|| · ||y||). Ranges from −1 (opposite directions) to +1 (identical directions). We want s(x, x+) to be high and s(x, x−) to be low.
How do we turn "pull positives, push negatives" into a differentiable loss? The InfoNCE loss (Oord et al., 2018) frames it as an N-way classification problem. Given an anchor x, one positive x+, and N−1 negatives {x−1, ..., x−N−1}, can we identify which one is the positive?
Let's break this down:
Numerator: exp(s(x, x+) / τ) — the exponentiated similarity between the anchor and its positive. We want this to be large.
Denominator: The numerator plus the sum of exponentiated similarities with all negatives. This normalizes the expression into a probability.
Negative log: We're minimizing the negative log-probability that the positive is identified correctly. This is identical to the cross-entropy loss for a softmax classifier where the correct class is the positive.
Define the probability of selecting the positive from N candidates:
p(positive = x+) = exp(s(x, x+) / τ) / ∑j=0N−1 exp(s(x, xj) / τ)
This is exactly a softmax over N candidates with logits s(x, xj) / τ. The InfoNCE loss is the cross-entropy: L = −log p(positive = x+). Minimizing this maximizes the probability of correctly identifying the positive among N candidates.
Temperature τ controls how "peaked" the distribution is. Small τ (e.g., 0.05) makes the softmax very sharp — the loss strongly penalizes even slightly wrong rankings. Large τ (e.g., 1.0) makes it smooth — close negatives are tolerated.
In practice, τ = 0.07 or 0.1 works well. Too small and training is unstable (gradients explode). Too large and the loss becomes too easy (it can't distinguish hard negatives from easy ones).
InfoNCE is a lower bound on the mutual information I(x; x+) between the two views. Specifically: I(x; x+) ≥ log(N) − LInfoNCE. This means: the more negatives (larger N), the tighter the bound. With infinite negatives, minimizing InfoNCE maximizes mutual information exactly. This is why more negatives always help — they make the proxy loss closer to the true objective.
Anchor x, positive x+, negatives x1−, x2−, x3−. Cosine similarities: s(x, x+) = 0.9, s(x, x1−) = 0.3, s(x, x2−) = 0.1, s(x, x3−) = −0.2. With τ = 0.1:
Numerator: exp(0.9/0.1) = exp(9) = 8103
Denominator: exp(9) + exp(3) + exp(1) + exp(−2) = 8103 + 20.1 + 2.72 + 0.14 = 8126
L = −log(8103/8126) = −log(0.9972) = 0.0028 — a very low loss because the positive is much closer than any negative.
Now if s(x, x1−) = 0.85 (a "hard negative"): exp(8.5) = 4915. Denominator becomes 8103 + 4915 + 2.72 + 0.14 = 13021. L = −log(8103/13021) = 0.474 — much higher! Hard negatives dominate the loss.
SimCLR (Chen et al., 2020) showed that you don't need clever architectures or auxiliary tasks to do great contrastive learning. You just need three ingredients: strong augmentations, a projection head, and a big batch.
Given a batch of N images:
SimCLR's augmentation pipeline applies the following randomly:
| Augmentation | What It Does | Why It Matters |
|---|---|---|
| Random crop & resize | Take a random rectangle (8-100% of area), resize to 224×224 | Forces learning of scale-invariant and position-invariant features |
| Color distortion | Random brightness, contrast, saturation, hue jitter + random grayscale | Prevents the network from using color as a shortcut |
| Gaussian blur | Random blur with 50% probability | Forces learning of shape over texture |
| Horizontal flip | Flip left-right with 50% probability | Standard invariance |
The ablation is striking: random crop alone gives ~65% linear probe. Adding color distortion jumps to ~75%. The combination of crop + color distortion is worth 10+ percentage points. Without strong augmentations, the network learns trivial color histogram matching instead of semantic features.
Two random crops of the same image often share similar color statistics. Without color distortion, the network can match positives just by comparing average color — no semantic understanding needed. Color jitter forces the network to look past color and learn shape, texture, and object structure. This is the single most important design choice in SimCLR.
SimCLR found that adding a 2-layer MLP projection head g(·) after the encoder improves linear probe accuracy by 10+ percentage points. The contrastive loss is computed on z = g(h), but the representation used for downstream tasks is h (the encoder output, before the projection).
Why does this help? The projection head acts as an information bottleneck. It discards information that's useful for the contrastive task (like color distribution or exact crop position) but harmful for downstream tasks. The encoder h retains all the information; the projection z filters it. The contrastive loss "uses up" task-specific information in z, leaving h clean.
SimCLR's negatives all come from the current batch. With batch size N, each anchor has 2(N−1) negatives. More negatives = tighter InfoNCE bound = better representations. The paper reports results with batch sizes of 4096 to 8192, requiring 32-128 TPUs. At batch size 256, accuracy drops by ~5 points.
SimCLR with batch size 4096 requires 32 TPU v3 cores for training. Each core has 16GB HBM. That's ~$10,000+ per training run. This isn't "simple" for most researchers. The batch size requirement was SimCLR's Achilles' heel, and it's what motivated MoCo's memory bank approach.
SimCLR v1 achieves 69.3% top-1 linear probe on ImageNet with ResNet-50. SimCLR v2 pushes this to 71.7% by using a deeper projection head (3-layer MLP) and a larger encoder (ResNet-50 4× width). With the wider network, fine-tuned performance reaches 76.5%.
Batch size N = 4096 images. After augmentation: 8192 views. For each anchor, there's 1 positive and 8190 negatives. That's a 8191-way classification problem per anchor. With N = 256: only 510 negatives. The loss surface is much less informative — the model can "cheat" by finding a few easy negatives and ignoring the rest.
SimCLR's insight was right: more negatives help. But SimCLR's solution — giant batches — was brute force. MoCo (He et al., 2020) asked a cleverer question: can we decouple the number of negatives from the batch size?
MoCo maintains a running queue of encoded representations (keys) from recent mini-batches. This queue acts as a large, consistent dictionary of negatives. With a queue of size 65,536, every anchor has 65,536 negatives — regardless of whether the batch size is 256 or 64.
Each training step: encode the current batch to produce queries (from the query encoder) and keys (from the key encoder). The positive pair is (query, key) from the same image. The negatives are all keys in the queue. After each step, enqueue the new keys and dequeue the oldest ones. First-in, first-out.
Here's the subtle problem: the keys in the queue were encoded by different versions of the encoder at different training steps. If the encoder changes rapidly, old keys in the queue are inconsistent with new ones — they represent a stale, different feature space. This inconsistency hurts training.
MoCo's solution: update the key encoder very slowly via exponential moving average (momentum update):
Only the query encoder receives gradients from the loss. The key encoder is never trained directly — it evolves slowly via momentum. This ensures that keys from 100 steps ago and keys from the current step live in approximately the same feature space.
Think of the key encoder as a slow-moving average of the query encoder. With m = 0.999, after 1000 steps the key encoder has "absorbed" roughly 63% of the current query encoder's parameters (1 − 0.9991000 ≈ 0.63). The queue holds ~256 batches worth of keys (65536/256). So the oldest key was encoded by a network that's ~63% similar to the current one. Close enough for contrastive learning to work. With m = 0.9, the oldest key comes from a completely different network — too stale.
MoCo v1 used weak augmentations and no projection head. MoCo v2 (Chen et al., 2020) borrowed SimCLR's key innovations — strong augmentations (crop + color distortion) and an MLP projection head — and combined them with MoCo's queue. The result: 71.1% linear probe on ImageNet with ResNet-50, matching SimCLR v2 while using 32× smaller batch size (256 vs 8192).
| Method | Batch Size | Negatives | GPUs | Linear Probe |
|---|---|---|---|---|
| SimCLR v1 | 4096 | 8190 | 32 TPUs | 69.3% |
| MoCo v1 | 256 | 65536 | 8 GPUs | 60.6% |
| MoCo v2 | 256 | 65536 | 8 GPUs | 71.1% |
| SimCLR v2 | 4096 | 8190 | 32 TPUs | 71.7% |
Queue size K = 65536, batch size B = 256. Each step adds 256 new keys and removes 256 old ones. The queue holds 65536/256 = 256 batches worth of keys. The oldest keys are from 256 steps ago. With momentum m = 0.999 and 256 updates, the key encoder has drifted by 1 − 0.999256 ≈ 22.6% from when the oldest keys were encoded. That's small enough for consistent contrastive targets.
An earlier approach (Wu et al., 2018) stored a memory bank — one representation per image in the entire dataset. Updated only when that image appeared in a batch. Problem: with 1.28M images and 256 batch size, each representation is updated once every 5000 steps — massively stale. MoCo's queue is better because it only keeps recent keys (from the last ~256 steps), ensuring freshness.
SimCLR and MoCo contrast static views: two crops of the same image. But what if your data has sequential structure — audio, video, text? Contrastive Predictive Coding (CPC, Oord et al., 2018) extends contrastive learning to sequences by predicting future representations from current context.
Given a sequence of observations x1, x2, ..., xT (e.g., audio frames, image patches scanned top-to-bottom, or text tokens):
The InfoNCE loss discriminates the true future zt+k from random negatives (z values sampled from other positions or other sequences).
CPC predicts representations, not raw observations. Predicting raw audio samples would require modeling every detail of the waveform. Predicting representations lets the model focus on high-level structure: what word will come next, what note will play next, what object will appear next.
CPC learns to predict the future — not in pixel/sample space, but in a learned representation space. This is exactly what a world model does: compress observations into states and predict how states evolve. CPC's autoregressive model gar is a primitive world model. The contrastive loss ensures the representations capture the features that are predictable and useful for prediction, discarding noise.
| Domain | Observation xt | Sequence | Result |
|---|---|---|---|
| Audio | 25ms audio frame | Speech waveform | SOTA phone classification; competitive speech recognition |
| Images | Image patch | Patches scanned top→bottom | Predict bottom patches from top; learns spatial structure |
| Text | Sentence | Book paragraphs | Predict next sentence; competitive NLI |
| Video | Video frame | Video clip | Action recognition features |
Without the autoregressive model, CPC would be predicting zt+k from zt alone. With the autoregressive model, it predicts from ct = gar(z1, ..., zt), which summarizes all past observations. This is critical for sequences with long-range dependencies: the next word depends not just on the previous word, but on the entire paragraph so far.
Audio at 16kHz, encoded into 100Hz representations (one z every 10ms). Context at t = 500 (= 5 seconds). Predict k = 1 to 12 steps ahead (10-120ms). Positives: the actual z at t + k. Negatives: z values from random positions in the same or other audio clips. The model learns that after "cat sat on the," the representation for "mat" is more likely than the representation for "helicopter." It captures semantic and syntactic patterns without transcription labels.
Contrastive methods need negative pairs. What if you didn't? DINO (Caron et al., 2021) takes a radically different approach: self-distillation. It's a teacher-student framework where the teacher is the student — or rather, a slow-moving average of it.
DINO creates two networks from the same architecture (typically a ViT):
Student network: Receives local crops (small patches, ~96×96 pixels, covering ~25% of the image). Trained with gradient descent.
Teacher network: Receives global crops (large patches, ~224×224 pixels, covering ~50-100% of the image). Updated via exponential moving average of the student (like MoCo's key encoder). No gradients flow through the teacher.
The loss is not contrastive. It's cross-entropy between output distributions:
Both networks output a probability distribution over K dimensions (e.g., K = 65536) via a softmax layer. The student must learn to produce the same distribution as the teacher, even though the student sees only a small crop while the teacher sees the whole image.
Without negatives, what prevents the trivial solution where both networks output the same constant vector for every image? Two mechanisms:
Centering: Subtract a running mean from the teacher's output before softmax: pt = softmax((gt(x) − c) / τt), where c is an exponential moving average of the teacher's outputs across the dataset. Centering prevents any single dimension from dominating, making the "collapse to one constant" solution unstable.
Sharpening: Use a low temperature τt = 0.04 for the teacher (producing sharp, peaked distributions) and a higher temperature τs = 0.1 for the student. The teacher commits strongly to its prediction; the student has to work harder to match this peaked target. A uniform (collapsed) output can never match a sharp target.
Consider what collapse looks like: both networks output the same vector c for every image. The cross-entropy loss would be H(softmax(c/τ), softmax(c/τ)) = constant. But with centering, we subtract c from the teacher's output, giving softmax((c − c)/τ) = softmax(0) = uniform distribution. A uniform teacher distribution has maximum entropy — it provides no training signal. The student can't reduce the loss by outputting a constant. To reduce the loss, the teacher must output non-uniform distributions, which requires producing different outputs for different images. Centering makes collapse a saddle point, not a minimum.
The most striking result from DINO: when trained with a ViT, the self-attention maps learn to segment objects without any segmentation supervision. The [CLS] token's attention in the last layer highlights the main object in the scene — cleanly separating foreground from background.
This happens because the global-local crop asymmetry forces the network to understand what the object is (which persists across crops) rather than where it is (which changes between crops). Object identity is precisely what segmentation captures.
Contrastive methods (SimCLR, MoCo) define similarity through data augmentation and enforce it via a push-pull loss. DINO achieves the same goal through a completely different mechanism: self-distillation with asymmetric views. The teacher sees more context (global crops) and "knows more"; the student sees less (local crops) and must learn to match the teacher's understanding. No negatives needed.
DINOv2 (Oquab et al., 2023) combines DINO's self-distillation with several improvements: a curated training dataset (LVD-142M), iBOT's masked image modeling objective, and a ViT-g architecture with 1.1B parameters. The result is arguably the best general-purpose visual feature extractor available today.
| Property | DINO | DINOv2 |
|---|---|---|
| Architecture | ViT-S/16 to ViT-B/8 | ViT-S/14 to ViT-g/14 |
| Training data | ImageNet-1K (1.28M) | LVD-142M (curated) |
| Objectives | Self-distillation only | Self-distillation + masked image modeling |
| Linear probe (IN-1K) | 77.0% (ViT-S/16) | 81.1% (ViT-g/14) |
| Key feature | Emergent segmentation | Universal visual features (depth, segmentation, retrieval) |
Input image: 224×224. Teacher receives 2 global crops of size 224×224 (covering 50-100% of the original, resized). Student receives 6 local crops of size 96×96 (covering ~20% each). Total forward passes: 2 through teacher + 6 through student = 8 views. The loss is computed for all (global, local) pairs: 2 × 6 = 12 cross-entropy terms. The student must extract the "essence" of the image from tiny crops to match the teacher's holistic understanding.
Now let's see contrastive learning in action. The simulation below shows data points in a 2D embedding space. Points are colored by their true class (which the algorithm never sees). Each training step creates augmented pairs, then pulls positives together and pushes negatives apart.
Watch how the clusters form over training. Adjust the temperature τ to see how it affects separation sharpness. Try different augmentation strengths to see how weak augmentation leads to shortcut features.
Low τ (0.05): Very sharp separation. Clusters form quickly but might not merge well — subgroups within a class stay separate. High τ (1.0): Soft separation. Everything blends together — too tolerant of wrong matches. Sweet spot (~0.1-0.2): Clean clusters with good within-class cohesion. Try low augmentation strength (~0.02) and watch how the algorithm fails to learn — augmented views are too similar to the original, so the network doesn't need to learn invariances.
SSL has gone through four phases, each solving a limitation of the previous:
| Era | Approach | Key Methods | Limitation Solved | Remaining Problem |
|---|---|---|---|---|
| 2015-2018 | Pretext tasks | Rotation, Jigsaw, Colorization | Learn without labels | Shortcut solutions, weak features |
| 2018-2020 | Contrastive | SimCLR, MoCo, CPC | Learn in feature space, stronger features | Need many negatives, large batches |
| 2021-2022 | Self-distillation | BYOL, DINO, VICReg | No negatives needed, simpler training | Requires careful collapse prevention |
| 2022+ | Masked modeling | MAE, BEiT, data2vec | Simpler objective, efficient training | Needs fine-tuning (weaker linear probe) |
| Property | SimCLR | MoCo v2 | MAE | DINO |
|---|---|---|---|---|
| Loss type | InfoNCE | InfoNCE | MSE recon. | Cross-entropy distill. |
| Needs negatives? | Yes (batch) | Yes (queue) | No | No |
| Batch size | 4096+ | 256 | 4096 | 1024 |
| Architecture | ResNet | ResNet | ViT | ViT |
| Linear probe (IN-1K) | 69.3% | 71.1% | 75.8%* | 77.0% |
| Fine-tune (IN-1K) | 76.5% | ~77% | 86.9%* | 82.8% |
| Key trick | Strong augmentations | Momentum queue | 75% masking | Centering + sharpening |
* MAE uses ViT-Huge; others use ResNet-50 or ViT-Small for fair comparison where noted. Numbers are approximate and depend on exact settings.
Every method in this lecture creates an artificial supervision signal from the structure of the data itself. Pretext tasks exploit spatial structure (rotation, jigsaw). Contrastive methods exploit augmentation invariance (two crops of the same image should be similar). Masked models exploit reconstruction (visible patches predict masked ones). Self-distillation exploits view asymmetry (local crop should match global understanding). The core insight is always the same: the data knows more than any label set.
1. Augmentation matters. Every method benefits from strong, diverse augmentations. Crop + color distortion is the minimum viable set. Without them, networks find trivial shortcuts.
2. Projection heads help. Computing the loss in a separate projection space (discarded at transfer) consistently improves representation quality across SimCLR, MoCo, BYOL, and DINO.
3. Scale helps. More data, bigger encoders, and longer training all improve SSL representations. Unlike supervised learning, SSL has no labeling bottleneck — you can always add more data.
4. Linear probe vs fine-tune is a real distinction. MAE is weaker on linear probe but stronger on fine-tuning than contrastive methods. Choose your evaluation to match your deployment scenario.
Self-supervised learning connects deeply to several other topics covered in this series:
CLIP extends contrastive learning to vision-language pairs. Instead of two image crops as the positive pair, CLIP uses an image and its text caption. This gives rise to zero-shot classification — the representation space is organized by language, not by a fixed label set. See the contrastive-clip lesson.
Transformers are the backbone architecture for modern SSL. ViT's patch-based design enables MAE's efficient masking strategy and DINO's attention-based segmentation. See the transformer lesson.
Generative models (VAEs, diffusion) are related to masked modeling — both learn by reconstruction. The difference: SSL focuses on the encoder (representation quality), while generative models focus on the decoder (sample quality). See VAE/VQ-VAE and diffusion.
| Paper | Year | Key Contribution |
|---|---|---|
| Doersch et al., "Unsupervised Visual Representation Learning by Context Prediction" | 2015 | Relative patch location pretext task |
| Noroozi & Favaro, "Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles" | 2016 | Jigsaw permutation prediction |
| Pathak et al., "Context Encoders: Feature Learning by Inpainting" | 2016 | Image inpainting as pretext task |
| Zhang et al., "Colorful Image Colorization" | 2016 | Colorization as pretext task |
| Gidaris et al., "Unsupervised Representation Learning by Predicting Image Rotations" | 2018 | Rotation prediction |
| Oord et al., "Representation Learning with Contrastive Predictive Coding" | 2018 | InfoNCE loss, CPC framework |
| He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" | 2020 | MoCo, momentum encoder, queue |
| Chen et al., "A Simple Framework for Contrastive Learning of Visual Representations" | 2020 | SimCLR, projection head, strong augmentations |
| Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" | 2021 | DINO, self-distillation, emergent segmentation |
| He et al., "Masked Autoencoders Are Scalable Vision Learners" | 2022 | MAE, 75% masking, asymmetric encoder-decoder |
| Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" | 2023 | Scaled DINO, universal visual features |
Self-supervised learning creates its own labels from unlabeled data — via puzzles, augmentation invariance, masking, or self-distillation — and the representations it learns rival or surpass supervised ones.