Caron, Touvron, Misra, Jégou, Marial, Bojanowski, Joulin — Facebook AI, 2021

Emerging Properties in Self-Supervised Vision Transformers

Self-distillation with no labels: a student-teacher framework where Vision Transformers spontaneously learn object segmentation, scene layout, and features so good that a simple k-NN achieves competitive ImageNet accuracy.

Prerequisites: Vision Transformers + Self-supervised learning basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

By 2021, self-supervised learning had produced spectacular results on images — but almost entirely with convolutional networks. Methods like MoCo, BYOL, and SwAV trained ResNets to produce features that rivaled supervised learning, closing the gap on ImageNet without using a single label.

Meanwhile, Vision Transformers (ViTs) had arrived. They worked well with supervision, but they hadn't shown anything special. They were computationally expensive, required more training data, and their features looked... ordinary. No emergent properties. No surprises.

This was puzzling. In NLP, the magic of Transformers came from self-supervised pretraining — BERT's masked language modeling, GPT's next-token prediction. These self-supervised objectives provided a richer learning signal than "predict a single label per sentence." Could the same be true for vision?

The central question: Does the muted success of Vision Transformers come from training them with supervision? What happens if we train ViTs with self-supervised learning instead? Do new properties emerge that don't appear with supervised training or with CNNs?

The answer turned out to be yes — dramatically so. Self-supervised ViTs spontaneously learn to segment objects without any segmentation labels. Their attention maps contain explicit information about scene layout. Their features are so well-organized that a simple k-nearest-neighbor classifier (no training at all) achieves competitive accuracy. None of these properties emerge with supervised ViTs or with self-supervised CNNs.

Why was the performance of Vision Transformers considered "muted" before DINO?

Chapter 1: The Key Insight

DINO stands for self-distillation with no labels. The insight is beautifully simple: take two copies of the same network — call one the student and one the teacher. Show them different augmented views of the same image. Train the student to match the teacher's output. Update the teacher as a slow-moving average of the student.

That's it. No labels. No contrastive pairs. No memory bank. No clustering. Just: "student, match the teacher. Teacher, slowly absorb the student."

The student-teacher dance

Here's how the information flows:

  1. Take an image. Create two different random augmentations (crops, color jitter, blur).
  2. Pass one view through the student network. Pass the other through the teacher network.
  3. Both output a probability distribution over K dimensions (via softmax).
  4. Compute cross-entropy loss between teacher output and student output.
  5. Backpropagate through the student only (stop gradient on teacher).
  6. Update the teacher via exponential moving average (EMA) of the student weights.
Why does this work? The teacher is a smoothed, ensembled version of the student — an exponentially-weighted average of all past student checkpoints. This averaging makes the teacher more stable and higher-quality than the student at any given moment. So the student is always chasing a better version of itself. The asymmetry (EMA teacher + stop gradient) is what prevents collapse to a trivial solution.

The DINO training loop: student learns from teacher, teacher slowly absorbs student via EMA.

How is the teacher network updated in DINO?

Chapter 2: The DINO Framework

The full DINO framework adds two critical design choices on top of the student-teacher core: a multi-crop augmentation strategy and a centering + sharpening mechanism to avoid collapse.

Multi-crop strategy

DINO generates multiple views from each image:

The key asymmetry: all views go through the student, but only the global views go through the teacher. This encourages local-to-global correspondences — the student must learn that a small crop of a dog's ear belongs to the same concept as the full image of the dog.

Multi-crop: 2 global views (teacher + student) and multiple local views (student only). Click to regenerate crops.

Centering and sharpening

Without careful design, the teacher can collapse — outputting the same distribution for every input (a uniform or single-spike output). DINO prevents this with two complementary operations applied to the teacher's output:

Centering alone would push toward uniform distributions (maximum entropy). Sharpening alone would push toward one-hot outputs (minimum entropy). Together, they balance each other: the teacher produces peaked but diverse distributions.

Why this matters: Most self-supervised methods need complex mechanisms to avoid collapse — contrastive losses with large batches (SimCLR), memory banks (MoCo), predictor heads (BYOL), or clustering (SwAV). DINO needs only centering + sharpening + EMA. This simplicity is a key contribution.
Why do only global views go through the teacher while all views go through the student?

Chapter 3: Knowledge Distillation Without Labels

Traditional knowledge distillation works like this: train a large "teacher" model with labels, then train a smaller "student" to mimic the teacher's soft outputs. The key innovation of DINO is removing every component that requires labels.

The loss function

Both student and teacher output a K-dimensional vector (K = 65536 in practice). These are converted to probability distributions via temperature-scaled softmax:

Ps(x)(i) = exp(gθs(x)(i) / τs) / Σk exp(gθs(x)(k) / τs)

The student temperature τs = 0.1 (fairly sharp). The teacher temperature τt is warmed up from 0.04 to 0.07 over the first 30 epochs (very sharp — this is the "sharpening").

The loss is a standard cross-entropy between teacher and student distributions:

L = − Σi Pt(x)(i) log Ps(x')(i)

With multi-crop, this becomes a sum over all pairs of (global teacher view, any student view):

minθs Σx ∈ globals Σx' ∈ V, x' ≠ x H(Pt(x), Ps(x'))

The teacher output is centered before the softmax: gt(x) ← gt(x) − c, where c is the EMA of batch means.

DINO pseudocode

# gs, gt: student and teacher networks
# C: center (K-dim), tps/tpt: temperatures
# l, m: EMA rates for network and center
gt.params = gs.params
for x in loader:
    x1, x2 = augment(x), augment(x)
    s1, s2 = gs(x1), gs(x2)
    t1, t2 = gt(x1), gt(x2)
    loss = H(t1, s2)/2 + H(t2, s1)/2
    loss.backward()
    update(gs)  # SGD on student only
    gt.params = l*gt.params + (1-l)*gs.params  # EMA
    C = m*C + (1-m)*cat([t1,t2]).mean(dim=0)

def H(t, s):
    t = t.detach()  # stop gradient on teacher
    s = softmax(s / tps, dim=1)
    t = softmax((t - C) / tpt, dim=1)  # center + sharpen
    return -(t * log(s)).sum(dim=1).mean()
No labels anywhere. Look at the pseudocode: there is no y, no label, no target_class. The teacher's output IS the target. The system bootstraps its own supervision. The only input is raw images.
What serves as the "label" in DINO's cross-entropy loss?

Chapter 4: Avoiding Collapse

Collapse is the nightmare of self-supervised learning. Without labels to anchor the features, the network can find trivial shortcuts — outputting the exact same representation for every input. The loss goes to zero, but the features are useless.

Different methods have different anti-collapse mechanisms:

DINO's three mechanisms work together:

1. Momentum teacher (EMA)

The teacher parameters θt are an exponential moving average of the student: θt ← λθt + (1−λ)θs. The momentum λ follows a cosine schedule from 0.996 to 1.0 during training. This means the teacher changes very slowly — it's a smoothed ensemble of many past student states. If the student starts collapsing, the teacher still retains diverse representations from before the collapse began.

2. Centering

The center c is the running mean of the teacher's output over the batch:

c ← m · c + (1 − m) · (1/B) Σi gθt(xi)

This is subtracted from the teacher output before softmax. Without centering, one output dimension could dominate — the teacher would collapse to a one-hot vector that's the same for every input. Centering prevents this by keeping the mean output at zero.

3. Sharpening

The teacher uses a very low temperature τt = 0.04–0.07 in its softmax. This makes the output distribution peaked (high confidence). Without sharpening, centering alone would push toward a uniform distribution — the teacher would say "every class is equally likely" for every input. That's also collapse (just in the other direction).

Three output modes: uniform collapse (centering only), dominant dimension (no centering), and healthy behavior (centering + sharpening).

The balancing act: Centering prevents single-dimension dominance but encourages uniform outputs. Sharpening prevents uniform outputs but could encourage dominance. Together they balance: the teacher produces peaked, diverse, centered distributions. This is sufficient to avoid collapse when combined with the momentum teacher.
What would happen if DINO used centering but NOT sharpening?

Chapter 5: Emergent Segmentation

This is the headline result — the property that made DINO famous. When you train a ViT with DINO and visualize the self-attention maps of the [CLS] token in the last layer, something remarkable appears: the attention maps spontaneously learn to segment objects.

No segmentation labels. No bounding boxes. No pixel-level supervision of any kind. The model just... learns that objects are things, and attends to them.

How does this work?

In a Vision Transformer, the input image is split into patches (e.g., 8×8 or 16×16 pixels each). A special [CLS] token is prepended to the sequence. Through 12 layers of self-attention, the [CLS] token learns to attend to the patches that are most informative for representing the image.

With supervised training, the [CLS] token attends diffusely — it spreads attention across the image without clear spatial structure. But with DINO's self-supervised training, each attention head in the last layer learns to focus on semantically meaningful regions:

The different heads provide complementary views, and together they form a segmentation mask that accurately delineates object boundaries.

Simulated [CLS] token attention from different heads. Each head attends to different semantic regions. Click to switch heads.

The magic: The [CLS] token was never told "this is a dog" or "these pixels are the boundary." It simply learned, through self-distillation, that attending to object boundaries and semantically coherent regions produces the best representations for matching the teacher's output across different augmented views.

This emergent segmentation is practically useful: you can threshold the attention maps to produce segmentation masks, and these masks are competitive with early unsupervised segmentation methods — all without any training for segmentation.

DINO also showed that these attention maps can be used for video object segmentation: propagate attention across frames to track objects, again without any video-specific training.

Why do DINO's attention maps learn object segmentation without segmentation labels?

Chapter 6: k-NN Classification

Here's another surprise from DINO: the learned features are so well-organized that a simple k-nearest-neighbor classifier — with zero training — achieves competitive ImageNet accuracy.

How it works

  1. Freeze the pretrained DINO model. Extract features for all ImageNet training images. Store them.
  2. For a test image, extract its feature, find the k=20 nearest training features (cosine similarity).
  3. The k neighbors vote for a label. That's the prediction.

No linear probe. No fine-tuning. No hyperparameter search. No data augmentation at test time. Just: "find the closest training images and copy their label."

The results are striking

With a ViT-S/16 backbone:

Compare this with other self-supervised methods on the same ViT-S architecture:

k-NN vs linear probe accuracy for different self-supervised methods on ViT-S. DINO's gap is remarkably small.

What this tells us: When k-NN works almost as well as a linear classifier, it means the feature space has a natural cluster structure — semantically similar images are genuinely close in feature space. This is exactly what you'd want from a "foundation" representation. DINO features don't just encode class-discriminative information; they encode a smooth, well-organized manifold of visual concepts.
Why is k-NN accuracy a better indicator of feature quality than linear probe accuracy?

Chapter 7: Results

DINO's results span ImageNet classification, image retrieval, copy detection, video segmentation, and transfer learning. Here are the highlights.

ImageNet classification

With a ViT-B/8 backbone (85M parameters, 8×8 patches), DINO achieves:

This beats all previous self-supervised methods, including those using much larger architectures. Importantly, using smaller patches (/8 vs /16) has a bigger impact than using a larger model.

Image retrieval

DINO features excel at image retrieval tasks (Oxford and Paris benchmarks). When pretrained on Google Landmarks v2 instead of ImageNet, DINO ViT-S/16 achieves 51.5 mAP on Revisited Oxford (Medium) — competitive with dedicated retrieval systems.

Copy detection

On the Copydays benchmark, DINO ViT-B/8 achieves 85.5% mAP — outperforming the specialized Multigrain model (82.5%) that was specifically trained for this task.

Video segmentation

Without any video training, DINO features can track objects across video frames by matching attention maps. On the DAVIS-2017 video object segmentation benchmark, DINO achieves competitive results using only frozen features and nearest-neighbor matching.

ImageNet linear probe accuracy across self-supervised methods and architectures.

The patch size story

One of DINO's most practical findings: reducing patch size from 16×16 to 8×8 dramatically improves results. ViT-S/8 reaches 79.7% linear accuracy — almost matching ViT-B/16 (78.2%) with 4× fewer parameters. The smaller patches create 4× more tokens, giving the attention maps higher spatial resolution and enabling finer-grained segmentation.

Practical note: ViT-S/16 is the sweet spot for speed (1007 im/s) while ViT-B/8 is the accuracy champion (80.1%). Training ViT-S/16 with DINO takes just 2×8 GPUs for 3 days to reach 76.1% — outperforming comparable self-supervised CNNs with significantly less compute.
Which modification had a bigger impact on DINO's performance: using a larger model or using smaller patches?

Chapter 8: What Makes ViTs Special

DINO works with both ViTs and CNNs — it achieves 75.3% linear accuracy with a ResNet-50, matching the state of the art. But the emergent properties are unique to ViTs. Why?

Self-attention provides a natural visualization

In a CNN, there's no direct equivalent of "what is the model attending to." You can compute gradient-based saliency maps (Grad-CAM), but these are post-hoc approximations. In a ViT, self-attention weights are a native part of the architecture — you can directly read off which patches the model considers important.

The [CLS] token as a global aggregator

The [CLS] token is unique to ViTs. It has no spatial position — it's a global summary that must learn to aggregate information from all patches. With self-supervised training, this aggregation becomes spatially structured: different attention heads specialize in different aspects of the scene (object interior, boundaries, background).

Local-to-global reasoning

DINO's multi-crop strategy specifically encourages local-to-global reasoning: the student sees small crops but must match the teacher's output on global views. In a ViT, this means the attention mechanism must learn to relate local patch features to global image-level semantics. CNNs, with their fixed receptive fields, handle this less naturally.

No batch normalization

A subtle but important difference: ViTs don't use batch normalization by default. BN creates implicit communication between samples in a batch, which can provide shortcuts for self-supervised methods (the model can "cheat" by using batch statistics). DINO with ViT is entirely BN-free, making the system cleaner and the learned features more robust.

The deeper lesson: Supervision may actually hurt ViTs by reducing the richness of their representations. Supervised training optimizes for a single label per image — collapsing the rich internal representations to a 1000-way classifier. Self-supervised training preserves this richness because the learning objective (match the teacher) doesn't discard any information.
Why do emergent segmentation properties appear with self-supervised ViTs but not with self-supervised CNNs?

Chapter 9: Connections

Predecessors

MoCo (He et al., 2020): Introduced the momentum encoder for self-supervised learning. DINO adopts this as the EMA teacher but replaces the contrastive loss and memory queue with cross-entropy distillation.

BYOL (Grill et al., 2020): Showed that you can learn without negatives using a predictor head + momentum encoder. DINO simplifies further by removing the predictor head and using centering + sharpening instead.

SimCLR (Chen et al., 2020): The contrastive learning baseline. Required batch sizes of 4096+ for good performance. DINO avoids contrastive losses entirely, working well with standard batch sizes of 1024.

SwAV (Caron et al., 2020): By the same first author. Introduced multi-crop training (which DINO adopts) and online clustering. DINO replaces clustering with cross-entropy distillation.

What DINO enabled

DINOv2 (Oquab et al., 2024): Scaled DINO to ViT-g (1.1B parameters) trained on a curated dataset of 142M images. Combined DINO's self-distillation with iBOT's masked image modeling. Achieved features that transfer to almost any vision task without fine-tuning — a true vision foundation model.

MAE (He et al., 2022): Took the complementary approach: instead of self-distillation, mask 75% of patches and reconstruct them. Different philosophy but validated that self-supervised ViTs produce powerful features.

CLIP (Radford et al., 2021): Trained ViTs with natural language supervision instead of self-supervision. Different from DINO's approach (no text involved) but both showed that ViTs benefit from non-standard training.

Segment Anything (SAM) (Kirillov et al., 2023): DINO demonstrated that self-supervised ViTs could segment objects without labels. SAM took this further with a massive labeled segmentation dataset, but the insight that ViTs naturally understand object boundaries came from DINO.

Foundation Models: DINO was a key proof that self-supervised ViTs can serve as general-purpose visual backbones. This inspired the current wave of vision foundation models (DINOv2, SigLIP, EVA, InternViT) used in VLMs and VLAs.

DINO's legacy: DINO showed that Vision Transformers, when freed from supervised training, reveal properties that convnets don't exhibit. The emergent segmentation result changed how the field thinks about ViTs — they're not just "convnets with attention," they're fundamentally different architectures that organize information spatially in their attention patterns. This insight underlies every modern vision foundation model.

Cheat sheet

Core idea
Self-distillation: student matches EMA teacher on different views of same image
Anti-collapse
Momentum teacher + centering (subtract mean) + sharpening (low τ)
Key result
80.1% ImageNet linear (ViT-B/8); emergent object segmentation in attention maps
k-NN surprise
74.5% with zero-training k-NN (ViT-S/16) — features have natural cluster structure
Impact
Proved ViTs have unique self-supervised properties → DINOv2, SAM, vision foundation models
How does DINOv2 build on the original DINO?