DINO — Veanors

Chapter 0: The Problem

By 2021, self-supervised learning had produced spectacular results on images — but almost entirely with convolutional networks. Methods like MoCo, BYOL, and SwAV trained ResNets to produce features that rivaled supervised learning, closing the gap on ImageNet without using a single label.

Meanwhile, Vision Transformers (ViTs) had arrived. They worked well with supervision, but they hadn't shown anything special. They were computationally expensive, required more training data, and their features looked... ordinary. No emergent properties. No surprises.

This was puzzling. In NLP, the magic of Transformers came from self-supervised pretraining — BERT's masked language modeling, GPT's next-token prediction. These self-supervised objectives provided a richer learning signal than "predict a single label per sentence." Could the same be true for vision?

The central question: Does the muted success of Vision Transformers come from training them with supervision? What happens if we train ViTs with self-supervised learning instead? Do new properties emerge that don't appear with supervised training or with CNNs?

The answer turned out to be yes — dramatically so. Self-supervised ViTs spontaneously learn to segment objects without any segmentation labels. Their attention maps contain explicit information about scene layout. Their features are so well-organized that a simple k-nearest-neighbor classifier (no training at all) achieves competitive accuracy. None of these properties emerge with supervised ViTs or with self-supervised CNNs.

Why was the performance of Vision Transformers considered "muted" before DINO?

They worked with supervision but showed no unique properties over CNNs — no emergent behaviors, higher compute cost, more data needed They couldn't match CNN accuracy at all They were too slow to train

Chapter 1: The Key Insight

DINO stands for self-distillation with no labels. The insight is beautifully simple: take two copies of the same network — call one the student and one the teacher. Show them different augmented views of the same image. Train the student to match the teacher's output. Update the teacher as a slow-moving average of the student.

That's it. No labels. No contrastive pairs. No memory bank. No clustering. Just: "student, match the teacher. Teacher, slowly absorb the student."

The student-teacher dance

Here's how the information flows:

Take an image. Create two different random augmentations (crops, color jitter, blur).
Pass one view through the student network. Pass the other through the teacher network.
Both output a probability distribution over K dimensions (via softmax).
Compute cross-entropy loss between teacher output and student output.
Backpropagate through the student only (stop gradient on teacher).
Update the teacher via exponential moving average (EMA) of the student weights.

Why does this work? The teacher is a smoothed, ensembled version of the student — an exponentially-weighted average of all past student checkpoints. This averaging makes the teacher more stable and higher-quality than the student at any given moment. So the student is always chasing a better version of itself. The asymmetry (EMA teacher + stop gradient) is what prevents collapse to a trivial solution.

The DINO training loop: student learns from teacher, teacher slowly absorbs student via EMA.

How is the teacher network updated in DINO?

Via exponential moving average (EMA) of the student's weights — no gradients flow through the teacher By backpropagating through both networks jointly By copying the student weights after each epoch

Chapter 2: The DINO Framework

The full DINO framework adds two critical design choices on top of the student-teacher core: a multi-crop augmentation strategy and a centering + sharpening mechanism to avoid collapse.

Multi-crop strategy

DINO generates multiple views from each image:

2 global views at resolution 224×224, each covering more than 50% of the image area
N local views (typically 6-8) at resolution 96×96, each covering less than 50% of the image

The key asymmetry: all views go through the student, but only the global views go through the teacher. This encourages local-to-global correspondences — the student must learn that a small crop of a dog's ear belongs to the same concept as the full image of the dog.

Multi-crop: 2 global views (teacher + student) and multiple local views (student only). Click to regenerate crops.

Centering and sharpening

Without careful design, the teacher can collapse — outputting the same distribution for every input (a uniform or single-spike output). DINO prevents this with two complementary operations applied to the teacher's output:

Centering: Subtract the running mean of the teacher's outputs. This prevents any single dimension from dominating. Updated via EMA: c ← mc + (1−m) · mean(batch outputs).
Sharpening: Use a low temperature τ_t in the teacher's softmax. This prevents collapse to a uniform distribution.

Centering alone would push toward uniform distributions (maximum entropy). Sharpening alone would push toward one-hot outputs (minimum entropy). Together, they balance each other: the teacher produces peaked but diverse distributions.

Why this matters: Most self-supervised methods need complex mechanisms to avoid collapse — contrastive losses with large batches (SimCLR), memory banks (MoCo), predictor heads (BYOL), or clustering (SwAV). DINO needs only centering + sharpening + EMA. This simplicity is a key contribution.

Why do only global views go through the teacher while all views go through the student?

To encourage local-to-global correspondence — the student must learn that a small crop belongs to the same concept as the full image To reduce computation cost Because the teacher can only handle high-resolution inputs

Chapter 3: Knowledge Distillation Without Labels

Traditional knowledge distillation works like this: train a large "teacher" model with labels, then train a smaller "student" to mimic the teacher's soft outputs. The key innovation of DINO is removing every component that requires labels.

The loss function

Both student and teacher output a K-dimensional vector (K = 65536 in practice). These are converted to probability distributions via temperature-scaled softmax:

P_s(x)⁽ⁱ⁾ = exp(g_{θ_s}(x)⁽ⁱ⁾ / τ_s) / Σ_k exp(g_{θ_s}(x)^(k) / τ_s)

The student temperature τ_s = 0.1 (fairly sharp). The teacher temperature τ_t is warmed up from 0.04 to 0.07 over the first 30 epochs (very sharp — this is the "sharpening").

The loss is a standard cross-entropy between teacher and student distributions:

L = − Σ_i P_t(x)⁽ⁱ⁾ log P_s(x')⁽ⁱ⁾

With multi-crop, this becomes a sum over all pairs of (global teacher view, any student view):

min_{θ_s} Σ_{x ∈ globals} Σ_{x' ∈ V, x' ≠ x} H(P_t(x), P_s(x'))

The teacher output is centered before the softmax: g_t(x) ← g_t(x) − c, where c is the EMA of batch means.

DINO pseudocode

# gs, gt: student and teacher networks
# C: center (K-dim), tps/tpt: temperatures
# l, m: EMA rates for network and center
gt.params = gs.params
for x in loader:
    x1, x2 = augment(x), augment(x)
    s1, s2 = gs(x1), gs(x2)
    t1, t2 = gt(x1), gt(x2)
    loss = H(t1, s2)/2 + H(t2, s1)/2
    loss.backward()
    update(gs)  # SGD on student only
    gt.params = l*gt.params + (1-l)*gs.params  # EMA
    C = m*C + (1-m)*cat([t1,t2]).mean(dim=0)

def H(t, s):
    t = t.detach()  # stop gradient on teacher
    s = softmax(s / tps, dim=1)
    t = softmax((t - C) / tpt, dim=1)  # center + sharpen
    return -(t * log(s)).sum(dim=1).mean()

No labels anywhere. Look at the pseudocode: there is no y, no label, no target_class. The teacher's output IS the target. The system bootstraps its own supervision. The only input is raw images.

What serves as the "label" in DINO's cross-entropy loss?

The teacher network's centered and sharpened output distribution — no external labels are used ImageNet class labels Cluster assignments from k-means

Chapter 4: Avoiding Collapse

Collapse is the nightmare of self-supervised learning. Without labels to anchor the features, the network can find trivial shortcuts — outputting the exact same representation for every input. The loss goes to zero, but the features are useless.

Different methods have different anti-collapse mechanisms:

SimCLR: Contrastive loss with negatives (requires large batches of 4096+)
MoCo: Momentum encoder + memory queue of negatives
BYOL: Predictor head + momentum encoder (no negatives!)
SwAV: Online clustering + Sinkhorn normalization
DINO: Momentum teacher + centering + sharpening (no negatives, no predictor, no clustering)

DINO's three mechanisms work together:

1. Momentum teacher (EMA)

The teacher parameters θ_t are an exponential moving average of the student: θ_t ← λθ_t + (1−λ)θ_s. The momentum λ follows a cosine schedule from 0.996 to 1.0 during training. This means the teacher changes very slowly — it's a smoothed ensemble of many past student states. If the student starts collapsing, the teacher still retains diverse representations from before the collapse began.

2. Centering

The center c is the running mean of the teacher's output over the batch:

c ← m · c + (1 − m) · (1/B) Σ_i g_{θ_t}(x_i)

This is subtracted from the teacher output before softmax. Without centering, one output dimension could dominate — the teacher would collapse to a one-hot vector that's the same for every input. Centering prevents this by keeping the mean output at zero.

3. Sharpening

The teacher uses a very low temperature τ_t = 0.04–0.07 in its softmax. This makes the output distribution peaked (high confidence). Without sharpening, centering alone would push toward a uniform distribution — the teacher would say "every class is equally likely" for every input. That's also collapse (just in the other direction).

Three output modes: uniform collapse (centering only), dominant dimension (no centering), and healthy behavior (centering + sharpening).

The balancing act: Centering prevents single-dimension dominance but encourages uniform outputs. Sharpening prevents uniform outputs but could encourage dominance. Together they balance: the teacher produces peaked, diverse, centered distributions. This is sufficient to avoid collapse when combined with the momentum teacher.

What would happen if DINO used centering but NOT sharpening?

The teacher would collapse to a uniform distribution — outputting equal probability for all dimensions regardless of input The model would train normally The model would overfit to the training data

Chapter 5: Emergent Segmentation

This is the headline result — the property that made DINO famous. When you train a ViT with DINO and visualize the self-attention maps of the [CLS] token in the last layer, something remarkable appears: the attention maps spontaneously learn to segment objects.

No segmentation labels. No bounding boxes. No pixel-level supervision of any kind. The model just... learns that objects are things, and attends to them.

How does this work?

In a Vision Transformer, the input image is split into patches (e.g., 8×8 or 16×16 pixels each). A special [CLS] token is prepended to the sequence. Through 12 layers of self-attention, the [CLS] token learns to attend to the patches that are most informative for representing the image.

With supervised training, the [CLS] token attends diffusely — it spreads attention across the image without clear spatial structure. But with DINO's self-supervised training, each attention head in the last layer learns to focus on semantically meaningful regions:

One head might attend to the object's body
Another head might attend to the object's boundary
Another might attend to the background

The different heads provide complementary views, and together they form a segmentation mask that accurately delineates object boundaries.

Simulated [CLS] token attention from different heads. Each head attends to different semantic regions. Click to switch heads.

The magic: The [CLS] token was never told "this is a dog" or "these pixels are the boundary." It simply learned, through self-distillation, that attending to object boundaries and semantically coherent regions produces the best representations for matching the teacher's output across different augmented views.

This emergent segmentation is practically useful: you can threshold the attention maps to produce segmentation masks, and these masks are competitive with early unsupervised segmentation methods — all without any training for segmentation.

DINO also showed that these attention maps can be used for video object segmentation: propagate attention across frames to track objects, again without any video-specific training.

Why do DINO's attention maps learn object segmentation without segmentation labels?

The self-distillation objective forces the [CLS] token to attend to semantically meaningful regions — object boundaries provide the most useful information for matching views across augmentations DINO uses implicit segmentation labels from ImageNet The multi-crop strategy acts as a form of segmentation supervision

Chapter 6: k-NN Classification

Here's another surprise from DINO: the learned features are so well-organized that a simple k-nearest-neighbor classifier — with zero training — achieves competitive ImageNet accuracy.

How it works

Freeze the pretrained DINO model. Extract features for all ImageNet training images. Store them.
For a test image, extract its feature, find the k=20 nearest training features (cosine similarity).
The k neighbors vote for a label. That's the prediction.

No linear probe. No fine-tuning. No hyperparameter search. No data augmentation at test time. Just: "find the closest training images and copy their label."

The results are striking

With a ViT-S/16 backbone:

DINO k-NN: 74.5% top-1 accuracy
DINO linear: 77.0% top-1 accuracy
Gap of only 2.5% — the features are almost linearly separable already!

Compare this with other self-supervised methods on the same ViT-S architecture:

BYOL k-NN: 66.6% (linear: 71.4%) — gap of 4.8%
MoCov2 k-NN: 64.4% (linear: 72.7%) — gap of 8.3%
SwAV k-NN: 66.3% (linear: 73.5%) — gap of 7.2%

k-NN vs linear probe accuracy for different self-supervised methods on ViT-S. DINO's gap is remarkably small.

What this tells us: When k-NN works almost as well as a linear classifier, it means the feature space has a natural cluster structure — semantically similar images are genuinely close in feature space. This is exactly what you'd want from a "foundation" representation. DINO features don't just encode class-discriminative information; they encode a smooth, well-organized manifold of visual concepts.

Why is k-NN accuracy a better indicator of feature quality than linear probe accuracy?

k-NN has no learnable parameters — high accuracy means the features already have natural cluster structure, not that a linear layer learned to compensate for poor features k-NN is faster to evaluate k-NN uses more neighbors

Chapter 7: Results

DINO's results span ImageNet classification, image retrieval, copy detection, video segmentation, and transfer learning. Here are the highlights.

ImageNet classification

With a ViT-B/8 backbone (85M parameters, 8×8 patches), DINO achieves:

80.1% top-1 linear classification (new SOTA for self-supervised with ViT)
77.4% top-1 k-NN classification

This beats all previous self-supervised methods, including those using much larger architectures. Importantly, using smaller patches (/8 vs /16) has a bigger impact than using a larger model.

Image retrieval

DINO features excel at image retrieval tasks (Oxford and Paris benchmarks). When pretrained on Google Landmarks v2 instead of ImageNet, DINO ViT-S/16 achieves 51.5 mAP on Revisited Oxford (Medium) — competitive with dedicated retrieval systems.

Copy detection

On the Copydays benchmark, DINO ViT-B/8 achieves 85.5% mAP — outperforming the specialized Multigrain model (82.5%) that was specifically trained for this task.

Video segmentation

Without any video training, DINO features can track objects across video frames by matching attention maps. On the DAVIS-2017 video object segmentation benchmark, DINO achieves competitive results using only frozen features and nearest-neighbor matching.

ImageNet linear probe accuracy across self-supervised methods and architectures.

The patch size story

One of DINO's most practical findings: reducing patch size from 16×16 to 8×8 dramatically improves results. ViT-S/8 reaches 79.7% linear accuracy — almost matching ViT-B/16 (78.2%) with 4× fewer parameters. The smaller patches create 4× more tokens, giving the attention maps higher spatial resolution and enabling finer-grained segmentation.

Practical note: ViT-S/16 is the sweet spot for speed (1007 im/s) while ViT-B/8 is the accuracy champion (80.1%). Training ViT-S/16 with DINO takes just 2×8 GPUs for 3 days to reach 76.1% — outperforming comparable self-supervised CNNs with significantly less compute.

Which modification had a bigger impact on DINO's performance: using a larger model or using smaller patches?

Smaller patches — going from /16 to /8 improved accuracy more than going from ViT-S to ViT-B, because finer patches give higher spatial resolution for attention maps Using a larger model Both had equal impact

Chapter 8: What Makes ViTs Special

DINO works with both ViTs and CNNs — it achieves 75.3% linear accuracy with a ResNet-50, matching the state of the art. But the emergent properties are unique to ViTs. Why?

Self-attention provides a natural visualization

In a CNN, there's no direct equivalent of "what is the model attending to." You can compute gradient-based saliency maps (Grad-CAM), but these are post-hoc approximations. In a ViT, self-attention weights are a native part of the architecture — you can directly read off which patches the model considers important.

The [CLS] token as a global aggregator

The [CLS] token is unique to ViTs. It has no spatial position — it's a global summary that must learn to aggregate information from all patches. With self-supervised training, this aggregation becomes spatially structured: different attention heads specialize in different aspects of the scene (object interior, boundaries, background).

Local-to-global reasoning

DINO's multi-crop strategy specifically encourages local-to-global reasoning: the student sees small crops but must match the teacher's output on global views. In a ViT, this means the attention mechanism must learn to relate local patch features to global image-level semantics. CNNs, with their fixed receptive fields, handle this less naturally.

No batch normalization

A subtle but important difference: ViTs don't use batch normalization by default. BN creates implicit communication between samples in a batch, which can provide shortcuts for self-supervised methods (the model can "cheat" by using batch statistics). DINO with ViT is entirely BN-free, making the system cleaner and the learned features more robust.

The deeper lesson: Supervision may actually hurt ViTs by reducing the richness of their representations. Supervised training optimizes for a single label per image — collapsing the rich internal representations to a 1000-way classifier. Self-supervised training preserves this richness because the learning objective (match the teacher) doesn't discard any information.

Why do emergent segmentation properties appear with self-supervised ViTs but not with self-supervised CNNs?

ViTs have explicit self-attention maps that directly show which patches the [CLS] token considers important — CNNs have no equivalent native mechanism for spatial attention visualization CNNs are less powerful than ViTs DINO's loss function works differently with CNNs

Chapter 9: Connections

Predecessors

MoCo (He et al., 2020): Introduced the momentum encoder for self-supervised learning. DINO adopts this as the EMA teacher but replaces the contrastive loss and memory queue with cross-entropy distillation.

BYOL (Grill et al., 2020): Showed that you can learn without negatives using a predictor head + momentum encoder. DINO simplifies further by removing the predictor head and using centering + sharpening instead.

SimCLR (Chen et al., 2020): The contrastive learning baseline. Required batch sizes of 4096+ for good performance. DINO avoids contrastive losses entirely, working well with standard batch sizes of 1024.

SwAV (Caron et al., 2020): By the same first author. Introduced multi-crop training (which DINO adopts) and online clustering. DINO replaces clustering with cross-entropy distillation.

What DINO enabled

DINOv2 (Oquab et al., 2024): Scaled DINO to ViT-g (1.1B parameters) trained on a curated dataset of 142M images. Combined DINO's self-distillation with iBOT's masked image modeling. Achieved features that transfer to almost any vision task without fine-tuning — a true vision foundation model.

MAE (He et al., 2022): Took the complementary approach: instead of self-distillation, mask 75% of patches and reconstruct them. Different philosophy but validated that self-supervised ViTs produce powerful features.

CLIP (Radford et al., 2021): Trained ViTs with natural language supervision instead of self-supervision. Different from DINO's approach (no text involved) but both showed that ViTs benefit from non-standard training.

Segment Anything (SAM) (Kirillov et al., 2023): DINO demonstrated that self-supervised ViTs could segment objects without labels. SAM took this further with a massive labeled segmentation dataset, but the insight that ViTs naturally understand object boundaries came from DINO.

Foundation Models: DINO was a key proof that self-supervised ViTs can serve as general-purpose visual backbones. This inspired the current wave of vision foundation models (DINOv2, SigLIP, EVA, InternViT) used in VLMs and VLAs.

DINO's legacy: DINO showed that Vision Transformers, when freed from supervised training, reveal properties that convnets don't exhibit. The emergent segmentation result changed how the field thinks about ViTs — they're not just "convnets with attention," they're fundamentally different architectures that organize information spatially in their attention patterns. This insight underlies every modern vision foundation model.

Cheat sheet

Core idea

Self-distillation: student matches EMA teacher on different views of same image

Anti-collapse

Momentum teacher + centering (subtract mean) + sharpening (low τ)

Key result

80.1% ImageNet linear (ViT-B/8); emergent object segmentation in attention maps

k-NN surprise

74.5% with zero-training k-NN (ViT-S/16) — features have natural cluster structure

Impact

Proved ViTs have unique self-supervised properties → DINOv2, SAM, vision foundation models

How does DINOv2 build on the original DINO?

It scales DINO's self-distillation to a 1.1B-parameter ViT-g, combines it with masked image modeling (iBOT), and trains on 142M curated images to create a universal vision foundation model It replaces the ViT with a CNN It adds supervised labels to DINO's training

Emerging Properties in Self-Supervised Vision Transformers