Self-distillation with no labels: a student-teacher framework where Vision Transformers spontaneously learn object segmentation, scene layout, and features so good that a simple k-NN achieves competitive ImageNet accuracy.
By 2021, self-supervised learning had produced spectacular results on images — but almost entirely with convolutional networks. Methods like MoCo, BYOL, and SwAV trained ResNets to produce features that rivaled supervised learning, closing the gap on ImageNet without using a single label.
Meanwhile, Vision Transformers (ViTs) had arrived. They worked well with supervision, but they hadn't shown anything special. They were computationally expensive, required more training data, and their features looked... ordinary. No emergent properties. No surprises.
This was puzzling. In NLP, the magic of Transformers came from self-supervised pretraining — BERT's masked language modeling, GPT's next-token prediction. These self-supervised objectives provided a richer learning signal than "predict a single label per sentence." Could the same be true for vision?
The answer turned out to be yes — dramatically so. Self-supervised ViTs spontaneously learn to segment objects without any segmentation labels. Their attention maps contain explicit information about scene layout. Their features are so well-organized that a simple k-nearest-neighbor classifier (no training at all) achieves competitive accuracy. None of these properties emerge with supervised ViTs or with self-supervised CNNs.
DINO stands for self-distillation with no labels. The insight is beautifully simple: take two copies of the same network — call one the student and one the teacher. Show them different augmented views of the same image. Train the student to match the teacher's output. Update the teacher as a slow-moving average of the student.
That's it. No labels. No contrastive pairs. No memory bank. No clustering. Just: "student, match the teacher. Teacher, slowly absorb the student."
Here's how the information flows:
The DINO training loop: student learns from teacher, teacher slowly absorbs student via EMA.
The full DINO framework adds two critical design choices on top of the student-teacher core: a multi-crop augmentation strategy and a centering + sharpening mechanism to avoid collapse.
DINO generates multiple views from each image:
The key asymmetry: all views go through the student, but only the global views go through the teacher. This encourages local-to-global correspondences — the student must learn that a small crop of a dog's ear belongs to the same concept as the full image of the dog.
Multi-crop: 2 global views (teacher + student) and multiple local views (student only). Click to regenerate crops.
Without careful design, the teacher can collapse — outputting the same distribution for every input (a uniform or single-spike output). DINO prevents this with two complementary operations applied to the teacher's output:
Centering alone would push toward uniform distributions (maximum entropy). Sharpening alone would push toward one-hot outputs (minimum entropy). Together, they balance each other: the teacher produces peaked but diverse distributions.
Traditional knowledge distillation works like this: train a large "teacher" model with labels, then train a smaller "student" to mimic the teacher's soft outputs. The key innovation of DINO is removing every component that requires labels.
Both student and teacher output a K-dimensional vector (K = 65536 in practice). These are converted to probability distributions via temperature-scaled softmax:
The student temperature τs = 0.1 (fairly sharp). The teacher temperature τt is warmed up from 0.04 to 0.07 over the first 30 epochs (very sharp — this is the "sharpening").
The loss is a standard cross-entropy between teacher and student distributions:
With multi-crop, this becomes a sum over all pairs of (global teacher view, any student view):
The teacher output is centered before the softmax: gt(x) ← gt(x) − c, where c is the EMA of batch means.
# gs, gt: student and teacher networks # C: center (K-dim), tps/tpt: temperatures # l, m: EMA rates for network and center gt.params = gs.params for x in loader: x1, x2 = augment(x), augment(x) s1, s2 = gs(x1), gs(x2) t1, t2 = gt(x1), gt(x2) loss = H(t1, s2)/2 + H(t2, s1)/2 loss.backward() update(gs) # SGD on student only gt.params = l*gt.params + (1-l)*gs.params # EMA C = m*C + (1-m)*cat([t1,t2]).mean(dim=0) def H(t, s): t = t.detach() # stop gradient on teacher s = softmax(s / tps, dim=1) t = softmax((t - C) / tpt, dim=1) # center + sharpen return -(t * log(s)).sum(dim=1).mean()
y, no label, no target_class. The teacher's output IS the target. The system bootstraps its own supervision. The only input is raw images.Collapse is the nightmare of self-supervised learning. Without labels to anchor the features, the network can find trivial shortcuts — outputting the exact same representation for every input. The loss goes to zero, but the features are useless.
Different methods have different anti-collapse mechanisms:
DINO's three mechanisms work together:
The teacher parameters θt are an exponential moving average of the student: θt ← λθt + (1−λ)θs. The momentum λ follows a cosine schedule from 0.996 to 1.0 during training. This means the teacher changes very slowly — it's a smoothed ensemble of many past student states. If the student starts collapsing, the teacher still retains diverse representations from before the collapse began.
The center c is the running mean of the teacher's output over the batch:
This is subtracted from the teacher output before softmax. Without centering, one output dimension could dominate — the teacher would collapse to a one-hot vector that's the same for every input. Centering prevents this by keeping the mean output at zero.
The teacher uses a very low temperature τt = 0.04–0.07 in its softmax. This makes the output distribution peaked (high confidence). Without sharpening, centering alone would push toward a uniform distribution — the teacher would say "every class is equally likely" for every input. That's also collapse (just in the other direction).
Three output modes: uniform collapse (centering only), dominant dimension (no centering), and healthy behavior (centering + sharpening).
This is the headline result — the property that made DINO famous. When you train a ViT with DINO and visualize the self-attention maps of the [CLS] token in the last layer, something remarkable appears: the attention maps spontaneously learn to segment objects.
No segmentation labels. No bounding boxes. No pixel-level supervision of any kind. The model just... learns that objects are things, and attends to them.
In a Vision Transformer, the input image is split into patches (e.g., 8×8 or 16×16 pixels each). A special [CLS] token is prepended to the sequence. Through 12 layers of self-attention, the [CLS] token learns to attend to the patches that are most informative for representing the image.
With supervised training, the [CLS] token attends diffusely — it spreads attention across the image without clear spatial structure. But with DINO's self-supervised training, each attention head in the last layer learns to focus on semantically meaningful regions:
The different heads provide complementary views, and together they form a segmentation mask that accurately delineates object boundaries.
Simulated [CLS] token attention from different heads. Each head attends to different semantic regions. Click to switch heads.
This emergent segmentation is practically useful: you can threshold the attention maps to produce segmentation masks, and these masks are competitive with early unsupervised segmentation methods — all without any training for segmentation.
DINO also showed that these attention maps can be used for video object segmentation: propagate attention across frames to track objects, again without any video-specific training.
Here's another surprise from DINO: the learned features are so well-organized that a simple k-nearest-neighbor classifier — with zero training — achieves competitive ImageNet accuracy.
No linear probe. No fine-tuning. No hyperparameter search. No data augmentation at test time. Just: "find the closest training images and copy their label."
With a ViT-S/16 backbone:
Compare this with other self-supervised methods on the same ViT-S architecture:
k-NN vs linear probe accuracy for different self-supervised methods on ViT-S. DINO's gap is remarkably small.
DINO's results span ImageNet classification, image retrieval, copy detection, video segmentation, and transfer learning. Here are the highlights.
With a ViT-B/8 backbone (85M parameters, 8×8 patches), DINO achieves:
This beats all previous self-supervised methods, including those using much larger architectures. Importantly, using smaller patches (/8 vs /16) has a bigger impact than using a larger model.
DINO features excel at image retrieval tasks (Oxford and Paris benchmarks). When pretrained on Google Landmarks v2 instead of ImageNet, DINO ViT-S/16 achieves 51.5 mAP on Revisited Oxford (Medium) — competitive with dedicated retrieval systems.
On the Copydays benchmark, DINO ViT-B/8 achieves 85.5% mAP — outperforming the specialized Multigrain model (82.5%) that was specifically trained for this task.
Without any video training, DINO features can track objects across video frames by matching attention maps. On the DAVIS-2017 video object segmentation benchmark, DINO achieves competitive results using only frozen features and nearest-neighbor matching.
ImageNet linear probe accuracy across self-supervised methods and architectures.
One of DINO's most practical findings: reducing patch size from 16×16 to 8×8 dramatically improves results. ViT-S/8 reaches 79.7% linear accuracy — almost matching ViT-B/16 (78.2%) with 4× fewer parameters. The smaller patches create 4× more tokens, giving the attention maps higher spatial resolution and enabling finer-grained segmentation.
DINO works with both ViTs and CNNs — it achieves 75.3% linear accuracy with a ResNet-50, matching the state of the art. But the emergent properties are unique to ViTs. Why?
In a CNN, there's no direct equivalent of "what is the model attending to." You can compute gradient-based saliency maps (Grad-CAM), but these are post-hoc approximations. In a ViT, self-attention weights are a native part of the architecture — you can directly read off which patches the model considers important.
The [CLS] token is unique to ViTs. It has no spatial position — it's a global summary that must learn to aggregate information from all patches. With self-supervised training, this aggregation becomes spatially structured: different attention heads specialize in different aspects of the scene (object interior, boundaries, background).
DINO's multi-crop strategy specifically encourages local-to-global reasoning: the student sees small crops but must match the teacher's output on global views. In a ViT, this means the attention mechanism must learn to relate local patch features to global image-level semantics. CNNs, with their fixed receptive fields, handle this less naturally.
A subtle but important difference: ViTs don't use batch normalization by default. BN creates implicit communication between samples in a batch, which can provide shortcuts for self-supervised methods (the model can "cheat" by using batch statistics). DINO with ViT is entirely BN-free, making the system cleaner and the learned features more robust.
MoCo (He et al., 2020): Introduced the momentum encoder for self-supervised learning. DINO adopts this as the EMA teacher but replaces the contrastive loss and memory queue with cross-entropy distillation.
BYOL (Grill et al., 2020): Showed that you can learn without negatives using a predictor head + momentum encoder. DINO simplifies further by removing the predictor head and using centering + sharpening instead.
SimCLR (Chen et al., 2020): The contrastive learning baseline. Required batch sizes of 4096+ for good performance. DINO avoids contrastive losses entirely, working well with standard batch sizes of 1024.
SwAV (Caron et al., 2020): By the same first author. Introduced multi-crop training (which DINO adopts) and online clustering. DINO replaces clustering with cross-entropy distillation.
DINOv2 (Oquab et al., 2024): Scaled DINO to ViT-g (1.1B parameters) trained on a curated dataset of 142M images. Combined DINO's self-distillation with iBOT's masked image modeling. Achieved features that transfer to almost any vision task without fine-tuning — a true vision foundation model.
MAE (He et al., 2022): Took the complementary approach: instead of self-distillation, mask 75% of patches and reconstruct them. Different philosophy but validated that self-supervised ViTs produce powerful features.
CLIP (Radford et al., 2021): Trained ViTs with natural language supervision instead of self-supervision. Different from DINO's approach (no text involved) but both showed that ViTs benefit from non-standard training.
Segment Anything (SAM) (Kirillov et al., 2023): DINO demonstrated that self-supervised ViTs could segment objects without labels. SAM took this further with a massive labeled segmentation dataset, but the insight that ViTs naturally understand object boundaries came from DINO.
Foundation Models: DINO was a key proof that self-supervised ViTs can serve as general-purpose visual backbones. This inspired the current wave of vision foundation models (DINOv2, SigLIP, EVA, InternViT) used in VLMs and VLAs.