Scaling self-supervised ViT training with curated data and combined losses to produce all-purpose visual features that rival CLIP — without any text supervision.
By 2023, the vision community had a clear winner for "general-purpose visual features": CLIP and its open-source sibling OpenCLIP. Train a vision encoder alongside a text encoder on billions of image-caption pairs, and you get features that transfer to almost any downstream task. Simple, effective, dominant.
But CLIP has a fundamental limitation: it requires text supervision. The captions only approximate the rich information in images. Complex spatial relationships, fine-grained textures, pixel-level details — these don't surface easily in a short caption. And you need massive aligned image-text corpora, which are expensive and noisy to collect.
Meanwhile, self-supervised methods like DINO had shown something remarkable. Without any labels at all, DINO's features exhibited emergent properties: the attention maps naturally segmented objects, the patch features encoded semantic parts, the [CLS] token captured high-level semantics. But DINO was trained on ImageNet-1k — just 1.3 million images. Every attempt to scale self-supervised methods to larger, uncurated datasets led to a significant drop in feature quality.
DINOv2 answers yes. The recipe: (1) curate a large, diverse dataset automatically, (2) combine the best self-supervised objectives, and (3) scale to a 1B-parameter ViT. The resulting features work as drop-in replacements for any vision task — classification, segmentation, depth estimation, retrieval — without fine-tuning.
DINOv2's insight is deceptively simple: existing self-supervised methods already work — they just need better data and more compute. The specific recipe combines three ingredients:
Previous attempts to scale self-supervised learning used uncurated web data. The problem? Uncurated data is dominated by a few visual modes (e.g., product photos, memes, screenshots) and contains duplicates and noise. DINOv2 instead builds an automatic pipeline that retrieves diverse, high-quality images from web data using existing curated datasets as "anchors." The result: LVD-142M, a 142-million image dataset that's both large and diverse.
Rather than inventing a new objective, DINOv2 combines two proven ones: the DINO self-distillation loss on [CLS] tokens (captures image-level semantics) and the iBOT masked image modeling loss on patch tokens (captures local, pixel-level features). Together they produce features that excel at both image-level and pixel-level tasks.
Custom FlashAttention, sequence packing, efficient stochastic depth, FSDP — these engineering contributions make training 2x faster and 3x more memory-efficient than comparable methods, enabling training of a 1.1B-parameter ViT-g.
Most self-supervised learning research trains on ImageNet-1k (1.3M images) or uncurated web scrapes. Neither works well for general-purpose features: ImageNet is too small and domain-specific, while uncurated data is noisy and imbalanced. DINOv2 introduces an automatic curation pipeline that produces LVD-142M — 142 million diverse, high-quality images.
The curation process has three stages:
Step 1: Gather sources. Start with a small set of curated "anchor" datasets: ImageNet-22k, Google Landmarks, and several fine-grained datasets. These define what "good" images look like. Then collect 1.2 billion raw images from publicly available web crawls.
Step 2: Deduplicate. Apply a copy-detection pipeline to remove near-duplicates from the uncurated pool. This increases diversity and prevents overfitting on repeated images. Also remove any images that appear in downstream benchmark test sets.
Step 3: Retrieve. Compute embeddings for all images using a self-supervised ViT-H/16 pretrained on ImageNet-22k. For each image in the curated anchor sets, retrieve the 4 nearest neighbors from the uncurated pool using cosine similarity. This "pulls in" web images that are visually similar to curated ones — expanding coverage while maintaining quality.
The result? LVD-142M maintains ImageNet-level performance while dramatically improving on out-of-domain tasks. Training on uncurated data of the same size leads to a significant quality drop — confirming that curation, not just scale, is essential.
DINOv2's training objective combines two complementary self-supervised losses. Let's build up each one, then see how they work together.
Given an image, create multiple crops: two large "global" crops (224x224) and several small "local" crops (96x96). Pass them through a student network and an EMA teacher network. Both output a [CLS] token representation, which is projected through an MLP head into "prototype scores," then normalized with softmax.
The loss is cross-entropy between teacher and student prototype distributions:
Where pt is the teacher's softmax output (after centering) and ps is the student's. The teacher is built as an exponential moving average (EMA) of the student weights — a form of self-distillation. This loss captures image-level semantics in the [CLS] token.
Randomly mask some input patches in the student's input (but not the teacher's). For each masked position, the student predicts what the teacher sees at that position. The loss is again cross-entropy, but now applied to individual patch tokens:
This forces the student to reconstruct local patch information from context — learning fine-grained, pixel-level features.
Where LKoLeo is a regularizer that encourages features to spread uniformly in the embedding space (preventing mode collapse). At scale, DINOv2 also uses separate MLP heads for DINO and iBOT (unlike the original iBOT which shared them) and Sinkhorn-Knopp centering instead of simple moving-average centering.
DINOv2 uses the Vision Transformer (ViT) architecture with patch size 14. The team trains four model sizes:
The ViT-g architecture differs slightly from the one proposed by Zhai et al. (2022). To maximize GPU efficiency with their custom FlashAttention kernel, DINOv2 uses an embedding dimension of 1536 with 24 heads (64 dim/head) rather than 1408 with 16 heads (88 dim/head). Matrix operations are most efficient when the full embedding dimension is a multiple of 256. No difference in final accuracy was observed.
Here's an important design choice: only the ViT-g is trained from scratch on LVD-142M. The smaller models (ViT-S, ViT-B, ViT-L) are distilled from the ViT-g using the same training loop with modifications:
This distillation approach achieves better performance than training smaller models from scratch — even for the ViT-L. The ViT-g effectively compresses its learned representations into smaller, more deployable models.
Training a 1.1B parameter ViT on 142M images with self-distillation is an engineering challenge. DINOv2 introduces several optimizations that together make training 2x faster and 3x more memory-efficient than comparable self-supervised methods.
The team implements their own version of FlashAttention optimized for their use case. Efficiency is best when the embedding dimension per head is a multiple of 64 — which motivated the ViT-g architecture choice of 1536 dim / 24 heads = 64 dim/head.
DINO's multi-crop strategy creates sequences of different lengths: global crops (224px, 256 patches) and local crops (96px, 49 patches). Normally these require separate forward passes. Sequence packing concatenates them into one long sequence with a block-diagonal attention mask that prevents cross-sequence attention. This is mathematically equivalent to separate forwards but significantly faster on GPU.
Standard stochastic depth masks the output of dropped layers — still computing the forward pass. DINOv2's version actually skips the computation by shuffling samples along the batch dimension and slicing only the first (1−d)×B samples. With a drop rate of d=0.4, this saves 40% of compute and memory in those blocks.
Training with AdamW requires 4 copies of the model in float32 (student, teacher, optimizer first moments, optimizer second moments) — 16 GB for 1B parameters. FSDP shards these across GPUs. Communication uses float16 for the backbone (float32 for MLP heads to avoid instability), cutting communication costs by ~50% versus standard DDP.
The gold standard for evaluating frozen features: freeze the backbone, train only a linear classifier on top, and measure accuracy. If a linear probe does well, the features themselves encode the relevant information — no fine-tuning needed.
DINOv2 ViT-g achieves 86.5% top-1 accuracy on ImageNet-1k with a simple linear probe. For context:
But ImageNet accuracy alone doesn't prove "general-purpose" features. DINOv2 is evaluated on a staggering range of tasks, all using frozen features with minimal task-specific adaptation:
The central claim of DINOv2 is not just that it achieves high accuracy on one benchmark. It's that the same frozen features work as a drop-in replacement for any vision task. No fine-tuning. No task-specific architecture changes. Just freeze the backbone and add a simple head.
The [CLS] token captures global image semantics. A linear probe or k-NN classifier on this single vector handles classification, retrieval, and domain transfer with state-of-the-art results.
The patch tokens capture local spatial information. A linear probe on the spatial grid of patch tokens handles segmentation, depth estimation, and surface normal prediction. This is where DINOv2 especially outshines CLIP — text-supervised models tend to have weak spatial features because captions don't describe pixel-level structure.
One of the most striking results: compute PCA on DINOv2's patch features across different images. The first three principal components, mapped to RGB channels, naturally segment objects and match corresponding parts across images — a dog's head maps to the same color regardless of pose, breed, or background. This emergent property (first observed in DINO) becomes even more pronounced at scale.
The natural question: how does DINOv2 compare to CLIP/OpenCLIP, the dominant paradigm for general visual features?