Oquab, Darcet, Moutakanni et al. (Meta AI) — 2023

DINOv2: Learning Robust Visual Features

Scaling self-supervised ViT training with curated data and combined losses to produce all-purpose visual features that rival CLIP — without any text supervision.

Prerequisites: Vision Transformers + Self-supervised learning basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

By 2023, the vision community had a clear winner for "general-purpose visual features": CLIP and its open-source sibling OpenCLIP. Train a vision encoder alongside a text encoder on billions of image-caption pairs, and you get features that transfer to almost any downstream task. Simple, effective, dominant.

But CLIP has a fundamental limitation: it requires text supervision. The captions only approximate the rich information in images. Complex spatial relationships, fine-grained textures, pixel-level details — these don't surface easily in a short caption. And you need massive aligned image-text corpora, which are expensive and noisy to collect.

Meanwhile, self-supervised methods like DINO had shown something remarkable. Without any labels at all, DINO's features exhibited emergent properties: the attention maps naturally segmented objects, the patch features encoded semantic parts, the [CLS] token captured high-level semantics. But DINO was trained on ImageNet-1k — just 1.3 million images. Every attempt to scale self-supervised methods to larger, uncurated datasets led to a significant drop in feature quality.

The gap: Text-supervised models (CLIP) produce great general features but are bottlenecked by caption quality. Self-supervised models (DINO) learn richer visual representations but haven't scaled beyond small curated datasets. Can we close this gap — matching CLIP's transfer performance using images alone, no text at all?

DINOv2 answers yes. The recipe: (1) curate a large, diverse dataset automatically, (2) combine the best self-supervised objectives, and (3) scale to a 1B-parameter ViT. The resulting features work as drop-in replacements for any vision task — classification, segmentation, depth estimation, retrieval — without fine-tuning.

What is the core limitation of text-supervised models like CLIP that DINOv2 aims to overcome?

Chapter 1: The Key Insight

DINOv2's insight is deceptively simple: existing self-supervised methods already work — they just need better data and more compute. The specific recipe combines three ingredients:

Ingredient 1: Curated data

Previous attempts to scale self-supervised learning used uncurated web data. The problem? Uncurated data is dominated by a few visual modes (e.g., product photos, memes, screenshots) and contains duplicates and noise. DINOv2 instead builds an automatic pipeline that retrieves diverse, high-quality images from web data using existing curated datasets as "anchors." The result: LVD-142M, a 142-million image dataset that's both large and diverse.

Ingredient 2: Combined losses

Rather than inventing a new objective, DINOv2 combines two proven ones: the DINO self-distillation loss on [CLS] tokens (captures image-level semantics) and the iBOT masked image modeling loss on patch tokens (captures local, pixel-level features). Together they produce features that excel at both image-level and pixel-level tasks.

Ingredient 3: Scale + engineering

Custom FlashAttention, sequence packing, efficient stochastic depth, FSDP — these engineering contributions make training 2x faster and 3x more memory-efficient than comparable methods, enabling training of a 1.1B-parameter ViT-g.

The meta-insight: You don't need a clever new loss function. You need (1) the right data, (2) the right combination of existing objectives, and (3) the engineering to actually train at scale. This is a systems paper as much as a methods paper.
What are the two self-supervised losses DINOv2 combines?

Chapter 2: Data Curation

Most self-supervised learning research trains on ImageNet-1k (1.3M images) or uncurated web scrapes. Neither works well for general-purpose features: ImageNet is too small and domain-specific, while uncurated data is noisy and imbalanced. DINOv2 introduces an automatic curation pipeline that produces LVD-142M — 142 million diverse, high-quality images.

The pipeline

The curation process has three stages:

Step 1: Gather sources. Start with a small set of curated "anchor" datasets: ImageNet-22k, Google Landmarks, and several fine-grained datasets. These define what "good" images look like. Then collect 1.2 billion raw images from publicly available web crawls.

Step 2: Deduplicate. Apply a copy-detection pipeline to remove near-duplicates from the uncurated pool. This increases diversity and prevents overfitting on repeated images. Also remove any images that appear in downstream benchmark test sets.

Step 3: Retrieve. Compute embeddings for all images using a self-supervised ViT-H/16 pretrained on ImageNet-22k. For each image in the curated anchor sets, retrieve the 4 nearest neighbors from the uncurated pool using cosine similarity. This "pulls in" web images that are visually similar to curated ones — expanding coverage while maintaining quality.

Why not just use more curated data? Manual curation doesn't scale. The brilliance of this pipeline is that it uses a small seed of curated data to automatically find similar high-quality images from a massive uncurated pool. Think of it like using Wikipedia to train a language model that then scores web text — except here, the "scoring" is visual similarity via embeddings.

The result? LVD-142M maintains ImageNet-level performance while dramatically improving on out-of-domain tasks. Training on uncurated data of the same size leads to a significant quality drop — confirming that curation, not just scale, is essential.

How does DINOv2 build its LVD-142M dataset from uncurated web data?

Chapter 3: The Combined Training Objective

DINOv2's training objective combines two complementary self-supervised losses. Let's build up each one, then see how they work together.

Loss 1: DINO self-distillation (image-level)

Given an image, create multiple crops: two large "global" crops (224x224) and several small "local" crops (96x96). Pass them through a student network and an EMA teacher network. Both output a [CLS] token representation, which is projected through an MLP head into "prototype scores," then normalized with softmax.

The loss is cross-entropy between teacher and student prototype distributions:

LDINO = − ∑ pt log ps

Where pt is the teacher's softmax output (after centering) and ps is the student's. The teacher is built as an exponential moving average (EMA) of the student weights — a form of self-distillation. This loss captures image-level semantics in the [CLS] token.

Loss 2: iBOT masked image modeling (patch-level)

Randomly mask some input patches in the student's input (but not the teacher's). For each masked position, the student predicts what the teacher sees at that position. The loss is again cross-entropy, but now applied to individual patch tokens:

LiBOT = − ∑i ∈ masked pti log psi

This forces the student to reconstruct local patch information from context — learning fine-grained, pixel-level features.

The combined objective

Ltotal = LDINO + λ LiBOT + α LKoLeo

Where LKoLeo is a regularizer that encourages features to spread uniformly in the embedding space (preventing mode collapse). At scale, DINOv2 also uses separate MLP heads for DINO and iBOT (unlike the original iBOT which shared them) and Sinkhorn-Knopp centering instead of simple moving-average centering.

Why combine both? DINO alone gives excellent image-level features (great for classification, retrieval) but weak pixel-level features. iBOT alone gives strong local features but weaker global representations. Together, you get features that excel at both — classification AND segmentation AND depth estimation from the same frozen backbone.
Why does DINOv2 use separate MLP heads for the DINO and iBOT losses, unlike the original iBOT?

Chapter 4: Architecture and Scaling

DINOv2 uses the Vision Transformer (ViT) architecture with patch size 14. The team trains four model sizes:

Model family:
ViT-S/14 — 21M params, 384 dim, 6 heads
ViT-B/14 — 86M params, 768 dim, 12 heads
ViT-L/14 — 300M params, 1024 dim, 16 heads
ViT-g/14 — 1.1B params, 1536 dim, 24 heads

The ViT-g architecture differs slightly from the one proposed by Zhai et al. (2022). To maximize GPU efficiency with their custom FlashAttention kernel, DINOv2 uses an embedding dimension of 1536 with 24 heads (64 dim/head) rather than 1408 with 16 heads (88 dim/head). Matrix operations are most efficient when the full embedding dimension is a multiple of 256. No difference in final accuracy was observed.

Knowledge distillation

Here's an important design choice: only the ViT-g is trained from scratch on LVD-142M. The smaller models (ViT-S, ViT-B, ViT-L) are distilled from the ViT-g using the same training loop with modifications:

This distillation approach achieves better performance than training smaller models from scratch — even for the ViT-L. The ViT-g effectively compresses its learned representations into smaller, more deployable models.

How are the smaller DINOv2 models (ViT-S, ViT-B, ViT-L) produced?

Chapter 5: Training at Scale

Training a 1.1B parameter ViT on 142M images with self-distillation is an engineering challenge. DINOv2 introduces several optimizations that together make training 2x faster and 3x more memory-efficient than comparable self-supervised methods.

Custom FlashAttention

The team implements their own version of FlashAttention optimized for their use case. Efficiency is best when the embedding dimension per head is a multiple of 64 — which motivated the ViT-g architecture choice of 1536 dim / 24 heads = 64 dim/head.

Sequence packing

DINO's multi-crop strategy creates sequences of different lengths: global crops (224px, 256 patches) and local crops (96px, 49 patches). Normally these require separate forward passes. Sequence packing concatenates them into one long sequence with a block-diagonal attention mask that prevents cross-sequence attention. This is mathematically equivalent to separate forwards but significantly faster on GPU.

Efficient stochastic depth

Standard stochastic depth masks the output of dropped layers — still computing the forward pass. DINOv2's version actually skips the computation by shuffling samples along the batch dimension and slicing only the first (1−d)×B samples. With a drop rate of d=0.4, this saves 40% of compute and memory in those blocks.

FSDP (Fully-Sharded Data Parallel)

Training with AdamW requires 4 copies of the model in float32 (student, teacher, optimizer first moments, optimizer second moments) — 16 GB for 1B parameters. FSDP shards these across GPUs. Communication uses float16 for the backbone (float32 for MLP heads to avoid instability), cutting communication costs by ~50% versus standard DDP.

The numbers: ViT-g training on LVD-142M takes ~7,000 A100 GPU-hours. The final training uses a batch size equivalent to 22,000 images. A short high-resolution phase (518x518) at the end of training improves pixel-level performance for segmentation and detection tasks.
How does DINOv2's efficient stochastic depth differ from the standard implementation?

Chapter 6: Linear Probing Results

The gold standard for evaluating frozen features: freeze the backbone, train only a linear classifier on top, and measure accuracy. If a linear probe does well, the features themselves encode the relevant information — no fine-tuning needed.

ImageNet classification

DINOv2 ViT-g achieves 86.5% top-1 accuracy on ImageNet-1k with a simple linear probe. For context:

Beyond classification

But ImageNet accuracy alone doesn't prove "general-purpose" features. DINOv2 is evaluated on a staggering range of tasks, all using frozen features with minimal task-specific adaptation:

The takeaway: A single frozen DINOv2 backbone, with nothing more than a linear layer on top, performs competitively or better than task-specific models across an extraordinarily diverse set of vision benchmarks. These are truly "all-purpose" features.
What does it mean that DINOv2 achieves 86.5% on ImageNet with a linear probe?

Chapter 7: All-Purpose Features

The central claim of DINOv2 is not just that it achieves high accuracy on one benchmark. It's that the same frozen features work as a drop-in replacement for any vision task. No fine-tuning. No task-specific architecture changes. Just freeze the backbone and add a simple head.

Image-level tasks

The [CLS] token captures global image semantics. A linear probe or k-NN classifier on this single vector handles classification, retrieval, and domain transfer with state-of-the-art results.

Pixel-level tasks

The patch tokens capture local spatial information. A linear probe on the spatial grid of patch tokens handles segmentation, depth estimation, and surface normal prediction. This is where DINOv2 especially outshines CLIP — text-supervised models tend to have weak spatial features because captions don't describe pixel-level structure.

PCA visualization

One of the most striking results: compute PCA on DINOv2's patch features across different images. The first three principal components, mapped to RGB channels, naturally segment objects and match corresponding parts across images — a dog's head maps to the same color regardless of pose, breed, or background. This emergent property (first observed in DINO) becomes even more pronounced at scale.

Foundation model behavior: DINOv2's features exhibit the hallmarks of a foundation model — they generalize across tasks, domains, and data distributions without adaptation. Downstream models like Depth Anything (2024) use DINOv2 as their backbone encoder, treating it as a frozen feature extractor. This is the vision equivalent of using GPT features for NLP tasks.
Why do DINOv2 features work for pixel-level tasks like segmentation and depth estimation?

Chapter 8: Comparison with CLIP

The natural question: how does DINOv2 compare to CLIP/OpenCLIP, the dominant paradigm for general visual features?

Where DINOv2 wins

Where CLIP wins

The punchline: For any task that doesn't fundamentally require language (classification, segmentation, depth, retrieval, 3D understanding), DINOv2 matches or beats CLIP. Text supervision is not a necessary ingredient for great visual features — it's just one way to get them, and not even the best way for dense prediction.
On what types of tasks does DINOv2 most clearly outperform CLIP?

Chapter 9: Connections

Lineage

DINO (Caron et al., 2021)
The direct ancestor. Self-distillation with EMA teacher on [CLS] tokens. Showed emergent attention-based segmentation. DINOv2 takes the same loss and scales it.
iBOT (Zhou et al., 2022)
Added masked image modeling on patch tokens to DINO. DINOv2 adopts this as the second loss term, enabling strong pixel-level features.
SwAV (Caron et al., 2020)
Introduced Sinkhorn-Knopp centering for prototype assignment. DINOv2 adopts this for more stable training at scale.
MAE (He et al., 2022)
Masked autoencoder approach — reconstructs raw pixels. Features require fine-tuning. DINOv2's iBOT loss works in feature space (distillation targets, not pixels), producing features that work frozen.

Downstream impact

Depth Anything (2024)
Uses DINOv2 as the frozen backbone encoder for monocular depth estimation. Demonstrates DINOv2 as a true foundation model — plug it in, add a decoder, get state-of-the-art depth.
SAM (Kirillov et al., 2023)
Uses a ViT encoder trained with supervised data. DINOv2's self-supervised features achieve comparable segmentation quality, suggesting supervised pretraining may not be necessary for this task.
CLIP / OpenCLIP
The primary competitor. Text-supervised features that DINOv2 matches or surpasses on tasks that don't require language. Together they define the two paradigms for visual foundation models: text-supervised vs. self-supervised.
Foundation model era
DINOv2 validates that vision can follow the NLP playbook: pretrain a large model on curated data with self-supervised objectives, then use frozen features everywhere. No labels, no text, no fine-tuning needed.
Why is Depth Anything's use of DINOv2 as a frozen backbone significant?