DINOv2 — Veanors

Chapter 0: The Problem

By 2023, the vision community had a clear winner for "general-purpose visual features": CLIP and its open-source sibling OpenCLIP. Train a vision encoder alongside a text encoder on billions of image-caption pairs, and you get features that transfer to almost any downstream task. Simple, effective, dominant.

But CLIP has a fundamental limitation: it requires text supervision. The captions only approximate the rich information in images. Complex spatial relationships, fine-grained textures, pixel-level details — these don't surface easily in a short caption. And you need massive aligned image-text corpora, which are expensive and noisy to collect.

Meanwhile, self-supervised methods like DINO had shown something remarkable. Without any labels at all, DINO's features exhibited emergent properties: the attention maps naturally segmented objects, the patch features encoded semantic parts, the [CLS] token captured high-level semantics. But DINO was trained on ImageNet-1k — just 1.3 million images. Every attempt to scale self-supervised methods to larger, uncurated datasets led to a significant drop in feature quality.

The gap: Text-supervised models (CLIP) produce great general features but are bottlenecked by caption quality. Self-supervised models (DINO) learn richer visual representations but haven't scaled beyond small curated datasets. Can we close this gap — matching CLIP's transfer performance using images alone, no text at all?

DINOv2 answers yes. The recipe: (1) curate a large, diverse dataset automatically, (2) combine the best self-supervised objectives, and (3) scale to a 1B-parameter ViT. The resulting features work as drop-in replacements for any vision task — classification, segmentation, depth estimation, retrieval — without fine-tuning.

What is the core limitation of text-supervised models like CLIP that DINOv2 aims to overcome?

Captions only approximate image content — complex spatial and pixel-level information doesn't surface through text supervision CLIP models are too small to be useful CLIP requires labeled bounding boxes

Chapter 1: The Key Insight

DINOv2's insight is deceptively simple: existing self-supervised methods already work — they just need better data and more compute. The specific recipe combines three ingredients:

Ingredient 1: Curated data

Previous attempts to scale self-supervised learning used uncurated web data. The problem? Uncurated data is dominated by a few visual modes (e.g., product photos, memes, screenshots) and contains duplicates and noise. DINOv2 instead builds an automatic pipeline that retrieves diverse, high-quality images from web data using existing curated datasets as "anchors." The result: LVD-142M, a 142-million image dataset that's both large and diverse.

Ingredient 2: Combined losses

Rather than inventing a new objective, DINOv2 combines two proven ones: the DINO self-distillation loss on [CLS] tokens (captures image-level semantics) and the iBOT masked image modeling loss on patch tokens (captures local, pixel-level features). Together they produce features that excel at both image-level and pixel-level tasks.

Ingredient 3: Scale + engineering

Custom FlashAttention, sequence packing, efficient stochastic depth, FSDP — these engineering contributions make training 2x faster and 3x more memory-efficient than comparable methods, enabling training of a 1.1B-parameter ViT-g.

The meta-insight: You don't need a clever new loss function. You need (1) the right data, (2) the right combination of existing objectives, and (3) the engineering to actually train at scale. This is a systems paper as much as a methods paper.

What are the two self-supervised losses DINOv2 combines?

DINO self-distillation loss on [CLS] tokens + iBOT masked image modeling loss on patch tokens Contrastive loss + reconstruction loss Cross-entropy loss + triplet loss

Chapter 2: Data Curation

Most self-supervised learning research trains on ImageNet-1k (1.3M images) or uncurated web scrapes. Neither works well for general-purpose features: ImageNet is too small and domain-specific, while uncurated data is noisy and imbalanced. DINOv2 introduces an automatic curation pipeline that produces LVD-142M — 142 million diverse, high-quality images.

The pipeline

The curation process has three stages:

Step 1: Gather sources. Start with a small set of curated "anchor" datasets: ImageNet-22k, Google Landmarks, and several fine-grained datasets. These define what "good" images look like. Then collect 1.2 billion raw images from publicly available web crawls.

Step 2: Deduplicate. Apply a copy-detection pipeline to remove near-duplicates from the uncurated pool. This increases diversity and prevents overfitting on repeated images. Also remove any images that appear in downstream benchmark test sets.

Step 3: Retrieve. Compute embeddings for all images using a self-supervised ViT-H/16 pretrained on ImageNet-22k. For each image in the curated anchor sets, retrieve the 4 nearest neighbors from the uncurated pool using cosine similarity. This "pulls in" web images that are visually similar to curated ones — expanding coverage while maintaining quality.

Why not just use more curated data? Manual curation doesn't scale. The brilliance of this pipeline is that it uses a small seed of curated data to automatically find similar high-quality images from a massive uncurated pool. Think of it like using Wikipedia to train a language model that then scores web text — except here, the "scoring" is visual similarity via embeddings.

The result? LVD-142M maintains ImageNet-level performance while dramatically improving on out-of-domain tasks. Training on uncurated data of the same size leads to a significant quality drop — confirming that curation, not just scale, is essential.

How does DINOv2 build its LVD-142M dataset from uncurated web data?

It computes embeddings for all images, then retrieves the nearest neighbors of curated anchor images from the uncurated pool using cosine similarity Human annotators manually select the best images It uses CLIP to filter images with matching captions

Chapter 3: The Combined Training Objective

DINOv2's training objective combines two complementary self-supervised losses. Let's build up each one, then see how they work together.

Loss 1: DINO self-distillation (image-level)

Given an image, create multiple crops: two large "global" crops (224x224) and several small "local" crops (96x96). Pass them through a student network and an EMA teacher network. Both output a [CLS] token representation, which is projected through an MLP head into "prototype scores," then normalized with softmax.

The loss is cross-entropy between teacher and student prototype distributions:

L_DINO = − ∑ p_t log p_s

Where p_t is the teacher's softmax output (after centering) and p_s is the student's. The teacher is built as an exponential moving average (EMA) of the student weights — a form of self-distillation. This loss captures image-level semantics in the [CLS] token.

Loss 2: iBOT masked image modeling (patch-level)

Randomly mask some input patches in the student's input (but not the teacher's). For each masked position, the student predicts what the teacher sees at that position. The loss is again cross-entropy, but now applied to individual patch tokens:

L_iBOT = − ∑_{i ∈ masked} p_tⁱ log p_sⁱ

This forces the student to reconstruct local patch information from context — learning fine-grained, pixel-level features.

The combined objective

L_total = L_DINO + λ L_iBOT + α L_KoLeo

Where L_KoLeo is a regularizer that encourages features to spread uniformly in the embedding space (preventing mode collapse). At scale, DINOv2 also uses separate MLP heads for DINO and iBOT (unlike the original iBOT which shared them) and Sinkhorn-Knopp centering instead of simple moving-average centering.

Why combine both? DINO alone gives excellent image-level features (great for classification, retrieval) but weak pixel-level features. iBOT alone gives strong local features but weaker global representations. Together, you get features that excel at both — classification AND segmentation AND depth estimation from the same frozen backbone.

Why does DINOv2 use separate MLP heads for the DINO and iBOT losses, unlike the original iBOT?

At scale, untying the heads leads to better performance — the optimal projection for [CLS] tokens differs from the optimal projection for patch tokens It reduces memory usage It makes the code simpler

Chapter 4: Architecture and Scaling

DINOv2 uses the Vision Transformer (ViT) architecture with patch size 14. The team trains four model sizes:

Model family:
ViT-S/14 — 21M params, 384 dim, 6 heads
ViT-B/14 — 86M params, 768 dim, 12 heads
ViT-L/14 — 300M params, 1024 dim, 16 heads
ViT-g/14 — 1.1B params, 1536 dim, 24 heads

The ViT-g architecture differs slightly from the one proposed by Zhai et al. (2022). To maximize GPU efficiency with their custom FlashAttention kernel, DINOv2 uses an embedding dimension of 1536 with 24 heads (64 dim/head) rather than 1408 with 16 heads (88 dim/head). Matrix operations are most efficient when the full embedding dimension is a multiple of 256. No difference in final accuracy was observed.

Knowledge distillation

Here's an important design choice: only the ViT-g is trained from scratch on LVD-142M. The smaller models (ViT-S, ViT-B, ViT-L) are distilled from the ViT-g using the same training loop with modifications:

The ViT-g serves as a frozen teacher (no EMA — just the pretrained ViT-g)
No masking or stochastic depth during distillation
The iBOT loss is applied on both global crops
An EMA of the student is kept as the final model

This distillation approach achieves better performance than training smaller models from scratch — even for the ViT-L. The ViT-g effectively compresses its learned representations into smaller, more deployable models.

How are the smaller DINOv2 models (ViT-S, ViT-B, ViT-L) produced?

They are distilled from the pretrained ViT-g using the same training loop with a frozen teacher, achieving better results than training from scratch Each model is trained independently from scratch on LVD-142M They prune layers from the ViT-g

Chapter 5: Training at Scale

Training a 1.1B parameter ViT on 142M images with self-distillation is an engineering challenge. DINOv2 introduces several optimizations that together make training 2x faster and 3x more memory-efficient than comparable self-supervised methods.

Custom FlashAttention

The team implements their own version of FlashAttention optimized for their use case. Efficiency is best when the embedding dimension per head is a multiple of 64 — which motivated the ViT-g architecture choice of 1536 dim / 24 heads = 64 dim/head.

Sequence packing

DINO's multi-crop strategy creates sequences of different lengths: global crops (224px, 256 patches) and local crops (96px, 49 patches). Normally these require separate forward passes. Sequence packing concatenates them into one long sequence with a block-diagonal attention mask that prevents cross-sequence attention. This is mathematically equivalent to separate forwards but significantly faster on GPU.

Efficient stochastic depth

Standard stochastic depth masks the output of dropped layers — still computing the forward pass. DINOv2's version actually skips the computation by shuffling samples along the batch dimension and slicing only the first (1−d)×B samples. With a drop rate of d=0.4, this saves 40% of compute and memory in those blocks.

FSDP (Fully-Sharded Data Parallel)

Training with AdamW requires 4 copies of the model in float32 (student, teacher, optimizer first moments, optimizer second moments) — 16 GB for 1B parameters. FSDP shards these across GPUs. Communication uses float16 for the backbone (float32 for MLP heads to avoid instability), cutting communication costs by ~50% versus standard DDP.

The numbers: ViT-g training on LVD-142M takes ~7,000 A100 GPU-hours. The final training uses a batch size equivalent to 22,000 images. A short high-resolution phase (518x518) at the end of training improves pixel-level performance for segmentation and detection tasks.

How does DINOv2's efficient stochastic depth differ from the standard implementation?

It skips the computation entirely by shuffling samples and only processing a (1-d) fraction, rather than computing the full forward pass and masking the output It drops entire layers instead of individual samples It uses a lower drop rate

Chapter 6: Linear Probing Results

The gold standard for evaluating frozen features: freeze the backbone, train only a linear classifier on top, and measure accuracy. If a linear probe does well, the features themselves encode the relevant information — no fine-tuning needed.

ImageNet classification

DINOv2 ViT-g achieves 86.5% top-1 accuracy on ImageNet-1k with a simple linear probe. For context:

Beyond classification

But ImageNet accuracy alone doesn't prove "general-purpose" features. DINOv2 is evaluated on a staggering range of tasks, all using frozen features with minimal task-specific adaptation:

Segmentation — ADE20k: linear probe on patch tokens achieves strong results
Depth estimation — NYUd, KITTI: linear probe on frozen features rivals supervised methods
Retrieval — Oxford, Paris landmarks: k-NN on frozen features
Fine-grained classification — iNaturalist, Cars, Food: strong without domain-specific training
Out-of-distribution — ImageNet-A, ImageNet-R, ImageNet-Sketch: robust generalization
Video understanding — temporal consistency from frozen spatial features

The takeaway: A single frozen DINOv2 backbone, with nothing more than a linear layer on top, performs competitively or better than task-specific models across an extraordinarily diverse set of vision benchmarks. These are truly "all-purpose" features.

What does it mean that DINOv2 achieves 86.5% on ImageNet with a linear probe?

The frozen backbone features are so informative that a single linear layer trained on top can classify ImageNet images with 86.5% accuracy — no fine-tuning of the backbone needed The model was fine-tuned on ImageNet and got 86.5% 86.5% of the model's parameters were used

Chapter 7: All-Purpose Features

The central claim of DINOv2 is not just that it achieves high accuracy on one benchmark. It's that the same frozen features work as a drop-in replacement for any vision task. No fine-tuning. No task-specific architecture changes. Just freeze the backbone and add a simple head.

Image-level tasks

The [CLS] token captures global image semantics. A linear probe or k-NN classifier on this single vector handles classification, retrieval, and domain transfer with state-of-the-art results.

Pixel-level tasks

The patch tokens capture local spatial information. A linear probe on the spatial grid of patch tokens handles segmentation, depth estimation, and surface normal prediction. This is where DINOv2 especially outshines CLIP — text-supervised models tend to have weak spatial features because captions don't describe pixel-level structure.

PCA visualization

One of the most striking results: compute PCA on DINOv2's patch features across different images. The first three principal components, mapped to RGB channels, naturally segment objects and match corresponding parts across images — a dog's head maps to the same color regardless of pose, breed, or background. This emergent property (first observed in DINO) becomes even more pronounced at scale.

Foundation model behavior: DINOv2's features exhibit the hallmarks of a foundation model — they generalize across tasks, domains, and data distributions without adaptation. Downstream models like Depth Anything (2024) use DINOv2 as their backbone encoder, treating it as a frozen feature extractor. This is the vision equivalent of using GPT features for NLP tasks.

Why do DINOv2 features work for pixel-level tasks like segmentation and depth estimation?

The iBOT patch-level loss forces each patch token to encode rich local spatial information, which can be decoded by a simple linear probe into per-pixel predictions The model outputs pixel-level predictions directly It uses a separate decoder for each task

Chapter 8: Comparison with CLIP

The natural question: how does DINOv2 compare to CLIP/OpenCLIP, the dominant paradigm for general visual features?

Where DINOv2 wins

Dense prediction tasks — segmentation, depth, surface normals. CLIP's text supervision doesn't encourage learning fine-grained spatial structure. DINOv2's iBOT loss explicitly trains patch-level features.
No text data needed — training uses images only, avoiding the need for expensive image-text corpora.
Fine-grained recognition — DINOv2 captures visual details that captions miss (texture, shape, pose).
Feature quality per FLOP — DINOv2 ViT-L matches or beats OpenCLIP ViT-G (4x larger) on many benchmarks.

Where CLIP wins

Zero-shot text-guided tasks — CLIP can classify images from text descriptions without any training data. DINOv2 has no text encoder, so it needs at least a few examples.
Open-vocabulary detection — tasks that inherently require language understanding benefit from CLIP's text-image alignment.

The punchline: For any task that doesn't fundamentally require language (classification, segmentation, depth, retrieval, 3D understanding), DINOv2 matches or beats CLIP. Text supervision is not a necessary ingredient for great visual features — it's just one way to get them, and not even the best way for dense prediction.

On what types of tasks does DINOv2 most clearly outperform CLIP?

Dense prediction tasks (segmentation, depth estimation) where CLIP's text supervision doesn't produce strong spatial features Zero-shot text-based classification Language-guided image generation

Chapter 9: Connections

Lineage

DINO (Caron et al., 2021)

The direct ancestor. Self-distillation with EMA teacher on [CLS] tokens. Showed emergent attention-based segmentation. DINOv2 takes the same loss and scales it.

iBOT (Zhou et al., 2022)

Added masked image modeling on patch tokens to DINO. DINOv2 adopts this as the second loss term, enabling strong pixel-level features.

SwAV (Caron et al., 2020)

Introduced Sinkhorn-Knopp centering for prototype assignment. DINOv2 adopts this for more stable training at scale.

MAE (He et al., 2022)

Masked autoencoder approach — reconstructs raw pixels. Features require fine-tuning. DINOv2's iBOT loss works in feature space (distillation targets, not pixels), producing features that work frozen.

Downstream impact

Depth Anything (2024)

Uses DINOv2 as the frozen backbone encoder for monocular depth estimation. Demonstrates DINOv2 as a true foundation model — plug it in, add a decoder, get state-of-the-art depth.

SAM (Kirillov et al., 2023)

Uses a ViT encoder trained with supervised data. DINOv2's self-supervised features achieve comparable segmentation quality, suggesting supervised pretraining may not be necessary for this task.

CLIP / OpenCLIP

The primary competitor. Text-supervised features that DINOv2 matches or surpasses on tasks that don't require language. Together they define the two paradigms for visual foundation models: text-supervised vs. self-supervised.

Foundation model era

DINOv2 validates that vision can follow the NLP playbook: pretrain a large model on curated data with self-supervised objectives, then use frozen features everywhere. No labels, no text, no fine-tuning needed.

Why is Depth Anything's use of DINOv2 as a frozen backbone significant?

It proves DINOv2 is a true foundation model — its frozen features are so rich that a downstream model can achieve state-of-the-art depth estimation without modifying the backbone at all Depth Anything fine-tunes DINOv2 extensively DINOv2 was specifically designed for depth tasks

DINOv2: Learning Robust Visual Features