Depth Anything V2

Chapter 0: The Problem

Monocular depth estimation asks a deceptively simple question: given a single 2D image, how far away is everything? No stereo pair, no LiDAR, no structured light. Just one photo and the model must infer the 3D structure of the entire scene.

By 2024, two families of models dominated this space. Discriminative models like Depth Anything V1 used DINOv2 encoders and were fast, lightweight, and robust on complex scenes. Generative models like Marigold built on Stable Diffusion and excelled at fine-grained details — thin chair legs, transparent glass, reflective surfaces.

But neither family was complete. V1 produced coarse, blobby depth maps that missed fine structures. Marigold captured details but was 10× slower, required ~1B parameters, and failed on complex real-world layouts. Why?

The real culprit: noisy training labels

Almost every depth dataset uses one of three methods to collect "ground truth" depth, and all three are fundamentally flawed:

Depth sensors (e.g., NYU Depth V2): cannot see through glass, fail on transparent or reflective objects. The depth of a mirror shows the mirror’s surface, not the reflected scene.
Stereo matching (e.g., HRWSI): fails on textureless walls, repetitive patterns, and thin structures. The depth labels are coarse and miss fine boundaries.
Structure-from-Motion (e.g., MegaDepth): fails on moving objects (people, cars) and produces sparse, incomplete depth maps full of outliers.

Models trained on these noisy labels learn to reproduce the noise. V1 trained on 1.5M labeled real images — but the labels themselves were wrong on exactly the cases that matter most: transparency, reflections, thin structures, and fine boundaries.

The core problem: The depth labels we train on are not actually ground truth. Sensors can’t see glass, stereo can’t match textureless walls, and SfM can’t handle moving objects. Every existing depth dataset has systematic blind spots, and models faithfully learn those blind spots. To get better depth, we need fundamentally better labels.

Label Noise in Real Depth Data

Click each source type to see where its depth labels break down. Red regions mark areas of incorrect or missing depth. The sensor literally cannot measure these correctly.

Why does Depth Anything V1 produce poor depth estimates for transparent objects like glass?

Because the depth sensors used to create its training labels cannot measure the correct depth of transparent surfaces — the model learns these systematic errors Because the DINOv2 encoder cannot process transparent textures Because V1 uses too few parameters

Chapter 1: The Key Insight

The paper’s central insight is beautifully simple: synthetic data gives you precision, real data gives you diversity. Use each for what it does best, and combine them through self-training.

Synthetic data: perfect labels, limited scenes

In a rendered 3D environment, depth is not measured — it is known exactly. Every pixel’s distance from the camera is computed directly from the scene geometry. There are no sensor failures, no stereo ambiguities, no SfM artifacts. Even transparent glass, thin mesh structures, tiny leaves, and reflective surfaces get pixel-perfect depth labels.

The catch? Synthetic datasets like Hypersim (indoor rooms) and Virtual KITTI (driving scenes) cover only a handful of pre-defined environments. A model trained purely on synthetic data has never seen a crowded street, a beach, or a dog park. It will fail on anything outside its narrow training distribution.

Real data: noisy labels, infinite variety

The internet has billions of unlabeled photos covering every imaginable scene. You just can’t get accurate depth labels for them. But you can get pseudo-labels from a powerful teacher model.

The three-step recipe

Step 1

Train a large teacher model (DINOv2-Giant) only on synthetic images with pixel-perfect depth labels → learns precise, fine-grained depth

↓

Step 2

Use the teacher to pseudo-label 62M unlabeled real images → high-quality depth labels on diverse real-world scenes

↓

Step 3

Train student models on pseudo-labeled real images → inherits the teacher’s precision AND learns real-world diversity

Why this works: Synthetic data teaches the teacher what precise depth looks like — sharp edges, correct transparency, fine details. The teacher then transfers that precision onto real images where the scenes are diverse but labels were previously unavailable. The student gets the best of both worlds: precision from synthetic training + diversity from real images. No domain gap, no label noise, no tradeoffs.

This is the entire paper. Everything else — architecture choices, loss functions, benchmark design — supports this core recipe. The insight that you should never mix noisy real labels into training and instead use synthetic-only for the teacher is the key departure from V1 and all prior work.

What is the key advantage of synthetic depth data over real depth data?

Synthetic depth labels are computed directly from scene geometry, giving pixel-perfect accuracy — no sensor noise, no stereo failures, correct transparency and reflections Synthetic images look more realistic Synthetic datasets are larger

Chapter 2: Synthetic Data Teacher

The first step is training a teacher model that produces the most precise depth predictions possible. The teacher uses the largest available DINOv2-Giant encoder (1.1B parameters) paired with a DPT decoder head. It is trained exclusively on synthetic images.

The synthetic training set

Five synthetic datasets totaling ~595K images:

Hypersim

~54K photorealistic indoor scenes with physically-based rendering. Perfect depth for rooms, furniture, glass, mirrors.

Virtual KITTI 2

~21K driving scenes. Perfect depth for roads, cars, trees, signs.

TartanAir

~306K aerial and ground-level scenes. Diverse outdoor environments.

IRS

~103K rendered scenes with interior rooms and objects.

BlendedMVS*

~110K scenes from multi-view stereo reconstruction (semi-synthetic).

The total is small by modern standards — 595K images vs. V1’s 1.5M labeled real images. But every single depth label is exact. No sensor noise, no coarse boundaries, no missing regions.

Why only DINOv2-Giant works for synthetic-to-real transfer

A critical finding: when you train only on synthetic data and test on real images, most encoders fail catastrophically. The paper tested BEiT-Large, SAM-Large, SynCLR-Large, and all sizes of DINOv2:

BEiT, SAM, SynCLR: severe artifacts, completely unusable predictions on real images
DINOv2-Small/Base/Large: visible domain gap artifacts, poor generalization
DINOv2-Giant: satisfying results, thanks to its massive pre-training on 142M curated images

DINOv2-Giant’s self-supervised pre-training on diverse real images gives it feature representations that bridge the synthetic-to-real domain gap. Smaller models simply don’t have enough representational capacity to learn both “what precise depth looks like” from synthetic data and “what real images look like” from their pre-training.

Training losses

Two loss functions from MiDaS, but now critical because of synthetic data:

L = L_ssi + λ L_gm

Where:

L_ssi — scale- and shift-invariant loss: handles the fact that different synthetic datasets use different depth scales
L_gm — gradient matching loss: penalizes incorrect depth gradients (edges), which is crucial for preserving fine details. The paper shows L_gm is “super beneficial to depth sharpness when using synthetic images”

Why L_gm matters more with synthetic data: Real data has coarse boundaries anyway, so a gradient loss can’t help much — the labels don’t have sharp edges to supervise. But synthetic data has pixel-perfect edges. The gradient loss can now fully exploit this precision, teaching the model to produce razor-sharp depth boundaries.

Synthetic vs Real Depth Labels

Compare the label quality side-by-side. Left: real depth label with coarse edges and missing regions. Right: synthetic depth label with pixel-perfect precision. Hover/tap to highlight the differences at object boundaries.

Why can only DINOv2-Giant (not smaller models) successfully transfer from synthetic training to real-world testing?

DINOv2-Giant's massive self-supervised pre-training on 142M diverse real images gives it feature representations that bridge the domain gap — smaller models lack this capacity DINOv2-Giant uses a special synthetic-aware architecture Smaller models train too slowly

Chapter 3: Pseudo-Label Pipeline

The teacher is precise but fragile — it was only trained on ~595K synthetic images covering a narrow slice of the visual world. It fails on skies, humans, unusual objects, and many real-world patterns. How do we bridge this gap?

The answer: use the teacher to label a massive collection of unlabeled real images, then train student models on those pseudo-labels. The diversity of the real images compensates for the teacher’s blind spots, while the teacher’s precision provides far better labels than any sensor or stereo system could.

The unlabeled real image pool

Eight large-scale datasets totaling 62 million images:

SA-1B (from SAM) — 11M images of extremely diverse scenes
Open Images V7 — 1.7M crowdsourced photos
BDD100K — 100K driving videos
LSUN — 10 scene categories, millions of images
Objects365 — 2M images with 365 object categories
Plus others — covering indoor, outdoor, aerial, underwater, and more

62M images is 100× the labeled data used by V1. This is the power of pseudo-labeling: you don’t need human annotators or expensive sensors. You just need a good teacher and internet-scale data.

Noise-aware training

Not all pseudo-labels are perfect. The teacher still makes mistakes, especially on scenes far from its synthetic training distribution. To handle this, V2 borrows a trick from V1: ignore the top 10% highest-loss pixels for each pseudo-labeled image during training. These are the pixels where the pseudo-label is most likely wrong.

SHOWCASE: The pseudo-labeling pipeline at scale. The teacher processes 62M images, producing dense depth maps for every one. Each depth map has far sharper edges and more accurate transparency handling than any sensor-based label. The 10% noise rejection then removes the teacher’s remaining errors. The result: a training set with the precision of synthetic data and the diversity of the entire internet.

Pseudo-Label Pipeline

Watch the teacher label real images in real-time. Click “Run Pipeline” to see synthetic-trained teacher produce pseudo-labels, then the noise filter removes the top 10% highest-loss pixels (shown in red).

Ready

Why does V2 ignore the top 10% highest-loss pixels when training on pseudo-labeled data?

Those pixels likely have incorrect pseudo-labels from the teacher's mistakes — ignoring them prevents the student from learning the teacher's errors To speed up training Because 10% of pixels are always occluded

Chapter 4: Student Training

With 62M precisely pseudo-labeled real images in hand, we can now train the final student models. This step is where a critical finding emerges: the student should be trained only on pseudo-labeled real images, not on the original synthetic data.

Why drop the synthetic data for students?

This seems counterintuitive. The synthetic data has perfect labels — why not include it? Two reasons:

Domain gap: Mixing synthetic images into training reintroduces the distribution shift that pseudo-labeling was designed to eliminate. The student sees “clean” rendered images alongside real photos and gets confused about which distribution to fit.
Redundancy: The pseudo-labels already encode everything the teacher learned from synthetic data. The teacher’s knowledge is transferred through the labels, not the images. Including the originals adds no new information but adds noise from the domain mismatch.

Knowledge distillation at label level

What V2 does is a form of knowledge distillation, but with an important twist:

Traditional distillation: student learns to match the teacher’s features or logits on the same training data
V2’s approach: student learns from the teacher’s labels on completely different (real) data

This is safer because feature-level distillation can fail when the teacher-student scale gap is large (DINOv2-Giant vs DINOv2-Small is a 50× parameter difference). Label-level distillation on new data works regardless of the architecture gap.

Feature alignment loss

To preserve the rich semantic features from the pre-trained DINOv2 encoders, V2 adds a feature alignment loss (borrowed from V1). This prevents the depth training from destroying the useful representations the encoder learned during self-supervised pre-training.

L_student = L_ssi + λ₁ L_gm + λ₂ L_feat

The student training insight: The four released models (DINOv2-S/B/L/G) are all trained on pseudo-labeled real images only. Even DINOv2-Small (25M params) achieves excellent depth estimation because it learns from the Giant teacher’s precise labels on diverse real images — something it could never achieve by training directly on synthetic data or noisy real labels.

Why should students be trained on pseudo-labeled real images ONLY, not on a mix of synthetic + pseudo-labeled?

Including synthetic images reintroduces the domain gap, and the pseudo-labels already encode everything the teacher learned from synthetic data — the originals add no value Synthetic images use too much disk space The loss function can't handle two data types

Chapter 5: Architecture

V2’s architecture is deliberately simple. The paper’s thesis is that data matters more than architecture, so they use a well-proven encoder-decoder design and focus all innovation on the training pipeline.

Encoder: DINOv2

Four scale variants, all based on the Vision Transformer (ViT) architecture pre-trained with DINOv2’s self-supervised method:

ViT-Small

25M params • 60ms latency (V100) • Real-time on edge devices

ViT-Base

97M params • 80ms latency • Good speed/quality tradeoff

ViT-Large

335M params • 213ms latency • High-quality production use

ViT-Giant

1.3B params • ~400ms latency • Maximum quality, serves as teacher

Decoder: DPT (Dense Prediction Transformer)

The decoder is the same DPT head used in MiDaS and V1. It takes multi-scale features from the ViT encoder and progressively fuses them into a dense depth prediction:

Extract features from 4 intermediate ViT layers (evenly spaced)
Project each to a common channel dimension via 1×1 convolution
Progressively upsample and fuse via residual convolution blocks
Final prediction head outputs a single-channel inverse depth map

DINOv2 Encoder + DPT Decoder Architecture

The encoder extracts multi-scale features from 4 ViT layers. The DPT decoder progressively fuses them into a dense depth map. Hover over each stage to see feature dimensions.

Output format

V2 produces affine-invariant inverse depth: the output is a relative depth map where closer pixels have higher values. It does not directly predict metric (meters) depth — that requires a separate fine-tuning step (Chapter 7). This makes the model robust to varying camera intrinsics and scene scales.

Architecture simplicity is the point: V2 uses the exact same architecture as V1. The only changes are in the data pipeline. This demonstrates that for monocular depth estimation, the quality and composition of training data matters far more than architectural innovations.

What does the DPT decoder do with the ViT encoder's features?

Extracts features from 4 intermediate ViT layers, projects them to a common dimension, then progressively upsamples and fuses them into a dense depth prediction Applies a single linear layer to the final ViT output Uses the ViT's attention maps directly as depth

Chapter 6: Results

V2 achieves state-of-the-art zero-shot monocular depth estimation, beating both V1 and Marigold across the board — while being 10× faster than Marigold.

Standard benchmarks

On five standard test sets (KITTI, NYU-D, Sintel, ETH3D, DIODE), V2-Large achieves:

AbsRel on NYU-D: 0.043 (V1-L: 0.043, Marigold: 0.055) — matches V1 on this benchmark, but the paper argues NYU-D is too noisy to differentiate strong models
AbsRel on KITTI: 0.036 (V1-L: 0.076) — 53% reduction in error
δ₁ accuracy on ETH3D: 99.0% (V1-L: 95.3%, Marigold: 86.8%)

DA-2K: a better benchmark

The authors construct DA-2K, a new benchmark with 2,000 diverse, high-resolution images annotated by humans for pairwise relative depth. This benchmark avoids the sensor noise in existing test sets. On DA-2K:

V2-Giant: 97.1% accuracy
V2-Large: 95.3%
V1-Large: 85.8%
Marigold: 86.8%

Speed comparison

On a V100 GPU:

Marigold: 5.2s per image (948M params, requires multi-step denoising)
V2-Large: 213ms per image (335M params, single forward pass)
V2-Small: 60ms per image (25M params, real-time on mobile)

V2 vs V1 vs Marigold

Accuracy on the DA-2K benchmark vs inference latency. V2 achieves higher accuracy at a fraction of the cost. Circle size indicates parameter count.

Transparent surface challenge

One of V2’s most dramatic improvements: on the Transparent Object Segmentation challenge, V2 scores 83.6% in zero-shot — compared to V1’s 53.5% and MiDaS’s 25.9%. This directly validates that synthetic data, which has perfect depth for transparent objects, solves the problem that real depth sensors fundamentally cannot.

The benchmark matters: On noisy benchmarks like NYU-D, V2 barely improves over V1 because the “ground truth” labels are themselves wrong. On clean benchmarks like DA-2K and ETH3D, V2’s improvements are massive. This is why the paper also contributes DA-2K — to give the community a benchmark that can actually distinguish good from great depth models.

Why does V2 show only modest improvement over V1 on NYU Depth V2 but massive improvement on ETH3D and DA-2K?

NYU-D's test labels are themselves noisy — a better model can actually score LOWER because it predicts correct depth where the label is wrong. Clean benchmarks reveal V2's true advantage. V2 only works on outdoor scenes ETH3D and DA-2K use easier images

Chapter 7: Metric Depth Fine-Tuning

So far, V2 produces relative (affine-invariant) depth: it tells you “pixel A is closer than pixel B” but not “pixel A is 2.3 meters away.” Many applications — robotics, AR, 3D reconstruction — need actual metric depth in meters.

From relative to metric

The conversion requires fine-tuning on datasets with known camera intrinsics and metric depth labels. V2 fine-tunes on indoor (NYU-D, Hypersim, etc.) and outdoor (KITTI, DDAD, etc.) metric depth datasets, creating two specialized metric models:

DA-V2-Metric-Indoor: trained on 8 indoor datasets, outputs metric depth up to ~20m
DA-V2-Metric-Outdoor: trained on 6 outdoor datasets, outputs metric depth up to ~80m

Why V2 is a better starting point for metric depth

V2’s relative depth model produces much better ordinal structure (which pixel is in front of which). Fine-tuning only needs to learn the scale and shift mapping. Starting from V2’s precise relative depth, the metric fine-tuning converges faster and achieves better results than fine-tuning V1 or training from scratch.

Scale-and-shift invariance: V2’s base model is trained with a scale- and shift-invariant loss, meaning it learns only the structure of depth, not the absolute numbers. Metric fine-tuning then adds the scale (how to convert to meters) and shift (where zero is). This two-stage approach is more robust than trying to learn both structure and scale simultaneously.

Results

V2’s metric models significantly outperform V1’s metric counterparts and even specialized metric depth models like ZoeDepth and Metric3D on most benchmarks. The key advantage: V2’s superior ordinal structure means fewer “depth inversions” where the model thinks a farther object is closer.

What does "affine-invariant" depth mean, and why is it used for the base model?

The model learns only relative depth structure (ordering and proportions), not absolute distances — this allows training on datasets with different scales without conflict, and metric depth can be added later via fine-tuning The depth is invariant to camera rotation It means the model works on any image size

Chapter 8: DA-V2 as Foundation

Depth Anything V2 rapidly became a foundational component in 3D vision, adopted by dozens of downstream systems within months of release.

3D reconstruction

DUSt3R / MASt3R (Leroy et al., 2024): These 3D reconstruction models benefit from V2’s precise monocular depth as initialization or regularization. V2’s depth maps provide a strong prior for multi-view stereo, reducing the search space for correspondences.

3D Gaussian Splatting: V2 provides initialization for Gaussian positions. Instead of starting from random point clouds, 3DGS uses V2’s depth maps to place Gaussians at approximately correct depths, dramatically accelerating convergence and improving quality for sparse-view reconstruction.

Video depth

V2 serves as the per-frame depth backbone for video depth estimation systems. Models like DepthCrafter and Video Depth Anything use V2’s per-frame predictions as input, then add temporal consistency through flow-based alignment or video diffusion. V2’s speed (60ms for Small) makes real-time video depth practical.

Robotics and embodied AI

V2-Small runs at real-time speeds on edge devices, making it useful for:

Obstacle avoidance in autonomous navigation
Grasp planning using depth-informed scene understanding
Scene understanding for vision-language-action models

AI-generated content

Depth-conditioned image and video generation (ControlNet, etc.) benefits from V2’s cleaner depth maps. Sharper depth boundaries lead to more precise control over generated 3D structure.

V2's Downstream Ecosystem

V2 serves as a foundational depth backbone for an expanding set of downstream applications across 3D vision.

Why V2 became foundational: Three properties make V2 the default depth backbone: (1) speed — 10× faster than diffusion-based alternatives, (2) quality — state-of-the-art accuracy across benchmarks, and (3) scale range — from 25M to 1.3B params, fitting any compute budget. When a downstream task needs monocular depth, V2 is the obvious choice.

How does 3D Gaussian Splatting use Depth Anything V2?

V2's depth maps provide initial positions for Gaussians, replacing random initialization with approximately correct depths — this accelerates convergence and improves quality for sparse-view reconstruction V2 replaces the Gaussian splatting renderer V2 trains the Gaussians directly

Chapter 9: Connections

What V2 built on

Depth Anything V1 (Yang et al., 2024): The direct predecessor. V1 used DINOv2 encoders and trained on 1.5M labeled real images + 62M unlabeled images. V2 keeps the architecture and unlabeled data pipeline but replaces the labeled real images with synthetic data for the teacher. Same skeleton, fundamentally different training recipe.

MiDaS (Ranftl et al., 2020-2022): Pioneered mixing multiple depth datasets with affine-invariant training. V2 uses MiDaS’s DPT decoder and loss functions (L_ssi + L_gm). MiDaS showed the path; V2 showed that the data composition matters more than the data quantity.

DINOv2 (Oquab et al., 2023): The self-supervised ViT encoder that makes the entire pipeline possible. DINOv2’s features bridge the synthetic-to-real domain gap and provide rich semantic representations that transfer to depth estimation.

Marigold (Ke et al., 2023): Showed that synthetic-only training produces fine-grained depth via Stable Diffusion. V2 takes Marigold’s insight (synthetic data is enough for precision) but achieves it with a discriminative model that’s 10× faster.

What V2 enabled

Video Depth Anything (2024): Extends V2 to temporally consistent video depth. Uses V2’s per-frame predictions and adds a temporal consistency module.

Metric3D V2 (Hu et al., 2024): Uses ideas similar to V2’s synthetic + pseudo-label pipeline for metric depth estimation at scale.

DUSt3R / MASt3R (2024): Multi-view 3D reconstruction that uses V2-quality monocular depth as initialization.

Related approaches

ZoeDepth (Bhat et al., 2023): Combines relative and metric depth in a single model with task-specific heads. V2’s two-stage approach (relative then fine-tune for metric) achieves better results.

UniDepth (Piccinelli et al., 2024): Jointly predicts depth and camera intrinsics. Complementary to V2’s approach.

V2’s lasting contribution: The paper’s deepest insight transcends depth estimation: when real-world labels are fundamentally noisy, train on synthetic data where labels are exact, then use self-training to bridge the domain gap. This “synthetic precision + real diversity” recipe is a general-purpose strategy applicable to any dense prediction task where ground truth is hard to measure.

Cheat sheet

Core recipe

Synthetic-only teacher → pseudo-label 62M real images → student trained on pseudo-labels only

Key numbers

595K synthetic images • 62M pseudo-labeled real images • 25M-1.3B params • 60-400ms latency

Architecture

DINOv2 encoder (ViT-S/B/L/G) + DPT decoder — identical to V1

Breakthrough

Replacing noisy real labels with precise synthetic labels is the single most important change

Impact

Foundation for DUSt3R, 3DGS init, video depth, robotics, AIGC — de facto standard MDE backbone

What general-purpose strategy does Depth Anything V2 demonstrate beyond depth estimation?

When real-world labels are fundamentally noisy, train on synthetic data for precision, then self-train on unlabeled real data for diversity — the "synthetic precision + real diversity" recipe Always use the largest possible model Discriminative models are always better than generative ones