Replacing noisy real depth labels with precise synthetic data, then self-training on 62M unlabeled real images to build the most capable monocular depth foundation model.
Monocular depth estimation asks a deceptively simple question: given a single 2D image, how far away is everything? No stereo pair, no LiDAR, no structured light. Just one photo and the model must infer the 3D structure of the entire scene.
By 2024, two families of models dominated this space. Discriminative models like Depth Anything V1 used DINOv2 encoders and were fast, lightweight, and robust on complex scenes. Generative models like Marigold built on Stable Diffusion and excelled at fine-grained details — thin chair legs, transparent glass, reflective surfaces.
But neither family was complete. V1 produced coarse, blobby depth maps that missed fine structures. Marigold captured details but was 10× slower, required ~1B parameters, and failed on complex real-world layouts. Why?
Almost every depth dataset uses one of three methods to collect "ground truth" depth, and all three are fundamentally flawed:
Models trained on these noisy labels learn to reproduce the noise. V1 trained on 1.5M labeled real images — but the labels themselves were wrong on exactly the cases that matter most: transparency, reflections, thin structures, and fine boundaries.
Click each source type to see where its depth labels break down. Red regions mark areas of incorrect or missing depth. The sensor literally cannot measure these correctly.
The paper’s central insight is beautifully simple: synthetic data gives you precision, real data gives you diversity. Use each for what it does best, and combine them through self-training.
In a rendered 3D environment, depth is not measured — it is known exactly. Every pixel’s distance from the camera is computed directly from the scene geometry. There are no sensor failures, no stereo ambiguities, no SfM artifacts. Even transparent glass, thin mesh structures, tiny leaves, and reflective surfaces get pixel-perfect depth labels.
The catch? Synthetic datasets like Hypersim (indoor rooms) and Virtual KITTI (driving scenes) cover only a handful of pre-defined environments. A model trained purely on synthetic data has never seen a crowded street, a beach, or a dog park. It will fail on anything outside its narrow training distribution.
The internet has billions of unlabeled photos covering every imaginable scene. You just can’t get accurate depth labels for them. But you can get pseudo-labels from a powerful teacher model.
This is the entire paper. Everything else — architecture choices, loss functions, benchmark design — supports this core recipe. The insight that you should never mix noisy real labels into training and instead use synthetic-only for the teacher is the key departure from V1 and all prior work.
The first step is training a teacher model that produces the most precise depth predictions possible. The teacher uses the largest available DINOv2-Giant encoder (1.1B parameters) paired with a DPT decoder head. It is trained exclusively on synthetic images.
Five synthetic datasets totaling ~595K images:
The total is small by modern standards — 595K images vs. V1’s 1.5M labeled real images. But every single depth label is exact. No sensor noise, no coarse boundaries, no missing regions.
A critical finding: when you train only on synthetic data and test on real images, most encoders fail catastrophically. The paper tested BEiT-Large, SAM-Large, SynCLR-Large, and all sizes of DINOv2:
DINOv2-Giant’s self-supervised pre-training on diverse real images gives it feature representations that bridge the synthetic-to-real domain gap. Smaller models simply don’t have enough representational capacity to learn both “what precise depth looks like” from synthetic data and “what real images look like” from their pre-training.
Two loss functions from MiDaS, but now critical because of synthetic data:
Where:
Compare the label quality side-by-side. Left: real depth label with coarse edges and missing regions. Right: synthetic depth label with pixel-perfect precision. Hover/tap to highlight the differences at object boundaries.
The teacher is precise but fragile — it was only trained on ~595K synthetic images covering a narrow slice of the visual world. It fails on skies, humans, unusual objects, and many real-world patterns. How do we bridge this gap?
The answer: use the teacher to label a massive collection of unlabeled real images, then train student models on those pseudo-labels. The diversity of the real images compensates for the teacher’s blind spots, while the teacher’s precision provides far better labels than any sensor or stereo system could.
Eight large-scale datasets totaling 62 million images:
62M images is 100× the labeled data used by V1. This is the power of pseudo-labeling: you don’t need human annotators or expensive sensors. You just need a good teacher and internet-scale data.
Not all pseudo-labels are perfect. The teacher still makes mistakes, especially on scenes far from its synthetic training distribution. To handle this, V2 borrows a trick from V1: ignore the top 10% highest-loss pixels for each pseudo-labeled image during training. These are the pixels where the pseudo-label is most likely wrong.
Watch the teacher label real images in real-time. Click “Run Pipeline” to see synthetic-trained teacher produce pseudo-labels, then the noise filter removes the top 10% highest-loss pixels (shown in red).
With 62M precisely pseudo-labeled real images in hand, we can now train the final student models. This step is where a critical finding emerges: the student should be trained only on pseudo-labeled real images, not on the original synthetic data.
This seems counterintuitive. The synthetic data has perfect labels — why not include it? Two reasons:
What V2 does is a form of knowledge distillation, but with an important twist:
This is safer because feature-level distillation can fail when the teacher-student scale gap is large (DINOv2-Giant vs DINOv2-Small is a 50× parameter difference). Label-level distillation on new data works regardless of the architecture gap.
To preserve the rich semantic features from the pre-trained DINOv2 encoders, V2 adds a feature alignment loss (borrowed from V1). This prevents the depth training from destroying the useful representations the encoder learned during self-supervised pre-training.
V2’s architecture is deliberately simple. The paper’s thesis is that data matters more than architecture, so they use a well-proven encoder-decoder design and focus all innovation on the training pipeline.
Four scale variants, all based on the Vision Transformer (ViT) architecture pre-trained with DINOv2’s self-supervised method:
The decoder is the same DPT head used in MiDaS and V1. It takes multi-scale features from the ViT encoder and progressively fuses them into a dense depth prediction:
The encoder extracts multi-scale features from 4 ViT layers. The DPT decoder progressively fuses them into a dense depth map. Hover over each stage to see feature dimensions.
V2 produces affine-invariant inverse depth: the output is a relative depth map where closer pixels have higher values. It does not directly predict metric (meters) depth — that requires a separate fine-tuning step (Chapter 7). This makes the model robust to varying camera intrinsics and scene scales.
V2 achieves state-of-the-art zero-shot monocular depth estimation, beating both V1 and Marigold across the board — while being 10× faster than Marigold.
On five standard test sets (KITTI, NYU-D, Sintel, ETH3D, DIODE), V2-Large achieves:
The authors construct DA-2K, a new benchmark with 2,000 diverse, high-resolution images annotated by humans for pairwise relative depth. This benchmark avoids the sensor noise in existing test sets. On DA-2K:
On a V100 GPU:
Accuracy on the DA-2K benchmark vs inference latency. V2 achieves higher accuracy at a fraction of the cost. Circle size indicates parameter count.
One of V2’s most dramatic improvements: on the Transparent Object Segmentation challenge, V2 scores 83.6% in zero-shot — compared to V1’s 53.5% and MiDaS’s 25.9%. This directly validates that synthetic data, which has perfect depth for transparent objects, solves the problem that real depth sensors fundamentally cannot.
So far, V2 produces relative (affine-invariant) depth: it tells you “pixel A is closer than pixel B” but not “pixel A is 2.3 meters away.” Many applications — robotics, AR, 3D reconstruction — need actual metric depth in meters.
The conversion requires fine-tuning on datasets with known camera intrinsics and metric depth labels. V2 fine-tunes on indoor (NYU-D, Hypersim, etc.) and outdoor (KITTI, DDAD, etc.) metric depth datasets, creating two specialized metric models:
V2’s relative depth model produces much better ordinal structure (which pixel is in front of which). Fine-tuning only needs to learn the scale and shift mapping. Starting from V2’s precise relative depth, the metric fine-tuning converges faster and achieves better results than fine-tuning V1 or training from scratch.
V2’s metric models significantly outperform V1’s metric counterparts and even specialized metric depth models like ZoeDepth and Metric3D on most benchmarks. The key advantage: V2’s superior ordinal structure means fewer “depth inversions” where the model thinks a farther object is closer.
Depth Anything V2 rapidly became a foundational component in 3D vision, adopted by dozens of downstream systems within months of release.
DUSt3R / MASt3R (Leroy et al., 2024): These 3D reconstruction models benefit from V2’s precise monocular depth as initialization or regularization. V2’s depth maps provide a strong prior for multi-view stereo, reducing the search space for correspondences.
3D Gaussian Splatting: V2 provides initialization for Gaussian positions. Instead of starting from random point clouds, 3DGS uses V2’s depth maps to place Gaussians at approximately correct depths, dramatically accelerating convergence and improving quality for sparse-view reconstruction.
V2 serves as the per-frame depth backbone for video depth estimation systems. Models like DepthCrafter and Video Depth Anything use V2’s per-frame predictions as input, then add temporal consistency through flow-based alignment or video diffusion. V2’s speed (60ms for Small) makes real-time video depth practical.
V2-Small runs at real-time speeds on edge devices, making it useful for:
Depth-conditioned image and video generation (ControlNet, etc.) benefits from V2’s cleaner depth maps. Sharper depth boundaries lead to more precise control over generated 3D structure.
V2 serves as a foundational depth backbone for an expanding set of downstream applications across 3D vision.
Depth Anything V1 (Yang et al., 2024): The direct predecessor. V1 used DINOv2 encoders and trained on 1.5M labeled real images + 62M unlabeled images. V2 keeps the architecture and unlabeled data pipeline but replaces the labeled real images with synthetic data for the teacher. Same skeleton, fundamentally different training recipe.
MiDaS (Ranftl et al., 2020-2022): Pioneered mixing multiple depth datasets with affine-invariant training. V2 uses MiDaS’s DPT decoder and loss functions (Lssi + Lgm). MiDaS showed the path; V2 showed that the data composition matters more than the data quantity.
DINOv2 (Oquab et al., 2023): The self-supervised ViT encoder that makes the entire pipeline possible. DINOv2’s features bridge the synthetic-to-real domain gap and provide rich semantic representations that transfer to depth estimation.
Marigold (Ke et al., 2023): Showed that synthetic-only training produces fine-grained depth via Stable Diffusion. V2 takes Marigold’s insight (synthetic data is enough for precision) but achieves it with a discriminative model that’s 10× faster.
Video Depth Anything (2024): Extends V2 to temporally consistent video depth. Uses V2’s per-frame predictions and adds a temporal consistency module.
Metric3D V2 (Hu et al., 2024): Uses ideas similar to V2’s synthetic + pseudo-label pipeline for metric depth estimation at scale.
DUSt3R / MASt3R (2024): Multi-view 3D reconstruction that uses V2-quality monocular depth as initialization.
ZoeDepth (Bhat et al., 2023): Combines relative and metric depth in a single model with task-specific heads. V2’s two-stage approach (relative then fine-tune for metric) achieves better results.
UniDepth (Piccinelli et al., 2024): Jointly predicts depth and camera intrinsics. Complementary to V2’s approach.