A monocular depth estimation foundation model that unleashes 62M unlabeled images through self-training with strong augmentation, producing robust zero-shot depth maps for any image under any conditions.
You have a single photograph. No stereo pair, no LiDAR, no depth sensor. Just one flat image. Can you figure out how far away every pixel is? This is monocular depth estimation (MDE) — recovering the 3D structure of a scene from a 2D image.
Humans do this effortlessly. We look at a photo and instantly sense that the car is closer than the tree, which is closer than the mountain. We use cues like relative size, occlusion, texture gradients, and perspective. But teaching a neural network to do this requires something we don't have much of: labeled depth data.
To get ground-truth depth for an image, you need one of:
The result: the largest labeled depth datasets have at most a few hundred thousand images, and they cover narrow domains. NYUv2 has ~50K indoor frames. KITTI has ~93K driving frames. A model trained on these sees kitchens and highways, but has no idea what a mountain trail or underwater reef looks like.
Here is the core idea of Depth Anything, stated plainly:
This is a form of self-training (also called pseudo-labeling). The recipe has three ingredients:
In their pilot studies, the authors found that naive self-training — student learns from pseudo-labels without augmentation — failed to improve over the labeled-only baseline. The teacher and student share the same architecture and pretraining (DINOv2), so they make similar predictions on the same data. The student isn't learning anything new.
Strong augmentation breaks this symmetry. When the student sees a color-jittered, blurred, CutMixed version of the image, it can't rely on surface-level visual cues. It's forced to learn deeper, more robust representations to match the teacher's clean-image prediction. The augmentation creates an information gap that the student fills by acquiring extra visual knowledge.
Both the teacher and student models use the same architecture: a DINOv2 encoder paired with a DPT (Dense Prediction Transformer) decoder. Understanding why DINOv2 matters is key to understanding Depth Anything's success.
DINOv2 is a Vision Transformer pretrained with self-supervised learning on 142M curated images. It was trained without any labels using a combination of self-distillation (DINO) and masked image modeling (iBOT). The result is an encoder with remarkably rich visual features:
A natural question: DINOv2 already has excellent features. Why not just freeze the encoder and only train the decoder? The paper tried this and found that fine-tuning the encoder produces significantly better depth estimates. Here's why:
DINOv2's features are optimized for semantic similarity — different parts of a car look similar in feature space. But for depth estimation, different parts of the same car can have very different depths (the hood is close, the roof is farther). Fine-tuning lets the encoder develop part-level discriminative features while retaining the semantic backbone.
Following MiDaS, Depth Anything uses the DPT (Dense Prediction Transformer) decoder. DPT takes multi-scale features from the ViT encoder (tokens from layers 5, 12, 18, 24 for ViT-L) and progressively upsamples them through fusion modules to produce a full-resolution depth map.
The decoder is randomly initialized and trained with a 10x larger learning rate than the encoder. This asymmetry makes sense: the encoder already has good features from DINOv2 pretraining and only needs gentle adaptation, while the decoder must learn depth-specific fusion from scratch.
This is the heart of Depth Anything. The self-training pipeline has two stages:
Collect 1.5M labeled images from 6 datasets (BlendedMVS, DIML, HRWSI, IRS, MegaDepth, TartanAir). Initialize the encoder with DINOv2 pretrained weights. Train for 20 epochs with the affine-invariant loss on these labeled images. The result is a teacher model T that produces decent relative depth maps.
This is where the magic happens. The student S is re-initialized from DINOv2 (not copied from the teacher). Then it trains on the union of labeled images and pseudo-labeled unlabeled images:
CutMix is typically used in image classification. The authors adapt it for depth: given two unlabeled images ua and ub, they paste a rectangle from ub onto ua. The student must predict depth for this composite image. The loss is computed separately in the two regions:
Where M is a binary mask with a rectangular region set to 1. The depth loss for the pasted region uses ub's pseudo label, and the background region uses ua's pseudo label. This forces the student to handle depth discontinuities and composite scenes — situations that never appear in the labeled data.
Depth Anything's data engine collects 62M unlabeled images from 8 large-scale public datasets. The selection criteria: diversity of scenes, high image quality, and broad coverage of real-world scenarios.
| Dataset | Images | Domain |
|---|---|---|
| SA-1B | 11.1M | Diverse web images (from Segment Anything) |
| Open Images V7 | 7.8M | Diverse web images with rich annotations |
| ImageNet-21K | 13.1M | Object-centric images, 21K categories |
| LSUN | 9.8M | Scenes (bedrooms, churches, towers, etc.) |
| BDD100K | 8.2M | Driving scenes (diverse weather, time of day) |
| Objects365 | 1.7M | Object detection scenes, 365 categories |
| Places365 | 6.5M | Scene recognition, 365 place categories |
| Google Landmarks | 4.1M | Landmarks and buildings worldwide |
| Dataset | Images | Label source |
|---|---|---|
| BlendedMVS | 115K | Stereo |
| DIML | 927K | Stereo |
| HRWSI | 20K | Stereo |
| IRS | 103K | Stereo |
| MegaDepth | 128K | SfM |
| TartanAir | 306K | Stereo (synthetic) |
One practical detail: the sky has effectively infinite depth. To handle this, the authors apply a pretrained semantic segmentation model to detect sky regions and set their disparity to 0 (farthest point). This prevents the depth model from being confused by the sky's ambiguous appearance.
Depth Anything uses three losses, each serving a distinct purpose. Let's walk through them.
Different datasets have different depth scales and shifts. An indoor RGB-D dataset might measure depth in meters (0.5-10m), while a stereo-reconstructed outdoor dataset might have arbitrary units. The affine-invariant loss ignores these differences:
Where d̂ denotes the scale-and-shift normalized version of the depth:
By normalizing both prediction and ground truth to zero median and unit mean-absolute-deviation, the loss becomes invariant to any affine transformation of the depth. This lets you train jointly on datasets with incompatible depth scales.
The same affine-invariant loss, but computed on pseudo-labeled unlabeled images. For CutMix, the loss is split into the pasted region and background region, each aligned independently:
This auxiliary loss keeps the student's encoder features aligned with the frozen DINOv2 encoder:
Where fi is the student's feature at pixel i, and f′i is the frozen DINOv2 feature. The key detail: a tolerance margin α = 0.85 — if the cosine similarity already exceeds α, that pixel is excluded from the loss. This prevents the loss from forcing exact feature matching (which would prevent depth-specific adaptation).
Depth Anything is evaluated in two settings: zero-shot relative depth (predict ordinal depth without fine-tuning) and fine-tuned metric depth (predict absolute distances after fine-tuning on NYUv2/KITTI).
Tested on 6 unseen datasets (KITTI, NYUv2, Sintel, DDAD, ETH3D, DIODE), Depth Anything ViT-L crushes MiDaS v3.1 ViT-L across the board:
When fine-tuned with metric depth labels on NYUv2 (indoor) and KITTI (driving), Depth Anything sets new state-of-the-art results:
The zero-shot pretraining provides such a strong backbone that even a small amount of metric depth fine-tuning produces exceptional results.
The authors conduct careful ablations to isolate each contribution:
| Configuration | KITTI AbsRel ↓ | NYUv2 AbsRel ↓ |
|---|---|---|
| Labeled only (baseline) | 0.090 | 0.056 |
| + Unlabeled (no augmentation) | 0.089 | 0.057 |
| + Color augmentation | 0.083 | 0.050 |
| + CutMix | 0.081 | 0.048 |
| + Feature alignment | 0.076 | 0.043 |
A strong depth model is a foundational building block for many 3D and scene understanding tasks. Depth Anything demonstrates this across several applications.
ControlNet lets you guide Stable Diffusion with structural conditions like edge maps or depth maps. Better depth maps = better ControlNet results. When Depth Anything's depth maps replace MiDaS's in ControlNet, the generated images show more accurate 3D structure, sharper object boundaries, and better spatial consistency.
Depth Anything's relative depth encoder can be fine-tuned into a metric depth estimator by adding a metric prediction head. Fine-tuned on NYUv2 or KITTI, it significantly outperforms ZoeDepth (the previous SOTA for generalizable metric depth). The strong relative depth backbone transfers directly — the model already understands scene geometry, it just needs to learn the absolute scale.
A surprising side benefit: the feature alignment loss with DINOv2 makes the student encoder also excel at semantic segmentation. Without any explicit segmentation training, Depth Anything's encoder (when paired with a segmentation head) achieves strong results on standard segmentation benchmarks. This suggests the encoder has become a universal visual encoder — good for both mid-level (depth) and high-level (semantic) tasks.
In June 2024, the same team released Depth Anything V2, which addressed several limitations of V1 and pushed performance significantly further.
V1 used 1.5M labeled images from real-world datasets with noisy labels (stereo matching errors, SfM artifacts, sensor noise). V2 replaces this with 595K synthetic images from procedurally generated 3D scenes with pixel-perfect depth labels. The insight: for training the teacher, label quality matters more than label quantity. Clean synthetic labels produce a better teacher, which produces better pseudo-labels, which produce a better student.
V2 uses a ViT-L teacher exclusively and trains students at ViT-S, ViT-B, ViT-L, and ViT-G scales. The larger teacher (trained on clean synthetic data) provides higher-quality pseudo-labels for all student sizes.
Depth Anything V2 significantly outperforms V1 across all benchmarks. The ViT-L model improves KITTI AbsRel from 0.076 to 0.048 and NYUv2 AbsRel from 0.043 to 0.033. The ViT-G model pushes even further. V2 also introduces Depth Anything V2 Metric, which directly predicts metric depth without a separate fine-tuning stage.
Depth Anything sits at the intersection of several important ideas in computer vision. Here's how it relates to the broader landscape.
MiDaS (Ranftl et al., 2020, 2022) pioneered the idea of training a single depth model on multiple datasets using the affine-invariant loss. Depth Anything directly builds on MiDaS's loss formulation and training framework, but adds the key ingredient of massive unlabeled data. Where MiDaS sought diversity through more labeled datasets, Depth Anything achieves it through pseudo-labeled unlabeled images.
DINOv2 (Oquab et al., 2024) provides the encoder backbone and the frozen features for the alignment loss. It's the "semantic brain" that Depth Anything inherits and adapts. Without DINOv2's strong pretrained features, the self-training pipeline would be less effective — the teacher would be weaker, the student's initial features would be worse, and the semantic alignment would be impossible.
ZoeDepth (Bhat et al., 2023) demonstrated that a strong relative depth model (MiDaS) can be fine-tuned into a strong metric depth model. Depth Anything follows the same principle: train for relative depth first (better generalization), then fine-tune for metric depth (better absolute accuracy). ZoeDepth uses MiDaS as its backbone; with Depth Anything as the backbone instead, metric depth performance improves dramatically.
Metric3D (Yin et al., 2023) takes a different approach: directly train for metric depth using camera intrinsics to handle scale ambiguity. While Depth Anything focuses on relative depth and then adapts to metric, Metric3D tackles the metric problem head-on. The two approaches are complementary — Depth Anything excels at zero-shot generalization, while Metric3D excels when camera parameters are known.
Depth Anything's recipe — teacher on labeled data, pseudo-labels on unlabeled data, student with strong augmentation — echoes successful patterns throughout deep learning: