Yang, Kang, Huang, Xu, Feng, Zhao — 2024

Depth Anything

A monocular depth estimation foundation model that unleashes 62M unlabeled images through self-training with strong augmentation, producing robust zero-shot depth maps for any image under any conditions.

Prerequisites: CNNs / ViTs + Self-supervised learning basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have a single photograph. No stereo pair, no LiDAR, no depth sensor. Just one flat image. Can you figure out how far away every pixel is? This is monocular depth estimation (MDE) — recovering the 3D structure of a scene from a 2D image.

Humans do this effortlessly. We look at a photo and instantly sense that the car is closer than the tree, which is closer than the mountain. We use cues like relative size, occlusion, texture gradients, and perspective. But teaching a neural network to do this requires something we don't have much of: labeled depth data.

Why is labeled depth data so scarce?

To get ground-truth depth for an image, you need one of:

The result: the largest labeled depth datasets have at most a few hundred thousand images, and they cover narrow domains. NYUv2 has ~50K indoor frames. KITTI has ~93K driving frames. A model trained on these sees kitchens and highways, but has no idea what a mountain trail or underwater reef looks like.

The data bottleneck: MiDaS (the previous state of the art) aggregated 12 labeled datasets totaling ~2M images. That sounds like a lot — but compared to the billions of images in language-vision pretraining, it's tiny. Limited data coverage means poor generalization: MiDaS fails catastrophically on scenes outside its training distribution. We need orders of magnitude more data, but labeling depth is fundamentally expensive.
Why is labeled depth data fundamentally scarce compared to, say, image classification labels?

Chapter 1: The Key Insight

Here is the core idea of Depth Anything, stated plainly:

The insight: Labeled depth data is scarce, but unlabeled monocular images are practically infinite. Train a teacher model on the small labeled set, use it to generate pseudo-depth-labels for 62 million unlabeled images, then train a student model on both — but force the student to learn under strong augmentation so it acquires knowledge the teacher never had.

This is a form of self-training (also called pseudo-labeling). The recipe has three ingredients:

  1. A teacher model T trained on 1.5M labeled images from 6 datasets. This teacher is already pretty good — it produces reasonable depth maps for most scenes.
  2. 62M unlabeled images collected from 8 large-scale public datasets (SA-1B, Open Images, ImageNet-21K, LSUN, etc.). The teacher generates a pseudo depth map for each one.
  3. A student model S trained on the union of labeled + pseudo-labeled images. The critical twist: the student sees strongly augmented versions of the unlabeled images (color jitter, Gaussian blur, CutMix), while the teacher generated labels from clean images.

Why strong augmentation matters

In their pilot studies, the authors found that naive self-training — student learns from pseudo-labels without augmentation — failed to improve over the labeled-only baseline. The teacher and student share the same architecture and pretraining (DINOv2), so they make similar predictions on the same data. The student isn't learning anything new.

Strong augmentation breaks this symmetry. When the student sees a color-jittered, blurred, CutMixed version of the image, it can't rely on surface-level visual cues. It's forced to learn deeper, more robust representations to match the teacher's clean-image prediction. The augmentation creates an information gap that the student fills by acquiring extra visual knowledge.

Why did naive self-training (without strong augmentation) fail to improve over the labeled-only baseline?

Chapter 2: The DINOv2 Encoder

Both the teacher and student models use the same architecture: a DINOv2 encoder paired with a DPT (Dense Prediction Transformer) decoder. Understanding why DINOv2 matters is key to understanding Depth Anything's success.

What is DINOv2?

DINOv2 is a Vision Transformer pretrained with self-supervised learning on 142M curated images. It was trained without any labels using a combination of self-distillation (DINO) and masked image modeling (iBOT). The result is an encoder with remarkably rich visual features:

Why not just freeze DINOv2?

A natural question: DINOv2 already has excellent features. Why not just freeze the encoder and only train the decoder? The paper tried this and found that fine-tuning the encoder produces significantly better depth estimates. Here's why:

DINOv2's features are optimized for semantic similarity — different parts of a car look similar in feature space. But for depth estimation, different parts of the same car can have very different depths (the hood is close, the roof is farther). Fine-tuning lets the encoder develop part-level discriminative features while retaining the semantic backbone.

The encoder dilemma: Semantic features group "same object" together. Depth features must distinguish "same object, different distance." Depth Anything solves this by fine-tuning DINOv2 (adapting its features for depth) while also adding a feature alignment loss (Section 5) that prevents the encoder from losing its semantic knowledge entirely. It's a balancing act — depth-aware and semantic-aware.

The DPT decoder

Following MiDaS, Depth Anything uses the DPT (Dense Prediction Transformer) decoder. DPT takes multi-scale features from the ViT encoder (tokens from layers 5, 12, 18, 24 for ViT-L) and progressively upsamples them through fusion modules to produce a full-resolution depth map.

The decoder is randomly initialized and trained with a 10x larger learning rate than the encoder. This asymmetry makes sense: the encoder already has good features from DINOv2 pretraining and only needs gentle adaptation, while the decoder must learn depth-specific fusion from scratch.

Why does Depth Anything fine-tune DINOv2 rather than freezing it?

Chapter 3: The Self-Training Pipeline

This is the heart of Depth Anything. The self-training pipeline has two stages:

Stage 1: Train the teacher

Collect 1.5M labeled images from 6 datasets (BlendedMVS, DIML, HRWSI, IRS, MegaDepth, TartanAir). Initialize the encoder with DINOv2 pretrained weights. Train for 20 epochs with the affine-invariant loss on these labeled images. The result is a teacher model T that produces decent relative depth maps.

Stage 2: Self-training with the student

This is where the magic happens. The student S is re-initialized from DINOv2 (not copied from the teacher). Then it trains on the union of labeled images and pseudo-labeled unlabeled images:

Teacher labels
Teacher T generates pseudo depth maps for all 62M unlabeled images (one forward pass each, no augmentation)
Strong augmentation
Student sees heavily augmented versions: color jitter + Gaussian blur + CutMix (50% probability)
Joint training
Each batch: 1 part labeled images (real labels) + 2 parts unlabeled images (pseudo labels). Student learns from both.
Feature alignment
Auxiliary loss keeps student encoder aligned with frozen DINOv2 features (with tolerance margin α = 0.85)

CutMix for depth

CutMix is typically used in image classification. The authors adapt it for depth: given two unlabeled images ua and ub, they paste a rectangle from ub onto ua. The student must predict depth for this composite image. The loss is computed separately in the two regions:

uab = ua ⊙ M + ub ⊙ (1 − M)

Where M is a binary mask with a rectangular region set to 1. The depth loss for the pasted region uses ub's pseudo label, and the background region uses ua's pseudo label. This forces the student to handle depth discontinuities and composite scenes — situations that never appear in the labeled data.

Why re-initialize the student? Instead of initializing the student from the teacher's weights, the authors re-initialize from DINOv2 pretrained weights. This may seem wasteful, but prior work shows that independent initialization leads to better performance in self-training. Starting fresh lets the student find its own optimization path rather than being trapped near the teacher's local minimum.
What is the role of CutMix in Depth Anything's self-training?

Chapter 4: The Data Engine

Depth Anything's data engine collects 62M unlabeled images from 8 large-scale public datasets. The selection criteria: diversity of scenes, high image quality, and broad coverage of real-world scenarios.

The 8 unlabeled sources

DatasetImagesDomain
SA-1B11.1MDiverse web images (from Segment Anything)
Open Images V77.8MDiverse web images with rich annotations
ImageNet-21K13.1MObject-centric images, 21K categories
LSUN9.8MScenes (bedrooms, churches, towers, etc.)
BDD100K8.2MDriving scenes (diverse weather, time of day)
Objects3651.7MObject detection scenes, 365 categories
Places3656.5MScene recognition, 365 place categories
Google Landmarks4.1MLandmarks and buildings worldwide

The 6 labeled sources

DatasetImagesLabel source
BlendedMVS115KStereo
DIML927KStereo
HRWSI20KStereo
IRS103KStereo
MegaDepth128KSfM
TartanAir306KStereo (synthetic)
Scale comparison: MiDaS v3.1 used ~2M labeled images from 12 datasets. Depth Anything uses only 1.5M labeled images (fewer!) but adds 62M unlabeled images — a 40x increase in total data. The key insight: it's not about getting more labels, it's about getting more images and being clever about how you learn from them.

Sky detection

One practical detail: the sky has effectively infinite depth. To handle this, the authors apply a pretrained semantic segmentation model to detect sky regions and set their disparity to 0 (farthest point). This prevents the depth model from being confused by the sky's ambiguous appearance.

How does Depth Anything's data compare to MiDaS v3.1's approach?

Chapter 5: Training Details

Depth Anything uses three losses, each serving a distinct purpose. Let's walk through them.

Loss 1: Affine-invariant loss (labeled images)

Different datasets have different depth scales and shifts. An indoor RGB-D dataset might measure depth in meters (0.5-10m), while a stereo-reconstructed outdoor dataset might have arbitrary units. The affine-invariant loss ignores these differences:

Ll = (1/HW) ∑i |d̂*i − d̂i|

Where d̂ denotes the scale-and-shift normalized version of the depth:

i = (di − t(d)) / s(d)
t(d) = median(d),   s(d) = (1/HW) ∑ |di − t(d)|

By normalizing both prediction and ground truth to zero median and unit mean-absolute-deviation, the loss becomes invariant to any affine transformation of the depth. This lets you train jointly on datasets with incompatible depth scales.

Loss 2: Unlabeled loss (pseudo-labeled images)

The same affine-invariant loss, but computed on pseudo-labeled unlabeled images. For CutMix, the loss is split into the pasted region and background region, each aligned independently:

Lu = (∑M / HW) LMu + (∑(1−M) / HW) L1−Mu

Loss 3: Feature alignment loss

This auxiliary loss keeps the student's encoder features aligned with the frozen DINOv2 encoder:

Lfeat = 1 − (1/HW) ∑i cos(fi, f′i)

Where fi is the student's feature at pixel i, and f′i is the frozen DINOv2 feature. The key detail: a tolerance margin α = 0.85 — if the cosine similarity already exceeds α, that pixel is excluded from the loss. This prevents the loss from forcing exact feature matching (which would prevent depth-specific adaptation).

Why the tolerance margin? DINOv2 produces similar features for different parts of the same object (car front ≈ car rear in feature space). But depth estimation needs these parts to have different features because they're at different depths. The tolerance margin lets the student diverge from DINOv2 where depth-specific discrimination is needed, while still inheriting the broad semantic knowledge.

Training recipe

What is the purpose of the tolerance margin α in the feature alignment loss?

Chapter 6: Results

Depth Anything is evaluated in two settings: zero-shot relative depth (predict ordinal depth without fine-tuning) and fine-tuned metric depth (predict absolute distances after fine-tuning on NYUv2/KITTI).

Zero-shot relative depth

Tested on 6 unseen datasets (KITTI, NYUv2, Sintel, DDAD, ETH3D, DIODE), Depth Anything ViT-L crushes MiDaS v3.1 ViT-L across the board:

Small model, big performance: Depth Anything ViT-S (24.8M parameters — less than 1/10 the size of MiDaS ViT-L) outperforms MiDaS on several benchmarks including Sintel, DDAD, and ETH3D. This demonstrates that data scale (62M images) can compensate for model scale — a small model trained on massive diverse data beats a large model trained on limited data.

Fine-tuned metric depth

When fine-tuned with metric depth labels on NYUv2 (indoor) and KITTI (driving), Depth Anything sets new state-of-the-art results:

The zero-shot pretraining provides such a strong backbone that even a small amount of metric depth fine-tuning produces exceptional results.

Ablation: What matters most?

The authors conduct careful ablations to isolate each contribution:

ConfigurationKITTI AbsRel ↓NYUv2 AbsRel ↓
Labeled only (baseline)0.0900.056
+ Unlabeled (no augmentation)0.0890.057
+ Color augmentation0.0830.050
+ CutMix0.0810.048
+ Feature alignment0.0760.043
The ablation tells the whole story: Adding unlabeled data without augmentation barely helps (+0.001). Color augmentation provides the first big jump. CutMix adds more. Feature alignment provides the final boost. Every component matters, but strong augmentation is the crucial ingredient that makes unlabeled data useful.
What does the ablation study reveal about adding unlabeled data without strong augmentation?

Chapter 7: Downstream Tasks

A strong depth model is a foundational building block for many 3D and scene understanding tasks. Depth Anything demonstrates this across several applications.

Depth-conditioned ControlNet

ControlNet lets you guide Stable Diffusion with structural conditions like edge maps or depth maps. Better depth maps = better ControlNet results. When Depth Anything's depth maps replace MiDaS's in ControlNet, the generated images show more accurate 3D structure, sharper object boundaries, and better spatial consistency.

Metric depth estimation

Depth Anything's relative depth encoder can be fine-tuned into a metric depth estimator by adding a metric prediction head. Fine-tuned on NYUv2 or KITTI, it significantly outperforms ZoeDepth (the previous SOTA for generalizable metric depth). The strong relative depth backbone transfers directly — the model already understands scene geometry, it just needs to learn the absolute scale.

Semantic segmentation

A surprising side benefit: the feature alignment loss with DINOv2 makes the student encoder also excel at semantic segmentation. Without any explicit segmentation training, Depth Anything's encoder (when paired with a segmentation head) achieves strong results on standard segmentation benchmarks. This suggests the encoder has become a universal visual encoder — good for both mid-level (depth) and high-level (semantic) tasks.

Universal encoder potential: The combination of depth supervision (which teaches geometric understanding) and DINOv2 feature alignment (which preserves semantic knowledge) produces an encoder that understands both "what things are" and "where things are." This multi-task capability wasn't the primary goal but emerged naturally from the training recipe.
Why does Depth Anything's encoder also work well for semantic segmentation, despite never being trained on segmentation labels?

Chapter 8: Depth Anything V2

In June 2024, the same team released Depth Anything V2, which addressed several limitations of V1 and pushed performance significantly further.

Key changes in V2

1. Synthetic data replaces labeled real data

V1 used 1.5M labeled images from real-world datasets with noisy labels (stereo matching errors, SfM artifacts, sensor noise). V2 replaces this with 595K synthetic images from procedurally generated 3D scenes with pixel-perfect depth labels. The insight: for training the teacher, label quality matters more than label quantity. Clean synthetic labels produce a better teacher, which produces better pseudo-labels, which produce a better student.

2. Larger teacher, all model sizes for student

V2 uses a ViT-L teacher exclusively and trains students at ViT-S, ViT-B, ViT-L, and ViT-G scales. The larger teacher (trained on clean synthetic data) provides higher-quality pseudo-labels for all student sizes.

3. Improved training recipe

V1 → V2 summary: V1 showed that massive unlabeled data + self-training works. V2 showed that the teacher quality is the bottleneck — switching to clean synthetic data for the teacher (even though there's less of it) produces dramatically better results. The lesson: in self-training, teacher quality > teacher data quantity.

V2 results

Depth Anything V2 significantly outperforms V1 across all benchmarks. The ViT-L model improves KITTI AbsRel from 0.076 to 0.048 and NYUv2 AbsRel from 0.043 to 0.033. The ViT-G model pushes even further. V2 also introduces Depth Anything V2 Metric, which directly predicts metric depth without a separate fine-tuning stage.

What is the main lesson from Depth Anything V1 to V2?

Chapter 9: Connections

Depth Anything sits at the intersection of several important ideas in computer vision. Here's how it relates to the broader landscape.

MiDaS

MiDaS (Ranftl et al., 2020, 2022) pioneered the idea of training a single depth model on multiple datasets using the affine-invariant loss. Depth Anything directly builds on MiDaS's loss formulation and training framework, but adds the key ingredient of massive unlabeled data. Where MiDaS sought diversity through more labeled datasets, Depth Anything achieves it through pseudo-labeled unlabeled images.

DINOv2

DINOv2 (Oquab et al., 2024) provides the encoder backbone and the frozen features for the alignment loss. It's the "semantic brain" that Depth Anything inherits and adapts. Without DINOv2's strong pretrained features, the self-training pipeline would be less effective — the teacher would be weaker, the student's initial features would be worse, and the semantic alignment would be impossible.

ZoeDepth

ZoeDepth (Bhat et al., 2023) demonstrated that a strong relative depth model (MiDaS) can be fine-tuned into a strong metric depth model. Depth Anything follows the same principle: train for relative depth first (better generalization), then fine-tune for metric depth (better absolute accuracy). ZoeDepth uses MiDaS as its backbone; with Depth Anything as the backbone instead, metric depth performance improves dramatically.

Metric3D

Metric3D (Yin et al., 2023) takes a different approach: directly train for metric depth using camera intrinsics to handle scale ambiguity. While Depth Anything focuses on relative depth and then adapts to metric, Metric3D tackles the metric problem head-on. The two approaches are complementary — Depth Anything excels at zero-shot generalization, while Metric3D excels when camera parameters are known.

The broader self-training paradigm

Depth Anything's recipe — teacher on labeled data, pseudo-labels on unlabeled data, student with strong augmentation — echoes successful patterns throughout deep learning:

The big picture: Depth Anything proves that the "foundation model" paradigm — massive data, strong pretraining, self-supervised scaling — works for dense prediction tasks like depth estimation, not just classification or generation. The recipe is simple: start with a good encoder (DINOv2), get a decent teacher (labeled data), generate massive pseudo-labels (62M unlabeled), and force the student to work harder (strong augmentation). No novel architecture. No new loss function. Just scale, done right.
What is the fundamental principle that connects Depth Anything to Noisy Student and other semi-supervised methods?