Depth Anything — Veanors

Chapter 0: The Problem

You have a single photograph. No stereo pair, no LiDAR, no depth sensor. Just one flat image. Can you figure out how far away every pixel is? This is monocular depth estimation (MDE) — recovering the 3D structure of a scene from a 2D image.

Humans do this effortlessly. We look at a photo and instantly sense that the car is closer than the tree, which is closer than the mountain. We use cues like relative size, occlusion, texture gradients, and perspective. But teaching a neural network to do this requires something we don't have much of: labeled depth data.

Why is labeled depth data so scarce?

To get ground-truth depth for an image, you need one of:

LiDAR sensors — expensive hardware, sparse point clouds, limited to outdoor scenes
Stereo matching — requires calibrated stereo image pairs, computationally intensive
Structure from Motion (SfM) — needs video sequences, fails on static scenes
RGB-D cameras — limited range, indoor-only for most devices

The result: the largest labeled depth datasets have at most a few hundred thousand images, and they cover narrow domains. NYUv2 has ~50K indoor frames. KITTI has ~93K driving frames. A model trained on these sees kitchens and highways, but has no idea what a mountain trail or underwater reef looks like.

The data bottleneck: MiDaS (the previous state of the art) aggregated 12 labeled datasets totaling ~2M images. That sounds like a lot — but compared to the billions of images in language-vision pretraining, it's tiny. Limited data coverage means poor generalization: MiDaS fails catastrophically on scenes outside its training distribution. We need orders of magnitude more data, but labeling depth is fundamentally expensive.

Why is labeled depth data fundamentally scarce compared to, say, image classification labels?

Depth labels require specialized hardware (LiDAR, stereo rigs, RGB-D cameras) or expensive computation (SfM, stereo matching) — you can't just ask a human annotator to draw bounding boxes Depth estimation is too easy, so nobody collects data Neural networks don't need depth labels at all

Chapter 1: The Key Insight

Here is the core idea of Depth Anything, stated plainly:

The insight: Labeled depth data is scarce, but unlabeled monocular images are practically infinite. Train a teacher model on the small labeled set, use it to generate pseudo-depth-labels for 62 million unlabeled images, then train a student model on both — but force the student to learn under strong augmentation so it acquires knowledge the teacher never had.

This is a form of self-training (also called pseudo-labeling). The recipe has three ingredients:

A teacher model T trained on 1.5M labeled images from 6 datasets. This teacher is already pretty good — it produces reasonable depth maps for most scenes.
62M unlabeled images collected from 8 large-scale public datasets (SA-1B, Open Images, ImageNet-21K, LSUN, etc.). The teacher generates a pseudo depth map for each one.
A student model S trained on the union of labeled + pseudo-labeled images. The critical twist: the student sees strongly augmented versions of the unlabeled images (color jitter, Gaussian blur, CutMix), while the teacher generated labels from clean images.

Why strong augmentation matters

In their pilot studies, the authors found that naive self-training — student learns from pseudo-labels without augmentation — failed to improve over the labeled-only baseline. The teacher and student share the same architecture and pretraining (DINOv2), so they make similar predictions on the same data. The student isn't learning anything new.

Strong augmentation breaks this symmetry. When the student sees a color-jittered, blurred, CutMixed version of the image, it can't rely on surface-level visual cues. It's forced to learn deeper, more robust representations to match the teacher's clean-image prediction. The augmentation creates an information gap that the student fills by acquiring extra visual knowledge.

Why did naive self-training (without strong augmentation) fail to improve over the labeled-only baseline?

The teacher and student share the same architecture and pretraining, so they make similar predictions — the student doesn't learn anything beyond what labeled data already provides The pseudo labels were too noisy The unlabeled images were too small

Chapter 2: The DINOv2 Encoder

Both the teacher and student models use the same architecture: a DINOv2 encoder paired with a DPT (Dense Prediction Transformer) decoder. Understanding why DINOv2 matters is key to understanding Depth Anything's success.

What is DINOv2?

DINOv2 is a Vision Transformer pretrained with self-supervised learning on 142M curated images. It was trained without any labels using a combination of self-distillation (DINO) and masked image modeling (iBOT). The result is an encoder with remarkably rich visual features:

Semantic awareness — features naturally cluster by object category
Spatial sensitivity — patch-level features encode local geometric structure
Domain robustness — performs well across diverse visual domains

Why not just freeze DINOv2?

A natural question: DINOv2 already has excellent features. Why not just freeze the encoder and only train the decoder? The paper tried this and found that fine-tuning the encoder produces significantly better depth estimates. Here's why:

DINOv2's features are optimized for semantic similarity — different parts of a car look similar in feature space. But for depth estimation, different parts of the same car can have very different depths (the hood is close, the roof is farther). Fine-tuning lets the encoder develop part-level discriminative features while retaining the semantic backbone.

The encoder dilemma: Semantic features group "same object" together. Depth features must distinguish "same object, different distance." Depth Anything solves this by fine-tuning DINOv2 (adapting its features for depth) while also adding a feature alignment loss (Section 5) that prevents the encoder from losing its semantic knowledge entirely. It's a balancing act — depth-aware and semantic-aware.

The DPT decoder

Following MiDaS, Depth Anything uses the DPT (Dense Prediction Transformer) decoder. DPT takes multi-scale features from the ViT encoder (tokens from layers 5, 12, 18, 24 for ViT-L) and progressively upsamples them through fusion modules to produce a full-resolution depth map.

The decoder is randomly initialized and trained with a 10x larger learning rate than the encoder. This asymmetry makes sense: the encoder already has good features from DINOv2 pretraining and only needs gentle adaptation, while the decoder must learn depth-specific fusion from scratch.

Why does Depth Anything fine-tune DINOv2 rather than freezing it?

DINOv2 features optimize for semantic similarity (same object = similar features), but depth needs part-level discrimination (same object, different depths) — fine-tuning develops this while retaining semantic knowledge DINOv2 features are random and need full retraining Freezing the encoder saves too much compute

Chapter 3: The Self-Training Pipeline

This is the heart of Depth Anything. The self-training pipeline has two stages:

Stage 1: Train the teacher

Collect 1.5M labeled images from 6 datasets (BlendedMVS, DIML, HRWSI, IRS, MegaDepth, TartanAir). Initialize the encoder with DINOv2 pretrained weights. Train for 20 epochs with the affine-invariant loss on these labeled images. The result is a teacher model T that produces decent relative depth maps.

Stage 2: Self-training with the student

This is where the magic happens. The student S is re-initialized from DINOv2 (not copied from the teacher). Then it trains on the union of labeled images and pseudo-labeled unlabeled images:

Teacher labels

Teacher T generates pseudo depth maps for all 62M unlabeled images (one forward pass each, no augmentation)

↓

Strong augmentation

Student sees heavily augmented versions: color jitter + Gaussian blur + CutMix (50% probability)

↓

Joint training

Each batch: 1 part labeled images (real labels) + 2 parts unlabeled images (pseudo labels). Student learns from both.

↓

Feature alignment

Auxiliary loss keeps student encoder aligned with frozen DINOv2 features (with tolerance margin α = 0.85)

CutMix for depth

CutMix is typically used in image classification. The authors adapt it for depth: given two unlabeled images u_a and u_b, they paste a rectangle from u_b onto u_a. The student must predict depth for this composite image. The loss is computed separately in the two regions:

u_ab = u_a ⊙ M + u_b ⊙ (1 − M)

Where M is a binary mask with a rectangular region set to 1. The depth loss for the pasted region uses u_b's pseudo label, and the background region uses u_a's pseudo label. This forces the student to handle depth discontinuities and composite scenes — situations that never appear in the labeled data.

Why re-initialize the student? Instead of initializing the student from the teacher's weights, the authors re-initialize from DINOv2 pretrained weights. This may seem wasteful, but prior work shows that independent initialization leads to better performance in self-training. Starting fresh lets the student find its own optimization path rather than being trapped near the teacher's local minimum.

What is the role of CutMix in Depth Anything's self-training?

It pastes a rectangle from one unlabeled image onto another, forcing the student to handle depth discontinuities and composite scenes under strong spatial perturbation It cuts out parts of the image to reduce computation It mixes the depth labels of two images by averaging

Chapter 4: The Data Engine

Depth Anything's data engine collects 62M unlabeled images from 8 large-scale public datasets. The selection criteria: diversity of scenes, high image quality, and broad coverage of real-world scenarios.

The 8 unlabeled sources

Dataset	Images	Domain
SA-1B	11.1M	Diverse web images (from Segment Anything)
Open Images V7	7.8M	Diverse web images with rich annotations
ImageNet-21K	13.1M	Object-centric images, 21K categories
LSUN	9.8M	Scenes (bedrooms, churches, towers, etc.)
BDD100K	8.2M	Driving scenes (diverse weather, time of day)
Objects365	1.7M	Object detection scenes, 365 categories
Places365	6.5M	Scene recognition, 365 place categories
Google Landmarks	4.1M	Landmarks and buildings worldwide

The 6 labeled sources

Dataset	Images	Label source
BlendedMVS	115K	Stereo
DIML	927K	Stereo
HRWSI	20K	Stereo
IRS	103K	Stereo
MegaDepth	128K	SfM
TartanAir	306K	Stereo (synthetic)

Scale comparison: MiDaS v3.1 used ~2M labeled images from 12 datasets. Depth Anything uses only 1.5M labeled images (fewer!) but adds 62M unlabeled images — a 40x increase in total data. The key insight: it's not about getting more labels, it's about getting more images and being clever about how you learn from them.

Sky detection

One practical detail: the sky has effectively infinite depth. To handle this, the authors apply a pretrained semantic segmentation model to detect sky regions and set their disparity to 0 (farthest point). This prevents the depth model from being confused by the sky's ambiguous appearance.

How does Depth Anything's data compare to MiDaS v3.1's approach?

Depth Anything uses fewer labeled images (1.5M vs ~2M) but adds 62M unlabeled images with pseudo-labels — massively expanding data coverage without requiring more expensive label acquisition Depth Anything uses 10x more labeled data Both use the same amount of data

Chapter 5: Training Details

Depth Anything uses three losses, each serving a distinct purpose. Let's walk through them.

Loss 1: Affine-invariant loss (labeled images)

Different datasets have different depth scales and shifts. An indoor RGB-D dataset might measure depth in meters (0.5-10m), while a stereo-reconstructed outdoor dataset might have arbitrary units. The affine-invariant loss ignores these differences:

L_l = (1/HW) ∑_i |d̂^*_i − d̂_i|

Where d̂ denotes the scale-and-shift normalized version of the depth:

d̂_i = (d_i − t(d)) / s(d)

t(d) = median(d), s(d) = (1/HW) ∑ |d_i − t(d)|

By normalizing both prediction and ground truth to zero median and unit mean-absolute-deviation, the loss becomes invariant to any affine transformation of the depth. This lets you train jointly on datasets with incompatible depth scales.

Loss 2: Unlabeled loss (pseudo-labeled images)

The same affine-invariant loss, but computed on pseudo-labeled unlabeled images. For CutMix, the loss is split into the pasted region and background region, each aligned independently:

L_u = (∑M / HW) L^M_u + (∑(1−M) / HW) L^1−M_u

Loss 3: Feature alignment loss

This auxiliary loss keeps the student's encoder features aligned with the frozen DINOv2 encoder:

L_feat = 1 − (1/HW) ∑_i cos(f_i, f′_i)

Where f_i is the student's feature at pixel i, and f′_i is the frozen DINOv2 feature. The key detail: a tolerance margin α = 0.85 — if the cosine similarity already exceeds α, that pixel is excluded from the loss. This prevents the loss from forcing exact feature matching (which would prevent depth-specific adaptation).

Why the tolerance margin? DINOv2 produces similar features for different parts of the same object (car front ≈ car rear in feature space). But depth estimation needs these parts to have different features because they're at different depths. The tolerance margin lets the student diverge from DINOv2 where depth-specific discrimination is needed, while still inheriting the broad semantic knowledge.

Training recipe

Encoder LR: 5e-6 (gentle — pretrained features are already good)
Decoder LR: 5e-5 (10x higher — randomly initialized, needs to learn fast)
Optimizer: AdamW with linear LR decay
Stage 1: Teacher trained for 20 epochs on labeled images
Stage 2: Student sweeps through all unlabeled images once (1 epoch), with labeled:unlabeled batch ratio of 1:2
Augmentation (labeled): Horizontal flip only
Augmentation (unlabeled): Color jitter + Gaussian blur + CutMix (50%)
Model sizes: ViT-S (24.8M), ViT-B (97.5M), ViT-L (335.3M)

What is the purpose of the tolerance margin α in the feature alignment loss?

It lets the student diverge from DINOv2 features where depth-specific discrimination is needed (e.g., different parts of the same object at different depths) while still inheriting broad semantic knowledge It speeds up training by ignoring hard examples It prevents the encoder from training at all

Chapter 6: Results

Depth Anything is evaluated in two settings: zero-shot relative depth (predict ordinal depth without fine-tuning) and fine-tuned metric depth (predict absolute distances after fine-tuning on NYUv2/KITTI).

Zero-shot relative depth

Tested on 6 unseen datasets (KITTI, NYUv2, Sintel, DDAD, ETH3D, DIODE), Depth Anything ViT-L crushes MiDaS v3.1 ViT-L across the board:

Small model, big performance: Depth Anything ViT-S (24.8M parameters — less than 1/10 the size of MiDaS ViT-L) outperforms MiDaS on several benchmarks including Sintel, DDAD, and ETH3D. This demonstrates that data scale (62M images) can compensate for model scale — a small model trained on massive diverse data beats a large model trained on limited data.

Fine-tuned metric depth

When fine-tuned with metric depth labels on NYUv2 (indoor) and KITTI (driving), Depth Anything sets new state-of-the-art results:

NYUv2: δ₁ = 0.984, AbsRel = 0.056 (previous SOTA: δ₁ = 0.964 by VPD)
KITTI: δ₁ = 0.982, AbsRel = 0.046 (previous SOTA: δ₁ = 0.978 by IEBins)

The zero-shot pretraining provides such a strong backbone that even a small amount of metric depth fine-tuning produces exceptional results.

Ablation: What matters most?

The authors conduct careful ablations to isolate each contribution:

Configuration	KITTI AbsRel ↓	NYUv2 AbsRel ↓
Labeled only (baseline)	0.090	0.056
+ Unlabeled (no augmentation)	0.089	0.057
+ Color augmentation	0.083	0.050
+ CutMix	0.081	0.048
+ Feature alignment	0.076	0.043

The ablation tells the whole story: Adding unlabeled data without augmentation barely helps (+0.001). Color augmentation provides the first big jump. CutMix adds more. Feature alignment provides the final boost. Every component matters, but strong augmentation is the crucial ingredient that makes unlabeled data useful.

What does the ablation study reveal about adding unlabeled data without strong augmentation?

It barely improves over the labeled-only baseline — confirming that naive self-training provides negligible benefit when labeled data is already sufficient It provides the biggest improvement It makes performance worse

Chapter 7: Downstream Tasks

A strong depth model is a foundational building block for many 3D and scene understanding tasks. Depth Anything demonstrates this across several applications.

Depth-conditioned ControlNet

ControlNet lets you guide Stable Diffusion with structural conditions like edge maps or depth maps. Better depth maps = better ControlNet results. When Depth Anything's depth maps replace MiDaS's in ControlNet, the generated images show more accurate 3D structure, sharper object boundaries, and better spatial consistency.

Metric depth estimation

Depth Anything's relative depth encoder can be fine-tuned into a metric depth estimator by adding a metric prediction head. Fine-tuned on NYUv2 or KITTI, it significantly outperforms ZoeDepth (the previous SOTA for generalizable metric depth). The strong relative depth backbone transfers directly — the model already understands scene geometry, it just needs to learn the absolute scale.

Semantic segmentation

A surprising side benefit: the feature alignment loss with DINOv2 makes the student encoder also excel at semantic segmentation. Without any explicit segmentation training, Depth Anything's encoder (when paired with a segmentation head) achieves strong results on standard segmentation benchmarks. This suggests the encoder has become a universal visual encoder — good for both mid-level (depth) and high-level (semantic) tasks.

Universal encoder potential: The combination of depth supervision (which teaches geometric understanding) and DINOv2 feature alignment (which preserves semantic knowledge) produces an encoder that understands both "what things are" and "where things are." This multi-task capability wasn't the primary goal but emerged naturally from the training recipe.

Why does Depth Anything's encoder also work well for semantic segmentation, despite never being trained on segmentation labels?

The feature alignment loss preserves DINOv2's rich semantic features while depth training adds geometric understanding — together they produce a universal encoder for both tasks The segmentation labels were implicitly in the depth data All depth models can do segmentation

Chapter 8: Depth Anything V2

In June 2024, the same team released Depth Anything V2, which addressed several limitations of V1 and pushed performance significantly further.

Key changes in V2

1. Synthetic data replaces labeled real data

V1 used 1.5M labeled images from real-world datasets with noisy labels (stereo matching errors, SfM artifacts, sensor noise). V2 replaces this with 595K synthetic images from procedurally generated 3D scenes with pixel-perfect depth labels. The insight: for training the teacher, label quality matters more than label quantity. Clean synthetic labels produce a better teacher, which produces better pseudo-labels, which produce a better student.

2. Larger teacher, all model sizes for student

V2 uses a ViT-L teacher exclusively and trains students at ViT-S, ViT-B, ViT-L, and ViT-G scales. The larger teacher (trained on clean synthetic data) provides higher-quality pseudo-labels for all student sizes.

3. Improved training recipe

No feature alignment loss — V2 drops the DINOv2 alignment loss, finding it less necessary with synthetic pretraining
Gradient matching loss — additional loss on depth gradients for sharper boundaries
Finer pseudo-labeling — teacher generates labels at higher resolution

V1 → V2 summary: V1 showed that massive unlabeled data + self-training works. V2 showed that the teacher quality is the bottleneck — switching to clean synthetic data for the teacher (even though there's less of it) produces dramatically better results. The lesson: in self-training, teacher quality > teacher data quantity.

V2 results

Depth Anything V2 significantly outperforms V1 across all benchmarks. The ViT-L model improves KITTI AbsRel from 0.076 to 0.048 and NYUv2 AbsRel from 0.043 to 0.033. The ViT-G model pushes even further. V2 also introduces Depth Anything V2 Metric, which directly predicts metric depth without a separate fine-tuning stage.

What is the main lesson from Depth Anything V1 to V2?

Teacher quality is the bottleneck in self-training — V2 uses fewer but pixel-perfect synthetic labels for the teacher, producing dramatically better pseudo-labels and final student performance More unlabeled data is always better Bigger models are the only thing that matters

Chapter 9: Connections

Depth Anything sits at the intersection of several important ideas in computer vision. Here's how it relates to the broader landscape.

MiDaS

MiDaS (Ranftl et al., 2020, 2022) pioneered the idea of training a single depth model on multiple datasets using the affine-invariant loss. Depth Anything directly builds on MiDaS's loss formulation and training framework, but adds the key ingredient of massive unlabeled data. Where MiDaS sought diversity through more labeled datasets, Depth Anything achieves it through pseudo-labeled unlabeled images.

DINOv2

DINOv2 (Oquab et al., 2024) provides the encoder backbone and the frozen features for the alignment loss. It's the "semantic brain" that Depth Anything inherits and adapts. Without DINOv2's strong pretrained features, the self-training pipeline would be less effective — the teacher would be weaker, the student's initial features would be worse, and the semantic alignment would be impossible.

ZoeDepth

ZoeDepth (Bhat et al., 2023) demonstrated that a strong relative depth model (MiDaS) can be fine-tuned into a strong metric depth model. Depth Anything follows the same principle: train for relative depth first (better generalization), then fine-tune for metric depth (better absolute accuracy). ZoeDepth uses MiDaS as its backbone; with Depth Anything as the backbone instead, metric depth performance improves dramatically.

Metric3D

Metric3D (Yin et al., 2023) takes a different approach: directly train for metric depth using camera intrinsics to handle scale ambiguity. While Depth Anything focuses on relative depth and then adapts to metric, Metric3D tackles the metric problem head-on. The two approaches are complementary — Depth Anything excels at zero-shot generalization, while Metric3D excels when camera parameters are known.

The broader self-training paradigm

Depth Anything's recipe — teacher on labeled data, pseudo-labels on unlabeled data, student with strong augmentation — echoes successful patterns throughout deep learning:

Noisy Student (Xie et al., 2020) — the same pattern for image classification. Teacher labels ImageNet pseudolabels on 300M images, student trains with noise.
Semi-supervised learning — FixMatch, MeanTeacher, and other methods use similar teacher-student frameworks with consistency regularization.
Knowledge distillation — the teacher-student framework itself traces back to Hinton et al. (2015), though Depth Anything's student ends up better than the teacher.

The big picture: Depth Anything proves that the "foundation model" paradigm — massive data, strong pretraining, self-supervised scaling — works for dense prediction tasks like depth estimation, not just classification or generation. The recipe is simple: start with a good encoder (DINOv2), get a decent teacher (labeled data), generate massive pseudo-labels (62M unlabeled), and force the student to work harder (strong augmentation). No novel architecture. No new loss function. Just scale, done right.

What is the fundamental principle that connects Depth Anything to Noisy Student and other semi-supervised methods?

The teacher-student self-training paradigm: a teacher generates pseudo-labels on unlabeled data, and the student learns from them under noise/augmentation to acquire knowledge beyond what labeled data alone provides They all use the same architecture They all train on the same dataset