Metric3D v2 — Veanors

Chapter 0: The Problem

You take a single photo with your phone. From that one image, you want two things: how far away every pixel is (metric depth), and which direction every surface faces (surface normals). These are the two most fundamental geometric properties of a scene — and they are deeply complementary.

Depth tells you the distance to each point. A pixel on a nearby table might be 0.8 meters away; a pixel on the far wall might be 5 meters. Depth is a scalar field — one number per pixel.

Surface normals tell you the orientation of the surface at each point. A horizontal floor has normals pointing straight up. A vertical wall has normals pointing toward you. A slanted roof has normals pointing diagonally. Normals are a vector field — three numbers (x, y, z direction) per pixel.

Together, depth and normals give you a complete geometric description of the scene. Depth provides the global structure (what is far, what is near). Normals provide the local geometry (which way each surface faces). If you have both at metric scale, you can reconstruct the 3D scene, measure real-world distances, and understand the physical layout — all from a single photo.

Here is the problem: existing methods train these two representations separately, and each has a critical weakness.

The two-headed monster: Depth methods that generalize well (MiDaS, DPT) learn affine-invariant depth — they predict relative structure but lose the real-world scale. A coffee mug might be predicted at "depth 0.3" and a building at "depth 0.9," but those numbers have no metric meaning. You cannot measure anything. Meanwhile, surface normal methods are bottlenecked by the lack of diverse training data — normal labels require expensive dense 3D reconstruction, so models struggle on outdoor scenes they have never seen.

The first problem — losing metric scale — comes from a fundamental ambiguity. If the neural network does not know the camera's focal length, the same image could have been taken by a wide-angle phone camera at 1 meter or a telephoto lens at 10 meters. Without focal length information, the network cannot predict metric depth. It must give up and predict only relative depth.

The second problem — limited normal training data — is about annotation cost. Depth labels can come from cheap sensors (LiDAR, RGB-D cameras). But normal labels require meticulously reconstructed 3D meshes: you need to capture a scene from many angles, run dense multi-view stereo, clean the reconstruction, and then compute normals from the resulting mesh. Only a handful of indoor datasets have gone through this process; outdoor normal annotations are almost nonexistent.

Here is the state of affairs before Metric3D v2:

MiDaS / DPT: Robust across scenes, but output is affine-invariant. Cannot measure real-world distances.
ZoeDepth: Adds "metric heads" fine-tuned per domain (indoor/outdoor). Better metric accuracy, but requires knowing the domain and fine-tuning.
Omnidata: Strong surface normals from dense reconstruction labels, but limited generalization outdoors and no metric depth.
Marigold: Beautiful depth from diffusion models, but affine-invariant and slow at inference.

None of these produces both metric depth and surface normals from a single model. Metric3D v2 does.

Depth vs Normals

A simple scene with two surfaces. Depth measures distance from the camera; normals show surface orientation. Toggle between the two views.

View Depth

Why can't affine-invariant depth methods (like MiDaS) recover real-world measurements?

They predict depth up to an unknown scale and shift, so absolute distances are lost — "0.3" and "0.9" have no metric meaning without knowing the true scale factor They are too slow to run on mobile devices They only work on indoor scenes

Chapter 1: The Key Insight

Metric3D v2 is built on a beautifully simple geometric fact: depth and normals are dual representations of the same 3D surface.

Think about it. If you know the depth at every pixel and the camera parameters, you can compute the 3D position of every point. And if you have the 3D positions, you can compute the surface normal at any point by looking at how the surface tilts between neighboring points. In other words:

n(x, y) = normalize( ∂P/∂x × ∂P/∂y )

Where P is the 3D point corresponding to pixel (x, y), and n is the surface normal. The normal is just the cross product of the partial derivatives of the 3D surface — it is the spatial gradient of the depth field.

Think about what this means concretely. If you have a flat table, the depth changes linearly as you move across it (it gets farther away). The gradient of this linear field is constant, so the normals are constant: they all point straight up. If the table has a bump, the depth changes faster there, and the normals tilt to follow the bump's slope. The normal IS the derivative of the depth.

This geometric relationship is the key that unlocks everything:

Direction 1: Depth → Normals

If you have accurate metric depth, you can compute surface normals by differentiating. This means depth labels implicitly contain normal information — you can train a normal estimator even without explicit normal labels.

↓

Direction 2: Normals → Depth

If you have surface normals, you can use them to refine depth. Normals provide constraints on how depth should change between neighboring pixels. Noisy depth that violates these constraints can be corrected.

↓

Joint Optimization

By iteratively refining both depth and normals together, each helps the other. Depth provides metric scale and global structure. Normals provide local geometric detail and consistency.

Why this matters: There are ~16M images with depth annotations but less than 20K outdoor images with normal labels. By exploiting the depth-to-normal relationship, Metric3D v2 can learn normals from 800× more data than was previously available for normal training. The depth-normal duality turns a data-starved task into a data-rich one.

The second key insight addresses the metric ambiguity problem. The authors perform a careful analysis of which camera parameters actually matter for depth prediction. The answer is surprisingly specific:

Camera Parameter	Affects Metric Depth?	Why
Sensor size	No	Only changes field of view, not the depth-imaging relationship
Pixel size (δ)	No	Different pixel sizes produce different resolutions but the same α = f/S' ratio
Focal length (f)	Yes	Directly scales the relationship between imaging size and real-world distance

Since only the focal length matters, the solution is clean: normalize all training images to behave as though they were taken by the same "canonical" camera with a fixed focal length. This eliminates the ambiguity entirely, without requiring the network to learn about cameras at all.

How does the depth-normal geometric relationship help with the normal data scarcity problem?

Since normals can be derived from depth by differentiation, the 16M depth-labeled images can provide indirect supervision for normal estimation — turning a data-starved problem into a data-rich one Normal labels are automatically generated by LiDAR sensors Data augmentation creates enough synthetic normal labels

Chapter 2: Canonical Camera Transform

Why does focal length cause metric ambiguity? Consider two cameras photographing the same chair. Camera A has a focal length of 26mm at 1 meter distance. Camera B has a 52mm focal length at 2 meters. Both produce the same image of the chair — identical pixel appearance, identical imaging size. But the depth is completely different: 1m vs 2m.

A neural network seeing these two identical-looking images would receive contradictory training signals. Same input, different labels. It cannot learn metric depth without knowing which camera took the photo.

The Solution: One Canonical Camera

Metric3D v2 defines a canonical camera with a fixed focal length f^c. Before training, every image-depth pair is transformed so it appears as though the canonical camera took it. The model only ever sees "canonical" images and can learn metric depth without ambiguity.

There are two ways to perform this transformation:

Method 1: Transform the Labels (CSTM-label)

Keep the image unchanged. Instead, rescale the ground-truth depth by the ratio of canonical to actual focal length:

D^c = (f^c / f) · D*

At inference, reverse it: D = (f / f^c) · D^c. The network predicts depth in canonical space; you de-canonicalize with the actual camera's focal length.

Method 2: Transform the Image (CSTM-image)

Resize the input image by ω_r = f^c / f so it looks like the canonical camera took it. The depth labels resize accordingly (without scaling values). At inference, resize the prediction back.

Why Does the Focal Length Cause Ambiguity?

Let's make this concrete. From the pinhole camera model, the depth d of an object is related to its real-world size S, its imaging size S', and the focal length f by:

d = S · (f / S')

If two cameras have different focal lengths f₁ = 2f₂ and the object is at distances d₁ = 2d₂, the imaging size S' is identical. A network looking at image appearance alone cannot tell these apart. But if we transform both depth maps to use the same canonical focal length f^c, the labels become consistent: the network sees the same image and gets the same (canonical) depth label. No contradiction.

Why CSTM-label wins: In experiments, CSTM-label outperforms CSTM-image. Resizing images introduces interpolation artifacts and changes the effective resolution. Scaling depth labels is mathematically clean — a simple multiplication that preserves all image detail.

Inference: De-canonicalization

At test time, the network predicts depth D^c in canonical space. To get real-world metric depth, you simply divide by the focal-length ratio: D = D^c / ω_d = D^c · (f / f^c). This requires knowing the test camera's focal length — which is available in every photo's EXIF metadata. No special calibration needed.

Normal labels are untouched: Surface normals do not depend on metric scale. If you scale a depth map by 2×, every surface still faces the same direction. So the canonical transform only applies to depth — normals pass through unchanged. This is another advantage of normals: they are inherently metric-agnostic.

Why does the canonical camera transform only apply to depth labels and NOT to normal labels?

Because normal labels are already in canonical space Because surface normals are invariant to metric scale — scaling the depth does not change the direction any surface faces Because normal labels are too sparse to transform reliably

Chapter 3: Joint Depth-Normal Optimization

This is the heart of Metric3D v2. Rather than predicting depth and normals independently, the model iteratively refines both together using recurrent blocks. Each iteration, depth helps normals and normals help depth.

The Iterative Refinement Loop

The encoder-decoder produces initial low-resolution predictions: D̂₀ for depth and N̂₀ for normals. Then, for T+1 iterations:

Step 1: ConvGRU Update

A ConvGRU block receives the current depth D̂_t, normal N̂_t, initial hidden features H₀, and current hidden features H_t. It updates the hidden state: H_t+1 = ConvGRU(D̂_t, N̂_t, H₀, H_t).

↓

Step 2: Compute Updates

Two projection heads predict residuals from the updated hidden state: ΔD̂_t+1 = G_d(H_t+1) and ΔN̂_t+1 = G_n(H_t+1).

↓

Step 3: Apply Residuals

D̂_t+1 = D̂_t + ΔD̂_t+1 and N̂_t+1 = N̂_t + ΔN̂_t+1. Both predictions improve every iteration.

After all iterations, the low-resolution predictions are upsampled to full resolution using learned upsampling heads H_d and H_n. Depth passes through ReLU (ensuring non-negative values). Normals are L2-normalized to unit vectors at every pixel.

The Math

Formally, at each step the recurrent block F takes all current state as input:

H^t+1 = ConvGRU(D̂_t, N̂_t, H₀, H_t)

ΔD̂_t+1 = G_d(H^t+1), ΔN̂_t+1 = G_n(H^t+1)

The critical detail: both projection heads G_d and G_n read from the same updated hidden state H^t+1. This shared representation is how depth knowledge flows into the normal branch and vice versa. The ConvGRU's gating mechanism learns which hidden features to keep, reset, or update — automatically balancing the two tasks.

Why Joint Optimization Helps

The ConvGRU sees both depth and normals at every iteration. If the current depth estimate has a noisy region, the normals there will be inconsistent with the depth gradients. The ConvGRU can detect this mismatch and correct both. It is like having two witnesses who keep checking each other's story — inconsistencies get resolved with each pass.

The key innovation: Previous methods (IronDepth, IEBins) iterate over depth or normals independently. Metric3D v2 is the first to iterate over both jointly in a learning-based scheme. The shared hidden state H lets depth information flow into the normal branch and vice versa — implicit knowledge transfer that is more robust than explicit formula-based conversion.

The design is inspired by RAFT (Recurrent All-Pairs Field Transforms), which uses ConvGRU blocks to iteratively refine optical flow. The key adaptation: RAFT updates a single quantity (flow), while Metric3D v2 updates two quantities (depth and normals) through a shared hidden state. This coupling is what makes the joint optimization powerful — without the shared state, you would just have two independent RAFT-like refinement loops.

Joint Depth-Normal Optimization

A 1D surface (side view). Add noise to see how depth becomes bumpy. The "normals from depth" arrows become noisy too. Click "Run Joint Optimization" to watch iterative refinement smooth both signals. Drag the noise slider to increase the challenge.

Noise 20

Iteration: 0

What advantage does joint depth-normal iteration have over iterating depth and normals independently?

The shared hidden state lets depth information flow into the normal branch and vice versa, enabling implicit cross-task knowledge transfer that resolves inconsistencies between the two predictions It uses fewer parameters by sharing weights It runs faster because only one branch is updated per iteration

Chapter 4: Normal from Depth Distillation

Here is the data reality: there are about 9.5 million outdoor images with depth labels, but fewer than 20,000 with normal labels. If you only train the normal head on those 20K images, it will never generalize to the real world. How do you learn normals from depth-only data?

Three Sources of Normal Supervision

Metric3D v2's normal estimator learns from three complementary signals:

Source	When Available	Description
GT Normal Labels (L_n)	~10M indoor frames	Direct supervision with an uncertainty-aware angular loss. High quality but limited diversity.
Depth-Normal Consistency (L_d-n)	Always	Self-supervision: convert predicted depth to a pseudo-normal map via least-squares fitting, then minimize the angular difference with the predicted normals. Requires no labels at all.
Implicit Feature Fusion	Always	Through the shared ConvGRU hidden state, depth features implicitly teach the normal branch about geometry — more robust than explicit pseudo-labels.

The Consistency Loss

Given a predicted depth map D, you can compute a pseudo-normal at each pixel by fitting a local plane to the neighboring 3D points (using the least-squares method). Call this N_pseudo. The consistency loss is:

L_d-n = angular_distance(N_predicted, N_pseudo(D))

This is a self-supervised signal — it uses no ground-truth labels at all. It simply enforces that the model's own depth and normal predictions agree with each other geometrically.

Why not just convert depth to normals directly? You could compute normals from predicted depth without learning a normal head at all. But this produces artifacts — depth predictions are noisy at edges and boundaries, and the numerical derivatives amplify that noise. The learned normal head produces much cleaner results because it has been trained on real normal labels (when available) and on the consistency signal (always). It learns to predict normals that are geometrically consistent with depth but visually cleaner than derivative-based conversion. The paper demonstrates this with Marigold — a state-of-the-art depth model — whose depth-derived normals show visible noise on smooth surfaces and errors at object boundaries.

The Data Flow

Consider a training image from an outdoor driving dataset with only depth labels (no normal GT). Here is what happens:

The depth head receives direct supervision from L_d (GT depth exists).
The normal head receives NO direct GT supervision (no normal labels).
But the normal head still learns from two sources: (1) the consistency loss L_d-n that aligns its predictions with the predicted depth, and (2) implicit knowledge transfer through the shared ConvGRU hidden state.
Over millions of such images, the normal head learns the geometric relationship between image appearance and surface orientation — despite never seeing a single normal label from this domain.

The overall loss: L = 0.5 · L_d(D^c, D*^c) + 1.0 · L_n(N, N*) + 0.01 · L_d-n(N, D). The small weight on L_d-n prevents the noisy pseudo-normals from overwhelming the clean GT supervision when both are available.

Note that the depth loss L_d operates in canonical space (D^c vs D*^c), while the consistency loss L_d-n uses the de-canonicalized depth D. This is because the depth-to-normal conversion requires real-world metric depth to compute correct 3D positions — canonical depth would distort the 3D geometry.

Why is the depth-normal consistency loss weighted much lower (0.01) than the GT normal loss (1.0)?

To reduce computation cost Because pseudo-normals derived from predicted depth are noisier than GT normals — a high weight would let noisy self-supervision overwhelm clean ground-truth supervision Because the consistency loss is only applied to outdoor images

Chapter 5: The Architecture

Metric3D v2 uses a standard encoder-decoder pipeline, augmented with the canonical camera transform and the joint optimization module. Here is the full data flow.

Backbone: ViT-Large (DINOv2)

The image — after canonical camera transformation — is fed into a Vision Transformer backbone pretrained with DINOv2 self-supervised learning. DINOv2 provides excellent visual features because it was trained on 142M images with no labels, learning rich geometric and semantic representations. The ViT-Large version produces patch tokens at 1/4 resolution (each token represents a 14×14 pixel patch).

The authors also experiment with a ConvNeXt-Large backbone (for those who prefer convnets) and a ViT-giant (1B+ parameters) for maximum accuracy on benchmarks.

Decoder: DPT

A DPT (Dense Prediction Transformer) decoder reassembles the multi-scale features from the ViT backbone into dense prediction maps. DPT works by tapping into features at multiple layers of the ViT and fusing them with convolutional upsampling. It produces three initial maps at 1/4 resolution:

Initial depth D̂₀: A rough depth estimate in canonical space
Initial normal N̂₀: An unnormalized 3-channel normal estimate
Initial hidden features H₀: Rich feature maps that the ConvGRU will use as a "memory" anchor during iterative refinement

Joint Optimization: Recurrent Blocks

The ConvGRU-based recurrent blocks iterate over depth and normals, as described in Chapter 3. The number of iterations T is a hyperparameter — more iterations give better results but cost more compute. Typically T = 3 to 5 iterations suffice. After T+1 steps, the refined predictions are upsampled to full resolution using learned convolutional upsampling heads.

Post-Processing

Depth is passed through ReLU (ensuring non-negative values) and then de-canonicalized by dividing by ω_d = f^c/f. Normals are L2-normalized to unit vectors. The full equation:

D_c = H_d(upsample(D̂_T+1)), N = H_n(upsample(N̂_T+1))

Where H_d is ReLU and H_n is per-pixel L2 normalization (ensuring ||n|| = 1).

Model Scaling

The architecture scales smoothly across three backbone sizes:

Backbone	Params	Use Case
ConvNeXt-Large	~200M	Fastest inference, good for real-time
ViT-Large	~300M	Best accuracy-speed tradeoff (default)
ViT-giant	~1B	Highest accuracy, server-side

Metric3D v2 Pipeline

Full architecture: image enters, canonical transform applied, encoder-decoder produces initial estimates, recurrent blocks refine jointly, outputs are de-canonicalized.

Model variants: The paper provides three backbones. ConvNeXt-Large (fastest, lowest accuracy). ViT-Large (best accuracy-speed tradeoff — this is the default). ViT-giant (highest accuracy, most expensive). All use the same CSTM + joint optimization framework.

What is the role of the DPT decoder in the pipeline?

It reassembles multi-scale ViT features into dense initial depth, normal, and hidden feature maps at 1/4 resolution — these serve as the starting point for iterative refinement It performs the canonical camera transformation It classifies the scene type (indoor vs outdoor)

Chapter 6: Training at Scale

Scale is what separates a research prototype from a foundation model. Metric3D v2 trains on an unprecedented collection of data.

Metric	Value
Total training images	16 million+
Number of datasets	16 (indoor + outdoor, real + synthetic)
Camera models	Thousands (phones, DSLRs, autonomous driving rigs, RGB-D sensors)
Images with depth labels	~16M
Images with normal labels	~10M (mostly indoor)
Outdoor normal-labeled images	< 20K
Training hardware	48 A100 GPUs
Training iterations	800K
Batch size	192

The Training Recipe

Key training details that make the system work at this scale:

Optimizer: AdamW with initial learning rate 1e-4, polynomial decay (power 0.9)
Resolution: 616 × 1064 for ViT, 512 × 960 for ConvNet
Augmentation: Random horizontal flip (50%), random crop after canonical transform
Backbone initialization: DINOv2 pretrained weights (ViT) or ImageNet-22K (ConvNeXt)

Dataset Balancing

With 16 datasets of wildly different sizes (some have millions of images, others have thousands), naive mixing would let large datasets dominate training. Following DiverseDepth, Metric3D v2 balances all datasets within each mini-batch so each accounts for an approximately equal share. This prevents the model from memorizing one domain at the expense of others.

Random Proposal Normalization Loss (RPNL)

Standard scale-shift invariant losses normalize depth over the entire image, which squeezes fine-grained depth differences in nearby regions. Consider a scene with a table at 1m and a wall at 5m. Global normalization maps this range to [0, 1], so the 2cm height difference of objects on the table gets compressed to ~0.004 — almost invisible to the loss function.

RPNL fixes this by randomly cropping 32 patches (each 12.5%-50% of the image size) and applying scale-shift normalization locally within each patch. A patch that contains only the table surface will normalize the 2cm variations to span a much larger fraction of [0, 1], preserving local geometric detail. The loss is the mean absolute deviation (MAD) normalized L1 distance across all patches.

The Depth Loss Stack

Depth supervision uses four complementary losses:

Scale-invariant log loss (L_silog): A variant of L1 that is robust to global scale differences.
Pair-wise normal regression loss (L_PWN): Encourages local geometric consistency between neighboring pixels.
Virtual normal loss (L_VNL): Randomly samples point triplets and enforces that the implied surface normal matches the depth gradient.
Random Proposal Normalization Loss (L_RPNL): The novel loss described above that preserves local detail.

Together: L_d = L_PWN + L_VNL + L_silog + L_RPNL.

Mixed annotations handled gracefully: Some datasets have both depth and normal labels. Some have depth only. The loss function simply drops the normal GT loss L_n when normal labels are absent and relies on the depth-normal consistency loss L_d-n instead. No architecture changes needed — the model trains on everything.

Why is dataset balancing important when training on 16 diverse datasets?

Without balancing, large datasets would dominate training and the model would memorize one domain at the expense of generalization to others To reduce training time by using fewer samples To ensure all images have the same resolution

Chapter 7: Results

Metric3D v2 was tested on over 16 benchmarks for depth and normals, both zero-shot (never seen the test domain during training) and fine-tuned. Here are the headline numbers.

NYUv2 Metric Depth (Indoor)

Method	Setting	δ₁ ↑	AbsRel ↓
MiDaS (affine-invariant)	ZS	—	affine only
ZoeDepth	FT	0.953	0.077
DepthAnything	FT	0.984	0.056
Marigold	ZS	—	affine only
Metric3D v2 (ViT-L)	ZS	0.975	0.063
Metric3D v2 (ViT-L)	FT	0.989	0.047

KITTI Metric Depth (Outdoor Driving)

Method	Setting	δ₁ ↑	AbsRel ↓
DepthAnything	FT	0.982	0.046
Metric3D v2 (ViT-L)	ZS	0.974	0.052
Metric3D v2 (ViT-g)	FT	0.989	0.039

NYUv2 Surface Normals

Method	Setting	11.25° ↑	Mean Error ↓
Omnidata	ZS	0.577	16.7°
Bae et al.	ZS	0.597	16.0°
Polymax	ZS	0.656	13.1°
Metric3D v2 (ViT-L)	ZS	0.662	13.1°
Metric3D v2 (ViT-L)	FT	0.688	12.0°

Zero-Shot Generalization (Never-Seen Datasets)

Perhaps the most impressive results are on datasets the model has never seen at all during training:

Dataset	Domain	Metric3D v2 δ₁	Best Prior δ₁
NYUv2	Indoor	0.975	0.969 (Polymax)
KITTI	Driving	0.974	0.968 (ZeroDepth)
ScanNet	Indoor	0.969	0.939 (HDN)
NuScenes	Driving	0.977	0.910 (ZeroDepth)
DIODE Indoor	Indoor	0.849	0.754 (ZeroDepth)
DIODE Outdoor	Outdoor	0.847	0.400 (ZoeDepth)
ETH3D	Mixed	0.993	0.969 (Polymax)

On DIODE Outdoor, where ZoeDepth manages only δ₁ = 0.400 (essentially failing), Metric3D v2 achieves 0.847 — more than double the accuracy. The canonical camera transform is especially impactful on datasets with unusual cameras (DIODE uses structured-light sensors), where previous methods fall apart because they have never trained on similar camera models.

On the affine-invariant depth benchmarks (ETH3D, iBIMS-1, DIODE), Metric3D v2 also outperforms MiDaS and DPT — even at their own game. The model has not sacrificed structure quality to gain metric accuracy; it has improved both.

The remarkable result: Metric3D v2's zero-shot metric depth on NYU (δ₁ = 0.975) is competitive with DepthAnything fine-tuned on NYU (δ₁ = 0.984). On KITTI, Metric3D v2 zero-shot nearly matches DepthAnything fine-tuned. This is the power of a foundation model: one model, trained once, works everywhere.

Results Comparison

δ₁ accuracy on NYUv2 (metric depth). Higher is better. Metric3D v2 zero-shot vs other methods.

What makes Metric3D v2's results particularly impressive compared to DepthAnything?

Metric3D v2 achieves near-competitive or better results in zero-shot mode (never seen the test dataset) while DepthAnything requires fine-tuning on the target dataset — demonstrating true foundation-model generalization Metric3D v2 uses fewer parameters Metric3D v2 runs in real time on mobile devices

Chapter 8: Applications

A model that predicts accurate metric depth and normals from a single photo unlocks applications that were previously impossible without expensive sensors or multi-view systems.

Single-Image Metrology

Take a photo with your iPhone. Read the focal length from the EXIF data. Feed the image into Metric3D v2. Get metric depth at every pixel. Backproject to 3D using the known camera intrinsics. Now you can measure real-world distances between any two points in the scene.

The authors demonstrate this with an iPhone 14 Pro (f = 24mm, pixel size 2.44μm) and a Samsung Galaxy S23 (f = 35mm, pixel size 1μm) — two completely different cameras with different focal lengths and sensor characteristics, neither seen during training. Measured sizes (e.g., a drone's wingspan, a chair's height) come within a few centimeters of ground truth.

Monocular SLAM Improvement

Visual SLAM systems like Droid-SLAM suffer from scale drift: as the camera moves through a large-scale scene, the estimated scale gradually diverges from reality. After walking through a building, the map might be 20% too large or too small, with the error accumulating over time.

By naively feeding Metric3D v2's per-frame metric depth into Droid-SLAM, the scale drift is dramatically reduced. The paper shows trajectory predictions that closely match ground truth, with accurate metric-scale dense mapping. The key: Metric3D v2 provides an absolute scale reference at every frame, preventing the SLAM system from drifting.

Why scale matters for SLAM: Traditional monocular SLAM can estimate camera motion up to an unknown scale. But for a robot that needs to navigate in meters, or an AR app that needs to place virtual objects at correct physical sizes, unknown scale is useless. Metric3D v2 provides the absolute scale that anchors SLAM in the real world.

Metric 3D Reconstruction

With metric depth, you can backproject every pixel to a 3D point at its correct real-world position. With surface normals, you can orient mesh faces correctly. The combination produces high-quality metric 3D reconstructions from casually captured images — no LiDAR, no multi-view stereo, no calibration targets.

NeRF and Neural Rendering

Neural radiance fields benefit enormously from depth and normal priors. Metric3D v2's predictions can initialize NeRF geometry, speeding up convergence and improving quality in under-constrained regions (e.g., textureless walls). The metric scale ensures that the NeRF scene has correct physical dimensions, which matters for VR/AR applications where virtual objects must interact with real geometry.

Autonomous Driving

Self-driving cars need to know how far away other vehicles, pedestrians, and obstacles are in meters. LiDAR provides this, but it is expensive ($5K-$75K per sensor) and produces only sparse point clouds. A monocular camera with Metric3D v2 can provide dense metric depth at every pixel at a fraction of the cost, serving as a redundant safety channel or enabling depth estimation in camera-only setups. The KITTI benchmark results (δ₁ = 0.989 fine-tuned) demonstrate near-perfect depth prediction in driving scenarios.

Downstream task improvement: The authors show that simply plugging Metric3D v2's predictions into existing pipelines — SLAM, NeRF, 3D reconstruction — significantly improves their output. The model acts as a drop-in geometric prior that benefits many systems without requiring any architectural changes to those systems.

The metrology demo: Using an iPhone 14 Pro (f=24mm) and Samsung Galaxy S23 (f=35mm) — two cameras with completely different focal lengths, neither seen in training — the authors reconstruct 3D scenes and measure real-world object sizes. A drone's wingspan measured from a single photo comes within centimeters of ground truth. This is single-image metrology: measuring real objects from a single smartphone photo.

How does Metric3D v2 help with the scale drift problem in monocular SLAM?

It provides metric-scale depth predictions that anchor the SLAM system's estimated scale to real-world measurements, preventing gradual drift It replaces the SLAM system entirely It uses GPS data to correct the scale

Chapter 9: Connections

Metric3D v2 sits at the intersection of several important research threads. Let's map where it fits.

Relation to Metric3D v1

Metric3D v1 introduced the canonical camera space transformation for metric depth. v2 keeps this module but adds three major innovations: (1) joint depth-normal optimization via ConvGRU, (2) the ability to learn normals from depth labels via the consistency loss, and (3) a massive scale-up from ~4M to 16M+ training images with ViT backbones. The result is not just better depth — it is a complete geometric foundation model.

Relation to MiDaS / DPT

MiDaS and DPT are affine-invariant depth models — they predict structure but not metric scale. Metric3D v2 solves the metric problem that MiDaS could not: by canonicalizing the camera, it learns true metric depth while maintaining the same robustness to diverse scenes.

Relation to DepthAnything

DepthAnything focuses on learning strong depth representations from massive unlabeled data via self-training. It produces affine-invariant or fine-tuned metric depth. Metric3D v2 takes a different approach: rather than scaling up unlabeled data, it scales up labeled data with the CSTM trick and adds normals. The two approaches are complementary.

Relation to Marigold / GeoWizard

Marigold uses diffusion models for monocular depth estimation, producing beautiful affine-invariant predictions. GeoWizard extends this to joint depth-normal estimation with diffusion. Metric3D v2 uses discriminative models (ViT + DPT) instead of diffusion, which makes it much faster at inference and enables true metric prediction. The paper shows that Marigold's depth-derived normals contain artifacts (noise at edges, incorrect orientations on smooth surfaces), while Metric3D v2's jointly-trained normal head produces cleaner results.

Relation to UniDepth / ZoeDepth

ZoeDepth tackles metric depth by first training an affine-invariant model, then fine-tuning "metric heads" for specific domain distributions (indoor vs outdoor). UniDepth uses camera-aware features. Metric3D v2's CSTM approach is more elegant: rather than learning domain-specific heads or camera encoders, it mathematically normalizes the camera out of the problem. One model, one training stage, all domains.

Relation to Omnidata

Omnidata tackles the normal data scarcity problem by performing dense 3D reconstruction on 1300M frames to create normal labels. This is enormously expensive. Metric3D v2's depth-to-normal distillation achieves similar or better results by leveraging cheap depth labels instead — a more scalable approach.

Cheat Sheet

Aspect	Metric3D v2
Input	Single RGB image + camera focal length
Output	Metric depth map + surface normal map
Backbone	DINOv2 ViT-Large (or ConvNeXt-L, ViT-giant)
Key module	Canonical camera space transform (CSTM)
Innovation	Joint depth-normal iterative optimization via ConvGRU
Normal supervision	GT labels + depth-normal consistency + implicit feature fusion
Training data	16M+ images, 16 datasets, 1000s of cameras
Key result	#1 on NYU, KITTI, ScanNet (both depth and normals)
Inference	Single forward pass + T refinement iterations

The broader lesson: When two geometric quantities are mathematically related (depth and normals are linked by differentiation), training them jointly and letting each supervise the other can overcome data limitations. Metric3D v2 turns 16M depth labels into 16M joint depth-and-normal training samples — turning a data-starved problem into a data-rich one through geometric reasoning. This is a powerful template: find a mathematical relationship between a data-rich task and a data-poor task, then transfer knowledge through joint training.

What is the key methodological difference between Metric3D v2 and diffusion-based depth methods like Marigold?

Metric3D v2 uses a discriminative model (ViT + DPT) that is faster at inference and predicts true metric depth, while Marigold uses slower diffusion models that produce only affine-invariant depth Metric3D v2 uses smaller training datasets Marigold predicts surface normals while Metric3D v2 does not

Metric3D v2: Zero-shot Metric Depth & Surface Normals