Yin, Zhang, Chen, Cai, Yu, Wang, Chen, Shen — DJI, Tencent, Zhejiang Univ, Intel Labs, 2023

Metric3D: Zero-shot Metric 3D from a Single Image

Monocular depth models predict relative depth — close vs far — but not real-world meters. Metric3D solves this by normalizing all training cameras to one canonical space, unlocking true metric depth from any phone photo.

Prerequisites: Pinhole camera model + Depth estimation basics + Encoder-decoder CNNs
10
Chapters
4
Simulations

Chapter 0: The Problem

You take a photo of a table with your iPhone. A monocular depth model tells you the table is "closer than the wall behind it." Great — but how far away is the table? 1.5 meters? 3 meters? The model has no idea. It outputs relative depth: a ranking of near-to-far, with no real-world scale.

This is useless for the things you actually want to do. A robot arm needs to know the table is exactly 1.2 meters away to place a cup on it. A SLAM system needs metric scale to build a map that doesn't drift. An AR app needs real-world measurements to overlay furniture that fits your room.

The core gap: State-of-the-art depth models like MiDaS and LeReS can predict beautiful depth maps that generalize across scenes — but the depth values are only accurate up to an unknown scale and shift. They predict d' = a·d + b, where a and b are different for every image. You can't measure anything in meters.

Why not just train on metric depth? You can — but only if you use a single camera. Methods like AdaBins or NeWCRFs achieve excellent metric accuracy on NYU (Kinect) or KITTI (LiDAR), but they fail catastrophically on images from different cameras. A model trained on Kinect data (focal length ~525 pixels) gives nonsensical metric predictions on iPhone photos (focal length ~3000 pixels).

The dream: a single model that predicts metric depth — real meters — from any camera, including ones it has never seen during training. That's what Metric3D achieves.

Relative vs Metric Depth

Left: a scene with three objects at known distances. Right: what a relative depth model outputs (only ordering preserved) vs what a metric model outputs (actual meters). Toggle between them.

Mode Relative
Why can't existing metric depth models (trained on one dataset) generalize to images from different cameras?

Chapter 1: The Key Insight

Imagine you're standing 2 meters from a chair and you photograph it with a 26mm lens. Now swap to a 52mm lens and stand 4 meters away. The chair looks exactly the same size in both photos. Same pixels, same appearance — but the real depth is completely different: 2m vs 4m.

This is the focal length ambiguity. A neural network that only sees pixels cannot distinguish these two situations. If you train it on both images with different depth labels (2m and 4m), the network receives contradictory supervision for identical-looking inputs. It cannot converge. This is precisely why mixed-dataset metric depth training fails.

The formal relationship: For a pinhole camera, the imaging equation is d = f̂ · S / S', where d is depth, f̂ is focal length, S is real-world size, and S' is imaging size on the sensor. Two cameras with focal lengths f̂1 = 2f̂2, viewing the same object from distances d1 = 2d2, produce the exact same image. The depth is entangled with the focal length.

What about other camera parameters? The authors make two critical observations:

The conclusion is sharp: focal length is the only camera parameter that creates metric ambiguity. If we can somehow remove the effect of varying focal lengths from training, we can train on mixed datasets and get metric depth for free.

The key insight: The metric depth ambiguity in mixed-camera training comes entirely from unknown focal lengths. If we normalize all training images to behave as though they were captured by the same "canonical" camera with a fixed focal length, the ambiguity vanishes. One model can then learn metric depth from millions of images across thousands of cameras.
Why does the focal length — and not sensor size or pixel size — create metric depth ambiguity?

Chapter 2: Affine-Invariant Depth

Before Metric3D, the best approach to mixed-camera training was to give up on metric depth entirely. Methods like MiDaS, LeReS, and DPT learn affine-invariant depth: depth predictions that are accurate only up to an unknown scale and shift.

Concretely, instead of predicting the true depth d, these models predict:

d' = a · d + b

where a (scale) and b (shift) are unknown constants that differ for every image. During training, the loss function aligns predictions to ground truth using a scale-shift invariant loss: it first computes the best-fit a and b, then measures the error after alignment. This way, the model is never penalized for getting the scale wrong — it only needs to get the shape of the depth map right.

Why this works for generalization

By decoupling scale and shift, these methods sidestep the focal length ambiguity entirely. The network doesn't need to know the focal length because it's not responsible for predicting absolute depth. This lets you train on millions of images from diverse cameras — which is why MiDaS generalizes beautifully to internet photos.

Why this fails for applications

The price is steep. Affine-invariant depth is fundamentally unsuitable for any task that needs real-world measurements:

The trade-off before Metric3D: You could have metric depth (accurate meters) but only on one camera. Or you could have generalizable depth (works on any camera) but only as a relative ordering. Metric3D breaks this trade-off: metric depth that generalizes to any camera.
What does "affine-invariant depth" mean practically?

Chapter 3: Canonical Camera Transform

This is the core contribution of Metric3D. The idea is beautifully simple: pick one fixed "canonical" focal length fc, and transform every training image so that it behaves as if it were captured by a camera with that focal length. Once all images look like they came from the same camera, a standard depth model can learn metric depth without any ambiguity.

The authors propose two equivalent methods to achieve this.

Method 1: Transform the depth labels (CSTM label)

Leave the image unchanged. Instead, scale the ground-truth depth by the ratio of canonical-to-actual focal length:

D*c = (fc / f) · D*

If the actual camera has a longer focal length than the canonical one (f > fc), the depth labels are scaled down. This compensates for the fact that the longer lens makes objects appear larger (closer) than they would through the canonical lens.

At inference, reverse the transform: D = (f / fc) · Dc.

Method 2: Transform the input image (CSTM image)

Instead of modifying labels, resize the input image by the ratio fc/f. A camera with focal length 2fc produces images that look "zoomed in" — so we shrink the image by half to simulate the canonical camera's field of view. The depth labels are resized spatially but not scaled in value.

At inference, resize the predicted depth map back to the original resolution.

Why both methods work: The ambiguity equation is d = f̂ · S / S'. Method 1 adjusts d (the label) to match fc. Method 2 adjusts S' (the imaging size via resizing) to match fc. Either way, the network sees a consistent relationship between image appearance and depth labels — as if everything came from one camera.

The analogy comes from human body reconstruction: to handle diverse poses, methods like SMPL map all body meshes to a canonical pose space. Metric3D does the same thing for cameras — mapping all images to a canonical camera space.

Canonical Camera Transform

Drag the focal length slider. Watch how the same scene at the same depth produces different imaging sizes with different focal lengths — and how the canonical transform normalizes them. The orange box shows the canonical camera's view.

Focal length f 1000 px
Object depth 3.0 m

De-canonical transformation at inference

At test time, the process reverses. You know the test camera's focal length (from EXIF metadata). Transform the input to canonical space, predict depth, then de-transform the depth back to the actual camera:

Step 1
Read focal length f from camera metadata (EXIF). Compute ratio ω = fc/f.
Step 2
Transform input: resize image by ω (CSTM image) or leave as-is (CSTM label).
Step 3
Run depth model → predict Dc in canonical space.
Step 4
De-canonical transform: D = Dc / ω (CSTM label) or resize back (CSTM image).
In CSTM label, if the actual camera has f = 2000 px and the canonical camera has fc = 1000 px, how is the ground-truth depth D* transformed?

Chapter 4: Training at Scale

With the canonical camera transformation, Metric3D can finally do what was previously impossible: train one metric depth model on all available datasets simultaneously. The authors assemble a training set of staggering diversity:

PropertyValue
Total images8 million+
Datasets merged11 public RGB-D datasets
Camera models10,000+ different cameras
Scene typesIndoor (rooms, offices) + Outdoor (driving, urban)
Depth sensorsKinect, LiDAR, stereo, SfM, etc.

Without CSTM, training on this mix is a disaster. The model receives contradictory supervision — identical-looking images with different depth labels from different cameras — and cannot converge. The authors show that a naive baseline (same architecture, same data, no CSTM) completely fails to learn metric depth on zero-shot benchmarks.

Random Proposal Normalization Loss (RPNL)

The authors also introduce a clever training loss. The standard scale-shift invariant loss (used by MiDaS, LeReS) normalizes depth over the entire image. This works for global ordering but squeezes fine-grained local depth differences — especially between nearby objects.

RPNL fixes this by randomly cropping M = 32 small patches from the image and applying the scale-shift invariant loss on each patch independently. This forces the model to preserve local depth structure within small regions, not just global ordering.

Why local patches matter: Imagine a table with three objects at 1.2m, 1.3m, and 1.35m. Global normalization compresses these tiny differences. Patch-level normalization treats this local region as its own little depth map, preserving the fine-grained structure. The model learns to distinguish 1.2m from 1.35m, not just "all about 1.3m."

Dataset balancing

With 11 datasets of wildly different sizes (from thousands to millions of images), naive sampling would let large datasets dominate training. Following DiverseDepth, the authors balance all datasets in each mini-batch so that each dataset contributes an approximately equal fraction. This ensures the model learns from indoor Kinect data, outdoor LiDAR data, and everything in between.

Total loss

The final training objective combines four losses:

L = LPWN + LVNL + Lsilog + LRPNL

Where LPWN is the pair-wise normal regression loss (surface smoothness), LVNL is the virtual normal loss (geometric consistency), Lsilog is the scale-invariant logarithmic loss (standard depth error), and LRPNL is the random proposal normalization loss (local contrast).

What happens when you train a metric depth model on mixed datasets without canonical camera transformation?

Chapter 5: The Architecture

Metric3D's architecture is refreshingly simple. The canonical camera transformation is the innovation — the depth model itself is a standard encoder-decoder that can be plugged into any existing architecture.

Encoder: ConvNeXt-Large

The backbone is ConvNeXt-Large, a modern convolutional network pretrained on ImageNet-22K. It extracts multi-scale feature maps from the input image (after canonical transformation). The choice of ConvNeXt over a Vision Transformer is practical — it handles arbitrary input resolutions without interpolating positional embeddings.

Decoder: UNet

A UNet decoder progressively upsamples the encoder features, using skip connections to preserve fine spatial detail. The final output is a dense depth map at the input resolution, predicting one depth value per pixel.

The CSTM module wraps the model

The canonical camera space transformation module (CSTM) sits outside the depth model. It's a preprocessing step, not an architectural component. This means CSTM can be applied to any monocular depth architecture — ConvNeXt, ViT, DPT, anything. The depth model never sees camera intrinsics; it simply predicts depth in canonical space.

Input
RGB image I + camera intrinsics (f, u0, v0)
CSTM
Transform I → Ic (canonical space). Compute ω = fc/f.
Encoder
ConvNeXt-Large extracts multi-scale features from Ic
Decoder
UNet decoder with skip connections produces Dc
De-CSTM
Transform Dc → D (metric depth in original camera space)
The power of simplicity: Camera intrinsics are not encoded in the network (unlike CamConv, which tries to learn camera awareness from data). Instead, CSTM handles camera normalization explicitly as a geometric transform. This is more robust, requires no additional network capacity, and generalizes perfectly to unseen cameras at test time.

Training details

SettingValue
BackboneConvNeXt-Large (ImageNet-22K pretrained)
OptimizerAdamW, lr = 0.0001, polynomial decay (power 0.9)
Batch size192
Training resolution512 × 960 (random crop after CSTM)
GPUs48 × A100
Iterations500K
Why is CSTM implemented as a preprocessing wrapper rather than encoding camera intrinsics inside the neural network?

Chapter 6: From Depth to 3D

A metric depth map is powerful, but it's still a 2D image where each pixel stores a distance. To get actual 3D coordinates — a point cloud you can rotate, measure, and reconstruct — you need to backproject each pixel into 3D space using the camera intrinsics.

The backprojection equation

For each pixel (u, v) with predicted metric depth d, the 3D point (X, Y, Z) in camera coordinates is:

X = (u − u0) · d / f
Y = (v − v0) · d / f
Z = d

Where f is the focal length and (u0, v0) is the principal point (usually the image center). Each pixel defines a ray from the camera; the depth tells us where along that ray the surface lies.

Why metric depth is essential: Backprojection with affine-invariant depth (d' = a·d + b) gives you a 3D shape that's correct up to a stretch and translation — like a rubber model of the scene. With metric depth, you get the actual 3D structure at the correct scale. The table is 1.2m wide. The room is 4m deep. The chair is 0.8m tall. Real measurements.

Multi-frame reconstruction

With per-frame metric depth and known camera poses (from EXIF, SfM, or SLAM), you can fuse point clouds from multiple viewpoints into a single, dense 3D reconstruction. Because all frames share the same metric scale (thanks to CSTM), the point clouds align naturally without per-frame scale adjustment.

This is a massive advantage over affine-invariant methods. With LeReS or DPT, each frame has its own unknown scale, so you must solve for a per-frame alignment factor before merging — and errors accumulate rapidly.

Backprojection Visualizer

A simulated "image" with three objects at different depths. Drag the focal length slider to see how the backprojected 3D points change. With metric depth, the 3D structure is correct regardless of focal length.

Focal length 800 px
Why does multi-frame 3D reconstruction work naturally with metric depth but not with affine-invariant depth?

Chapter 7: Results

Metric3D is evaluated in two modes: (1) as a metric depth model, directly compared to specialist methods on their home benchmarks, and (2) as a generalizable depth model, compared to affine-invariant methods on diverse zero-shot benchmarks. It excels at both.

Zero-shot metric depth

The most remarkable result: Metric3D — trained on mixed data from 11 datasets — achieves performance comparable to or better than methods trained specifically on each benchmark.

MethodTrainingNYU AbsRel ↓KITTI AbsRel ↓
AdaBinsNYU / KITTI only0.1030.058
NeWCRFsNYU / KITTI only0.0950.052
Metric3D (label)11 datasets, zero-shot0.0830.058
Metric3D (image)11 datasets, zero-shot0.0920.060

On NYU, Metric3D CSTM-label achieves 0.083 AbsRel — 12.6% better than NeWCRFs — despite never having seen NYU during training. On KITTI, it matches the specialist methods.

Generalization to unseen cameras

The real stress test: 6 benchmarks with camera models never seen during training. While specialist methods (trained on NYU or KITTI) degrade severely on these unseen cameras, Metric3D maintains strong performance:

MethodDIODE Indoor7ScenesETH3DNuScenes
AdaBins0.4430.2181.2710.445
NeWCRFs0.4040.2400.8900.400
Metric3D (label)0.2520.1830.4160.154

On NuScenes (outdoor driving, very different from indoor training), Metric3D cuts the error by more than 60% compared to NeWCRFs.

Zero-shot AbsRel Comparison

AbsRel (lower is better) across 4 benchmarks. Metric3D (orange) vs specialist methods (blue/purple). Metric3D generalizes dramatically better to unseen cameras.

Ablation: CSTM is essential

Without CSTM (naive mixed-data training), the model completely fails to predict metric depth on zero-shot benchmarks. Both CSTM variants (label and image) enable metric prediction, with CSTM-label slightly outperforming CSTM-image on most benchmarks.

Championship result: Metric3D won the 2nd Monocular Depth Estimation Challenge, validating its state-of-the-art accuracy on a competitive public benchmark.
What is the most surprising result about Metric3D's performance on NYU?

Chapter 8: Applications

Metric depth unlocks applications that relative depth simply cannot support. The authors demonstrate three compelling use cases.

1. Metrology in the wild

Take a photo of a table with your phone. Read the focal length from the photo's EXIF metadata. Run Metric3D. Backproject to 3D. Measure the table's width.

The authors do exactly this with an iPhone 12 Pro Max and a Xiaomi 9 Android phone. The measured table sizes (from Metric3D's 3D reconstruction) match the real-world ground truth to within a few centimeters. This is impossible with affine-invariant depth — LeReS cannot even attempt a measurement because it has no scale.

Phone as a measuring tape: With Metric3D and a phone camera, you can measure furniture dimensions from a single photo. The EXIF metadata provides the focal length, CSTM handles the rest. No LiDAR, no calibration target, no special hardware.

2. Dense SLAM mapping

Monocular SLAM systems like Droid-SLAM suffer from scale drift: over long trajectories, the estimated scale gradually deviates from reality. This is because monocular SLAM can only recover up to an unknown scale.

By feeding Metric3D's per-frame metric depth into Droid-SLAM, the scale drift problem essentially vanishes. The SLAM system now has a strong metric prior for every frame. On KITTI, the translation drift decreases dramatically.

MethodTransl. drift (trel) ↓
Droid-SLAM (original)significant scale drift
Droid-SLAM + Metric3Dgreatly reduced (metric scale recovered)

3. 3D scene reconstruction

On 9 unseen NYU scenes, Metric3D produces 3D reconstructions that significantly outperform all baselines in Chamfer distance and F-score. Unlike LeReS (which requires per-frame scale alignment to ground truth), Metric3D's frames fuse directly because they all share the same metric scale.

The practical impact: Metric3D turns a phone into a 3D scanner. Take photos from different viewpoints, predict metric depth for each, fuse the point clouds with known poses, and you get a metrically accurate 3D model of the scene. No depth sensor required.
How does Metric3D improve monocular SLAM?

Chapter 9: Connections

Metric3D sits at a critical junction in the evolution of monocular depth estimation. Let's map the landscape.

Relation to MiDaS / DPT

MiDaS and DPT pioneered mixed-dataset training for monocular depth, but they gave up on metric depth in exchange for generalization. They learn affine-invariant depth using scale-shift invariant losses. Metric3D shows that with canonical camera transformation, you don't have to choose — you get both metric accuracy and cross-camera generalization.

Relation to Metric3D v2

The follow-up work extends Metric3D with a ViT-based backbone (ViT-giant), improved losses, and the ability to predict both metric depth and surface normals. It achieves even stronger zero-shot performance and sets new state-of-the-art results on many benchmarks.

Relation to Depth Anything / Depth Anything v2

Depth Anything and Depth Anything v2 use large-scale self-supervised pretraining to learn powerful depth features, then fine-tune for metric depth on specific datasets. Metric3D's canonical camera transformation is complementary — it could be combined with Depth Anything's features to enable even better zero-shot metric depth.

Relation to GeoWizard

GeoWizard uses diffusion models to jointly predict depth and surface normals, achieving strong generalization. Like Metric3D, it aims for metric-aware predictions, but through a generative modeling approach rather than an explicit camera transformation.

Cheat Sheet

AspectMetric3D
InputSingle RGB image + camera intrinsics (focal length)
OutputDense metric depth map (real meters)
Key innovationCanonical camera space transformation (CSTM)
BackboneConvNeXt-Large (ImageNet-22K)
DecoderUNet with skip connections
Training data8M+ images, 11 datasets, 10K+ cameras
Novel lossRandom Proposal Normalization Loss (RPNL)
Key resultZero-shot metric depth outperforms per-dataset specialists
Achievement1st place, 2nd Mono Depth Estimation Challenge
The broader lesson: When a problem has a clean geometric structure — like the focal length ambiguity in metric depth — an explicit geometric solution (CSTM) beats trying to learn it from data (CamConv). Know your geometry, use it directly, and let the neural network focus on what only learning can solve: understanding scene semantics and layout.
What is the fundamental insight that distinguishes Metric3D from affine-invariant methods like MiDaS?