Metric3D — Veanors

Chapter 0: The Problem

You take a photo of a table with your iPhone. A monocular depth model tells you the table is "closer than the wall behind it." Great — but how far away is the table? 1.5 meters? 3 meters? The model has no idea. It outputs relative depth: a ranking of near-to-far, with no real-world scale.

This is useless for the things you actually want to do. A robot arm needs to know the table is exactly 1.2 meters away to place a cup on it. A SLAM system needs metric scale to build a map that doesn't drift. An AR app needs real-world measurements to overlay furniture that fits your room.

The core gap: State-of-the-art depth models like MiDaS and LeReS can predict beautiful depth maps that generalize across scenes — but the depth values are only accurate up to an unknown scale and shift. They predict d' = a·d + b, where a and b are different for every image. You can't measure anything in meters.

Why not just train on metric depth? You can — but only if you use a single camera. Methods like AdaBins or NeWCRFs achieve excellent metric accuracy on NYU (Kinect) or KITTI (LiDAR), but they fail catastrophically on images from different cameras. A model trained on Kinect data (focal length ~525 pixels) gives nonsensical metric predictions on iPhone photos (focal length ~3000 pixels).

The dream: a single model that predicts metric depth — real meters — from any camera, including ones it has never seen during training. That's what Metric3D achieves.

Relative vs Metric Depth

Left: a scene with three objects at known distances. Right: what a relative depth model outputs (only ordering preserved) vs what a metric model outputs (actual meters). Toggle between them.

Mode Relative

Why can't existing metric depth models (trained on one dataset) generalize to images from different cameras?

Because the relationship between image appearance and metric depth depends on the camera's focal length — different cameras produce identical-looking images at different real-world distances Because different cameras have different color profiles Because metric depth requires stereo cameras

Chapter 1: The Key Insight

Imagine you're standing 2 meters from a chair and you photograph it with a 26mm lens. Now swap to a 52mm lens and stand 4 meters away. The chair looks exactly the same size in both photos. Same pixels, same appearance — but the real depth is completely different: 2m vs 4m.

This is the focal length ambiguity. A neural network that only sees pixels cannot distinguish these two situations. If you train it on both images with different depth labels (2m and 4m), the network receives contradictory supervision for identical-looking inputs. It cannot converge. This is precisely why mixed-dataset metric depth training fails.

The formal relationship: For a pinhole camera, the imaging equation is d = f̂ · S / S', where d is depth, f̂ is focal length, S is real-world size, and S' is imaging size on the sensor. Two cameras with focal lengths f̂₁ = 2f̂₂, viewing the same object from distances d₁ = 2d₂, produce the exact same image. The depth is entangled with the focal length.

What about other camera parameters? The authors make two critical observations:

Sensor size only affects the field of view (how much of the scene you see), not the metric depth relationship. It's irrelevant.
Pixel size changes the image resolution but not the depth relationship. Two sensors with different pixel sizes but the same focal length yield the same metric depth for the same imaging size.

The conclusion is sharp: focal length is the only camera parameter that creates metric ambiguity. If we can somehow remove the effect of varying focal lengths from training, we can train on mixed datasets and get metric depth for free.

The key insight: The metric depth ambiguity in mixed-camera training comes entirely from unknown focal lengths. If we normalize all training images to behave as though they were captured by the same "canonical" camera with a fixed focal length, the ambiguity vanishes. One model can then learn metric depth from millions of images across thousands of cameras.

Why does the focal length — and not sensor size or pixel size — create metric depth ambiguity?

Because focal length determines image resolution Because sensor size is always the same across cameras Because changing the focal length changes how real-world depth maps to imaging size — two different (focal, depth) pairs can produce identical images, creating contradictory training labels

Chapter 2: Affine-Invariant Depth

Before Metric3D, the best approach to mixed-camera training was to give up on metric depth entirely. Methods like MiDaS, LeReS, and DPT learn affine-invariant depth: depth predictions that are accurate only up to an unknown scale and shift.

Concretely, instead of predicting the true depth d, these models predict:

d' = a · d + b

where a (scale) and b (shift) are unknown constants that differ for every image. During training, the loss function aligns predictions to ground truth using a scale-shift invariant loss: it first computes the best-fit a and b, then measures the error after alignment. This way, the model is never penalized for getting the scale wrong — it only needs to get the shape of the depth map right.

Why this works for generalization

By decoupling scale and shift, these methods sidestep the focal length ambiguity entirely. The network doesn't need to know the focal length because it's not responsible for predicting absolute depth. This lets you train on millions of images from diverse cameras — which is why MiDaS generalizes beautifully to internet photos.

Why this fails for applications

The price is steep. Affine-invariant depth is fundamentally unsuitable for any task that needs real-world measurements:

Robotics: A robot needs to know the table is 1.2m away, not "closer than the wall."
3D reconstruction: Fusing depth maps from multiple views requires consistent metric scale. Affine-invariant depth gives each frame a different scale, causing catastrophic distortions.
SLAM: Scale drift is the number-one failure mode of monocular SLAM. Affine-invariant depth doesn't help — it introduces its own unknown scale per frame.
Metrology: Measuring the size of a table from a photo is impossible if you don't know the depth scale.

The trade-off before Metric3D: You could have metric depth (accurate meters) but only on one camera. Or you could have generalizable depth (works on any camera) but only as a relative ordering. Metric3D breaks this trade-off: metric depth that generalizes to any camera.

What does "affine-invariant depth" mean practically?

The model predicts depth in meters but with a fixed offset The predicted depth is accurate only up to an unknown scale (a) and shift (b) — d' = a·d + b — so you can compare relative distances within one image but not measure real-world meters The depth map is invariant to image rotation

Chapter 3: Canonical Camera Transform

This is the core contribution of Metric3D. The idea is beautifully simple: pick one fixed "canonical" focal length f^c, and transform every training image so that it behaves as if it were captured by a camera with that focal length. Once all images look like they came from the same camera, a standard depth model can learn metric depth without any ambiguity.

The authors propose two equivalent methods to achieve this.

Method 1: Transform the depth labels (CSTM label)

Leave the image unchanged. Instead, scale the ground-truth depth by the ratio of canonical-to-actual focal length:

D^*_c = (f^c / f) · D^*

If the actual camera has a longer focal length than the canonical one (f > f^c), the depth labels are scaled down. This compensates for the fact that the longer lens makes objects appear larger (closer) than they would through the canonical lens.

At inference, reverse the transform: D = (f / f^c) · D_c.

Method 2: Transform the input image (CSTM image)

Instead of modifying labels, resize the input image by the ratio f^c/f. A camera with focal length 2f^c produces images that look "zoomed in" — so we shrink the image by half to simulate the canonical camera's field of view. The depth labels are resized spatially but not scaled in value.

At inference, resize the predicted depth map back to the original resolution.

Why both methods work: The ambiguity equation is d = f̂ · S / S'. Method 1 adjusts d (the label) to match f^c. Method 2 adjusts S' (the imaging size via resizing) to match f^c. Either way, the network sees a consistent relationship between image appearance and depth labels — as if everything came from one camera.

The analogy comes from human body reconstruction: to handle diverse poses, methods like SMPL map all body meshes to a canonical pose space. Metric3D does the same thing for cameras — mapping all images to a canonical camera space.

Canonical Camera Transform

Drag the focal length slider. Watch how the same scene at the same depth produces different imaging sizes with different focal lengths — and how the canonical transform normalizes them. The orange box shows the canonical camera's view.

Focal length f 1000 px

Object depth 3.0 m

De-canonical transformation at inference

At test time, the process reverses. You know the test camera's focal length (from EXIF metadata). Transform the input to canonical space, predict depth, then de-transform the depth back to the actual camera:

Step 1

Read focal length f from camera metadata (EXIF). Compute ratio ω = f^c/f.

↓

Step 2

Transform input: resize image by ω (CSTM image) or leave as-is (CSTM label).

↓

Step 3

Run depth model → predict D_c in canonical space.

↓

Step 4

De-canonical transform: D = D_c / ω (CSTM label) or resize back (CSTM image).

In CSTM label, if the actual camera has f = 2000 px and the canonical camera has f^c = 1000 px, how is the ground-truth depth D* transformed?

D*_c = (1000/2000) · D* = 0.5 · D* — the depth labels are halved because the longer lens makes objects appear as if they're half as far in canonical space D*_c = 2 · D* — the depth labels are doubled D*_c = D* — the labels are unchanged

Chapter 4: Training at Scale

With the canonical camera transformation, Metric3D can finally do what was previously impossible: train one metric depth model on all available datasets simultaneously. The authors assemble a training set of staggering diversity:

Property	Value
Total images	8 million+
Datasets merged	11 public RGB-D datasets
Camera models	10,000+ different cameras
Scene types	Indoor (rooms, offices) + Outdoor (driving, urban)
Depth sensors	Kinect, LiDAR, stereo, SfM, etc.

Without CSTM, training on this mix is a disaster. The model receives contradictory supervision — identical-looking images with different depth labels from different cameras — and cannot converge. The authors show that a naive baseline (same architecture, same data, no CSTM) completely fails to learn metric depth on zero-shot benchmarks.

Random Proposal Normalization Loss (RPNL)

The authors also introduce a clever training loss. The standard scale-shift invariant loss (used by MiDaS, LeReS) normalizes depth over the entire image. This works for global ordering but squeezes fine-grained local depth differences — especially between nearby objects.

RPNL fixes this by randomly cropping M = 32 small patches from the image and applying the scale-shift invariant loss on each patch independently. This forces the model to preserve local depth structure within small regions, not just global ordering.

Why local patches matter: Imagine a table with three objects at 1.2m, 1.3m, and 1.35m. Global normalization compresses these tiny differences. Patch-level normalization treats this local region as its own little depth map, preserving the fine-grained structure. The model learns to distinguish 1.2m from 1.35m, not just "all about 1.3m."

Dataset balancing

With 11 datasets of wildly different sizes (from thousands to millions of images), naive sampling would let large datasets dominate training. Following DiverseDepth, the authors balance all datasets in each mini-batch so that each dataset contributes an approximately equal fraction. This ensures the model learns from indoor Kinect data, outdoor LiDAR data, and everything in between.

Total loss

The final training objective combines four losses:

L = L_PWN + L_VNL + L_silog + L_RPNL

Where L_PWN is the pair-wise normal regression loss (surface smoothness), L_VNL is the virtual normal loss (geometric consistency), L_silog is the scale-invariant logarithmic loss (standard depth error), and L_RPNL is the random proposal normalization loss (local contrast).

What happens when you train a metric depth model on mixed datasets without canonical camera transformation?

The model fails to converge because identical-looking images from different cameras have contradictory depth labels — the focal length ambiguity creates irreconcilable supervision conflicts The model trains normally but is slightly less accurate The model learns affine-invariant depth instead of metric depth

Chapter 5: The Architecture

Metric3D's architecture is refreshingly simple. The canonical camera transformation is the innovation — the depth model itself is a standard encoder-decoder that can be plugged into any existing architecture.

Encoder: ConvNeXt-Large

The backbone is ConvNeXt-Large, a modern convolutional network pretrained on ImageNet-22K. It extracts multi-scale feature maps from the input image (after canonical transformation). The choice of ConvNeXt over a Vision Transformer is practical — it handles arbitrary input resolutions without interpolating positional embeddings.

Decoder: UNet

A UNet decoder progressively upsamples the encoder features, using skip connections to preserve fine spatial detail. The final output is a dense depth map at the input resolution, predicting one depth value per pixel.

The CSTM module wraps the model

The canonical camera space transformation module (CSTM) sits outside the depth model. It's a preprocessing step, not an architectural component. This means CSTM can be applied to any monocular depth architecture — ConvNeXt, ViT, DPT, anything. The depth model never sees camera intrinsics; it simply predicts depth in canonical space.

Input

RGB image I + camera intrinsics (f, u₀, v₀)

↓

CSTM

Transform I → I_c (canonical space). Compute ω = f^c/f.

↓

Encoder

ConvNeXt-Large extracts multi-scale features from I_c

↓

Decoder

UNet decoder with skip connections produces D_c

↓

De-CSTM

Transform D_c → D (metric depth in original camera space)

The power of simplicity: Camera intrinsics are not encoded in the network (unlike CamConv, which tries to learn camera awareness from data). Instead, CSTM handles camera normalization explicitly as a geometric transform. This is more robust, requires no additional network capacity, and generalizes perfectly to unseen cameras at test time.

Training details

Setting	Value
Backbone	ConvNeXt-Large (ImageNet-22K pretrained)
Optimizer	AdamW, lr = 0.0001, polynomial decay (power 0.9)
Batch size	192
Training resolution	512 × 960 (random crop after CSTM)
GPUs	48 × A100
Iterations	500K

Why is CSTM implemented as a preprocessing wrapper rather than encoding camera intrinsics inside the neural network?

Because it runs faster on GPUs Because camera intrinsics are always available at test time Because an explicit geometric transform is more robust than learned camera awareness, requires no extra network capacity, can be plugged into any existing depth architecture, and generalizes perfectly to unseen cameras

Chapter 6: From Depth to 3D

A metric depth map is powerful, but it's still a 2D image where each pixel stores a distance. To get actual 3D coordinates — a point cloud you can rotate, measure, and reconstruct — you need to backproject each pixel into 3D space using the camera intrinsics.

The backprojection equation

For each pixel (u, v) with predicted metric depth d, the 3D point (X, Y, Z) in camera coordinates is:

X = (u − u₀) · d / f
Y = (v − v₀) · d / f
Z = d

Where f is the focal length and (u₀, v₀) is the principal point (usually the image center). Each pixel defines a ray from the camera; the depth tells us where along that ray the surface lies.

Why metric depth is essential: Backprojection with affine-invariant depth (d' = a·d + b) gives you a 3D shape that's correct up to a stretch and translation — like a rubber model of the scene. With metric depth, you get the actual 3D structure at the correct scale. The table is 1.2m wide. The room is 4m deep. The chair is 0.8m tall. Real measurements.

Multi-frame reconstruction

With per-frame metric depth and known camera poses (from EXIF, SfM, or SLAM), you can fuse point clouds from multiple viewpoints into a single, dense 3D reconstruction. Because all frames share the same metric scale (thanks to CSTM), the point clouds align naturally without per-frame scale adjustment.

This is a massive advantage over affine-invariant methods. With LeReS or DPT, each frame has its own unknown scale, so you must solve for a per-frame alignment factor before merging — and errors accumulate rapidly.

Backprojection Visualizer

A simulated "image" with three objects at different depths. Drag the focal length slider to see how the backprojected 3D points change. With metric depth, the 3D structure is correct regardless of focal length.

Focal length 800 px

Why does multi-frame 3D reconstruction work naturally with metric depth but not with affine-invariant depth?

Because metric depth gives all frames a consistent real-world scale, so their point clouds align directly. Affine-invariant depth has a different unknown scale per frame, causing misalignment. Because metric depth is more accurate Because affine-invariant depth doesn't support backprojection

Chapter 7: Results

Metric3D is evaluated in two modes: (1) as a metric depth model, directly compared to specialist methods on their home benchmarks, and (2) as a generalizable depth model, compared to affine-invariant methods on diverse zero-shot benchmarks. It excels at both.

Zero-shot metric depth

The most remarkable result: Metric3D — trained on mixed data from 11 datasets — achieves performance comparable to or better than methods trained specifically on each benchmark.

Method	Training	NYU AbsRel ↓	KITTI AbsRel ↓
AdaBins	NYU / KITTI only	0.103	0.058
NeWCRFs	NYU / KITTI only	0.095	0.052
Metric3D (label)	11 datasets, zero-shot	0.083	0.058
Metric3D (image)	11 datasets, zero-shot	0.092	0.060

On NYU, Metric3D CSTM-label achieves 0.083 AbsRel — 12.6% better than NeWCRFs — despite never having seen NYU during training. On KITTI, it matches the specialist methods.

Generalization to unseen cameras

The real stress test: 6 benchmarks with camera models never seen during training. While specialist methods (trained on NYU or KITTI) degrade severely on these unseen cameras, Metric3D maintains strong performance:

Method	DIODE Indoor	7Scenes	ETH3D	NuScenes
AdaBins	0.443	0.218	1.271	0.445
NeWCRFs	0.404	0.240	0.890	0.400
Metric3D (label)	0.252	0.183	0.416	0.154

On NuScenes (outdoor driving, very different from indoor training), Metric3D cuts the error by more than 60% compared to NeWCRFs.

Zero-shot AbsRel Comparison

AbsRel (lower is better) across 4 benchmarks. Metric3D (orange) vs specialist methods (blue/purple). Metric3D generalizes dramatically better to unseen cameras.

Ablation: CSTM is essential

Without CSTM (naive mixed-data training), the model completely fails to predict metric depth on zero-shot benchmarks. Both CSTM variants (label and image) enable metric prediction, with CSTM-label slightly outperforming CSTM-image on most benchmarks.

Championship result: Metric3D won the 2nd Monocular Depth Estimation Challenge, validating its state-of-the-art accuracy on a competitive public benchmark.

What is the most surprising result about Metric3D's performance on NYU?

It achieves better AbsRel (0.083) than specialist methods trained on NYU (NeWCRFs: 0.095), despite never having seen NYU data during training — proving zero-shot metric depth can surpass dataset-specific training It runs faster than existing methods It uses less GPU memory

Chapter 8: Applications

Metric depth unlocks applications that relative depth simply cannot support. The authors demonstrate three compelling use cases.

1. Metrology in the wild

Take a photo of a table with your phone. Read the focal length from the photo's EXIF metadata. Run Metric3D. Backproject to 3D. Measure the table's width.

The authors do exactly this with an iPhone 12 Pro Max and a Xiaomi 9 Android phone. The measured table sizes (from Metric3D's 3D reconstruction) match the real-world ground truth to within a few centimeters. This is impossible with affine-invariant depth — LeReS cannot even attempt a measurement because it has no scale.

Phone as a measuring tape: With Metric3D and a phone camera, you can measure furniture dimensions from a single photo. The EXIF metadata provides the focal length, CSTM handles the rest. No LiDAR, no calibration target, no special hardware.

2. Dense SLAM mapping

Monocular SLAM systems like Droid-SLAM suffer from scale drift: over long trajectories, the estimated scale gradually deviates from reality. This is because monocular SLAM can only recover up to an unknown scale.

By feeding Metric3D's per-frame metric depth into Droid-SLAM, the scale drift problem essentially vanishes. The SLAM system now has a strong metric prior for every frame. On KITTI, the translation drift decreases dramatically.

Method	Transl. drift (t_rel) ↓
Droid-SLAM (original)	significant scale drift
Droid-SLAM + Metric3D	greatly reduced (metric scale recovered)

3. 3D scene reconstruction

On 9 unseen NYU scenes, Metric3D produces 3D reconstructions that significantly outperform all baselines in Chamfer distance and F-score. Unlike LeReS (which requires per-frame scale alignment to ground truth), Metric3D's frames fuse directly because they all share the same metric scale.

The practical impact: Metric3D turns a phone into a 3D scanner. Take photos from different viewpoints, predict metric depth for each, fuse the point clouds with known poses, and you get a metrically accurate 3D model of the scene. No depth sensor required.

How does Metric3D improve monocular SLAM?

It replaces the SLAM system entirely By providing per-frame metric depth as a prior, it eliminates the scale ambiguity that causes drift in monocular SLAM — the system now recovers real-world metric scale It provides better feature matching for SLAM

Chapter 9: Connections

Metric3D sits at a critical junction in the evolution of monocular depth estimation. Let's map the landscape.

Relation to MiDaS / DPT

MiDaS and DPT pioneered mixed-dataset training for monocular depth, but they gave up on metric depth in exchange for generalization. They learn affine-invariant depth using scale-shift invariant losses. Metric3D shows that with canonical camera transformation, you don't have to choose — you get both metric accuracy and cross-camera generalization.

Relation to Metric3D v2

The follow-up work extends Metric3D with a ViT-based backbone (ViT-giant), improved losses, and the ability to predict both metric depth and surface normals. It achieves even stronger zero-shot performance and sets new state-of-the-art results on many benchmarks.

Relation to Depth Anything / Depth Anything v2

Depth Anything and Depth Anything v2 use large-scale self-supervised pretraining to learn powerful depth features, then fine-tune for metric depth on specific datasets. Metric3D's canonical camera transformation is complementary — it could be combined with Depth Anything's features to enable even better zero-shot metric depth.

Relation to GeoWizard

GeoWizard uses diffusion models to jointly predict depth and surface normals, achieving strong generalization. Like Metric3D, it aims for metric-aware predictions, but through a generative modeling approach rather than an explicit camera transformation.

Cheat Sheet

Aspect	Metric3D
Input	Single RGB image + camera intrinsics (focal length)
Output	Dense metric depth map (real meters)
Key innovation	Canonical camera space transformation (CSTM)
Backbone	ConvNeXt-Large (ImageNet-22K)
Decoder	UNet with skip connections
Training data	8M+ images, 11 datasets, 10K+ cameras
Novel loss	Random Proposal Normalization Loss (RPNL)
Key result	Zero-shot metric depth outperforms per-dataset specialists
Achievement	1st place, 2nd Mono Depth Estimation Challenge

The broader lesson: When a problem has a clean geometric structure — like the focal length ambiguity in metric depth — an explicit geometric solution (CSTM) beats trying to learn it from data (CamConv). Know your geometry, use it directly, and let the neural network focus on what only learning can solve: understanding scene semantics and layout.

What is the fundamental insight that distinguishes Metric3D from affine-invariant methods like MiDaS?

Metric3D uses a better backbone network Metric3D trains on more data Instead of giving up on metric depth to handle diverse cameras, Metric3D resolves the focal length ambiguity explicitly via canonical camera transformation, enabling metric depth training on mixed-camera datasets

Metric3D: Zero-shot Metric 3D from a Single Image