Monocular depth models predict relative depth — close vs far — but not real-world meters. Metric3D solves this by normalizing all training cameras to one canonical space, unlocking true metric depth from any phone photo.
You take a photo of a table with your iPhone. A monocular depth model tells you the table is "closer than the wall behind it." Great — but how far away is the table? 1.5 meters? 3 meters? The model has no idea. It outputs relative depth: a ranking of near-to-far, with no real-world scale.
This is useless for the things you actually want to do. A robot arm needs to know the table is exactly 1.2 meters away to place a cup on it. A SLAM system needs metric scale to build a map that doesn't drift. An AR app needs real-world measurements to overlay furniture that fits your room.
Why not just train on metric depth? You can — but only if you use a single camera. Methods like AdaBins or NeWCRFs achieve excellent metric accuracy on NYU (Kinect) or KITTI (LiDAR), but they fail catastrophically on images from different cameras. A model trained on Kinect data (focal length ~525 pixels) gives nonsensical metric predictions on iPhone photos (focal length ~3000 pixels).
The dream: a single model that predicts metric depth — real meters — from any camera, including ones it has never seen during training. That's what Metric3D achieves.
Left: a scene with three objects at known distances. Right: what a relative depth model outputs (only ordering preserved) vs what a metric model outputs (actual meters). Toggle between them.
Imagine you're standing 2 meters from a chair and you photograph it with a 26mm lens. Now swap to a 52mm lens and stand 4 meters away. The chair looks exactly the same size in both photos. Same pixels, same appearance — but the real depth is completely different: 2m vs 4m.
This is the focal length ambiguity. A neural network that only sees pixels cannot distinguish these two situations. If you train it on both images with different depth labels (2m and 4m), the network receives contradictory supervision for identical-looking inputs. It cannot converge. This is precisely why mixed-dataset metric depth training fails.
What about other camera parameters? The authors make two critical observations:
The conclusion is sharp: focal length is the only camera parameter that creates metric ambiguity. If we can somehow remove the effect of varying focal lengths from training, we can train on mixed datasets and get metric depth for free.
Before Metric3D, the best approach to mixed-camera training was to give up on metric depth entirely. Methods like MiDaS, LeReS, and DPT learn affine-invariant depth: depth predictions that are accurate only up to an unknown scale and shift.
Concretely, instead of predicting the true depth d, these models predict:
where a (scale) and b (shift) are unknown constants that differ for every image. During training, the loss function aligns predictions to ground truth using a scale-shift invariant loss: it first computes the best-fit a and b, then measures the error after alignment. This way, the model is never penalized for getting the scale wrong — it only needs to get the shape of the depth map right.
By decoupling scale and shift, these methods sidestep the focal length ambiguity entirely. The network doesn't need to know the focal length because it's not responsible for predicting absolute depth. This lets you train on millions of images from diverse cameras — which is why MiDaS generalizes beautifully to internet photos.
The price is steep. Affine-invariant depth is fundamentally unsuitable for any task that needs real-world measurements:
This is the core contribution of Metric3D. The idea is beautifully simple: pick one fixed "canonical" focal length fc, and transform every training image so that it behaves as if it were captured by a camera with that focal length. Once all images look like they came from the same camera, a standard depth model can learn metric depth without any ambiguity.
The authors propose two equivalent methods to achieve this.
Leave the image unchanged. Instead, scale the ground-truth depth by the ratio of canonical-to-actual focal length:
If the actual camera has a longer focal length than the canonical one (f > fc), the depth labels are scaled down. This compensates for the fact that the longer lens makes objects appear larger (closer) than they would through the canonical lens.
At inference, reverse the transform: D = (f / fc) · Dc.
Instead of modifying labels, resize the input image by the ratio fc/f. A camera with focal length 2fc produces images that look "zoomed in" — so we shrink the image by half to simulate the canonical camera's field of view. The depth labels are resized spatially but not scaled in value.
At inference, resize the predicted depth map back to the original resolution.
The analogy comes from human body reconstruction: to handle diverse poses, methods like SMPL map all body meshes to a canonical pose space. Metric3D does the same thing for cameras — mapping all images to a canonical camera space.
Drag the focal length slider. Watch how the same scene at the same depth produces different imaging sizes with different focal lengths — and how the canonical transform normalizes them. The orange box shows the canonical camera's view.
At test time, the process reverses. You know the test camera's focal length (from EXIF metadata). Transform the input to canonical space, predict depth, then de-transform the depth back to the actual camera:
With the canonical camera transformation, Metric3D can finally do what was previously impossible: train one metric depth model on all available datasets simultaneously. The authors assemble a training set of staggering diversity:
| Property | Value |
|---|---|
| Total images | 8 million+ |
| Datasets merged | 11 public RGB-D datasets |
| Camera models | 10,000+ different cameras |
| Scene types | Indoor (rooms, offices) + Outdoor (driving, urban) |
| Depth sensors | Kinect, LiDAR, stereo, SfM, etc. |
Without CSTM, training on this mix is a disaster. The model receives contradictory supervision — identical-looking images with different depth labels from different cameras — and cannot converge. The authors show that a naive baseline (same architecture, same data, no CSTM) completely fails to learn metric depth on zero-shot benchmarks.
The authors also introduce a clever training loss. The standard scale-shift invariant loss (used by MiDaS, LeReS) normalizes depth over the entire image. This works for global ordering but squeezes fine-grained local depth differences — especially between nearby objects.
RPNL fixes this by randomly cropping M = 32 small patches from the image and applying the scale-shift invariant loss on each patch independently. This forces the model to preserve local depth structure within small regions, not just global ordering.
With 11 datasets of wildly different sizes (from thousands to millions of images), naive sampling would let large datasets dominate training. Following DiverseDepth, the authors balance all datasets in each mini-batch so that each dataset contributes an approximately equal fraction. This ensures the model learns from indoor Kinect data, outdoor LiDAR data, and everything in between.
The final training objective combines four losses:
Where LPWN is the pair-wise normal regression loss (surface smoothness), LVNL is the virtual normal loss (geometric consistency), Lsilog is the scale-invariant logarithmic loss (standard depth error), and LRPNL is the random proposal normalization loss (local contrast).
Metric3D's architecture is refreshingly simple. The canonical camera transformation is the innovation — the depth model itself is a standard encoder-decoder that can be plugged into any existing architecture.
The backbone is ConvNeXt-Large, a modern convolutional network pretrained on ImageNet-22K. It extracts multi-scale feature maps from the input image (after canonical transformation). The choice of ConvNeXt over a Vision Transformer is practical — it handles arbitrary input resolutions without interpolating positional embeddings.
A UNet decoder progressively upsamples the encoder features, using skip connections to preserve fine spatial detail. The final output is a dense depth map at the input resolution, predicting one depth value per pixel.
The canonical camera space transformation module (CSTM) sits outside the depth model. It's a preprocessing step, not an architectural component. This means CSTM can be applied to any monocular depth architecture — ConvNeXt, ViT, DPT, anything. The depth model never sees camera intrinsics; it simply predicts depth in canonical space.
| Setting | Value |
|---|---|
| Backbone | ConvNeXt-Large (ImageNet-22K pretrained) |
| Optimizer | AdamW, lr = 0.0001, polynomial decay (power 0.9) |
| Batch size | 192 |
| Training resolution | 512 × 960 (random crop after CSTM) |
| GPUs | 48 × A100 |
| Iterations | 500K |
A metric depth map is powerful, but it's still a 2D image where each pixel stores a distance. To get actual 3D coordinates — a point cloud you can rotate, measure, and reconstruct — you need to backproject each pixel into 3D space using the camera intrinsics.
For each pixel (u, v) with predicted metric depth d, the 3D point (X, Y, Z) in camera coordinates is:
Where f is the focal length and (u0, v0) is the principal point (usually the image center). Each pixel defines a ray from the camera; the depth tells us where along that ray the surface lies.
With per-frame metric depth and known camera poses (from EXIF, SfM, or SLAM), you can fuse point clouds from multiple viewpoints into a single, dense 3D reconstruction. Because all frames share the same metric scale (thanks to CSTM), the point clouds align naturally without per-frame scale adjustment.
This is a massive advantage over affine-invariant methods. With LeReS or DPT, each frame has its own unknown scale, so you must solve for a per-frame alignment factor before merging — and errors accumulate rapidly.
A simulated "image" with three objects at different depths. Drag the focal length slider to see how the backprojected 3D points change. With metric depth, the 3D structure is correct regardless of focal length.
Metric3D is evaluated in two modes: (1) as a metric depth model, directly compared to specialist methods on their home benchmarks, and (2) as a generalizable depth model, compared to affine-invariant methods on diverse zero-shot benchmarks. It excels at both.
The most remarkable result: Metric3D — trained on mixed data from 11 datasets — achieves performance comparable to or better than methods trained specifically on each benchmark.
| Method | Training | NYU AbsRel ↓ | KITTI AbsRel ↓ |
|---|---|---|---|
| AdaBins | NYU / KITTI only | 0.103 | 0.058 |
| NeWCRFs | NYU / KITTI only | 0.095 | 0.052 |
| Metric3D (label) | 11 datasets, zero-shot | 0.083 | 0.058 |
| Metric3D (image) | 11 datasets, zero-shot | 0.092 | 0.060 |
On NYU, Metric3D CSTM-label achieves 0.083 AbsRel — 12.6% better than NeWCRFs — despite never having seen NYU during training. On KITTI, it matches the specialist methods.
The real stress test: 6 benchmarks with camera models never seen during training. While specialist methods (trained on NYU or KITTI) degrade severely on these unseen cameras, Metric3D maintains strong performance:
| Method | DIODE Indoor | 7Scenes | ETH3D | NuScenes |
|---|---|---|---|---|
| AdaBins | 0.443 | 0.218 | 1.271 | 0.445 |
| NeWCRFs | 0.404 | 0.240 | 0.890 | 0.400 |
| Metric3D (label) | 0.252 | 0.183 | 0.416 | 0.154 |
On NuScenes (outdoor driving, very different from indoor training), Metric3D cuts the error by more than 60% compared to NeWCRFs.
AbsRel (lower is better) across 4 benchmarks. Metric3D (orange) vs specialist methods (blue/purple). Metric3D generalizes dramatically better to unseen cameras.
Without CSTM (naive mixed-data training), the model completely fails to predict metric depth on zero-shot benchmarks. Both CSTM variants (label and image) enable metric prediction, with CSTM-label slightly outperforming CSTM-image on most benchmarks.
Metric depth unlocks applications that relative depth simply cannot support. The authors demonstrate three compelling use cases.
Take a photo of a table with your phone. Read the focal length from the photo's EXIF metadata. Run Metric3D. Backproject to 3D. Measure the table's width.
The authors do exactly this with an iPhone 12 Pro Max and a Xiaomi 9 Android phone. The measured table sizes (from Metric3D's 3D reconstruction) match the real-world ground truth to within a few centimeters. This is impossible with affine-invariant depth — LeReS cannot even attempt a measurement because it has no scale.
Monocular SLAM systems like Droid-SLAM suffer from scale drift: over long trajectories, the estimated scale gradually deviates from reality. This is because monocular SLAM can only recover up to an unknown scale.
By feeding Metric3D's per-frame metric depth into Droid-SLAM, the scale drift problem essentially vanishes. The SLAM system now has a strong metric prior for every frame. On KITTI, the translation drift decreases dramatically.
| Method | Transl. drift (trel) ↓ |
|---|---|
| Droid-SLAM (original) | significant scale drift |
| Droid-SLAM + Metric3D | greatly reduced (metric scale recovered) |
On 9 unseen NYU scenes, Metric3D produces 3D reconstructions that significantly outperform all baselines in Chamfer distance and F-score. Unlike LeReS (which requires per-frame scale alignment to ground truth), Metric3D's frames fuse directly because they all share the same metric scale.
Metric3D sits at a critical junction in the evolution of monocular depth estimation. Let's map the landscape.
MiDaS and DPT pioneered mixed-dataset training for monocular depth, but they gave up on metric depth in exchange for generalization. They learn affine-invariant depth using scale-shift invariant losses. Metric3D shows that with canonical camera transformation, you don't have to choose — you get both metric accuracy and cross-camera generalization.
The follow-up work extends Metric3D with a ViT-based backbone (ViT-giant), improved losses, and the ability to predict both metric depth and surface normals. It achieves even stronger zero-shot performance and sets new state-of-the-art results on many benchmarks.
Depth Anything and Depth Anything v2 use large-scale self-supervised pretraining to learn powerful depth features, then fine-tune for metric depth on specific datasets. Metric3D's canonical camera transformation is complementary — it could be combined with Depth Anything's features to enable even better zero-shot metric depth.
GeoWizard uses diffusion models to jointly predict depth and surface normals, achieving strong generalization. Like Metric3D, it aims for metric-aware predictions, but through a generative modeling approach rather than an explicit camera transformation.
| Aspect | Metric3D |
|---|---|
| Input | Single RGB image + camera intrinsics (focal length) |
| Output | Dense metric depth map (real meters) |
| Key innovation | Canonical camera space transformation (CSTM) |
| Backbone | ConvNeXt-Large (ImageNet-22K) |
| Decoder | UNet with skip connections |
| Training data | 8M+ images, 11 datasets, 10K+ cameras |
| Novel loss | Random Proposal Normalization Loss (RPNL) |
| Key result | Zero-shot metric depth outperforms per-dataset specialists |
| Achievement | 1st place, 2nd Mono Depth Estimation Challenge |