A geometric foundation model that jointly predicts metric depth and surface normals from a single image. Trained on 16M+ images from thousands of cameras, it ranks #1 on NYU, KITTI, and many zero-shot benchmarks for both tasks.
You take a single photo with your phone. From that one image, you want two things: how far away every pixel is (metric depth), and which direction every surface faces (surface normals). These are the two most fundamental geometric properties of a scene — and they are deeply complementary.
Depth tells you the distance to each point. A pixel on a nearby table might be 0.8 meters away; a pixel on the far wall might be 5 meters. Depth is a scalar field — one number per pixel.
Surface normals tell you the orientation of the surface at each point. A horizontal floor has normals pointing straight up. A vertical wall has normals pointing toward you. A slanted roof has normals pointing diagonally. Normals are a vector field — three numbers (x, y, z direction) per pixel.
Together, depth and normals give you a complete geometric description of the scene. Depth provides the global structure (what is far, what is near). Normals provide the local geometry (which way each surface faces). If you have both at metric scale, you can reconstruct the 3D scene, measure real-world distances, and understand the physical layout — all from a single photo.
Here is the problem: existing methods train these two representations separately, and each has a critical weakness.
The first problem — losing metric scale — comes from a fundamental ambiguity. If the neural network does not know the camera's focal length, the same image could have been taken by a wide-angle phone camera at 1 meter or a telephoto lens at 10 meters. Without focal length information, the network cannot predict metric depth. It must give up and predict only relative depth.
The second problem — limited normal training data — is about annotation cost. Depth labels can come from cheap sensors (LiDAR, RGB-D cameras). But normal labels require meticulously reconstructed 3D meshes: you need to capture a scene from many angles, run dense multi-view stereo, clean the reconstruction, and then compute normals from the resulting mesh. Only a handful of indoor datasets have gone through this process; outdoor normal annotations are almost nonexistent.
Here is the state of affairs before Metric3D v2:
None of these produces both metric depth and surface normals from a single model. Metric3D v2 does.
A simple scene with two surfaces. Depth measures distance from the camera; normals show surface orientation. Toggle between the two views.
Metric3D v2 is built on a beautifully simple geometric fact: depth and normals are dual representations of the same 3D surface.
Think about it. If you know the depth at every pixel and the camera parameters, you can compute the 3D position of every point. And if you have the 3D positions, you can compute the surface normal at any point by looking at how the surface tilts between neighboring points. In other words:
Where P is the 3D point corresponding to pixel (x, y), and n is the surface normal. The normal is just the cross product of the partial derivatives of the 3D surface — it is the spatial gradient of the depth field.
Think about what this means concretely. If you have a flat table, the depth changes linearly as you move across it (it gets farther away). The gradient of this linear field is constant, so the normals are constant: they all point straight up. If the table has a bump, the depth changes faster there, and the normals tilt to follow the bump's slope. The normal IS the derivative of the depth.
This geometric relationship is the key that unlocks everything:
The second key insight addresses the metric ambiguity problem. The authors perform a careful analysis of which camera parameters actually matter for depth prediction. The answer is surprisingly specific:
| Camera Parameter | Affects Metric Depth? | Why |
|---|---|---|
| Sensor size | No | Only changes field of view, not the depth-imaging relationship |
| Pixel size (δ) | No | Different pixel sizes produce different resolutions but the same α = f/S' ratio |
| Focal length (f) | Yes | Directly scales the relationship between imaging size and real-world distance |
Since only the focal length matters, the solution is clean: normalize all training images to behave as though they were taken by the same "canonical" camera with a fixed focal length. This eliminates the ambiguity entirely, without requiring the network to learn about cameras at all.
Why does focal length cause metric ambiguity? Consider two cameras photographing the same chair. Camera A has a focal length of 26mm at 1 meter distance. Camera B has a 52mm focal length at 2 meters. Both produce the same image of the chair — identical pixel appearance, identical imaging size. But the depth is completely different: 1m vs 2m.
A neural network seeing these two identical-looking images would receive contradictory training signals. Same input, different labels. It cannot learn metric depth without knowing which camera took the photo.
Metric3D v2 defines a canonical camera with a fixed focal length fc. Before training, every image-depth pair is transformed so it appears as though the canonical camera took it. The model only ever sees "canonical" images and can learn metric depth without ambiguity.
There are two ways to perform this transformation:
Keep the image unchanged. Instead, rescale the ground-truth depth by the ratio of canonical to actual focal length:
At inference, reverse it: D = (f / fc) · Dc. The network predicts depth in canonical space; you de-canonicalize with the actual camera's focal length.
Resize the input image by ωr = fc / f so it looks like the canonical camera took it. The depth labels resize accordingly (without scaling values). At inference, resize the prediction back.
Let's make this concrete. From the pinhole camera model, the depth d of an object is related to its real-world size S, its imaging size S', and the focal length f by:
If two cameras have different focal lengths f1 = 2f2 and the object is at distances d1 = 2d2, the imaging size S' is identical. A network looking at image appearance alone cannot tell these apart. But if we transform both depth maps to use the same canonical focal length fc, the labels become consistent: the network sees the same image and gets the same (canonical) depth label. No contradiction.
At test time, the network predicts depth Dc in canonical space. To get real-world metric depth, you simply divide by the focal-length ratio: D = Dc / ωd = Dc · (f / fc). This requires knowing the test camera's focal length — which is available in every photo's EXIF metadata. No special calibration needed.
This is the heart of Metric3D v2. Rather than predicting depth and normals independently, the model iteratively refines both together using recurrent blocks. Each iteration, depth helps normals and normals help depth.
The encoder-decoder produces initial low-resolution predictions: D̂0 for depth and N̂0 for normals. Then, for T+1 iterations:
After all iterations, the low-resolution predictions are upsampled to full resolution using learned upsampling heads Hd and Hn. Depth passes through ReLU (ensuring non-negative values). Normals are L2-normalized to unit vectors at every pixel.
Formally, at each step the recurrent block F takes all current state as input:
The critical detail: both projection heads Gd and Gn read from the same updated hidden state Ht+1. This shared representation is how depth knowledge flows into the normal branch and vice versa. The ConvGRU's gating mechanism learns which hidden features to keep, reset, or update — automatically balancing the two tasks.
The ConvGRU sees both depth and normals at every iteration. If the current depth estimate has a noisy region, the normals there will be inconsistent with the depth gradients. The ConvGRU can detect this mismatch and correct both. It is like having two witnesses who keep checking each other's story — inconsistencies get resolved with each pass.
The design is inspired by RAFT (Recurrent All-Pairs Field Transforms), which uses ConvGRU blocks to iteratively refine optical flow. The key adaptation: RAFT updates a single quantity (flow), while Metric3D v2 updates two quantities (depth and normals) through a shared hidden state. This coupling is what makes the joint optimization powerful — without the shared state, you would just have two independent RAFT-like refinement loops.
A 1D surface (side view). Add noise to see how depth becomes bumpy. The "normals from depth" arrows become noisy too. Click "Run Joint Optimization" to watch iterative refinement smooth both signals. Drag the noise slider to increase the challenge.
Here is the data reality: there are about 9.5 million outdoor images with depth labels, but fewer than 20,000 with normal labels. If you only train the normal head on those 20K images, it will never generalize to the real world. How do you learn normals from depth-only data?
Metric3D v2's normal estimator learns from three complementary signals:
| Source | When Available | Description |
|---|---|---|
| GT Normal Labels (Ln) | ~10M indoor frames | Direct supervision with an uncertainty-aware angular loss. High quality but limited diversity. |
| Depth-Normal Consistency (Ld-n) | Always | Self-supervision: convert predicted depth to a pseudo-normal map via least-squares fitting, then minimize the angular difference with the predicted normals. Requires no labels at all. |
| Implicit Feature Fusion | Always | Through the shared ConvGRU hidden state, depth features implicitly teach the normal branch about geometry — more robust than explicit pseudo-labels. |
Given a predicted depth map D, you can compute a pseudo-normal at each pixel by fitting a local plane to the neighboring 3D points (using the least-squares method). Call this Npseudo. The consistency loss is:
This is a self-supervised signal — it uses no ground-truth labels at all. It simply enforces that the model's own depth and normal predictions agree with each other geometrically.
Consider a training image from an outdoor driving dataset with only depth labels (no normal GT). Here is what happens:
Note that the depth loss Ld operates in canonical space (Dc vs D*c), while the consistency loss Ld-n uses the de-canonicalized depth D. This is because the depth-to-normal conversion requires real-world metric depth to compute correct 3D positions — canonical depth would distort the 3D geometry.
Metric3D v2 uses a standard encoder-decoder pipeline, augmented with the canonical camera transform and the joint optimization module. Here is the full data flow.
The image — after canonical camera transformation — is fed into a Vision Transformer backbone pretrained with DINOv2 self-supervised learning. DINOv2 provides excellent visual features because it was trained on 142M images with no labels, learning rich geometric and semantic representations. The ViT-Large version produces patch tokens at 1/4 resolution (each token represents a 14×14 pixel patch).
The authors also experiment with a ConvNeXt-Large backbone (for those who prefer convnets) and a ViT-giant (1B+ parameters) for maximum accuracy on benchmarks.
A DPT (Dense Prediction Transformer) decoder reassembles the multi-scale features from the ViT backbone into dense prediction maps. DPT works by tapping into features at multiple layers of the ViT and fusing them with convolutional upsampling. It produces three initial maps at 1/4 resolution:
The ConvGRU-based recurrent blocks iterate over depth and normals, as described in Chapter 3. The number of iterations T is a hyperparameter — more iterations give better results but cost more compute. Typically T = 3 to 5 iterations suffice. After T+1 steps, the refined predictions are upsampled to full resolution using learned convolutional upsampling heads.
Depth is passed through ReLU (ensuring non-negative values) and then de-canonicalized by dividing by ωd = fc/f. Normals are L2-normalized to unit vectors. The full equation:
Where Hd is ReLU and Hn is per-pixel L2 normalization (ensuring ||n|| = 1).
The architecture scales smoothly across three backbone sizes:
| Backbone | Params | Use Case |
|---|---|---|
| ConvNeXt-Large | ~200M | Fastest inference, good for real-time |
| ViT-Large | ~300M | Best accuracy-speed tradeoff (default) |
| ViT-giant | ~1B | Highest accuracy, server-side |
Full architecture: image enters, canonical transform applied, encoder-decoder produces initial estimates, recurrent blocks refine jointly, outputs are de-canonicalized.
Scale is what separates a research prototype from a foundation model. Metric3D v2 trains on an unprecedented collection of data.
| Metric | Value |
|---|---|
| Total training images | 16 million+ |
| Number of datasets | 16 (indoor + outdoor, real + synthetic) |
| Camera models | Thousands (phones, DSLRs, autonomous driving rigs, RGB-D sensors) |
| Images with depth labels | ~16M |
| Images with normal labels | ~10M (mostly indoor) |
| Outdoor normal-labeled images | < 20K |
| Training hardware | 48 A100 GPUs |
| Training iterations | 800K |
| Batch size | 192 |
Key training details that make the system work at this scale:
With 16 datasets of wildly different sizes (some have millions of images, others have thousands), naive mixing would let large datasets dominate training. Following DiverseDepth, Metric3D v2 balances all datasets within each mini-batch so each accounts for an approximately equal share. This prevents the model from memorizing one domain at the expense of others.
Standard scale-shift invariant losses normalize depth over the entire image, which squeezes fine-grained depth differences in nearby regions. Consider a scene with a table at 1m and a wall at 5m. Global normalization maps this range to [0, 1], so the 2cm height difference of objects on the table gets compressed to ~0.004 — almost invisible to the loss function.
RPNL fixes this by randomly cropping 32 patches (each 12.5%-50% of the image size) and applying scale-shift normalization locally within each patch. A patch that contains only the table surface will normalize the 2cm variations to span a much larger fraction of [0, 1], preserving local geometric detail. The loss is the mean absolute deviation (MAD) normalized L1 distance across all patches.
Depth supervision uses four complementary losses:
Together: Ld = LPWN + LVNL + Lsilog + LRPNL.
Metric3D v2 was tested on over 16 benchmarks for depth and normals, both zero-shot (never seen the test domain during training) and fine-tuned. Here are the headline numbers.
| Method | Setting | δ1 ↑ | AbsRel ↓ |
|---|---|---|---|
| MiDaS (affine-invariant) | ZS | — | affine only |
| ZoeDepth | FT | 0.953 | 0.077 |
| DepthAnything | FT | 0.984 | 0.056 |
| Marigold | ZS | — | affine only |
| Metric3D v2 (ViT-L) | ZS | 0.975 | 0.063 |
| Metric3D v2 (ViT-L) | FT | 0.989 | 0.047 |
| Method | Setting | δ1 ↑ | AbsRel ↓ |
|---|---|---|---|
| DepthAnything | FT | 0.982 | 0.046 |
| Metric3D v2 (ViT-L) | ZS | 0.974 | 0.052 |
| Metric3D v2 (ViT-g) | FT | 0.989 | 0.039 |
| Method | Setting | 11.25° ↑ | Mean Error ↓ |
|---|---|---|---|
| Omnidata | ZS | 0.577 | 16.7° |
| Bae et al. | ZS | 0.597 | 16.0° |
| Polymax | ZS | 0.656 | 13.1° |
| Metric3D v2 (ViT-L) | ZS | 0.662 | 13.1° |
| Metric3D v2 (ViT-L) | FT | 0.688 | 12.0° |
Perhaps the most impressive results are on datasets the model has never seen at all during training:
| Dataset | Domain | Metric3D v2 δ1 | Best Prior δ1 |
|---|---|---|---|
| NYUv2 | Indoor | 0.975 | 0.969 (Polymax) |
| KITTI | Driving | 0.974 | 0.968 (ZeroDepth) |
| ScanNet | Indoor | 0.969 | 0.939 (HDN) |
| NuScenes | Driving | 0.977 | 0.910 (ZeroDepth) |
| DIODE Indoor | Indoor | 0.849 | 0.754 (ZeroDepth) |
| DIODE Outdoor | Outdoor | 0.847 | 0.400 (ZoeDepth) |
| ETH3D | Mixed | 0.993 | 0.969 (Polymax) |
On DIODE Outdoor, where ZoeDepth manages only δ1 = 0.400 (essentially failing), Metric3D v2 achieves 0.847 — more than double the accuracy. The canonical camera transform is especially impactful on datasets with unusual cameras (DIODE uses structured-light sensors), where previous methods fall apart because they have never trained on similar camera models.
On the affine-invariant depth benchmarks (ETH3D, iBIMS-1, DIODE), Metric3D v2 also outperforms MiDaS and DPT — even at their own game. The model has not sacrificed structure quality to gain metric accuracy; it has improved both.
δ1 accuracy on NYUv2 (metric depth). Higher is better. Metric3D v2 zero-shot vs other methods.
A model that predicts accurate metric depth and normals from a single photo unlocks applications that were previously impossible without expensive sensors or multi-view systems.
Take a photo with your iPhone. Read the focal length from the EXIF data. Feed the image into Metric3D v2. Get metric depth at every pixel. Backproject to 3D using the known camera intrinsics. Now you can measure real-world distances between any two points in the scene.
The authors demonstrate this with an iPhone 14 Pro (f = 24mm, pixel size 2.44μm) and a Samsung Galaxy S23 (f = 35mm, pixel size 1μm) — two completely different cameras with different focal lengths and sensor characteristics, neither seen during training. Measured sizes (e.g., a drone's wingspan, a chair's height) come within a few centimeters of ground truth.
Visual SLAM systems like Droid-SLAM suffer from scale drift: as the camera moves through a large-scale scene, the estimated scale gradually diverges from reality. After walking through a building, the map might be 20% too large or too small, with the error accumulating over time.
By naively feeding Metric3D v2's per-frame metric depth into Droid-SLAM, the scale drift is dramatically reduced. The paper shows trajectory predictions that closely match ground truth, with accurate metric-scale dense mapping. The key: Metric3D v2 provides an absolute scale reference at every frame, preventing the SLAM system from drifting.
With metric depth, you can backproject every pixel to a 3D point at its correct real-world position. With surface normals, you can orient mesh faces correctly. The combination produces high-quality metric 3D reconstructions from casually captured images — no LiDAR, no multi-view stereo, no calibration targets.
Neural radiance fields benefit enormously from depth and normal priors. Metric3D v2's predictions can initialize NeRF geometry, speeding up convergence and improving quality in under-constrained regions (e.g., textureless walls). The metric scale ensures that the NeRF scene has correct physical dimensions, which matters for VR/AR applications where virtual objects must interact with real geometry.
Self-driving cars need to know how far away other vehicles, pedestrians, and obstacles are in meters. LiDAR provides this, but it is expensive ($5K-$75K per sensor) and produces only sparse point clouds. A monocular camera with Metric3D v2 can provide dense metric depth at every pixel at a fraction of the cost, serving as a redundant safety channel or enabling depth estimation in camera-only setups. The KITTI benchmark results (δ1 = 0.989 fine-tuned) demonstrate near-perfect depth prediction in driving scenarios.
Metric3D v2 sits at the intersection of several important research threads. Let's map where it fits.
Metric3D v1 introduced the canonical camera space transformation for metric depth. v2 keeps this module but adds three major innovations: (1) joint depth-normal optimization via ConvGRU, (2) the ability to learn normals from depth labels via the consistency loss, and (3) a massive scale-up from ~4M to 16M+ training images with ViT backbones. The result is not just better depth — it is a complete geometric foundation model.
MiDaS and DPT are affine-invariant depth models — they predict structure but not metric scale. Metric3D v2 solves the metric problem that MiDaS could not: by canonicalizing the camera, it learns true metric depth while maintaining the same robustness to diverse scenes.
DepthAnything focuses on learning strong depth representations from massive unlabeled data via self-training. It produces affine-invariant or fine-tuned metric depth. Metric3D v2 takes a different approach: rather than scaling up unlabeled data, it scales up labeled data with the CSTM trick and adds normals. The two approaches are complementary.
Marigold uses diffusion models for monocular depth estimation, producing beautiful affine-invariant predictions. GeoWizard extends this to joint depth-normal estimation with diffusion. Metric3D v2 uses discriminative models (ViT + DPT) instead of diffusion, which makes it much faster at inference and enables true metric prediction. The paper shows that Marigold's depth-derived normals contain artifacts (noise at edges, incorrect orientations on smooth surfaces), while Metric3D v2's jointly-trained normal head produces cleaner results.
ZoeDepth tackles metric depth by first training an affine-invariant model, then fine-tuning "metric heads" for specific domain distributions (indoor vs outdoor). UniDepth uses camera-aware features. Metric3D v2's CSTM approach is more elegant: rather than learning domain-specific heads or camera encoders, it mathematically normalizes the camera out of the problem. One model, one training stage, all domains.
Omnidata tackles the normal data scarcity problem by performing dense 3D reconstruction on 1300M frames to create normal labels. This is enormously expensive. Metric3D v2's depth-to-normal distillation achieves similar or better results by leveraging cheap depth labels instead — a more scalable approach.
| Aspect | Metric3D v2 |
|---|---|
| Input | Single RGB image + camera focal length |
| Output | Metric depth map + surface normal map |
| Backbone | DINOv2 ViT-Large (or ConvNeXt-L, ViT-giant) |
| Key module | Canonical camera space transform (CSTM) |
| Innovation | Joint depth-normal iterative optimization via ConvGRU |
| Normal supervision | GT labels + depth-normal consistency + implicit feature fusion |
| Training data | 16M+ images, 16 datasets, 1000s of cameras |
| Key result | #1 on NYU, KITTI, ScanNet (both depth and normals) |
| Inference | Single forward pass + T refinement iterations |