Wang, Leroy, Cabon, Chidlovskii, Revaud — CVPR 2024

DUSt3R: Geometric 3D Vision Made Easy

Predict dense 3D pointmaps directly from image pairs using a transformer — no camera calibration, no feature matching, no multi-stage pipeline. One network replaces the entire SfM/MVS stack.

Prerequisites: Vision Transformers + Basic 3D geometry
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have a handful of photos of a room, a building, or a sculpture. You want the 3D shape. What does the traditional pipeline look like?

First, detect keypoints in every image (SIFT, SuperPoint). Then match them across pairs (SuperGlue, LoFTR). Then estimate essential matrices to recover relative camera poses. Then run bundle adjustment to refine everything jointly. Then feed the calibrated cameras into a multi-view stereo algorithm to get dense depth. Each step is a separate module, a separate paper, a separate failure mode.

This chain — Structure-from-Motion (SfM) followed by Multi-View Stereo (MVS) — is the backbone of 3D reconstruction. Tools like COLMAP implement it beautifully. But the pipeline is fragile: each sub-problem adds noise to the next. Keypoint matching fails on textureless surfaces. Essential matrix estimation breaks with insufficient camera motion. And the entire chain requires camera intrinsics (focal length, principal point) that you often don't have.

The core frustration: "An MVS algorithm is only as good as the quality of the input images and camera parameters." Every stage is solved independently — dense reconstruction can't help sparse SfM, and vice versa. There's no communication between sub-problems. DUSt3R asks: what if a single network solved everything at once?
Why is the traditional SfM → MVS pipeline fragile?

Chapter 1: The Key Insight

DUSt3R's radical idea: skip every intermediate step. Don't detect keypoints. Don't match features. Don't estimate cameras. Instead, take two RGB images and directly regress a 3D pointmap for each — a per-pixel 3D coordinate telling you where each pixel lives in space.

Think about what this means. The network takes in two images of a scene and outputs, for every pixel in both images, an (x, y, z) coordinate in a shared 3D coordinate frame. From this output alone, you can extract:

One forward pass. One network. No calibration input. No feature matching. No bundle adjustment.

Why pointmaps? Traditional approaches predict depth (a single number per pixel) plus camera parameters (rotation, translation, intrinsics) separately. DUSt3R predicts (x, y, z) per pixel directly. This is more expressive: the 3D points don't need to follow a pinhole camera model. The network can learn arbitrary mappings from pixels to 3D, which handles lens distortion, non-planar sensors, and other real-world deviations naturally.
What does DUSt3R predict instead of depth maps + camera parameters?

Chapter 2: The Pointmap Representation

Let's define pointmaps precisely. A pointmap X is a W × H × 3 tensor — the same spatial resolution as the input image, but with three channels storing (x, y, z) instead of (R, G, B). Each pixel (i, j) maps to a 3D point Xi,j ∈ R3.

The notation Xn,m means "the pointmap for image n, expressed in camera m's coordinate frame." DUSt3R always outputs both pointmaps in camera 1's coordinate frame:

F(I1, I2) → (X1,1, X2,1, C1, C2)

Both X1,1 and X2,1 live in the same coordinate system (camera 1), so they're directly comparable — no alignment needed. The confidence maps C1 and C2 tell you how certain the network is about each pixel's 3D position.

Why not just depth?

A depth map D gives you one number per pixel: how far away is this pixel? To get 3D points, you need the camera intrinsics K:

Xi,j = K−1 [i · Di,j,   j · Di,j,   Di,j]T

This requires knowing the focal length and principal point. Pointmaps bypass this entirely — the network directly outputs the 3D coordinates without needing K. The relationship between pixels and 3D points is learned, not assumed to follow a pinhole model.

Scale ambiguity: The predicted pointmaps are only defined up to an unknown global scale. A scene could be a miniature model or a full building — the network can't tell from two images alone. DUSt3R normalizes by the average distance of all valid points to the origin: z = norm(X1,1, X2,1) = mean of all ||Xi||. Ground truth is normalized the same way during training.
Why does DUSt3R predict pointmaps (per-pixel 3D coordinates) rather than depth maps?

Chapter 3: The Architecture

DUSt3R's architecture is a Siamese encoder with a cross-attention decoder, inspired by CroCo (Cross-view Completion). It has three stages: encode, decode with information sharing, and regress pointmaps.

Stage 1: Siamese ViT Encoder

Both images are independently encoded by the same weight-sharing Vision Transformer (ViT-Large). Each image is split into 16×16 patches and processed into a sequence of tokens:

F1 = Encoder(I1),   F2 = Encoder(I2)

At this stage, the two images don't interact. Each branch produces a token representation independently.

Stage 2: Cross-Attention Decoder

This is where the magic happens. Two separate decoders (ViT-Base) process the two token streams, but they constantly exchange information via cross-attention. Each decoder block performs three operations in sequence:

  1. Self-attention: tokens within one view attend to each other
  2. Cross-attention: tokens from view 1 attend to tokens from view 2 (and vice versa)
  3. MLP: standard feed-forward layer
G1i = DecoderBlock1i(G1i−1, G2i−1)
G2i = DecoderBlock2i(G2i−1, G1i−1)

The cross-attention is crucial: it lets the network reason about correspondences between the two views. Without it, each branch would produce an independent depth map with no shared coordinate frame.

Stage 3: Regression Heads

Two DPT heads take the decoder tokens from all layers and produce the final output: a pointmap X ∈ RW×H×3 and a confidence map C ∈ RW×H for each view.

CroCo pretraining: The entire network is initialized from CroCo, a self-supervised model trained to complete masked patches of one view given the other view. This pretraining teaches the network cross-view geometric reasoning before it ever sees a 3D supervision signal — a massive head start.
Why is cross-attention between the two decoder branches essential?

Chapter 4: Training

DUSt3R is trained with a beautifully simple objective: regress the correct 3D coordinates for every pixel. No adversarial loss, no contrastive learning, no hand-crafted geometric constraints. Just predict the right (x, y, z) and you're done.

The regression loss

For each valid pixel i in view v, the loss is the Euclidean distance between prediction and ground truth, normalized by scale:

regr(v, i) = || Xv,1i/z − X̄v,1i/z̄ ||

Where z and z̄ are the average distances to the origin for predicted and ground-truth pointmaps, respectively. This normalization handles the inherent scale ambiguity — the network only needs to get the shape right, not the absolute metric scale.

Confidence-aware weighting

Not all pixels are equally easy. The sky has no well-defined 3D point. Translucent objects are ambiguous. DUSt3R learns a per-pixel confidence score Ci that weights the loss:

Lconf = ∑v,i Cv,1i · ℓregr(v, i) − α log Cv,1i

The first term says: "weight hard pixels less." The second term (−α log C) is a regularizer that prevents the trivial solution of setting all confidences to zero. The network must try to predict everywhere, but it can express lower confidence in genuinely ambiguous regions.

Training data

DUSt3R trains on a mixture of 8 datasets totaling 8.5M image pairs: Habitat (synthetic indoor), MegaDepth (outdoor landmarks), ARKitScenes (real indoor), ScanNet++ (high-quality indoor scans), CO3D (object-centric), BlendedMVS (diverse), Static Scenes 3D, and Waymo (autonomous driving). Ground truth pointmaps come from depth sensors, SfM reconstructions, or synthetic rendering.

Two-stage training: To handle the cost of high-resolution ViT processing, training happens in two phases: first at 224×224 resolution, then fine-tuned at 512 pixels. The network also sees random aspect ratios (16:9, 4:3, etc.) so it handles different image shapes at test time.
What prevents the confidence-aware loss from setting all confidence scores to zero (ignoring all pixels)?

Chapter 5: Global Alignment

So far, DUSt3R handles one pair of images at a time. But what if you have 10, 50, or 200 photos of a scene? You need all the pairwise pointmaps to agree — to live in a single, consistent 3D coordinate frame. This is the global alignment problem.

The connectivity graph

Given N images, build a graph where each edge connects two images that share visual content. You can determine overlap by running DUSt3R on all pairs (∼40ms per pair on an H100) and checking average confidence, or use image retrieval to select likely pairs.

The optimization

For each pair e = (n, m) in the graph, DUSt3R predicts pointmaps Xn,e and Xm,e in a local coordinate frame. We want globally consistent pointmaps χn for every image n. For each pair, we introduce a rigid transformation Pe and scale σe that aligns the local predictions to the global frame:

χ* = argminχ,P,σe ∈ Ev ∈ ei Cv,ei || χvi − σe Pe Xv,ei ||

The key insight: both pointmaps Xn,e and Xm,e from the same pair share the same local coordinate frame (camera n's frame), so the same rigid transformation Pe should align both to the global frame. The product constraint ∏ σe = 1 prevents the trivial all-zeros solution.

Not bundle adjustment: Traditional BA minimizes 2D reprojection errors, which requires iterating between 3D point estimates and camera parameters. DUSt3R's global alignment minimizes 3D point distances directly — no reprojection, no camera model assumptions. This is faster (converges in seconds with standard gradient descent) and simpler (no special solvers needed).

Recovering camera parameters

If you want camera parameters (poses, intrinsics, depth maps), you can enforce a pinhole model during alignment by parameterizing:

χni,j := Pn−1 Kn−1 [i · Dni,j,   j · Dni,j,   Dni,j]T

This jointly optimizes all camera poses {Pn}, intrinsics {Kn}, and depth maps {Dn} — the complete output of a traditional SfM + MVS pipeline, obtained from a single optimization.

Click Align to watch pairwise pointmaps converge to a globally consistent reconstruction.

How does DUSt3R's global alignment differ from traditional bundle adjustment?

Chapter 6: Results

DUSt3R is evaluated on a remarkable breadth of tasks — all with the same model, no task-specific fine-tuning. This is the payoff of the unified pointmap representation: one model, many outputs.

Multi-view pose estimation

On CO3Dv2 and RealEstate10k, DUSt3R with global alignment achieves the best overall mAA@30, significantly surpassing PoseDiffusion and COLMAP+SuperPoint+SuperGlue. And DUSt3R was never explicitly trained for pose estimation — poses are a downstream extraction from pointmaps.

Monocular depth

For single-image depth, DUSt3R simply feeds the same image twice: F(I, I). In zero-shot evaluation (no fine-tuning on target datasets), it matches or exceeds state-of-the-art supervised baselines on NYUv2, KITTI, DDAD, and other benchmarks.

Visual localization

On 7Scenes (indoor) and Cambridge Landmarks (outdoor), DUSt3R matches specialized localization methods like HLoc — despite never being trained for localization. It uses raw pointmap outputs as a 2D-2D pixel matcher, then solves PnP-RANSAC for the pose.

One model, many tasks: The same DUSt3R checkpoint produces state-of-the-art or competitive results on monocular depth, multi-view depth, relative pose estimation, absolute pose estimation, and dense 3D reconstruction. No other method before DUSt3R unified all these capabilities in a single model.
How does DUSt3R handle monocular (single-image) depth estimation?

Chapter 7: No Calibration Needed

Let's appreciate how revolutionary the "no calibration" aspect is. In traditional 3D vision, camera calibration is the first step and the hardest bottleneck. You need to know:

DUSt3R needs neither. The pointmap representation implicitly encodes all this information. The network has learned, from millions of training pairs, to map pixels to 3D coordinates regardless of the camera that captured them.

Recovering intrinsics if you want them

Since X1,1 is expressed in camera 1's frame, you can recover the focal length by solving a simple optimization. Assuming the principal point is roughly centered and pixels are square:

f*1 = argminfi,j C1,1i,j || (i', j') − f · (X1,1i,j,0, X1,1i,j,1) / X1,1i,j,2 ||

Where i' = i − W/2, j' = j − H/2. This has a closed-form solution via the Weiszfeld algorithm in just a few iterations. The network doesn't predict the focal length directly, but the focal length is implicit in the pointmap geometry.

Any camera, any lens: Because pointmaps don't assume a specific camera model, DUSt3R handles wide-angle lenses, fisheye cameras, and even images with unknown crop or zoom. The network has seen diverse cameras during training and learns to handle them all — the pointmap representation is more general than any parametric camera model.
How can camera intrinsics be recovered from DUSt3R's output, even though they're not explicitly predicted?

Chapter 8: Downstream Tasks

The pointmap representation is surprisingly versatile. Here's how DUSt3R extracts classical 3D vision outputs from its unified representation.

Point matching

Establishing pixel correspondences between two images: find nearest neighbors in 3D pointmap space. For pixel i in image 1, find the closest point in X2,1. Retain only mutual (reciprocal) matches for robustness. This replaces SIFT/SuperPoint + SuperGlue in a single step.

Relative pose estimation

Compare the two pointmaps X1,1 and X1,2 using Procrustes alignment to recover the rigid transformation (rotation R, translation t, scale σ) between camera frames. Or use the matched points with PnP-RANSAC for a more robust estimate.

Absolute pose estimation (visual localization)

Given a query image IQ and a database image IB with known 3D coordinates: run DUSt3R on the pair, extract 2D-3D correspondences from the pointmaps, and solve PnP-RANSAC for the query camera's absolute pose.

Dense 3D reconstruction

The pointmaps are the reconstruction. For multi-view scenes, run global alignment (Chapter 5) to merge all pairwise predictions into one consistent 3D point cloud. Color each 3D point with its corresponding pixel color from the input images.

The unified representation thesis: Every classical 3D vision task — depth estimation, pose estimation, feature matching, 3D reconstruction, visual localization — is just a different reading of the same pointmap output. This is what makes DUSt3R "easy": the hard work happens once in the network, and downstream tasks are simple post-processing.
How does DUSt3R establish pixel correspondences between two images?

Chapter 9: Connections

What DUSt3R built on

COLMAP / SfM (Schönberger & Frahm, 2016): The gold-standard traditional pipeline. DUSt3R replaces its entire chain — feature detection, matching, essential matrix estimation, bundle adjustment, dense MVS — with a single forward pass. Global alignment serves a similar role to bundle adjustment but operates on 3D points directly.

CroCo (Weinzaepfel et al., 2022-23): Cross-view Completion — a self-supervised pretraining method where the model learns to reconstruct masked patches of one view given the other. DUSt3R's architecture is directly inspired by CroCo, and CroCo pretrained weights provide a massive initialization boost.

DPT (Ranftl et al., 2021): Dense Prediction Transformer — the regression head architecture that converts ViT tokens back to pixel-level predictions. DUSt3R uses DPT heads for pointmap and confidence output.

Monocular Depth Estimation (MiDaS, ZoeDepth, etc.): Prior work predicted depth from single images. DUSt3R generalizes this: a pair produces calibration-free stereo, and feeding the same image twice recovers monocular depth.

What DUSt3R enabled

MASt3R (Leroy, Weinzaepfel, et al., 2024): Matching And Stereo 3D Reconstruction. Extends DUSt3R by adding local feature descriptors to the pointmap output, improving the quality of pixel correspondences and enabling better global alignment. A direct successor.

VGGT (Wang et al., 2025): Visual Geometry Grounded Transformer. Pushes the DUSt3R paradigm further — handles arbitrary numbers of input images in a single forward pass (no separate global alignment), predicts pointmaps, cameras, and more in one shot.

3D Gaussian Splatting (Kerbl et al., 2023): DUSt3R pointmaps provide excellent initialization for 3DGS — you get a dense, colored point cloud from uncalibrated images, which 3DGS can then refine into a real-time renderable scene. Several works combine DUSt3R + 3DGS for fast novel view synthesis from casual photos.

DUSt3R's legacy: By proving that a single network can replace the entire SfM/MVS pipeline, DUSt3R changed how the community thinks about 3D reconstruction. The key lesson: don't chain together specialized modules — let a powerful transformer learn the whole mapping from pixels to 3D, end to end. The pointmap representation and global alignment procedure have become standard building blocks for subsequent methods.

Cheat sheet

Core idea
Regress per-pixel 3D pointmaps from image pairs, no camera calibration needed
Architecture
Siamese ViT-L encoder + cross-attention ViT-B decoder + DPT heads
Training
Confidence-weighted 3D regression loss on 8.5M pairs from 8 datasets
Multi-view
Global alignment: optimize rigid transforms to unify pairwise pointmaps
Impact
Replaces SfM + MVS pipeline; enables MASt3R, VGGT, DUSt3R + 3DGS
What is the key advance of MASt3R over DUSt3R?