Predict dense 3D pointmaps directly from image pairs using a transformer — no camera calibration, no feature matching, no multi-stage pipeline. One network replaces the entire SfM/MVS stack.
You have a handful of photos of a room, a building, or a sculpture. You want the 3D shape. What does the traditional pipeline look like?
First, detect keypoints in every image (SIFT, SuperPoint). Then match them across pairs (SuperGlue, LoFTR). Then estimate essential matrices to recover relative camera poses. Then run bundle adjustment to refine everything jointly. Then feed the calibrated cameras into a multi-view stereo algorithm to get dense depth. Each step is a separate module, a separate paper, a separate failure mode.
This chain — Structure-from-Motion (SfM) followed by Multi-View Stereo (MVS) — is the backbone of 3D reconstruction. Tools like COLMAP implement it beautifully. But the pipeline is fragile: each sub-problem adds noise to the next. Keypoint matching fails on textureless surfaces. Essential matrix estimation breaks with insufficient camera motion. And the entire chain requires camera intrinsics (focal length, principal point) that you often don't have.
DUSt3R's radical idea: skip every intermediate step. Don't detect keypoints. Don't match features. Don't estimate cameras. Instead, take two RGB images and directly regress a 3D pointmap for each — a per-pixel 3D coordinate telling you where each pixel lives in space.
Think about what this means. The network takes in two images of a scene and outputs, for every pixel in both images, an (x, y, z) coordinate in a shared 3D coordinate frame. From this output alone, you can extract:
One forward pass. One network. No calibration input. No feature matching. No bundle adjustment.
Let's define pointmaps precisely. A pointmap X is a W × H × 3 tensor — the same spatial resolution as the input image, but with three channels storing (x, y, z) instead of (R, G, B). Each pixel (i, j) maps to a 3D point Xi,j ∈ R3.
The notation Xn,m means "the pointmap for image n, expressed in camera m's coordinate frame." DUSt3R always outputs both pointmaps in camera 1's coordinate frame:
Both X1,1 and X2,1 live in the same coordinate system (camera 1), so they're directly comparable — no alignment needed. The confidence maps C1 and C2 tell you how certain the network is about each pixel's 3D position.
A depth map D gives you one number per pixel: how far away is this pixel? To get 3D points, you need the camera intrinsics K:
This requires knowing the focal length and principal point. Pointmaps bypass this entirely — the network directly outputs the 3D coordinates without needing K. The relationship between pixels and 3D points is learned, not assumed to follow a pinhole model.
DUSt3R's architecture is a Siamese encoder with a cross-attention decoder, inspired by CroCo (Cross-view Completion). It has three stages: encode, decode with information sharing, and regress pointmaps.
Both images are independently encoded by the same weight-sharing Vision Transformer (ViT-Large). Each image is split into 16×16 patches and processed into a sequence of tokens:
At this stage, the two images don't interact. Each branch produces a token representation independently.
This is where the magic happens. Two separate decoders (ViT-Base) process the two token streams, but they constantly exchange information via cross-attention. Each decoder block performs three operations in sequence:
The cross-attention is crucial: it lets the network reason about correspondences between the two views. Without it, each branch would produce an independent depth map with no shared coordinate frame.
Two DPT heads take the decoder tokens from all layers and produce the final output: a pointmap X ∈ RW×H×3 and a confidence map C ∈ RW×H for each view.
DUSt3R is trained with a beautifully simple objective: regress the correct 3D coordinates for every pixel. No adversarial loss, no contrastive learning, no hand-crafted geometric constraints. Just predict the right (x, y, z) and you're done.
For each valid pixel i in view v, the loss is the Euclidean distance between prediction and ground truth, normalized by scale:
Where z and z̄ are the average distances to the origin for predicted and ground-truth pointmaps, respectively. This normalization handles the inherent scale ambiguity — the network only needs to get the shape right, not the absolute metric scale.
Not all pixels are equally easy. The sky has no well-defined 3D point. Translucent objects are ambiguous. DUSt3R learns a per-pixel confidence score Ci that weights the loss:
The first term says: "weight hard pixels less." The second term (−α log C) is a regularizer that prevents the trivial solution of setting all confidences to zero. The network must try to predict everywhere, but it can express lower confidence in genuinely ambiguous regions.
DUSt3R trains on a mixture of 8 datasets totaling 8.5M image pairs: Habitat (synthetic indoor), MegaDepth (outdoor landmarks), ARKitScenes (real indoor), ScanNet++ (high-quality indoor scans), CO3D (object-centric), BlendedMVS (diverse), Static Scenes 3D, and Waymo (autonomous driving). Ground truth pointmaps come from depth sensors, SfM reconstructions, or synthetic rendering.
So far, DUSt3R handles one pair of images at a time. But what if you have 10, 50, or 200 photos of a scene? You need all the pairwise pointmaps to agree — to live in a single, consistent 3D coordinate frame. This is the global alignment problem.
Given N images, build a graph where each edge connects two images that share visual content. You can determine overlap by running DUSt3R on all pairs (∼40ms per pair on an H100) and checking average confidence, or use image retrieval to select likely pairs.
For each pair e = (n, m) in the graph, DUSt3R predicts pointmaps Xn,e and Xm,e in a local coordinate frame. We want globally consistent pointmaps χn for every image n. For each pair, we introduce a rigid transformation Pe and scale σe that aligns the local predictions to the global frame:
The key insight: both pointmaps Xn,e and Xm,e from the same pair share the same local coordinate frame (camera n's frame), so the same rigid transformation Pe should align both to the global frame. The product constraint ∏ σe = 1 prevents the trivial all-zeros solution.
If you want camera parameters (poses, intrinsics, depth maps), you can enforce a pinhole model during alignment by parameterizing:
This jointly optimizes all camera poses {Pn}, intrinsics {Kn}, and depth maps {Dn} — the complete output of a traditional SfM + MVS pipeline, obtained from a single optimization.
Click Align to watch pairwise pointmaps converge to a globally consistent reconstruction.
DUSt3R is evaluated on a remarkable breadth of tasks — all with the same model, no task-specific fine-tuning. This is the payoff of the unified pointmap representation: one model, many outputs.
On CO3Dv2 and RealEstate10k, DUSt3R with global alignment achieves the best overall mAA@30, significantly surpassing PoseDiffusion and COLMAP+SuperPoint+SuperGlue. And DUSt3R was never explicitly trained for pose estimation — poses are a downstream extraction from pointmaps.
For single-image depth, DUSt3R simply feeds the same image twice: F(I, I). In zero-shot evaluation (no fine-tuning on target datasets), it matches or exceeds state-of-the-art supervised baselines on NYUv2, KITTI, DDAD, and other benchmarks.
On 7Scenes (indoor) and Cambridge Landmarks (outdoor), DUSt3R matches specialized localization methods like HLoc — despite never being trained for localization. It uses raw pointmap outputs as a 2D-2D pixel matcher, then solves PnP-RANSAC for the pose.
Let's appreciate how revolutionary the "no calibration" aspect is. In traditional 3D vision, camera calibration is the first step and the hardest bottleneck. You need to know:
DUSt3R needs neither. The pointmap representation implicitly encodes all this information. The network has learned, from millions of training pairs, to map pixels to 3D coordinates regardless of the camera that captured them.
Since X1,1 is expressed in camera 1's frame, you can recover the focal length by solving a simple optimization. Assuming the principal point is roughly centered and pixels are square:
Where i' = i − W/2, j' = j − H/2. This has a closed-form solution via the Weiszfeld algorithm in just a few iterations. The network doesn't predict the focal length directly, but the focal length is implicit in the pointmap geometry.
The pointmap representation is surprisingly versatile. Here's how DUSt3R extracts classical 3D vision outputs from its unified representation.
Establishing pixel correspondences between two images: find nearest neighbors in 3D pointmap space. For pixel i in image 1, find the closest point in X2,1. Retain only mutual (reciprocal) matches for robustness. This replaces SIFT/SuperPoint + SuperGlue in a single step.
Compare the two pointmaps X1,1 and X1,2 using Procrustes alignment to recover the rigid transformation (rotation R, translation t, scale σ) between camera frames. Or use the matched points with PnP-RANSAC for a more robust estimate.
Given a query image IQ and a database image IB with known 3D coordinates: run DUSt3R on the pair, extract 2D-3D correspondences from the pointmaps, and solve PnP-RANSAC for the query camera's absolute pose.
The pointmaps are the reconstruction. For multi-view scenes, run global alignment (Chapter 5) to merge all pairwise predictions into one consistent 3D point cloud. Color each 3D point with its corresponding pixel color from the input images.
COLMAP / SfM (Schönberger & Frahm, 2016): The gold-standard traditional pipeline. DUSt3R replaces its entire chain — feature detection, matching, essential matrix estimation, bundle adjustment, dense MVS — with a single forward pass. Global alignment serves a similar role to bundle adjustment but operates on 3D points directly.
CroCo (Weinzaepfel et al., 2022-23): Cross-view Completion — a self-supervised pretraining method where the model learns to reconstruct masked patches of one view given the other. DUSt3R's architecture is directly inspired by CroCo, and CroCo pretrained weights provide a massive initialization boost.
DPT (Ranftl et al., 2021): Dense Prediction Transformer — the regression head architecture that converts ViT tokens back to pixel-level predictions. DUSt3R uses DPT heads for pointmap and confidence output.
Monocular Depth Estimation (MiDaS, ZoeDepth, etc.): Prior work predicted depth from single images. DUSt3R generalizes this: a pair produces calibration-free stereo, and feeding the same image twice recovers monocular depth.
MASt3R (Leroy, Weinzaepfel, et al., 2024): Matching And Stereo 3D Reconstruction. Extends DUSt3R by adding local feature descriptors to the pointmap output, improving the quality of pixel correspondences and enabling better global alignment. A direct successor.
VGGT (Wang et al., 2025): Visual Geometry Grounded Transformer. Pushes the DUSt3R paradigm further — handles arbitrary numbers of input images in a single forward pass (no separate global alignment), predicts pointmaps, cameras, and more in one shot.
3D Gaussian Splatting (Kerbl et al., 2023): DUSt3R pointmaps provide excellent initialization for 3DGS — you get a dense, colored point cloud from uncalibrated images, which 3DGS can then refine into a real-time renderable scene. Several works combine DUSt3R + 3DGS for fast novel view synthesis from casual photos.