Extends DUSt3R by adding a local feature matching head that predicts dense descriptors alongside pointmaps — enabling both coarse global alignment AND fine local matching in a single forward pass.
Image matching is the backbone of 3D computer vision. Want to reconstruct a scene from photos? You need to find which pixels in image A correspond to which pixels in image B. Want to localize a camera? Same thing. Want to build a map for a robot? Matching, again.
For decades, matching has been treated as a 2D problem. Extract keypoints in each image, describe them with local features (SIFT, SuperPoint, etc.), then find nearest neighbors in feature space. This works well when viewpoints are similar, but breaks down under large viewpoint changes, repetitive textures, or low-texture regions.
Then came DUSt3R (2024), which flipped the script: instead of matching in 2D, it predicts full 3D pointmaps for each image and derives correspondences from the 3D geometry. This turned out to be remarkably robust to extreme viewpoint changes, vaulting to the top of the Map-free localization benchmark.
But DUSt3R has a critical weakness: its matches are imprecise. Pointmap regression is inherently noisy. A pixel whose true 3D position is (1.02, 3.51, 0.87) might be predicted as (1.05, 3.48, 0.90). That error, when projected back to 2D, yields matches that are off by several pixels. For tasks like visual localization, those few pixels matter enormously.
DUSt3R (orange) gives robust but imprecise matches. 2D matchers (gray) are precise but fragile under viewpoint change. MASt3R (teal) achieves both. Drag the viewpoint slider to see how each method degrades.
MASt3R's insight is deceptively simple: predict pointmaps AND dense local features simultaneously, from the same backbone, with separate heads.
Think about what DUSt3R's decoder already learns. The cross-attention between the two views forces the decoder to understand the 3D relationship between them — which pixels see the same surface, how the scene is laid out, where occlusions occur. This is an incredibly rich internal representation. But DUSt3R only uses it to regress 3D coordinates.
MASt3R adds a second head — a simple 2-layer MLP — that reads the same decoder features and outputs a d-dimensional local descriptor for every pixel. These descriptors are trained with a contrastive loss (InfoNCE) that directly rewards pixel-accurate matching.
The result: MASt3R outputs, for each pixel in each image, three things:
You can match using either the 3D points or the local features. Or both. The local features consistently outperform the 3D points for matching, while the 3D points provide the geometric backbone that makes everything robust.
Before diving into MASt3R's additions, let's understand the foundation it builds on. DUSt3R (Dense and Unconstrained Stereo 3D Reconstruction) takes two uncalibrated images and outputs a 3D reconstruction — no camera intrinsics needed, no feature matching step, just raw pixels in, 3D points out.
DUSt3R uses a Siamese encoder + intertwined decoder design:
A pointmap Xa,b is a dense H×W×3 tensor that maps every pixel (u,v) of image Ia to a 3D point expressed in the coordinate frame of camera Cb. DUSt3R predicts two pointmaps — X1,1 and X2,1 — both expressed in camera 1's frame. This implicitly solves relative pose: if you know where every pixel of image 2 lands in camera 1's coordinate system, you know the relative camera transformation.
The normalizing factor ẑ (mean distance of 3D points from origin) makes the loss scale-invariant. DUSt3R wraps this with confidence weighting:
The confidence Civ lets the network downweight unreliable predictions (occluded regions, sky, etc.) while the log term prevents the trivial solution of setting all confidences to zero.
The shared encoder processes both images, the decoder exchanges information via cross-attention, and Head3D predicts pointmaps. MASt3R adds Headdesc (teal) for local features.
This is MASt3R's core contribution. Alongside the existing Head3D that predicts pointmaps, a new Headdesc is added that predicts dense local features.
Headdesc is a 2-layer MLP with GELU activation, applied independently to each pixel's concatenated encoder+decoder representation:
Each output descriptor Di is L2-normalized to unit norm. The feature dimension is d=24 — much smaller than typical descriptors (SuperPoint uses 256, SIFT uses 128). Why so compact? Because the features are only used for matching between the two views already processed by the decoder. The cross-attention has already done the heavy lifting of geometric understanding. The descriptors just need to be discriminative enough to pick out the right pixel.
MASt3R trains the descriptors with InfoNCE — a contrastive loss that treats matching as classification. For each ground-truth correspondence (i,j) between images 1 and 2:
Where the similarity score uses a temperature-scaled dot product:
Notice that the loss is symmetric — it includes terms for both "given pixel i in image 1, find its match j in image 2" and "given pixel j in image 2, find its match i in image 1." This symmetry is crucial for the reciprocal matching scheme we'll see next.
Ground-truth correspondences M̂ are obtained by finding reciprocal nearest neighbors between the ground-truth 3D pointmaps. During training, 4096 correspondences are randomly sampled per image pair. If fewer exist, random false correspondences pad the batch to keep the ratio of true matches constant.
Each pixel in both images gets a d=24 descriptor from Headdesc. Matching pixels have similar descriptors (high dot product), non-matching pixels have dissimilar descriptors. Click pairs to see their descriptor similarity. The SHOWCASE visualization below shows both pointmap and descriptor outputs side by side.
Given dense feature maps D1 and D2 (each H×W×d), how do you extract reliable correspondences? The answer is reciprocal nearest neighbor matching — but with a critical speedup.
A pair (i,j) is a mutual nearest neighbor if pixel i's nearest neighbor in image 2 is j, AND pixel j's nearest neighbor in image 1 is i:
This reciprocal check is a powerful outlier filter. One-directional nearest neighbors often include many false matches (pixel i's closest feature in image 2 might be j, but j's closest in image 1 might be some other pixel k). Requiring mutual agreement dramatically reduces false matches.
Computing all mutual nearest neighbors naively requires comparing every pixel in image 1 to every pixel in image 2. For 512×384 images, that's 196,608 pixels per image, so 196,608² ≈ 39 billion comparisons. Even with optimized nearest-neighbor libraries, this takes 30-100 seconds on a CPU — far slower than the network's forward pass itself.
MASt3R's solution is an iterative subsampling scheme. Instead of starting from all pixels, start from a sparse grid of k pixels (e.g., k=3000) in image 1:
The complexity drops from O(W2H2) to O(kWH) — for k=3000 and WH=196,608, that's a 65x speedup.
Watch how sparse initial points (left image) converge to reciprocal matches through forward-backward NN iterations. Teal points have converged (formed a cycle). Orange points are still propagating. Click "Step" to advance one iteration.
MASt3R's ViT encoder handles images with a maximum dimension of 512 pixels. For a 4000×3000 photo, that's an 8x downscale. Matches found at this reduced resolution are accurate to ~8 pixels when mapped back to the original image. For many applications, that's not enough.
The solution is a two-stage process: match at coarse resolution first, then zoom in to refine.
The greedy crop selection is key to efficiency. If you have 10 crops per image, there are 100 possible pairs. But most of them don't overlap. The coarse matches from Stage 1 tell you which crop pairs are worth processing. Typically only 10-20% of pairs need refinement.
Coarse-to-fine matching lets MASt3R handle megapixel images while maintaining pixel-level accuracy. The coarse stage provides the global context (which parts of the scene overlap), and the fine stage provides the local precision (sub-pixel matching within overlapping regions).
MASt3R's training combines two objectives that reinforce each other: 3D reconstruction and feature matching.
Where β=1 balances the reconstruction and matching losses equally. Let's unpack each component.
Inherited from DUSt3R with one modification: when ground-truth is metric (known absolute scale), the scale normalization factor z is set to ẑ (the ground-truth normalizer). This forces the network to predict metric-scale geometry rather than up-to-scale reconstruction. Ten of the fourteen training datasets have metric ground truth.
The symmetric InfoNCE loss from Chapter 3. Ground-truth correspondences are mined from ground-truth pointmaps by finding reciprocal nearest neighbors in 3D, then 4096 are randomly sampled per pair.
MASt3R dominates multiple benchmarks — especially the ones where extreme viewpoint changes make 2D-only methods fall apart.
The hardest benchmark: localize a camera given a single reference image, with viewpoint changes up to 180°. No map, no SfM database — just one reference photo.
On CO3D: MASt3R achieves 78.1% AUC@15° for rotation (vs 62.7% for DUSt3R, 47.5% for LoFTR). On RealEstate10K: 80.5% vs 76.3% vs 72.5%.
On InLoc (indoor localization), Aachen Day-Night v1.1, and Cambridge Landmarks: MASt3R matches or exceeds the state of the art across all benchmarks.
On DTU and Tanks & Temples: MASt3R's 3D reconstructions improve over DUSt3R's, even though matching was the primary target. The matching loss helps the backbone learn better geometry.
VCRE AUC (%) on the Map-free test set. MASt3R achieves 93.0% — a massive 30-point jump over the previous best published result.
MASt3R was designed for pairwise matching, but its capabilities naturally extend to full Structure-from-Motion (SfM) — the problem of reconstructing a 3D scene and recovering all camera poses from a collection of images.
COLMAP, the standard SfM pipeline, follows a decades-old recipe: detect SIFT keypoints, match them, triangulate 3D points, bundle adjust. It's reliable but slow and brittle under challenging conditions (few textures, extreme viewpoints, repetitive patterns). Every step is hand-crafted and relies on assumptions that break in the wild.
MASt3R-SfM replaces the entire COLMAP pipeline. The key steps:
MASt3R-SfM works in situations where COLMAP fails entirely — scenes with few textures, extreme viewpoint changes, or sparse image coverage. It produces denser reconstructions (every pixel gets a 3D point, not just keypoints) and is more robust to degenerate configurations.
The tradeoff: MASt3R-SfM is currently slower than COLMAP for large image collections because every pair requires a ViT forward pass. But it's far more robust, and the quality of correspondences is substantially higher.
DUSt3R (Wang et al., 2024): The direct predecessor. MASt3R inherits the entire architecture — Siamese ViT encoder, cross-attention decoder, pointmap regression — and adds the matching head on top. DUSt3R proved that casting reconstruction as pointmap regression works; MASt3R proved that casting matching as a joint 3D + feature task works even better.
SuperGlue (Sarlin et al., 2020): Showed that attention-based global reasoning improves keypoint matching over naive nearest-neighbor pairing. But SuperGlue still relies on local keypoint detectors (SuperPoint). MASt3R goes further by making the entire pipeline — detection, description, matching, and geometric reasoning — happen in one network.
LoFTR (Sun et al., 2021): Pioneered dense, detector-free matching using Transformers. Showed that dense matching with global attention outperforms keypoint-based methods on hard benchmarks. But LoFTR operates purely in 2D. MASt3R's key advance is grounding this dense matching in 3D geometry.
COLMAP (Schönberger & Frahm, 2016): The standard SfM pipeline that MASt3R-SfM aims to replace for challenging scenarios. COLMAP's modular detect-describe-match-triangulate pipeline is reliable but brittle under extreme conditions.
VGGT (Wang et al., 2025): Visual Geometry Grounded Transformer — takes the idea further by predicting cameras, pointmaps, depth, and correspondences all in one feedforward pass. Builds on similar principles as MASt3R but with a more unified architecture and larger-scale training.
MASt3R-SLAM (2024-25): Extends MASt3R to real-time SLAM by combining its dense matching capabilities with a tracking-and-mapping framework. Uses MASt3R's pointmaps for initialization and its descriptors for tracking.