Leroy, Cabon, Revaud — Naver Labs, 2024

Grounding Image Matching in 3D with MASt3R

Extends DUSt3R by adding a local feature matching head that predicts dense descriptors alongside pointmaps — enabling both coarse global alignment AND fine local matching in a single forward pass.

Prerequisites: Feature matching + 3D reconstruction basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

Image matching is the backbone of 3D computer vision. Want to reconstruct a scene from photos? You need to find which pixels in image A correspond to which pixels in image B. Want to localize a camera? Same thing. Want to build a map for a robot? Matching, again.

For decades, matching has been treated as a 2D problem. Extract keypoints in each image, describe them with local features (SIFT, SuperPoint, etc.), then find nearest neighbors in feature space. This works well when viewpoints are similar, but breaks down under large viewpoint changes, repetitive textures, or low-texture regions.

Then came DUSt3R (2024), which flipped the script: instead of matching in 2D, it predicts full 3D pointmaps for each image and derives correspondences from the 3D geometry. This turned out to be remarkably robust to extreme viewpoint changes, vaulting to the top of the Map-free localization benchmark.

But DUSt3R has a critical weakness: its matches are imprecise. Pointmap regression is inherently noisy. A pixel whose true 3D position is (1.02, 3.51, 0.87) might be predicted as (1.05, 3.48, 0.90). That error, when projected back to 2D, yields matches that are off by several pixels. For tasks like visual localization, those few pixels matter enormously.

The gap: DUSt3R gives you robust coarse matches via 3D reasoning, but not precise ones. Classical 2D matchers like LoFTR give you precise local matches, but struggle with extreme viewpoint changes. Nobody has both. MASt3R closes this gap by adding a dedicated matching head that produces dense local features trained explicitly for pixel-accurate correspondence — while keeping DUSt3R's 3D backbone for robustness.
The Matching Precision Gap

DUSt3R (orange) gives robust but imprecise matches. 2D matchers (gray) are precise but fragile under viewpoint change. MASt3R (teal) achieves both. Drag the viewpoint slider to see how each method degrades.

Viewpoint change30°
Why are DUSt3R's correspondences imprecise, despite being robust to large viewpoint changes?

Chapter 1: The Key Insight

MASt3R's insight is deceptively simple: predict pointmaps AND dense local features simultaneously, from the same backbone, with separate heads.

Think about what DUSt3R's decoder already learns. The cross-attention between the two views forces the decoder to understand the 3D relationship between them — which pixels see the same surface, how the scene is laid out, where occlusions occur. This is an incredibly rich internal representation. But DUSt3R only uses it to regress 3D coordinates.

MASt3R adds a second head — a simple 2-layer MLP — that reads the same decoder features and outputs a d-dimensional local descriptor for every pixel. These descriptors are trained with a contrastive loss (InfoNCE) that directly rewards pixel-accurate matching.

Why joint training matters: The paper ablates training the matching head without the 3D reconstruction loss — just descriptors alone. Performance drops substantially, especially for pose estimation (median rotation error jumps from 3.0° to 10.8°). The 3D task forces the backbone to understand geometry, which makes the descriptors geometrically aware. Grounding matching in 3D is the whole point.

The result: MASt3R outputs, for each pixel in each image, three things:

  1. A 3D point (from Head3D) — for coarse geometric matching and reconstruction
  2. A confidence score — how reliable each prediction is
  3. A d-dimensional local feature (from Headdesc) — for precise, discriminative matching

You can match using either the 3D points or the local features. Or both. The local features consistently outperform the 3D points for matching, while the 3D points provide the geometric backbone that makes everything robust.

What happens when MASt3R's matching head is trained without the 3D reconstruction loss?

Chapter 2: DUSt3R Recap

Before diving into MASt3R's additions, let's understand the foundation it builds on. DUSt3R (Dense and Unconstrained Stereo 3D Reconstruction) takes two uncalibrated images and outputs a 3D reconstruction — no camera intrinsics needed, no feature matching step, just raw pixels in, 3D points out.

The architecture

DUSt3R uses a Siamese encoder + intertwined decoder design:

Encode
Both images pass through a shared ViT-Large encoder: H1 = Encoder(I1), H2 = Encoder(I2)
Decode
Two ViT-Base decoders process representations jointly via cross-attention: H'1, H'2 = Decoder(H1, H2)
Predict
Head3D regresses pointmaps and confidence from [Hv, H'v]

Pointmaps

A pointmap Xa,b is a dense H×W×3 tensor that maps every pixel (u,v) of image Ia to a 3D point expressed in the coordinate frame of camera Cb. DUSt3R predicts two pointmaps — X1,1 and X2,1 — both expressed in camera 1's frame. This implicitly solves relative pose: if you know where every pixel of image 2 lands in camera 1's coordinate system, you know the relative camera transformation.

The regression loss

regr(v, i) = ||Xiv,1 − X̂iv,1|| / ẑ

The normalizing factor ẑ (mean distance of 3D points from origin) makes the loss scale-invariant. DUSt3R wraps this with confidence weighting:

Lconf = ∑v,i Civregr(v,i) − α log Civ

The confidence Civ lets the network downweight unreliable predictions (occluded regions, sky, etc.) while the log term prevents the trivial solution of setting all confidences to zero.

DUSt3R Architecture

The shared encoder processes both images, the decoder exchanges information via cross-attention, and Head3D predicts pointmaps. MASt3R adds Headdesc (teal) for local features.

Key property of cross-attention: The decoder doesn't just process each image independently. Cross-attention lets each decoder token attend to the other image's tokens, building a joint understanding of the two views. This is why DUSt3R can handle extreme viewpoint changes — the decoder can reason about which parts of image 1 correspond to which parts of image 2, even when their appearances are drastically different due to viewpoint.
What do DUSt3R's pointmaps X1,1 and X2,1 represent?

Chapter 3: The Feature Head

This is MASt3R's core contribution. Alongside the existing Head3D that predicts pointmaps, a new Headdesc is added that predicts dense local features.

Architecture

Headdesc is a 2-layer MLP with GELU activation, applied independently to each pixel's concatenated encoder+decoder representation:

D1 = Headdesc([H1, H'1])
D2 = Headdesc([H2, H'2])

Each output descriptor Di is L2-normalized to unit norm. The feature dimension is d=24 — much smaller than typical descriptors (SuperPoint uses 256, SIFT uses 128). Why so compact? Because the features are only used for matching between the two views already processed by the decoder. The cross-attention has already done the heavy lifting of geometric understanding. The descriptors just need to be discriminative enough to pick out the right pixel.

The matching loss

MASt3R trains the descriptors with InfoNCE — a contrastive loss that treats matching as classification. For each ground-truth correspondence (i,j) between images 1 and 2:

Lmatch = −∑(i,j)∈M̂ [log sτ(i,j) / ∑k sτ(k,j) + log sτ(i,j) / ∑k sτ(i,k)]

Where the similarity score uses a temperature-scaled dot product:

sτ(i,j) = exp(−τ Di1⊤ Dj2)
Why InfoNCE over regression: The regression loss in DUSt3R rewards getting close to the right 3D point. InfoNCE rewards getting exactly the right pixel — you get zero credit for matching to an adjacent pixel. This binary nature forces the network to learn highly discriminative features. Think of it as the difference between "land near the bullseye" (regression) and "hit the bullseye" (classification).

The dual nature

Notice that the loss is symmetric — it includes terms for both "given pixel i in image 1, find its match j in image 2" and "given pixel j in image 2, find its match i in image 1." This symmetry is crucial for the reciprocal matching scheme we'll see next.

Ground-truth correspondences M̂ are obtained by finding reciprocal nearest neighbors between the ground-truth 3D pointmaps. During training, 4096 correspondences are randomly sampled per image pair. If fewer exist, random false correspondences pad the batch to keep the ratio of true matches constant.

Feature Head: Descriptors for Every Pixel

Each pixel in both images gets a d=24 descriptor from Headdesc. Matching pixels have similar descriptors (high dot product), non-matching pixels have dissimilar descriptors. Click pairs to see their descriptor similarity. The SHOWCASE visualization below shows both pointmap and descriptor outputs side by side.

Why does MASt3R use d=24 dimensional descriptors instead of higher dimensions like SuperPoint's 256?

Chapter 4: The Matching Pipeline

Given dense feature maps D1 and D2 (each H×W×d), how do you extract reliable correspondences? The answer is reciprocal nearest neighbor matching — but with a critical speedup.

Mutual nearest neighbors

A pair (i,j) is a mutual nearest neighbor if pixel i's nearest neighbor in image 2 is j, AND pixel j's nearest neighbor in image 1 is i:

M = {(i,j) | j = NN2(Di1) and i = NN1(Dj2)}

This reciprocal check is a powerful outlier filter. One-directional nearest neighbors often include many false matches (pixel i's closest feature in image 2 might be j, but j's closest in image 1 might be some other pixel k). Requiring mutual agreement dramatically reduces false matches.

The quadratic problem

Computing all mutual nearest neighbors naively requires comparing every pixel in image 1 to every pixel in image 2. For 512×384 images, that's 196,608 pixels per image, so 196,608² ≈ 39 billion comparisons. Even with optimized nearest-neighbor libraries, this takes 30-100 seconds on a CPU — far slower than the network's forward pass itself.

Fast reciprocal matching

MASt3R's solution is an iterative subsampling scheme. Instead of starting from all pixels, start from a sparse grid of k pixels (e.g., k=3000) in image 1:

Init
Sample k pixels on a regular grid in image 1: U0
Forward
Map each pixel to its NN in image 2: Vt = [NN2(Du1)]u∈Ut
Backward
Map back to image 1: Ut+1 = [NN1(Dv2)]v∈Vt
Collect
Pairs where Unt = Unt+1 (a cycle!) are reciprocal matches → add to Mk
Filter
Remove converged points, repeat with remaining. After ~5 iterations, nearly all points converge.

The complexity drops from O(W2H2) to O(kWH) — for k=3000 and WH=196,608, that's a 65x speedup.

Surprising result: Subsampling doesn't just speed things up — it actually improves matching quality. With k=3000, MASt3R scores higher on Map-free than with the full correspondence set. Why? The iterative convergence process acts as a natural outlier filter. Points that don't converge to a stable cycle tend to be ambiguous or incorrect matches. Filtering them out improves downstream pose estimation.
Fast Reciprocal Matching

Watch how sparse initial points (left image) converge to reciprocal matches through forward-backward NN iterations. Teal points have converged (formed a cycle). Orange points are still propagating. Click "Step" to advance one iteration.

Iteration 0 — 20 active points
Why does subsampling to k=3000 correspondences actually IMPROVE matching quality compared to using the full set?

Chapter 5: Coarse-to-Fine Matching

MASt3R's ViT encoder handles images with a maximum dimension of 512 pixels. For a 4000×3000 photo, that's an 8x downscale. Matches found at this reduced resolution are accurate to ~8 pixels when mapped back to the original image. For many applications, that's not enough.

The coarse-to-fine strategy

The solution is a two-stage process: match at coarse resolution first, then zoom in to refine.

Stage 1: Coarse
Downscale both images to 512px max dimension. Run MASt3R. Extract coarse matches M0k using fast reciprocal matching.
Generate crops
Create overlapping 512px crops (50% overlap) from each full-resolution image: W1 and W2.
Select pairs
Greedily select crop pairs (w1, w2) that cover ≥90% of coarse correspondences M0k.
Stage 2: Fine
Run MASt3R on each selected crop pair at full resolution. Extract matches. Map back to original image coordinates.

The greedy crop selection is key to efficiency. If you have 10 crops per image, there are 100 possible pairs. But most of them don't overlap. The coarse matches from Stage 1 tell you which crop pairs are worth processing. Typically only 10-20% of pairs need refinement.

Why 50% overlap? With no overlap, a match near a crop boundary might be split between two crops and lost. With 50% overlap, every point in the image is covered by at least two crops, ensuring continuous coverage. The tradeoff is more crops to process, but the greedy selection filters most of them out.

The payoff

Coarse-to-fine matching lets MASt3R handle megapixel images while maintaining pixel-level accuracy. The coarse stage provides the global context (which parts of the scene overlap), and the fine stage provides the local precision (sub-pixel matching within overlapping regions).

In MASt3R's coarse-to-fine scheme, what role do the coarse matches play?

Chapter 6: Training

MASt3R's training combines two objectives that reinforce each other: 3D reconstruction and feature matching.

The total loss

Ltotal = Lconf + β Lmatch

Where β=1 balances the reconstruction and matching losses equally. Let's unpack each component.

Reconstruction loss (Lconf)

Inherited from DUSt3R with one modification: when ground-truth is metric (known absolute scale), the scale normalization factor z is set to ẑ (the ground-truth normalizer). This forces the network to predict metric-scale geometry rather than up-to-scale reconstruction. Ten of the fourteen training datasets have metric ground truth.

Matching loss (Lmatch)

The symmetric InfoNCE loss from Chapter 3. Ground-truth correspondences are mined from ground-truth pointmaps by finding reciprocal nearest neighbors in 3D, then 4096 are randomly sampled per pair.

Training details

The synergy in the loss: Lconf teaches the backbone to understand 3D geometry. Lmatch teaches the descriptors to be discriminative at the pixel level. But they share the same backbone and decoder, so improvements in geometric understanding directly improve descriptor quality, and vice versa. The paper shows this is a strict improvement over training either loss alone.
Why does MASt3R modify DUSt3R's regression loss to sometimes skip scale normalization?

Chapter 7: Results

MASt3R dominates multiple benchmarks — especially the ones where extreme viewpoint changes make 2D-only methods fall apart.

Map-free localization

The hardest benchmark: localize a camera given a single reference image, with viewpoint changes up to 180°. No map, no SfM database — just one reference photo.

Relative pose estimation (CO3D, RealEstate10K)

On CO3D: MASt3R achieves 78.1% AUC@15° for rotation (vs 62.7% for DUSt3R, 47.5% for LoFTR). On RealEstate10K: 80.5% vs 76.3% vs 72.5%.

Visual localization

On InLoc (indoor localization), Aachen Day-Night v1.1, and Cambridge Landmarks: MASt3R matches or exceeds the state of the art across all benchmarks.

Multi-view stereo

On DTU and Tanks & Temples: MASt3R's 3D reconstructions improve over DUSt3R's, even though matching was the primary target. The matching loss helps the backbone learn better geometry.

Map-free Localization: MASt3R vs Prior Art

VCRE AUC (%) on the Map-free test set. MASt3R achieves 93.0% — a massive 30-point jump over the previous best published result.

The ablation that tells the story: Training with only Lmatch (no 3D loss) gives 88.5% AUC on Map-free. Training with both Lconf + Lmatch gives 93.0%. The 3D loss doesn't directly affect matching, yet it boosts matching by 4.5 points. This is the quantitative proof that grounding matching in 3D works.
By how much does MASt3R improve over the previous best method on the Map-free localization benchmark?

Chapter 8: MASt3R-SfM

MASt3R was designed for pairwise matching, but its capabilities naturally extend to full Structure-from-Motion (SfM) — the problem of reconstructing a 3D scene and recovering all camera poses from a collection of images.

Why replace COLMAP?

COLMAP, the standard SfM pipeline, follows a decades-old recipe: detect SIFT keypoints, match them, triangulate 3D points, bundle adjust. It's reliable but slow and brittle under challenging conditions (few textures, extreme viewpoints, repetitive patterns). Every step is hand-crafted and relies on assumptions that break in the wild.

MASt3R-SfM replaces the entire COLMAP pipeline. The key steps:

Pairwise matching
Run MASt3R on all selected image pairs to get dense correspondences and pointmaps.
Coarse-to-fine
Apply the windowed coarse-to-fine scheme for high-resolution images.
Global alignment
Use DUSt3R's global alignment to merge all pairwise pointmaps into a single consistent coordinate frame.
Optional refinement
Bundle adjustment on the merged reconstruction using the dense correspondences as constraints.

The advantage

MASt3R-SfM works in situations where COLMAP fails entirely — scenes with few textures, extreme viewpoint changes, or sparse image coverage. It produces denser reconstructions (every pixel gets a 3D point, not just keypoints) and is more robust to degenerate configurations.

The tradeoff: MASt3R-SfM is currently slower than COLMAP for large image collections because every pair requires a ViT forward pass. But it's far more robust, and the quality of correspondences is substantially higher.

A paradigm shift: Traditional SfM pipelines are modular — detect, describe, match, triangulate, optimize — with each module independently designed. MASt3R-SfM replaces the first three modules with a single learned system that jointly reasons about 3D geometry and matching. This is the trend in modern 3D vision: end-to-end learned systems replacing hand-crafted pipelines.
What is the key advantage of MASt3R-SfM over traditional SfM pipelines like COLMAP?

Chapter 9: Connections

What MASt3R builds on

DUSt3R (Wang et al., 2024): The direct predecessor. MASt3R inherits the entire architecture — Siamese ViT encoder, cross-attention decoder, pointmap regression — and adds the matching head on top. DUSt3R proved that casting reconstruction as pointmap regression works; MASt3R proved that casting matching as a joint 3D + feature task works even better.

SuperGlue (Sarlin et al., 2020): Showed that attention-based global reasoning improves keypoint matching over naive nearest-neighbor pairing. But SuperGlue still relies on local keypoint detectors (SuperPoint). MASt3R goes further by making the entire pipeline — detection, description, matching, and geometric reasoning — happen in one network.

LoFTR (Sun et al., 2021): Pioneered dense, detector-free matching using Transformers. Showed that dense matching with global attention outperforms keypoint-based methods on hard benchmarks. But LoFTR operates purely in 2D. MASt3R's key advance is grounding this dense matching in 3D geometry.

What MASt3R influenced and relates to

COLMAP (Schönberger & Frahm, 2016): The standard SfM pipeline that MASt3R-SfM aims to replace for challenging scenarios. COLMAP's modular detect-describe-match-triangulate pipeline is reliable but brittle under extreme conditions.

VGGT (Wang et al., 2025): Visual Geometry Grounded Transformer — takes the idea further by predicting cameras, pointmaps, depth, and correspondences all in one feedforward pass. Builds on similar principles as MASt3R but with a more unified architecture and larger-scale training.

MASt3R-SLAM (2024-25): Extends MASt3R to real-time SLAM by combining its dense matching capabilities with a tracking-and-mapping framework. Uses MASt3R's pointmaps for initialization and its descriptors for tracking.

The evolution of matching: SIFT (local, handcrafted) → SuperPoint (local, learned) → SuperGlue (local + attention) → LoFTR (dense, 2D) → DUSt3R (dense, 3D, implicit) → MASt3R (dense, 3D, explicit matching) → VGGT (everything in one model). Each step adds more geometric awareness and more end-to-end integration. MASt3R sits at the critical inflection point where matching became a first-class 3D citizen.

Cheat sheet

Core idea
Add a matching head (Headdesc) to DUSt3R that predicts d=24 local features alongside pointmaps
Loss
Ltotal = Lconf (regression) + Lmatch (InfoNCE) — 3D and matching reinforce each other
Matching
Fast reciprocal NN with iterative subsampling: O(kWH) instead of O(W²H²), 65x speedup
Key result
93% VCRE AUC on Map-free (+30% over previous SOTA), 36cm median translation error
Impact
Foundation for MASt3R-SfM, MASt3R-SLAM, and the trend toward 3D-grounded matching
What is the key difference between LoFTR and MASt3R in their approach to dense matching?