Murai, Dexheimer, Davison — CVPR 2025

MASt3R-SLAM

Real-time dense monocular SLAM built bottom-up from MASt3R's learned 3D reconstruction priors. No hand-designed features, no calibration required, no parametric camera model — just plug in a camera and go.

Prerequisites: Visual SLAM basics + DUSt3R/MASt3R + Bundle adjustment
10
Chapters
5+
Simulations

Chapter 0: The Problem

You point your phone camera at a room and walk around. You want the device to know where it is and what the room looks like in 3D — simultaneously. That is SLAM: Simultaneous Localisation and Mapping.

Classical visual SLAM (think ORB-SLAM3, LSD-SLAM) is an engineering marvel. Decades of research produced systems that track camera poses and build sparse 3D maps in real time. But they carry heavy baggage:

The fundamental tension: Classical SLAM decomposes a single problem (understanding 3D structure from images) into many sub-problems (feature detection, matching, triangulation, pose estimation, camera modelling). Each sub-problem uses hand-crafted priors. When these priors disagree with reality — textureless walls, rolling shutter, fisheye lenses, unknown focal length — the system breaks. What if a single learned model could handle all of these implicitly?
Classical SLAM vs Learned SLAM

Classical SLAM chains many hand-designed modules. MASt3R-SLAM replaces the entire front-end with a single learned model. Click each pipeline to highlight its components.

Why does classical visual SLAM struggle on in-the-wild video from unknown cameras?

Chapter 1: The Key Insight

MASt3R-SLAM's insight is radical: replace SLAM's entire hand-designed front-end with MASt3R's learned 3D reconstruction prior.

MASt3R (Matching And Stereo TRansformer 3D Reconstruction) is a neural network that takes two images and outputs:

  1. Pointmaps — a dense 3D point for every pixel in both images, all in a shared coordinate frame
  2. Descriptors — per-pixel feature vectors for robust matching
  3. Confidences — how much to trust each point and each descriptor

Think about what this means. A single forward pass through MASt3R implicitly solves feature extraction, feature matching, depth estimation, and relative pose estimation — all at once. It doesn't need to know the camera model because the 3D geometry is predicted directly from pixel appearance, trained on millions of image pairs with known 3D structure.

The paradigm shift: Classical SLAM asks: "Given known camera parameters, how do I extract features, match them, and triangulate 3D points?" MASt3R-SLAM asks: "Given two images, what does the 3D scene look like?" — and then builds the SLAM system around that answer. The camera model is never assumed; it's implicitly encoded in the predicted pointmap rays.

But MASt3R alone isn't SLAM. It processes image pairs, not video streams. It has no concept of keyframes, loop closure, or global consistency. The paper's contribution is engineering a complete real-time SLAM system that uses MASt3R as its backbone — with efficient pointmap matching, keyframe management, local fusion, and second-order global optimization.

Frozen vs trained: MASt3R itself is completely frozen in this system — no fine-tuning, no adaptation, no gradient updates at runtime. It was pretrained by Naver Labs on millions of image pairs with 3D ground truth (indoor/outdoor, synthetic/real). The entire SLAM pipeline around it — matching, tracking, fusion, optimization — is classical engineering using MASt3R's outputs as input. This makes MASt3R-SLAM a beautiful example of "learned backbone + engineered system": the network provides the 3D prior, and traditional SLAM provides the temporal reasoning.

Concrete data flow per frame

At 512×384 resolution, here's what flows through the system for each new frame:

What does MASt3R-SLAM replace from classical SLAM with learned priors?

Chapter 2: MASt3R Backbone

Let's unpack what MASt3R actually gives us. Given two images Ii and Ij, MASt3R runs a single forward pass and outputs:

Pointmaps

Xii, Xji ∈ RH×W×3

These are dense 3D point clouds — one per pixel — for both images, expressed in image i's coordinate frame. The notation Xji means "the pointmap of image j in the coordinate frame of camera i." Every pixel gets a 3D position, giving us a dense reconstruction from just two views.

Descriptors

Dii, Dji ∈ RH×W×d

High-dimensional per-pixel feature vectors for matching. Unlike ORB or SIFT which are designed by hand, these are learned end-to-end for the task of 3D reconstruction. They enable robust wide-baseline matching that classical features cannot handle.

Confidences

Cii, Cji ∈ RH×W×1   and   Qii, Qji ∈ RH×W×1

Per-pixel confidence for the pointmaps (C) and for the descriptors (Q). This is crucial — the network knows when it's uncertain. Occluded regions, textureless areas, and ambiguous depths all get low confidence, which downstream optimization can use to weight residuals.

The central camera assumption

MASt3R-SLAM's only geometric assumption: all rays pass through a single camera centre. No focal length, no distortion model, no parametric camera. Instead, the pointmap itself defines the camera model — each pixel's 3D point implies a ray direction from the camera centre:

ψ(Xii) = Xii / ||Xii||

This normalization converts pointmaps to unit rays, giving a smooth pixel-to-ray mapping that works for any camera — pinhole, fisheye, or even time-varying zoom.

MASt3R's Outputs

From two input images, MASt3R outputs dense pointmaps, descriptors, and confidences. Toggle to see each output type.

Scale ambiguity: Although MASt3R is trained on data with metric scale, the scale of pointmap predictions can be inconsistent across different image pairs. MASt3R-SLAM handles this by optimizing poses in Sim(3) — the similarity group that includes rotation, translation, and scale — giving 7 degrees of freedom per pose instead of the usual 6.
What MASt3R was trained on: MASt3R (the backbone) was pretrained on a mixture of indoor datasets (ScanNet, ARKitScenes, HyperSim), outdoor datasets (MegaDepth, CO3D), and synthetic environments. Its encoder is a ViT-Large (~300M params) initialized from CroCo (a cross-view completion pretraining task). The decoder adds ~200M params. Total: ~500M parameters, all frozen during SLAM operation. This pretraining gives MASt3R geometric intuition about real-world scenes — it has seen millions of rooms, streets, and objects from all angles, learning priors about surface continuity, object shapes, and depth distributions.
Why does MASt3R-SLAM normalize pointmaps into unit rays ψ(X)?

Chapter 3: The SLAM Pipeline

Here's the full system at a glance. Every component flows from MASt3R's outputs — pointmaps and descriptors are the universal currency.

New Frame
Image arrives → MASt3R forward pass with current keyframe → pointmaps + descriptors
Pointmap Matching
Iterative projective matching finds pixel correspondences in ~2ms using CUDA kernels
Tracking
Solve for relative pose Tkf via ray-error minimization with Gauss-Newton IRLS
Pointmap Fusion
Running weighted average fuses new observations into keyframe's canonical pointmap
New Keyframe?
If matches drop below threshold ωk → spawn new keyframe, add edge to graph
Loop Closure
Query retrieval database (ASMK) with encoded features → decode candidates with MASt3R → add edges
Global Optimization
Second-order Gauss-Newton on ray errors across all edges → globally consistent poses + geometry

Pointmap matching: the clever part

Classical SLAM uses projective data association: project a 3D point into an image using the known camera model. But MASt3R-SLAM has no camera model! Instead, it uses iterative projective matching — for each 3D point x from pointmap Xji, it finds the pixel p in image i whose ray best aligns with the ray to x:

p* = argminp ||ψ[Xii]p − ψ(x)||²

This minimizes the angle between the queried ray and the target ray. It converges within 10 iterations because the ray field is smooth. Each point is solved independently, so the whole thing runs in parallel on GPU — just 2ms for tracking.

Tracking with ray errors

Given matched points, we solve for the relative pose Tkf by minimizing the directional ray error:

Er = Σm,n w(qm,n, σr²) ||ψ(X̃k,nk) − ψ(Tkf Xf,mf)||ρ

Why ray errors instead of 3D point errors? Because depth predictions from MASt3R can be inconsistent — a point might be at the right angular position but wrong depth. Ray errors are bounded (angles are always in [0, π]) and therefore naturally robust to depth outliers. The 3D point error would let a single outlier with wildly wrong depth dominate the cost.

SLAM Pipeline Showcase

Watch a simulated camera move through a scene. Keyframes are added when overlap drops. Loop closures connect revisited areas. Drag to rotate the view.

Ready
Worked example — matching in 2ms: A 512×384 image has ~196K pixels. For each pixel, iterative projection runs ~10 Levenberg-Marquardt steps on a 2D problem (the pixel coordinates). Each step computes a 2×2 Jacobian and solves a tiny linear system. All 196K problems run in parallel via custom CUDA kernels. Then a local feature search refines matches using MASt3R's descriptors. Total: ~2ms on an RTX 4090.
Engineering decision — ray errors over 3D point errors: This is a critical design choice. MASt3R's depth predictions can be off by 10-20% in difficult regions (textureless walls, specular surfaces). A 3D point error ||ppred − pgt|| would be dominated by these depth outliers — a point at 5m with 20% error contributes 1m of residual, overwhelming the signal from hundreds of correct points. Ray errors bound the maximum contribution: even a point at infinite depth contributes at most ||ψ(a) − ψ(b)|| ≤ 2, because unit vectors have bounded difference. This is why the Gauss-Newton IRLS solver converges reliably in 3-5 iterations with ray errors but would oscillate with 3D errors.
Why does MASt3R-SLAM use ray errors instead of 3D point errors for tracking?

Chapter 4: Keyframe Management

Not every frame should be a keyframe. If we made every frame a keyframe, the backend optimization would grow quadratically. If we used too few keyframes, the system would lose tracking when the camera moves to a new viewpoint.

When to create a new keyframe

MASt3R-SLAM creates a new keyframe Ki when the number of valid matches between the current frame and the current keyframe falls below a threshold ωk. "Valid matches" means pixel correspondences that survive outlier rejection — large 3D distance or low confidence.

This is a natural criterion: when overlap drops, the current keyframe can no longer reliably constrain the current pose. Time for a fresh keyframe that sees the new part of the scene.

Edges in the pose graph

When a new keyframe Ki is created, a bidirectional edge to the previous keyframe Ki-1 is immediately added to the graph. This edge carries the pixel matches and the MASt3R predictions between the two keyframes, constraining their relative pose.

But sequential edges alone cause drift. To close loops, MASt3R-SLAM uses an image retrieval system based on ASMK (Aggregated Selective Match Kernels). Here's the loop closure process:

  1. Encode the new keyframe's MASt3R features into the retrieval codebook
  2. Query the database for the top-K most similar keyframes
  3. If retrieval score exceeds threshold ωr, run MASt3R decoder on the pair
  4. If enough matches survive (> ωl), add a bidirectional edge to the graph
  5. Update the retrieval database with the new keyframe's features
Incremental retrieval: Previous work (MASt3R-SfM) used ASMK in a batch setting with all images available upfront. MASt3R-SLAM adapts it to work incrementally — the database grows as new keyframes arrive. The codebook has tens of thousands of centroids, and a dense L2 distance calculation is fast enough for real-time quantization.
What triggers tracking loss: The system loses tracking when the number of valid matches drops below a minimum threshold (much lower than the keyframe creation threshold ωk). In practice this happens with: (1) fast camera motion — large rotation between consecutive frames means MASt3R's pair prediction has minimal overlap, (2) extreme motion blur — destroys the appearance features MASt3R relies on, (3) scene transitions — cutting to a completely different viewpoint. The relocalisation mechanism handles (3) gracefully, but (1) and (2) require slowing down. At 15 FPS, this means the camera shouldn't rotate faster than ~60°/second for reliable tracking.

Relocalisation

If tracking is lost (too few matches), the system queries the retrieval database with stricter thresholds. Once a retrieved keyframe matches well enough with the current frame, a new keyframe is inserted and tracking resumes. This is the same mechanism as loop closure, just triggered by failure instead of success.

What criterion does MASt3R-SLAM use to decide when to create a new keyframe?

Chapter 5: Local-Global Optimization

MASt3R-SLAM has two levels of optimization that mirror the classical frontend-backend split in SLAM — but both operate on MASt3R's pointmaps.

Local: Pointmap fusion (frontend)

Each time we track a new frame against the current keyframe, we fuse the new pointmap observation into the keyframe's canonical pointmap using a running weighted average:

kk ← (C̃kkkk + Ckf Tkf Xkf) / (C̃kk + Ckf)
kk ← C̃kk + Ckf

This is a confidence-weighted filter. Early predictions from small-baseline frames have larger errors and lower confidence. As more frames observe the same keyframe from different viewpoints, the canonical pointmap converges toward the true 3D geometry. This is essentially Bayesian filtering — each observation refines the estimate.

Why filtering matters: Without fusion, each MASt3R prediction is a noisy snapshot from one image pair. With fusion, the keyframe accumulates evidence from many viewpoints. This is especially important for depth: a point seen from one angle might have ambiguous depth, but seeing it from multiple angles resolves the ambiguity — the same principle as multi-view stereo, but performed incrementally.

Global: Second-order backend optimization

Given all keyframe poses TWCi and canonical pointmaps X̃ii, the backend minimizes the ray error across all edges E in the pose graph:

Eg = Σi,j ∈ E Σm,n w(qm,n, σr²) ||ψ(X̃i,mi) − ψ(Tijj,nj)||ρ

where Tij = TWCi-1 TWCj.

This is solved with Gauss-Newton using sparse Cholesky decomposition. Each Sim(3) pose has 7 DoF, so with N keyframes we solve a 7N × 7N sparse system. The first pose is fixed to remove gauge freedom. The Hessian is constructed with analytical Jacobians and parallel reductions in CUDA — at most 10 Gauss-Newton iterations per new keyframe, terminating early on convergence.

Pointmap Fusion Over Time

Watch how the canonical pointmap improves as more frames observe the same keyframe. Each observation is noisy; the weighted average converges to the true geometry.

Observations: 1
Why second-order? Previous methods like DUSt3R and MASt3R-SfM used first-order optimization and needed rescaling after every iteration (because scale is part of the optimization). MASt3R-SLAM's second-order Gauss-Newton handles scale naturally through the Sim(3) parameterization and converges much faster — typically 3-5 iterations instead of hundreds. This is critical for real-time operation.
Local BA vs global BA: The system has two distinct optimization levels, and understanding their difference is key. Local (frontend): operates per-keyframe, fuses pointmap observations with a running weighted average — O(1) per frame, no iteration needed, purely incremental. Global (backend): solves a sparse 7N×7N system across all N keyframes — runs only when new keyframes are inserted or loop closures detected. With 50 keyframes, that is a 350×350 system solved via sparse Cholesky in ~10ms. The backend caps at 10 Gauss-Newton steps with early termination on convergence (||Δx|| < ε). The first keyframe pose is locked to remove gauge freedom.
What is the benefit of pointmap fusion in the frontend?

Chapter 6: Dense Reconstruction

Here's where MASt3R-SLAM's design pays off beautifully. Classical sparse SLAM produces a cloud of isolated 3D points. Getting a dense reconstruction requires a separate pipeline — multi-view stereo, volumetric fusion, or neural rendering. Each adds complexity and latency.

MASt3R-SLAM gets dense reconstruction for free. Every keyframe already has a canonical pointmap X̃kk with a 3D point for every pixel. After global optimization adjusts all keyframe poses TWCk, each pointmap is transformed into the global frame:

XkW = TWCkkk

The union of all transformed pointmaps is the dense reconstruction. No additional processing needed.

Why is this better than classical dense SLAM?

Known calibration bonus

If camera calibration is available, MASt3R-SLAM makes two modifications:

  1. Constrained backprojection: Instead of using MASt3R's predicted rays, query only the depth and backproject along rays defined by the known camera model. This corrects any ray direction errors in MASt3R's predictions.
  2. Pixel-space residuals: Switch from ray errors to reprojection errors in pixel space, which is more precise when the projection function Π is known.
Reconstruction quality: On 7-Scenes, MASt3R-SLAM achieves 0.074m accuracy and 0.057m completion (Chamfer: 0.066m) — better than DROID-SLAM (0.115m accuracy, 0.040m completion, Chamfer: 0.077m). On EuRoC with calibration, accuracy is 0.099m vs DROID-SLAM's 0.173m. The learned 3D prior produces geometry that is both more accurate and more complete.
How does MASt3R-SLAM produce a dense 3D reconstruction?

Chapter 7: Results

MASt3R-SLAM is evaluated on four standard benchmarks: TUM RGB-D, 7-Scenes, ETH3D-SLAM, and EuRoC — all monocular RGB only.

TUM RGB-D: State-of-the-art with calibration

With known calibration, MASt3R-SLAM achieves 0.030m average ATE — the best among all compared methods. This beats DROID-SLAM (0.038m), DPV-SLAM++ (0.035m), and GO-SLAM (0.049m).

Without calibration: the real story

This is where the system truly shines. Without any calibration, MASt3R-SLAM achieves 0.060m average ATE on TUM — comparable to DPV-SLAM (0.076m) which uses known calibration. A baseline of DROID-SLAM with GeoCalib-predicted intrinsics scores 0.158m — nearly 3x worse.

ETH3D-SLAM

MASt3R-SLAM achieves the best ATE (0.086m) and AUC (23.935) among all methods including ORB-SLAM3 (0.135m), DROID-SLAM (0.171m), and DPV-SLAM++ (0.132m).

Trajectory Error Comparison

Average ATE (metres) across TUM RGB-D sequences. Lower is better. MASt3R-SLAM sets the new state of the art with calibration and is competitive without it.

Reconstruction quality

On 7-Scenes, MASt3R-SLAM's dense reconstruction has 0.074m accuracy vs DROID-SLAM's 0.115m — a 36% improvement. On EuRoC, accuracy is 0.099m vs 0.173m — a 43% improvement. The learned 3D prior produces significantly better geometry than triangulation-based approaches.

No calibration, no problem: Classical SLAM systems assume you've carefully calibrated your camera. ORB-SLAM3 fails entirely (marked X) on 5 of 9 TUM sequences. MASt3R-SLAM without calibration tracks all sequences successfully, with accuracy comparable to calibrated competitors. This is the "plug-and-play" promise delivered.
How does MASt3R-SLAM without calibration compare to DROID-SLAM with calibration on TUM RGB-D?

Chapter 8: Real-Time Performance

A SLAM system that produces beautiful results offline is useful for 3D scanning. A SLAM system that runs in real-time enables robotics, AR, and autonomous navigation. MASt3R-SLAM achieves 15 FPS on a single RTX 4090.

Where does the time go?

The timing budget per frame breaks down as:

The MASt3R network is the bottleneck — ~50ms of the ~67ms budget. Everything else the authors designed (matching, tracking, fusion, optimization) takes <15ms combined.

Engineering for speed

Several design choices are specifically motivated by real-time constraints:

Resolution matters: MASt3R resizes the largest image dimension to 512 pixels. This is a tradeoff — lower resolution means faster inference but coarser pointmaps. The 15 FPS figure uses full resolution outputs. At lower resolutions, the system can run faster but with reduced accuracy.
Hardware and memory budget: All results are on a single RTX 4090 (24GB VRAM, Ada Lovelace). Memory breakdown: MASt3R model weights ~2GB, input images + feature maps ~3GB, canonical pointmaps (scales with keyframe count: ~2MB per keyframe × up to ~200 keyframes = ~400MB), pose graph + Hessian <50MB. Total steady-state: ~6-8GB VRAM, well within the 24GB budget. The system has been tested on sequences up to 2000+ frames (TUM, EuRoC) with hundreds of keyframes without memory issues. The ~50ms MASt3R forward pass uses CUDA with TF32 precision on Ampere+ GPUs.
Timing Breakdown

Per-frame timing budget at 15 FPS (~67ms total). The MASt3R network dominates; all custom components together use less than 15ms.

What is the computational bottleneck in MASt3R-SLAM?

Chapter 9: Connections

What MASt3R-SLAM builds on

DUSt3R (Wang et al., 2024): The pioneer — showed that a transformer can predict dense 3D pointmaps from image pairs, replacing hand-designed stereo matching. MASt3R-SLAM inherits the pointmap representation and the idea of optimizing in a shared 3D coordinate frame.

MASt3R (Leroy et al., 2024): DUSt3R's successor with an additional matching head that predicts per-pixel descriptors for robust correspondence. MASt3R-SLAM uses this as its backbone — the descriptors are critical for the feature-refinement step of pointmap matching.

ORB-SLAM3 (Campos et al., 2021): The gold standard of classical sparse SLAM. MASt3R-SLAM matches or exceeds its accuracy on sequences where ORB-SLAM3 works — and succeeds on sequences where ORB-SLAM3 fails entirely.

DROID-SLAM (Teed & Deng, 2021): Learned dense SLAM using optical flow features with per-pixel bundle adjustment. MASt3R-SLAM shares the spirit of end-to-end learned features but replaces the flow-based architecture with a 3D reconstruction prior, and achieves better accuracy and robustness.

What MASt3R-SLAM enables

Gaussian SLAM: MASt3R-SLAM's dense pointmaps could initialize 3D Gaussian splatting for photorealistic novel view synthesis. Instead of growing Gaussians from scratch, seed them from the high-quality pointmaps.

FutureMapping / Spatial AI: Davison's vision of SLAM systems that understand semantics and enable robotic interaction. MASt3R-SLAM is a step toward "spatial intelligence" — dense geometry from any camera, in real time.

Plug-and-play robotics: A robot with any camera — GoPro, phone, microscope, endoscope — can now do real-time dense SLAM without calibration. This lowers the bar for deploying vision-based autonomy.

The learned prior trajectory: Single-view depth (Depth Anything) → two-view 3D (DUSt3R) → two-view 3D + matching (MASt3R) → real-time SLAM (MASt3R-SLAM). Each step uses a stronger prior and solves more of the 3D vision pipeline. The endgame may be a single network that takes in a video stream and outputs globally consistent 3D geometry, poses, and semantics — all in real time.

Cheat sheet

Core idea
Build a complete dense SLAM system bottom-up from MASt3R's learned two-view 3D reconstruction prior
Key innovation
Iterative projective matching (2ms), ray-error tracking, confidence-weighted pointmap fusion, second-order Sim(3) optimization
Camera model
Generic central camera — no calibration needed. Rays defined by pointmaps. Works with pinhole, fisheye, zoom, rolling shutter.
Performance
15 FPS on RTX 4090, SOTA accuracy with calibration (0.030m ATE on TUM), competitive without (0.060m)
Impact
First plug-and-play monocular dense SLAM — no features, no calibration, no assumptions. Enables in-the-wild spatial intelligence.
What is the fundamental difference between DROID-SLAM and MASt3R-SLAM in terms of their learned priors?