Real-time dense monocular SLAM built bottom-up from MASt3R's learned 3D reconstruction priors. No hand-designed features, no calibration required, no parametric camera model — just plug in a camera and go.
You point your phone camera at a room and walk around. You want the device to know where it is and what the room looks like in 3D — simultaneously. That is SLAM: Simultaneous Localisation and Mapping.
Classical visual SLAM (think ORB-SLAM3, LSD-SLAM) is an engineering marvel. Decades of research produced systems that track camera poses and build sparse 3D maps in real time. But they carry heavy baggage:
Classical SLAM chains many hand-designed modules. MASt3R-SLAM replaces the entire front-end with a single learned model. Click each pipeline to highlight its components.
MASt3R-SLAM's insight is radical: replace SLAM's entire hand-designed front-end with MASt3R's learned 3D reconstruction prior.
MASt3R (Matching And Stereo TRansformer 3D Reconstruction) is a neural network that takes two images and outputs:
Think about what this means. A single forward pass through MASt3R implicitly solves feature extraction, feature matching, depth estimation, and relative pose estimation — all at once. It doesn't need to know the camera model because the 3D geometry is predicted directly from pixel appearance, trained on millions of image pairs with known 3D structure.
But MASt3R alone isn't SLAM. It processes image pairs, not video streams. It has no concept of keyframes, loop closure, or global consistency. The paper's contribution is engineering a complete real-time SLAM system that uses MASt3R as its backbone — with efficient pointmap matching, keyframe management, local fusion, and second-order global optimization.
At 512×384 resolution, here's what flows through the system for each new frame:
Let's unpack what MASt3R actually gives us. Given two images Ii and Ij, MASt3R runs a single forward pass and outputs:
These are dense 3D point clouds — one per pixel — for both images, expressed in image i's coordinate frame. The notation Xji means "the pointmap of image j in the coordinate frame of camera i." Every pixel gets a 3D position, giving us a dense reconstruction from just two views.
High-dimensional per-pixel feature vectors for matching. Unlike ORB or SIFT which are designed by hand, these are learned end-to-end for the task of 3D reconstruction. They enable robust wide-baseline matching that classical features cannot handle.
Per-pixel confidence for the pointmaps (C) and for the descriptors (Q). This is crucial — the network knows when it's uncertain. Occluded regions, textureless areas, and ambiguous depths all get low confidence, which downstream optimization can use to weight residuals.
MASt3R-SLAM's only geometric assumption: all rays pass through a single camera centre. No focal length, no distortion model, no parametric camera. Instead, the pointmap itself defines the camera model — each pixel's 3D point implies a ray direction from the camera centre:
This normalization converts pointmaps to unit rays, giving a smooth pixel-to-ray mapping that works for any camera — pinhole, fisheye, or even time-varying zoom.
From two input images, MASt3R outputs dense pointmaps, descriptors, and confidences. Toggle to see each output type.
Here's the full system at a glance. Every component flows from MASt3R's outputs — pointmaps and descriptors are the universal currency.
Classical SLAM uses projective data association: project a 3D point into an image using the known camera model. But MASt3R-SLAM has no camera model! Instead, it uses iterative projective matching — for each 3D point x from pointmap Xji, it finds the pixel p in image i whose ray best aligns with the ray to x:
This minimizes the angle between the queried ray and the target ray. It converges within 10 iterations because the ray field is smooth. Each point is solved independently, so the whole thing runs in parallel on GPU — just 2ms for tracking.
Given matched points, we solve for the relative pose Tkf by minimizing the directional ray error:
Why ray errors instead of 3D point errors? Because depth predictions from MASt3R can be inconsistent — a point might be at the right angular position but wrong depth. Ray errors are bounded (angles are always in [0, π]) and therefore naturally robust to depth outliers. The 3D point error would let a single outlier with wildly wrong depth dominate the cost.
Watch a simulated camera move through a scene. Keyframes are added when overlap drops. Loop closures connect revisited areas. Drag to rotate the view.
Not every frame should be a keyframe. If we made every frame a keyframe, the backend optimization would grow quadratically. If we used too few keyframes, the system would lose tracking when the camera moves to a new viewpoint.
MASt3R-SLAM creates a new keyframe Ki when the number of valid matches between the current frame and the current keyframe falls below a threshold ωk. "Valid matches" means pixel correspondences that survive outlier rejection — large 3D distance or low confidence.
This is a natural criterion: when overlap drops, the current keyframe can no longer reliably constrain the current pose. Time for a fresh keyframe that sees the new part of the scene.
When a new keyframe Ki is created, a bidirectional edge to the previous keyframe Ki-1 is immediately added to the graph. This edge carries the pixel matches and the MASt3R predictions between the two keyframes, constraining their relative pose.
But sequential edges alone cause drift. To close loops, MASt3R-SLAM uses an image retrieval system based on ASMK (Aggregated Selective Match Kernels). Here's the loop closure process:
If tracking is lost (too few matches), the system queries the retrieval database with stricter thresholds. Once a retrieved keyframe matches well enough with the current frame, a new keyframe is inserted and tracking resumes. This is the same mechanism as loop closure, just triggered by failure instead of success.
MASt3R-SLAM has two levels of optimization that mirror the classical frontend-backend split in SLAM — but both operate on MASt3R's pointmaps.
Each time we track a new frame against the current keyframe, we fuse the new pointmap observation into the keyframe's canonical pointmap using a running weighted average:
This is a confidence-weighted filter. Early predictions from small-baseline frames have larger errors and lower confidence. As more frames observe the same keyframe from different viewpoints, the canonical pointmap converges toward the true 3D geometry. This is essentially Bayesian filtering — each observation refines the estimate.
Given all keyframe poses TWCi and canonical pointmaps X̃ii, the backend minimizes the ray error across all edges E in the pose graph:
where Tij = TWCi-1 TWCj.
This is solved with Gauss-Newton using sparse Cholesky decomposition. Each Sim(3) pose has 7 DoF, so with N keyframes we solve a 7N × 7N sparse system. The first pose is fixed to remove gauge freedom. The Hessian is constructed with analytical Jacobians and parallel reductions in CUDA — at most 10 Gauss-Newton iterations per new keyframe, terminating early on convergence.
Watch how the canonical pointmap improves as more frames observe the same keyframe. Each observation is noisy; the weighted average converges to the true geometry.
Here's where MASt3R-SLAM's design pays off beautifully. Classical sparse SLAM produces a cloud of isolated 3D points. Getting a dense reconstruction requires a separate pipeline — multi-view stereo, volumetric fusion, or neural rendering. Each adds complexity and latency.
MASt3R-SLAM gets dense reconstruction for free. Every keyframe already has a canonical pointmap X̃kk with a 3D point for every pixel. After global optimization adjusts all keyframe poses TWCk, each pointmap is transformed into the global frame:
The union of all transformed pointmaps is the dense reconstruction. No additional processing needed.
If camera calibration is available, MASt3R-SLAM makes two modifications:
MASt3R-SLAM is evaluated on four standard benchmarks: TUM RGB-D, 7-Scenes, ETH3D-SLAM, and EuRoC — all monocular RGB only.
With known calibration, MASt3R-SLAM achieves 0.030m average ATE — the best among all compared methods. This beats DROID-SLAM (0.038m), DPV-SLAM++ (0.035m), and GO-SLAM (0.049m).
This is where the system truly shines. Without any calibration, MASt3R-SLAM achieves 0.060m average ATE on TUM — comparable to DPV-SLAM (0.076m) which uses known calibration. A baseline of DROID-SLAM with GeoCalib-predicted intrinsics scores 0.158m — nearly 3x worse.
MASt3R-SLAM achieves the best ATE (0.086m) and AUC (23.935) among all methods including ORB-SLAM3 (0.135m), DROID-SLAM (0.171m), and DPV-SLAM++ (0.132m).
Average ATE (metres) across TUM RGB-D sequences. Lower is better. MASt3R-SLAM sets the new state of the art with calibration and is competitive without it.
On 7-Scenes, MASt3R-SLAM's dense reconstruction has 0.074m accuracy vs DROID-SLAM's 0.115m — a 36% improvement. On EuRoC, accuracy is 0.099m vs 0.173m — a 43% improvement. The learned 3D prior produces significantly better geometry than triangulation-based approaches.
A SLAM system that produces beautiful results offline is useful for 3D scanning. A SLAM system that runs in real-time enables robotics, AR, and autonomous navigation. MASt3R-SLAM achieves 15 FPS on a single RTX 4090.
The timing budget per frame breaks down as:
The MASt3R network is the bottleneck — ~50ms of the ~67ms budget. Everything else the authors designed (matching, tracking, fusion, optimization) takes <15ms combined.
Several design choices are specifically motivated by real-time constraints:
Per-frame timing budget at 15 FPS (~67ms total). The MASt3R network dominates; all custom components together use less than 15ms.
DUSt3R (Wang et al., 2024): The pioneer — showed that a transformer can predict dense 3D pointmaps from image pairs, replacing hand-designed stereo matching. MASt3R-SLAM inherits the pointmap representation and the idea of optimizing in a shared 3D coordinate frame.
MASt3R (Leroy et al., 2024): DUSt3R's successor with an additional matching head that predicts per-pixel descriptors for robust correspondence. MASt3R-SLAM uses this as its backbone — the descriptors are critical for the feature-refinement step of pointmap matching.
ORB-SLAM3 (Campos et al., 2021): The gold standard of classical sparse SLAM. MASt3R-SLAM matches or exceeds its accuracy on sequences where ORB-SLAM3 works — and succeeds on sequences where ORB-SLAM3 fails entirely.
DROID-SLAM (Teed & Deng, 2021): Learned dense SLAM using optical flow features with per-pixel bundle adjustment. MASt3R-SLAM shares the spirit of end-to-end learned features but replaces the flow-based architecture with a 3D reconstruction prior, and achieves better accuracy and robustness.
Gaussian SLAM: MASt3R-SLAM's dense pointmaps could initialize 3D Gaussian splatting for photorealistic novel view synthesis. Instead of growing Gaussians from scratch, seed them from the high-quality pointmaps.
FutureMapping / Spatial AI: Davison's vision of SLAM systems that understand semantics and enable robotic interaction. MASt3R-SLAM is a step toward "spatial intelligence" — dense geometry from any camera, in real time.
Plug-and-play robotics: A robot with any camera — GoPro, phone, microscope, endoscope — can now do real-time dense SLAM without calibration. This lowers the bar for deploying vision-based autonomy.