Rooms from Motion

Chapter 0: The Problem

You walk through your apartment with a phone, snapping photos of every room. You want a 3D map of every object — the couch, the table, the bookshelf — with their real-world positions, sizes, and orientations. No depth sensor. No pre-calibrated camera rig. Just plain RGB images.

The traditional approach would be Structure from Motion (SfM): extract keypoints (corners, edges, blobs), match them across frames, triangulate their 3D positions, and solve for camera poses. Then, on top of that point cloud, try to fit some kind of object representation.

This works, but think about what it actually does. SfM reconstructs thousands of meaningless points on surfaces. It doesn't know that this cluster of points is a "table" and that cluster is a "chair." The objects — the things you actually care about — are an afterthought bolted on at the end.

The mismatch: You want a semantic 3D map of objects. SfM gives you a cloud of anonymous points. The information you need (object identity, size, pose) isn't represented at all in the point cloud. You're solving a harder problem (dense geometry) to get at an easier one (object layout).

What if you skipped the points entirely? What if the fundamental primitive for localization and mapping wasn't a point but an object?

Points vs. Objects

Left: traditional SfM scatters keypoints across surfaces. Right: RfM uses 3D object boxes directly. Click "Regenerate" to see a new random room.

Why is using points as the fundamental primitive for indoor 3D mapping wasteful?

Points reconstruct surface geometry but carry no semantic information — you still need a separate step to identify and localize objects Points are too slow to extract from images Points require expensive LiDAR sensors to capture

Chapter 1: The Key Insight

RfM's insight is deceptively simple: if you can predict metric 3D bounding boxes from single images, you can do everything SfM does — but with objects instead of points.

Think about what SfM needs from its point correspondences. It matches the same point across two frames, then uses those matches to estimate the relative camera pose between the frames. Points are just a convenient vehicle for establishing correspondences and computing geometry.

But 3D bounding boxes can serve the same purpose. A box has 8 corners. If you detect the same object in two frames, you have 8 pairs of corresponding 3D points — the corners of the matched boxes. That is more than enough to estimate the relative pose between the two cameras.

Objects as correspondences: In SfM, you find 100+ matching keypoints to estimate a relative pose. In RfM, matching even 2-3 objects gives you 16-24 corresponding 3D corners — far more than the minimum needed for rigid alignment. And unlike keypoints, these correspondences come with semantic meaning: you know what each matched entity is.

The full RfM pipeline follows from this single insight:

1. Detect

CuTR predicts oriented 3D boxes with learned embeddings from each frame

↓

2. Match Objects

Cubify Match finds which boxes across frames correspond to the same object

↓

3. Match Corners

A second matcher aligns the 8 corners of matched boxes for finer correspondences

↓

4. Relative Pose

Kabsch-Umeyama alignment on matched corners gives the camera motion between frames

↓

5. Global Poses

Rotation and translation averaging over the view graph gives all camera poses

↓

6. Object Tracks

Union-find merges matched objects across all frames into persistent tracks

↓

7. Bundle Adjustment

Optimize global 3D box parameters via corner reprojection

Scaling advantage: The representation complexity scales with the number of objects, not the number of surface points. A room with 15 objects produces 15 boxes (120 corners) regardless of room size. A point cloud of the same room might contain 50,000+ points. The entire object map of a 10-room apartment fits in <50KB (150 objects × ~300 bytes each), while a dense point cloud of the same space would be >500MB.

Why can 3D bounding box corners replace keypoints for relative pose estimation?

Because boxes are easier to detect than keypoints Because matched box corners provide corresponding 3D point pairs (8 per object), which is exactly what rigid alignment algorithms like Kabsch need Because bounding boxes already encode the camera pose

Chapter 2: Per-Frame Detection

The foundation of RfM is CuTR (Cubify Transformer) — a single-image 3D object detector that produces oriented bounding boxes in metric coordinates. Every downstream step depends on CuTR giving good boxes, so let's understand what it does and what it outputs.

What CuTR Produces

For each detected object in a frame, CuTR outputs:

3D center: the (x, y, z) position of the box center in the camera's coordinate frame
3D extents: the (width, height, depth) of the box in meters
Orientation: a yaw angle θ around the gravity axis (gravity-aligned, just like Boxer)
Learned embedding: a feature vector describing the object's appearance and shape — this is crucial for matching
Category label: what kind of object it is (table, chair, sofa, etc.)

CuTR is trained with known camera intrinsics, so the 3D predictions are in metric units — actual meters, not arbitrary scale. This is essential because RfM needs to compute real-world distances between objects.

Frozen vs trained — what's learned where: CuTR (the detector) is trained from scratch on CA-1M (a large-scale indoor dataset with 1M+ frames and 3D box annotations). Its backbone is a pretrained image encoder (DINOv2 or similar), fine-tuned end-to-end for 3D detection. Cubify Match (the object matcher) is also trained from scratch, supervised with ground-truth object correspondences. The corner matcher is a separate trained module. The rest of the pipeline — Kabsch alignment, rotation averaging (glomap), union-find, bundle adjustment — is classical geometry with no learned parameters. So the split is: perception is learned (detection + matching), geometry is classical (alignment + averaging + optimization).

Concrete tensor shapes

For a single frame at 640×480 resolution with ~10 detected objects:

CuTR input: 640×480×3 RGB image + camera intrinsics (fx, fy, cx, cy)
CuTR output per object: 3D center (3), extents (3), yaw (1), embedding (256-dim), confidence (1), category (1) = 265 floats
Total per frame: ~10 objects × 265 = 2,650 floats + 8 corners per object = 80 3D points (240 floats)
Cubify Match input: two sets of object features (N×265 and M×265) → output: assignment matrix (N+1)×(M+1) with dustbin
Box cloud per frame: 10 objects × 8 corners = 80 3D points → corner matcher produces 80×80 potential correspondences

From Pixels to Metric Boxes

The key challenge is going from a flat 2D image to a 3D box with real-world dimensions. CuTR uses the camera's intrinsic parameters (focal length, principal point) to unproject image features into 3D rays, then predicts the depth and extent along those rays. The gravity direction is assumed known (from an IMU or estimated), which reduces rotation to a single yaw angle.

Why gravity alignment matters: Indoor objects almost always sit on flat surfaces — floors, desks, shelves. By assuming the box's "up" direction is aligned with gravity, CuTR reduces the rotation prediction from 3 angles to just 1 (yaw). This is the same assumption Boxer uses, and it holds well in practice for indoor scenes.

Engineering decision — gravity from where? The gravity direction comes from the device IMU (accelerometer) on phones/tablets, or from a learned gravity estimator for images without inertial data. This is a standard assumption in indoor 3D detection (also used by ARKit, ARCore, and Omni3D). It means the entire pipeline operates in a gravity-aligned coordinate system where "up" is always +Y, simplifying rotation from SO(3) (3 DoF) to SO(2) (1 DoF). The downstream benefit: Kabsch alignment becomes a 4-DoF problem (1 yaw + 3 translation) instead of 6-DoF, making it far more robust with fewer correspondences.

Per-Frame 3D Detection

Left: a camera view with 2D detections. Right: the corresponding 3D boxes in a top-down view. Drag the depth slider to see how the 3D positions change.

Camera depth 3.0m

What does CuTR output for each detected object, beyond the standard bounding box parameters?

A segmentation mask and optical flow A point cloud of the object surface A learned embedding vector that encodes the object's appearance and shape, used for matching across frames

Chapter 3: Object Matching

You've detected 3D boxes in Frame A and Frame B. Now the critical question: which box in Frame A is the same object as which box in Frame B? This is the data association problem, and RfM solves it with Cubify Match.

Why Not Just Use 3D IoU?

A naive approach would be to check which boxes overlap in 3D. But here's the problem: the frames are un-posed. You don't know the relative camera pose yet — that's what you're trying to compute. So you can't transform the boxes into a common coordinate frame to compare them.

You need a matcher that works before you know the camera poses. Cubify Match does this by comparing the learned embeddings from CuTR, not the 3D positions.

LightGlue-Style Architecture

Cubify Match adapts LightGlue, a state-of-the-art keypoint matcher, to work on 3D box features instead of 2D keypoint descriptors. The architecture uses alternating self- and cross-attention layers:

Self-attention within each frame: boxes in Frame A attend to each other, learning about the spatial layout of objects in that frame. Same for Frame B.
Cross-attention across frames: boxes in Frame A attend to boxes in Frame B, comparing their embeddings to find correspondences.
Optimal transport: the final assignment uses the Sinkhorn algorithm to produce a soft assignment matrix, where each row sums to ≤ 1 (an object in A matches at most one object in B, or is unmatched).

Why attention helps: Self-attention lets the matcher reason about context. If Frame A has a table with four chairs around it, the matcher knows "this is probably a dining set." When it sees a similar arrangement in Frame B, the cross-attention can match not just individual objects but the whole layout. Context disambiguates objects that look identical in isolation (like four identical chairs).

Handling Unmatched Objects

Not every object in Frame A appears in Frame B. The camera may have moved, revealing new objects and hiding old ones. Cubify Match handles this with a dustbin column in the assignment matrix — objects that don't have a good match in the other frame are assigned to the dustbin, meaning "unmatched."

Input features: Each box is represented by its CuTR embedding, its 3D center, extents, and yaw. The 3D parameters are encoded with positional encoding and concatenated with the embedding. This gives the matcher both what the object looks like and where it was predicted to be (in each frame's local coordinate system).

Why exhaustive pairwise matching over minimal sampling: Classical SfM matches only a few hundred keypoints and uses RANSAC with a minimal 5-point solver. RfM takes the opposite approach: it evaluates all possible object pairs between frames (typically 5-15 objects per frame = 25-225 pairs), runs the full Sinkhorn optimal transport, then uses all matched corners for Kabsch. This exhaustive strategy works because: (1) the number of objects is small (unlike thousands of keypoints), (2) Sinkhorn produces globally optimal soft assignments, and (3) using all corners gives a highly over-determined system (8 corners × 3 matched objects = 24 point pairs for a 4-DoF problem that needs only 2). More correspondences = more robust pose.

Why can't RfM simply use 3D IoU to match objects across frames?

Because the frames are un-posed — without knowing the relative camera pose, the boxes from different frames can't be compared in a common coordinate system Because 3D IoU is too expensive to compute Because indoor objects often overlap in 3D space

Chapter 4: Corner Matching & Relative Pose

Cubify Match told us which objects correspond across frames. But to estimate the relative camera pose, we need something more precise: we need to know which corners of matched boxes correspond. This is where the "box cloud" idea comes in.

Box Clouds

Each oriented 3D box has 8 corners. If Frame A has N detected objects, we get 8N 3D points — the "box cloud" of that frame. Similarly for Frame B. The challenge is that the corner ordering might not be consistent across frames because the boxes are predicted independently with potentially different orientations.

RfM runs a second LightGlue-style matcher on these box clouds. Each corner is represented by a feature that combines: the CuTR embedding of its parent box, the corner's local position within the box (which of the 8 corners it is), and the corner's 3D coordinates.

Kabsch-Umeyama Alignment

Once we have matched corner pairs, we need to find the rigid transformation (rotation + translation) that best maps one set of corners to the other. This is the classic Kabsch-Umeyama algorithm:

Center the points: subtract the centroid from each set
Compute the cross-covariance matrix: H = P_A^T · P_B
SVD: decompose H = UΣV^T
Rotation: R = V · diag(1, 1, det(VU^T)) · U^T
Translation: t = centroid_B − R · centroid_A

R* = argmin_R ∑_i || p_B,i − R · p_A,i − t ||²

Because indoor scenes are gravity-aligned, the rotation is restricted to 4 DoF: one yaw angle plus 3D translation. This simplification makes the estimation more robust.

RANSAC for outlier rejection: Not all corner matches are correct. RfM uses RANSAC: randomly sample a minimal set of matches, estimate the pose, count how many other matches agree (inliers), and keep the pose with the most inliers. This makes the system robust to matching errors.

Why 4-DoF Kabsch specifically: Standard Kabsch solves for full 6-DoF (3 rotation + 3 translation). RfM constrains this to 4-DoF by fixing the gravity axis: roll = 0, pitch = 0, leaving only yaw + 3D translation. The SVD-based solution is modified to find the optimal yaw angle that minimizes the sum of squared point-to-point distances. With N matched corners (typically 16-48), this is massively over-determined. The RANSAC loop samples a minimal set (just 2 matched corners suffice for 4-DoF), estimates the yaw + translation, then counts inliers within a 3D distance threshold. Typically 100 RANSAC iterations with 2-corner samples suffices for >99% confidence of finding the correct solution.

Kabsch Alignment — Interactive Demo

Two sets of matched 3D corners (top-down view). Orange = Frame A corners, teal = Frame B corners. Click "Run Kabsch" to compute the alignment. The orange points rotate and translate to match the teal points. Toggle outliers to see RANSAC filtering.

Animation 0%

Why does RfM restrict the relative pose to 4 DoF (yaw + translation) instead of full 6 DoF?

Because 6 DoF requires more matched points than are available Because 4 DoF is computationally cheaper Because indoor scenes are gravity-aligned, so roll and pitch are known from the IMU — only yaw and 3D translation remain to be estimated

Chapter 5: Global Pose Estimation

Chapter 4 gave us pairwise relative poses: "Frame 3 is rotated 15 degrees and shifted 0.8m from Frame 7." But we need absolute poses: where is every camera in a single global coordinate system?

The View Graph

RfM builds a view graph where each node is a camera frame and each edge carries the relative pose from Kabsch alignment. Not every pair of frames has an edge — only pairs with sufficient matched objects (typically ≥ 2 matched objects after RANSAC).

The view graph might look like a chain (sequential frames) or have shortcuts (when the camera revisits an area). More edges mean more constraints, leading to a more accurate global solution.

Rotation Averaging

First, solve for all camera rotations globally. Each edge gives a noisy measurement of the relative rotation R_ij ≈ R_j · R_i⁻¹. Rotation averaging finds the set of absolute rotations {R_i} that best satisfies all pairwise constraints simultaneously.

RfM uses glomap for this step — a state-of-the-art global SfM solver that handles rotation and translation averaging jointly.

Translation Averaging

Given the rotations, each edge also constrains the relative translation direction. Translation averaging recovers the absolute positions {t_i} up to a global scale. Because CuTR predicts metric 3D boxes, the scale is already determined — unlike monocular SfM, there's no scale ambiguity.

No scale ambiguity: Classic monocular SfM can only recover the scene up to an unknown scale factor. RfM avoids this entirely because CuTR predicts boxes in real meters. The matched corners carry metric distances, so the translation averaging produces metric results directly.

Outlier Filtering

Before global averaging, RfM filters the view graph edges. Edges with too few inlier matches, or where the estimated pose is inconsistent with neighboring edges, are removed. This prevents corrupted pairwise estimates from poisoning the global solution.

Why global beats sequential: A sequential approach (estimate pose 1→2, then 2→3, etc.) accumulates drift. By solving all poses simultaneously with rotation averaging, RfM distributes errors evenly across the graph. Loop closures (revisiting the same area) further reduce drift.

Why does RfM have no scale ambiguity, unlike traditional monocular SfM?

Because CuTR predicts 3D boxes in real meters, so the matched corners carry metric distances that anchor the global scale Because it uses an IMU for scale estimation Because rotation averaging eliminates scale factors

Chapter 6: Object Tracks & Maps

We now have global camera poses and per-frame 3D detections. The same real-world table might be detected in 30 different frames, producing 30 slightly different 3D boxes. How do we merge these into a single, clean object track?

Union-Find Across Matches

Cubify Match already told us which objects correspond between pairs of frames. RfM collects all these pairwise matches into a global union-find data structure (also called disjoint-set).

If object A₃ in Frame 3 matches object A₇ in Frame 7, and A₇ matches object A₁₂ in Frame 12, then union-find groups all three into the same set. Each set is one object track — all observations of the same physical object.

Representative Box Selection

Each track contains multiple 3D box observations, all now transformed to the global coordinate frame using the estimated camera poses. But they're noisy — different viewpoints produce slightly different box predictions.

RfM selects a representative box for each track. Rather than averaging all boxes (which can produce invalid boxes if orientations differ), it picks the single observation that best represents the group. The criterion considers:

Detection confidence: higher-confidence detections are preferred
Consistency: the box most consistent with other observations in the track
Visibility: observations from viewpoints where the object is most fully visible

Why not average? Averaging 3D boxes sounds natural but is problematic. If one observation has yaw = 5° and another has yaw = 355°, the naive average gives 180° — completely wrong. And averaging extents across poor viewpoints can shrink or bloat the box. Selecting a single best observation avoids these pitfalls.

Why union-find over greedy clustering: Consider frames 1, 5, and 12. Frame 1 matches with Frame 5 (same table detected in both). Frame 5 matches with Frame 12 (same table). But Frames 1 and 12 might have zero overlap — the camera moved far between them, and the table is only partially visible in each. Union-find correctly propagates the transitive relationship: table_1 = table_5 = table_12, forming one track. Spatial clustering in 3D would miss this because the boxes are in different coordinate frames until poses are estimated. Union-find operates on the match graph directly, regardless of spatial relationships.

Object Track Formation

Multiple frames observe the same object. Click "Merge Tracks" to see how union-find groups observations and selects a representative box (highlighted in teal).

Why does RfM use union-find instead of simply clustering nearby boxes?

Because spatial clustering would be too slow Because pairwise matches form a graph of correspondences — union-find correctly propagates transitive matches (A matches B, B matches C, so A matches C) that spatial clustering might miss Because union-find produces smaller data structures

Chapter 7: Bundle Adjustment

The global poses from rotation averaging and the representative boxes from track merging are good — but they're not jointly optimized. Bundle adjustment (BA) is the final refinement step that makes everything consistent.

What Gets Optimized

Classical BA optimizes camera poses and 3D point positions to minimize reprojection error (the distance between where a 3D point should appear in an image and where it was actually observed). RfM adapts this idea to boxes:

Global box parameters: the center, extents, and yaw of each object track's representative box
Camera poses: the global position and orientation of each camera

The optimization minimizes a corner reprojection cost: for each object track, project the global box's 8 corners into every frame where it was observed, and compare against the per-frame detection's corners.

E = ∑_tracks ∑_frames ∑_corners || π(R_j · c_k + t_j) − ĉ_j,k ||²

Where c_k is the k-th global corner, R_j and t_j are the camera pose, π is the projection function, and ĉ_j,k is the observed corner in frame j.

Why Corners, Not Centers?

You might wonder why RfM reprojects corners rather than just box centers. The answer is that corners carry orientation information. If a box is rotated 10 degrees incorrectly, the center might still be in roughly the right place, but the corners will be off. Optimizing through corners tightens both position and orientation simultaneously.

The BA effect: Bundle adjustment typically improves AP15 by 2-4 points on CA-1M. It's not a huge leap, but it matters — especially for objects seen from few viewpoints where the initial pose estimate is noisier. BA redistributes error across all observations.

Implementation

RfM uses a standard Levenberg-Marquardt optimizer. The Jacobians are computed analytically for the corner reprojection function, which involves the chain rule through the box parameterization, the rigid body transformation, and the camera projection.

Optional, not mandatory: BA is the last step and is optional. The system already produces good results without it. This is architecturally clean — each pipeline stage produces a valid output, and BA is purely a refinement on top.

Why does RfM's bundle adjustment optimize through box corners rather than box centers?

Because corners carry orientation information — a rotationally incorrect box has a correct center but incorrect corners, so corner-based optimization tightens both position and orientation Because centers are harder to compute gradients for Because corners are more numerically stable

Chapter 8: Results

RfM is evaluated on two benchmarks: CA-1M (large-scale indoor dataset) and ScanNet++ (high-quality indoor scans). Let's look at what the numbers tell us.

3D Detection: Posed Setting (RGB-D)

When camera poses are given (not estimated by RfM), how does the detection pipeline compare?

Method	AP15 ↑	AP25 ↑	AP50 ↑
FCAF	28.7	38.1	45.1
UniDet3D	28.3	36.9	43.6
ImVoxelNet	14.2	22.7	33.5
RfM Posed	47.4	55.8	60.9

+18.7 AP15 over the best baseline. On CA-1M with posed RGB-D, RfM outperforms FCAF (a strong point-cloud-based detector) by a massive margin. The object-centric representation is simply more efficient than processing raw point clouds.

3D Detection: RGB Only (No Depth)

Can it work with just RGB images — no depth sensor at all?

Method	AP15 ↑	AP25 ↑
ImGeoNet	18.6	28.3
ImVoxelNet	14.0	22.3
RfM Posed+BA	31.3	43.8

Localization: Camera Pose Accuracy

The most surprising result: RfM estimates camera poses from object boxes alone that are competitive with dedicated SLAM systems.

Method	Rotation error ↓	Translation error ↓
DROID-SLAM (RGB-D)	2.6°	4.7 cm
RfM (RGB-D)	1.8°	4.0 cm
RfM (RGB only)	2.5°	12.7 cm

Better than DROID-SLAM. RfM with RGB-D achieves 1.8° rotation and 4.0 cm translation error — beating a state-of-the-art visual SLAM system that uses dense feature matching. This is remarkable: a system using only 3D boxes outperforms one using dense pixel-level correspondences for camera localization.

Key Ablation: More Objects Help

An important finding from the ablations: localization accuracy improves with more objects. More objects = more corners = more constraints for Kabsch alignment. This confirms the thesis that objects are sufficient primitives for localization.

What degrades — the taxonomy ablation: The paper's ablation study reveals exactly when RfM fails: (1) Few objects per frame (<3): with only 1-2 detected objects, there aren't enough corners for robust Kabsch alignment. The localization error jumps from 2.5° to >8° when going from 5 to 2 objects per frame. (2) Small objects only: tiny objects (mugs, books) have poorly estimated extents, making their corners unreliable landmarks. Large furniture (tables, sofas) with well-estimated dimensions gives the best results. (3) Narrow field of view: fewer objects visible per frame means fewer matches between frames. Wide-angle cameras see more objects and produce denser view graphs. (4) Repeated identical objects: four identical chairs around a table confuse the matcher — which chair is which? The learned embeddings help but aren't perfect for truly identical instances.

Concrete numbers: CuTR processes each frame in ~40ms on an A100 GPU. Cubify Match takes ~5ms per pair. Corner matching: ~3ms per pair. Kabsch + RANSAC: <1ms. Rotation averaging (glomap): ~0.5s for 200 frames. Bundle adjustment: ~2s for 200 frames with 50 object tracks. Total pipeline for a 200-frame apartment scan: ~15s end-to-end. Memory: ~4GB VRAM dominated by CuTR. The system was trained and evaluated on 8×A100 GPUs; inference runs on a single GPU.

Results Comparison

AP15 on CA-1M (posed, RGB-D). Higher is better. RfM dominates existing methods.

What is the most surprising result from the RfM experiments?

That RfM works without depth sensors That RfM achieves better camera localization (1.8°/4.0cm) than DROID-SLAM (2.6°/4.7cm) using only 3D boxes as primitives instead of dense correspondences That RfM runs faster than other methods

Chapter 9: Connections

RfM sits at a fascinating intersection: it's a SLAM system that uses detection, and a detection system that uses SLAM. Let's map where it connects.

Relation to SLAM

Traditional SLAM (ORB-SLAM, DROID-SLAM) uses point features for both mapping and localization. Object SLAM systems like QuadricSLAM and CubeSLAM explored using objects, but relied on separate detection and odometry pipelines. RfM is the first system where objects are the only primitive for both tasks — no points involved at any stage.

Relation to DETR / CuTR

CuTR (the per-frame detector) is a DETR-style transformer detector adapted for 3D. RfM shows that CuTR's outputs are rich enough to support not just detection but full localization and mapping. The learned embeddings, originally designed for classification, turn out to be excellent features for cross-frame matching.

Relation to Boxer

Where Boxer separates 2D detection from 3D lifting, RfM goes further: it separates per-frame 3D detection from multi-frame localization and mapping. Both papers share the philosophy that good modular decomposition beats end-to-end monoliths.

Relation to SpatialVLM

RfM produces exactly the kind of structured 3D scene representation that spatial reasoning models need. A vision-language model asking "what's to the left of the bookshelf?" could query RfM's object tracks directly, getting metric positions and sizes without any dense reconstruction.

Cheat Sheet

Aspect	RfM
Input	Un-posed RGB (or RGB-D) images
Primitive	Oriented 3D bounding boxes (not points)
Per-frame detector	CuTR (Cubify Transformer)
Object matcher	Cubify Match (LightGlue-style)
Corner matcher	Second LightGlue on box clouds
Relative pose	Kabsch-Umeyama (4 DoF)
Global poses	glomap (rotation + translation averaging)
Track formation	Union-find on pairwise matches
Refinement	Corner reprojection bundle adjustment
Key result (detection)	AP15 = 47.4 vs FCAF 28.7 (CA-1M posed)
Key result (localization)	1.8°/4.0cm vs DROID-SLAM 2.6°/4.7cm

The broader lesson: The right representation matters more than the right algorithm. By choosing objects as the fundamental primitive instead of points, RfM simplifies every downstream step — matching becomes about finding the same object (not the same corner pixel), pose estimation gets 8 correspondences per match (not 1), and the final map is immediately useful for spatial reasoning. Sometimes the best optimization is choosing the right abstraction.

What is the fundamental difference between RfM and previous object-SLAM systems like CubeSLAM?

RfM uses a better neural network for detection RfM processes more frames per second RfM uses objects as the only primitive for both localization and mapping — no point features involved at any stage — while previous systems used objects alongside a separate point-based odometry pipeline