Lazarow, Kang & Dehghan — Apple, 2025

Rooms from Motion

Un-posed indoor 3D object detection as localization and mapping — using oriented boxes instead of points.

Prerequisites: 3D bounding boxes + Camera geometry basics + Structure from Motion intuition
10
Chapters
5+
Simulations

Chapter 0: The Problem

You walk through your apartment with a phone, snapping photos of every room. You want a 3D map of every object — the couch, the table, the bookshelf — with their real-world positions, sizes, and orientations. No depth sensor. No pre-calibrated camera rig. Just plain RGB images.

The traditional approach would be Structure from Motion (SfM): extract keypoints (corners, edges, blobs), match them across frames, triangulate their 3D positions, and solve for camera poses. Then, on top of that point cloud, try to fit some kind of object representation.

This works, but think about what it actually does. SfM reconstructs thousands of meaningless points on surfaces. It doesn't know that this cluster of points is a "table" and that cluster is a "chair." The objects — the things you actually care about — are an afterthought bolted on at the end.

The mismatch: You want a semantic 3D map of objects. SfM gives you a cloud of anonymous points. The information you need (object identity, size, pose) isn't represented at all in the point cloud. You're solving a harder problem (dense geometry) to get at an easier one (object layout).

What if you skipped the points entirely? What if the fundamental primitive for localization and mapping wasn't a point but an object?

Points vs. Objects

Left: traditional SfM scatters keypoints across surfaces. Right: RfM uses 3D object boxes directly. Click "Regenerate" to see a new random room.

Why is using points as the fundamental primitive for indoor 3D mapping wasteful?

Chapter 1: The Key Insight

RfM's insight is deceptively simple: if you can predict metric 3D bounding boxes from single images, you can do everything SfM does — but with objects instead of points.

Think about what SfM needs from its point correspondences. It matches the same point across two frames, then uses those matches to estimate the relative camera pose between the frames. Points are just a convenient vehicle for establishing correspondences and computing geometry.

But 3D bounding boxes can serve the same purpose. A box has 8 corners. If you detect the same object in two frames, you have 8 pairs of corresponding 3D points — the corners of the matched boxes. That is more than enough to estimate the relative pose between the two cameras.

Objects as correspondences: In SfM, you find 100+ matching keypoints to estimate a relative pose. In RfM, matching even 2-3 objects gives you 16-24 corresponding 3D corners — far more than the minimum needed for rigid alignment. And unlike keypoints, these correspondences come with semantic meaning: you know what each matched entity is.

The full RfM pipeline follows from this single insight:

1. Detect
CuTR predicts oriented 3D boxes with learned embeddings from each frame
2. Match Objects
Cubify Match finds which boxes across frames correspond to the same object
3. Match Corners
A second matcher aligns the 8 corners of matched boxes for finer correspondences
4. Relative Pose
Kabsch-Umeyama alignment on matched corners gives the camera motion between frames
5. Global Poses
Rotation and translation averaging over the view graph gives all camera poses
6. Object Tracks
Union-find merges matched objects across all frames into persistent tracks
7. Bundle Adjustment
Optimize global 3D box parameters via corner reprojection
Scaling advantage: The representation complexity scales with the number of objects, not the number of surface points. A room with 15 objects produces 15 boxes (120 corners) regardless of room size. A point cloud of the same room might contain 50,000+ points. The entire object map of a 10-room apartment fits in <50KB (150 objects × ~300 bytes each), while a dense point cloud of the same space would be >500MB.
Why can 3D bounding box corners replace keypoints for relative pose estimation?

Chapter 2: Per-Frame Detection

The foundation of RfM is CuTR (Cubify Transformer) — a single-image 3D object detector that produces oriented bounding boxes in metric coordinates. Every downstream step depends on CuTR giving good boxes, so let's understand what it does and what it outputs.

What CuTR Produces

For each detected object in a frame, CuTR outputs:

CuTR is trained with known camera intrinsics, so the 3D predictions are in metric units — actual meters, not arbitrary scale. This is essential because RfM needs to compute real-world distances between objects.

Frozen vs trained — what's learned where: CuTR (the detector) is trained from scratch on CA-1M (a large-scale indoor dataset with 1M+ frames and 3D box annotations). Its backbone is a pretrained image encoder (DINOv2 or similar), fine-tuned end-to-end for 3D detection. Cubify Match (the object matcher) is also trained from scratch, supervised with ground-truth object correspondences. The corner matcher is a separate trained module. The rest of the pipeline — Kabsch alignment, rotation averaging (glomap), union-find, bundle adjustment — is classical geometry with no learned parameters. So the split is: perception is learned (detection + matching), geometry is classical (alignment + averaging + optimization).

Concrete tensor shapes

For a single frame at 640×480 resolution with ~10 detected objects:

From Pixels to Metric Boxes

The key challenge is going from a flat 2D image to a 3D box with real-world dimensions. CuTR uses the camera's intrinsic parameters (focal length, principal point) to unproject image features into 3D rays, then predicts the depth and extent along those rays. The gravity direction is assumed known (from an IMU or estimated), which reduces rotation to a single yaw angle.

Why gravity alignment matters: Indoor objects almost always sit on flat surfaces — floors, desks, shelves. By assuming the box's "up" direction is aligned with gravity, CuTR reduces the rotation prediction from 3 angles to just 1 (yaw). This is the same assumption Boxer uses, and it holds well in practice for indoor scenes.
Engineering decision — gravity from where? The gravity direction comes from the device IMU (accelerometer) on phones/tablets, or from a learned gravity estimator for images without inertial data. This is a standard assumption in indoor 3D detection (also used by ARKit, ARCore, and Omni3D). It means the entire pipeline operates in a gravity-aligned coordinate system where "up" is always +Y, simplifying rotation from SO(3) (3 DoF) to SO(2) (1 DoF). The downstream benefit: Kabsch alignment becomes a 4-DoF problem (1 yaw + 3 translation) instead of 6-DoF, making it far more robust with fewer correspondences.
Per-Frame 3D Detection

Left: a camera view with 2D detections. Right: the corresponding 3D boxes in a top-down view. Drag the depth slider to see how the 3D positions change.

Camera depth 3.0m
What does CuTR output for each detected object, beyond the standard bounding box parameters?

Chapter 3: Object Matching

You've detected 3D boxes in Frame A and Frame B. Now the critical question: which box in Frame A is the same object as which box in Frame B? This is the data association problem, and RfM solves it with Cubify Match.

Why Not Just Use 3D IoU?

A naive approach would be to check which boxes overlap in 3D. But here's the problem: the frames are un-posed. You don't know the relative camera pose yet — that's what you're trying to compute. So you can't transform the boxes into a common coordinate frame to compare them.

You need a matcher that works before you know the camera poses. Cubify Match does this by comparing the learned embeddings from CuTR, not the 3D positions.

LightGlue-Style Architecture

Cubify Match adapts LightGlue, a state-of-the-art keypoint matcher, to work on 3D box features instead of 2D keypoint descriptors. The architecture uses alternating self- and cross-attention layers:

  1. Self-attention within each frame: boxes in Frame A attend to each other, learning about the spatial layout of objects in that frame. Same for Frame B.
  2. Cross-attention across frames: boxes in Frame A attend to boxes in Frame B, comparing their embeddings to find correspondences.
  3. Optimal transport: the final assignment uses the Sinkhorn algorithm to produce a soft assignment matrix, where each row sums to ≤ 1 (an object in A matches at most one object in B, or is unmatched).
Why attention helps: Self-attention lets the matcher reason about context. If Frame A has a table with four chairs around it, the matcher knows "this is probably a dining set." When it sees a similar arrangement in Frame B, the cross-attention can match not just individual objects but the whole layout. Context disambiguates objects that look identical in isolation (like four identical chairs).

Handling Unmatched Objects

Not every object in Frame A appears in Frame B. The camera may have moved, revealing new objects and hiding old ones. Cubify Match handles this with a dustbin column in the assignment matrix — objects that don't have a good match in the other frame are assigned to the dustbin, meaning "unmatched."

Input features: Each box is represented by its CuTR embedding, its 3D center, extents, and yaw. The 3D parameters are encoded with positional encoding and concatenated with the embedding. This gives the matcher both what the object looks like and where it was predicted to be (in each frame's local coordinate system).
Why exhaustive pairwise matching over minimal sampling: Classical SfM matches only a few hundred keypoints and uses RANSAC with a minimal 5-point solver. RfM takes the opposite approach: it evaluates all possible object pairs between frames (typically 5-15 objects per frame = 25-225 pairs), runs the full Sinkhorn optimal transport, then uses all matched corners for Kabsch. This exhaustive strategy works because: (1) the number of objects is small (unlike thousands of keypoints), (2) Sinkhorn produces globally optimal soft assignments, and (3) using all corners gives a highly over-determined system (8 corners × 3 matched objects = 24 point pairs for a 4-DoF problem that needs only 2). More correspondences = more robust pose.
Why can't RfM simply use 3D IoU to match objects across frames?

Chapter 4: Corner Matching & Relative Pose

Cubify Match told us which objects correspond across frames. But to estimate the relative camera pose, we need something more precise: we need to know which corners of matched boxes correspond. This is where the "box cloud" idea comes in.

Box Clouds

Each oriented 3D box has 8 corners. If Frame A has N detected objects, we get 8N 3D points — the "box cloud" of that frame. Similarly for Frame B. The challenge is that the corner ordering might not be consistent across frames because the boxes are predicted independently with potentially different orientations.

RfM runs a second LightGlue-style matcher on these box clouds. Each corner is represented by a feature that combines: the CuTR embedding of its parent box, the corner's local position within the box (which of the 8 corners it is), and the corner's 3D coordinates.

Kabsch-Umeyama Alignment

Once we have matched corner pairs, we need to find the rigid transformation (rotation + translation) that best maps one set of corners to the other. This is the classic Kabsch-Umeyama algorithm:

  1. Center the points: subtract the centroid from each set
  2. Compute the cross-covariance matrix: H = PAT · PB
  3. SVD: decompose H = UΣVT
  4. Rotation: R = V · diag(1, 1, det(VUT)) · UT
  5. Translation: t = centroidB − R · centroidA
R* = argminRi || pB,i − R · pA,i − t ||2

Because indoor scenes are gravity-aligned, the rotation is restricted to 4 DoF: one yaw angle plus 3D translation. This simplification makes the estimation more robust.

RANSAC for outlier rejection: Not all corner matches are correct. RfM uses RANSAC: randomly sample a minimal set of matches, estimate the pose, count how many other matches agree (inliers), and keep the pose with the most inliers. This makes the system robust to matching errors.
Why 4-DoF Kabsch specifically: Standard Kabsch solves for full 6-DoF (3 rotation + 3 translation). RfM constrains this to 4-DoF by fixing the gravity axis: roll = 0, pitch = 0, leaving only yaw + 3D translation. The SVD-based solution is modified to find the optimal yaw angle that minimizes the sum of squared point-to-point distances. With N matched corners (typically 16-48), this is massively over-determined. The RANSAC loop samples a minimal set (just 2 matched corners suffice for 4-DoF), estimates the yaw + translation, then counts inliers within a 3D distance threshold. Typically 100 RANSAC iterations with 2-corner samples suffices for >99% confidence of finding the correct solution.
Kabsch Alignment — Interactive Demo

Two sets of matched 3D corners (top-down view). Orange = Frame A corners, teal = Frame B corners. Click "Run Kabsch" to compute the alignment. The orange points rotate and translate to match the teal points. Toggle outliers to see RANSAC filtering.

Animation 0%
Why does RfM restrict the relative pose to 4 DoF (yaw + translation) instead of full 6 DoF?

Chapter 5: Global Pose Estimation

Chapter 4 gave us pairwise relative poses: "Frame 3 is rotated 15 degrees and shifted 0.8m from Frame 7." But we need absolute poses: where is every camera in a single global coordinate system?

The View Graph

RfM builds a view graph where each node is a camera frame and each edge carries the relative pose from Kabsch alignment. Not every pair of frames has an edge — only pairs with sufficient matched objects (typically ≥ 2 matched objects after RANSAC).

The view graph might look like a chain (sequential frames) or have shortcuts (when the camera revisits an area). More edges mean more constraints, leading to a more accurate global solution.

Rotation Averaging

First, solve for all camera rotations globally. Each edge gives a noisy measurement of the relative rotation Rij ≈ Rj · Ri−1. Rotation averaging finds the set of absolute rotations {Ri} that best satisfies all pairwise constraints simultaneously.

RfM uses glomap for this step — a state-of-the-art global SfM solver that handles rotation and translation averaging jointly.

Translation Averaging

Given the rotations, each edge also constrains the relative translation direction. Translation averaging recovers the absolute positions {ti} up to a global scale. Because CuTR predicts metric 3D boxes, the scale is already determined — unlike monocular SfM, there's no scale ambiguity.

No scale ambiguity: Classic monocular SfM can only recover the scene up to an unknown scale factor. RfM avoids this entirely because CuTR predicts boxes in real meters. The matched corners carry metric distances, so the translation averaging produces metric results directly.

Outlier Filtering

Before global averaging, RfM filters the view graph edges. Edges with too few inlier matches, or where the estimated pose is inconsistent with neighboring edges, are removed. This prevents corrupted pairwise estimates from poisoning the global solution.

Why global beats sequential: A sequential approach (estimate pose 1→2, then 2→3, etc.) accumulates drift. By solving all poses simultaneously with rotation averaging, RfM distributes errors evenly across the graph. Loop closures (revisiting the same area) further reduce drift.
Why does RfM have no scale ambiguity, unlike traditional monocular SfM?

Chapter 6: Object Tracks & Maps

We now have global camera poses and per-frame 3D detections. The same real-world table might be detected in 30 different frames, producing 30 slightly different 3D boxes. How do we merge these into a single, clean object track?

Union-Find Across Matches

Cubify Match already told us which objects correspond between pairs of frames. RfM collects all these pairwise matches into a global union-find data structure (also called disjoint-set).

If object A3 in Frame 3 matches object A7 in Frame 7, and A7 matches object A12 in Frame 12, then union-find groups all three into the same set. Each set is one object track — all observations of the same physical object.

Representative Box Selection

Each track contains multiple 3D box observations, all now transformed to the global coordinate frame using the estimated camera poses. But they're noisy — different viewpoints produce slightly different box predictions.

RfM selects a representative box for each track. Rather than averaging all boxes (which can produce invalid boxes if orientations differ), it picks the single observation that best represents the group. The criterion considers:

Why not average? Averaging 3D boxes sounds natural but is problematic. If one observation has yaw = 5° and another has yaw = 355°, the naive average gives 180° — completely wrong. And averaging extents across poor viewpoints can shrink or bloat the box. Selecting a single best observation avoids these pitfalls.
Why union-find over greedy clustering: Consider frames 1, 5, and 12. Frame 1 matches with Frame 5 (same table detected in both). Frame 5 matches with Frame 12 (same table). But Frames 1 and 12 might have zero overlap — the camera moved far between them, and the table is only partially visible in each. Union-find correctly propagates the transitive relationship: table_1 = table_5 = table_12, forming one track. Spatial clustering in 3D would miss this because the boxes are in different coordinate frames until poses are estimated. Union-find operates on the match graph directly, regardless of spatial relationships.
Object Track Formation

Multiple frames observe the same object. Click "Merge Tracks" to see how union-find groups observations and selects a representative box (highlighted in teal).

Why does RfM use union-find instead of simply clustering nearby boxes?

Chapter 7: Bundle Adjustment

The global poses from rotation averaging and the representative boxes from track merging are good — but they're not jointly optimized. Bundle adjustment (BA) is the final refinement step that makes everything consistent.

What Gets Optimized

Classical BA optimizes camera poses and 3D point positions to minimize reprojection error (the distance between where a 3D point should appear in an image and where it was actually observed). RfM adapts this idea to boxes:

The optimization minimizes a corner reprojection cost: for each object track, project the global box's 8 corners into every frame where it was observed, and compare against the per-frame detection's corners.

E = ∑tracksframescorners || π(Rj · ck + tj) − ĉj,k ||2

Where ck is the k-th global corner, Rj and tj are the camera pose, π is the projection function, and ĉj,k is the observed corner in frame j.

Why Corners, Not Centers?

You might wonder why RfM reprojects corners rather than just box centers. The answer is that corners carry orientation information. If a box is rotated 10 degrees incorrectly, the center might still be in roughly the right place, but the corners will be off. Optimizing through corners tightens both position and orientation simultaneously.

The BA effect: Bundle adjustment typically improves AP15 by 2-4 points on CA-1M. It's not a huge leap, but it matters — especially for objects seen from few viewpoints where the initial pose estimate is noisier. BA redistributes error across all observations.

Implementation

RfM uses a standard Levenberg-Marquardt optimizer. The Jacobians are computed analytically for the corner reprojection function, which involves the chain rule through the box parameterization, the rigid body transformation, and the camera projection.

Optional, not mandatory: BA is the last step and is optional. The system already produces good results without it. This is architecturally clean — each pipeline stage produces a valid output, and BA is purely a refinement on top.
Why does RfM's bundle adjustment optimize through box corners rather than box centers?

Chapter 8: Results

RfM is evaluated on two benchmarks: CA-1M (large-scale indoor dataset) and ScanNet++ (high-quality indoor scans). Let's look at what the numbers tell us.

3D Detection: Posed Setting (RGB-D)

When camera poses are given (not estimated by RfM), how does the detection pipeline compare?

MethodAP15 ↑AP25 ↑AP50 ↑
FCAF28.738.145.1
UniDet3D28.336.943.6
ImVoxelNet14.222.733.5
RfM Posed47.455.860.9
+18.7 AP15 over the best baseline. On CA-1M with posed RGB-D, RfM outperforms FCAF (a strong point-cloud-based detector) by a massive margin. The object-centric representation is simply more efficient than processing raw point clouds.

3D Detection: RGB Only (No Depth)

Can it work with just RGB images — no depth sensor at all?

MethodAP15 ↑AP25 ↑
ImGeoNet18.628.3
ImVoxelNet14.022.3
RfM Posed+BA31.343.8

Localization: Camera Pose Accuracy

The most surprising result: RfM estimates camera poses from object boxes alone that are competitive with dedicated SLAM systems.

MethodRotation error ↓Translation error ↓
DROID-SLAM (RGB-D)2.6°4.7 cm
RfM (RGB-D)1.8°4.0 cm
RfM (RGB only)2.5°12.7 cm
Better than DROID-SLAM. RfM with RGB-D achieves 1.8° rotation and 4.0 cm translation error — beating a state-of-the-art visual SLAM system that uses dense feature matching. This is remarkable: a system using only 3D boxes outperforms one using dense pixel-level correspondences for camera localization.

Key Ablation: More Objects Help

An important finding from the ablations: localization accuracy improves with more objects. More objects = more corners = more constraints for Kabsch alignment. This confirms the thesis that objects are sufficient primitives for localization.

What degrades — the taxonomy ablation: The paper's ablation study reveals exactly when RfM fails: (1) Few objects per frame (<3): with only 1-2 detected objects, there aren't enough corners for robust Kabsch alignment. The localization error jumps from 2.5° to >8° when going from 5 to 2 objects per frame. (2) Small objects only: tiny objects (mugs, books) have poorly estimated extents, making their corners unreliable landmarks. Large furniture (tables, sofas) with well-estimated dimensions gives the best results. (3) Narrow field of view: fewer objects visible per frame means fewer matches between frames. Wide-angle cameras see more objects and produce denser view graphs. (4) Repeated identical objects: four identical chairs around a table confuse the matcher — which chair is which? The learned embeddings help but aren't perfect for truly identical instances.
Concrete numbers: CuTR processes each frame in ~40ms on an A100 GPU. Cubify Match takes ~5ms per pair. Corner matching: ~3ms per pair. Kabsch + RANSAC: <1ms. Rotation averaging (glomap): ~0.5s for 200 frames. Bundle adjustment: ~2s for 200 frames with 50 object tracks. Total pipeline for a 200-frame apartment scan: ~15s end-to-end. Memory: ~4GB VRAM dominated by CuTR. The system was trained and evaluated on 8×A100 GPUs; inference runs on a single GPU.
Results Comparison

AP15 on CA-1M (posed, RGB-D). Higher is better. RfM dominates existing methods.

What is the most surprising result from the RfM experiments?

Chapter 9: Connections

RfM sits at a fascinating intersection: it's a SLAM system that uses detection, and a detection system that uses SLAM. Let's map where it connects.

Relation to SLAM

Traditional SLAM (ORB-SLAM, DROID-SLAM) uses point features for both mapping and localization. Object SLAM systems like QuadricSLAM and CubeSLAM explored using objects, but relied on separate detection and odometry pipelines. RfM is the first system where objects are the only primitive for both tasks — no points involved at any stage.

Relation to DETR / CuTR

CuTR (the per-frame detector) is a DETR-style transformer detector adapted for 3D. RfM shows that CuTR's outputs are rich enough to support not just detection but full localization and mapping. The learned embeddings, originally designed for classification, turn out to be excellent features for cross-frame matching.

Relation to Boxer

Where Boxer separates 2D detection from 3D lifting, RfM goes further: it separates per-frame 3D detection from multi-frame localization and mapping. Both papers share the philosophy that good modular decomposition beats end-to-end monoliths.

Relation to SpatialVLM

RfM produces exactly the kind of structured 3D scene representation that spatial reasoning models need. A vision-language model asking "what's to the left of the bookshelf?" could query RfM's object tracks directly, getting metric positions and sizes without any dense reconstruction.

Cheat Sheet

AspectRfM
InputUn-posed RGB (or RGB-D) images
PrimitiveOriented 3D bounding boxes (not points)
Per-frame detectorCuTR (Cubify Transformer)
Object matcherCubify Match (LightGlue-style)
Corner matcherSecond LightGlue on box clouds
Relative poseKabsch-Umeyama (4 DoF)
Global posesglomap (rotation + translation averaging)
Track formationUnion-find on pairwise matches
RefinementCorner reprojection bundle adjustment
Key result (detection)AP15 = 47.4 vs FCAF 28.7 (CA-1M posed)
Key result (localization)1.8°/4.0cm vs DROID-SLAM 2.6°/4.7cm
The broader lesson: The right representation matters more than the right algorithm. By choosing objects as the fundamental primitive instead of points, RfM simplifies every downstream step — matching becomes about finding the same object (not the same corner pixel), pose estimation gets 8 correspondences per match (not 1), and the final map is immediately useful for spatial reasoning. Sometimes the best optimization is choosing the right abstraction.
What is the fundamental difference between RfM and previous object-SLAM systems like CubeSLAM?