Un-posed indoor 3D object detection as localization and mapping — using oriented boxes instead of points.
You walk through your apartment with a phone, snapping photos of every room. You want a 3D map of every object — the couch, the table, the bookshelf — with their real-world positions, sizes, and orientations. No depth sensor. No pre-calibrated camera rig. Just plain RGB images.
The traditional approach would be Structure from Motion (SfM): extract keypoints (corners, edges, blobs), match them across frames, triangulate their 3D positions, and solve for camera poses. Then, on top of that point cloud, try to fit some kind of object representation.
This works, but think about what it actually does. SfM reconstructs thousands of meaningless points on surfaces. It doesn't know that this cluster of points is a "table" and that cluster is a "chair." The objects — the things you actually care about — are an afterthought bolted on at the end.
What if you skipped the points entirely? What if the fundamental primitive for localization and mapping wasn't a point but an object?
Left: traditional SfM scatters keypoints across surfaces. Right: RfM uses 3D object boxes directly. Click "Regenerate" to see a new random room.
RfM's insight is deceptively simple: if you can predict metric 3D bounding boxes from single images, you can do everything SfM does — but with objects instead of points.
Think about what SfM needs from its point correspondences. It matches the same point across two frames, then uses those matches to estimate the relative camera pose between the frames. Points are just a convenient vehicle for establishing correspondences and computing geometry.
But 3D bounding boxes can serve the same purpose. A box has 8 corners. If you detect the same object in two frames, you have 8 pairs of corresponding 3D points — the corners of the matched boxes. That is more than enough to estimate the relative pose between the two cameras.
The full RfM pipeline follows from this single insight:
The foundation of RfM is CuTR (Cubify Transformer) — a single-image 3D object detector that produces oriented bounding boxes in metric coordinates. Every downstream step depends on CuTR giving good boxes, so let's understand what it does and what it outputs.
For each detected object in a frame, CuTR outputs:
CuTR is trained with known camera intrinsics, so the 3D predictions are in metric units — actual meters, not arbitrary scale. This is essential because RfM needs to compute real-world distances between objects.
For a single frame at 640×480 resolution with ~10 detected objects:
The key challenge is going from a flat 2D image to a 3D box with real-world dimensions. CuTR uses the camera's intrinsic parameters (focal length, principal point) to unproject image features into 3D rays, then predicts the depth and extent along those rays. The gravity direction is assumed known (from an IMU or estimated), which reduces rotation to a single yaw angle.
Left: a camera view with 2D detections. Right: the corresponding 3D boxes in a top-down view. Drag the depth slider to see how the 3D positions change.
You've detected 3D boxes in Frame A and Frame B. Now the critical question: which box in Frame A is the same object as which box in Frame B? This is the data association problem, and RfM solves it with Cubify Match.
A naive approach would be to check which boxes overlap in 3D. But here's the problem: the frames are un-posed. You don't know the relative camera pose yet — that's what you're trying to compute. So you can't transform the boxes into a common coordinate frame to compare them.
You need a matcher that works before you know the camera poses. Cubify Match does this by comparing the learned embeddings from CuTR, not the 3D positions.
Cubify Match adapts LightGlue, a state-of-the-art keypoint matcher, to work on 3D box features instead of 2D keypoint descriptors. The architecture uses alternating self- and cross-attention layers:
Not every object in Frame A appears in Frame B. The camera may have moved, revealing new objects and hiding old ones. Cubify Match handles this with a dustbin column in the assignment matrix — objects that don't have a good match in the other frame are assigned to the dustbin, meaning "unmatched."
Cubify Match told us which objects correspond across frames. But to estimate the relative camera pose, we need something more precise: we need to know which corners of matched boxes correspond. This is where the "box cloud" idea comes in.
Each oriented 3D box has 8 corners. If Frame A has N detected objects, we get 8N 3D points — the "box cloud" of that frame. Similarly for Frame B. The challenge is that the corner ordering might not be consistent across frames because the boxes are predicted independently with potentially different orientations.
RfM runs a second LightGlue-style matcher on these box clouds. Each corner is represented by a feature that combines: the CuTR embedding of its parent box, the corner's local position within the box (which of the 8 corners it is), and the corner's 3D coordinates.
Once we have matched corner pairs, we need to find the rigid transformation (rotation + translation) that best maps one set of corners to the other. This is the classic Kabsch-Umeyama algorithm:
Because indoor scenes are gravity-aligned, the rotation is restricted to 4 DoF: one yaw angle plus 3D translation. This simplification makes the estimation more robust.
Two sets of matched 3D corners (top-down view). Orange = Frame A corners, teal = Frame B corners. Click "Run Kabsch" to compute the alignment. The orange points rotate and translate to match the teal points. Toggle outliers to see RANSAC filtering.
Chapter 4 gave us pairwise relative poses: "Frame 3 is rotated 15 degrees and shifted 0.8m from Frame 7." But we need absolute poses: where is every camera in a single global coordinate system?
RfM builds a view graph where each node is a camera frame and each edge carries the relative pose from Kabsch alignment. Not every pair of frames has an edge — only pairs with sufficient matched objects (typically ≥ 2 matched objects after RANSAC).
The view graph might look like a chain (sequential frames) or have shortcuts (when the camera revisits an area). More edges mean more constraints, leading to a more accurate global solution.
First, solve for all camera rotations globally. Each edge gives a noisy measurement of the relative rotation Rij ≈ Rj · Ri−1. Rotation averaging finds the set of absolute rotations {Ri} that best satisfies all pairwise constraints simultaneously.
RfM uses glomap for this step — a state-of-the-art global SfM solver that handles rotation and translation averaging jointly.
Given the rotations, each edge also constrains the relative translation direction. Translation averaging recovers the absolute positions {ti} up to a global scale. Because CuTR predicts metric 3D boxes, the scale is already determined — unlike monocular SfM, there's no scale ambiguity.
Before global averaging, RfM filters the view graph edges. Edges with too few inlier matches, or where the estimated pose is inconsistent with neighboring edges, are removed. This prevents corrupted pairwise estimates from poisoning the global solution.
We now have global camera poses and per-frame 3D detections. The same real-world table might be detected in 30 different frames, producing 30 slightly different 3D boxes. How do we merge these into a single, clean object track?
Cubify Match already told us which objects correspond between pairs of frames. RfM collects all these pairwise matches into a global union-find data structure (also called disjoint-set).
If object A3 in Frame 3 matches object A7 in Frame 7, and A7 matches object A12 in Frame 12, then union-find groups all three into the same set. Each set is one object track — all observations of the same physical object.
Each track contains multiple 3D box observations, all now transformed to the global coordinate frame using the estimated camera poses. But they're noisy — different viewpoints produce slightly different box predictions.
RfM selects a representative box for each track. Rather than averaging all boxes (which can produce invalid boxes if orientations differ), it picks the single observation that best represents the group. The criterion considers:
Multiple frames observe the same object. Click "Merge Tracks" to see how union-find groups observations and selects a representative box (highlighted in teal).
The global poses from rotation averaging and the representative boxes from track merging are good — but they're not jointly optimized. Bundle adjustment (BA) is the final refinement step that makes everything consistent.
Classical BA optimizes camera poses and 3D point positions to minimize reprojection error (the distance between where a 3D point should appear in an image and where it was actually observed). RfM adapts this idea to boxes:
The optimization minimizes a corner reprojection cost: for each object track, project the global box's 8 corners into every frame where it was observed, and compare against the per-frame detection's corners.
Where ck is the k-th global corner, Rj and tj are the camera pose, π is the projection function, and ĉj,k is the observed corner in frame j.
You might wonder why RfM reprojects corners rather than just box centers. The answer is that corners carry orientation information. If a box is rotated 10 degrees incorrectly, the center might still be in roughly the right place, but the corners will be off. Optimizing through corners tightens both position and orientation simultaneously.
RfM uses a standard Levenberg-Marquardt optimizer. The Jacobians are computed analytically for the corner reprojection function, which involves the chain rule through the box parameterization, the rigid body transformation, and the camera projection.
RfM is evaluated on two benchmarks: CA-1M (large-scale indoor dataset) and ScanNet++ (high-quality indoor scans). Let's look at what the numbers tell us.
When camera poses are given (not estimated by RfM), how does the detection pipeline compare?
| Method | AP15 ↑ | AP25 ↑ | AP50 ↑ |
|---|---|---|---|
| FCAF | 28.7 | 38.1 | 45.1 |
| UniDet3D | 28.3 | 36.9 | 43.6 |
| ImVoxelNet | 14.2 | 22.7 | 33.5 |
| RfM Posed | 47.4 | 55.8 | 60.9 |
Can it work with just RGB images — no depth sensor at all?
| Method | AP15 ↑ | AP25 ↑ |
|---|---|---|
| ImGeoNet | 18.6 | 28.3 |
| ImVoxelNet | 14.0 | 22.3 |
| RfM Posed+BA | 31.3 | 43.8 |
The most surprising result: RfM estimates camera poses from object boxes alone that are competitive with dedicated SLAM systems.
| Method | Rotation error ↓ | Translation error ↓ |
|---|---|---|
| DROID-SLAM (RGB-D) | 2.6° | 4.7 cm |
| RfM (RGB-D) | 1.8° | 4.0 cm |
| RfM (RGB only) | 2.5° | 12.7 cm |
An important finding from the ablations: localization accuracy improves with more objects. More objects = more corners = more constraints for Kabsch alignment. This confirms the thesis that objects are sufficient primitives for localization.
AP15 on CA-1M (posed, RGB-D). Higher is better. RfM dominates existing methods.
RfM sits at a fascinating intersection: it's a SLAM system that uses detection, and a detection system that uses SLAM. Let's map where it connects.
Traditional SLAM (ORB-SLAM, DROID-SLAM) uses point features for both mapping and localization. Object SLAM systems like QuadricSLAM and CubeSLAM explored using objects, but relied on separate detection and odometry pipelines. RfM is the first system where objects are the only primitive for both tasks — no points involved at any stage.
CuTR (the per-frame detector) is a DETR-style transformer detector adapted for 3D. RfM shows that CuTR's outputs are rich enough to support not just detection but full localization and mapping. The learned embeddings, originally designed for classification, turn out to be excellent features for cross-frame matching.
Where Boxer separates 2D detection from 3D lifting, RfM goes further: it separates per-frame 3D detection from multi-frame localization and mapping. Both papers share the philosophy that good modular decomposition beats end-to-end monoliths.
RfM produces exactly the kind of structured 3D scene representation that spatial reasoning models need. A vision-language model asking "what's to the left of the bookshelf?" could query RfM's object tracks directly, getting metric positions and sizes without any dense reconstruction.
| Aspect | RfM |
|---|---|
| Input | Un-posed RGB (or RGB-D) images |
| Primitive | Oriented 3D bounding boxes (not points) |
| Per-frame detector | CuTR (Cubify Transformer) |
| Object matcher | Cubify Match (LightGlue-style) |
| Corner matcher | Second LightGlue on box clouds |
| Relative pose | Kabsch-Umeyama (4 DoF) |
| Global poses | glomap (rotation + translation averaging) |
| Track formation | Union-find on pairwise matches |
| Refinement | Corner reprojection bundle adjustment |
| Key result (detection) | AP15 = 47.4 vs FCAF 28.7 (CA-1M posed) |
| Key result (localization) | 1.8°/4.0cm vs DROID-SLAM 2.6°/4.7cm |