Scaling indoor 3D object detection with exhaustive annotations on survey-grade scans and a fully transformer-based detector that needs no point clouds, no voxels, no sparse convolutions.
You walk into a room with an iPad and scan it. Your app detects a couch, a table, maybe a lamp. But here's the thing — it missed the stack of books, the remote control, the power strip under the desk, the three throw pillows, and about a hundred other objects. Why?
Because existing indoor 3D detection datasets are sparse. ARKitScenes, the largest available benchmark, labels only 21 handpicked categories and averages about 11 labeled objects per video. A real room has ten times that many objects.
This creates a vicious cycle. Models trained on sparse annotations learn to detect only common furniture. They never see small objects, unusual items, or cluttered surfaces during training. So they never learn to detect them.
There's a second problem lurking underneath. Most 3D detectors depend on specialized operations: sparse 3D convolutions, voxelization, custom CUDA kernels. These are fast on NVIDIA GPUs with the right libraries installed, but they don't run on Apple Silicon, mobile devices, or standard inference engines like ONNX. The deployment story is terrible.
Cubify Anything attacks both problems simultaneously: a dataset that labels everything, and a model that uses only standard transformer operations.
A room with furniture and objects. Toggle between datasets to see how many objects get labeled. Orange boxes are annotations — notice how much ARKitScenes misses.
Here is the core idea behind Cubify Anything, and it's deceptively simple: annotate on the best geometry, detect from images.
ARKitScenes captures rooms with an iPad Pro, which produces a noisy mesh via ARKit's real-time SLAM. Annotators then try to draw 3D boxes on this noisy mesh. The mesh has holes, drift, and misalignment — so the annotations inherit that noise.
But the same rooms were also scanned with a FARO Focus laser scanner. FARO scans are survey-grade: millimeter accuracy, tens of millions of points, no drift. The raw data for perfect 3D annotations already exists. Nobody was using it for labeling.
This insight also explains the paper's two contributions: they couldn't just release a dataset, because existing models (point-based 3D detectors) can't fully exploit image-only training at this scale. And they couldn't just propose a model, because no dataset had the annotation quality to validate it. The dataset and the model are co-designed.
CA-1M (Cubify Anything — 1 Million) is the paper's first major contribution. Let's look at the numbers. Across 1,000+ indoor scenes, annotators placed 440,000 unique 3D bounding boxes. That's roughly 125 objects per video — more than ten times ARKitScenes' 11 per video.
But raw numbers don't tell the whole story. What makes CA-1M special is four key properties, and no prior dataset satisfies all of them simultaneously.
The class-agnostic approach is a deliberate design choice. Instead of defining a taxonomy of 20-50 categories and only labeling objects that fit, CA-1M labels everything and assigns category labels afterward. This means the dataset captures the true distribution of objects in indoor spaces — including the long tail of unusual items that category-specific datasets miss.
Toggle each property to see which datasets satisfy it. Only CA-1M achieves all four simultaneously.
The per-frame rendering pipeline produces 15 million+ training frames from 3,500 iPad captures. Each frame comes with both 2D bounding boxes (projected from 3D) and full 3D annotations in the camera's coordinate system. This is far more training signal than any prior indoor dataset.
Labeling 440,000 objects in 3D sounds like an impossible task. How do you even look at a point cloud with millions of points and efficiently place tight-fitting boxes around every object? The answer: a carefully designed annotation pipeline with model-in-the-loop acceleration.
Each room has a FARO Focus laser scan — a dense point cloud with tens of millions of points at millimeter precision. Annotators work directly in a 3D viewer, rotating and zooming through the point cloud. Because the geometry is crisp and unambiguous, they can see exactly where each object begins and ends.
Annotating from scratch for every scene would be painfully slow. Instead, the pipeline uses a bootstrapping approach:
This iterative loop dramatically speeds up annotation. Instead of placing 125 boxes from nothing, annotators might confirm 80 proposals, tweak 30, delete 5 bad ones, and add 10 the model missed.
Some regions of the point cloud are ambiguous even at FARO quality — a dark corner might have too few scan points to tell whether that blob is a shoe or a crumpled towel. When annotators hit these cases, they can pull up the corresponding iPad video frames to see the actual RGB appearance. The calibrated camera poses let them cross-reference 3D points with 2D images.
Every annotated scene goes through a verification pass. Reviewers check for:
The combination of FARO-quality geometry, model-in-the-loop proposals, and multi-view image references makes the pipeline surprisingly efficient. An experienced annotator can process a moderately cluttered room (80-150 objects) in a single session. Without the model proposals, the same room would take several times longer.
The key insight is that verification is faster than creation. Looking at a proposed box and saying "yes, that's right" or "nudge it 2cm left" is cognitively simpler than finding the object, deciding its bounds, and placing a box from scratch. The model handles the pattern matching; the human handles the judgment calls.
You've annotated 3D boxes on the FARO scan. Now you need labels for each of the millions of iPad video frames. How do you get from world-space 3D boxes to per-frame 2D+3D annotations? Through a rendering pipeline with three stages.
Each iPad frame has a known camera pose (position + orientation) from ARKit's visual-inertial odometry. For every 3D bounding box in the scene, we transform its 8 corners from world coordinates into the camera's coordinate system using the standard extrinsic matrix.
where R is the 3×3 rotation matrix and t is the translation vector of the camera pose.
Not every object is visible in every frame. Frustum culling removes boxes that fall entirely outside the camera's field of view. If all 8 corners of a box project outside the image boundaries, or if the box center is behind the camera (negative depth), the box is culled for this frame.
Even after frustum culling, some objects might be hidden behind other objects. The pipeline renders a depth buffer from the FARO point cloud for each camera viewpoint. Then for each surviving 3D box, it checks: does the depth at the box's projected center match the box's actual depth, or is something in front? If the object is heavily occluded (below a visibility threshold), it's filtered out.
A top-down view of a room with objects (squares) and a camera (triangle). Drag the camera angle slider to change the view direction. Green objects are visible, red are culled (outside frustum), gray are occluded.
Say there are 440 annotated 3D boxes in a living room scene. For a particular iPad frame:
This frame now has 85 labeled objects with both 2D and 3D annotations — all generated automatically from the one-time FARO annotation.
The result: from a single set of world-space 3D annotations, the pipeline automatically generates 2D bounding boxes and 3D box parameters for every one of the 15 million+ frames. Each frame gets both a projected 2D box (for 2D losses) and the 3D box in camera coordinates (for 3D losses).
CuTR (Cubify Transformer) is the paper's model contribution. Its defining feature: it is a fully transformer-based 3D object detector that uses absolutely no 3D-specific operations. No point cloud processing, no voxelization, no sparse 3D convolutions. Just a ViT backbone feeding into a DETR-style detector, with a 3D prediction head on top.
CuTR supports two input modes:
In both cases, the backbone produces a flat grid of image tokens — the same representation any ViT would produce. Nothing 3D-specific yet.
The token grid goes into a single-scale Plain DETR detector. This is a standard DETR: learnable object queries cross-attend to the image tokens. Each query either matches an object or produces a "no object" prediction. Hungarian matching assigns queries to ground-truth boxes during training.
Plain DETR was chosen deliberately over multi-scale variants (Deformable DETR, DINO-DETR) to keep the architecture free of custom operations. Single-scale attention is pure matrix multiplication — it runs everywhere.
Each object query is decoded into both a 2D and a 3D prediction:
The data flows from image through ViT backbone, into single-scale DETR, and out through parallel 2D and 3D prediction heads.
When using affine-invariant depth (like DepthAnything's output), the raw depth values don't correspond to real-world meters. CuTR handles this by preserving the affine parameters (μ, σ) used to normalize the depth during encoding, then rescaling predictions back:
This means the network operates in a normalized depth space internally, which stabilizes training, and the real-world scale is recovered at the output.
CuTR's training recipe combines ideas from 2D object detection (Hungarian matching from DETR) with a 3D-specific loss (Chamfer distance on box corners). Let's walk through each component.
Like DETR, CuTR uses Hungarian matching to assign predicted boxes to ground-truth boxes. The matching cost combines 2D box IoU with L1 distance and classification loss. Each ground-truth box is assigned to exactly one query, and unmatched queries are trained to predict "no object."
The matching operates on 2D boxes, not 3D — this is more stable because 2D IoU is a well-behaved metric. Once matched, the 3D loss is applied to the matched pairs.
For each matched pair, CuTR computes the Chamfer distance between the 8 corners of the predicted 3D box and the 8 corners of the ground-truth 3D box:
where P is the set of 8 predicted corners and G is the set of 8 ground-truth corners. This loss has a nice property: it's smooth under rotations. Unlike direct L1 on angles (which has discontinuities at ±π), Chamfer distance on corners degrades gracefully as the predicted box rotates away from the target.
Because DETR's Hungarian matching assigns each object to exactly one query, there are no duplicate predictions. This means CuTR needs no Non-Maximum Suppression (NMS) at inference time. Each query either detects an object or outputs "no object." This simplifies the inference pipeline and removes a hand-tuned hyperparameter (the NMS threshold).
Indoor objects typically sit on horizontal surfaces. By applying a gravity transform (from the IMU) to align the "up" direction before prediction, CuTR reduces the orientation problem to a single yaw angle. This makes the learning problem easier — the model only needs to predict how the object is rotated around the vertical axis, not arbitrary 3D rotations.
CuTR trains on the per-frame labels from CA-1M's rendering pipeline. Each training sample is a single video frame with its associated 2D+3D box annotations. The model sees 15 million+ frames during training — a massive supervision signal compared to prior datasets.
Data augmentation includes standard 2D transforms (random crop, color jitter, horizontal flip) applied consistently to both the image and the box annotations. For RGB-D mode, the depth map is augmented with synthetic noise to make the model robust to the varying quality of commodity depth sensors.
How well does CuTR actually work? The paper evaluates it against established point-based 3D detectors on CA-1M, and the results tell a clear story.
All methods are evaluated using Average Recall (AR) at IoU thresholds of 25% and 50%. Recall matters more than precision here because CA-1M's exhaustive annotations mean every object is a valid detection target — we want to know how many the model finds.
| Method | Input | AR@25 | AR@50 |
|---|---|---|---|
| ImVoxelNet | RGB | 22.8 | 7.6 |
| CuTR | RGB | 35.4 | 16.3 |
| FCAF3D | RGB-D | 52.9 | 26.3 |
| TR3D+FF | RGB-D | 52.9 | 28.2 |
| CuTR | RGB-D | 62.3 | 33.4 |
A critical ablation: what happens when you swap iPad's noisy LiDAR depth for FARO's ground-truth depth? Point-based methods improve dramatically (they were hamstrung by noise). But CuTR still wins most metrics even with GT depth, suggesting its advantage isn't just noise robustness — the architecture itself is better at exploiting dense scale information.
On noisy commodity depth (the realistic deployment scenario), CuTR's advantage is even larger. Point-based methods that build 3D voxel grids from noisy depth get corrupted features. CuTR encodes depth as 2D image tokens, which is more robust to noise.
Another perspective: CuTR's ViT backbone was pre-trained on millions of images, giving it strong 2D appearance priors. Even when depth is noisy, the model can fall back on visual cues (perspective, relative size, occlusion patterns) to estimate 3D. Point-based methods don't have this fallback.
Average Recall at IoU 25% on CA-1M. Toggle between RGB-only and RGB-D modes.
A dataset's true value isn't just how well models perform on it — it's how well they transfer from it. If CA-1M's annotations are genuinely high-quality and exhaustive, then pre-training on CA-1M should boost performance on other benchmarks.
SUN RGB-D is a standard indoor 3D detection benchmark with 10 categories. The transfer experiment is straightforward:
| Method | Pre-train | AR@25 | Δ |
|---|---|---|---|
| CuTR | None | 44.2 | — |
| CuTR | CA-1M | 57.2 | +13.0 |
| FCAF3D | None | 54.8 | — |
| TR3D+FF | None | 55.3 | — |
This result validates both the dataset and the model. The exhaustive annotations in CA-1M teach the model general object geometry — shapes, sizes, spatial relationships — that transfers to new scenes and new category sets. And CuTR's pure-transformer architecture is flexible enough to absorb this pre-training signal.
A natural question: couldn't you also pre-train FCAF3D or TR3D on CA-1M's point clouds and get the same benefit? In principle, yes — but in practice, there are structural barriers.
Point-based methods use sparse 3D convolutions (libraries like MinkowskiEngine or TorchSparse). These operate on irregular 3D grids defined by the input point cloud. Pre-training such architectures requires 3D data in a compatible format — you can't just feed in ImageNet and fine-tune. The 3D pre-training ecosystem is tiny compared to the 2D vision ecosystem.
Point-based methods like FCAF3D and TR3D are strong when trained directly on the target dataset. But they can't easily leverage pre-training from larger datasets because their sparse 3D convolutions are hard to pre-train in the general-purpose way that ViTs excel at.
CuTR, being built entirely from ViTs and DETR, inherits the entire ecosystem of vision pre-training. Its backbone starts from MultiMAE or DepthAnything — models trained on millions of images. CA-1M adds domain-specific 3D knowledge on top. This composable pre-training is a structural advantage of the fully-transformer approach.
Cubify Anything sits at the intersection of several research threads. Let's map where it fits and what it connects to.
The paper's two contributions — dataset and model — each draw from and influence different parts of the 3D vision landscape.
CuTR is a direct descendant of DETR. It uses the same Hungarian matching, the same learned object queries, and the same encoder-decoder transformer structure. The key extension is the 3D prediction head — DETR predicts 2D boxes, CuTR extends each query to predict depth, dimensions, and yaw as well.
Boxer takes a different decomposition: it separates 2D detection from 3D lifting. CuTR is end-to-end — one model does both detection and 3D estimation. Boxer explicitly references CuTR as a baseline, showing that the decomposition approach can outperform end-to-end systems when 2D detectors are strong enough. The two papers represent complementary philosophies.
FCAF3D and TR3D represent the dominant paradigm: build a 3D feature volume from point clouds and detect objects in 3D space directly. CuTR's contribution is showing that this paradigm isn't necessary — a pure image-based approach with the right data can match or exceed point-based methods, with much better deployment characteristics.
The point-based paradigm has a structural advantage (native 3D reasoning) but also a structural limitation (dependency on custom sparse convolution kernels). CuTR trades one for the other: it gives up explicit 3D structure for universal deployability. As transformers get faster and pre-training gets richer, this trade-off increasingly favors the CuTR approach.
CuTR's RGB-only mode relies on DepthAnything for monocular depth estimation. The affine-invariant depth encoding and rescaling trick (σz + μ) is a general-purpose technique for using relative depth predictions in metric 3D tasks.
This means CuTR can work in two regimes: with a real depth sensor (LiDAR on iPad Pro) for best accuracy, or with predicted depth from a monocular model for maximum accessibility. The same architecture handles both — only the depth input source changes.
| Aspect | Cubify Anything |
|---|---|
| Dataset | CA-1M: 440K objects, 1K+ scenes, exhaustive, FARO-annotated |
| Model | CuTR: ViT backbone + Plain DETR + 3D head |
| Input modes | RGB-only (DepthAnything) or RGB-D (MultiMAE) |
| Output | 2D box + 3D box (center, dims, yaw) + class per query |
| 3D loss | Chamfer distance on 8 box corners |
| Matching | Hungarian on 2D boxes, no NMS |
| Orientation | Gravity-aware, yaw only (IMU-aligned) |
| Key result | 62.3% AR@25 on CA-1M (vs 52.9% best point method) |
| Transfer | +13 AR@25 on SUN RGB-D from CA-1M pre-training |
| Deployment | Metal, ANE, ONNX — no custom sparse kernels |