Lazarow, Griffiths, Kohavi, Crespo, Dehghan — Apple, 2024

Cubify Anything

Scaling indoor 3D object detection with exhaustive annotations on survey-grade scans and a fully transformer-based detector that needs no point clouds, no voxels, no sparse convolutions.

Prerequisites: 2D object detection + Camera projection basics + Transformers (DETR)
10
Chapters
5+
Simulations

Chapter 0: The Problem

You walk into a room with an iPad and scan it. Your app detects a couch, a table, maybe a lamp. But here's the thing — it missed the stack of books, the remote control, the power strip under the desk, the three throw pillows, and about a hundred other objects. Why?

Because existing indoor 3D detection datasets are sparse. ARKitScenes, the largest available benchmark, labels only 21 handpicked categories and averages about 11 labeled objects per video. A real room has ten times that many objects.

This creates a vicious cycle. Models trained on sparse annotations learn to detect only common furniture. They never see small objects, unusual items, or cluttered surfaces during training. So they never learn to detect them.

The data bottleneck: Indoor 3D detection isn't stuck because of bad models — it's stuck because of bad data. Even the best architecture can't learn to find objects it has never been labeled on. The field needs exhaustive annotations, not just bigger models.

There's a second problem lurking underneath. Most 3D detectors depend on specialized operations: sparse 3D convolutions, voxelization, custom CUDA kernels. These are fast on NVIDIA GPUs with the right libraries installed, but they don't run on Apple Silicon, mobile devices, or standard inference engines like ONNX. The deployment story is terrible.

Cubify Anything attacks both problems simultaneously: a dataset that labels everything, and a model that uses only standard transformer operations.

The concrete pipeline end-to-end: FARO laser scan (survey-grade point cloud, mm accuracy) → annotator labels 9-DoF boxes on point cloud → registration transforms boxes to each iPad capture frame → rendering pipeline: project 3D boxes, frustum culling, depth-based occlusion test → per-frame 2D+3D ground truth. For inference: RGB image (1024×768) + optional depth (256×192) → ViT backbone → patch tokens → Plain DETR decoder with box-modulated cross-attention → per-query: 2D box + 3D box (projected xy, depth z, dims l/w/h, yaw θ) + class score.
Annotation Density: ARKitScenes vs CA-1M

A room with furniture and objects. Toggle between datasets to see how many objects get labeled. Orange boxes are annotations — notice how much ARKitScenes misses.

Why does training on ARKitScenes produce detectors that miss most objects in a room?

Chapter 1: The Key Insight

Here is the core idea behind Cubify Anything, and it's deceptively simple: annotate on the best geometry, detect from images.

ARKitScenes captures rooms with an iPad Pro, which produces a noisy mesh via ARKit's real-time SLAM. Annotators then try to draw 3D boxes on this noisy mesh. The mesh has holes, drift, and misalignment — so the annotations inherit that noise.

But the same rooms were also scanned with a FARO Focus laser scanner. FARO scans are survey-grade: millimeter accuracy, tens of millions of points, no drift. The raw data for perfect 3D annotations already exists. Nobody was using it for labeling.

Step 1: FARO Scan
Survey-grade laser scanner captures the room with millimeter accuracy. Dense, clean point cloud — the ground truth of the room's geometry.
Step 2: Annotate on FARO
Label EVERY object with a 9-DoF 3D bounding box directly on the FARO point cloud. No noise, no drift — the boxes are spatially accurate.
Step 3: Render to Frames
Project world-space boxes onto each iPad video frame using known camera poses. Frustum culling + occlusion filtering = pixel-perfect 2D+3D labels for free.
Step 4: Train CuTR
Standard ViT + DETR on images only. No point cloud input needed at inference time. Runs on any hardware.
Why this matters: By disentangling where you annotate (FARO scans) from what you detect on (iPad images), you get the best of both worlds. Annotations are spatially perfect because FARO is accurate. Labels are exhaustive because you only annotate once per room, not per frame. And the model at inference time sees only images — no special hardware, no custom ops.

This insight also explains the paper's two contributions: they couldn't just release a dataset, because existing models (point-based 3D detectors) can't fully exploit image-only training at this scale. And they couldn't just propose a model, because no dataset had the annotation quality to validate it. The dataset and the model are co-designed.

What is the key disentanglement that makes CA-1M's annotations both accurate and exhaustive?

Chapter 2: The CA-1M Dataset

CA-1M (Cubify Anything — 1 Million) is the paper's first major contribution. Let's look at the numbers. Across 1,000+ indoor scenes, annotators placed 440,000 unique 3D bounding boxes. That's roughly 125 objects per video — more than ten times ARKitScenes' 11 per video.

But raw numbers don't tell the whole story. What makes CA-1M special is four key properties, and no prior dataset satisfies all of them simultaneously.

The Four Properties

Scale comparison: ARKitScenes has 56K objects across 21 categories. ScanNet has ~36K objects across 18 categories. CA-1M has 440K objects — class-agnostic, exhaustively labeled. That's nearly 8× the total object count of ARKitScenes.

The class-agnostic approach is a deliberate design choice. Instead of defining a taxonomy of 20-50 categories and only labeling objects that fit, CA-1M labels everything and assigns category labels afterward. This means the dataset captures the true distribution of objects in indoor spaces — including the long tail of unusual items that category-specific datasets miss.

Dataset Properties Comparison

Toggle each property to see which datasets satisfy it. Only CA-1M achieves all four simultaneously.

The per-frame rendering pipeline produces 15 million+ training frames from 3,500 iPad captures. Each frame comes with both 2D bounding boxes (projected from 3D) and full 3D annotations in the camera's coordinate system. This is far more training signal than any prior indoor dataset.

Concrete numbers: 440K unique 3D bounding boxes across 1,000+ rooms. 15M rendered frames from 3,500 iPad video captures. That's ~125 annotated objects per capture on average. Each 9-DoF box stores: center (x, y, z), dimensions (l, w, h), orientation (roll, pitch, yaw). The entire annotation is done once on the FARO scan; the 15M frame labels come for free via automated projection.
What makes CA-1M's "exhaustive" labeling different from ARKitScenes?

Chapter 3: The Annotation Pipeline

Labeling 440,000 objects in 3D sounds like an impossible task. How do you even look at a point cloud with millions of points and efficiently place tight-fitting boxes around every object? The answer: a carefully designed annotation pipeline with model-in-the-loop acceleration.

Starting Point: FARO Scans

Each room has a FARO Focus laser scan — a dense point cloud with tens of millions of points at millimeter precision. Annotators work directly in a 3D viewer, rotating and zooming through the point cloud. Because the geometry is crisp and unambiguous, they can see exactly where each object begins and ends.

Model-in-the-Loop

Annotating from scratch for every scene would be painfully slow. Instead, the pipeline uses a bootstrapping approach:

  1. Annotate a seed set of scenes manually
  2. Train a preliminary 3D detector on the seed set
  3. Run the detector on new scenes to generate proposal boxes
  4. Annotators verify, adjust, and supplement the proposals instead of drawing from scratch
  5. Retrain on the expanded dataset and repeat

This iterative loop dramatically speeds up annotation. Instead of placing 125 boxes from nothing, annotators might confirm 80 proposals, tweak 30, delete 5 bad ones, and add 10 the model missed.

Human + model synergy: The model handles the easy, obvious objects (large furniture). The human catches the hard ones (a pair of headphones draped over a monitor, a cable snaking behind a desk). Each iteration, the model gets better at the hard cases, and the human workload decreases.

Multi-View Image Support

Some regions of the point cloud are ambiguous even at FARO quality — a dark corner might have too few scan points to tell whether that blob is a shoe or a crumpled towel. When annotators hit these cases, they can pull up the corresponding iPad video frames to see the actual RGB appearance. The calibrated camera poses let them cross-reference 3D points with 2D images.

Quality Control

Every annotated scene goes through a verification pass. Reviewers check for:

9-DoF boxes: Each annotation specifies 9 degrees of freedom: 3D center (x, y, z), dimensions (length, width, height), and orientation (roll, pitch, yaw). The full orientation is needed because indoor objects aren't always aligned with the room axes — a chair might be angled, a book tilted on a shelf.

Annotation Speed

The combination of FARO-quality geometry, model-in-the-loop proposals, and multi-view image references makes the pipeline surprisingly efficient. An experienced annotator can process a moderately cluttered room (80-150 objects) in a single session. Without the model proposals, the same room would take several times longer.

The key insight is that verification is faster than creation. Looking at a proposed box and saying "yes, that's right" or "nudge it 2cm left" is cognitively simpler than finding the object, deciding its bounds, and placing a box from scratch. The model handles the pattern matching; the human handles the judgment calls.

Scale through iteration: The first batch of scenes is the slowest — no model proposals, everything manual. But each batch trains a better model, which generates better proposals for the next batch. By the end, the model catches 70-80% of objects automatically, and the human effort per scene drops dramatically. This is how you get to 440K annotations without an army of annotators.
How does the model-in-the-loop approach accelerate annotation?

Chapter 4: Per-Frame Rendering

You've annotated 3D boxes on the FARO scan. Now you need labels for each of the millions of iPad video frames. How do you get from world-space 3D boxes to per-frame 2D+3D annotations? Through a rendering pipeline with three stages.

Stage 1: World-to-Camera Transform

Each iPad frame has a known camera pose (position + orientation) from ARKit's visual-inertial odometry. For every 3D bounding box in the scene, we transform its 8 corners from world coordinates into the camera's coordinate system using the standard extrinsic matrix.

pcam = R · pworld + t

where R is the 3×3 rotation matrix and t is the translation vector of the camera pose.

Stage 2: Frustum Culling

Not every object is visible in every frame. Frustum culling removes boxes that fall entirely outside the camera's field of view. If all 8 corners of a box project outside the image boundaries, or if the box center is behind the camera (negative depth), the box is culled for this frame.

Stage 3: Occlusion Filtering

Even after frustum culling, some objects might be hidden behind other objects. The pipeline renders a depth buffer from the FARO point cloud for each camera viewpoint. Then for each surviving 3D box, it checks: does the depth at the box's projected center match the box's actual depth, or is something in front? If the object is heavily occluded (below a visibility threshold), it's filtered out.

The pipeline in one sentence: Transform every box to the camera frame, throw away boxes outside the field of view, throw away boxes hidden behind other things, and project the survivors to get pixel-perfect 2D boxes. All of this is automated — annotators never touch individual frames.
Frustum Culling & Occlusion

A top-down view of a room with objects (squares) and a camera (triangle). Drag the camera angle slider to change the view direction. Green objects are visible, red are culled (outside frustum), gray are occluded.

Camera Angle 45°

A Worked Example

Say there are 440 annotated 3D boxes in a living room scene. For a particular iPad frame:

  1. Transform: All 440 boxes are transformed to the camera's coordinate system using the frame's known pose.
  2. Cull: 320 boxes are behind the camera or outside the field of view. They're removed. 120 remain.
  3. Occlude: Of those 120, 35 are fully behind other objects (a box behind the couch, a cable behind the desk). They're filtered out. 85 survive.
  4. Project: The 85 surviving 3D boxes are projected to 2D, giving pixel-aligned bounding boxes.

This frame now has 85 labeled objects with both 2D and 3D annotations — all generated automatically from the one-time FARO annotation.

The result: from a single set of world-space 3D annotations, the pipeline automatically generates 2D bounding boxes and 3D box parameters for every one of the 15 million+ frames. Each frame gets both a projected 2D box (for 2D losses) and the 3D box in camera coordinates (for 3D losses).

Engineering detail — the occlusion test: The depth buffer is rendered from the FARO point cloud (not the iPad's noisy depth). For each candidate box, the pipeline samples the depth buffer at the box's projected 2D center. If the buffer depth is significantly less than the box's actual depth (something is in front), the box is marked occluded. The visibility threshold is tuned to avoid labeling barely-visible slivers — an object must be substantially unoccluded to count as a positive training example. This prevents the model from learning to "hallucinate" objects it can barely see.
Why is occlusion filtering needed in addition to frustum culling?

Chapter 5: The CuTR Architecture

CuTR (Cubify Transformer) is the paper's model contribution. Its defining feature: it is a fully transformer-based 3D object detector that uses absolutely no 3D-specific operations. No point cloud processing, no voxelization, no sparse 3D convolutions. Just a ViT backbone feeding into a DETR-style detector, with a 3D prediction head on top.

Backbone: Vision Transformer

CuTR supports two input modes:

In both cases, the backbone produces a flat grid of image tokens — the same representation any ViT would produce. Nothing 3D-specific yet.

Why MultiMAE handles asymmetric resolutions: RGB comes in at 1024×768, depth at 256×192 (1/4 resolution on each axis). With patch size 16×16, RGB produces (1024/16)×(768/16) = 64×48 = 3,072 tokens. Depth at 1/4 scale produces (256/16)×(192/16) = 16×12 = 192 tokens — only 1/16th of the RGB token count. This makes depth essentially free in terms of compute: you get metric scale information for <7% extra tokens.
Frozen vs. trained: The MultiMAE backbone is pretrained on RGB+depth reconstruction tasks, then fine-tuned during CuTR training. The DepthAnything backbone (for RGB-only mode) is pretrained on massive monocular depth data, then fine-tuned. The DETR decoder is trained from scratch. Windowed attention with base patch 16×16, no relative position embeddings — this reduces cost and keeps the model export-friendly.

Detector: Plain DETR

The token grid goes into a single-scale Plain DETR detector. This is a standard DETR: learnable object queries cross-attend to the image tokens. Each query either matches an object or produces a "no object" prediction. Hungarian matching assigns queries to ground-truth boxes during training.

Plain DETR was chosen deliberately over multi-scale variants (Deformable DETR, DINO-DETR) to keep the architecture free of custom operations. Single-scale attention is pure matrix multiplication — it runs everywhere.

The deployment engineering decision: Deformable DETR uses deformable attention — a custom CUDA kernel that samples features at learned offsets. This is fast on NVIDIA GPUs but doesn't compile to Apple Neural Engine (ANE), Metal compute shaders, or ONNX without custom ops. Plain DETR's vanilla cross-attention is just Q×KT×V — three matmuls and a softmax. Every inference engine on earth can run it. The trade-off: O(N²) attention cost instead of O(N×K) deformable sampling. At 3,072 tokens, this is still tractable for real-time on modern hardware.

3D Prediction Head

Each object query is decoded into both a 2D and a 3D prediction:

Gravity-aware: CuTR assumes a known gravity direction (from the iPad's IMU). This lets it decompose orientation into just yaw θ (rotation around the gravity axis). Roll and pitch are handled by the gravity transform, reducing the orientation problem from SO(3) to SO(2).
CuTR Architecture Flow

The data flows from image through ViT backbone, into single-scale DETR, and out through parallel 2D and 3D prediction heads.

Depth Rescaling for RGB-D

When using affine-invariant depth (like DepthAnything's output), the raw depth values don't correspond to real-world meters. CuTR handles this by preserving the affine parameters (μ, σ) used to normalize the depth during encoding, then rescaling predictions back:

z' = σ · z + μ,    dims' = σ · dims

This means the network operates in a normalized depth space internally, which stabilizes training, and the real-world scale is recovered at the output.

Why affine-invariant encoding + explicit scale: DepthAnything produces relative depth (up-to-shift-and-scale). The model predicts z in this normalized space, and the decoder also predicts μ and σ from the metric depth input (when available). At inference: zmetric = σ·zpred + μ, and dimsmetric = σ·dimspred. This decouples the hard problem (relative depth ordering, which DepthAnything is excellent at) from the easy problem (absolute scale recovery from a single metric depth reading). When no metric depth is available, μ and σ are estimated from image statistics — less accurate, but still functional.
Why does CuTR use single-scale Plain DETR instead of multi-scale variants like Deformable DETR?

Chapter 6: Training & Losses

CuTR's training recipe combines ideas from 2D object detection (Hungarian matching from DETR) with a 3D-specific loss (Chamfer distance on box corners). Let's walk through each component.

Hungarian Matching

Like DETR, CuTR uses Hungarian matching to assign predicted boxes to ground-truth boxes. The matching cost combines 2D box IoU with L1 distance and classification loss. Each ground-truth box is assigned to exactly one query, and unmatched queries are trained to predict "no object."

The matching operates on 2D boxes, not 3D — this is more stable because 2D IoU is a well-behaved metric. Once matched, the 3D loss is applied to the matched pairs.

Chamfer Loss on 3D Corners

For each matched pair, CuTR computes the Chamfer distance between the 8 corners of the predicted 3D box and the 8 corners of the ground-truth 3D box:

Lchamfer = ½ ( ∑p∈P ming∈G ||p − g||2 + ∑g∈G minp∈P ||g − p||2 )

where P is the set of 8 predicted corners and G is the set of 8 ground-truth corners. This loss has a nice property: it's smooth under rotations. Unlike direct L1 on angles (which has discontinuities at ±π), Chamfer distance on corners degrades gracefully as the predicted box rotates away from the target.

Why corners, not parameters? Comparing box parameters (x, y, z, l, w, h, θ) directly is problematic because a small error in θ can cause a large error in corner positions for long, thin objects. Chamfer on corners directly measures what matters: how close are the actual box surfaces?

No NMS

Because DETR's Hungarian matching assigns each object to exactly one query, there are no duplicate predictions. This means CuTR needs no Non-Maximum Suppression (NMS) at inference time. Each query either detects an object or outputs "no object." This simplifies the inference pipeline and removes a hand-tuned hyperparameter (the NMS threshold).

Why no NMS is an engineering win, not just elegance: In indoor scenes, objects at different depths often have nearly identical 2D bounding boxes (a book on a shelf behind a cup on a desk). Standard 2D NMS would suppress one of them. 3D IoU-based NMS would be better, but computing 3D IoU between rotated boxes requires custom kernels — exactly the kind of specialized operation CuTR is designed to avoid. DETR's set-based prediction sidesteps the problem entirely.

Gravity-Aware Orientation

Indoor objects typically sit on horizontal surfaces. By applying a gravity transform (from the IMU) to align the "up" direction before prediction, CuTR reduces the orientation problem to a single yaw angle. This makes the learning problem easier — the model only needs to predict how the object is rotated around the vertical axis, not arbitrary 3D rotations.

Total loss: L = Lmatch(2D IoU + L1 + class) + λ3D · Lchamfer + λ2D · LGIoU. The 2D GIoU loss ensures the projected 2D boxes are accurate, while Chamfer handles 3D geometry.

Training Details

CuTR trains on the per-frame labels from CA-1M's rendering pipeline. Each training sample is a single video frame with its associated 2D+3D box annotations. The model sees 15 million+ frames during training — a massive supervision signal compared to prior datasets.

Data augmentation includes standard 2D transforms (random crop, color jitter, horizontal flip) applied consistently to both the image and the box annotations. For RGB-D mode, the depth map is augmented with synthetic noise to make the model robust to the varying quality of commodity depth sensors.

What's frozen, what's trained: The ViT backbone (MultiMAE or DepthAnything) starts from pretrained weights and is fine-tuned with a lower learning rate. The DETR decoder and all prediction heads are trained from scratch with a higher learning rate. This two-rate schedule lets the backbone adapt its features for 3D detection without destroying its pretrained representations. Standard AdamW optimizer, cosine learning rate schedule.
Why 2D matching works for 3D: You might wonder why Hungarian matching operates on 2D boxes rather than 3D. The reason is stability: 2D IoU is well-defined and smooth, while 3D IoU between rotated boxes is expensive to compute and can have degenerate cases (e.g., thin objects). Matching on 2D first, then applying 3D loss to matched pairs, gives the best of both worlds.
Why does CuTR not need Non-Maximum Suppression (NMS) at inference?

Chapter 7: Results on CA-1M

How well does CuTR actually work? The paper evaluates it against established point-based 3D detectors on CA-1M, and the results tell a clear story.

The Main Comparison

All methods are evaluated using Average Recall (AR) at IoU thresholds of 25% and 50%. Recall matters more than precision here because CA-1M's exhaustive annotations mean every object is a valid detection target — we want to know how many the model finds.

MethodInputAR@25AR@50
ImVoxelNetRGB22.87.6
CuTRRGB35.416.3
FCAF3DRGB-D52.926.3
TR3D+FFRGB-D52.928.2
CuTRRGB-D62.333.4
The headline result: CuTR RGB-D recalls 62.3% of all objects at IoU25, beating the best point-based method (FCAF3D/TR3D+FF at 52.9%) by nearly 10 points. Even CuTR RGB-only (35.4%) handily beats ImVoxelNet RGB (22.8%).

Noisy vs. Ground-Truth Depth

A critical ablation: what happens when you swap iPad's noisy LiDAR depth for FARO's ground-truth depth? Point-based methods improve dramatically (they were hamstrung by noise). But CuTR still wins most metrics even with GT depth, suggesting its advantage isn't just noise robustness — the architecture itself is better at exploiting dense scale information.

What degrades and by how much: With commodity LiDAR depth (the iPad's noisy sensor), point methods like FCAF3D score 52.9 AR@25. Give them FARO ground-truth depth and they jump significantly — their gap to CuTR narrows. But CuTR's own gap between noisy and GT depth is smaller, suggesting image-based methods handle depth noise more gracefully. The key insight: CuTR already has strong geometric priors from its pretrained backbone, so it relies less on perfect depth. RGB-only CuTR drops recall from 62.3% (RGB-D) to 35.4% — still beating the best RGB-only baseline (ImVoxelNet at 22.8%) by a wide margin.

On noisy commodity depth (the realistic deployment scenario), CuTR's advantage is even larger. Point-based methods that build 3D voxel grids from noisy depth get corrupted features. CuTR encodes depth as 2D image tokens, which is more robust to noise.

Why noise hurts point methods more: Sparse 3D convolutions operate on voxelized point clouds. If the depth is noisy, points land in wrong voxels, and the 3D features become spatially scrambled. CuTR, by contrast, treats depth as just another 2D input channel — noise in depth slightly degrades the features but doesn't corrupt the spatial structure of the 2D token grid.

Another perspective: CuTR's ViT backbone was pre-trained on millions of images, giving it strong 2D appearance priors. Even when depth is noisy, the model can fall back on visual cues (perspective, relative size, occlusion patterns) to estimate 3D. Point-based methods don't have this fallback.

AR@25 Results Comparison

Average Recall at IoU 25% on CA-1M. Toggle between RGB-only and RGB-D modes.

Why does CuTR outperform point-based methods even more on noisy commodity depth than on ground-truth depth?

Chapter 8: Transfer Learning

A dataset's true value isn't just how well models perform on it — it's how well they transfer from it. If CA-1M's annotations are genuinely high-quality and exhaustive, then pre-training on CA-1M should boost performance on other benchmarks.

The Experiment: SUN RGB-D

SUN RGB-D is a standard indoor 3D detection benchmark with 10 categories. The transfer experiment is straightforward:

  1. Pre-train CuTR on CA-1M (440K objects, exhaustive, class-agnostic)
  2. Fine-tune on SUN RGB-D (10 categories, standard splits)
  3. Compare against training on SUN RGB-D from scratch and against point-based methods
MethodPre-trainAR@25Δ
CuTRNone44.2
CuTRCA-1M57.2+13.0
FCAF3DNone54.8
TR3D+FFNone55.3
The +13 AR25 jump: Pre-training on CA-1M boosts CuTR's SUN RGB-D AR@25 from 44.2 to 57.2 — a gain of 13 points. This is enough to surpass all point-based methods (FCAF3D: 54.8, TR3D+FF: 55.3) that were trained directly on SUN RGB-D. Data quality and quantity beat architecture.

This result validates both the dataset and the model. The exhaustive annotations in CA-1M teach the model general object geometry — shapes, sizes, spatial relationships — that transfers to new scenes and new category sets. And CuTR's pure-transformer architecture is flexible enough to absorb this pre-training signal.

The composable pre-training stack: CuTR's backbone starts from MultiMAE (pretrained on RGB+depth reconstruction) or DepthAnything (pretrained on massive monocular depth). Then CA-1M fine-tuning adds indoor 3D geometry knowledge. Then SUN RGB-D fine-tuning specializes to the target categories. Each stage builds on the last: general vision → depth understanding → indoor 3D → target domain. Point methods can't stack pre-training like this because sparse 3D convolutions have no equivalent of ImageNet pre-training.

Why Not Pre-train Point Methods Too?

A natural question: couldn't you also pre-train FCAF3D or TR3D on CA-1M's point clouds and get the same benefit? In principle, yes — but in practice, there are structural barriers.

Point-based methods use sparse 3D convolutions (libraries like MinkowskiEngine or TorchSparse). These operate on irregular 3D grids defined by the input point cloud. Pre-training such architectures requires 3D data in a compatible format — you can't just feed in ImageNet and fine-tune. The 3D pre-training ecosystem is tiny compared to the 2D vision ecosystem.

The Bigger Lesson

Point-based methods like FCAF3D and TR3D are strong when trained directly on the target dataset. But they can't easily leverage pre-training from larger datasets because their sparse 3D convolutions are hard to pre-train in the general-purpose way that ViTs excel at.

CuTR, being built entirely from ViTs and DETR, inherits the entire ecosystem of vision pre-training. Its backbone starts from MultiMAE or DepthAnything — models trained on millions of images. CA-1M adds domain-specific 3D knowledge on top. This composable pre-training is a structural advantage of the fully-transformer approach.

Accessibility win: Because CuTR uses only standard operations (attention, MLPs, layer norms), it can be exported to ONNX, run on Apple's Metal/Neural Engine, or deployed on any hardware with a transformer inference stack. No CUDA, no MinkowskiEngine, no custom sparse kernels.
Why does pre-training on CA-1M help CuTR so much more than it would help point-based methods?

Chapter 9: Connections

Cubify Anything sits at the intersection of several research threads. Let's map where it fits and what it connects to.

The paper's two contributions — dataset and model — each draw from and influence different parts of the 3D vision landscape.

Relation to DETR

CuTR is a direct descendant of DETR. It uses the same Hungarian matching, the same learned object queries, and the same encoder-decoder transformer structure. The key extension is the 3D prediction head — DETR predicts 2D boxes, CuTR extends each query to predict depth, dimensions, and yaw as well.

Relation to Boxer

Boxer takes a different decomposition: it separates 2D detection from 3D lifting. CuTR is end-to-end — one model does both detection and 3D estimation. Boxer explicitly references CuTR as a baseline, showing that the decomposition approach can outperform end-to-end systems when 2D detectors are strong enough. The two papers represent complementary philosophies.

Relation to Point-Based Methods

FCAF3D and TR3D represent the dominant paradigm: build a 3D feature volume from point clouds and detect objects in 3D space directly. CuTR's contribution is showing that this paradigm isn't necessary — a pure image-based approach with the right data can match or exceed point-based methods, with much better deployment characteristics.

The point-based paradigm has a structural advantage (native 3D reasoning) but also a structural limitation (dependency on custom sparse convolution kernels). CuTR trades one for the other: it gives up explicit 3D structure for universal deployability. As transformers get faster and pre-training gets richer, this trade-off increasingly favors the CuTR approach.

Relation to Depth Estimation

CuTR's RGB-only mode relies on DepthAnything for monocular depth estimation. The affine-invariant depth encoding and rescaling trick (σz + μ) is a general-purpose technique for using relative depth predictions in metric 3D tasks.

This means CuTR can work in two regimes: with a real depth sensor (LiDAR on iPad Pro) for best accuracy, or with predicted depth from a monocular model for maximum accessibility. The same architecture handles both — only the depth input source changes.

Cheat Sheet

AspectCubify Anything
DatasetCA-1M: 440K objects, 1K+ scenes, exhaustive, FARO-annotated
ModelCuTR: ViT backbone + Plain DETR + 3D head
Input modesRGB-only (DepthAnything) or RGB-D (MultiMAE)
Output2D box + 3D box (center, dims, yaw) + class per query
3D lossChamfer distance on 8 box corners
MatchingHungarian on 2D boxes, no NMS
OrientationGravity-aware, yaw only (IMU-aligned)
Key result62.3% AR@25 on CA-1M (vs 52.9% best point method)
Transfer+13 AR@25 on SUN RGB-D from CA-1M pre-training
DeploymentMetal, ANE, ONNX — no custom sparse kernels
The broader lesson: When both data and architecture are bottlenecks, fix them together. Exhaustive annotations on survey-grade geometry plus a deployment-friendly transformer — not a fancier architecture or a noisier dataset — was the path to scaling indoor 3D detection.
What is the fundamental difference between CuTR's approach and Boxer's approach to 3D detection?