Decompose open-world 3D detection: let existing 2D detectors handle semantics, then lift their boxes into metric 3D with a single transformer. Works with any depth source — or none at all.
You have an AR headset. A 2D detector — DETIC, OWLv2, SAM3 — can see a coffee mug on your desk and draw a 2D bounding box around it. But your robot arm needs to pick it up. For that, you need to know where that mug is in 3D: how far away, how wide, how tall, and which way it's rotated.
This is the 2D-to-3D lifting problem: given a 2D bounding box in an image, produce a 3D bounding box in the real world.
Why is this hard? A single 2D box is deeply ambiguous. A small box near the top of the image could be a small object nearby or a large object far away. The 2D projection throws away depth information entirely.
Existing approaches try to solve 2D detection and 3D estimation simultaneously (CuTR, Cube R-CNN). But this couples two very different problems: recognizing what something is (semantics) and understanding where it is in 3D (geometry). Boxer's idea: decouple them.
Drag the depth slider to see how the same 2D box corresponds to different 3D boxes. The orange rectangle is fixed in 2D — but the 3D box it represents changes drastically with depth.
Boxer's core insight is a clean decomposition: separate 2D detection from 3D lifting.
Existing 3D detectors try to do everything at once. They take a raw image and produce 3D boxes end-to-end. This sounds elegant, but it creates a painful dependency: you need 3D-annotated training data to teach the model both what objects look like and where they are in 3D. Such data is extremely expensive to collect.
Meanwhile, 2D detectors have gotten spectacularly good. Models like DETIC, OWLv2, and SAM3 can detect thousands of object categories in the wild — far more than any 3D dataset covers. They were trained on billions of internet images with cheap 2D labels.
Boxer's decomposition exploits this asymmetry:
This also makes the system open-world by inheritance. If your 2D detector can find "ergonomic keyboard" or "sourdough bread," BoxerNet can lift those detections to 3D — even though it has never seen those categories in its 3D training set. The geometry of lifting doesn't depend on the object's class.
BoxerNet is a transformer that takes three inputs and produces 7-DoF 3D bounding boxes. Let's trace the data flow.
The encoder fuses three types of information into a single set of tokens:
The combined tokens pass through transformer encoder layers with self-attention, letting the model reason about global spatial relationships.
The decoder receives one query per 2D detection. Each query is constructed from: (1) the 2D bounding box corners (x1, y1, x2, y2) encoded via sinusoidal positional encoding into a 256-dim vector, plus (2) RoI-pooled image features from the DINOv3 feature map cropped to that box region. The decoder applies cross-attention from these queries to the encoder tokens — each 2D box "looks at" the relevant parts of the scene to gather 3D information.
Each query's final embedding (the cross-attention output, a 1024-dim vector we call the box latent) is passed through four parallel 2-layer MLPs (128 hidden dims, ReLU):
The three input streams merge in the encoder. 2D box queries attend to encoder tokens via cross-attention. Output heads produce 7-DoF boxes.
Depth is the strongest signal for 3D lifting — if you know how far away each pixel is, placing a 3D box becomes much easier. But depth comes in wildly different forms:
Most existing methods require dense depth. They feed a full depth image through a ViT encoder — expensive, and impossible when depth is sparse or absent.
Boxer takes a radically simpler approach. For each image patch (matching the ViT grid of 16×16 pixels), it computes the median depth of all depth samples that fall within that patch. If no depth samples land in a patch, it gets a special "no depth" token (value −1).
This single scalar per patch is then encoded through a 2-layer MLP (128 hidden dims, ReLU) and concatenated to the image+ray token for that patch.
This design has three key advantages:
Not all 3D predictions are equally reliable. A coffee mug sitting on a clear table with dense depth data? Easy — the model should be confident. A partially occluded chair seen from a weird angle with no depth? Hard — the model should admit it's uncertain.
Boxer handles this with aleatoric uncertainty — uncertainty that comes from the data itself (noise, occlusion, ambiguity), not from the model's lack of training.
For each predicted 3D box, BoxerNet outputs both the box parameters and an uncertainty scalar σ̂. The training loss for each box is:
Let's unpack this. There are two terms pulling in opposite directions:
This is the same principle used in Kendall & Gal (2017) for multi-task learning, adapted here for per-prediction uncertainty. The Chamfer distance is computed between the 8 corners of the predicted and ground-truth boxes, giving a smooth, rotation-aware loss.
Adjust the chamfer loss (how bad the prediction is) and the uncertainty σ̂. Watch how the total loss changes. Find the sweet spot: when chamfer is high, raising σ̂ helps — but only to a point.
A 3D bounding box in the real world has many possible parameterizations. Boxer uses a gravity-aligned 7-DoF representation — seven numbers that fully describe a box's position, size, and orientation:
| Parameter | Symbol | Meaning |
|---|---|---|
| Center X | x | Left-right position in world frame |
| Center Y | y | Up-down position (gravity direction) |
| Center Z | z | Depth (distance from camera) |
| Width | w | Extent along world X axis |
| Height | h | Extent along gravity axis |
| Depth extent | d | Extent along world Z axis |
| Yaw | θ | Rotation around the gravity (Y) axis |
Most indoor objects sit on surfaces aligned with gravity. Chairs, tables, monitors, and mugs all have a natural "up" direction. By assuming the box is aligned with gravity (no pitch or roll), Boxer reduces the rotation from a full 3-DoF (roll, pitch, yaw) to just 1-DoF (yaw). This is a much easier prediction target.
The gravity direction is known from the device's IMU (inertial measurement unit), so the world frame is established before BoxerNet even runs.
BoxerNet predicts 3D boxes in the camera coordinate frame, then transforms them to the gravity-aligned world frame using the known camera pose. The center (x, y, z) is predicted as a 3D offset from the camera ray passing through the center of the 2D bounding box. This gives the model a strong prior — the 3D center should be "somewhere along the ray" through the 2D box center.
Adjust the 7 parameters to see how they define a 3D bounding box. The wireframe is shown in a simple perspective projection.
BoxerNet processes one frame at a time. But AR and robotics scenarios involve video — the camera moves, and the same object is seen from many angles. How do you merge all those per-frame 3D boxes into a single coherent scene?
Frame-by-frame predictions are noisy. The same chair might get slightly different 3D boxes from different viewpoints. A false positive might appear in one frame but not others. And the 2D detector might assign different class names to the same object across frames ("armchair" vs "chair").
Each per-frame 3D box is transformed from camera coordinates to a global world coordinate frame using the known camera pose (from SLAM or device tracking). Now all boxes from all frames live in the same coordinate system.
Boxer builds a graph where each node is a per-frame 3D detection. Two nodes are connected with an edge if:
The combination of geometric overlap and semantic similarity is crucial. Two boxes might overlap in 3D (a monitor and a laptop stacked on it) but have different semantics. Or two identical-looking chairs might be far apart in 3D.
Connected components of the graph represent the same physical object. Within each component, the boxes are merged using confidence-weighted averaging (boxes with higher s3D contribute more to the final center, extents, and yaw). A final 3D NMS pass with IoU threshold 0.5 removes any remaining duplicates.
The Boxer system was trained on a dataset of remarkable scale and diversity. Here are the key numbers:
| Metric | Value |
|---|---|
| Unique 3D bounding boxes | 1.22 million |
| Device types | 4 (Aria glasses, Quest headset, iPad, Azure Kinect) |
| Depth modalities | Dense (Kinect), sparse (SLAM points), none |
| Scenes | Indoor environments (offices, homes, labs) |
Collecting 1.22M 3D annotations is a massive effort. The pipeline works as follows:
Not every 3D object is visible in every frame. Boxer uses a ray-casting visibility check against the 3D mesh: if fewer than a threshold fraction of the box's surface is visible, the annotation is excluded from that frame. This prevents training on heavily occluded objects where the 3D box would be nearly impossible to predict.
During training, Boxer applies standard 2D augmentations (random crop, color jitter, horizontal flip) and also jitters the 2D input boxes by up to 20% of their size. This simulates the noise that real 2D detectors produce — their boxes are never perfectly tight.
Boxer is evaluated on two challenging benchmarks that test different scenarios:
Egocentric video from Aria glasses. Sparse point cloud depth from visual-inertial SLAM. This is the hardest setting — limited depth, moving camera, ego-motion blur.
| Method | 2D Detector | mAP ↑ |
|---|---|---|
| CuTR (end-to-end baseline) | GT 2D boxes | 0.010 |
| BoxerNet | GT 2D boxes | 0.532 |
| BoxerNet | DETIC | 0.254 |
| BoxerNet | SAM3 | 0.278 |
The gap is staggering: with perfect 2D boxes, BoxerNet achieves 53× higher mAP than CuTR. Even with noisy real detectors (DETIC, SAM3), Boxer far exceeds what end-to-end methods achieve.
Indoor scenes captured with Azure Kinect, providing dense depth. This is the easier setting for geometry.
| Method | 2D Detector | mAP ↑ |
|---|---|---|
| CuTR | GT 2D boxes | 0.250 |
| BoxerNet | GT 2D boxes | 0.412 |
| BoxerNet | SAM3 | 0.270 |
What matters most? The ablation study on NymeriaPlus (with GT 2D boxes) reveals:
| Ablation | mAP | Change |
|---|---|---|
| Full BoxerNet | 0.518 | — |
| Remove point cloud depth | 0.279 | −46% |
| Remove aleatoric uncertainty | 0.485 | −6.4% |
| Remove camera ray features | 0.495 | −4.4% |
mAP comparison between BoxerNet and CuTR across the two benchmarks (using GT 2D boxes). Higher is better.
Boxer sits at the intersection of several important research directions. Let's map where it fits.
BoxerNet borrows the transformer encoder-decoder architecture from DETR, but with a crucial difference: DETR's object queries are learned parameters that must discover objects from scratch. BoxerNet's queries are given by the 2D detector — the model knows where objects are in 2D and only needs to lift them to 3D. This makes the decoder's job much easier.
Cube R-CNN and CuTR are end-to-end 3D detectors that jointly detect and estimate 3D boxes. Boxer's decomposition shows that separating these two tasks leads to dramatically better performance, especially in the open-world setting where 2D detectors have a massive vocabulary advantage.
Models like Depth Anything and ZoeDepth predict per-pixel depth from a single image. Boxer's approach is complementary: if you have a monocular depth estimator, you could use its output as Boxer's depth input. The median-patch encoding would handle the estimated depth the same way it handles sensor depth.
SpatialVLM and similar vision-language models can reason about 3D spatial relationships in natural language. Boxer provides the grounding: metric 3D bounding boxes that anchor spatial reasoning to physical reality.
| Aspect | Boxer |
|---|---|
| Input | Image + 2D boxes + optional depth |
| Output | 7-DoF 3D box + uncertainty per detection |
| Backbone | DINOv3 ViT-Large |
| Decoder queries | From 2D detector (not learned) |
| Depth encoding | Median depth per patch (MLP) |
| Loss | Chamfer · exp(−σ̂) + σ̂ |
| Fusion | 3D IoU + CLIP similarity graph + NMS |
| Training data | 1.22M 3DBBs, 4 device types |
| Key result | 53× mAP vs CuTR on NymeriaPlus |
| Trainable params | ~25M (backbone frozen) |
| Inference speed | ~20ms on RTX 4090 (960×960, bf16) |
| Training | 2 weeks, 16× H100, AdamW, cosine LR |