DeTone, Shen, Zhang, Ma, Straub, Newcombe, Engel — Meta Reality Labs, 2026

Robust Lifting of 2D Boxes to 3D

Decompose open-world 3D detection: let existing 2D detectors handle semantics, then lift their boxes into metric 3D with a single transformer. Works with any depth source — or none at all.

Prerequisites: 2D object detection + Camera geometry basics + Transformers (attention)
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have an AR headset. A 2D detector — DETIC, OWLv2, SAM3 — can see a coffee mug on your desk and draw a 2D bounding box around it. But your robot arm needs to pick it up. For that, you need to know where that mug is in 3D: how far away, how wide, how tall, and which way it's rotated.

This is the 2D-to-3D lifting problem: given a 2D bounding box in an image, produce a 3D bounding box in the real world.

Why is this hard? A single 2D box is deeply ambiguous. A small box near the top of the image could be a small object nearby or a large object far away. The 2D projection throws away depth information entirely.

The fundamental ambiguity: A 2D bounding box is a projection of infinitely many possible 3D boxes. A 40×40 pixel square could be a tennis ball at 1 meter or a beach ball at 10 meters. Without additional information — context, depth, learned priors — the problem is ill-posed.

Existing approaches try to solve 2D detection and 3D estimation simultaneously (CuTR, Cube R-CNN). But this couples two very different problems: recognizing what something is (semantics) and understanding where it is in 3D (geometry). Boxer's idea: decouple them.

Full Data Flow at a Glance: RGB image I ∈ RH×W×3 + pinhole intrinsics (fx, fy, cx, cy) + 6-DoF pose Tworld←cam + N 2D bounding boxes (x1, y1, x2, y2, score) → DINOv3 features Fimg ∈ RH'×W'×1024 | camera rays → Fray ∈ RH'×W'×3 | depth → median per patch → Fdepth ∈ RH'×W'×1 | concat → Fenc ∈ RH'×W'×1028 → self-attention encoder → cross-attention decoder with 2D box queries → per box: 7-DoF (x, y, z, w, h, d, θ) + uncertainty σ̂ + confidence s3D.
2D → 3D Ambiguity

Drag the depth slider to see how the same 2D box corresponds to different 3D boxes. The orange rectangle is fixed in 2D — but the 3D box it represents changes drastically with depth.

Depth 3.0m
Why is lifting a 2D bounding box to 3D fundamentally difficult?

Chapter 1: The Key Insight

Boxer's core insight is a clean decomposition: separate 2D detection from 3D lifting.

Existing 3D detectors try to do everything at once. They take a raw image and produce 3D boxes end-to-end. This sounds elegant, but it creates a painful dependency: you need 3D-annotated training data to teach the model both what objects look like and where they are in 3D. Such data is extremely expensive to collect.

Meanwhile, 2D detectors have gotten spectacularly good. Models like DETIC, OWLv2, and SAM3 can detect thousands of object categories in the wild — far more than any 3D dataset covers. They were trained on billions of internet images with cheap 2D labels.

Boxer's decomposition exploits this asymmetry:

Step 1: 2D Detection
Use any off-the-shelf open-world 2D detector (DETIC, OWLv2, SAM3). Get 2D bounding boxes + class labels. No 3D annotations needed.
Step 2: 3D Lifting
Feed image + 2D boxes into BoxerNet. It outputs 7-DoF 3D bounding boxes + uncertainty. Trained on 3D data, but doesn't need to learn semantics.
Step 3: Multi-View Fusion
Merge per-frame 3D boxes across time using 3D IoU + semantic similarity. Produces a coherent scene-level 3D map.
Why this matters: The decomposition lets each component use the best available data. 2D detectors leverage billions of web images. BoxerNet leverages millions of 3D annotations from AR/VR devices. Neither needs the other's data. You can swap in a better 2D detector tomorrow without retraining BoxerNet.

This also makes the system open-world by inheritance. If your 2D detector can find "ergonomic keyboard" or "sourdough bread," BoxerNet can lift those detections to 3D — even though it has never seen those categories in its 3D training set. The geometry of lifting doesn't depend on the object's class.

What happens when inputs degrade: With no SLAM/depth available, Fdepth patches are all set to −1 (the "no depth" token) → the network learns to ignore depth and relies purely on visual + ray features. With no device pose, Tworld←cam defaults to identity → 3D boxes remain in camera frame rather than world frame, and multi-view fusion becomes impossible. With a noisy 2D detector (SAM3 vs GT boxes), mAP drops from 0.532 to 0.278 — the system degrades gracefully rather than catastrophically.
What is the key advantage of separating 2D detection from 3D lifting?

Chapter 2: The BoxerNet Architecture

BoxerNet is a transformer that takes three inputs and produces 7-DoF 3D bounding boxes. Let's trace the data flow.

Encoder: Building a Rich Scene Representation

The encoder fuses three types of information into a single set of tokens:

The combined tokens pass through transformer encoder layers with self-attention, letting the model reason about global spatial relationships.

Frozen vs. Trained: DINOv3 backbone: frozen (pretrained on 142M images — retraining would be wasteful and destroy generalization). Self-attention encoder layers: trained. Cross-attention decoder layers: trained. Ray MLP + Depth MLP: trained. Prediction heads: trained. Total trainable parameters: ~25M. This is a thin learnable shell around a massive frozen feature extractor.

Decoder: Querying with 2D Boxes

The decoder receives one query per 2D detection. Each query is constructed from: (1) the 2D bounding box corners (x1, y1, x2, y2) encoded via sinusoidal positional encoding into a 256-dim vector, plus (2) RoI-pooled image features from the DINOv3 feature map cropped to that box region. The decoder applies cross-attention from these queries to the encoder tokens — each 2D box "looks at" the relevant parts of the scene to gather 3D information.

Output Heads

Each query's final embedding (the cross-attention output, a 1024-dim vector we call the box latent) is passed through four parallel 2-layer MLPs (128 hidden dims, ReLU):

Compared to DETR: DETR uses learned object queries to discover what and where objects are. BoxerNet queries are given — they come from the 2D detector. This means BoxerNet doesn't need to learn detection; it only needs to learn geometry. The queries carry 2D spatial information that anchors the cross-attention to the right image region.
BoxerNet Pipeline

The three input streams merge in the encoder. 2D box queries attend to encoder tokens via cross-attention. Output heads produce 7-DoF boxes.

What are the three types of input tokens that BoxerNet's encoder fuses?

Chapter 3: Depth Encoding

Depth is the strongest signal for 3D lifting — if you know how far away each pixel is, placing a 3D box becomes much easier. But depth comes in wildly different forms:

Most existing methods require dense depth. They feed a full depth image through a ViT encoder — expensive, and impossible when depth is sparse or absent.

Boxer's Solution: Median Depth Patches

Boxer takes a radically simpler approach. For each image patch (matching the ViT grid of 16×16 pixels), it computes the median depth of all depth samples that fall within that patch. If no depth samples land in a patch, it gets a special "no depth" token (value −1).

This single scalar per patch is then encoded through a 2-layer MLP (128 hidden dims, ReLU) and concatenated to the image+ray token for that patch.

Concrete numbers: A sparse SLAM point cloud from Aria glasses has ~10K 3D points. Projected onto a 960×960 image with a 60×60 patch grid (3600 patches of 16×16 pixels each), most patches receive 0–3 depth points. Many patches get zero. The median is robust to this extreme sparsity — patches with points get a signal, patches without get −1. With dense depth (e.g., iPad LiDAR at 256×192 projected onto the same grid), every patch receives ~10–20 points. With no depth at all: all 3600 patches are −1, and the network learns to ignore Fdepth entirely.
Why median (not mean)? The median is robust to outliers. If a patch contains points on both a table (1.5m) and a wall behind it (3.0m), the median picks the dominant surface rather than averaging them. It's also trivially cheap to compute — sort and pick the middle value. A mean would be skewed by a single noisy point at 50m from a SLAM artifact.

This design has three key advantages:

Ablation result: On NymeriaPlus (egocentric, sparse depth), adding point cloud depth raises mAP from 0.279 to 0.518 — nearly doubling performance. Depth is hugely valuable, and Boxer's encoding captures it efficiently.
Why does Boxer use median depth per patch instead of feeding a full depth image through a ViT?

Chapter 4: Uncertainty-Aware Training

Not all 3D predictions are equally reliable. A coffee mug sitting on a clear table with dense depth data? Easy — the model should be confident. A partially occluded chair seen from a weird angle with no depth? Hard — the model should admit it's uncertain.

Boxer handles this with aleatoric uncertainty — uncertainty that comes from the data itself (noise, occlusion, ambiguity), not from the model's lack of training.

The Loss Function

For each predicted 3D box, BoxerNet outputs both the box parameters and an uncertainty scalar σ̂. The training loss for each box is:

L = Lchamfer · exp(−σ̂) + σ̂

Let's unpack this. There are two terms pulling in opposite directions:

The equilibrium: If the model predicts a bad box, it can reduce loss by increasing σ̂ — but only up to the point where the σ̂ penalty exceeds the chamfer savings. If the model predicts a good box, low σ̂ gives the best loss. The result: σ̂ naturally calibrates to how hard each sample is.

This is the same principle used in Kendall & Gal (2017) for multi-task learning, adapted here for per-prediction uncertainty. The Chamfer distance is computed between the 8 corners of the predicted and ground-truth boxes, giving a smooth, rotation-aware loss.

Architecture detail: The uncertainty MLP head takes the same 1024-dim box latent as the 7-DoF heads — they all share the cross-attention decoder output. The uncertainty head is a 2-layer MLP: 1024 → 128 (ReLU) → 1. It outputs a single unbounded scalar σ̂ (log-variance). No sigmoid, no clamp — the loss itself provides the regularization. At inference, confidence is computed as s3D = exp(−σ̂), and the final detection confidence is (s2D + s3D) / 2, decoupling 2D detection quality from 3D lifting quality.
Uncertainty-Aware Loss Landscape

Adjust the chamfer loss (how bad the prediction is) and the uncertainty σ̂. Watch how the total loss changes. Find the sweet spot: when chamfer is high, raising σ̂ helps — but only to a point.

Chamfer Loss 2.0
σ̂ (uncertainty) 0.00
Loss → Architecture reasoning: L = Lchamfer · exp(−σ̂) + σ̂. The Chamfer loss compares 8 predicted corners to 8 GT corners (16 point-to-point distances, averaged). The σ̂ head produces its output from the same box latent as the 7-DoF head. When σ̂ is high, exp(−σ̂) → 0, so the chamfer term vanishes — the model says "I don't know" and isn't penalized for being wrong. But the +σ̂ term grows linearly — claiming ignorance has a cost. The optimal σ̂* = ln(Lchamfer), which you can verify in the simulation above.
Ablation result: Removing aleatoric uncertainty drops NymeriaPlus mAP from 0.518 to 0.485. The uncertainty mechanism gives a meaningful 6.4% improvement by letting the model focus its capacity on learnable examples.
Why can't the model just set σ̂ = ∞ for every prediction to avoid all chamfer loss?

Chapter 5: The 7-DoF Representation

A 3D bounding box in the real world has many possible parameterizations. Boxer uses a gravity-aligned 7-DoF representation — seven numbers that fully describe a box's position, size, and orientation:

ParameterSymbolMeaning
Center XxLeft-right position in world frame
Center YyUp-down position (gravity direction)
Center ZzDepth (distance from camera)
WidthwExtent along world X axis
HeighthExtent along gravity axis
Depth extentdExtent along world Z axis
YawθRotation around the gravity (Y) axis

Why gravity-aligned?

Most indoor objects sit on surfaces aligned with gravity. Chairs, tables, monitors, and mugs all have a natural "up" direction. By assuming the box is aligned with gravity (no pitch or roll), Boxer reduces the rotation from a full 3-DoF (roll, pitch, yaw) to just 1-DoF (yaw). This is a much easier prediction target.

The gravity direction is known from the device's IMU (inertial measurement unit), so the world frame is established before BoxerNet even runs.

Coordinate System

BoxerNet predicts 3D boxes in the camera coordinate frame, then transforms them to the gravity-aligned world frame using the known camera pose. The center (x, y, z) is predicted as a 3D offset from the camera ray passing through the center of the 2D bounding box. This gives the model a strong prior — the 3D center should be "somewhere along the ray" through the 2D box center.

Predicting along the ray: Instead of predicting absolute (x, y, z) coordinates (which vary wildly depending on camera position), BoxerNet predicts a depth along the camera ray plus small lateral offsets. This is geometrically natural — the 2D box center gives the direction, the model just needs to figure out the distance. Concretely: given the 2D box center (uc, vc), the ray direction is d = K−1[uc, vc, 1]T. The center head predicts (Δd, Δx, Δy) where Δd is the depth offset along the ray, and Δx, Δy are small lateral corrections. The 3D center is then p = (Δd + d0) · d̂ + (Δx, Δy, 0), where d0 is the median depth at that patch (if available). Extents (w, h, d) are predicted in log-space and exponentiated — this ensures they're always positive.
7-DoF Box Visualizer

Adjust the 7 parameters to see how they define a 3D bounding box. The wireframe is shown in a simple perspective projection.

X (left-right) 0.0
Y (up-down) 0.0
Z (depth) 4.0
Width 1.0
Height 1.0
Depth ext. 1.0
Yaw θ 0.40
Why does Boxer use only 1 rotation angle (yaw) instead of full 3-DoF rotation (roll, pitch, yaw)?

Chapter 6: Multi-View Fusion

BoxerNet processes one frame at a time. But AR and robotics scenarios involve video — the camera moves, and the same object is seen from many angles. How do you merge all those per-frame 3D boxes into a single coherent scene?

The Challenge

Frame-by-frame predictions are noisy. The same chair might get slightly different 3D boxes from different viewpoints. A false positive might appear in one frame but not others. And the 2D detector might assign different class names to the same object across frames ("armchair" vs "chair").

Step 1: Transform to World Frame

Each per-frame 3D box is transformed from camera coordinates to a global world coordinate frame using the known camera pose (from SLAM or device tracking). Now all boxes from all frames live in the same coordinate system.

Step 2: Build a Similarity Graph

Boxer builds a graph where each node is a per-frame 3D detection. Two nodes are connected with an edge if:

The combination of geometric overlap and semantic similarity is crucial. Two boxes might overlap in 3D (a monitor and a laptop stacked on it) but have different semantics. Or two identical-looking chairs might be far apart in 3D.

Step 3: Connected Components + NMS

Connected components of the graph represent the same physical object. Within each component, the boxes are merged using confidence-weighted averaging (boxes with higher s3D contribute more to the final center, extents, and yaw). A final 3D NMS pass with IoU threshold 0.5 removes any remaining duplicates.

Full fusion data flow: Per-frame 3D boxes (in camera frame) → transform to world frame via Tworld←cam from device tracking → compute pairwise 3D IoU (threshold ≥ 0.25 for edge creation) → compute CLIP text embedding similarity between cropped 2D patches (threshold ≥ 0.7) → build undirected graph (edge requires BOTH conditions) → find connected components → confidence-weighted box averaging within each component → 3D NMS (IoU ≥ 0.5) → final scene-level 3D boxes with merged confidence scores.
Temporal consistency for free: Because fusion operates in a persistent world frame, objects naturally accumulate evidence over time. An object seen in 50 frames gets a much more accurate merged box than one seen in 3 frames. And false positives that appear in only 1–2 frames get suppressed by the graph structure — isolated nodes with no edges are discarded if their confidence is below a threshold.
Why does multi-view fusion use BOTH 3D IoU AND semantic similarity to decide if two detections are the same object?

Chapter 7: Training at Scale

The Boxer system was trained on a dataset of remarkable scale and diversity. Here are the key numbers:

MetricValue
Unique 3D bounding boxes1.22 million
Device types4 (Aria glasses, Quest headset, iPad, Azure Kinect)
Depth modalitiesDense (Kinect), sparse (SLAM points), none
ScenesIndoor environments (offices, homes, labs)

Annotation Pipeline

Collecting 1.22M 3D annotations is a massive effort. The pipeline works as follows:

  1. 3D reconstruction: Each scene is reconstructed into a 3D mesh using multi-view stereo or depth fusion.
  2. 3D annotation: Human annotators place 3D bounding boxes on the mesh. This is done once per scene, not per frame.
  3. Projection to frames: The 3D annotations are automatically projected into every camera frame using known poses. A visibility check ensures only visible objects get projected.
  4. 2D box generation: For training, 2D bounding boxes are computed by projecting the 3D box corners into each frame and taking the tightest enclosing rectangle.

Visibility Computation

Not every 3D object is visible in every frame. Boxer uses a ray-casting visibility check against the 3D mesh: if fewer than a threshold fraction of the box's surface is visible, the annotation is excluded from that frame. This prevents training on heavily occluded objects where the 3D box would be nearly impossible to predict.

Data Augmentation

During training, Boxer applies standard 2D augmentations (random crop, color jitter, horizontal flip) and also jitters the 2D input boxes by up to 20% of their size. This simulates the noise that real 2D detectors produce — their boxes are never perfectly tight.

Training compute: Trained for 2 weeks on 16 H100 GPUs with bfloat16 mixed precision. Optimizer: AdamW with weight decay 0.05. Learning rate: cosine decay from 1e-4 to 1e-5 over 500K steps. Batch size: 64 (4 per GPU). Forward pass at inference: ~20ms on RTX 4090 at 960×960 resolution with bfloat16 — fast enough for real-time AR/robotics at 50 FPS (BoxerNet alone, excluding 2D detection).
Device diversity matters: By training on 4 different device types, BoxerNet learns to handle different camera intrinsics, fields of view, and depth characteristics. Aria glasses have a fisheye lens and sparse SLAM points. Kinect has a narrow field of view but dense depth. The model becomes robust to all of them. Critically, during training, depth is randomly dropped (set to all −1) with 20% probability, forcing the model to learn a fallback that works without any depth signal.
Why does Boxer jitter the 2D input boxes during training?

Chapter 8: Results

Boxer is evaluated on two challenging benchmarks that test different scenarios:

NymeriaPlus (Egocentric, Sparse/No Depth)

Egocentric video from Aria glasses. Sparse point cloud depth from visual-inertial SLAM. This is the hardest setting — limited depth, moving camera, ego-motion blur.

Method2D DetectormAP ↑
CuTR (end-to-end baseline)GT 2D boxes0.010
BoxerNetGT 2D boxes0.532
BoxerNetDETIC0.254
BoxerNetSAM30.278

The gap is staggering: with perfect 2D boxes, BoxerNet achieves 53× higher mAP than CuTR. Even with noisy real detectors (DETIC, SAM3), Boxer far exceeds what end-to-end methods achieve.

CA-1M (Dense Depth)

Indoor scenes captured with Azure Kinect, providing dense depth. This is the easier setting for geometry.

Method2D DetectormAP ↑
CuTRGT 2D boxes0.250
BoxerNetGT 2D boxes0.412
BoxerNetSAM30.270

Key Ablations

What matters most? The ablation study on NymeriaPlus (with GT 2D boxes) reveals:

AblationmAPChange
Full BoxerNet0.518
Remove point cloud depth0.279−46%
Remove aleatoric uncertainty0.485−6.4%
Remove camera ray features0.495−4.4%
Depth is king: Removing the point cloud nearly halves performance. This confirms what we'd expect geometrically — depth is the strongest single signal for 3D lifting. But even without any depth, BoxerNet still achieves 0.279 mAP, showing it can learn useful geometric priors from image features alone.
Results Comparison

mAP comparison between BoxerNet and CuTR across the two benchmarks (using GT 2D boxes). Higher is better.

According to the ablation study, what is the single most important input for BoxerNet's performance?

Chapter 9: Connections

Boxer sits at the intersection of several important research directions. Let's map where it fits.

Relation to DETR

BoxerNet borrows the transformer encoder-decoder architecture from DETR, but with a crucial difference: DETR's object queries are learned parameters that must discover objects from scratch. BoxerNet's queries are given by the 2D detector — the model knows where objects are in 2D and only needs to lift them to 3D. This makes the decoder's job much easier.

Relation to Cube R-CNN / CuTR

Cube R-CNN and CuTR are end-to-end 3D detectors that jointly detect and estimate 3D boxes. Boxer's decomposition shows that separating these two tasks leads to dramatically better performance, especially in the open-world setting where 2D detectors have a massive vocabulary advantage.

Relation to Monocular Depth Estimation

Models like Depth Anything and ZoeDepth predict per-pixel depth from a single image. Boxer's approach is complementary: if you have a monocular depth estimator, you could use its output as Boxer's depth input. The median-patch encoding would handle the estimated depth the same way it handles sensor depth.

Relation to SpatialVLM

SpatialVLM and similar vision-language models can reason about 3D spatial relationships in natural language. Boxer provides the grounding: metric 3D bounding boxes that anchor spatial reasoning to physical reality.

Cheat Sheet

AspectBoxer
InputImage + 2D boxes + optional depth
Output7-DoF 3D box + uncertainty per detection
BackboneDINOv3 ViT-Large
Decoder queriesFrom 2D detector (not learned)
Depth encodingMedian depth per patch (MLP)
LossChamfer · exp(−σ̂) + σ̂
Fusion3D IoU + CLIP similarity graph + NMS
Training data1.22M 3DBBs, 4 device types
Key result53× mAP vs CuTR on NymeriaPlus
Trainable params~25M (backbone frozen)
Inference speed~20ms on RTX 4090 (960×960, bf16)
Training2 weeks, 16× H100, AdamW, cosine LR
The broader lesson: When a hard problem can be decomposed into a "solved" part and an "unsolved" part, do not retrain the solved part. Stand on the shoulders of open-world 2D detectors and focus your model on the geometry.
What is the key architectural difference between how DETR and BoxerNet use their decoder queries?