Boxer — Veanors

Chapter 0: The Problem

You have an AR headset. A 2D detector — DETIC, OWLv2, SAM3 — can see a coffee mug on your desk and draw a 2D bounding box around it. But your robot arm needs to pick it up. For that, you need to know where that mug is in 3D: how far away, how wide, how tall, and which way it's rotated.

This is the 2D-to-3D lifting problem: given a 2D bounding box in an image, produce a 3D bounding box in the real world.

Why is this hard? A single 2D box is deeply ambiguous. A small box near the top of the image could be a small object nearby or a large object far away. The 2D projection throws away depth information entirely.

The fundamental ambiguity: A 2D bounding box is a projection of infinitely many possible 3D boxes. A 40×40 pixel square could be a tennis ball at 1 meter or a beach ball at 10 meters. Without additional information — context, depth, learned priors — the problem is ill-posed.

Existing approaches try to solve 2D detection and 3D estimation simultaneously (CuTR, Cube R-CNN). But this couples two very different problems: recognizing what something is (semantics) and understanding where it is in 3D (geometry). Boxer's idea: decouple them.

Full Data Flow at a Glance: RGB image I ∈ R^H×W×3 + pinhole intrinsics (f_x, f_y, c_x, c_y) + 6-DoF pose T_world←cam + N 2D bounding boxes (x₁, y₁, x₂, y₂, score) → DINOv3 features F_img ∈ R^H'×W'×1024 | camera rays → F_ray ∈ R^H'×W'×3 | depth → median per patch → F_depth ∈ R^H'×W'×1 | concat → F_enc ∈ R^H'×W'×1028 → self-attention encoder → cross-attention decoder with 2D box queries → per box: 7-DoF (x, y, z, w, h, d, θ) + uncertainty σ̂ + confidence s_3D.

2D → 3D Ambiguity

Drag the depth slider to see how the same 2D box corresponds to different 3D boxes. The orange rectangle is fixed in 2D — but the 3D box it represents changes drastically with depth.

Depth 3.0m

Why is lifting a 2D bounding box to 3D fundamentally difficult?

A single 2D box is the projection of infinitely many 3D boxes at different depths and sizes — the mapping is one-to-many 2D detectors are too slow to run in real time 3D bounding boxes require more GPU memory than 2D boxes

Chapter 1: The Key Insight

Boxer's core insight is a clean decomposition: separate 2D detection from 3D lifting.

Existing 3D detectors try to do everything at once. They take a raw image and produce 3D boxes end-to-end. This sounds elegant, but it creates a painful dependency: you need 3D-annotated training data to teach the model both what objects look like and where they are in 3D. Such data is extremely expensive to collect.

Meanwhile, 2D detectors have gotten spectacularly good. Models like DETIC, OWLv2, and SAM3 can detect thousands of object categories in the wild — far more than any 3D dataset covers. They were trained on billions of internet images with cheap 2D labels.

Boxer's decomposition exploits this asymmetry:

Step 1: 2D Detection

Use any off-the-shelf open-world 2D detector (DETIC, OWLv2, SAM3). Get 2D bounding boxes + class labels. No 3D annotations needed.

↓

Step 2: 3D Lifting

Feed image + 2D boxes into BoxerNet. It outputs 7-DoF 3D bounding boxes + uncertainty. Trained on 3D data, but doesn't need to learn semantics.

↓

Step 3: Multi-View Fusion

Merge per-frame 3D boxes across time using 3D IoU + semantic similarity. Produces a coherent scene-level 3D map.

Why this matters: The decomposition lets each component use the best available data. 2D detectors leverage billions of web images. BoxerNet leverages millions of 3D annotations from AR/VR devices. Neither needs the other's data. You can swap in a better 2D detector tomorrow without retraining BoxerNet.

This also makes the system open-world by inheritance. If your 2D detector can find "ergonomic keyboard" or "sourdough bread," BoxerNet can lift those detections to 3D — even though it has never seen those categories in its 3D training set. The geometry of lifting doesn't depend on the object's class.

What happens when inputs degrade: With no SLAM/depth available, F_depth patches are all set to −1 (the "no depth" token) → the network learns to ignore depth and relies purely on visual + ray features. With no device pose, T_world←cam defaults to identity → 3D boxes remain in camera frame rather than world frame, and multi-view fusion becomes impossible. With a noisy 2D detector (SAM3 vs GT boxes), mAP drops from 0.532 to 0.278 — the system degrades gracefully rather than catastrophically.

What is the key advantage of separating 2D detection from 3D lifting?

It runs faster on mobile GPUs Each component can leverage different data sources — cheap 2D web data for detection, expensive 3D data only for lifting — and they can be improved independently It allows using smaller backbone networks

Chapter 2: The BoxerNet Architecture

BoxerNet is a transformer that takes three inputs and produces 7-DoF 3D bounding boxes. Let's trace the data flow.

Encoder: Building a Rich Scene Representation

The encoder fuses three types of information into a single set of tokens:

Image features: A DINOv3 backbone (ViT-Large, frozen) extracts patch-level features from the image. Input image is resized to 960×960. With patch size 16, this gives a 60×60 grid = 3600 patch tokens, each a 1024-dimensional vector encoding appearance.
Camera ray features: For each patch center (u, v), the 3D ray direction is computed as d = K⁻¹[u, v, 1]^T (using camera intrinsics K). A 2-layer MLP (1024 hidden, ReLU) encodes these 3D directions into 1024-dim vectors, added element-wise to the image tokens.
Depth features (optional): If depth data is available (from a point cloud, LiDAR, or depth sensor), the median depth within each patch is computed, giving one scalar per patch. A separate 2-layer MLP encodes this scalar into 1024 dims, concatenated to produce F_enc ∈ R^3600×1028. We'll cover this in detail in Chapter 3.

The combined tokens pass through transformer encoder layers with self-attention, letting the model reason about global spatial relationships.

Frozen vs. Trained: DINOv3 backbone: frozen (pretrained on 142M images — retraining would be wasteful and destroy generalization). Self-attention encoder layers: trained. Cross-attention decoder layers: trained. Ray MLP + Depth MLP: trained. Prediction heads: trained. Total trainable parameters: ~25M. This is a thin learnable shell around a massive frozen feature extractor.

Decoder: Querying with 2D Boxes

The decoder receives one query per 2D detection. Each query is constructed from: (1) the 2D bounding box corners (x₁, y₁, x₂, y₂) encoded via sinusoidal positional encoding into a 256-dim vector, plus (2) RoI-pooled image features from the DINOv3 feature map cropped to that box region. The decoder applies cross-attention from these queries to the encoder tokens — each 2D box "looks at" the relevant parts of the scene to gather 3D information.

Output Heads

Each query's final embedding (the cross-attention output, a 1024-dim vector we call the box latent) is passed through four parallel 2-layer MLPs (128 hidden dims, ReLU):

Center head: predicts the 3D center (x, y, z) — as depth along ray + lateral offsets
Extent head: predicts the half-widths (w, h, d) in log-space (exp applied at output)
Yaw head: predicts sin(θ) and cos(θ) — avoids angle wraparound discontinuities
Uncertainty head: predicts log-variance σ̂ (a single scalar, unbounded)

Compared to DETR: DETR uses learned object queries to discover what and where objects are. BoxerNet queries are given — they come from the 2D detector. This means BoxerNet doesn't need to learn detection; it only needs to learn geometry. The queries carry 2D spatial information that anchors the cross-attention to the right image region.

BoxerNet Pipeline

The three input streams merge in the encoder. 2D box queries attend to encoder tokens via cross-attention. Output heads produce 7-DoF boxes.

What are the three types of input tokens that BoxerNet's encoder fuses?

Image features (DINOv3 patches), camera ray features (3D directions), and optional median depth features RGB channels, depth map, and segmentation mask Object queries, positional encodings, and class embeddings

Chapter 3: Depth Encoding

Depth is the strongest signal for 3D lifting — if you know how far away each pixel is, placing a 3D box becomes much easier. But depth comes in wildly different forms:

Dense depth maps: every pixel has a depth value (from a depth sensor like Azure Kinect)
Sparse point clouds: only a few thousand 3D points (from visual-inertial SLAM on an AR headset)
No depth at all: just a single RGB image

Most existing methods require dense depth. They feed a full depth image through a ViT encoder — expensive, and impossible when depth is sparse or absent.

Boxer's Solution: Median Depth Patches

Boxer takes a radically simpler approach. For each image patch (matching the ViT grid of 16×16 pixels), it computes the median depth of all depth samples that fall within that patch. If no depth samples land in a patch, it gets a special "no depth" token (value −1).

This single scalar per patch is then encoded through a 2-layer MLP (128 hidden dims, ReLU) and concatenated to the image+ray token for that patch.

Concrete numbers: A sparse SLAM point cloud from Aria glasses has ~10K 3D points. Projected onto a 960×960 image with a 60×60 patch grid (3600 patches of 16×16 pixels each), most patches receive 0–3 depth points. Many patches get zero. The median is robust to this extreme sparsity — patches with points get a signal, patches without get −1. With dense depth (e.g., iPad LiDAR at 256×192 projected onto the same grid), every patch receives ~10–20 points. With no depth at all: all 3600 patches are −1, and the network learns to ignore F_depth entirely.

Why median (not mean)? The median is robust to outliers. If a patch contains points on both a table (1.5m) and a wall behind it (3.0m), the median picks the dominant surface rather than averaging them. It's also trivially cheap to compute — sort and pick the middle value. A mean would be skewed by a single noisy point at 50m from a SLAM artifact.

This design has three key advantages:

Works with any density. Dense depth fills every patch. Sparse point clouds fill some patches. No depth fills none. The architecture handles all three gracefully — empty patches simply don't get a depth signal.
Extremely efficient. Instead of encoding a full H×W depth image through a second ViT (millions of FLOPs), Boxer encodes one scalar per patch through a tiny MLP (negligible cost).
Unified representation. At training time, Boxer can mix data from dense-depth devices (Kinect) and sparse-depth devices (AR glasses) in the same batch, without any architecture changes.

Ablation result: On NymeriaPlus (egocentric, sparse depth), adding point cloud depth raises mAP from 0.279 to 0.518 — nearly doubling performance. Depth is hugely valuable, and Boxer's encoding captures it efficiently.

Why does Boxer use median depth per patch instead of feeding a full depth image through a ViT?

Because median depth is more accurate than raw depth values Because it removes the need for camera calibration Because it works uniformly with dense, sparse, or no depth, costs negligible compute (one scalar per patch), and is robust to outliers

Chapter 4: Uncertainty-Aware Training

Not all 3D predictions are equally reliable. A coffee mug sitting on a clear table with dense depth data? Easy — the model should be confident. A partially occluded chair seen from a weird angle with no depth? Hard — the model should admit it's uncertain.

Boxer handles this with aleatoric uncertainty — uncertainty that comes from the data itself (noise, occlusion, ambiguity), not from the model's lack of training.

The Loss Function

For each predicted 3D box, BoxerNet outputs both the box parameters and an uncertainty scalar σ̂. The training loss for each box is:

L = L_chamfer · exp(−σ̂) + σ̂

Let's unpack this. There are two terms pulling in opposite directions:

exp(−σ̂) · L_chamfer: The chamfer loss (distance between predicted and ground-truth box corners) is down-weighted when σ̂ is large. High uncertainty = "don't punish me too hard for getting this wrong."
σ̂ (regularizer): This prevents the model from setting σ̂ = ∞ for everything. Claiming uncertainty has a cost. The model must earn its uncertainty by actually having a bad prediction.

The equilibrium: If the model predicts a bad box, it can reduce loss by increasing σ̂ — but only up to the point where the σ̂ penalty exceeds the chamfer savings. If the model predicts a good box, low σ̂ gives the best loss. The result: σ̂ naturally calibrates to how hard each sample is.

This is the same principle used in Kendall & Gal (2017) for multi-task learning, adapted here for per-prediction uncertainty. The Chamfer distance is computed between the 8 corners of the predicted and ground-truth boxes, giving a smooth, rotation-aware loss.

Architecture detail: The uncertainty MLP head takes the same 1024-dim box latent as the 7-DoF heads — they all share the cross-attention decoder output. The uncertainty head is a 2-layer MLP: 1024 → 128 (ReLU) → 1. It outputs a single unbounded scalar σ̂ (log-variance). No sigmoid, no clamp — the loss itself provides the regularization. At inference, confidence is computed as s_3D = exp(−σ̂), and the final detection confidence is (s_2D + s_3D) / 2, decoupling 2D detection quality from 3D lifting quality.

Uncertainty-Aware Loss Landscape

Adjust the chamfer loss (how bad the prediction is) and the uncertainty σ̂. Watch how the total loss changes. Find the sweet spot: when chamfer is high, raising σ̂ helps — but only to a point.

Chamfer Loss 2.0

σ̂ (uncertainty) 0.00

Loss → Architecture reasoning: L = L_chamfer · exp(−σ̂) + σ̂. The Chamfer loss compares 8 predicted corners to 8 GT corners (16 point-to-point distances, averaged). The σ̂ head produces its output from the same box latent as the 7-DoF head. When σ̂ is high, exp(−σ̂) → 0, so the chamfer term vanishes — the model says "I don't know" and isn't penalized for being wrong. But the +σ̂ term grows linearly — claiming ignorance has a cost. The optimal σ̂* = ln(L_chamfer), which you can verify in the simulation above.

Ablation result: Removing aleatoric uncertainty drops NymeriaPlus mAP from 0.518 to 0.485. The uncertainty mechanism gives a meaningful 6.4% improvement by letting the model focus its capacity on learnable examples.

Why can't the model just set σ̂ = ∞ for every prediction to avoid all chamfer loss?

Because the +σ̂ regularization term penalizes high uncertainty directly — claiming uncertainty has a cost that must be balanced against the chamfer savings Because the optimizer clips gradient values above a threshold Because σ̂ is bounded between 0 and 1 by a sigmoid activation

Chapter 5: The 7-DoF Representation

A 3D bounding box in the real world has many possible parameterizations. Boxer uses a gravity-aligned 7-DoF representation — seven numbers that fully describe a box's position, size, and orientation:

Parameter	Symbol	Meaning
Center X	x	Left-right position in world frame
Center Y	y	Up-down position (gravity direction)
Center Z	z	Depth (distance from camera)
Width	w	Extent along world X axis
Height	h	Extent along gravity axis
Depth extent	d	Extent along world Z axis
Yaw	θ	Rotation around the gravity (Y) axis

Why gravity-aligned?

Most indoor objects sit on surfaces aligned with gravity. Chairs, tables, monitors, and mugs all have a natural "up" direction. By assuming the box is aligned with gravity (no pitch or roll), Boxer reduces the rotation from a full 3-DoF (roll, pitch, yaw) to just 1-DoF (yaw). This is a much easier prediction target.

The gravity direction is known from the device's IMU (inertial measurement unit), so the world frame is established before BoxerNet even runs.

Coordinate System

BoxerNet predicts 3D boxes in the camera coordinate frame, then transforms them to the gravity-aligned world frame using the known camera pose. The center (x, y, z) is predicted as a 3D offset from the camera ray passing through the center of the 2D bounding box. This gives the model a strong prior — the 3D center should be "somewhere along the ray" through the 2D box center.

Predicting along the ray: Instead of predicting absolute (x, y, z) coordinates (which vary wildly depending on camera position), BoxerNet predicts a depth along the camera ray plus small lateral offsets. This is geometrically natural — the 2D box center gives the direction, the model just needs to figure out the distance. Concretely: given the 2D box center (u_c, v_c), the ray direction is d = K⁻¹[u_c, v_c, 1]^T. The center head predicts (Δd, Δx, Δy) where Δd is the depth offset along the ray, and Δx, Δy are small lateral corrections. The 3D center is then p = (Δd + d₀) · d̂ + (Δx, Δy, 0), where d₀ is the median depth at that patch (if available). Extents (w, h, d) are predicted in log-space and exponentiated — this ensures they're always positive.

7-DoF Box Visualizer

Adjust the 7 parameters to see how they define a 3D bounding box. The wireframe is shown in a simple perspective projection.

X (left-right) 0.0

Y (up-down) 0.0

Z (depth) 4.0

Width 1.0

Height 1.0

Depth ext. 1.0

Yaw θ 0.40

Why does Boxer use only 1 rotation angle (yaw) instead of full 3-DoF rotation (roll, pitch, yaw)?

Because the IMU can only measure yaw Because most indoor objects sit on gravity-aligned surfaces, so roll and pitch are near zero — the gravity direction from the IMU fixes two of the three rotation axes Because 3-DoF rotation requires quaternions which are harder to learn

Chapter 6: Multi-View Fusion

BoxerNet processes one frame at a time. But AR and robotics scenarios involve video — the camera moves, and the same object is seen from many angles. How do you merge all those per-frame 3D boxes into a single coherent scene?

The Challenge

Frame-by-frame predictions are noisy. The same chair might get slightly different 3D boxes from different viewpoints. A false positive might appear in one frame but not others. And the 2D detector might assign different class names to the same object across frames ("armchair" vs "chair").

Step 1: Transform to World Frame

Each per-frame 3D box is transformed from camera coordinates to a global world coordinate frame using the known camera pose (from SLAM or device tracking). Now all boxes from all frames live in the same coordinate system.

Step 2: Build a Similarity Graph

Boxer builds a graph where each node is a per-frame 3D detection. Two nodes are connected with an edge if:

3D IoU > threshold: the boxes overlap significantly in 3D space (they're probably the same object)
Semantic similarity: the CLIP embeddings of the two 2D crops are similar (they look like the same thing)

The combination of geometric overlap and semantic similarity is crucial. Two boxes might overlap in 3D (a monitor and a laptop stacked on it) but have different semantics. Or two identical-looking chairs might be far apart in 3D.

Step 3: Connected Components + NMS

Connected components of the graph represent the same physical object. Within each component, the boxes are merged using confidence-weighted averaging (boxes with higher s_3D contribute more to the final center, extents, and yaw). A final 3D NMS pass with IoU threshold 0.5 removes any remaining duplicates.

Full fusion data flow: Per-frame 3D boxes (in camera frame) → transform to world frame via T_world←cam from device tracking → compute pairwise 3D IoU (threshold ≥ 0.25 for edge creation) → compute CLIP text embedding similarity between cropped 2D patches (threshold ≥ 0.7) → build undirected graph (edge requires BOTH conditions) → find connected components → confidence-weighted box averaging within each component → 3D NMS (IoU ≥ 0.5) → final scene-level 3D boxes with merged confidence scores.

Temporal consistency for free: Because fusion operates in a persistent world frame, objects naturally accumulate evidence over time. An object seen in 50 frames gets a much more accurate merged box than one seen in 3 frames. And false positives that appear in only 1–2 frames get suppressed by the graph structure — isolated nodes with no edges are discarded if their confidence is below a threshold.

Why does multi-view fusion use BOTH 3D IoU AND semantic similarity to decide if two detections are the same object?

Because objects can overlap in 3D but be different things (monitor on a laptop), or look similar but be spatially separate (two identical chairs) — both signals are needed to avoid false merges Because 3D IoU is too expensive to compute alone Because CLIP embeddings contain depth information that improves IoU estimates

Chapter 7: Training at Scale

The Boxer system was trained on a dataset of remarkable scale and diversity. Here are the key numbers:

Metric	Value
Unique 3D bounding boxes	1.22 million
Device types	4 (Aria glasses, Quest headset, iPad, Azure Kinect)
Depth modalities	Dense (Kinect), sparse (SLAM points), none
Scenes	Indoor environments (offices, homes, labs)

Annotation Pipeline

Collecting 1.22M 3D annotations is a massive effort. The pipeline works as follows:

3D reconstruction: Each scene is reconstructed into a 3D mesh using multi-view stereo or depth fusion.
3D annotation: Human annotators place 3D bounding boxes on the mesh. This is done once per scene, not per frame.
Projection to frames: The 3D annotations are automatically projected into every camera frame using known poses. A visibility check ensures only visible objects get projected.
2D box generation: For training, 2D bounding boxes are computed by projecting the 3D box corners into each frame and taking the tightest enclosing rectangle.

Visibility Computation

Not every 3D object is visible in every frame. Boxer uses a ray-casting visibility check against the 3D mesh: if fewer than a threshold fraction of the box's surface is visible, the annotation is excluded from that frame. This prevents training on heavily occluded objects where the 3D box would be nearly impossible to predict.

Data Augmentation

During training, Boxer applies standard 2D augmentations (random crop, color jitter, horizontal flip) and also jitters the 2D input boxes by up to 20% of their size. This simulates the noise that real 2D detectors produce — their boxes are never perfectly tight.

Training compute: Trained for 2 weeks on 16 H100 GPUs with bfloat16 mixed precision. Optimizer: AdamW with weight decay 0.05. Learning rate: cosine decay from 1e-4 to 1e-5 over 500K steps. Batch size: 64 (4 per GPU). Forward pass at inference: ~20ms on RTX 4090 at 960×960 resolution with bfloat16 — fast enough for real-time AR/robotics at 50 FPS (BoxerNet alone, excluding 2D detection).

Device diversity matters: By training on 4 different device types, BoxerNet learns to handle different camera intrinsics, fields of view, and depth characteristics. Aria glasses have a fisheye lens and sparse SLAM points. Kinect has a narrow field of view but dense depth. The model becomes robust to all of them. Critically, during training, depth is randomly dropped (set to all −1) with 20% probability, forcing the model to learn a fallback that works without any depth signal.

Why does Boxer jitter the 2D input boxes during training?

To increase the number of training samples To simulate the imperfect 2D boxes that real off-the-shelf detectors produce at inference time, making BoxerNet robust to detection noise To prevent the model from memorizing exact bounding box coordinates

Chapter 8: Results

Boxer is evaluated on two challenging benchmarks that test different scenarios:

NymeriaPlus (Egocentric, Sparse/No Depth)

Egocentric video from Aria glasses. Sparse point cloud depth from visual-inertial SLAM. This is the hardest setting — limited depth, moving camera, ego-motion blur.

Method	2D Detector	mAP ↑
CuTR (end-to-end baseline)	GT 2D boxes	0.010
BoxerNet	GT 2D boxes	0.532
BoxerNet	DETIC	0.254
BoxerNet	SAM3	0.278

The gap is staggering: with perfect 2D boxes, BoxerNet achieves 53× higher mAP than CuTR. Even with noisy real detectors (DETIC, SAM3), Boxer far exceeds what end-to-end methods achieve.

CA-1M (Dense Depth)

Indoor scenes captured with Azure Kinect, providing dense depth. This is the easier setting for geometry.

Method	2D Detector	mAP ↑
CuTR	GT 2D boxes	0.250
BoxerNet	GT 2D boxes	0.412
BoxerNet	SAM3	0.270

Key Ablations

What matters most? The ablation study on NymeriaPlus (with GT 2D boxes) reveals:

Ablation	mAP	Change
Full BoxerNet	0.518	—
Remove point cloud depth	0.279	−46%
Remove aleatoric uncertainty	0.485	−6.4%
Remove camera ray features	0.495	−4.4%

Depth is king: Removing the point cloud nearly halves performance. This confirms what we'd expect geometrically — depth is the strongest single signal for 3D lifting. But even without any depth, BoxerNet still achieves 0.279 mAP, showing it can learn useful geometric priors from image features alone.

Results Comparison

mAP comparison between BoxerNet and CuTR across the two benchmarks (using GT 2D boxes). Higher is better.

According to the ablation study, what is the single most important input for BoxerNet's performance?

Aleatoric uncertainty Point cloud depth — removing it drops mAP from 0.518 to 0.279, a 46% decline Camera ray features

Chapter 9: Connections

Boxer sits at the intersection of several important research directions. Let's map where it fits.

Relation to DETR

BoxerNet borrows the transformer encoder-decoder architecture from DETR, but with a crucial difference: DETR's object queries are learned parameters that must discover objects from scratch. BoxerNet's queries are given by the 2D detector — the model knows where objects are in 2D and only needs to lift them to 3D. This makes the decoder's job much easier.

Relation to Cube R-CNN / CuTR

Cube R-CNN and CuTR are end-to-end 3D detectors that jointly detect and estimate 3D boxes. Boxer's decomposition shows that separating these two tasks leads to dramatically better performance, especially in the open-world setting where 2D detectors have a massive vocabulary advantage.

Relation to Monocular Depth Estimation

Models like Depth Anything and ZoeDepth predict per-pixel depth from a single image. Boxer's approach is complementary: if you have a monocular depth estimator, you could use its output as Boxer's depth input. The median-patch encoding would handle the estimated depth the same way it handles sensor depth.

Relation to SpatialVLM

SpatialVLM and similar vision-language models can reason about 3D spatial relationships in natural language. Boxer provides the grounding: metric 3D bounding boxes that anchor spatial reasoning to physical reality.

Cheat Sheet

Aspect	Boxer
Input	Image + 2D boxes + optional depth
Output	7-DoF 3D box + uncertainty per detection
Backbone	DINOv3 ViT-Large
Decoder queries	From 2D detector (not learned)
Depth encoding	Median depth per patch (MLP)
Loss	Chamfer · exp(−σ̂) + σ̂
Fusion	3D IoU + CLIP similarity graph + NMS
Training data	1.22M 3DBBs, 4 device types
Key result	53× mAP vs CuTR on NymeriaPlus
Trainable params	~25M (backbone frozen)
Inference speed	~20ms on RTX 4090 (960×960, bf16)
Training	2 weeks, 16× H100, AdamW, cosine LR

The broader lesson: When a hard problem can be decomposed into a "solved" part and an "unsolved" part, do not retrain the solved part. Stand on the shoulders of open-world 2D detectors and focus your model on the geometry.

What is the key architectural difference between how DETR and BoxerNet use their decoder queries?

DETR uses more queries than BoxerNet DETR's queries are learned parameters that must discover objects, while BoxerNet's queries are derived from 2D detections and only need to estimate 3D geometry BoxerNet queries use cross-attention while DETR queries use self-attention

Robust Lifting of 2D Boxes to 3D