A single plain transformer predicts spatially consistent depth and ray maps from any number of images — no architectural specialization, no multi-task complexity. Just depth + rays.
You have photos of a room. Maybe one photo, maybe five, maybe a hundred frames from a phone video. You want to know: how far away is every pixel? Where was the camera when each photo was taken? Can I combine all of these into a single 3D point cloud?
Right now, you face a fragmented landscape. If you have one image, you use a monocular depth estimator like Depth Anything 2. If you have two images, you use DUSt3R. If you have many images, you use COLMAP (a slow, multi-stage pipeline) or VGGT (a large two-transformer model). Each tool is a separate system with its own architecture, training data, and failure modes.
What if one model could handle all of these cases? One image, two images, a hundred frames — all processed by the same network, producing consistent geometry every time?
Consider the specific failure modes. COLMAP achieves only 13.0 Auc3 on HiRoom (a challenging indoor dataset with textureless walls) compared to 81.7 from DA3. On ScanNet++, COLMAP gets 13.3 Auc3 versus 83.2 from DA3. Classical methods need well-textured surfaces to find correspondences. When those correspondences break, the entire pipeline collapses.
Feed-forward models like VGGT improved on this, but they use a complex two-transformer architecture (a pretrained DINOv2 encoder + a separate untrained cross-view transformer), predict redundant targets (depth maps + point maps + poses + correspondences), and still struggle on challenging scenes. VGGT's 0.90B parameter model is impressive, but DA3 shows that a 0.30B model can surpass it on most benchmarks.
Click each scenario to see what tool you'd need today — and what DA3 replaces them all with.
The core question DA3 asks is: what is the minimal set of prediction targets and the minimal architecture that can recover 3D geometry from any number of views? The answer turns out to be surprisingly simple: predict depth + rays from a single pretrained transformer. No custom architecture. No point cloud heads. No iterative optimization. Just two dense maps per image, combined with element-wise operations.
DA3's insight comes in two parts, each challenging a widely held assumption in the field.
VGGT uses two separate transformers stacked together: a pretrained DINOv2 backbone plus a separate cross-view transformer trained from scratch. This means two-thirds of the model's blocks have never seen ImageNet-scale pretraining. DA3 asks: what if we just used one transformer and rearranged the attention pattern inside it?
Take a vanilla DINOv2 encoder. It already knows how to extract powerful visual features. Instead of adding a second transformer for cross-view reasoning, DA3 simply rearranges input tokens in the last third of the network so that self-attention happens across views instead of within views. No new parameters. No new architecture. Just tensor reordering.
The result: a ViT-L model (0.30B parameters) using DA3's approach outperforms a VGGT-style architecture with comparable parameter count by 20% (Table 6 in the paper). Full pretraining of every layer beats partial pretraining of a larger stack.
Previous unified 3D models predicted a zoo of outputs: point maps, depth maps, camera poses, correspondences, confidence maps. DA3 shows that exactly two dense predictions per image are sufficient:
From these two outputs, a 3D point in world coordinates is simply:
That is it. Element-wise multiplication and addition. No matrix inversions. No rotation decomposition. No iterative optimization. Every pixel's 3D position falls out from combining its depth with its ray direction.
The paper also adds a lightweight camera head that predicts camera parameters directly (FOV f ∈ ℝ2, quaternion q ∈ ℝ4, translation t ∈ ℝ3) from a single token per view. This is optional — you can extract camera parameters from the ray map instead — but the camera head is 18.7× faster (0.46ms vs 8.60ms on an A100 GPU). Since it costs only ~0.1% of backbone computation, they include it for free.
Toggle between prediction strategies to see how they compare. Depth + Ray is the minimal sufficient set.
Before we look at the neural network, we need to deeply understand what it predicts. The depth-ray representation is the foundation of everything DA3 does, and it is beautifully simple once you see where it comes from.
Start with the standard pinhole camera model. A pixel p = (u, v, 1)T in image i projects to a 3D point P in world coordinates via:
Where:
The problem: predicting Ri directly is hard because rotation matrices must satisfy RTR = I and det(R) = 1. These are nonlinear constraints that a neural network has no natural way to enforce. Previous works predict quaternions or 6D rotation representations, but these require careful normalization and still struggle.
DA3 sidesteps this entirely. Instead of predicting Ri and Ki separately, it predicts a dense ray map Mi ∈ ℝH×W×6 that stores, for every pixel, the camera ray in world coordinates.
Each pixel's ray has two components:
The direction d is not normalized. This is crucial: its magnitude preserves the projection scale, encoding both intrinsics and rotation in a single vector. A pixel near the edge of a wide-angle lens has a longer d than a pixel near the center — the direction itself encodes the focal length.
Suppose we have a single image of a table. The network predicts:
The 3D point is:
That is it. Scalar multiplication and vector addition. For an entire image of H×W pixels, this is a single broadcasted element-wise operation — no matrix inversions, no SVD decompositions, no iterative solvers.
In a perfect pinhole camera, all rays originate from the same point (the camera center). So why predict t per-pixel instead of once per image?
If you need explicit camera parameters (for downstream applications), you can extract them from the ray map. The procedure is:
This is a standard DLT (Direct Linear Transform) problem. Once H* is found, decompose it via RQ decomposition into K and R. Total cost: 8.60ms on an A100 GPU. But the camera head does the same thing in 0.46ms — an 18.7× speedup. That is why DA3 includes both: the ray head for dense geometric supervision during training, the camera head for fast inference.
Each pixel has a ray origin (camera center) and direction. Drag the camera to see how ray directions change. Depth scales along each ray to produce 3D points.
The architecture of DA3 is notable for what it does not have. No epipolar transformers. No cost volumes. No cascaded decoders. No custom attention patterns with learned relative position biases. Just a single pretrained ViT with token rearrangement, plus a dual-headed decoder.
Let us trace what happens when you feed N = 4 images (each 518×518 pixels, the default resolution) through DA3-Large (ViT-L, L = 24 layers).
Step 1: Patchification. Each image is split into 14×14 pixel patches, giving 37×37 = 1,369 patch tokens per image. With a 1024-dim embedding, that is 4 images × 1,369 tokens × 1024 dims = a tensor of shape [4, 1369, 1024].
Step 2: Camera token injection. If camera parameters are available, each image gets a camera token ci = MLP(fi, qi, ti) of dimension 1024. If not, a learnable token cl is used. This is prepended to the patch tokens: [4, 1370, 1024]. Camera tokens participate in all attention operations, giving the model geometric context throughout.
Step 3: Within-view attention (layers 1–16). The first Ls = 16 layers apply standard self-attention independently to each image's tokens. Each image's 1,370 tokens attend only to each other. This is exactly monocular feature extraction — the same computation as running DINOv2 on each image separately. Output: [4, 1370, 1024].
Step 4: Alternating cross/within-view attention (layers 17–24). The last Lg = 8 layers alternate: odd layers do cross-view attention (all 4×1,370 = 5,480 tokens attend to each other), even layers do within-view attention. This is where multi-view reasoning happens. Cross-view layers let tokens from different images exchange information. Output: [4, 1370, 1024].
Step 5: Dual-DPT head. Multi-scale features from layers {6, 12, 18, 24} are extracted and fed through shared reassembly (upsample + project). Then they split: depth branch produces D ∈ ℝH×W×1, ray branch produces M ∈ ℝH×W×6. A confidence map σ ∈ ℝH×W×1 is also predicted.
Step 6: Camera head (optional). A small transformer DC operates on the N camera tokens to predict (f, q, t) per view. Processes 4 tokens — negligible cost.
The remarkable finding: DA3-Large (0.30B) is 3× smaller than VGGT (0.90B) yet outperforms it on 5 out of 10 geometry benchmarks. This validates the core insight — a fully pretrained single transformer beats a larger partially-pretrained two-transformer stack.
Interactive architecture diagram. Hover over components to see data shapes and click to toggle detail.
When Nv = 1 (monocular input), the cross-view attention layers become standard within-view attention (there are no other views to cross with). The model naturally reduces to a monocular depth estimator — no architectural change, no special mode. This is why DA3 can also beat Depth Anything 2 on monocular benchmarks: it was trained on both monocular and multi-view data, and the architecture handles both seamlessly.
The magic that turns a monocular feature extractor into a multi-view geometry engine is remarkably simple: rearranging tokens before self-attention. No new parameters. No cross-attention modules. Just a different ordering of the same tokens going through the same attention heads.
Consider N = 3 images, each producing K = 1,369 tokens. In within-view layers, attention happens independently per image:
# Within-view attention: each image attends to itself # tokens shape: [N, K, D] = [3, 1369, 1024] for i in range(N): tokens[i] = self_attention(tokens[i]) # [1369, 1024] → [1369, 1024] # Each image's tokens only see tokens from the same image. # This is exactly what DINOv2 was pretrained to do.
In cross-view layers, we simply reshape the tensor to merge all tokens into one sequence:
# Cross-view attention: ALL tokens attend to ALL tokens # tokens shape: [N, K, D] = [3, 1369, 1024] # Step 1: Reshape to merge views all_tokens = tokens.reshape(N * K, D) # [4107, 1024] # Step 2: Standard self-attention on the merged sequence all_tokens = self_attention(all_tokens) # [4107, 1024] # Step 3: Reshape back tokens = all_tokens.reshape(N, K, D) # [3, 1369, 1024]
This is the entire cross-view mechanism. No new parameters, no new modules. The same attention weights that were pretrained for within-view feature extraction now also handle cross-view reasoning. The transformer learns to repurpose its existing attention heads for both tasks.
In the last Lg layers, attention alternates every layer:
# Alternating attention in the last L_g = 8 layers # Layer 17: within-view (standard DINOv2) # Layer 18: cross-view (tokens merged across images) # Layer 19: within-view # Layer 20: cross-view # Layer 21: within-view # Layer 22: cross-view # Layer 23: within-view # Layer 24: cross-view for layer_idx in range(L_s, L): if (layer_idx - L_s) % 2 == 0: # Within-view: attend within each image for i in range(N): tokens[i] = layer(tokens[i]) else: # Cross-view: attend across all images merged = tokens.reshape(N * K, D) merged = layer(merged) tokens = merged.reshape(N, K, D)
This alternation is key. A within-view layer refines each image's features independently (consolidating the cross-view information just received). Then a cross-view layer shares information again. This interleaving is much more effective than doing all cross-view attention at once (Table 6: Full Alt. drops performance).
Standard self-attention has O(S2) complexity where S is the sequence length. Within-view attention processes S = K tokens per image (manageable). Cross-view attention processes S = N×K tokens (potentially large). For N = 10 images at 518×518 resolution:
But only Lg/2 = 4 out of 24 layers are cross-view. The remaining 20 layers are cheap within-view attention. The total cost increase is moderate: roughly 1.3–1.5× compared to pure monocular processing.
Watch how tokens are rearranged between within-view and cross-view attention. Use the slider to step through layers.
When N = 1, cross-view attention reduces to within-view attention (only one image's tokens exist). There is no conditional logic, no mode switch. The model simply processes whatever tokens it receives. This means DA3 is inherently a monocular depth estimator that gains multi-view superpowers when given extra images — for free.
The backbone produces a sequence of feature tokens per image. The Dual-DPT (Dense Prediction Transformer) head converts these tokens into dense pixel-level depth and ray maps. Understanding DPT architecture is essential because the dual-branch design is what makes the disentangled depth-ray prediction work.
DPT (Ranftl et al., 2021) is a decoder architecture for vision transformers that produces dense predictions. It takes features from multiple layers of the transformer backbone and progressively upsamples them:
DA3's key modification is splitting the DPT into two branches after the reassembly stage:
# Backbone outputs features at 4 intermediate layers features = backbone(images) # list of 4 tensors, each [N, K, D] # Reassemble: reshape tokens → 2D feature maps, project dims # For ViT-L with 518×518 input: K=37×37 patches f6 = reassemble(features[0]) # [N, C, 37, 37] f12 = reassemble(features[1]) # [N, C, 37, 37] f18 = reassemble(features[2]) # [N, C, 37, 37] f24 = reassemble(features[3]) # [N, C, 37, 37] # Shared so far — now SPLIT # Depth branch: progressive fusion (upsample + merge + conv) d = depth_fusion_4(f24) # [N, C, 37, 37] d = depth_fusion_3(d + f18) # [N, C, 74, 74] (2× upsample) d = depth_fusion_2(d + f12) # [N, C, 148, 148] d = depth_fusion_1(d + f6) # [N, C, 296, 296] depth = depth_output(d) # [N, 1, 518, 518] conf = conf_output(d) # [N, 1, 518, 518] # Ray branch: same structure, different weights r = ray_fusion_4(f24) # [N, C, 37, 37] r = ray_fusion_3(r + f18) # [N, C, 74, 74] r = ray_fusion_2(r + f12) # [N, C, 148, 148] r = ray_fusion_1(r + f6) # [N, C, 296, 296] rays = ray_output(r) # [N, 6, 518, 518]
DA3's teacher model predicts depth in exponential space rather than linear depth or disparity. This is a deliberate engineering choice:
The depth branch also outputs a confidence map σ ∈ ℝH×W. This is used in the confidence-aware loss following DUSt3R: pixels where the model is uncertain contribute less to the loss. During inference, confidence maps tell downstream applications which depth values to trust. Occluded regions, sky pixels, and reflective surfaces typically get low confidence.
Visualizing how backbone features flow through shared reassembly then split into depth and ray branches. Click stages to highlight data shapes.
The architecture is simple. The representation is simple. The hard part is training data. Real-world depth sensors produce noisy, sparse, and incomplete depth maps. Synthetic data has perfect depth but poor visual diversity. DA3's teacher-student paradigm bridges this gap elegantly.
DA3 trains on three types of data:
The key challenge: you cannot train a geometry foundation model on noisy, sparse labels. The model would learn to predict noisy, sparse depth. But you also cannot train only on synthetic data — the model would not generalize to real photos.
The solution is a two-stage pipeline:
Stage 1: Train a teacher on synthetic data only. The teacher is a monocular relative depth estimation model trained exclusively on synthetic datasets where depth is perfect. The training corpus is massive and diverse: Hypersim, TartanAir, vKITTI2, BlendedMVS, SPRING, MVSynth, UnrealStereo4K, KenBurns, GTA-SM, TauAgent, MatrixCity, EDEN, ReplicaGSO, UrbanSyn, PointOdyssey, Structured3D, Objaverse, Trellis, and OmniObject. This covers indoor, outdoor, object-centric, and diverse in-the-wild scenes.
The teacher outputs relative depth (not metric depth) — it knows the shape of depth but not the absolute scale. It predicts in depth space (not disparity), using exponential depth representation for uniform sensitivity across distances.
Stage 2: Generate pseudo-labels for all real data. Run the teacher on every real-world image to get dense, clean, detailed pseudo-depth maps. Then align these pseudo-depth maps to the original sparse/noisy ground truth via RANSAC least squares. This gives us the best of both worlds:
The student model (DA3 itself) is then trained on the aligned pseudo-labels using a composite loss:
Let us break down each term:
ℒD — Depth loss (confidence-aware):
Where Dc,p is the predicted confidence at pixel p. The first term penalizes depth error weighted by confidence. The second term (-log Dc,p) prevents the model from cheating by setting all confidences to zero. The balance: the model learns to assign high confidence to accurate predictions and low confidence to uncertain ones.
ℒM — Ray map loss: L1 loss between predicted and ground truth ray maps. Supervises both ray origins and ray directions.
ℒP — Point map loss: Computes 3D points from predicted depth and ray directions (P = &Dcirc; · d + t), then penalizes the error against ground truth 3D points. This is a consistency loss: even if depth and rays are individually approximate, the combined 3D reconstruction should be accurate.
ℒC — Camera loss: Optional, supervises the camera head's predictions of (f, q, t). Only active when camera parameters are known in the training data.
ℒgrad — Gradient loss:
Penalizes differences in depth gradients. This preserves sharp edges (furniture boundaries, object silhouettes) while allowing smooth depth on planar surfaces. Without this term, the model would blur depth edges.
Before computing any loss, all ground truth signals are normalized by a common scale factor: the mean L2 norm of valid reprojected point maps P. This ensures that a 10m outdoor scene and a 1m tabletop scene contribute equally to the loss. Without this, the model would overfit to whichever scale dominates the training set.
Visualizing how sparse/noisy real-world depth is transformed into clean, dense pseudo-labels via the teacher model.
DA3 introduces a comprehensive Visual Geometry Benchmark that evaluates three capabilities: pose estimation accuracy, geometric reconstruction accuracy, and visual rendering quality. It covers 5 datasets (89+ scenes) spanning object-level to indoor/outdoor environments.
Pose accuracy is measured by Auc3 and Auc30 (area under the accuracy curve at 3° and 30° thresholds for relative rotation and translation). Higher is better.
The results tell a clear story. On challenging datasets with sparse views and textureless regions, DA3-Giant destroys all competition:
The average improvement over VGGT across all five datasets: 35.7% in camera pose accuracy.
Geometry accuracy is measured by reconstructing point clouds from predicted depth and poses, aligning them to ground truth via Umeyama alignment, and computing F-Score (all datasets except DTU, where Chamfer Distance is used).
DA3-Giant achieves 23.6% relative improvement over VGGT and 16.7% over Pi3 on average across all five datasets in the pose-free setting. Key numbers:
Even on monocular depth benchmarks (single image input), DA3 outperforms Depth Anything 2 with an average rank of 2.20 vs 2.60. On ETH3D, DA3 achieves 98.6 δ1 vs DA2's 86.5 — a massive gap. The teacher model (trained on synthetic data) achieves rank 1.00, showing headroom for future improvements.
Compare DA3 against all competitors across datasets and metrics. Click method names to toggle visibility.
Honest limitations the paper acknowledges or we can infer:
DA3 is not just a depth/pose estimator — it is a geometry backbone that can power downstream 3D tasks. To prove this, the paper fine-tunes DA3 for feed-forward novel view synthesis (FF-NVS): given a few input images, render the scene from a new camera viewpoint, without any per-scene optimization.
Following the minimal modeling philosophy, FF-NVS is achieved by adding a single new DPT head (GS-DPT) to the existing DA3 backbone. This head predicts per-pixel 3D Gaussian parameters:
The 3D position of each Gaussian comes from DA3's existing depth and ray predictions: Pi = t + Di · di. These pixel-aligned 3D Gaussians are then splatted to render novel views using standard 3DGS rasterization.
The paper introduces a new NVS benchmark across three datasets: DL3DV (140 scenes), Tanks & Temples (6 scenes), and MegaDepth (19 scenes). Each scene has ~300 frames with COLMAP-estimated camera poses as ground truth.
DA3 outperforms all competitors:
Two key findings from this experiment:
1. Geometry-model-based NVS consistently beats specialized NVS models. All geometry backbones (DA3, VGGT, Fast3R) outperform dedicated 3DGS models (pixelSplat, MVSplat, DepthSplat) that use epipolar transformers or cost volumes. Large-scale geometric pretraining provides better features than task-specific architectures designed from scratch for NVS.
2. NVS quality correlates with geometry quality. Among the geometry backbones, the ranking on NVS (DA3 > VGGT > MV-DUSt3R > Fast3R) perfectly matches the ranking on geometry benchmarks. Better depth and pose estimation directly translates to better novel view rendering. This suggests FF-NVS can be effectively addressed simply by improving the geometry backbone.
Table 5 in the paper reveals another important result. For the GS-DPT head, having ray maps in the prediction targets dramatically improves NVS performance compared to using point clouds:
The disentangled depth-ray representation helps NVS because the Gaussian positions (from depth × ray direction) are more accurate when depth and camera are predicted separately.
From input images to novel view rendering: the DA3 backbone produces depth + rays, the GS-DPT head adds Gaussian parameters, and standard 3DGS rasterization renders the new view.
Depth Anything (V1, 2024) — The first generation used self-training on 62M unlabeled images with a DINOv2 backbone. Monocular depth only. No multi-view capability. Key innovation: using a pretrained teacher on labeled data, then training a student on unlabeled data where the teacher provides pseudo-labels.
Depth Anything V2 (2024) — Replaced mixed labeled/unlabeled training with a cleaner strategy: train the teacher on synthetic data only, generate pseudo-labels for real data, train the student on pseudo-labels. This is the direct ancestor of DA3's teacher-student paradigm. V2 was monocular only.
Depth Anything 3 (2026) — Extends from monocular to any-view. The teacher-student paradigm is inherited from V2. The backbone is still DINOv2. The key innovations are: (1) depth-ray representation instead of depth alone, (2) cross-view attention via token rearrangement, (3) the Dual-DPT head. DA3 subsumes V2 — with one image, it reduces to a monocular depth estimator that outperforms V2.
DUSt3R (2024) — First to predict pointmaps directly from image pairs. Two images in, two pointmaps out. For multi-view, requires expensive global alignment optimization (iteratively fusing pairwise results). DA3's advantage: handles any number of views natively via cross-view attention.
MASt3R (2024) — Extended DUSt3R with dense feature matching capabilities. Still limited to image pairs.
VGGT (CVPR 2025 Best Paper) — First to process N images simultaneously and predict all 3D geometry in one forward pass. Uses two transformers (pretrained DINOv2 + untrained cross-view transformer). Predicts multiple targets (depth, pointmaps, poses, correspondences). DA3's advantages: (1) single fully-pretrained transformer vs. two partially-pretrained, (2) minimal depth+ray targets vs. multi-task, (3) 0.30B beats 0.90B on many benchmarks.
Pi3 (2025) — Another multi-view geometry model. Strong on some benchmarks but outperformed by DA3-Giant across the board.
MapAnything (2025) — Feed-forward metric 3D reconstruction. Benefits from pose conditioning like DA3 but does not match DA3's pose-free performance.
DA3 demonstrates a powerful principle: minimal modeling with maximal pretraining. Instead of designing complex architectures with geometric inductive biases (cost volumes, epipolar attention, point cloud decoders), use a standard pretrained transformer and make the smallest possible modification (token rearrangement) to enable the new capability (cross-view reasoning). The teacher-student paradigm handles data quality. The depth-ray representation handles target design. Everything else is inherited from DINOv2.
This points toward a future where a single pretrained geometry backbone serves as the foundation for all 3D vision tasks — depth, poses, reconstruction, NVS, SLAM, and beyond. DA3 is the strongest evidence yet that this future is achievable.