Depth Anything 3

Chapter 0: The Problem

You have photos of a room. Maybe one photo, maybe five, maybe a hundred frames from a phone video. You want to know: how far away is every pixel? Where was the camera when each photo was taken? Can I combine all of these into a single 3D point cloud?

Right now, you face a fragmented landscape. If you have one image, you use a monocular depth estimator like Depth Anything 2. If you have two images, you use DUSt3R. If you have many images, you use COLMAP (a slow, multi-stage pipeline) or VGGT (a large two-transformer model). Each tool is a separate system with its own architecture, training data, and failure modes.

What if one model could handle all of these cases? One image, two images, a hundred frames — all processed by the same network, producing consistent geometry every time?

The fragmentation tax: Every specialized model means separate training, separate deployment, separate failure modes. Monocular depth models predict scale-ambiguous depth. Multi-view systems predict point clouds that entangle camera and scene geometry. Classical SfM pipelines (COLMAP) work well on textured surfaces but fail catastrophically on textureless walls, reflective surfaces, or moving objects. There is no single system that handles arbitrary view counts with consistent, metrically accurate geometry.

Consider the specific failure modes. COLMAP achieves only 13.0 Auc3 on HiRoom (a challenging indoor dataset with textureless walls) compared to 81.7 from DA3. On ScanNet++, COLMAP gets 13.3 Auc3 versus 83.2 from DA3. Classical methods need well-textured surfaces to find correspondences. When those correspondences break, the entire pipeline collapses.

Feed-forward models like VGGT improved on this, but they use a complex two-transformer architecture (a pretrained DINOv2 encoder + a separate untrained cross-view transformer), predict redundant targets (depth maps + point maps + poses + correspondences), and still struggle on challenging scenes. VGGT's 0.90B parameter model is impressive, but DA3 shows that a 0.30B model can surpass it on most benchmarks.

The Fragmented 3D Vision Landscape

Click each scenario to see what tool you'd need today — and what DA3 replaces them all with.

The core question DA3 asks is: what is the minimal set of prediction targets and the minimal architecture that can recover 3D geometry from any number of views? The answer turns out to be surprisingly simple: predict depth + rays from a single pretrained transformer. No custom architecture. No point cloud heads. No iterative optimization. Just two dense maps per image, combined with element-wise operations.

Why do classical SfM pipelines like COLMAP fail on indoor scenes with textureless walls?

They rely on detecting distinctive keypoint correspondences between images — textureless regions produce no matches, so the pipeline cannot estimate camera poses or reconstruct geometry Indoor lighting is too dim for feature extraction COLMAP only works on outdoor scenes by design

Chapter 1: The Key Insight

DA3's insight comes in two parts, each challenging a widely held assumption in the field.

Insight 1: A single plain transformer is enough

VGGT uses two separate transformers stacked together: a pretrained DINOv2 backbone plus a separate cross-view transformer trained from scratch. This means two-thirds of the model's blocks have never seen ImageNet-scale pretraining. DA3 asks: what if we just used one transformer and rearranged the attention pattern inside it?

Take a vanilla DINOv2 encoder. It already knows how to extract powerful visual features. Instead of adding a second transformer for cross-view reasoning, DA3 simply rearranges input tokens in the last third of the network so that self-attention happens across views instead of within views. No new parameters. No new architecture. Just tensor reordering.

The result: a ViT-L model (0.30B parameters) using DA3's approach outperforms a VGGT-style architecture with comparable parameter count by 20% (Table 6 in the paper). Full pretraining of every layer beats partial pretraining of a larger stack.

Insight 2: Depth + rays is the minimal prediction target

Previous unified 3D models predicted a zoo of outputs: point maps, depth maps, camera poses, correspondences, confidence maps. DA3 shows that exactly two dense predictions per image are sufficient:

A depth map D ∈ ℝ^H×W — how far every pixel is from the camera
A ray map M ∈ ℝ^H×W×6 — for each pixel, the origin (t ∈ ℝ³) and direction (d ∈ ℝ³) of the camera ray in world coordinates

From these two outputs, a 3D point in world coordinates is simply:

P = t + D(u, v) · d

That is it. Element-wise multiplication and addition. No matrix inversions. No rotation decomposition. No iterative optimization. Every pixel's 3D position falls out from combining its depth with its ray direction.

Why not just predict point clouds directly? DUSt3R predicts per-pixel 3D point maps. But point maps entangle depth and camera information into a single representation. When you predict a 3D point, errors in depth and errors in camera pose are mixed together and cannot be separated. DA3's ablation (Table 5) shows this clearly: directly predicting point clouds achieves only 31.6 Auc3, while depth + ray achieves 36.0 Auc3 — a 14% improvement. The disentangled representation lets each head specialize, and the ray map implicitly encodes the full camera pose without requiring orthogonality constraints on rotation matrices.

The paper also adds a lightweight camera head that predicts camera parameters directly (FOV f ∈ ℝ², quaternion q ∈ ℝ⁴, translation t ∈ ℝ³) from a single token per view. This is optional — you can extract camera parameters from the ray map instead — but the camera head is 18.7× faster (0.46ms vs 8.60ms on an A100 GPU). Since it costs only ~0.1% of backbone computation, they include it for free.

Prediction Target Comparison

Toggle between prediction strategies to see how they compare. Depth + Ray is the minimal sufficient set.

Why does predicting per-pixel ray maps work better than predicting rotation matrices directly?

Rotation matrices are always 3×3 and too small to capture camera info Ray maps avoid the orthogonality constraint (R^TR = I) that makes direct rotation prediction hard, while encoding the same information in a dense pixel-aligned format that the DPT head can naturally output Ray maps use fewer parameters than rotation matrices

Chapter 2: The Depth-Ray Representation

Before we look at the neural network, we need to deeply understand what it predicts. The depth-ray representation is the foundation of everything DA3 does, and it is beautifully simple once you see where it comes from.

From camera geometry to rays

Start with the standard pinhole camera model. A pixel p = (u, v, 1)^T in image i projects to a 3D point P in world coordinates via:

P = R_i · D_i(u,v) · K_i^-1 p + t_i

Where:

D_i(u,v) is the depth at pixel (u,v) — how far the point is from the camera along the viewing ray
K_i is the 3×3 intrinsic matrix (focal lengths, principal point) — maps pixel coordinates to camera coordinates
R_i is the 3×3 rotation matrix — maps from camera frame to world frame
t_i is the 3D translation — the camera center in world coordinates

The problem: predicting R_i directly is hard because rotation matrices must satisfy R^TR = I and det(R) = 1. These are nonlinear constraints that a neural network has no natural way to enforce. Previous works predict quaternions or 6D rotation representations, but these require careful normalization and still struggle.

The ray map trick

DA3 sidesteps this entirely. Instead of predicting R_i and K_i separately, it predicts a dense ray map M_i ∈ ℝ^H×W×6 that stores, for every pixel, the camera ray in world coordinates.

Each pixel's ray has two components:

r = (t, d) where t ∈ ℝ³, d ∈ ℝ³

t = ray origin = camera center in world coordinates (same for all pixels in one image, but predicted per-pixel for robustness)
d = ray direction = R · K^-1 · p — the backprojected pixel direction, rotated into world frame

The direction d is not normalized. This is crucial: its magnitude preserves the projection scale, encoding both intrinsics and rotation in a single vector. A pixel near the edge of a wide-angle lens has a longer d than a pixel near the center — the direction itself encodes the focal length.

Worked example: recovering 3D points

Suppose we have a single image of a table. The network predicts:

Depth at pixel (100, 150): D = 2.3 meters
Ray origin at that pixel: t = (0.5, 1.2, 0.0) (camera is 0.5m right, 1.2m up from world origin)
Ray direction at that pixel: d = (-0.1, -0.3, 0.95) (pointing slightly down-left and mostly forward)

The 3D point is:

P = t + D · d = (0.5, 1.2, 0.0) + 2.3 · (-0.1, -0.3, 0.95) = (0.27, 0.51, 2.185)

That is it. Scalar multiplication and vector addition. For an entire image of H×W pixels, this is a single broadcasted element-wise operation — no matrix inversions, no SVD decompositions, no iterative solvers.

Why per-pixel origins?

In a perfect pinhole camera, all rays originate from the same point (the camera center). So why predict t per-pixel instead of once per image?

Dense supervision beats sparse prediction. The paper's ablation (Table 8) answers this directly. Replacing per-pixel ray origin prediction with a single MLP-based global prediction drops performance from 35.1 to 32.2 average Auc3. The per-pixel formulation provides dense supervision during training — every pixel contributes a gradient signal to the ray head, creating much richer learning than a single global prediction. At inference, the predicted origins are nearly constant across pixels (as expected), but the dense training signal matters enormously.

Recovering camera parameters from rays

If you need explicit camera parameters (for downstream applications), you can extract them from the ray map. The procedure is:

Camera center: Average all per-pixel ray origins: t_c = mean(M[:,:,0:3])
Rotation and intrinsics: Define canonical rays d_I = K_I^-1 p with identity intrinsics. The transformation from canonical to predicted rays is d_cam = KR d_I, forming a homography H = KR. Solve for H* via least-squares matching between canonical and predicted ray directions:

H* = arg min_H Σ_h,w || H p_h,w × M(h,w,3:) ||

This is a standard DLT (Direct Linear Transform) problem. Once H* is found, decompose it via RQ decomposition into K and R. Total cost: 8.60ms on an A100 GPU. But the camera head does the same thing in 0.46ms — an 18.7× speedup. That is why DA3 includes both: the ray head for dense geometric supervision during training, the camera head for fast inference.

Ray Map Visualization

Each pixel has a ray origin (camera center) and direction. Drag the camera to see how ray directions change. Depth scales along each ray to produce 3D points.

Focal length: Rotation:

Given a pixel with depth D = 3.0, ray origin t = (0, 0, 0), and ray direction d = (0, 0, 1), what is the 3D world point?

P = (0, 0, 3) — simply t + D · d = (0,0,0) + 3·(0,0,1) P = (3, 3, 3) — depth applies to all coordinates P = (0, 0, 1) — direction is already the point

Chapter 3: Architecture

The architecture of DA3 is notable for what it does not have. No epipolar transformers. No cost volumes. No cascaded decoders. No custom attention patterns with learned relative position biases. Just a single pretrained ViT with token rearrangement, plus a dual-headed decoder.

The three components

1. Single Transformer Backbone

Vanilla DINOv2 (ViT-S/B/L/G) — all weights pretrained. L_s layers of within-view attention, L_g layers alternating cross-view and within-view. No new parameters added to the backbone.

↓

2. Camera Encoder (optional)

If camera poses are known, encode (f_i, q_i, t_i) ∈ ℝ⁹ via MLP → camera token c_i. If unknown, use a shared learnable placeholder token. Prepended to patch tokens.

↓

3. Dual-DPT Head

Shared reassembly modules extract multi-scale features, then split into two branches: one for depth, one for rays. Each branch has its own fusion layers and output head. Also outputs confidence maps.

Data flow: tracing a forward pass

Let us trace what happens when you feed N = 4 images (each 518×518 pixels, the default resolution) through DA3-Large (ViT-L, L = 24 layers).

Step 1: Patchification. Each image is split into 14×14 pixel patches, giving 37×37 = 1,369 patch tokens per image. With a 1024-dim embedding, that is 4 images × 1,369 tokens × 1024 dims = a tensor of shape [4, 1369, 1024].

Step 2: Camera token injection. If camera parameters are available, each image gets a camera token c_i = MLP(f_i, q_i, t_i) of dimension 1024. If not, a learnable token c_l is used. This is prepended to the patch tokens: [4, 1370, 1024]. Camera tokens participate in all attention operations, giving the model geometric context throughout.

Step 3: Within-view attention (layers 1–16). The first L_s = 16 layers apply standard self-attention independently to each image's tokens. Each image's 1,370 tokens attend only to each other. This is exactly monocular feature extraction — the same computation as running DINOv2 on each image separately. Output: [4, 1370, 1024].

Step 4: Alternating cross/within-view attention (layers 17–24). The last L_g = 8 layers alternate: odd layers do cross-view attention (all 4×1,370 = 5,480 tokens attend to each other), even layers do within-view attention. This is where multi-view reasoning happens. Cross-view layers let tokens from different images exchange information. Output: [4, 1370, 1024].

Step 5: Dual-DPT head. Multi-scale features from layers {6, 12, 18, 24} are extracted and fed through shared reassembly (upsample + project). Then they split: depth branch produces D ∈ ℝ^H×W×1, ray branch produces M ∈ ℝ^H×W×6. A confidence map σ ∈ ℝ^H×W×1 is also predicted.

Step 6: Camera head (optional). A small transformer D_C operates on the N camera tokens to predict (f, q, t) per view. Processes 4 tokens — negligible cost.

Why L_s:L_g = 2:1? The paper ablates this ratio in Table 6. Full alternation (L_g = L, every layer crosses views) is worse because it disrupts the pretrained within-view features too aggressively. The 2:1 split keeps 2/3 of layers operating exactly as DINOv2 was pretrained, preserving strong monocular features while adding cross-view reasoning in the final third. This is the "minimal modeling" philosophy in action: change as little as possible from the pretrained model.

Model sizes

DA3-Small

ViT-S · ~22M params · Fastest, lowest accuracy

DA3-Base

ViT-B · ~86M params · Good accuracy, efficient

DA3-Large

ViT-L · ~0.30B params · Surpasses VGGT (0.90B) on 5/10 settings

DA3-Giant

ViT-G · ~1.1B params · SOTA on 18/20 benchmark settings

The remarkable finding: DA3-Large (0.30B) is 3× smaller than VGGT (0.90B) yet outperforms it on 5 out of 10 geometry benchmarks. This validates the core insight — a fully pretrained single transformer beats a larger partially-pretrained two-transformer stack.

DA3 Architecture

Interactive architecture diagram. Hover over components to see data shapes and click to toggle detail.

Single-image fallback

When N_v = 1 (monocular input), the cross-view attention layers become standard within-view attention (there are no other views to cross with). The model naturally reduces to a monocular depth estimator — no architectural change, no special mode. This is why DA3 can also beat Depth Anything 2 on monocular benchmarks: it was trained on both monocular and multi-view data, and the architecture handles both seamlessly.

DA3-Large (0.30B params) outperforms VGGT (0.90B params) on many benchmarks. What architectural difference explains this?

DA3 uses a more complex architecture with more layers DA3 uses a single fully-pretrained transformer (all layers from DINOv2), while VGGT stacks two transformers where 2/3 of blocks are untrained — full pretraining beats partial pretraining at equal scale DA3 has more training data

Chapter 4: Cross-View Attention

The magic that turns a monocular feature extractor into a multi-view geometry engine is remarkably simple: rearranging tokens before self-attention. No new parameters. No cross-attention modules. Just a different ordering of the same tokens going through the same attention heads.

Within-view attention (layers 1–L_s)

Consider N = 3 images, each producing K = 1,369 tokens. In within-view layers, attention happens independently per image:

# Within-view attention: each image attends to itself
# tokens shape: [N, K, D] = [3, 1369, 1024]
for i in range(N):
    tokens[i] = self_attention(tokens[i])  # [1369, 1024] → [1369, 1024]

# Each image's tokens only see tokens from the same image.
# This is exactly what DINOv2 was pretrained to do.

Cross-view attention (odd layers in L_g)

In cross-view layers, we simply reshape the tensor to merge all tokens into one sequence:

# Cross-view attention: ALL tokens attend to ALL tokens
# tokens shape: [N, K, D] = [3, 1369, 1024]

# Step 1: Reshape to merge views
all_tokens = tokens.reshape(N * K, D)  # [4107, 1024]

# Step 2: Standard self-attention on the merged sequence
all_tokens = self_attention(all_tokens)  # [4107, 1024]

# Step 3: Reshape back
tokens = all_tokens.reshape(N, K, D)  # [3, 1369, 1024]

This is the entire cross-view mechanism. No new parameters, no new modules. The same attention weights that were pretrained for within-view feature extraction now also handle cross-view reasoning. The transformer learns to repurpose its existing attention heads for both tasks.

Why does this work? DINOv2 was pretrained to find relationships between patch tokens within a single image. Cross-view attention presents tokens from different images that depict the same 3D structure. A patch showing a chair leg in image 1 and a patch showing the same chair leg from a different angle in image 3 will have similar DINOv2 features. Cross-view attention lets the model discover these correspondences and reason about their 3D relationship — using the same attention mechanism it already mastered for intra-image reasoning.

The alternation pattern

In the last L_g layers, attention alternates every layer:

# Alternating attention in the last L_g = 8 layers
# Layer 17: within-view (standard DINOv2)
# Layer 18: cross-view  (tokens merged across images)
# Layer 19: within-view
# Layer 20: cross-view
# Layer 21: within-view
# Layer 22: cross-view
# Layer 23: within-view
# Layer 24: cross-view

for layer_idx in range(L_s, L):
    if (layer_idx - L_s) % 2 == 0:
        # Within-view: attend within each image
        for i in range(N):
            tokens[i] = layer(tokens[i])
    else:
        # Cross-view: attend across all images
        merged = tokens.reshape(N * K, D)
        merged = layer(merged)
        tokens = merged.reshape(N, K, D)

This alternation is key. A within-view layer refines each image's features independently (consolidating the cross-view information just received). Then a cross-view layer shares information again. This interleaving is much more effective than doing all cross-view attention at once (Table 6: Full Alt. drops performance).

Computational cost

Standard self-attention has O(S²) complexity where S is the sequence length. Within-view attention processes S = K tokens per image (manageable). Cross-view attention processes S = N×K tokens (potentially large). For N = 10 images at 518×518 resolution:

Within-view: 1,370² = ~1.9M attention entries per image, ~19M total
Cross-view: (10×1,370)² = ~188M attention entries (10× more)

But only L_g/2 = 4 out of 24 layers are cross-view. The remaining 20 layers are cheap within-view attention. The total cost increase is moderate: roughly 1.3–1.5× compared to pure monocular processing.

Cross-View Attention Mechanism

Watch how tokens are rearranged between within-view and cross-view attention. Use the slider to step through layers.

Layer: 1 (within) Views: 3

Input-adaptive: monocular is free

When N = 1, cross-view attention reduces to within-view attention (only one image's tokens exist). There is no conditional logic, no mode switch. The model simply processes whatever tokens it receives. This means DA3 is inherently a monocular depth estimator that gains multi-view superpowers when given extra images — for free.

What new parameters does DA3 add to the pretrained DINOv2 backbone for cross-view reasoning?

None — cross-view attention uses the same pretrained attention weights by simply rearranging (reshaping) the token order before standard self-attention A separate cross-attention module with new K/V projections Additional transformer layers stacked on top

Chapter 5: The Dual-DPT Head

The backbone produces a sequence of feature tokens per image. The Dual-DPT (Dense Prediction Transformer) head converts these tokens into dense pixel-level depth and ray maps. Understanding DPT architecture is essential because the dual-branch design is what makes the disentangled depth-ray prediction work.

Background: what is DPT?

DPT (Ranftl et al., 2021) is a decoder architecture for vision transformers that produces dense predictions. It takes features from multiple layers of the transformer backbone and progressively upsamples them:

Reassemble: Extract feature tokens from 4 intermediate layers (e.g., layers {6, 12, 18, 24} of a ViT-L). Reshape from 1D token sequences back to 2D spatial feature maps. Project to a common dimension.
Fusion: Starting from the coarsest scale, progressively upsample and merge with finer-scale features via residual connections and convolutions. This builds up spatial resolution step by step.
Output: A final convolution maps the fused features to the output (e.g., depth value per pixel).

The Dual-DPT architecture

DA3's key modification is splitting the DPT into two branches after the reassembly stage:

Shared Reassembly

Features from layers {6,12,18,24} → reshape to 2D → project to common dim. Shared between depth and ray branches. This alignment is critical.

↓ split into two branches

Depth Branch

Own fusion layers (4 stages of upsample + residual conv) → output layer → D ∈ ℝ^H×W×1 + confidence σ

Ray Branch

Own fusion layers (4 stages of upsample + residual conv) → output layer → M ∈ ℝ^H×W×6

Why share reassembly but split fusion? The reassembly modules process the raw backbone features into spatial feature maps. Sharing these ensures that both branches see the same spatial representation, promoting alignment between depth and ray predictions. But depth and rays require different transformations from features to outputs — depth needs to learn absolute metric distances while rays need to learn directions in world coordinates. Separate fusion layers let each branch specialize without interfering.

Data flow through the Dual-DPT

# Backbone outputs features at 4 intermediate layers
features = backbone(images)  # list of 4 tensors, each [N, K, D]

# Reassemble: reshape tokens → 2D feature maps, project dims
# For ViT-L with 518×518 input: K=37×37 patches
f6  = reassemble(features[0])   # [N, C, 37, 37]
f12 = reassemble(features[1])   # [N, C, 37, 37]
f18 = reassemble(features[2])   # [N, C, 37, 37]
f24 = reassemble(features[3])   # [N, C, 37, 37]

# Shared so far — now SPLIT

# Depth branch: progressive fusion (upsample + merge + conv)
d = depth_fusion_4(f24)                    # [N, C, 37, 37]
d = depth_fusion_3(d + f18)                # [N, C, 74, 74]  (2× upsample)
d = depth_fusion_2(d + f12)                # [N, C, 148, 148]
d = depth_fusion_1(d + f6)                 # [N, C, 296, 296]
depth = depth_output(d)                    # [N, 1, 518, 518]
conf  = conf_output(d)                     # [N, 1, 518, 518]

# Ray branch: same structure, different weights
r = ray_fusion_4(f24)                      # [N, C, 37, 37]
r = ray_fusion_3(r + f18)                  # [N, C, 74, 74]
r = ray_fusion_2(r + f12)                  # [N, C, 148, 148]
r = ray_fusion_1(r + f6)                   # [N, C, 296, 296]
rays = ray_output(r)                       # [N, 6, 518, 518]

Depth representation: exponential depth

DA3's teacher model predicts depth in exponential space rather than linear depth or disparity. This is a deliberate engineering choice:

Linear depth: Near-camera regions (0–2m) are compressed into a tiny range while far regions (10–100m) dominate. Bad for indoor scenes.
Disparity (1/depth): Good for far scenes but loses sensitivity for near objects.
Exponential depth (log-space): Uniform sensitivity across all distances. A 10% error at 1m and a 10% error at 100m get equal gradient signals. This is crucial for the diverse training data (objects at 0.5m, rooms at 5m, outdoor scenes at 100m+).

Confidence map

The depth branch also outputs a confidence map σ ∈ ℝ^H×W. This is used in the confidence-aware loss following DUSt3R: pixels where the model is uncertain contribute less to the loss. During inference, confidence maps tell downstream applications which depth values to trust. Occluded regions, sky pixels, and reflective surfaces typically get low confidence.

Dual-DPT Head Pipeline

Visualizing how backbone features flow through shared reassembly then split into depth and ray branches. Click stages to highlight data shapes.

Why does the Dual-DPT share reassembly modules between depth and ray branches, but use separate fusion layers?

Shared reassembly ensures both branches see the same spatial representation (alignment), while separate fusion lets each specialize in its output modality (metric depth vs. world-frame directions) It reduces memory usage Reassembly is too expensive to run twice

Chapter 6: Teacher-Student Training

The architecture is simple. The representation is simple. The hard part is training data. Real-world depth sensors produce noisy, sparse, and incomplete depth maps. Synthetic data has perfect depth but poor visual diversity. DA3's teacher-student paradigm bridges this gap elegantly.

The data quality problem

DA3 trains on three types of data:

Real-world depth captures (e.g., ARKitScenes with LiDAR depth): Diverse scenes but depth is sparse (many missing pixels) and noisy (sensor errors, flying pixels at depth boundaries)
3D reconstructions (e.g., CO3D, ScanNet): Depth from multi-view reconstruction. More complete but still has holes and noise
Synthetic data (Hypersim, TartanAir, Objaverse, etc.): Perfect ground truth depth, but limited visual diversity and domain gap

The key challenge: you cannot train a geometry foundation model on noisy, sparse labels. The model would learn to predict noisy, sparse depth. But you also cannot train only on synthetic data — the model would not generalize to real photos.

The teacher: DA3-Teacher

The solution is a two-stage pipeline:

Stage 1: Train a teacher on synthetic data only. The teacher is a monocular relative depth estimation model trained exclusively on synthetic datasets where depth is perfect. The training corpus is massive and diverse: Hypersim, TartanAir, vKITTI2, BlendedMVS, SPRING, MVSynth, UnrealStereo4K, KenBurns, GTA-SM, TauAgent, MatrixCity, EDEN, ReplicaGSO, UrbanSyn, PointOdyssey, Structured3D, Objaverse, Trellis, and OmniObject. This covers indoor, outdoor, object-centric, and diverse in-the-wild scenes.

The teacher outputs relative depth (not metric depth) — it knows the shape of depth but not the absolute scale. It predicts in depth space (not disparity), using exponential depth representation for uniform sensitivity across distances.

Stage 2: Generate pseudo-labels for all real data. Run the teacher on every real-world image to get dense, clean, detailed pseudo-depth maps. Then align these pseudo-depth maps to the original sparse/noisy ground truth via RANSAC least squares. This gives us the best of both worlds:

From the teacher: Dense coverage, clean edges, fine geometric detail
From the real ground truth: Correct metric scale and geometric accuracy

The alignment step is critical. The teacher predicts relative depth — it does not know if the wall is 2 meters or 20 meters away. RANSAC least squares finds the optimal scale and shift that aligns the teacher's dense prediction with the sparse ground truth measurements. Concretely: for each image, find s, b = argmin Σ|s · D_teacher(p) + b - D_GT(p)|² over valid ground truth pixels p, using RANSAC to reject outliers. Apply s and b to the teacher's full dense prediction → metric pseudo-depth for every pixel.

Training the student

The student model (DA3 itself) is then trained on the aligned pseudo-labels using a composite loss:

ℒ = ℒ_D(D̂, D) + ℒ_M(R̂, M) + ℒ_P(D̂ ⊙ d + t, P) + βℒ_C(ĉ, v) + αℒ_grad(D̂, D)

Let us break down each term:

ℒ_D — Depth loss (confidence-aware):

ℒ_D(D̂, D; D_c) = (1/|Ω|) Σ_{p ∈ Ω} m_p (D_c,p |D̂_p - D_p| - λ_c log D_c,p)

Where D_c,p is the predicted confidence at pixel p. The first term penalizes depth error weighted by confidence. The second term (-log D_c,p) prevents the model from cheating by setting all confidences to zero. The balance: the model learns to assign high confidence to accurate predictions and low confidence to uncertain ones.

ℒ_M — Ray map loss: L1 loss between predicted and ground truth ray maps. Supervises both ray origins and ray directions.

ℒ_P — Point map loss: Computes 3D points from predicted depth and ray directions (P = D̂ · d + t), then penalizes the error against ground truth 3D points. This is a consistency loss: even if depth and rays are individually approximate, the combined 3D reconstruction should be accurate.

ℒ_C — Camera loss: Optional, supervises the camera head's predictions of (f, q, t). Only active when camera parameters are known in the training data.

ℒ_grad — Gradient loss:

ℒ_grad(D̂, D) = ||∇_xD̂ - ∇_xD||₁ + ||∇_yD̂ - ∇_yD||₁

Penalizes differences in depth gradients. This preserves sharp edges (furniture boundaries, object silhouettes) while allowing smooth depth on planar surfaces. Without this term, the model would blur depth edges.

Scale normalization

Before computing any loss, all ground truth signals are normalized by a common scale factor: the mean L₂ norm of valid reprojected point maps P. This ensures that a 10m outdoor scene and a 1m tabletop scene contribute equally to the loss. Without this, the model would overfit to whichever scale dominates the training set.

Teacher-Student Pipeline

Visualizing how sparse/noisy real-world depth is transformed into clean, dense pseudo-labels via the teacher model.

Why does DA3 train a teacher model on synthetic data only and then use it to generate pseudo-labels for real data, rather than training directly on the mixed data?

Real-world depth is sparse and noisy — training directly on it would teach the student to predict sparse, noisy depth. The teacher provides dense, clean depth with correct edges, which is then aligned to the real ground truth scale via RANSAC to get the best of both worlds. Synthetic data is faster to load The teacher compresses the training data to save disk space

Chapter 7: Results & Benchmark

DA3 introduces a comprehensive Visual Geometry Benchmark that evaluates three capabilities: pose estimation accuracy, geometric reconstruction accuracy, and visual rendering quality. It covers 5 datasets (89+ scenes) spanning object-level to indoor/outdoor environments.

Pose estimation: where DA3 dominates

Pose accuracy is measured by Auc3 and Auc30 (area under the accuracy curve at 3° and 30° thresholds for relative rotation and translation). Higher is better.

The results tell a clear story. On challenging datasets with sparse views and textureless regions, DA3-Giant destroys all competition:

HiRoom (indoor, textureless)

DA3-Giant: 81.7 Auc3 vs VGGT: 49.1 vs COLMAP: 13.0. DA3 is 6.3× better than COLMAP and 66% better than VGGT.

ScanNet++ (diverse indoor)

DA3-Giant: 83.2 Auc3 vs VGGT: 62.6 vs COLMAP: 13.3. A 33% relative gain over VGGT.

ETH3D (outdoor/mixed)

DA3-Giant: 39.3 Auc3 vs VGGT: 26.3 vs Pi3: 35.2. Surpasses even the strong Pi3 model.

DTU (dense, well-textured)

DA3-Giant: 85.6 Auc3 vs GLOMAP: 96.8. Classical methods still win on dense, well-textured scenes. Honest result.

The average improvement over VGGT across all five datasets: 35.7% in camera pose accuracy.

Geometric reconstruction: depth quality

Geometry accuracy is measured by reconstructing point clouds from predicted depth and poses, aligning them to ground truth via Umeyama alignment, and computing F-Score (all datasets except DTU, where Chamfer Distance is used).

DA3-Giant achieves 23.6% relative improvement over VGGT and 16.7% over Pi3 on average across all five datasets in the pose-free setting. Key numbers:

HiRoom: DA3: 89.3 F1 vs VGGT: 56.7 (57% improvement)
ETH3D: DA3: 74.4 F1 vs VGGT: 57.2 (30% improvement)
ScanNet++: DA3: 76.4 F1 vs VGGT: 66.4 (15% improvement)
DTU: DA3: 1.92mm CD vs VGGT: 2.05mm (7% improvement)

Monocular depth: beating DA2

Even on monocular depth benchmarks (single image input), DA3 outperforms Depth Anything 2 with an average rank of 2.20 vs 2.60. On ETH3D, DA3 achieves 98.6 δ₁ vs DA2's 86.5 — a massive gap. The teacher model (trained on synthetic data) achieves rank 1.00, showing headroom for future improvements.

Scaling behavior

Pose estimation scales more strongly than depth estimation. A striking finding from the model size ablations: scaling from DA3-Small to DA3-Giant improves pose accuracy dramatically (HiRoom Auc3: 3.4 → 81.7, a 24× gain) but improves depth-based reconstruction more modestly (HiRoom F1: 12.9 → 89.3, a 7× gain). With pose conditioning (giving the model ground truth camera poses), scaling gains shrink significantly, confirming that larger models primarily improve at pose estimation. The implication: pose estimation is the harder task that most benefits from model scale.

Benchmark Results Explorer

Compare DA3 against all competitors across datasets and metrics. Click method names to toggle visibility.

What DA3 does NOT do well

Honest limitations the paper acknowledges or we can infer:

DTU (dense, well-textured): Classical methods (GLOMAP) still achieve 96.8 Auc3 vs DA3's 85.6. On perfectly textured objects with many views, classical correspondence + bundle adjustment is hard to beat.
Small models struggle: DA3-Small (ViT-S) achieves only 3.40 Auc3 on HiRoom vs 81.7 for DA3-Giant. The minimal modeling approach needs model scale to work — it is not a trick that works at any size.
Dynamic scenes: The paper tests on static scenes. Moving objects would break multi-view consistency. The conclusion mentions extending to dynamic scenes as future work.
Very large view counts: Cross-view attention is O(N²K²) in the worst case. For N > 100 views, this becomes expensive. Practical deployment may need chunking strategies.

Why does pose estimation benefit more from model scaling than depth estimation?

Depth estimation is a simpler task Larger models have more memory for multi-view data Pose estimation requires reasoning about geometric relationships across views (cross-view attention), which benefits from more capacity in the later transformer layers. Depth estimation is largely monocular (within-view) and saturates earlier. Evidence: with ground-truth poses, scaling gains shrink.

Chapter 8: Feed-Forward Novel View Synthesis

DA3 is not just a depth/pose estimator — it is a geometry backbone that can power downstream 3D tasks. To prove this, the paper fine-tunes DA3 for feed-forward novel view synthesis (FF-NVS): given a few input images, render the scene from a new camera viewpoint, without any per-scene optimization.

The approach: GS-DPT head

Following the minimal modeling philosophy, FF-NVS is achieved by adding a single new DPT head (GS-DPT) to the existing DA3 backbone. This head predicts per-pixel 3D Gaussian parameters:

σ_i — opacity (from the confidence head)
q_i ∈ ℍ — rotation quaternion of the Gaussian ellipsoid
s_i ∈ ℝ³ — scale (size of the Gaussian in 3D)
c_i ∈ ℝ³ — RGB color

The 3D position of each Gaussian comes from DA3's existing depth and ray predictions: P_i = t + D_i · d_i. These pixel-aligned 3D Gaussians are then splatted to render novel views using standard 3DGS rasterization.

Why this validates the geometry backbone thesis: The GS-DPT head is structurally identical to the depth and ray DPT heads. No epipolar modules, no cost volumes, no cascaded decoders. The same architecture pattern (backbone features → DPT → dense prediction) that works for depth and rays also works for Gaussian parameters. The only difference is what the output channels represent. This is the power of having a strong geometry backbone — task-specific heads become trivial.

Results: NVS benchmark

The paper introduces a new NVS benchmark across three datasets: DL3DV (140 scenes), Tanks & Temples (6 scenes), and MegaDepth (19 scenes). Each scene has ~300 frames with COLMAP-estimated camera poses as ground truth.

DA3 outperforms all competitors:

DL3DV (in-domain)

DA3: 21.33 PSNR, 0.241 LPIPS vs VGGT: 20.96, 0.253 vs MVSplat: 18.13, 0.393

Tanks & Temples (out-of-domain)

DA3: 18.10 PSNR, 0.311 LPIPS vs VGGT: 17.18, 0.347

MegaDepth (out-of-domain)

DA3: 17.89 PSNR, 0.351 LPIPS vs VGGT: 16.45, 0.417

Two key findings from this experiment:

1. Geometry-model-based NVS consistently beats specialized NVS models. All geometry backbones (DA3, VGGT, Fast3R) outperform dedicated 3DGS models (pixelSplat, MVSplat, DepthSplat) that use epipolar transformers or cost volumes. Large-scale geometric pretraining provides better features than task-specific architectures designed from scratch for NVS.

2. NVS quality correlates with geometry quality. Among the geometry backbones, the ranking on NVS (DA3 > VGGT > MV-DUSt3R > Fast3R) perfectly matches the ranking on geometry benchmarks. Better depth and pose estimation directly translates to better novel view rendering. This suggests FF-NVS can be effectively addressed simply by improving the geometry backbone.

The prediction target ablation for NVS

Table 5 in the paper reveals another important result. For the GS-DPT head, having ray maps in the prediction targets dramatically improves NVS performance compared to using point clouds:

depth + ray + cam: Best overall (Avg F1: 56.5, best on DTU)
depth + ray: Close second (Avg F1: 56.4)
pcd alone: Worst (Avg F1: 51.5)

The disentangled depth-ray representation helps NVS because the Gaussian positions (from depth × ray direction) are more accurate when depth and camera are predicted separately.

NVS Pipeline

From input images to novel view rendering: the DA3 backbone produces depth + rays, the GS-DPT head adds Gaussian parameters, and standard 3DGS rasterization renders the new view.

Why do geometry-model-based NVS approaches outperform specialized NVS models like pixelSplat and MVSplat?

Geometry backbones (DA3, VGGT) benefit from large-scale pretraining on diverse 3D data, providing richer features than task-specific architectures trained from scratch for NVS alone — better geometry directly yields better novel views Specialized NVS models use less compute pixelSplat and MVSplat are older models that have not been updated

Chapter 9: Connections

The Depth Anything lineage

Depth Anything (V1, 2024) — The first generation used self-training on 62M unlabeled images with a DINOv2 backbone. Monocular depth only. No multi-view capability. Key innovation: using a pretrained teacher on labeled data, then training a student on unlabeled data where the teacher provides pseudo-labels.

Depth Anything V2 (2024) — Replaced mixed labeled/unlabeled training with a cleaner strategy: train the teacher on synthetic data only, generate pseudo-labels for real data, train the student on pseudo-labels. This is the direct ancestor of DA3's teacher-student paradigm. V2 was monocular only.

Depth Anything 3 (2026) — Extends from monocular to any-view. The teacher-student paradigm is inherited from V2. The backbone is still DINOv2. The key innovations are: (1) depth-ray representation instead of depth alone, (2) cross-view attention via token rearrangement, (3) the Dual-DPT head. DA3 subsumes V2 — with one image, it reduces to a monocular depth estimator that outperforms V2.

Related methods

DUSt3R (2024) — First to predict pointmaps directly from image pairs. Two images in, two pointmaps out. For multi-view, requires expensive global alignment optimization (iteratively fusing pairwise results). DA3's advantage: handles any number of views natively via cross-view attention.

MASt3R (2024) — Extended DUSt3R with dense feature matching capabilities. Still limited to image pairs.

VGGT (CVPR 2025 Best Paper) — First to process N images simultaneously and predict all 3D geometry in one forward pass. Uses two transformers (pretrained DINOv2 + untrained cross-view transformer). Predicts multiple targets (depth, pointmaps, poses, correspondences). DA3's advantages: (1) single fully-pretrained transformer vs. two partially-pretrained, (2) minimal depth+ray targets vs. multi-task, (3) 0.30B beats 0.90B on many benchmarks.

Pi3 (2025) — Another multi-view geometry model. Strong on some benchmarks but outperformed by DA3-Giant across the board.

MapAnything (2025) — Feed-forward metric 3D reconstruction. Benefits from pose conditioning like DA3 but does not match DA3's pose-free performance.

Cheat sheet: key equations

3D point from depth + ray

P = t + D(u,v) · d
t = ray origin (camera center), d = ray direction, D = depth. Element-wise ops only.

Ray direction from camera params

d = R K^-1 p
R = rotation matrix, K = intrinsics, p = pixel. The ray map encodes R and K implicitly.

Training loss

ℒ = ℒ_D + ℒ_M + ℒ_P + βℒ_C + αℒ_grad
Depth (confidence-weighted) + Ray map + Point consistency + Camera + Gradient smoothness

Attention split

L_s : L_g = 2:1
First 2/3 layers: within-view (monocular). Last 1/3: alternating within/cross-view.

Pseudo-label alignment

D_aligned = s · D_teacher + b
RANSAC least-squares finds (s,b) matching teacher to sparse GT.

When to use what

One image, need depth: DA3 (monocular mode) — outperforms Depth Anything 2
Few images (2–20), need depth + poses: DA3 — best feed-forward method
Many images (50+), well-textured, need maximum accuracy: COLMAP/GLOMAP still wins on dense, textured scenes (DTU Auc3: 96.8 vs 85.6)
Novel view synthesis from sparse views: DA3 + GS-DPT — best feed-forward NVS backbone
Real-time SLAM: MASt3R-SLAM — DA3 is feed-forward, not designed for sequential real-time operation (yet)

What this paper means for the field

DA3 demonstrates a powerful principle: minimal modeling with maximal pretraining. Instead of designing complex architectures with geometric inductive biases (cost volumes, epipolar attention, point cloud decoders), use a standard pretrained transformer and make the smallest possible modification (token rearrangement) to enable the new capability (cross-view reasoning). The teacher-student paradigm handles data quality. The depth-ray representation handles target design. Everything else is inherited from DINOv2.

This points toward a future where a single pretrained geometry backbone serves as the foundation for all 3D vision tasks — depth, poses, reconstruction, NVS, SLAM, and beyond. DA3 is the strongest evidence yet that this future is achievable.

Related lessons on this site

Depth Anything V1 — The self-training paradigm for monocular depth
Depth Anything V2 — Synthetic teacher + real student pipeline
DUSt3R — Direct pointmap prediction from image pairs
MASt3R — Adding dense matching to DUSt3R
VGGT — The multi-output geometry transformer DA3 surpasses
MASt3R-SLAM — Real-time dense SLAM with learned 3D priors
Vision Transformer — The ViT architecture underlying DA3
Gleams: 3D Vision — Camera geometry, projection, reconstruction basics
Gleams: NeRF & 3DGS — Neural radiance fields and Gaussian splatting

What is DA3's single most important contribution relative to VGGT?

More training data Demonstrating that a single fully-pretrained transformer with minimal modifications (token rearrangement + depth-ray targets) outperforms a more complex two-transformer architecture with multi-task outputs — proving that minimal modeling with maximal pretraining is the superior paradigm Faster inference speed

Depth Anything 3: Recovering the Visual Space from Any Views