Wang, Chen, Karaev, Rupprecht, Novotny, Vedaldi — CVPR 2025 Best Paper

VGGT: Visual Geometry Grounded Transformer

A single feed-forward transformer that takes N unposed images and outputs ALL 3D geometry — camera poses, pointmaps, depth maps, correspondences — in one forward pass. No iteration, no RANSAC, no intrinsics needed.

Prerequisites: Vision Transformers + Basic 3D geometry + Structure from Motion concepts
10
Chapters
6
Simulations

Chapter 0: The Problem

You have a bunch of photos of a scene. Maybe 5 photos of a room, or 50 photos of a building taken from different angles. You want the full 3D geometry: where each camera was, how deep every pixel is, a dense 3D point cloud, and correspondences between images. This is the fundamental problem of 3D computer vision.

Traditionally, this requires an entire pipeline of specialized modules, each solving one piece of the puzzle:

Feature extraction
Detect keypoints (SIFT, SuperPoint) in each image
Feature matching
Match keypoints across image pairs (SuperGlue, LoFTR)
Geometric verification
RANSAC to filter outlier matches, estimate essential matrices
Triangulation
Recover 3D points from verified matches
Bundle Adjustment
Jointly optimize all camera poses + 3D points (iterative, slow)
Dense reconstruction
Multi-view stereo to get dense depth/point clouds (another pipeline)

This is COLMAP — the gold standard since 2016. It works, but it is complex (thousands of lines of C++ across six stages), slow (seconds to minutes per scene), brittle (fails on textureless regions, repeated textures, extreme viewpoints), and each module is designed and tuned independently.

The fundamental bottleneck: Every stage in the classical pipeline makes hard decisions that downstream stages cannot undo. RANSAC rejects matches that might be correct. Triangulation discards points with high reprojection error. Bundle Adjustment converges to local minima. Information is lost at every boundary. What if a single neural network could reason about all these quantities jointly, correcting its own mistakes internally?
Classical Pipeline vs. VGGT

The traditional SfM/MVS pipeline requires six sequential stages. VGGT replaces the entire stack with a single forward pass.

Why does information get lost at each stage of the classical 3D reconstruction pipeline?

Chapter 1: The Key Insight

VGGT's insight is radical in its simplicity: a single large transformer, with almost no 3D-specific inductive biases, can learn to predict all 3D scene attributes simultaneously from raw images.

Not camera poses alone. Not depth maps alone. Not point clouds alone. Not correspondences alone. All of them, together, in one forward pass.

f(I1, I2, ..., IN) = (gi, Di, Pi, Ti)i=1..N

Where for each image Ii:

This is possible because these outputs are not independent. Camera poses constrain depth maps. Depth maps and poses together determine point maps. Point correspondences are implied by the point maps. A model that predicts all of them can use their mutual consistency as an internal error signal during inference.

The paradigm shift: Previous methods like DUSt3R and MASt3R showed that pointmaps could be predicted directly from image pairs. But they could only process two images at once, then needed expensive global alignment optimization to fuse pairwise results. VGGT processes all N images simultaneously and produces globally consistent outputs — no post-processing needed. The transformer is the optimizer.

The architecture is deliberately simple: a standard large transformer (1.2 billion parameters, 24 layers) with one unusual design choice — alternating attention that switches between looking within each image and looking across all images. That is essentially the only 3D inductive bias. Everything else is learned from data.

What is VGGT's core architectural innovation compared to standard vision transformers?

Chapter 2: Architecture

The architecture has four components: a frozen DINOv2 tokenizer, camera and register tokens, the alternating-attention transformer backbone, and task-specific prediction heads.

Step 1: Tokenize images with DINOv2

Each input image Ii is patchified into K tokens (14×14 pixel patches) using a frozen DINOv2-Large encoder. This gives a set of 1024-dimensional tokens tIi per image. DINOv2 was chosen over a raw convolutional patchifier because it provides much more stable training and better performance.

Frozen vs trained: The DINOv2 tokenizer is completely frozen — its weights never update during VGGT training. This is deliberate: DINOv2 was pretrained via self-supervised learning on 142M images (LVD-142M), giving it extremely robust visual features. Training it jointly would risk catastrophic forgetting. Everything else — the camera/register tokens, all 24 transformer blocks, and all prediction heads — is trained from scratch on 3D data. The total parameter budget: ~300M from frozen DINOv2-L, ~900M trained from scratch.

Concrete tensor shapes through the pipeline

Let's trace the exact data flow for N=10 images at 336×518 resolution:

Step 2: Append special tokens

For each image, the model appends:

Crucially, the first frame gets different learnable tokens (t̄g, t̄R) than all other frames. This lets the model know which frame is the reference — all 3D outputs are expressed in the coordinate frame of camera 1.

Step 3: Alternating-Attention (AA) Transformer

The concatenation of all tokens from all frames passes through L = 24 blocks, each containing two attention layers:

  1. Frame-wise self-attention: tokens within each frame attend only to each other. This normalizes activations per-frame and lets the camera/register tokens interact with their own image tokens.
  2. Global self-attention: all tokens from all frames attend to each other. This is where cross-image reasoning happens — the model discovers correspondences, resolves relative poses, and builds a unified 3D understanding.
Why not cross-attention? The ablation study (Table 5 in the paper) shows alternating self-attention significantly outperforms both global-only self-attention and cross-attention. Cross-attention (each frame attends to tokens from other frames) scored 1.061 overall error vs 0.709 for alternating attention. Self-attention is both more expressive and more parameter-efficient for this task.
Engineering decision — why alternating? Frame-wise attention normalizes each image independently (preventing one bright or high-contrast image from dominating activations) and lets camera tokens absorb per-frame information. Global attention then lets the model discover correspondences and reason about multi-view geometry. Doing both in every layer means the model can iteratively refine: "here's what I see in each image" → "here's how these images relate to each other" → "given those relationships, let me re-interpret each image." This alternation is the only 3D inductive bias — no epipolar constraints, no depth priors, no camera model assumptions.

Step 4: Prediction heads

VGGT Architecture

Images are tokenized by DINOv2, augmented with camera/register tokens, processed through 24 alternating-attention blocks, then decoded by task-specific heads. Click layers to highlight the data flow.

Why does the first frame get different learnable camera and register tokens than all other frames?

Chapter 3: Multi-Task Outputs

VGGT simultaneously predicts five distinct 3D quantities from a single forward pass. These outputs are over-complete — they encode redundant information. That is the point.

Camera parameters gi

Each image gets a 9-dimensional camera vector: rotation quaternion q ∈ R4, translation vector t ∈ R3, and field-of-view f ∈ R2. The first camera is always identity (q1 = [0,0,0,1], t1 = [0,0,0]). This means the model predicts both extrinsics (where each camera is in space) and intrinsics (focal length) without any calibration input.

Depth maps Di

Per-pixel depth from each camera's viewpoint. Unlike monocular depth estimators, these are metrically consistent across views because the model sees all images together and reasons about their geometric relationships.

Point maps Pi

Per-pixel 3D coordinates in the world frame (camera 1's coordinate system). Each pixel in each image maps to a 3D point [x, y, z]. This is what DUSt3R also predicts — but VGGT does it for all N images simultaneously instead of pairs.

Tracking features Ti

Dense C-dimensional feature maps that can be queried to find correspondences. Given any point in any image, the tracking head correlates its feature with all other frames' feature maps to find the matching 2D location. This works for both ordered video frames and unordered photo collections.

Uncertainty maps Σi

Per-pixel confidence estimates for depth and point maps. These are used during training (as aleatoric uncertainty in the loss function) and at inference time to identify which predictions are reliable.

Why over-complete helps: You might think predicting both depth maps and point maps is redundant — you can derive point maps from depth + camera parameters. The ablation study (Table 6) proves otherwise: training with all tasks simultaneously improves point map accuracy by 15% compared to training without camera prediction. The model uses the redundancy as a self-consistency signal, forcing its internal representations to be geometrically coherent. And at inference time, combining depth + camera heads actually gives better point clouds than the dedicated point map head.
Multi-Task Output Relationships

All five outputs are geometrically related. Hover over each output to see how it connects to the others.

At inference time, what produces more accurate 3D point clouds: the dedicated point map head, or combining the depth head with the camera head?

Chapter 4: Training

Training a model this general requires a massive and diverse collection of 3D-annotated data. VGGT was trained on 16 datasets spanning indoor scenes, outdoor environments, synthetic renders, and real captures.

Training datasets

Co3Dv2, BlendMVS, DL3DV, MegaDepth, Kubric, WildRGB, ScanNet, HyperSim, Mapillary, Habitat, Replica, MVS-Synth, PointOdyssey, Virtual KITTI, Aria Synthetic Environments, Aria Digital Twin, and a synthetic dataset of artist-created assets. These cover:

Multi-task loss

L = Lcamera + Ldepth + Lpointmap + λLtrack

Where λ = 0.05 (tracking loss is down-weighted). The camera, depth, and pointmap losses naturally have similar magnitudes and do not need explicit balancing.

Camera loss: Huber loss between predicted and ground-truth camera parameters [q, t, f].

Depth loss: Aleatoric uncertainty-weighted loss with a gradient-based term:

Ldepth = Σ||ΣDi ⊙ (D̂i − Di)|| + ||ΣDi ⊙ (∇D̂i − ∇Di)|| − α log ΣDi

The gradient term penalizes errors in depth gradients (edges, discontinuities), not just absolute depth. The −α log Σ term prevents the model from cheating by predicting infinite uncertainty everywhere.

Tracking loss: L1 distance between predicted and ground-truth correspondences, plus binary cross-entropy for visibility prediction (whether a point is visible in each frame).

Training details

Ground truth normalization

Scenes are normalized by expressing everything in the first camera's coordinate frame, then scaling so the average point-to-origin distance is 1.0. Unlike DUSt3R, VGGT does not normalize its predictions at inference — it learns to output the correct scale directly.

Aggressive color augmentation: Each frame within the same scene gets independent color jittering. This forces the model to be robust to varying lighting conditions and prevents it from using color consistency as a shortcut for matching. The geometry must come from shape and structure, not color.
Training compute: 64 NVIDIA A100 GPUs (80GB each) for 9 days. That is ~13,800 GPU-hours. At current cloud pricing (~$2/GPU-hr for A100s), this is roughly $27,600 in compute. The batch construction samples 2–24 frames per scene randomly, with frames resized to max dimension 518px. Training uses bfloat16 mixed precision with gradient checkpointing to fit the 1.2B parameter model + activations into memory. The 160K iterations with cosine LR schedule means the model sees approximately 10M image-tuples total.
Why does the depth loss include a gradient-based term ||∇D̂ − ∇D||?

Chapter 5: Single Forward Pass

This is what makes VGGT fundamentally different from everything that came before. Let's understand exactly what "single forward pass" means and what it replaces.

What classical methods do

COLMAP processes 10 images through: feature extraction (one pass per image), pairwise matching (up to N²/2 pairs), RANSAC per pair, incremental SfM (iteratively adding cameras with bundle adjustment at each step), final global bundle adjustment, then dense MVS. Each step is iterative. Total: >15 seconds, often minutes.

What DUSt3R/MASt3R do

Process all N(N−1)/2 pairs through the network (quadratic), then run global alignment optimization to merge the pairwise predictions into a consistent scene. For 10 images: 45 forward passes + iterative optimization. Total: ~7–9 seconds.

What VGGT does

All N images go through the transformer once. Tokens attend to each other across frames via global self-attention. Camera poses, depth maps, point maps, and tracking features come out the other end. Total: ~0.2 seconds for 10 images.

No iteration, no RANSAC, no intrinsics: VGGT requires zero camera calibration information. It predicts intrinsics (field of view) alongside extrinsics. There is no RANSAC for outlier rejection — the transformer learns to handle outliers internally. There is no iterative optimization — the 24 layers of alternating attention serve as implicit "optimization steps" where each layer refines the previous layer's representation. The transformer is the optimizer.

Optional post-processing

While VGGT's feed-forward outputs already beat optimization-based methods, you can optionally refine with bundle adjustment. Because VGGT provides excellent initialization (near-correct poses and dense correspondences), BA converges extremely fast: ~1.6 seconds on top of the 0.2s forward pass. This pushes AUC@30 from 85.3 to 93.5 on RealEstate10K — but even without BA, the feed-forward result already beats all prior methods.

Inference memory and speed on real hardware: On an H100 with flash attention v3: 10 images at 336×518 take 0.14s and 3.63 GB. On a consumer RTX 4090 (24GB): 10 images fit easily; you can process up to ~40 images before hitting VRAM limits. The backbone takes ~80% of runtime; DPT heads add ~0.03s per frame. If memory is tight, run the backbone on all frames jointly (for cross-view reasoning), then decode DPT heads one at a time — this trades latency for memory with zero accuracy loss.
Processing Time Comparison

Feed-forward inference time for 10 images. VGGT completes in 0.2 seconds what classical pipelines need 15+ seconds for.

Why can VGGT skip RANSAC entirely?

Chapter 6: Results

VGGT won Best Paper at CVPR 2025. Here is why — it dominates across every 3D task, often by large margins, while being orders of magnitude faster.

Camera pose estimation (RealEstate10K + CO3Dv2)

AUC@30 metric (higher is better), 10 random frames per scene:

On Re10K (a dataset VGGT was never trained on), the margin is enormous: 85.3 vs 78.9 for the next best method, in 50x less time.

Dense reconstruction (DTU)

Without ground-truth cameras, VGGT achieves 0.382 Chamfer distance vs DUSt3R's 1.741 — a 4.5x improvement. It even approaches methods that cheat by using ground-truth cameras.

Point cloud quality (ETH3D)

Feed-forward in 0.2s: 0.709 overall vs DUSt3R's 1.005 (with expensive global alignment). The depth+camera combination scores 0.677 — better than any prior method.

Image matching (ScanNet-1500)

Despite not being specialized for two-view matching, VGGT outperforms the state-of-the-art dedicated matcher RoMa: AUC@20 of 73.4 vs 70.9.

Dynamic point tracking (TAP-Vid)

Using VGGT features as a backbone for CoTracker improves δvisavg from 78.9 to 84.0 on TAP-Vid RGB-S, and from 64.3 to 69.0 on Kinetics.

VGGT vs Prior Art: Camera Pose Estimation

AUC@30 on RealEstate10K (unseen dataset). Higher is better. Bar opacity indicates relative speed.

The generalization story: VGGT was never trained on RealEstate10K, yet it outperforms all methods by a huge margin on this dataset. It also handles extreme cases: oil paintings, non-overlapping frames, scenes with repeated textures (like deserts), and even single-image reconstruction. The model generalizes because it learns geometry, not dataset-specific patterns.
What degrades: VGGT struggles in three specific regimes: (1) Very few views with large baselines — with only 2 images and 90+ degree viewpoint change, the model has limited evidence to triangulate and produces noisier depth. (2) Textureless scenes — white walls and flat surfaces give the global attention less to latch onto. (3) Extreme scale differences — if one image shows a building from 2m away and another from 200m, the 14×14 patch tokenization loses fine details at the far distance. In all cases, providing more views (N ≥ 5) dramatically improves quality, as the global attention can "bridge" between views.
How does VGGT's dense reconstruction quality (Chamfer distance) compare to DUSt3R on the DTU dataset?

Chapter 7: Comparison with DUSt3R / MASt3R

DUSt3R (CVPR 2024) was the breakthrough that showed a transformer could directly predict 3D pointmaps from image pairs without any classical geometry pipeline. MASt3R extended it with better matching. VGGT is the next evolution. Understanding the differences is key to appreciating what changed.

DUSt3R: Pairwise then optimize

DUSt3R takes two images and predicts pointmaps for both, expressed in camera 1's frame. For N images, you must run N(N−1)/2 pairwise predictions, then solve a global alignment optimization to merge all pairwise results into one consistent 3D scene. This optimization takes seconds and can fail or converge to bad solutions.

MASt3R: Better matching, same limitation

MASt3R adds a matching head to DUSt3R, producing better correspondences. But it still processes pairs and still needs global alignment. For 32 images, DUSt3R takes over 200 seconds. For more than 32, it runs out of memory.

VGGT: All at once

VGGT processes all N images in a single forward pass. The global self-attention layers let every image's tokens attend to every other image's tokens, building a unified 3D representation internally. No pairwise decomposition, no global alignment, no quadratic scaling.

Pairwise vs. All-at-Once Processing

DUSt3R processes N(N−1)/2 pairs then optimizes. VGGT processes all N images simultaneously. Drag the slider to change N.

N images5

Key differences at a glance

DUSt3R / MASt3R
Pairwise → O(N²) passes → global alignment (iterative) → ~7–200s depending on N
VGGT
All N images → 1 pass → done → ~0.2s for 10 images, scales roughly linearly with N
Handling non-overlapping views: DUSt3R fails when two images have no visual overlap — it cannot find correspondences. VGGT handles this gracefully because global attention lets the model reason about spatial relationships even between non-overlapping views, using intermediate frames as bridges.
Why pointmaps instead of just depth? A depth map tells you "this pixel is 3.2m from the camera." A pointmap tells you "this pixel is at world coordinate (1.7, 0.4, 3.1)." The difference is crucial: depth requires knowing the camera pose to be useful in world coordinates, while pointmaps are already in world coordinates. VGGT predicts both because they provide complementary error signals — and the ablation shows that combining depth + estimated camera actually produces better pointmaps than the dedicated pointmap head (Table 7: 0.677 vs 0.709 overall error on ETH3D). The decomposition into simpler subtasks helps even when trained jointly.
What is the fundamental scaling difference between DUSt3R and VGGT when processing N images?

Chapter 8: Efficiency

VGGT scales efficiently with the number of input views. Here are the measured runtime and memory numbers on an NVIDIA H100 GPU with flash attention v3:

Runtime scaling

1 image
0.04s, 1.88 GB
2 images
0.05s, 2.07 GB
10 images
0.14s, 3.63 GB
50 images
1.04s, 11.41 GB
100 images
3.12s, 21.15 GB
200 images
8.75s, 40.63 GB

The backbone dominates cost. The camera head adds only ~5% runtime and ~2% memory. Each DPT head costs ~0.03s and ~0.2 GB per frame.

Why it scales well

Global self-attention is technically O(N²K²) in tokens (N frames, K patches each). But with flash attention and modern hardware, this is manageable up to hundreds of frames. And unlike DUSt3R, there is no quadratic number of forward passes — just one pass with more tokens.

Memory-constrained deployment

The DPT heads make independent predictions per frame. So if GPU memory is tight, you can run the backbone on all frames jointly (for cross-frame reasoning), then run DPT heads one frame at a time. This trades latency for memory without losing any accuracy.

Practical deployment: For 10 images at 336×518 resolution, VGGT needs only 3.63 GB — this fits comfortably on consumer GPUs. At 100 images it needs 21 GB (fits an RTX 3090). At 200 images, 40 GB (A100/H100 territory). Tensor parallelism across multiple GPUs can extend this further.
Runtime & Memory Scaling

Runtime (seconds) and GPU memory (GB) vs number of input frames on an H100 GPU.

How can VGGT be deployed on memory-constrained GPUs for many frames?

Chapter 9: Connections

What VGGT built on

DUSt3R (Wang et al., CVPR 2024): The breakthrough showing that a transformer can predict dense 3D pointmaps from image pairs without calibration. VGGT extends this from pairwise to any number of images, eliminates the global alignment optimization, and adds camera/depth/tracking heads.

MASt3R (Duisterhof et al., 2024): Extended DUSt3R with a matching head for better correspondences. VGGT's tracking head serves a similar purpose but works across all N images simultaneously.

COLMAP (Schönberger & Frahm, 2016): The gold standard classical SfM pipeline that VGGT replaces. COLMAP's incremental reconstruction with bundle adjustment remains a useful optional post-processing step for VGGT.

DINOv2 (Oquab et al., 2023): The self-supervised ViT backbone used as VGGT's tokenizer. Its strong visual features provide a stable initialization that enables reliable training.

CoTracker (Karaev et al., 2023): The point tracking architecture used as VGGT's tracking head. VGGT's features dramatically improve CoTracker's performance on dynamic scenes.

What VGGT enables

3D Gaussian Splatting: VGGT can provide the camera poses and initial point clouds that 3DGS needs for optimization, replacing COLMAP as the initialization pipeline.

Feed-forward novel view synthesis: By finetuning with Plücker ray tokens for target views, VGGT achieves competitive novel view synthesis without knowing input camera parameters.

Dynamic scene understanding: VGGT's features, when used as a backbone for video trackers, improve dynamic point tracking performance, opening the door to understanding non-rigid scenes.

FutureMapping/Spatial AI: VGGT represents a step toward the "FutureMapping" vision where a single model replaces entire SLAM pipelines, directly predicting scene geometry from sensor observations.

The bigger picture: VGGT follows the same scaling paradigm as GPTs, CLIP, DINO, and Stable Diffusion: build a large, simple model, train it on a massive dataset, and let it learn what hand-engineered pipelines previously required. Just as LLMs replaced hand-crafted NLP pipelines and diffusion models replaced hand-crafted image synthesis, VGGT represents the moment when a single neural network can replace the multi-stage 3D vision pipeline. The era of "geometry as a module" may be giving way to "geometry as an emergent capability."

Cheat sheet

Core idea
One transformer, one pass, all 3D outputs (poses, depth, pointmaps, correspondences)
Architecture
DINOv2 tokenizer + 24 alternating attention blocks (frame ↔ global) + DPT/camera heads
Key numbers
1.2B params, 0.2s for 10 images, AUC@30 = 85.3 (feed-forward) / 93.5 (+ BA)
vs DUSt3R
4.5x better Chamfer distance, 35x faster, handles any N (not just pairs)
Impact
CVPR 2025 Best Paper. Replaces COLMAP as 3D vision foundation.
How does VGGT relate to the broader trend in AI of replacing hand-engineered pipelines with large neural networks?