A single feed-forward transformer that takes N unposed images and outputs ALL 3D geometry — camera poses, pointmaps, depth maps, correspondences — in one forward pass. No iteration, no RANSAC, no intrinsics needed.
You have a bunch of photos of a scene. Maybe 5 photos of a room, or 50 photos of a building taken from different angles. You want the full 3D geometry: where each camera was, how deep every pixel is, a dense 3D point cloud, and correspondences between images. This is the fundamental problem of 3D computer vision.
Traditionally, this requires an entire pipeline of specialized modules, each solving one piece of the puzzle:
This is COLMAP — the gold standard since 2016. It works, but it is complex (thousands of lines of C++ across six stages), slow (seconds to minutes per scene), brittle (fails on textureless regions, repeated textures, extreme viewpoints), and each module is designed and tuned independently.
The traditional SfM/MVS pipeline requires six sequential stages. VGGT replaces the entire stack with a single forward pass.
VGGT's insight is radical in its simplicity: a single large transformer, with almost no 3D-specific inductive biases, can learn to predict all 3D scene attributes simultaneously from raw images.
Not camera poses alone. Not depth maps alone. Not point clouds alone. Not correspondences alone. All of them, together, in one forward pass.
Where for each image Ii:
This is possible because these outputs are not independent. Camera poses constrain depth maps. Depth maps and poses together determine point maps. Point correspondences are implied by the point maps. A model that predicts all of them can use their mutual consistency as an internal error signal during inference.
The architecture is deliberately simple: a standard large transformer (1.2 billion parameters, 24 layers) with one unusual design choice — alternating attention that switches between looking within each image and looking across all images. That is essentially the only 3D inductive bias. Everything else is learned from data.
The architecture has four components: a frozen DINOv2 tokenizer, camera and register tokens, the alternating-attention transformer backbone, and task-specific prediction heads.
Each input image Ii is patchified into K tokens (14×14 pixel patches) using a frozen DINOv2-Large encoder. This gives a set of 1024-dimensional tokens tIi per image. DINOv2 was chosen over a raw convolutional patchifier because it provides much more stable training and better performance.
Let's trace the exact data flow for N=10 images at 336×518 resolution:
For each image, the model appends:
Crucially, the first frame gets different learnable tokens (t̄g, t̄R) than all other frames. This lets the model know which frame is the reference — all 3D outputs are expressed in the coordinate frame of camera 1.
The concatenation of all tokens from all frames passes through L = 24 blocks, each containing two attention layers:
Images are tokenized by DINOv2, augmented with camera/register tokens, processed through 24 alternating-attention blocks, then decoded by task-specific heads. Click layers to highlight the data flow.
VGGT simultaneously predicts five distinct 3D quantities from a single forward pass. These outputs are over-complete — they encode redundant information. That is the point.
Each image gets a 9-dimensional camera vector: rotation quaternion q ∈ R4, translation vector t ∈ R3, and field-of-view f ∈ R2. The first camera is always identity (q1 = [0,0,0,1], t1 = [0,0,0]). This means the model predicts both extrinsics (where each camera is in space) and intrinsics (focal length) without any calibration input.
Per-pixel depth from each camera's viewpoint. Unlike monocular depth estimators, these are metrically consistent across views because the model sees all images together and reasons about their geometric relationships.
Per-pixel 3D coordinates in the world frame (camera 1's coordinate system). Each pixel in each image maps to a 3D point [x, y, z]. This is what DUSt3R also predicts — but VGGT does it for all N images simultaneously instead of pairs.
Dense C-dimensional feature maps that can be queried to find correspondences. Given any point in any image, the tracking head correlates its feature with all other frames' feature maps to find the matching 2D location. This works for both ordered video frames and unordered photo collections.
Per-pixel confidence estimates for depth and point maps. These are used during training (as aleatoric uncertainty in the loss function) and at inference time to identify which predictions are reliable.
All five outputs are geometrically related. Hover over each output to see how it connects to the others.
Training a model this general requires a massive and diverse collection of 3D-annotated data. VGGT was trained on 16 datasets spanning indoor scenes, outdoor environments, synthetic renders, and real captures.
Co3Dv2, BlendMVS, DL3DV, MegaDepth, Kubric, WildRGB, ScanNet, HyperSim, Mapillary, Habitat, Replica, MVS-Synth, PointOdyssey, Virtual KITTI, Aria Synthetic Environments, Aria Digital Twin, and a synthetic dataset of artist-created assets. These cover:
Where λ = 0.05 (tracking loss is down-weighted). The camera, depth, and pointmap losses naturally have similar magnitudes and do not need explicit balancing.
Camera loss: Huber loss between predicted and ground-truth camera parameters [q, t, f].
Depth loss: Aleatoric uncertainty-weighted loss with a gradient-based term:
The gradient term penalizes errors in depth gradients (edges, discontinuities), not just absolute depth. The −α log Σ term prevents the model from cheating by predicting infinite uncertainty everywhere.
Tracking loss: L1 distance between predicted and ground-truth correspondences, plus binary cross-entropy for visibility prediction (whether a point is visible in each frame).
Scenes are normalized by expressing everything in the first camera's coordinate frame, then scaling so the average point-to-origin distance is 1.0. Unlike DUSt3R, VGGT does not normalize its predictions at inference — it learns to output the correct scale directly.
This is what makes VGGT fundamentally different from everything that came before. Let's understand exactly what "single forward pass" means and what it replaces.
COLMAP processes 10 images through: feature extraction (one pass per image), pairwise matching (up to N²/2 pairs), RANSAC per pair, incremental SfM (iteratively adding cameras with bundle adjustment at each step), final global bundle adjustment, then dense MVS. Each step is iterative. Total: >15 seconds, often minutes.
Process all N(N−1)/2 pairs through the network (quadratic), then run global alignment optimization to merge the pairwise predictions into a consistent scene. For 10 images: 45 forward passes + iterative optimization. Total: ~7–9 seconds.
All N images go through the transformer once. Tokens attend to each other across frames via global self-attention. Camera poses, depth maps, point maps, and tracking features come out the other end. Total: ~0.2 seconds for 10 images.
While VGGT's feed-forward outputs already beat optimization-based methods, you can optionally refine with bundle adjustment. Because VGGT provides excellent initialization (near-correct poses and dense correspondences), BA converges extremely fast: ~1.6 seconds on top of the 0.2s forward pass. This pushes AUC@30 from 85.3 to 93.5 on RealEstate10K — but even without BA, the feed-forward result already beats all prior methods.
Feed-forward inference time for 10 images. VGGT completes in 0.2 seconds what classical pipelines need 15+ seconds for.
VGGT won Best Paper at CVPR 2025. Here is why — it dominates across every 3D task, often by large margins, while being orders of magnitude faster.
AUC@30 metric (higher is better), 10 random frames per scene:
On Re10K (a dataset VGGT was never trained on), the margin is enormous: 85.3 vs 78.9 for the next best method, in 50x less time.
Without ground-truth cameras, VGGT achieves 0.382 Chamfer distance vs DUSt3R's 1.741 — a 4.5x improvement. It even approaches methods that cheat by using ground-truth cameras.
Feed-forward in 0.2s: 0.709 overall vs DUSt3R's 1.005 (with expensive global alignment). The depth+camera combination scores 0.677 — better than any prior method.
Despite not being specialized for two-view matching, VGGT outperforms the state-of-the-art dedicated matcher RoMa: AUC@20 of 73.4 vs 70.9.
Using VGGT features as a backbone for CoTracker improves δvisavg from 78.9 to 84.0 on TAP-Vid RGB-S, and from 64.3 to 69.0 on Kinetics.
AUC@30 on RealEstate10K (unseen dataset). Higher is better. Bar opacity indicates relative speed.
DUSt3R (CVPR 2024) was the breakthrough that showed a transformer could directly predict 3D pointmaps from image pairs without any classical geometry pipeline. MASt3R extended it with better matching. VGGT is the next evolution. Understanding the differences is key to appreciating what changed.
DUSt3R takes two images and predicts pointmaps for both, expressed in camera 1's frame. For N images, you must run N(N−1)/2 pairwise predictions, then solve a global alignment optimization to merge all pairwise results into one consistent 3D scene. This optimization takes seconds and can fail or converge to bad solutions.
MASt3R adds a matching head to DUSt3R, producing better correspondences. But it still processes pairs and still needs global alignment. For 32 images, DUSt3R takes over 200 seconds. For more than 32, it runs out of memory.
VGGT processes all N images in a single forward pass. The global self-attention layers let every image's tokens attend to every other image's tokens, building a unified 3D representation internally. No pairwise decomposition, no global alignment, no quadratic scaling.
DUSt3R processes N(N−1)/2 pairs then optimizes. VGGT processes all N images simultaneously. Drag the slider to change N.
VGGT scales efficiently with the number of input views. Here are the measured runtime and memory numbers on an NVIDIA H100 GPU with flash attention v3:
The backbone dominates cost. The camera head adds only ~5% runtime and ~2% memory. Each DPT head costs ~0.03s and ~0.2 GB per frame.
Global self-attention is technically O(N²K²) in tokens (N frames, K patches each). But with flash attention and modern hardware, this is manageable up to hundreds of frames. And unlike DUSt3R, there is no quadratic number of forward passes — just one pass with more tokens.
The DPT heads make independent predictions per frame. So if GPU memory is tight, you can run the backbone on all frames jointly (for cross-frame reasoning), then run DPT heads one frame at a time. This trades latency for memory without losing any accuracy.
Runtime (seconds) and GPU memory (GB) vs number of input frames on an H100 GPU.
DUSt3R (Wang et al., CVPR 2024): The breakthrough showing that a transformer can predict dense 3D pointmaps from image pairs without calibration. VGGT extends this from pairwise to any number of images, eliminates the global alignment optimization, and adds camera/depth/tracking heads.
MASt3R (Duisterhof et al., 2024): Extended DUSt3R with a matching head for better correspondences. VGGT's tracking head serves a similar purpose but works across all N images simultaneously.
COLMAP (Schönberger & Frahm, 2016): The gold standard classical SfM pipeline that VGGT replaces. COLMAP's incremental reconstruction with bundle adjustment remains a useful optional post-processing step for VGGT.
DINOv2 (Oquab et al., 2023): The self-supervised ViT backbone used as VGGT's tokenizer. Its strong visual features provide a stable initialization that enables reliable training.
CoTracker (Karaev et al., 2023): The point tracking architecture used as VGGT's tracking head. VGGT's features dramatically improve CoTracker's performance on dynamic scenes.
3D Gaussian Splatting: VGGT can provide the camera poses and initial point clouds that 3DGS needs for optimization, replacing COLMAP as the initialization pipeline.
Feed-forward novel view synthesis: By finetuning with Plücker ray tokens for target views, VGGT achieves competitive novel view synthesis without knowing input camera parameters.
Dynamic scene understanding: VGGT's features, when used as a backbone for video trackers, improve dynamic point tracking performance, opening the door to understanding non-rigid scenes.
FutureMapping/Spatial AI: VGGT represents a step toward the "FutureMapping" vision where a single model replaces entire SLAM pipelines, directly predicting scene geometry from sensor observations.