Wang, Karaev, Rupprecht, Novotny — Oxford VGG + Meta AI, 2023

Visual Geometry Grounded Deep Structure From Motion

Make every SfM component differentiable — tracking, camera prediction, triangulation, bundle adjustment — and train the whole pipeline end-to-end. Simpler than COLMAP, and better on wide-baseline scenes.

Prerequisites: Epipolar geometry + Bundle adjustment basics + Transformers (attention)
10
Chapters
4
Simulations

Chapter 0: The Problem

You have a handful of photos of a building — taken from different angles, at different times, maybe by different people. You want to figure out where each camera was when the photo was taken and build a 3D model of the building. This is Structure from Motion (SfM).

Classical SfM, epitomized by COLMAP, solves this in an incremental pipeline with many hand-engineered stages:

1. Detect Keypoints
Find distinctive image features (SIFT, SuperPoint) in every image independently.
2. Match Pairs
For every pair of images, match keypoints via nearest-neighbor search + geometric verification (RANSAC).
3. Chain into Tracks
Connect pairwise matches across images to form multi-view tracks. This step is purely hand-engineered.
4. Register Cameras
Initialize with a well-conditioned pair, then add cameras one-by-one via PnP + RANSAC.
5. Triangulate + BA
Triangulate 3D points, run bundle adjustment. Repeat steps 4-5 until all cameras are registered.

This works — spectacularly well in some cases. But it has deep structural problems:

The core issue: Deep learning has improved individual SfM components (better keypoint matching, better feature descriptors), but the overall pipeline remains the same non-differentiable, incremental framework from the 2000s. Each component optimizes its own objective in isolation — keypoint matching doesn't know what bundle adjustment needs.
Classical vs. Deep SfM

Classical SfM chains pairwise matches incrementally. VGGSfM tracks points across ALL frames simultaneously and recovers all cameras at once. Toggle between the two pipelines.

Why can't classical SfM benefit from end-to-end training?

Chapter 1: The Key Insight

VGGSfM's core idea is simple to state and hard to execute: make every component of SfM fully differentiable, then train the whole pipeline end-to-end.

The reconstruction function becomes a single differentiable function fθ:

fθ(I) = (P, X)

It takes a set of images I and outputs camera parameters P and a 3D point cloud X. Because fθ is differentiable, we can train it by minimizing a loss that compares the predicted cameras and points against ground truth:

θ* = arg minθ Σs L(fθ(Is), Ps*, Ts*, Xs*)

The pipeline decomposes into four seamless stages, each differentiable:

1. Deep Point Tracker T
Track 2D points across ALL frames simultaneously. No pairwise matching, no chaining. A single feed-forward network.
2. Camera Initializer TP
A Transformer that predicts all camera poses at once from image features + track features. No incremental registration.
3. Triangulator TX
A Transformer that predicts 3D point positions given cameras and tracks. Refines beyond simple DLT triangulation.
4. Differentiable BA
Levenberg-Marquardt optimization using the Theseus library. Gradients flow through the optimization via the implicit function theorem.
Why end-to-end matters: When components train in isolation, the tracker optimizes for tracking accuracy and the camera predictor optimizes for pose accuracy — but neither knows what the other needs. End-to-end training lets the tracker learn to produce tracks that make the camera predictor's job easier, and vice versa. The paper shows this synergy yields +3.3% AUC on CO3D and +5.6% AUC on IMC over training components separately.

Each of these four stages replaces a non-differentiable classical counterpart with a learned, differentiable alternative. The next four chapters examine each one in detail.

What is the primary benefit of making every SfM component differentiable?

Chapter 2: Deep Point Tracking

In classical SfM, correspondences are established in two steps: (1) match keypoints between image pairs, then (2) chain pairwise matches into multi-view tracks. Step 2 is hand-engineered and error-prone — one wrong pairwise match corrupts the entire track.

VGGSfM replaces both steps with a single deep point tracker that directly outputs multi-view tracks. Given NT query points in a reference frame, the tracker produces a track for each query — its location in every other frame, plus a visibility flag and confidence estimate.

Architecture: Coarse-to-Fine

The tracker follows a two-stage coarse-to-fine design inspired by CoTracker and PIPs:

  1. Coarse tracking: A 2D CNN extracts feature maps from all frames. Query descriptors are sampled from the reference frame's feature map. Each descriptor is correlated with all frames' feature maps at multiple resolutions, building a cost-volume pyramid. These cost volumes are flattened into tokens and fed to a Transformer that outputs coarse track positions.
  2. Fine tracking: Image patches are cropped around the coarse estimates and the tracking is repeated on these zoomed-in crops. This achieves sub-pixel accuracy — critical for SfM where even half-pixel errors propagate into large 3D errors.
Key difference from video trackers: CoTracker and PIPs assume temporal continuity — a point in frame t is near its position in frame t-1. VGGSfM's inputs are unordered, free-form images with no temporal relationship. So the tracker attends to all frames jointly rather than using a sliding window. It also predicts each track independently, allowing a much larger number of tracked points at test time.

Tracking Confidence

Not all tracks are equally reliable. An occluded point or a textureless region produces unreliable tracks. VGGSfM predicts an aleatoric uncertainty σij for each track point yij. The confidence 1/σij tells downstream components how much to trust each correspondence.

During training, the standard L1 loss on track positions is replaced with a negative log-likelihood loss:

Ltrack = − Σi,j log N(yij* | yij, σij)

This forces the model to predict tight distributions (low σ) around accurate predictions, and wide distributions (high σ) around uncertain ones. The predicted uncertainty is anisotropic — a keypoint on a vertical stripe might have high uncertainty along the horizontal axis but low uncertainty along the vertical axis.

Why tracks beat pairwise matches: Tracks directly encode multi-view consistency. If a point is tracked from frame 1 to frame 5, the tracker must find it in all intermediate frames too — there's no opportunity for a single bad pairwise match to break the chain. The paper shows that feeding VGGSfM tracks to PixSfM (a classical pipeline) improves its accuracy, confirming the tracks' quality.
Why does VGGSfM's tracker use coarse-to-fine tracking instead of single-resolution tracking?

Chapter 3: Camera Prediction

In classical SfM, cameras are registered one at a time. You start with a good pair, solve the essential matrix, then add each new camera by solving the Perspective-n-Point (PnP) problem. This incremental process is fragile: if one camera is registered badly, the error cascades into all subsequent registrations.

VGGSfM replaces this with a deep Transformer that predicts all cameras simultaneously.

Architecture

The camera initializer TP takes two types of input tokens:

These are fused via cross-attention: the image features (queries) attend to the track descriptors (keys/values). This produces NI tokens — one per image — each enriched with correspondence information.

A preliminary camera estimate is obtained from the 8-point algorithm applied to track correspondences. Its 8-dimensional parameterization (quaternion + translation + log focal length) is embedded via harmonic positional encoding and concatenated with the cross-attention output.

The combined tokens pass through a Transformer trunk (8 self-attention layers, 512 hidden dimensions, 4 heads). The output is decoded into camera parameters via an MLP. This is applied iteratively 4 times, with each iteration refining the previous prediction.

Why all cameras at once? When a Transformer processes all cameras jointly via self-attention, each camera's prediction is informed by every other camera's prediction. Camera 5's pose constrains Camera 3's pose and vice versa. This mutual information sharing is impossible in incremental registration, where Camera 5 only benefits from cameras registered before it.

Camera Parameterization

Each camera has 8 degrees of freedom:

ParameterDimensionsDescription
Rotation quaternion4q(R) encodes the camera's orientation in SO(3)
Translation3t encodes the camera's position
Log focal length1ln(f) for numerical stability

The principal point is fixed at the image center (standard practice). The full projection matrix is P = K R [I3 | t].

How does VGGSfM's camera predictor differ from classical incremental registration?

Chapter 4: Differentiable Triangulation

We have tracks (2D point locations across frames) and initial cameras. Now we need to recover the 3D points. This is triangulation: given a point's 2D projection in multiple cameras, find its 3D position.

The Geometry

Each camera defines a ray from the camera center through the observed 2D point. In the ideal case (no noise), all rays from the same 3D point intersect perfectly. With noise, they don't — and the goal is to find the 3D point that minimizes the distance to all rays.

The classical approach is Direct Linear Transform (DLT) triangulation: set up a system of linear equations from the projection constraints and solve via SVD. This gives a closed-form solution — fast, but not optimally accurate.

VGGSfM's Learned Triangulator

VGGSfM first runs DLT to get a preliminary 3D point cloud X̄. Then it feeds this preliminary estimate into a Transformer TX for refinement. The input tokens are per-track descriptors dX(yij) that combine:

The Transformer processes these tokens and outputs refined 3D positions. Because the Transformer can attend across all points and all views, it can reason about global structure — "these five points form a planar surface" or "this point is inconsistent with its neighbors."

Why refine beyond DLT? DLT assumes all observations are equally reliable and finds the algebraically simplest solution. But some tracks are noisy (low confidence), and some cameras are inaccurately estimated. The learned triangulator can weight observations by their reliability and incorporate geometric priors that DLT ignores. The ablation shows replacing the learned triangulator with plain DLT drops AUC@10 from 73.92% to 69.42%.
Interactive Triangulation

Multiple camera rays converge on a 3D point. Add noise to see how triangulation degrades — the rays no longer intersect perfectly, and the estimated 3D point drifts. More cameras give a more robust estimate.

Noise 0
Cameras 4
Why does VGGSfM use a learned Transformer triangulator instead of plain DLT?

Chapter 5: Differentiable Bundle Adjustment

Bundle adjustment is the final, crucial refinement step. It jointly optimizes all camera parameters and all 3D points to minimize the total reprojection error:

LBA = Σi Σj vij · ||Pi(xj) − yij||

Here, Pi(xj) projects 3D point xj into camera i, and yij is the observed 2D position. The visibility flag vij masks out invisible points. The sum runs over all points and all cameras — this is why it's called "bundle" adjustment: we adjust all the bundles of rays simultaneously.

The Differentiability Challenge

Classical BA uses the Ceres solver — a C++ library that minimizes reprojection error via Levenberg-Marquardt (LM) optimization. This works beautifully for SfM, but Ceres is not differentiable. You can't backpropagate gradients through it.

To train the full VGGSfM pipeline end-to-end, gradients must flow through the BA layer — from the training loss back to the tracker parameters. This requires differentiating through an optimization process.

Solution: Implicit Differentiation via Theseus

VGGSfM uses the Theseus library (from Meta), which exploits the implicit function theorem to backpropagate through optimization. The idea:

  1. BA finds the solution (P*, X*) that satisfies the optimality condition: ∇LBA = 0.
  2. At this solution, the implicit function theorem tells us how the solution changes when the inputs (tracks, initial cameras) change — without unrolling the optimization steps.
  3. This gives us the Jacobian of the solution with respect to the inputs, which is all we need for backpropagation.
Why not just unroll? A naive approach would unroll every LM iteration and backpropagate through all of them. This is memory-expensive (you store all intermediate states) and numerically unstable (gradients through many iterations can explode or vanish). Implicit differentiation gives exact gradients with constant memory cost, regardless of how many LM steps were taken.

Filtering

Before BA, VGGSfM filters out outlier correspondences based on:

This filtering is critical — a single bad correspondence can distort the entire reconstruction. At test time, VGGSfM runs the full pipeline (track → camera → triangulate → BA) multiple times, iterating until BA achieves sub-pixel reprojection error.

How does VGGSfM make bundle adjustment differentiable?

Chapter 6: Training

Training a fully differentiable SfM pipeline is tricky. The four components are deeply interdependent — the triangulator can't learn if the cameras are garbage, and the cameras can't learn if the tracks are garbage. VGGSfM handles this with a careful multi-stage training strategy.

Stage 1: Tracker Pre-training

The tracker T is first trained on the Kubric synthetic dataset, which provides perfect ground-truth tracks. This gives the tracker a solid foundation before being fine-tuned on real data (CO3D or MegaDepth).

Stage 2: Component-wise Training

With the tracker frozen, the camera initializer TP is trained on real data. Then, with both tracker and camera initializer frozen, the triangulator TX is trained. Each component builds on the quality of its predecessors.

Stage 3: End-to-End Joint Training

Finally, all components are unfrozen and trained jointly. This is where the magic happens — the synergy between components emerges as each adjusts to help the others.

Training Loss

The full loss has four terms:

L = Σj (|xj* − xj|ε + |xj* − x̂j|ε) + Σi (eP(Pi*, Pi) + eP(Pi*, P̂i)) − Σi,j log N(yij* | yij, σij)

Let's unpack each piece:

Supervising intermediate outputs: The loss penalizes both the initial estimates (P̂, x̂) and the BA-refined estimates (P, x). This ensures that the camera initializer and triangulator produce reasonable outputs even before BA — BA shouldn't have to do all the work. It also provides gradient signal to early components that might otherwise suffer from vanishing gradients.

Training Details

DetailValue
OptimizerAdamW, cyclic LR (30-epoch cycles)
Learning rate0.0001 (joint), 0.0005 (pre-training)
Hardware32 NVIDIA A100 (80GB) GPUs
Frames per batch3 to 30 (randomly sampled)
Query points (train)256
BA steps (train)5
Image resolution512 × 512 (zero-padded)
Why does VGGSfM train in stages (tracker first, then camera predictor, then triangulator, then jointly)?

Chapter 7: Results

VGGSfM is evaluated on three benchmarks, each testing different conditions.

CO3D (Wide Baseline, Object-Centric)

Turntable-style videos of 51 categories. Wide baselines between test frames make this very challenging for classical SfM.

MethodTypeRRE@15°RTE@15°AUC@30°
COLMAP (SP+SG)Incremental31.627.325.3
PixSfM (SP+SG)Incremental33.732.930.1
PoseDiffusionDeep80.579.866.5
VGGSfM w/o JointDeep88.283.470.7
VGGSfMDeep92.188.374.0

On CO3D, VGGSfM destroys classical methods. COLMAP achieves 25.3% AUC, VGGSfM achieves 74.0% — a staggering 3x improvement. The wide baselines cripple pairwise matching, but VGGSfM's deep tracker handles them naturally.

IMC Phototourism (Narrow Baseline, Landmarks)

Famous landmarks photographed by tourists. Views overlap well — classical SfM's sweet spot.

MethodAUC@3°AUC@5°AUC@10°
COLMAP (SIFT+NN)23.5832.6644.79
PixSfM (SP+SG)45.1957.2270.47
DFSfM (LoFTR)46.5558.7472.19
PoseDiffusion12.3123.1736.82
VGGSfM45.2358.8973.92

Even on IMC — where classical methods have decades of tuning — VGGSfM matches or beats them. It leads on AUC@5 and AUC@10, and is competitive on AUC@3.

ETH3D (3D Triangulation Quality)

Laser-scanned ground truth for evaluating 3D point quality. VGGSfM achieves the best accuracy AND completeness across all thresholds.

Results Comparison

Camera pose accuracy (AUC) on CO3D and IMC. VGGSfM dramatically outperforms classical methods on wide-baseline CO3D and matches them on narrow-baseline IMC.

The headline result: A single differentiable pipeline — simpler than COLMAP — achieves state-of-the-art on wide-baseline, narrow-baseline, and triangulation benchmarks. No single classical configuration works well across all three.
Where does VGGSfM show its largest improvement over classical SfM (COLMAP)?

Chapter 8: Ablations

The ablation studies reveal what matters most in VGGSfM's design. Each ablation removes or replaces one component, keeping everything else fixed.

End-to-End Training

The single most important design choice. Without joint training:

SettingCO3D AUC@30IMC AUC@10
VGGSfM w/o Joint70.768.35
VGGSfM (full)74.073.92
Improvement+3.3+5.6

End-to-end training consistently improves every metric. The synergy between components is real and substantial.

Coarse-to-Fine Tracking

Removing the fine tracker (using only coarse tracks) drops IMC AUC@10 from 73.92% to 62.30% — a devastating 11.6-point drop. Sub-pixel accuracy is not optional for SfM.

Camera Initializer vs. PoseDiffusion

Replacing VGGSfM's camera initializer with PoseDiffusion drops AUC@10 from 73.92% to 62.18%. The learned initializer is better than an off-the-shelf deep pose estimator.

Learned Triangulator vs. DLT

Replacing the learned triangulator with plain DLT drops AUC@10 from 73.92% to 69.42%. The learned refinement adds +4.5 points.

Tracks vs. Pairwise Matching

An interesting cross-evaluation: VGGSfM tracks fed to PixSfM (classical) achieve 70.62% AUC@10, slightly beating PixSfM with its own matches (70.47%). SuperPoint+SuperGlue matches fed to VGGSfM achieve only 68.78%. Both results confirm: (1) VGGSfM's tracks are excellent even outside its own pipeline, and (2) VGGSfM benefits from tracks trained jointly with it.

Ablation takeaway: Every component contributes, but the ranking is clear. Coarse-to-fine tracking has the largest impact (−11.6 AUC), followed by end-to-end training (−5.6), then the camera initializer (−4.5 vs. PoseDiffusion alternative), then the triangulator (−4.5 vs. DLT).
Which ablation causes the largest performance drop on the IMC dataset?

Chapter 9: Connections

VGGSfM sits at the intersection of several research threads. Let's map where it fits.

Relation to COLMAP

COLMAP is the gold standard of incremental SfM. VGGSfM replaces every component of COLMAP with a learned, differentiable alternative. On narrow-baseline scenes (IMC), VGGSfM matches COLMAP. On wide-baseline scenes (CO3D), it dramatically surpasses it. COLMAP still has the advantage of scaling to thousands of images — VGGSfM currently handles up to ~30 frames.

Relation to PoseDiffusion

PoseDiffusion uses a diffusion model to predict camera poses. It's one of the first deep methods to handle many cameras simultaneously. VGGSfM's camera initializer achieves better accuracy by incorporating track features and geometric priors (8-point algorithm initialization), and by being integrated into a full SfM pipeline that provides 3D structure.

Relation to CoTracker

CoTracker introduced joint point tracking in videos. VGGSfM adapts this idea for unordered images: it removes the temporal sliding window (since images aren't sequential) and adds sub-pixel coarse-to-fine refinement (since SfM needs higher accuracy than video tracking).

Relation to Classical SLAM

Real-time SLAM systems (ORB-SLAM, DROID-SLAM) solve a related problem but emphasize speed and sequential processing. DROID-SLAM notably uses a differentiable BA layer similar to VGGSfM. VGGSfM focuses on offline, high-accuracy reconstruction from unordered images.

Relation to Rooms from Motion / DUSt3R

These concurrent works also pursue end-to-end 3D reconstruction from images. DUSt3R directly predicts pointmaps (3D coordinates per pixel) without explicit camera estimation. VGGSfM retains the classical SfM decomposition (cameras + points) but makes it fully differentiable. Both approaches validate the trend toward learned 3D reconstruction.

Cheat Sheet

AspectVGGSfM
Input3-30 unordered images of a scene
OutputCamera poses + intrinsics + 3D point cloud
TrackerCoarse-to-fine deep tracker (all frames jointly)
Camera predictorTransformer, all cameras at once
TriangulationDLT + learned Transformer refinement
Bundle adjustmentDifferentiable LM via Theseus (implicit diff)
TrainingMulti-stage, then end-to-end joint
Key result3x COLMAP on CO3D, matches PixSfM on IMC
Limitation≤30 frames (can't scale to thousands like COLMAP)
The broader lesson: Long-standing pipelines with many hand-engineered stages can often be simplified and improved by making them fully differentiable. The components don't need to be individually optimal — they need to work well together. End-to-end training discovers this synergy automatically.
VGGSfM Architecture Flow

The full differentiable pipeline: images flow through the tracker, camera predictor, triangulator, and bundle adjustment. Gradients flow backward through the entire chain during training.

What is VGGSfM's primary limitation compared to classical SfM (COLMAP)?