Make every SfM component differentiable — tracking, camera prediction, triangulation, bundle adjustment — and train the whole pipeline end-to-end. Simpler than COLMAP, and better on wide-baseline scenes.
You have a handful of photos of a building — taken from different angles, at different times, maybe by different people. You want to figure out where each camera was when the photo was taken and build a 3D model of the building. This is Structure from Motion (SfM).
Classical SfM, epitomized by COLMAP, solves this in an incremental pipeline with many hand-engineered stages:
This works — spectacularly well in some cases. But it has deep structural problems:
Classical SfM chains pairwise matches incrementally. VGGSfM tracks points across ALL frames simultaneously and recovers all cameras at once. Toggle between the two pipelines.
VGGSfM's core idea is simple to state and hard to execute: make every component of SfM fully differentiable, then train the whole pipeline end-to-end.
The reconstruction function becomes a single differentiable function fθ:
It takes a set of images I and outputs camera parameters P and a 3D point cloud X. Because fθ is differentiable, we can train it by minimizing a loss that compares the predicted cameras and points against ground truth:
The pipeline decomposes into four seamless stages, each differentiable:
Each of these four stages replaces a non-differentiable classical counterpart with a learned, differentiable alternative. The next four chapters examine each one in detail.
In classical SfM, correspondences are established in two steps: (1) match keypoints between image pairs, then (2) chain pairwise matches into multi-view tracks. Step 2 is hand-engineered and error-prone — one wrong pairwise match corrupts the entire track.
VGGSfM replaces both steps with a single deep point tracker that directly outputs multi-view tracks. Given NT query points in a reference frame, the tracker produces a track for each query — its location in every other frame, plus a visibility flag and confidence estimate.
The tracker follows a two-stage coarse-to-fine design inspired by CoTracker and PIPs:
Not all tracks are equally reliable. An occluded point or a textureless region produces unreliable tracks. VGGSfM predicts an aleatoric uncertainty σij for each track point yij. The confidence 1/σij tells downstream components how much to trust each correspondence.
During training, the standard L1 loss on track positions is replaced with a negative log-likelihood loss:
This forces the model to predict tight distributions (low σ) around accurate predictions, and wide distributions (high σ) around uncertain ones. The predicted uncertainty is anisotropic — a keypoint on a vertical stripe might have high uncertainty along the horizontal axis but low uncertainty along the vertical axis.
In classical SfM, cameras are registered one at a time. You start with a good pair, solve the essential matrix, then add each new camera by solving the Perspective-n-Point (PnP) problem. This incremental process is fragile: if one camera is registered badly, the error cascades into all subsequent registrations.
VGGSfM replaces this with a deep Transformer that predicts all cameras simultaneously.
The camera initializer TP takes two types of input tokens:
These are fused via cross-attention: the image features (queries) attend to the track descriptors (keys/values). This produces NI tokens — one per image — each enriched with correspondence information.
A preliminary camera estimate is obtained from the 8-point algorithm applied to track correspondences. Its 8-dimensional parameterization (quaternion + translation + log focal length) is embedded via harmonic positional encoding and concatenated with the cross-attention output.
The combined tokens pass through a Transformer trunk (8 self-attention layers, 512 hidden dimensions, 4 heads). The output is decoded into camera parameters via an MLP. This is applied iteratively 4 times, with each iteration refining the previous prediction.
Each camera has 8 degrees of freedom:
| Parameter | Dimensions | Description |
|---|---|---|
| Rotation quaternion | 4 | q(R) encodes the camera's orientation in SO(3) |
| Translation | 3 | t encodes the camera's position |
| Log focal length | 1 | ln(f) for numerical stability |
The principal point is fixed at the image center (standard practice). The full projection matrix is P = K R [I3 | t].
We have tracks (2D point locations across frames) and initial cameras. Now we need to recover the 3D points. This is triangulation: given a point's 2D projection in multiple cameras, find its 3D position.
Each camera defines a ray from the camera center through the observed 2D point. In the ideal case (no noise), all rays from the same 3D point intersect perfectly. With noise, they don't — and the goal is to find the 3D point that minimizes the distance to all rays.
The classical approach is Direct Linear Transform (DLT) triangulation: set up a system of linear equations from the projection constraints and solve via SVD. This gives a closed-form solution — fast, but not optimally accurate.
VGGSfM first runs DLT to get a preliminary 3D point cloud X̄. Then it feeds this preliminary estimate into a Transformer TX for refinement. The input tokens are per-track descriptors dX(yij) that combine:
The Transformer processes these tokens and outputs refined 3D positions. Because the Transformer can attend across all points and all views, it can reason about global structure — "these five points form a planar surface" or "this point is inconsistent with its neighbors."
Multiple camera rays converge on a 3D point. Add noise to see how triangulation degrades — the rays no longer intersect perfectly, and the estimated 3D point drifts. More cameras give a more robust estimate.
Bundle adjustment is the final, crucial refinement step. It jointly optimizes all camera parameters and all 3D points to minimize the total reprojection error:
Here, Pi(xj) projects 3D point xj into camera i, and yij is the observed 2D position. The visibility flag vij masks out invisible points. The sum runs over all points and all cameras — this is why it's called "bundle" adjustment: we adjust all the bundles of rays simultaneously.
Classical BA uses the Ceres solver — a C++ library that minimizes reprojection error via Levenberg-Marquardt (LM) optimization. This works beautifully for SfM, but Ceres is not differentiable. You can't backpropagate gradients through it.
To train the full VGGSfM pipeline end-to-end, gradients must flow through the BA layer — from the training loss back to the tracker parameters. This requires differentiating through an optimization process.
VGGSfM uses the Theseus library (from Meta), which exploits the implicit function theorem to backpropagate through optimization. The idea:
Before BA, VGGSfM filters out outlier correspondences based on:
This filtering is critical — a single bad correspondence can distort the entire reconstruction. At test time, VGGSfM runs the full pipeline (track → camera → triangulate → BA) multiple times, iterating until BA achieves sub-pixel reprojection error.
Training a fully differentiable SfM pipeline is tricky. The four components are deeply interdependent — the triangulator can't learn if the cameras are garbage, and the cameras can't learn if the tracks are garbage. VGGSfM handles this with a careful multi-stage training strategy.
The tracker T is first trained on the Kubric synthetic dataset, which provides perfect ground-truth tracks. This gives the tracker a solid foundation before being fine-tuned on real data (CO3D or MegaDepth).
With the tracker frozen, the camera initializer TP is trained on real data. Then, with both tracker and camera initializer frozen, the triangulator TX is trained. Each component builds on the quality of its predecessors.
Finally, all components are unfrozen and trained jointly. This is where the magic happens — the synergy between components emerges as each adjusts to help the others.
The full loss has four terms:
Let's unpack each piece:
| Detail | Value |
|---|---|
| Optimizer | AdamW, cyclic LR (30-epoch cycles) |
| Learning rate | 0.0001 (joint), 0.0005 (pre-training) |
| Hardware | 32 NVIDIA A100 (80GB) GPUs |
| Frames per batch | 3 to 30 (randomly sampled) |
| Query points (train) | 256 |
| BA steps (train) | 5 |
| Image resolution | 512 × 512 (zero-padded) |
VGGSfM is evaluated on three benchmarks, each testing different conditions.
Turntable-style videos of 51 categories. Wide baselines between test frames make this very challenging for classical SfM.
| Method | Type | RRE@15° | RTE@15° | AUC@30° |
|---|---|---|---|---|
| COLMAP (SP+SG) | Incremental | 31.6 | 27.3 | 25.3 |
| PixSfM (SP+SG) | Incremental | 33.7 | 32.9 | 30.1 |
| PoseDiffusion | Deep | 80.5 | 79.8 | 66.5 |
| VGGSfM w/o Joint | Deep | 88.2 | 83.4 | 70.7 |
| VGGSfM | Deep | 92.1 | 88.3 | 74.0 |
On CO3D, VGGSfM destroys classical methods. COLMAP achieves 25.3% AUC, VGGSfM achieves 74.0% — a staggering 3x improvement. The wide baselines cripple pairwise matching, but VGGSfM's deep tracker handles them naturally.
Famous landmarks photographed by tourists. Views overlap well — classical SfM's sweet spot.
| Method | AUC@3° | AUC@5° | AUC@10° |
|---|---|---|---|
| COLMAP (SIFT+NN) | 23.58 | 32.66 | 44.79 |
| PixSfM (SP+SG) | 45.19 | 57.22 | 70.47 |
| DFSfM (LoFTR) | 46.55 | 58.74 | 72.19 |
| PoseDiffusion | 12.31 | 23.17 | 36.82 |
| VGGSfM | 45.23 | 58.89 | 73.92 |
Even on IMC — where classical methods have decades of tuning — VGGSfM matches or beats them. It leads on AUC@5 and AUC@10, and is competitive on AUC@3.
Laser-scanned ground truth for evaluating 3D point quality. VGGSfM achieves the best accuracy AND completeness across all thresholds.
Camera pose accuracy (AUC) on CO3D and IMC. VGGSfM dramatically outperforms classical methods on wide-baseline CO3D and matches them on narrow-baseline IMC.
The ablation studies reveal what matters most in VGGSfM's design. Each ablation removes or replaces one component, keeping everything else fixed.
The single most important design choice. Without joint training:
| Setting | CO3D AUC@30 | IMC AUC@10 |
|---|---|---|
| VGGSfM w/o Joint | 70.7 | 68.35 |
| VGGSfM (full) | 74.0 | 73.92 |
| Improvement | +3.3 | +5.6 |
End-to-end training consistently improves every metric. The synergy between components is real and substantial.
Removing the fine tracker (using only coarse tracks) drops IMC AUC@10 from 73.92% to 62.30% — a devastating 11.6-point drop. Sub-pixel accuracy is not optional for SfM.
Replacing VGGSfM's camera initializer with PoseDiffusion drops AUC@10 from 73.92% to 62.18%. The learned initializer is better than an off-the-shelf deep pose estimator.
Replacing the learned triangulator with plain DLT drops AUC@10 from 73.92% to 69.42%. The learned refinement adds +4.5 points.
An interesting cross-evaluation: VGGSfM tracks fed to PixSfM (classical) achieve 70.62% AUC@10, slightly beating PixSfM with its own matches (70.47%). SuperPoint+SuperGlue matches fed to VGGSfM achieve only 68.78%. Both results confirm: (1) VGGSfM's tracks are excellent even outside its own pipeline, and (2) VGGSfM benefits from tracks trained jointly with it.
VGGSfM sits at the intersection of several research threads. Let's map where it fits.
COLMAP is the gold standard of incremental SfM. VGGSfM replaces every component of COLMAP with a learned, differentiable alternative. On narrow-baseline scenes (IMC), VGGSfM matches COLMAP. On wide-baseline scenes (CO3D), it dramatically surpasses it. COLMAP still has the advantage of scaling to thousands of images — VGGSfM currently handles up to ~30 frames.
PoseDiffusion uses a diffusion model to predict camera poses. It's one of the first deep methods to handle many cameras simultaneously. VGGSfM's camera initializer achieves better accuracy by incorporating track features and geometric priors (8-point algorithm initialization), and by being integrated into a full SfM pipeline that provides 3D structure.
CoTracker introduced joint point tracking in videos. VGGSfM adapts this idea for unordered images: it removes the temporal sliding window (since images aren't sequential) and adds sub-pixel coarse-to-fine refinement (since SfM needs higher accuracy than video tracking).
Real-time SLAM systems (ORB-SLAM, DROID-SLAM) solve a related problem but emphasize speed and sequential processing. DROID-SLAM notably uses a differentiable BA layer similar to VGGSfM. VGGSfM focuses on offline, high-accuracy reconstruction from unordered images.
These concurrent works also pursue end-to-end 3D reconstruction from images. DUSt3R directly predicts pointmaps (3D coordinates per pixel) without explicit camera estimation. VGGSfM retains the classical SfM decomposition (cameras + points) but makes it fully differentiable. Both approaches validate the trend toward learned 3D reconstruction.
| Aspect | VGGSfM |
|---|---|
| Input | 3-30 unordered images of a scene |
| Output | Camera poses + intrinsics + 3D point cloud |
| Tracker | Coarse-to-fine deep tracker (all frames jointly) |
| Camera predictor | Transformer, all cameras at once |
| Triangulation | DLT + learned Transformer refinement |
| Bundle adjustment | Differentiable LM via Theseus (implicit diff) |
| Training | Multi-stage, then end-to-end joint |
| Key result | 3x COLMAP on CO3D, matches PixSfM on IMC |
| Limitation | ≤30 frames (can't scale to thousands like COLMAP) |
The full differentiable pipeline: images flow through the tracker, camera predictor, triangulator, and bundle adjustment. Gradients flow backward through the entire chain during training.