VGGSfM — Veanors

Chapter 0: The Problem

You have a handful of photos of a building — taken from different angles, at different times, maybe by different people. You want to figure out where each camera was when the photo was taken and build a 3D model of the building. This is Structure from Motion (SfM).

Classical SfM, epitomized by COLMAP, solves this in an incremental pipeline with many hand-engineered stages:

1. Detect Keypoints

Find distinctive image features (SIFT, SuperPoint) in every image independently.

↓

2. Match Pairs

For every pair of images, match keypoints via nearest-neighbor search + geometric verification (RANSAC).

↓

3. Chain into Tracks

Connect pairwise matches across images to form multi-view tracks. This step is purely hand-engineered.

↓

4. Register Cameras

Initialize with a well-conditioned pair, then add cameras one-by-one via PnP + RANSAC.

↓

5. Triangulate + BA

Triangulate 3D points, run bundle adjustment. Repeat steps 4-5 until all cameras are registered.

This works — spectacularly well in some cases. But it has deep structural problems:

Non-differentiable. RANSAC, incremental registration, Ceres-based BA — none of these have gradients. You cannot train the pipeline end-to-end on data.
Brittle chaining. Tracks are built by linking pairwise matches. One wrong match poisons the whole chain. This "drift" accumulates.
Incremental fragility. The sequential camera registration process can fail if the initial pair is poorly chosen, or if there aren't enough matches to register a new camera. Order matters.
Wide baseline failure. When images are taken far apart (wide baseline), pairwise matching struggles. COLMAP's accuracy degrades sharply.

The core issue: Deep learning has improved individual SfM components (better keypoint matching, better feature descriptors), but the overall pipeline remains the same non-differentiable, incremental framework from the 2000s. Each component optimizes its own objective in isolation — keypoint matching doesn't know what bundle adjustment needs.

Classical vs. Deep SfM

Classical SfM chains pairwise matches incrementally. VGGSfM tracks points across ALL frames simultaneously and recovers all cameras at once. Toggle between the two pipelines.

Why can't classical SfM benefit from end-to-end training?

Because key components (RANSAC, incremental registration, Ceres BA) are non-differentiable — there are no gradients to backpropagate through the full pipeline Because there isn't enough training data for SfM Because SfM is already optimal and cannot be improved

Chapter 1: The Key Insight

VGGSfM's core idea is simple to state and hard to execute: make every component of SfM fully differentiable, then train the whole pipeline end-to-end.

The reconstruction function becomes a single differentiable function f_θ:

f_θ(I) = (P, X)

It takes a set of images I and outputs camera parameters P and a 3D point cloud X. Because f_θ is differentiable, we can train it by minimizing a loss that compares the predicted cameras and points against ground truth:

θ^* = arg min_θ Σ_s L(f_θ(I_s), P_s^*, T_s^*, X_s^*)

The pipeline decomposes into four seamless stages, each differentiable:

1. Deep Point Tracker T

Track 2D points across ALL frames simultaneously. No pairwise matching, no chaining. A single feed-forward network.

↓

2. Camera Initializer T_P

A Transformer that predicts all camera poses at once from image features + track features. No incremental registration.

↓

3. Triangulator T_X

A Transformer that predicts 3D point positions given cameras and tracks. Refines beyond simple DLT triangulation.

↓

4. Differentiable BA

Levenberg-Marquardt optimization using the Theseus library. Gradients flow through the optimization via the implicit function theorem.

Why end-to-end matters: When components train in isolation, the tracker optimizes for tracking accuracy and the camera predictor optimizes for pose accuracy — but neither knows what the other needs. End-to-end training lets the tracker learn to produce tracks that make the camera predictor's job easier, and vice versa. The paper shows this synergy yields +3.3% AUC on CO3D and +5.6% AUC on IMC over training components separately.

Each of these four stages replaces a non-differentiable classical counterpart with a learned, differentiable alternative. The next four chapters examine each one in detail.

What is the primary benefit of making every SfM component differentiable?

It makes the code simpler to implement End-to-end training lets each component generate outputs that specifically facilitate its successor's task, improving overall performance beyond what isolated components achieve It runs faster than classical SfM

Chapter 2: Deep Point Tracking

In classical SfM, correspondences are established in two steps: (1) match keypoints between image pairs, then (2) chain pairwise matches into multi-view tracks. Step 2 is hand-engineered and error-prone — one wrong pairwise match corrupts the entire track.

VGGSfM replaces both steps with a single deep point tracker that directly outputs multi-view tracks. Given N_T query points in a reference frame, the tracker produces a track for each query — its location in every other frame, plus a visibility flag and confidence estimate.

Architecture: Coarse-to-Fine

The tracker follows a two-stage coarse-to-fine design inspired by CoTracker and PIPs:

Coarse tracking: A 2D CNN extracts feature maps from all frames. Query descriptors are sampled from the reference frame's feature map. Each descriptor is correlated with all frames' feature maps at multiple resolutions, building a cost-volume pyramid. These cost volumes are flattened into tokens and fed to a Transformer that outputs coarse track positions.
Fine tracking: Image patches are cropped around the coarse estimates and the tracking is repeated on these zoomed-in crops. This achieves sub-pixel accuracy — critical for SfM where even half-pixel errors propagate into large 3D errors.

Key difference from video trackers: CoTracker and PIPs assume temporal continuity — a point in frame t is near its position in frame t-1. VGGSfM's inputs are unordered, free-form images with no temporal relationship. So the tracker attends to all frames jointly rather than using a sliding window. It also predicts each track independently, allowing a much larger number of tracked points at test time.

Tracking Confidence

Not all tracks are equally reliable. An occluded point or a textureless region produces unreliable tracks. VGGSfM predicts an aleatoric uncertainty σ_ij for each track point y_ij. The confidence 1/σ_ij tells downstream components how much to trust each correspondence.

During training, the standard L1 loss on track positions is replaced with a negative log-likelihood loss:

L_track = − Σ_i,j log N(y_ij^* | y_ij, σ_ij)

This forces the model to predict tight distributions (low σ) around accurate predictions, and wide distributions (high σ) around uncertain ones. The predicted uncertainty is anisotropic — a keypoint on a vertical stripe might have high uncertainty along the horizontal axis but low uncertainty along the vertical axis.

Why tracks beat pairwise matches: Tracks directly encode multi-view consistency. If a point is tracked from frame 1 to frame 5, the tracker must find it in all intermediate frames too — there's no opportunity for a single bad pairwise match to break the chain. The paper shows that feeding VGGSfM tracks to PixSfM (a classical pipeline) improves its accuracy, confirming the tracks' quality.

Why does VGGSfM's tracker use coarse-to-fine tracking instead of single-resolution tracking?

Because SfM requires sub-pixel accuracy — the coarse stage finds approximate positions across all frames, then the fine stage refines to sub-pixel precision on zoomed-in patches Because it uses less GPU memory Because coarse-to-fine is standard in all computer vision models

Chapter 3: Camera Prediction

In classical SfM, cameras are registered one at a time. You start with a good pair, solve the essential matrix, then add each new camera by solving the Perspective-n-Point (PnP) problem. This incremental process is fragile: if one camera is registered badly, the error cascades into all subsequent registrations.

VGGSfM replaces this with a deep Transformer that predicts all cameras simultaneously.

Architecture

The camera initializer T_P takes two types of input tokens:

Global image features: A ResNet50 extracts a 512-dimensional feature vector φ(I_i) for each input image.
Track descriptors: An auxiliary branch of the tracker produces per-track-point descriptors d_P(y_ij) that carry correspondence information.

These are fused via cross-attention: the image features (queries) attend to the track descriptors (keys/values). This produces N_I tokens — one per image — each enriched with correspondence information.

A preliminary camera estimate is obtained from the 8-point algorithm applied to track correspondences. Its 8-dimensional parameterization (quaternion + translation + log focal length) is embedded via harmonic positional encoding and concatenated with the cross-attention output.

The combined tokens pass through a Transformer trunk (8 self-attention layers, 512 hidden dimensions, 4 heads). The output is decoded into camera parameters via an MLP. This is applied iteratively 4 times, with each iteration refining the previous prediction.

Why all cameras at once? When a Transformer processes all cameras jointly via self-attention, each camera's prediction is informed by every other camera's prediction. Camera 5's pose constrains Camera 3's pose and vice versa. This mutual information sharing is impossible in incremental registration, where Camera 5 only benefits from cameras registered before it.

Camera Parameterization

Each camera has 8 degrees of freedom:

Parameter	Dimensions	Description
Rotation quaternion	4	q(R) encodes the camera's orientation in SO(3)
Translation	3	t encodes the camera's position
Log focal length	1	ln(f) for numerical stability

The principal point is fixed at the image center (standard practice). The full projection matrix is P = K R [I₃ | t].

How does VGGSfM's camera predictor differ from classical incremental registration?

It uses more RANSAC iterations It uses a CNN instead of a Transformer It predicts all cameras simultaneously via a Transformer, so each camera's prediction is informed by all others — unlike incremental registration where cameras are added one-by-one

Chapter 4: Differentiable Triangulation

We have tracks (2D point locations across frames) and initial cameras. Now we need to recover the 3D points. This is triangulation: given a point's 2D projection in multiple cameras, find its 3D position.

The Geometry

Each camera defines a ray from the camera center through the observed 2D point. In the ideal case (no noise), all rays from the same 3D point intersect perfectly. With noise, they don't — and the goal is to find the 3D point that minimizes the distance to all rays.

The classical approach is Direct Linear Transform (DLT) triangulation: set up a system of linear equations from the projection constraints and solve via SVD. This gives a closed-form solution — fast, but not optimally accurate.

VGGSfM's Learned Triangulator

VGGSfM first runs DLT to get a preliminary 3D point cloud X̄. Then it feeds this preliminary estimate into a Transformer T_X for refinement. The input tokens are per-track descriptors d_X(y_ij) that combine:

Tracker features (what the point looks like)
Positional harmonic embeddings of the DLT triangulated positions (where the point roughly is)

The Transformer processes these tokens and outputs refined 3D positions. Because the Transformer can attend across all points and all views, it can reason about global structure — "these five points form a planar surface" or "this point is inconsistent with its neighbors."

Why refine beyond DLT? DLT assumes all observations are equally reliable and finds the algebraically simplest solution. But some tracks are noisy (low confidence), and some cameras are inaccurately estimated. The learned triangulator can weight observations by their reliability and incorporate geometric priors that DLT ignores. The ablation shows replacing the learned triangulator with plain DLT drops AUC@10 from 73.92% to 69.42%.

Interactive Triangulation

Multiple camera rays converge on a 3D point. Add noise to see how triangulation degrades — the rays no longer intersect perfectly, and the estimated 3D point drifts. More cameras give a more robust estimate.

Noise 0

Cameras 4

Why does VGGSfM use a learned Transformer triangulator instead of plain DLT?

Because DLT treats all observations equally and ignores confidence/noise, while the learned triangulator can weight observations by reliability and incorporate geometric priors — yielding +4.5% AUC improvement Because DLT is too slow for real-time applications Because DLT requires at least 8 cameras

Chapter 5: Differentiable Bundle Adjustment

Bundle adjustment is the final, crucial refinement step. It jointly optimizes all camera parameters and all 3D points to minimize the total reprojection error:

L_BA = Σ_i Σ_j v_ij · ||P_i(x^j) − y_ij||

Here, P_i(x^j) projects 3D point x^j into camera i, and y_ij is the observed 2D position. The visibility flag v_ij masks out invisible points. The sum runs over all points and all cameras — this is why it's called "bundle" adjustment: we adjust all the bundles of rays simultaneously.

The Differentiability Challenge

Classical BA uses the Ceres solver — a C++ library that minimizes reprojection error via Levenberg-Marquardt (LM) optimization. This works beautifully for SfM, but Ceres is not differentiable. You can't backpropagate gradients through it.

To train the full VGGSfM pipeline end-to-end, gradients must flow through the BA layer — from the training loss back to the tracker parameters. This requires differentiating through an optimization process.

Solution: Implicit Differentiation via Theseus

VGGSfM uses the Theseus library (from Meta), which exploits the implicit function theorem to backpropagate through optimization. The idea:

BA finds the solution (P*, X*) that satisfies the optimality condition: ∇L_BA = 0.
At this solution, the implicit function theorem tells us how the solution changes when the inputs (tracks, initial cameras) change — without unrolling the optimization steps.
This gives us the Jacobian of the solution with respect to the inputs, which is all we need for backpropagation.

Why not just unroll? A naive approach would unroll every LM iteration and backpropagate through all of them. This is memory-expensive (you store all intermediate states) and numerically unstable (gradients through many iterations can explode or vanish). Implicit differentiation gives exact gradients with constant memory cost, regardless of how many LM steps were taken.

Filtering

Before BA, VGGSfM filters out outlier correspondences based on:

Low visibility (v_ij = 0)
Low tracker confidence (high σ_ij)
Geometric constraints (epipolar consistency)
Large reprojection errors (from the initial triangulation)

This filtering is critical — a single bad correspondence can distort the entire reconstruction. At test time, VGGSfM runs the full pipeline (track → camera → triangulate → BA) multiple times, iterating until BA achieves sub-pixel reprojection error.

How does VGGSfM make bundle adjustment differentiable?

It uses the Theseus library, which exploits the implicit function theorem to compute gradients through the optimization without unrolling — giving exact gradients with constant memory cost It replaces Levenberg-Marquardt with gradient descent It removes bundle adjustment entirely and relies on the triangulator

Chapter 6: Training

Training a fully differentiable SfM pipeline is tricky. The four components are deeply interdependent — the triangulator can't learn if the cameras are garbage, and the cameras can't learn if the tracks are garbage. VGGSfM handles this with a careful multi-stage training strategy.

Stage 1: Tracker Pre-training

The tracker T is first trained on the Kubric synthetic dataset, which provides perfect ground-truth tracks. This gives the tracker a solid foundation before being fine-tuned on real data (CO3D or MegaDepth).

Stage 2: Component-wise Training

With the tracker frozen, the camera initializer T_P is trained on real data. Then, with both tracker and camera initializer frozen, the triangulator T_X is trained. Each component builds on the quality of its predecessors.

Stage 3: End-to-End Joint Training

Finally, all components are unfrozen and trained jointly. This is where the magic happens — the synergy between components emerges as each adjusts to help the others.

Training Loss

The full loss has four terms:

Let's unpack each piece:

3D point loss: Pseudo-Huber loss between predicted and ground-truth 3D points — applied to both initial triangulation (x̂) and BA-refined points (x).
Camera loss: Huber loss on camera parameters — applied to both initial predictions (P̂) and BA-refined cameras (P).
Track loss: Negative log-likelihood of ground-truth track positions under the predicted distributions (y_ij, σ_ij).

Supervising intermediate outputs: The loss penalizes both the initial estimates (P̂, x̂) and the BA-refined estimates (P, x). This ensures that the camera initializer and triangulator produce reasonable outputs even before BA — BA shouldn't have to do all the work. It also provides gradient signal to early components that might otherwise suffer from vanishing gradients.

Training Details

Detail	Value
Optimizer	AdamW, cyclic LR (30-epoch cycles)
Learning rate	0.0001 (joint), 0.0005 (pre-training)
Hardware	32 NVIDIA A100 (80GB) GPUs
Frames per batch	3 to 30 (randomly sampled)
Query points (train)	256
BA steps (train)	5
Image resolution	512 × 512 (zero-padded)

Why does VGGSfM train in stages (tracker first, then camera predictor, then triangulator, then jointly)?

Because it runs faster this way Because the components are independent Because each component depends on its predecessors' quality — training the triangulator before the camera predictor works would produce garbage. Staged training builds a solid foundation before joint fine-tuning unlocks synergy.

Chapter 7: Results

VGGSfM is evaluated on three benchmarks, each testing different conditions.

CO3D (Wide Baseline, Object-Centric)

Turntable-style videos of 51 categories. Wide baselines between test frames make this very challenging for classical SfM.

Method	Type	RRE@15°	RTE@15°	AUC@30°
COLMAP (SP+SG)	Incremental	31.6	27.3	25.3
PixSfM (SP+SG)	Incremental	33.7	32.9	30.1
PoseDiffusion	Deep	80.5	79.8	66.5
VGGSfM w/o Joint	Deep	88.2	83.4	70.7
VGGSfM	Deep	92.1	88.3	74.0

On CO3D, VGGSfM destroys classical methods. COLMAP achieves 25.3% AUC, VGGSfM achieves 74.0% — a staggering 3x improvement. The wide baselines cripple pairwise matching, but VGGSfM's deep tracker handles them naturally.

IMC Phototourism (Narrow Baseline, Landmarks)

Famous landmarks photographed by tourists. Views overlap well — classical SfM's sweet spot.

Method	AUC@3°	AUC@5°	AUC@10°
COLMAP (SIFT+NN)	23.58	32.66	44.79
PixSfM (SP+SG)	45.19	57.22	70.47
DFSfM (LoFTR)	46.55	58.74	72.19
PoseDiffusion	12.31	23.17	36.82
VGGSfM	45.23	58.89	73.92

Even on IMC — where classical methods have decades of tuning — VGGSfM matches or beats them. It leads on AUC@5 and AUC@10, and is competitive on AUC@3.

ETH3D (3D Triangulation Quality)

Laser-scanned ground truth for evaluating 3D point quality. VGGSfM achieves the best accuracy AND completeness across all thresholds.

Results Comparison

Camera pose accuracy (AUC) on CO3D and IMC. VGGSfM dramatically outperforms classical methods on wide-baseline CO3D and matches them on narrow-baseline IMC.

The headline result: A single differentiable pipeline — simpler than COLMAP — achieves state-of-the-art on wide-baseline, narrow-baseline, and triangulation benchmarks. No single classical configuration works well across all three.

Where does VGGSfM show its largest improvement over classical SfM (COLMAP)?

On CO3D (wide-baseline scenes), where it achieves 74.0% AUC vs. COLMAP's 25.3% — because deep tracking handles wide baselines far better than pairwise keypoint matching On IMC (narrow-baseline), where the improvement is equally large On ETH3D, because triangulation is the hardest task

Chapter 8: Ablations

The ablation studies reveal what matters most in VGGSfM's design. Each ablation removes or replaces one component, keeping everything else fixed.

End-to-End Training

The single most important design choice. Without joint training:

Setting	CO3D AUC@30	IMC AUC@10
VGGSfM w/o Joint	70.7	68.35
VGGSfM (full)	74.0	73.92
Improvement	+3.3	+5.6

End-to-end training consistently improves every metric. The synergy between components is real and substantial.

Coarse-to-Fine Tracking

Removing the fine tracker (using only coarse tracks) drops IMC AUC@10 from 73.92% to 62.30% — a devastating 11.6-point drop. Sub-pixel accuracy is not optional for SfM.

Camera Initializer vs. PoseDiffusion

Replacing VGGSfM's camera initializer with PoseDiffusion drops AUC@10 from 73.92% to 62.18%. The learned initializer is better than an off-the-shelf deep pose estimator.

Learned Triangulator vs. DLT

Replacing the learned triangulator with plain DLT drops AUC@10 from 73.92% to 69.42%. The learned refinement adds +4.5 points.

Tracks vs. Pairwise Matching

An interesting cross-evaluation: VGGSfM tracks fed to PixSfM (classical) achieve 70.62% AUC@10, slightly beating PixSfM with its own matches (70.47%). SuperPoint+SuperGlue matches fed to VGGSfM achieve only 68.78%. Both results confirm: (1) VGGSfM's tracks are excellent even outside its own pipeline, and (2) VGGSfM benefits from tracks trained jointly with it.

Ablation takeaway: Every component contributes, but the ranking is clear. Coarse-to-fine tracking has the largest impact (−11.6 AUC), followed by end-to-end training (−5.6), then the camera initializer (−4.5 vs. PoseDiffusion alternative), then the triangulator (−4.5 vs. DLT).

Which ablation causes the largest performance drop on the IMC dataset?

Removing end-to-end training Removing the fine tracker (coarse-to-fine) — AUC@10 drops by 11.6 points from 73.92% to 62.30%, showing sub-pixel tracking accuracy is the most critical component Replacing the triangulator with DLT

Chapter 9: Connections

VGGSfM sits at the intersection of several research threads. Let's map where it fits.

Relation to COLMAP

COLMAP is the gold standard of incremental SfM. VGGSfM replaces every component of COLMAP with a learned, differentiable alternative. On narrow-baseline scenes (IMC), VGGSfM matches COLMAP. On wide-baseline scenes (CO3D), it dramatically surpasses it. COLMAP still has the advantage of scaling to thousands of images — VGGSfM currently handles up to ~30 frames.

Relation to PoseDiffusion

PoseDiffusion uses a diffusion model to predict camera poses. It's one of the first deep methods to handle many cameras simultaneously. VGGSfM's camera initializer achieves better accuracy by incorporating track features and geometric priors (8-point algorithm initialization), and by being integrated into a full SfM pipeline that provides 3D structure.

Relation to CoTracker

CoTracker introduced joint point tracking in videos. VGGSfM adapts this idea for unordered images: it removes the temporal sliding window (since images aren't sequential) and adds sub-pixel coarse-to-fine refinement (since SfM needs higher accuracy than video tracking).

Relation to Classical SLAM

Real-time SLAM systems (ORB-SLAM, DROID-SLAM) solve a related problem but emphasize speed and sequential processing. DROID-SLAM notably uses a differentiable BA layer similar to VGGSfM. VGGSfM focuses on offline, high-accuracy reconstruction from unordered images.

Relation to Rooms from Motion / DUSt3R

These concurrent works also pursue end-to-end 3D reconstruction from images. DUSt3R directly predicts pointmaps (3D coordinates per pixel) without explicit camera estimation. VGGSfM retains the classical SfM decomposition (cameras + points) but makes it fully differentiable. Both approaches validate the trend toward learned 3D reconstruction.

Cheat Sheet

Aspect	VGGSfM
Input	3-30 unordered images of a scene
Output	Camera poses + intrinsics + 3D point cloud
Tracker	Coarse-to-fine deep tracker (all frames jointly)
Camera predictor	Transformer, all cameras at once
Triangulation	DLT + learned Transformer refinement
Bundle adjustment	Differentiable LM via Theseus (implicit diff)
Training	Multi-stage, then end-to-end joint
Key result	3x COLMAP on CO3D, matches PixSfM on IMC
Limitation	≤30 frames (can't scale to thousands like COLMAP)

The broader lesson: Long-standing pipelines with many hand-engineered stages can often be simplified and improved by making them fully differentiable. The components don't need to be individually optimal — they need to work well together. End-to-end training discovers this synergy automatically.

VGGSfM Architecture Flow

The full differentiable pipeline: images flow through the tracker, camera predictor, triangulator, and bundle adjustment. Gradients flow backward through the entire chain during training.

What is VGGSfM's primary limitation compared to classical SfM (COLMAP)?

It can only process up to ~30 frames, while COLMAP can scale to thousands of images It produces lower-quality 3D point clouds It only works on indoor scenes

Visual Geometry Grounded Deep Structure From Motion