PoseDiffusion — Veanors

Chapter 0: The Problem

You have ten tourist photos of the Colosseum, taken from different angles by different people. You want to figure out exactly where each camera was when the photo was taken — its position, its orientation, and its focal length. This is camera pose estimation, and it's the foundation of everything from 3D reconstruction to augmented reality.

The classical approach is a pipeline called Structure from Motion (SfM). It works in stages: detect keypoints (SIFT, SuperPoint), match them across images (nearest-neighbor, SuperGlue), use RANSAC to reject outliers, compute relative poses from the five-point algorithm, then run Bundle Adjustment to jointly refine all cameras and 3D points.

This pipeline is brittle. Each stage can fail, and failures cascade. If keypoint matching fails on a wide-baseline pair (two photos taken from very different positions), that image pair gets dropped. If too many pairs drop, the reconstruction collapses.

The Achilles' heel: Classical SfM pipelines depend on finding reliable point correspondences between images. When views are sparse and baselines are wide — exactly the scenario that matters for real applications — correspondences become unreliable or nonexistent. A small matching failure at the start can doom the entire reconstruction.

Learned methods like RelPose tried to bypass the correspondence problem by directly predicting camera poses from image features. But RelPose only predicts rotations, not translations, and it can't match the precision of Bundle Adjustment when many images are available.

We need something that can handle both sparse wide-baseline views and dense multi-view sequences — gracefully, without brittle handoffs between pipeline stages.

Keypoint Matching Fragility

Two views of the same scene. Green lines are correct matches; red lines are incorrect. As baseline widens, more matches fail. Click "Widen Baseline" to see the effect.

Baseline Narrow

Why does classical SfM struggle with sparse, wide-baseline views?

Because keypoint matching becomes unreliable when viewpoints differ widely, and pipeline failures cascade through RANSAC, pose estimation, and bundle adjustment Because SIFT features are too slow to compute on large images Because bundle adjustment requires at least 50 images to converge

Chapter 1: The Key Insight

Here's the idea that changes everything: what if camera pose estimation was just denoising?

Think about what a diffusion model does for images. You start with pure noise and gradually refine it into a coherent picture. At each step, the model nudges the noisy image slightly closer to something that looks real. The final result emerges through many small corrections, not one giant leap.

Now think about what Bundle Adjustment does. You start with rough camera poses (from a noisy initialization) and iteratively refine them until all the cameras are geometrically consistent. Each iteration makes small adjustments. The final result emerges through many corrections.

The parallel is striking. Both processes are iterative refinements from noise to signal. PoseDiffusion makes this parallel literal.

Step 1: Start from Noise

Sample N random camera poses from a Gaussian distribution. These are completely wrong — cameras pointing in random directions at random positions.

↓

Step 2: Denoise

A learned denoiser (transformer) takes the noisy poses + image features and predicts slightly cleaner poses. Repeat T times.

↓

Step 3: Guide with Geometry

At each step, nudge poses toward satisfying epipolar constraints from 2D point correspondences. This injects classical geometric reasoning into the diffusion process.

↓

Step 4: Output

After T denoising steps, the poses have converged. Each camera has an extrinsic (rotation + translation) and intrinsic (focal length).

Why this matters: By framing pose estimation as diffusion, we get three things for free: (1) iterative refinement, like BA, without hand-engineering the optimization; (2) a natural way to inject geometric constraints via classifier guidance; (3) the ability to model uncertainty — multiple samples from p(x|I) show how confident the model is about each camera.

The model learns the conditional distribution p(x|I) — the probability of camera parameters x given images I. At test time, sampling from this distribution produces camera poses. Because the distribution is assumed to be near-delta (i.e., there's essentially one right answer for a given set of images), any sample is a valid pose estimate.

The inference pipeline, end to end

1. Feature extraction

N images → DINO ViT-S/16 → N × 384-dim features (one-time cost: ~0.1s)

↓

2. Sample initial noise

x₁₀₀ ~ N(0, I): random quaternions, translations, focal lengths for all N cameras

↓

3. Denoise (steps 100 to 11)

90 transformer forward passes, each predicting clean cameras from current noisy state + image features. ~0.8s total.

↓

4. Denoise + GGS (steps 10 to 1)

10 steps with geometric guidance. Each: transformer prediction + 100 gradient descent iterations on Sampson error. ~60-90s total (bottleneck).

↓

5. Output

N cameras: rotation (quaternion), translation (3D vector), focal length. Ready for NeRF, 3D reconstruction, or AR.

What is the core analogy that PoseDiffusion exploits?

Camera images look similar to noisy images Both diffusion denoising and bundle adjustment iteratively refine from a noisy initialization to a clean solution through many small corrections Diffusion models are faster than gradient descent optimizers

Chapter 2: Background

Before we dive into PoseDiffusion's mechanics, let's nail down the three pieces of geometry it relies on.

Camera Extrinsics

A camera's extrinsic parameters describe where it is and which way it's pointing. Formally, extrinsics g = (R, t) consist of a rotation matrix R in SO(3) and a translation vector t in R³. Together, they define a rigid-body transformation from world coordinates to camera coordinates:

p_c = R · p_w + t

PoseDiffusion represents the rotation as a unit quaternion q in H (4 numbers) and keeps the translation as a 3-vector, giving 7 numbers per camera for extrinsics.

Camera Intrinsics

The intrinsic parameters describe how the camera projects 3D points onto its 2D sensor. The calibration matrix K maps a 3D camera-space point to a 2D pixel:

K = [f, 0, p_x ; 0, f, p_y ; 0, 0, 1]

PoseDiffusion simplifies this to one degree of freedom: the focal length f. The principal point (p_x, p_y) is fixed at the image center, which is standard in SfM. The focal length is predicted as f = exp(f̂) to guarantee it's always positive. This adds 1 number per camera, for a total of 8 parameters per camera.

Epipolar Geometry

Given two cameras with known poses, there's a fundamental relationship between corresponding points. If a point p₁ in image 1 corresponds to a point p₂ in image 2, then the epipolar constraint says:

p̃₂^T F p̃₁ = 0

where F is the Fundamental Matrix computed from the two cameras' poses, and p̃ denotes homogeneous coordinates. This says: if you know one point in image 1, the corresponding point in image 2 must lie on a specific line (the epipolar line). If the poses are correct, all correspondences satisfy this constraint. If they don't, the poses are wrong.

The Sampson Error: Directly checking p̃₂^TFp̃₁ = 0 is numerically unstable. The Sampson Epipolar Error is a first-order approximation to the geometric error that's much better behaved. It normalizes by the gradient of the constraint, measuring the actual pixel distance to the epipolar line rather than an algebraic residual. This is what PoseDiffusion minimizes during guided sampling.

How many learnable parameters per camera does PoseDiffusion predict?

8: quaternion (4) + translation (3) + log-focal-length (1) 6: rotation matrix (3) + translation (3) 12: full 3x4 projection matrix

Chapter 3: Diffusion for Poses

PoseDiffusion adapts DDPM (Denoising Diffusion Probabilistic Models) to the domain of camera parameters. Let's trace how the standard DDPM machinery maps onto this problem.

The Forward Process: Adding Noise to Cameras

Start with ground-truth camera parameters x₀. The forward process adds Gaussian noise over T steps:

q(x_t | x_t-1) = N(x_t; √(1 - β_t) · x_t-1, β_t I)

After T steps, the cameras are indistinguishable from pure Gaussian noise. The variance schedule β₁, ..., β_T controls how fast this happens. PoseDiffusion uses T = 100 steps with a linear schedule from 10^-3 to 0.2.

A convenient closed-form lets us jump directly to any timestep:

x_t ~ N(√ᾱ_t · x₀, (1 - ᾱ_t) I)

where ᾱ_t = ∏_i=1^t (1 - β_i). This is essential for efficient training: we can sample any noise level without simulating the full chain.

The Reverse Process: Denoising Cameras

The reverse process is what we actually use at inference. Starting from pure noise x_T ~ N(0, I), we iteratively denoise:

p_θ(x_t-1 | x_t, I) = N(x_t-1; √α_t · D_θ(x_t, t, I), (1 - α_t) I)

The denoiser D_θ takes the current noisy cameras x_t, the timestep t, and image features I, and predicts the clean cameras. PoseDiffusion uses the "x₀ prediction" formulation: the network directly predicts the clean signal rather than the noise. This is empirically more stable for camera parameters.

What Makes This Different from Image Diffusion?

In image diffusion, each sample is a grid of pixels — high-dimensional, spatially structured. Here, each sample is a set of camera parameter vectors. There are key differences:

Variable-length sets: PoseDiffusion handles an arbitrary number N of cameras. The denoiser processes them jointly through a transformer.
Low dimension per camera: Just 8 numbers per camera, versus millions of pixels. But these 8 numbers must be geometrically precise.
Conditional on images: The denoising is conditioned on image features, so it's really modeling p(x|I), not p(x).
SE(3) structure: Camera parameters live on a manifold (rotations are in SO(3)), not in flat Euclidean space. PoseDiffusion handles this by working with quaternions and letting the noise schedule operate in the ambient R⁸ space.

Why "x₀ prediction" and not "noise prediction"? Most image diffusion models predict the noise ε that was added. PoseDiffusion instead predicts the clean cameras x₀ directly. The authors found this more stable, likely because camera parameters have hard geometric constraints (rotation quaternions must be unit norm) that are easier to enforce in the output space than in the noise space.

Concrete diffusion setup

Total diffusion steps T: 100 (much fewer than the 1000 typical for image diffusion — pose space is low-dimensional and doesn't need as many refinement steps)
Noise schedule: Linear β from 10^-3 to 0.2. This is aggressive — by step 100, the signal-to-noise ratio is very low, ensuring x₁₀₀ is nearly pure Gaussian.
Input dimension: 8 per camera × N cameras. For N=10, the total diffusion operates over an 80-dimensional space. Compare to image diffusion over a 32×32×4 = 4096-dimensional space. PoseDiffusion's space is 50x smaller.
Quaternion normalization: After each denoising step, the quaternion portion of each camera's parameters is L2-normalized to unit length. This enforces the SO(3) constraint directly in the diffusion process.

Scale of the problem: PoseDiffusion diffuses over a tiny space (8N dimensions) compared to image models (thousands to millions of dimensions). This is why the denoiser can be so small (~5M params vs. 675M for DiT or 8B for SD3). But the geometric precision required is much higher — a 1-degree rotation error is immediately visible in reconstruction. The diffusion framework helps by providing many refinement steps, and GGS provides the final geometric precision.

Why does PoseDiffusion use x₀ prediction instead of noise prediction?

Because x₀ prediction generates higher-resolution images Because noise prediction requires more training data Because directly predicting clean camera parameters is empirically more stable, likely because geometric constraints like unit-norm quaternions are easier to enforce in the output space

Chapter 4: Geometry-Guided Sampling

The diffusion denoiser alone can produce reasonable poses, but it's a feed-forward neural network — and neural networks are notoriously bad at regressing precise geometric quantities like rotation angles and translation vectors. PoseDiffusion's secret weapon is Geometry-Guided Sampling (GGS): injecting classical epipolar constraints directly into the diffusion sampling process.

The Mechanism: Classifier Guidance for Geometry

Recall that in classifier-guided diffusion for images, you steer samples toward a desired class by adding the gradient of a classifier to the denoising step. PoseDiffusion does the same thing, but the "classifier" is replaced by an epipolar geometry likelihood.

At each denoising step t, the predicted mean μ_t-1 is adjusted:

μ̂_t-1 = D_θ(x_t, t, I) + s · ∇_x log p(I | x_t)

The gradient ∇ log p(I|x) pushes the poses toward satisfying epipolar constraints from 2D correspondences. The scalar s controls guidance strength.

The Sampson Likelihood

The likelihood p(I|x) is modeled as a product of exponential distributions over pairwise Sampson Errors:

p(I | x) = ∏_i,j exp(−e^ij)

where e^ij is the Sampson Epipolar Error between cameras i and j, computed from 2D correspondences extracted by SuperPoint+SuperGlue. The Sampson error is clamped at ε = 10 to handle outlier correspondences robustly.

The gradient of log p(I|x) is just the negative gradient of the total Sampson error across all pairs. It tells the optimizer: "adjust these camera poses so that epipolar lines pass closer to the matched keypoints."

Implementation Details

Applied late: GGS runs only during the last 10 of the 100 diffusion steps. Early in the process, poses are too noisy for geometric constraints to help.
100 iterations per step: At each diffusion step where GGS is active, 100 gradient-descent iterations adjust the mean.
Adaptive strength: The guidance strength s is set so that the gradient norm doesn't exceed α · ||μ_t|| with α = 0.0001. This prevents geometric guidance from overwhelming the diffusion prior.

The best of both worlds: The diffusion model provides a strong learned prior over plausible camera configurations — the kind of thing that's hard to hand-engineer. GGS provides precise geometric constraints from classical SfM — the kind of thing that's hard to learn. Together, they exceed what either achieves alone. Removing GGS drops mAA(30) on CO3D from 66.5 to 56.0.

Why geometry guidance works so well

Neural networks are fundamentally bad at precise geometric regression. A network might learn that "cameras looking at the same object should face inward," but it can't learn that "this specific camera's optical axis must pass within 0.5 pixels of this specific point." That level of precision comes from explicit geometric computation.

GGS provides this precision by computing exact Sampson errors from 2D correspondences. The gradient of the Sampson error tells the optimizer exactly which direction to nudge each camera to improve geometric consistency. The diffusion prior provides the rough neighborhood; GGS provides the final precision within that neighborhood.

The late application (last 10 steps) is critical: early in denoising, cameras are too scattered for epipolar constraints to be meaningful (random camera pairs have no real geometric relationship). By step 10, the diffusion model has already placed cameras approximately right — GGS just needs to fine-tune within a small region, which is exactly what gradient descent excels at.

Denoising with Geometry Guidance

Drag the slider to move through denoising timesteps. Watch cameras (colored triangles) refine from random noise to clean poses. In the final steps (t < 10), epipolar lines (dashed) appear as GGS activates, pulling cameras into geometric consistency.

Timestep t t=100

Why is Geometry-Guided Sampling applied only in the last 10 of 100 diffusion steps?

Because early in the process poses are too noisy for epipolar constraints to provide meaningful geometric guidance — the constraints only help once poses are roughly correct Because running GGS for all 100 steps would be too slow Because the correspondence extractor needs cleaner images to find matches

Chapter 5: The Architecture

The denoiser D_θ is a transformer that processes the noisy camera parameters jointly with image features. Let's trace the data flow.

Inputs: Three Streams per Camera

For each of the N cameras, the transformer receives a token built from three components:

Noisy pose x_tⁱ: The 8-dimensional camera parameter vector at the current noise level. Projected through a linear layer to 96 dimensions.
Diffusion time t: The scalar timestep, also projected to 96 dimensions.
Image features ψ(Iⁱ): 384-dimensional DINO ViT-S/16 features for image Iⁱ, plus a 1-dimensional binary pivot flag (is this the canonical camera?). Total: 385 dimensions.

These three are concatenated into a single token per camera, then fed into the transformer.

The Transformer

The architecture is a standard transformer encoder with 8 layers, 4 attention heads, and feedforward dimension 1024. There's no decoder — all cameras attend to each other through self-attention. This is important: each camera can see every other camera's noisy pose and image features. The model learns to reason about multi-view consistency.

Output

The transformer's output tokens are passed through a 2-layer MLP (hidden dim 128, output dim 8) to produce the predicted clean camera parameters: log-focal-length f̂, quaternion q, and translation t for each camera.

PoseDiffusion Architecture

Data flow from inputs through the transformer to output camera parameters. All N cameras are processed jointly via self-attention.

The complete data flow with shapes

N input images

N × H × W × 3 (e.g., 10 images of the Colosseum)

↓

DINO ViT-S/16 feature extraction

Each image: center-crop + resize to 224×224 → extract at 3 scales (1x, 1/2, 1/3) → average → N × 384-dim features (CLS token)

↓

Build per-camera tokens

For each of N cameras: concat [pose x_t: 8 → project to 96] + [timestep t: 1 → project to 96] + [DINO feature: 384 + pivot flag: 1 = 385] = 577-dim token per camera

↓

Transformer encoder

N tokens × 577 dim → 8 layers, 4 heads, FFN dim 1024 → N tokens × 577 dim. All N cameras attend to each other.

↓

MLP head

N × 577 → 2-layer MLP (hidden 128) → N × 8 predicted clean cameras: [log(f), q_w, q_x, q_y, q_z, t_x, t_y, t_z]

↓

Post-process

Quaternion normalized to unit length, focal length = exp(f̂), translations unnormalized

Feature Extraction: DINO

Image features come from DINO ViT-S/16, pretrained in a self-supervised fashion. The images are center-cropped and resized to 224x224, then features are extracted at three scales (1x, 1/2, 1/3) and averaged for multi-scale understanding. DINO's weights are fine-tuned during training — this lets the feature extractor adapt to the pose-estimation task.

Coordinate Frame Canonicalization

SfM datasets define poses in arbitrary scene-specific coordinate frames. To prevent the model from overfitting to these arbitrary frames, PoseDiffusion canonicalizes all poses relative to a randomly selected pivot camera. The pivot camera gets identity rotation and zero translation. A binary flag in the input tells the model which camera is the pivot. Translations are further normalized by their median norm to handle scale ambiguity.

Why self-attention over all cameras? Multi-view geometry is inherently about relationships between cameras. A single camera's pose is meaningless without reference to the others. Self-attention lets the model jointly reason about all cameras, naturally encoding constraints like "if camera 1 and camera 3 see similar features, they should be close together."

Model size: why so small?

PoseDiffusion's transformer has only 8 layers, 4 heads, and processes at most ~20 tokens. Compare:

Model	Tokens	Layers	Params	Task
PoseDiffusion	3-20	8	~5M	Denoise N camera poses
DiT-XL/2	256	28	675M	Denoise 32×32×4 latent
SD3-8B	~4500	38	8B	Denoise 128×128×16 latent

The tiny model works because: (1) the output space is 8N dimensions, not millions of pixels; (2) each token already carries rich 384-dim visual features from DINO; (3) the geometric reasoning required (relative camera placement) is compositionally simpler than photorealistic image generation. More parameters would likely overfit — CO3Dv2 has only ~37K scenes.

Why does PoseDiffusion canonicalize poses relative to a random pivot camera?

To make the model run faster To ensure all training scenes use the same number of cameras To prevent overfitting to arbitrary scene-specific coordinate frames — the model should learn geometry, not memorize reference frames

Chapter 6: Training

PoseDiffusion is trained with a remarkably simple objective: predict the clean cameras from noisy ones.

The Denoising Loss

At each training step, the model receives a batch of scenes with ground-truth cameras x₀ and images I. A random timestep t is sampled, noise is added to get x_t, and the denoiser predicts x₀:

L_diff = E_{t, x_t} || D_θ(x_t, t, I) − x₀ ||²

That's it. No adversarial losses, no perceptual losses, no multi-task heads. Just an L2 loss between predicted and ground-truth camera parameters, averaged over all cameras in the scene and all timesteps.

Training Data

Two datasets provide the training signal:

CO3Dv2: ~37,000 turntable-like videos of objects from 51 categories. Cameras annotated by COLMAP using 200 frames per video. Object-centric, mostly circular camera trajectories.
RealEstate10K: ~57,000 YouTube clips of interiors and exteriors. Cameras annotated by ORB-SLAM2 + bundle adjustment. Scene-centric, mostly linear fly-through trajectories.

Training Details

Optimizer: Adam, initial LR 0.0005, decayed 10x after 30 epochs
Batch sampling: Each batch randomly samples 3-20 frames from a random scene. This variable count teaches the model to handle any number of cameras.
Diffusion schedule: T = 100 steps, linear β schedule from 10^-3 to 0.2
Training time: 2 days on 8 GPUs until convergence

Frozen vs. trained components

Component	Parameters	Status
DINO ViT-S/16	~22M	Pretrained (self-supervised on ImageNet), fine-tuned during training
Transformer denoiser	~5M	Trained from scratch
MLP head	~75K	Trained from scratch
SuperPoint + SuperGlue	~13M	Frozen (pretrained, used only at inference for GGS)

Total trainable: ~27M parameters. This is remarkably small — orders of magnitude fewer than image diffusion models. The small model size is possible because: (1) the output space is only 8 numbers per camera, not millions of pixels; (2) DINO provides rich visual features without needing a massive backbone; (3) the transformer only processes N tokens (number of cameras), not thousands of image patches.

Engineering decisions

Why diffusion for poses instead of direct regression? A regression network maps images directly to a single pose prediction. The problem: multi-view pose estimation is inherently ambiguous (especially with few views), and the loss landscape has many local minima. Diffusion provides: (1) iterative refinement from coarse to fine (the 100-step chain), (2) implicit ensembling (can sample multiple times for uncertainty), and (3) a natural injection point for geometric constraints (GGS). The PoseReg ablation confirms this: same architecture without diffusion scores 48.2 mAA vs 66.5 with diffusion.

Why DINO instead of CLIP or ResNet? DINO ViT-S/16 provides features trained with self-supervised objectives that emphasize spatial structure and object parts — exactly what you need for geometric reasoning. CLIP features emphasize semantic similarity, which is less useful for precise pose estimation. ResNet features lack the global receptive field that ViT's self-attention provides.

Why fine-tune DINO? Generic DINO features are good but not specialized for pose. Fine-tuning lets the feature extractor learn to emphasize viewpoint-discriminative information (silhouettes, parallax cues) over semantic content (object identity). This is one of the keys to PoseDiffusion's performance.

No GGS during training: Geometry-Guided Sampling is only applied at inference time. During training, the model learns pure diffusion denoising. GGS is bolted on afterward as a test-time refinement. This is elegant: you can train the model once and then decide whether to use GGS at inference based on your accuracy/speed tradeoff.

A key advantage of the diffusion formulation: the model is trained one step at a time. Unlike autoregressive methods that require backpropagation through the full generation chain, each training step only requires forward/backward through a single denoising step. This makes training tractable even for complex geometric reasoning.

Why is the one-step-at-a-time training an advantage of the diffusion formulation?

Because it avoids the need to backpropagate through the entire denoising chain — each step is supervised independently with a simple L2 loss, making training tractable Because it requires fewer training images Because it allows using a smaller transformer model

Chapter 7: Results

PoseDiffusion is evaluated on two challenging real-world datasets with different characteristics. The results are compelling.

CO3Dv2: Object-Centric Scenes

Each scene is a turntable-like video of a single object. Cameras orbit the object at roughly constant distance. PoseDiffusion significantly outperforms all baselines in both sparse and dense settings.

Method	RRA@15	RTA@15	mAA(30)
RelPose	57.1	—	—
COLMAP+SPSG	33.7	32.9	30.1
PixSfM	53.2	49.1	45.0
PoseReg (no diffusion)	57.0	53.4	48.2
Ours w/o GGS	75.9	72.8	56.0
PoseDiffusion	80.5	79.8	66.5

Key observations: (1) The diffusion model alone (w/o GGS) already beats every baseline. (2) Adding GGS provides a further 10+ point boost in mAA(30). (3) The non-diffusion baseline PoseReg with the same architecture scores much lower, validating that diffusion itself — not just the architecture — is responsible for the gains.

RealEstate10K: Scene-Centric Views

These are fly-through videos of real interiors and exteriors — the domain where COLMAP traditionally excels. Yet PoseDiffusion still wins across all metrics and frame counts.

Novel View Synthesis

To test whether the estimated cameras are truly useful, the authors train NeRFs using PoseDiffusion's output. The NeRF rendering quality (PSNR) matches or exceeds NeRFs trained with COLMAP cameras — and crucially, replacing predicted focal lengths with ground-truth makes no difference, proving that the intrinsic estimation is highly accurate.

Generalization

Perhaps the most impressive result: a model trained on 41 CO3Dv2 categories transfers to 10 unseen categories with only a small accuracy drop (50.8 to 48.0 mAA). Even more remarkably, transferring from CO3Dv2 (object-centric, circular trajectories) to RealEstate10K (scene-centric, linear trajectories) — a huge domain shift — produces results comparable to PixSfM.

Performance Comparison: mAA(30) on CO3Dv2

Mean Average Accuracy at 30 degrees. Higher is better. 10 input frames.

What degrades and when

Fewer images (N): With N=3 frames, mAA(30) drops to ~42 (from 66.5 at N=10). With N=2, the problem becomes highly ambiguous — there are infinite configurations consistent with just a stereo pair. More images = more mutual constraints = better accuracy.
Wider baselines: When cameras are spread far apart (wide-baseline), features share less overlap and the matching becomes harder. PoseDiffusion handles this better than COLMAP (which fails outright on many wide-baseline pairs) but still degrades. Extremely wide baselines (>120 degrees rotation) push error up significantly.
Without GGS: mAA(30) drops from 66.5 to 56.0 on CO3Dv2 — a 10.5 point drop. The model still beats all baselines without GGS, proving the diffusion prior alone is valuable. GGS provides geometric precision that the neural network alone cannot match.
Without diffusion (PoseReg baseline): Same architecture but trained as a direct regressor (no noise, no iterative denoising) scores only 48.2 mAA. The diffusion framework itself provides an 18.3 point improvement over regression.
Domain shift: CO3Dv2 to RealEstate10K (object-centric circular → scene-centric linear) loses ~5-8 points. Unseen object categories within CO3Dv2 lose only 2.8 points (50.8 → 48.0), showing good within-domain generalization.

Concrete numbers: Total model size: ~27M parameters (tiny by modern standards). Training: 2 days on 8 GPUs (likely A100s). Inference without GGS: ~1 second for 20 frames (100 denoising steps, each a single transformer forward pass through N=20 tokens — this is very cheap). Inference with GGS: 60-90 seconds (100 gradient iterations × 10 GGS-active steps = 1000 optimization steps with Sampson error computation over all camera pairs). The GGS bottleneck is N² pairwise Sampson computations per iteration, not the neural network.

Execution time: Without GGS, PoseDiffusion takes ~1 second for 20 frames. With GGS, it rises to 60-90 seconds (unoptimized Python loop of 100 iterations per step x 10 steps). This is slower than COLMAP for easy sequences but competitive for the hard sparse-view cases where COLMAP often fails entirely.

What does the PoseReg ablation prove about PoseDiffusion?

That the diffusion framework itself — not just the transformer architecture — is responsible for the performance gains, since the same architecture without diffusion scores much lower That a larger transformer would solve the problem without diffusion That GGS is the only important component

Chapter 8: The Bundle Adjustment Connection

PoseDiffusion's name includes "Bundle Adjustment" for a reason. The connection runs deeper than a surface analogy.

Classical Bundle Adjustment

Bundle Adjustment (BA) is the gold standard for refining camera poses. Given initial camera estimates and 3D point estimates, BA minimizes the reprojection error: the sum of squared distances between observed 2D keypoints and the projected positions of the estimated 3D points. It's a nonlinear least-squares optimization, typically solved with Levenberg-Marquardt.

min_{x, P} ∑_i,j || p_jⁱ − π(Kⁱ, gⁱ, P_j) ||²

where p_jⁱ is the observed 2D position of point j in camera i, and π is the projection function.

Diffusion as Implicit Bundle Adjustment

PoseDiffusion mirrors BA in several ways:

Iterative refinement: BA iterates Levenberg-Marquardt steps. PoseDiffusion iterates denoising steps. Both converge from a rough initialization to a precise solution.
Geometric constraints: BA explicitly minimizes reprojection error. GGS explicitly minimizes Sampson epipolar error. Both enforce multi-view geometric consistency.
Joint optimization: BA optimizes all cameras jointly. The transformer's self-attention reasons about all cameras jointly.

But PoseDiffusion also has advantages over classical BA:

Learned prior: BA has no prior over plausible camera configurations. PoseDiffusion has a strong learned prior from training on thousands of scenes. This prior is especially valuable when correspondences are sparse or unreliable.
No explicit 3D points: BA requires triangulating 3D points, which is a chicken-and-egg problem (you need poses to triangulate, but you need 3D points for BA). PoseDiffusion skips 3D point estimation entirely.
Single-step training: BA requires unrolling the full optimization to backpropagate gradients. PoseDiffusion trains one step at a time.

BA vs. Diffusion: Iterative Refinement

Two paths to the same goal. Left: classical BA refines via gradient descent on reprojection error. Right: diffusion refines via learned denoising + geometric guidance. Both converge from rough to precise.

The deeper insight: PoseDiffusion suggests that many iterative optimization problems in geometry could be reformulated as diffusion processes. The diffusion framework provides a natural way to combine a learned prior with classical constraints, and the sampling process navigates complex optimization landscapes more robustly than gradient descent alone.

What is PoseDiffusion's key advantage over classical bundle adjustment?

It uses more GPU memory for better precision It combines a strong learned prior over camera configurations with geometric constraints, allowing it to succeed even when correspondences are sparse or unreliable — situations where BA has no prior to fall back on It runs faster than Levenberg-Marquardt

Chapter 9: Connections

PoseDiffusion sits at the intersection of classical geometry and deep generative modeling. Let's map where it connects to the broader landscape.

Relation to VGGSfM

VGGSfM (2024) extends the PoseDiffusion idea: it adds differentiable Bundle Adjustment on top of the diffusion-predicted poses, and jointly estimates 3D structure. Where PoseDiffusion skips 3D points entirely, VGGSfM uses them as an additional refinement signal. VGGSfM achieves even higher accuracy, validating the "diffusion initialization + classical refinement" paradigm.

Relation to Rooms from Motion

Rooms from Motion tackles indoor scene reconstruction from panoramic images. Like PoseDiffusion, it faces the challenge of wide baselines (rooms have very different viewpoints). Both methods show that learned priors can rescue reconstruction when classical matching fails.

Relation to COLMAP

COLMAP is the workhorse SfM pipeline that PoseDiffusion aims to replace. PoseDiffusion's training data is itself generated by COLMAP (on CO3Dv2) or ORB-SLAM (on RealEstate10K) — the student learns from the teacher, then surpasses it in the hardest cases. This is a recurring pattern in learned geometry: use classical methods to generate training data, then train a model that handles the failure modes better.

Relation to Classical SLAM

SLAM systems (ORB-SLAM, VINS-Mono) solve pose estimation in real-time. They use fast, approximate methods (PnP, essential matrix) and rely on temporal continuity. PoseDiffusion operates on unordered image sets, handles wider baselines, but is much slower. Future work could combine the best of both: SLAM for real-time tracking, PoseDiffusion for offline refinement.

Cheat Sheet

Aspect	PoseDiffusion
Input	N unordered images (any N)
Output	Camera extrinsics (quaternion + translation) + intrinsics (focal length)
Backbone	DINO ViT-S/16 (fine-tuned)
Denoiser	8-layer Transformer encoder, 4 heads
Diffusion	DDPM, T=100, x₀ prediction
Guidance	Sampson epipolar error via SuperPoint+SuperGlue
GGS schedule	Last 10 steps, 100 iterations each, adaptive strength
Training	L2 loss on clean cameras, 2 days on 8 GPUs
Key result	66.5 mAA(30) on CO3Dv2 vs. 45.0 for PixSfM

The broader lesson: Diffusion models are not just for generating images. Any iterative refinement problem — pose estimation, protein folding, planning — might benefit from the diffusion formulation. The key insight is that diffusion gives you a principled way to combine learned priors with test-time constraints, something that pure regression or pure optimization struggle to do.

What is the "student surpasses teacher" pattern in PoseDiffusion's training?

PoseDiffusion trains on camera poses generated by COLMAP/ORB-SLAM, then outperforms those same classical methods in the hardest cases by leveraging learned priors The model distills a large teacher network into a smaller student PoseDiffusion is pretrained on ImageNet before fine-tuning on pose data

PoseDiffusion: Solving Pose Estimation via Diffusion