Wang, Rupprecht, Novotny — Oxford VGG + Meta AI, 2023

PoseDiffusion: Solving Pose Estimation via Diffusion

What if estimating camera poses was just denoising? Start from random cameras, iteratively refine them through a learned diffusion process, and inject epipolar geometry constraints along the way. The result beats COLMAP on real-world scenes.

Prerequisites: Camera geometry (extrinsics/intrinsics) + Epipolar geometry basics + Diffusion models (DDPM)
10
Chapters
4+
Simulations

Chapter 0: The Problem

You have ten tourist photos of the Colosseum, taken from different angles by different people. You want to figure out exactly where each camera was when the photo was taken — its position, its orientation, and its focal length. This is camera pose estimation, and it's the foundation of everything from 3D reconstruction to augmented reality.

The classical approach is a pipeline called Structure from Motion (SfM). It works in stages: detect keypoints (SIFT, SuperPoint), match them across images (nearest-neighbor, SuperGlue), use RANSAC to reject outliers, compute relative poses from the five-point algorithm, then run Bundle Adjustment to jointly refine all cameras and 3D points.

This pipeline is brittle. Each stage can fail, and failures cascade. If keypoint matching fails on a wide-baseline pair (two photos taken from very different positions), that image pair gets dropped. If too many pairs drop, the reconstruction collapses.

The Achilles' heel: Classical SfM pipelines depend on finding reliable point correspondences between images. When views are sparse and baselines are wide — exactly the scenario that matters for real applications — correspondences become unreliable or nonexistent. A small matching failure at the start can doom the entire reconstruction.

Learned methods like RelPose tried to bypass the correspondence problem by directly predicting camera poses from image features. But RelPose only predicts rotations, not translations, and it can't match the precision of Bundle Adjustment when many images are available.

We need something that can handle both sparse wide-baseline views and dense multi-view sequences — gracefully, without brittle handoffs between pipeline stages.

Keypoint Matching Fragility

Two views of the same scene. Green lines are correct matches; red lines are incorrect. As baseline widens, more matches fail. Click "Widen Baseline" to see the effect.

Baseline Narrow
Why does classical SfM struggle with sparse, wide-baseline views?

Chapter 1: The Key Insight

Here's the idea that changes everything: what if camera pose estimation was just denoising?

Think about what a diffusion model does for images. You start with pure noise and gradually refine it into a coherent picture. At each step, the model nudges the noisy image slightly closer to something that looks real. The final result emerges through many small corrections, not one giant leap.

Now think about what Bundle Adjustment does. You start with rough camera poses (from a noisy initialization) and iteratively refine them until all the cameras are geometrically consistent. Each iteration makes small adjustments. The final result emerges through many corrections.

The parallel is striking. Both processes are iterative refinements from noise to signal. PoseDiffusion makes this parallel literal.

Step 1: Start from Noise
Sample N random camera poses from a Gaussian distribution. These are completely wrong — cameras pointing in random directions at random positions.
Step 2: Denoise
A learned denoiser (transformer) takes the noisy poses + image features and predicts slightly cleaner poses. Repeat T times.
Step 3: Guide with Geometry
At each step, nudge poses toward satisfying epipolar constraints from 2D point correspondences. This injects classical geometric reasoning into the diffusion process.
Step 4: Output
After T denoising steps, the poses have converged. Each camera has an extrinsic (rotation + translation) and intrinsic (focal length).
Why this matters: By framing pose estimation as diffusion, we get three things for free: (1) iterative refinement, like BA, without hand-engineering the optimization; (2) a natural way to inject geometric constraints via classifier guidance; (3) the ability to model uncertainty — multiple samples from p(x|I) show how confident the model is about each camera.

The model learns the conditional distribution p(x|I) — the probability of camera parameters x given images I. At test time, sampling from this distribution produces camera poses. Because the distribution is assumed to be near-delta (i.e., there's essentially one right answer for a given set of images), any sample is a valid pose estimate.

The inference pipeline, end to end

1. Feature extraction
N images → DINO ViT-S/16 → N × 384-dim features (one-time cost: ~0.1s)
2. Sample initial noise
x100 ~ N(0, I): random quaternions, translations, focal lengths for all N cameras
3. Denoise (steps 100 to 11)
90 transformer forward passes, each predicting clean cameras from current noisy state + image features. ~0.8s total.
4. Denoise + GGS (steps 10 to 1)
10 steps with geometric guidance. Each: transformer prediction + 100 gradient descent iterations on Sampson error. ~60-90s total (bottleneck).
5. Output
N cameras: rotation (quaternion), translation (3D vector), focal length. Ready for NeRF, 3D reconstruction, or AR.
What is the core analogy that PoseDiffusion exploits?

Chapter 2: Background

Before we dive into PoseDiffusion's mechanics, let's nail down the three pieces of geometry it relies on.

Camera Extrinsics

A camera's extrinsic parameters describe where it is and which way it's pointing. Formally, extrinsics g = (R, t) consist of a rotation matrix R in SO(3) and a translation vector t in R3. Together, they define a rigid-body transformation from world coordinates to camera coordinates:

pc = R · pw + t

PoseDiffusion represents the rotation as a unit quaternion q in H (4 numbers) and keeps the translation as a 3-vector, giving 7 numbers per camera for extrinsics.

Camera Intrinsics

The intrinsic parameters describe how the camera projects 3D points onto its 2D sensor. The calibration matrix K maps a 3D camera-space point to a 2D pixel:

K = [f, 0, px ; 0, f, py ; 0, 0, 1]

PoseDiffusion simplifies this to one degree of freedom: the focal length f. The principal point (px, py) is fixed at the image center, which is standard in SfM. The focal length is predicted as f = exp(f̂) to guarantee it's always positive. This adds 1 number per camera, for a total of 8 parameters per camera.

Epipolar Geometry

Given two cameras with known poses, there's a fundamental relationship between corresponding points. If a point p1 in image 1 corresponds to a point p2 in image 2, then the epipolar constraint says:

2T F p̃1 = 0

where F is the Fundamental Matrix computed from the two cameras' poses, and p̃ denotes homogeneous coordinates. This says: if you know one point in image 1, the corresponding point in image 2 must lie on a specific line (the epipolar line). If the poses are correct, all correspondences satisfy this constraint. If they don't, the poses are wrong.

The Sampson Error: Directly checking p̃2TFp̃1 = 0 is numerically unstable. The Sampson Epipolar Error is a first-order approximation to the geometric error that's much better behaved. It normalizes by the gradient of the constraint, measuring the actual pixel distance to the epipolar line rather than an algebraic residual. This is what PoseDiffusion minimizes during guided sampling.
How many learnable parameters per camera does PoseDiffusion predict?

Chapter 3: Diffusion for Poses

PoseDiffusion adapts DDPM (Denoising Diffusion Probabilistic Models) to the domain of camera parameters. Let's trace how the standard DDPM machinery maps onto this problem.

The Forward Process: Adding Noise to Cameras

Start with ground-truth camera parameters x0. The forward process adds Gaussian noise over T steps:

q(xt | xt-1) = N(xt; √(1 - βt) · xt-1, βt I)

After T steps, the cameras are indistinguishable from pure Gaussian noise. The variance schedule β1, ..., βT controls how fast this happens. PoseDiffusion uses T = 100 steps with a linear schedule from 10-3 to 0.2.

A convenient closed-form lets us jump directly to any timestep:

xt ~ N(√ᾱt · x0, (1 - ᾱt) I)

where ᾱt = ∏i=1t (1 - βi). This is essential for efficient training: we can sample any noise level without simulating the full chain.

The Reverse Process: Denoising Cameras

The reverse process is what we actually use at inference. Starting from pure noise xT ~ N(0, I), we iteratively denoise:

pθ(xt-1 | xt, I) = N(xt-1; √αt · Dθ(xt, t, I), (1 - αt) I)

The denoiser Dθ takes the current noisy cameras xt, the timestep t, and image features I, and predicts the clean cameras. PoseDiffusion uses the "x0 prediction" formulation: the network directly predicts the clean signal rather than the noise. This is empirically more stable for camera parameters.

What Makes This Different from Image Diffusion?

In image diffusion, each sample is a grid of pixels — high-dimensional, spatially structured. Here, each sample is a set of camera parameter vectors. There are key differences:

Why "x0 prediction" and not "noise prediction"? Most image diffusion models predict the noise ε that was added. PoseDiffusion instead predicts the clean cameras x0 directly. The authors found this more stable, likely because camera parameters have hard geometric constraints (rotation quaternions must be unit norm) that are easier to enforce in the output space than in the noise space.

Concrete diffusion setup

Scale of the problem: PoseDiffusion diffuses over a tiny space (8N dimensions) compared to image models (thousands to millions of dimensions). This is why the denoiser can be so small (~5M params vs. 675M for DiT or 8B for SD3). But the geometric precision required is much higher — a 1-degree rotation error is immediately visible in reconstruction. The diffusion framework helps by providing many refinement steps, and GGS provides the final geometric precision.
Why does PoseDiffusion use x0 prediction instead of noise prediction?

Chapter 4: Geometry-Guided Sampling

The diffusion denoiser alone can produce reasonable poses, but it's a feed-forward neural network — and neural networks are notoriously bad at regressing precise geometric quantities like rotation angles and translation vectors. PoseDiffusion's secret weapon is Geometry-Guided Sampling (GGS): injecting classical epipolar constraints directly into the diffusion sampling process.

The Mechanism: Classifier Guidance for Geometry

Recall that in classifier-guided diffusion for images, you steer samples toward a desired class by adding the gradient of a classifier to the denoising step. PoseDiffusion does the same thing, but the "classifier" is replaced by an epipolar geometry likelihood.

At each denoising step t, the predicted mean μt-1 is adjusted:

μ̂t-1 = Dθ(xt, t, I) + s · ∇x log p(I | xt)

The gradient ∇ log p(I|x) pushes the poses toward satisfying epipolar constraints from 2D correspondences. The scalar s controls guidance strength.

The Sampson Likelihood

The likelihood p(I|x) is modeled as a product of exponential distributions over pairwise Sampson Errors:

p(I | x) = ∏i,j exp(−eij)

where eij is the Sampson Epipolar Error between cameras i and j, computed from 2D correspondences extracted by SuperPoint+SuperGlue. The Sampson error is clamped at ε = 10 to handle outlier correspondences robustly.

The gradient of log p(I|x) is just the negative gradient of the total Sampson error across all pairs. It tells the optimizer: "adjust these camera poses so that epipolar lines pass closer to the matched keypoints."

Implementation Details

The best of both worlds: The diffusion model provides a strong learned prior over plausible camera configurations — the kind of thing that's hard to hand-engineer. GGS provides precise geometric constraints from classical SfM — the kind of thing that's hard to learn. Together, they exceed what either achieves alone. Removing GGS drops mAA(30) on CO3D from 66.5 to 56.0.

Why geometry guidance works so well

Neural networks are fundamentally bad at precise geometric regression. A network might learn that "cameras looking at the same object should face inward," but it can't learn that "this specific camera's optical axis must pass within 0.5 pixels of this specific point." That level of precision comes from explicit geometric computation.

GGS provides this precision by computing exact Sampson errors from 2D correspondences. The gradient of the Sampson error tells the optimizer exactly which direction to nudge each camera to improve geometric consistency. The diffusion prior provides the rough neighborhood; GGS provides the final precision within that neighborhood.

The late application (last 10 steps) is critical: early in denoising, cameras are too scattered for epipolar constraints to be meaningful (random camera pairs have no real geometric relationship). By step 10, the diffusion model has already placed cameras approximately right — GGS just needs to fine-tune within a small region, which is exactly what gradient descent excels at.

Denoising with Geometry Guidance

Drag the slider to move through denoising timesteps. Watch cameras (colored triangles) refine from random noise to clean poses. In the final steps (t < 10), epipolar lines (dashed) appear as GGS activates, pulling cameras into geometric consistency.

Timestep t t=100
Why is Geometry-Guided Sampling applied only in the last 10 of 100 diffusion steps?

Chapter 5: The Architecture

The denoiser Dθ is a transformer that processes the noisy camera parameters jointly with image features. Let's trace the data flow.

Inputs: Three Streams per Camera

For each of the N cameras, the transformer receives a token built from three components:

These three are concatenated into a single token per camera, then fed into the transformer.

The Transformer

The architecture is a standard transformer encoder with 8 layers, 4 attention heads, and feedforward dimension 1024. There's no decoder — all cameras attend to each other through self-attention. This is important: each camera can see every other camera's noisy pose and image features. The model learns to reason about multi-view consistency.

Output

The transformer's output tokens are passed through a 2-layer MLP (hidden dim 128, output dim 8) to produce the predicted clean camera parameters: log-focal-length f̂, quaternion q, and translation t for each camera.

PoseDiffusion Architecture

Data flow from inputs through the transformer to output camera parameters. All N cameras are processed jointly via self-attention.

The complete data flow with shapes

N input images
N × H × W × 3 (e.g., 10 images of the Colosseum)
DINO ViT-S/16 feature extraction
Each image: center-crop + resize to 224×224 → extract at 3 scales (1x, 1/2, 1/3) → average → N × 384-dim features (CLS token)
Build per-camera tokens
For each of N cameras: concat [pose xt: 8 → project to 96] + [timestep t: 1 → project to 96] + [DINO feature: 384 + pivot flag: 1 = 385] = 577-dim token per camera
Transformer encoder
N tokens × 577 dim → 8 layers, 4 heads, FFN dim 1024 → N tokens × 577 dim. All N cameras attend to each other.
MLP head
N × 577 → 2-layer MLP (hidden 128) → N × 8 predicted clean cameras: [log(f), qw, qx, qy, qz, tx, ty, tz]
Post-process
Quaternion normalized to unit length, focal length = exp(f̂), translations unnormalized

Feature Extraction: DINO

Image features come from DINO ViT-S/16, pretrained in a self-supervised fashion. The images are center-cropped and resized to 224x224, then features are extracted at three scales (1x, 1/2, 1/3) and averaged for multi-scale understanding. DINO's weights are fine-tuned during training — this lets the feature extractor adapt to the pose-estimation task.

Coordinate Frame Canonicalization

SfM datasets define poses in arbitrary scene-specific coordinate frames. To prevent the model from overfitting to these arbitrary frames, PoseDiffusion canonicalizes all poses relative to a randomly selected pivot camera. The pivot camera gets identity rotation and zero translation. A binary flag in the input tells the model which camera is the pivot. Translations are further normalized by their median norm to handle scale ambiguity.

Why self-attention over all cameras? Multi-view geometry is inherently about relationships between cameras. A single camera's pose is meaningless without reference to the others. Self-attention lets the model jointly reason about all cameras, naturally encoding constraints like "if camera 1 and camera 3 see similar features, they should be close together."

Model size: why so small?

PoseDiffusion's transformer has only 8 layers, 4 heads, and processes at most ~20 tokens. Compare:

ModelTokensLayersParamsTask
PoseDiffusion3-208~5MDenoise N camera poses
DiT-XL/225628675MDenoise 32×32×4 latent
SD3-8B~4500388BDenoise 128×128×16 latent

The tiny model works because: (1) the output space is 8N dimensions, not millions of pixels; (2) each token already carries rich 384-dim visual features from DINO; (3) the geometric reasoning required (relative camera placement) is compositionally simpler than photorealistic image generation. More parameters would likely overfit — CO3Dv2 has only ~37K scenes.

Why does PoseDiffusion canonicalize poses relative to a random pivot camera?

Chapter 6: Training

PoseDiffusion is trained with a remarkably simple objective: predict the clean cameras from noisy ones.

The Denoising Loss

At each training step, the model receives a batch of scenes with ground-truth cameras x0 and images I. A random timestep t is sampled, noise is added to get xt, and the denoiser predicts x0:

Ldiff = Et, xt || Dθ(xt, t, I) − x0 ||2

That's it. No adversarial losses, no perceptual losses, no multi-task heads. Just an L2 loss between predicted and ground-truth camera parameters, averaged over all cameras in the scene and all timesteps.

Training Data

Two datasets provide the training signal:

Training Details

Frozen vs. trained components

ComponentParametersStatus
DINO ViT-S/16~22MPretrained (self-supervised on ImageNet), fine-tuned during training
Transformer denoiser~5MTrained from scratch
MLP head~75KTrained from scratch
SuperPoint + SuperGlue~13MFrozen (pretrained, used only at inference for GGS)

Total trainable: ~27M parameters. This is remarkably small — orders of magnitude fewer than image diffusion models. The small model size is possible because: (1) the output space is only 8 numbers per camera, not millions of pixels; (2) DINO provides rich visual features without needing a massive backbone; (3) the transformer only processes N tokens (number of cameras), not thousands of image patches.

Engineering decisions

Why diffusion for poses instead of direct regression? A regression network maps images directly to a single pose prediction. The problem: multi-view pose estimation is inherently ambiguous (especially with few views), and the loss landscape has many local minima. Diffusion provides: (1) iterative refinement from coarse to fine (the 100-step chain), (2) implicit ensembling (can sample multiple times for uncertainty), and (3) a natural injection point for geometric constraints (GGS). The PoseReg ablation confirms this: same architecture without diffusion scores 48.2 mAA vs 66.5 with diffusion.

Why DINO instead of CLIP or ResNet? DINO ViT-S/16 provides features trained with self-supervised objectives that emphasize spatial structure and object parts — exactly what you need for geometric reasoning. CLIP features emphasize semantic similarity, which is less useful for precise pose estimation. ResNet features lack the global receptive field that ViT's self-attention provides.

Why fine-tune DINO? Generic DINO features are good but not specialized for pose. Fine-tuning lets the feature extractor learn to emphasize viewpoint-discriminative information (silhouettes, parallax cues) over semantic content (object identity). This is one of the keys to PoseDiffusion's performance.

No GGS during training: Geometry-Guided Sampling is only applied at inference time. During training, the model learns pure diffusion denoising. GGS is bolted on afterward as a test-time refinement. This is elegant: you can train the model once and then decide whether to use GGS at inference based on your accuracy/speed tradeoff.

A key advantage of the diffusion formulation: the model is trained one step at a time. Unlike autoregressive methods that require backpropagation through the full generation chain, each training step only requires forward/backward through a single denoising step. This makes training tractable even for complex geometric reasoning.

Why is the one-step-at-a-time training an advantage of the diffusion formulation?

Chapter 7: Results

PoseDiffusion is evaluated on two challenging real-world datasets with different characteristics. The results are compelling.

CO3Dv2: Object-Centric Scenes

Each scene is a turntable-like video of a single object. Cameras orbit the object at roughly constant distance. PoseDiffusion significantly outperforms all baselines in both sparse and dense settings.

MethodRRA@15RTA@15mAA(30)
RelPose57.1
COLMAP+SPSG33.732.930.1
PixSfM53.249.145.0
PoseReg (no diffusion)57.053.448.2
Ours w/o GGS75.972.856.0
PoseDiffusion80.579.866.5

Key observations: (1) The diffusion model alone (w/o GGS) already beats every baseline. (2) Adding GGS provides a further 10+ point boost in mAA(30). (3) The non-diffusion baseline PoseReg with the same architecture scores much lower, validating that diffusion itself — not just the architecture — is responsible for the gains.

RealEstate10K: Scene-Centric Views

These are fly-through videos of real interiors and exteriors — the domain where COLMAP traditionally excels. Yet PoseDiffusion still wins across all metrics and frame counts.

Novel View Synthesis

To test whether the estimated cameras are truly useful, the authors train NeRFs using PoseDiffusion's output. The NeRF rendering quality (PSNR) matches or exceeds NeRFs trained with COLMAP cameras — and crucially, replacing predicted focal lengths with ground-truth makes no difference, proving that the intrinsic estimation is highly accurate.

Generalization

Perhaps the most impressive result: a model trained on 41 CO3Dv2 categories transfers to 10 unseen categories with only a small accuracy drop (50.8 to 48.0 mAA). Even more remarkably, transferring from CO3Dv2 (object-centric, circular trajectories) to RealEstate10K (scene-centric, linear trajectories) — a huge domain shift — produces results comparable to PixSfM.

Performance Comparison: mAA(30) on CO3Dv2

Mean Average Accuracy at 30 degrees. Higher is better. 10 input frames.

What degrades and when

Concrete numbers: Total model size: ~27M parameters (tiny by modern standards). Training: 2 days on 8 GPUs (likely A100s). Inference without GGS: ~1 second for 20 frames (100 denoising steps, each a single transformer forward pass through N=20 tokens — this is very cheap). Inference with GGS: 60-90 seconds (100 gradient iterations × 10 GGS-active steps = 1000 optimization steps with Sampson error computation over all camera pairs). The GGS bottleneck is N² pairwise Sampson computations per iteration, not the neural network.
Execution time: Without GGS, PoseDiffusion takes ~1 second for 20 frames. With GGS, it rises to 60-90 seconds (unoptimized Python loop of 100 iterations per step x 10 steps). This is slower than COLMAP for easy sequences but competitive for the hard sparse-view cases where COLMAP often fails entirely.
What does the PoseReg ablation prove about PoseDiffusion?

Chapter 8: The Bundle Adjustment Connection

PoseDiffusion's name includes "Bundle Adjustment" for a reason. The connection runs deeper than a surface analogy.

Classical Bundle Adjustment

Bundle Adjustment (BA) is the gold standard for refining camera poses. Given initial camera estimates and 3D point estimates, BA minimizes the reprojection error: the sum of squared distances between observed 2D keypoints and the projected positions of the estimated 3D points. It's a nonlinear least-squares optimization, typically solved with Levenberg-Marquardt.

minx, Pi,j || pji − π(Ki, gi, Pj) ||2

where pji is the observed 2D position of point j in camera i, and π is the projection function.

Diffusion as Implicit Bundle Adjustment

PoseDiffusion mirrors BA in several ways:

But PoseDiffusion also has advantages over classical BA:

BA vs. Diffusion: Iterative Refinement

Two paths to the same goal. Left: classical BA refines via gradient descent on reprojection error. Right: diffusion refines via learned denoising + geometric guidance. Both converge from rough to precise.

The deeper insight: PoseDiffusion suggests that many iterative optimization problems in geometry could be reformulated as diffusion processes. The diffusion framework provides a natural way to combine a learned prior with classical constraints, and the sampling process navigates complex optimization landscapes more robustly than gradient descent alone.
What is PoseDiffusion's key advantage over classical bundle adjustment?

Chapter 9: Connections

PoseDiffusion sits at the intersection of classical geometry and deep generative modeling. Let's map where it connects to the broader landscape.

Relation to VGGSfM

VGGSfM (2024) extends the PoseDiffusion idea: it adds differentiable Bundle Adjustment on top of the diffusion-predicted poses, and jointly estimates 3D structure. Where PoseDiffusion skips 3D points entirely, VGGSfM uses them as an additional refinement signal. VGGSfM achieves even higher accuracy, validating the "diffusion initialization + classical refinement" paradigm.

Relation to Rooms from Motion

Rooms from Motion tackles indoor scene reconstruction from panoramic images. Like PoseDiffusion, it faces the challenge of wide baselines (rooms have very different viewpoints). Both methods show that learned priors can rescue reconstruction when classical matching fails.

Relation to COLMAP

COLMAP is the workhorse SfM pipeline that PoseDiffusion aims to replace. PoseDiffusion's training data is itself generated by COLMAP (on CO3Dv2) or ORB-SLAM (on RealEstate10K) — the student learns from the teacher, then surpasses it in the hardest cases. This is a recurring pattern in learned geometry: use classical methods to generate training data, then train a model that handles the failure modes better.

Relation to Classical SLAM

SLAM systems (ORB-SLAM, VINS-Mono) solve pose estimation in real-time. They use fast, approximate methods (PnP, essential matrix) and rely on temporal continuity. PoseDiffusion operates on unordered image sets, handles wider baselines, but is much slower. Future work could combine the best of both: SLAM for real-time tracking, PoseDiffusion for offline refinement.

Cheat Sheet

AspectPoseDiffusion
InputN unordered images (any N)
OutputCamera extrinsics (quaternion + translation) + intrinsics (focal length)
BackboneDINO ViT-S/16 (fine-tuned)
Denoiser8-layer Transformer encoder, 4 heads
DiffusionDDPM, T=100, x0 prediction
GuidanceSampson epipolar error via SuperPoint+SuperGlue
GGS scheduleLast 10 steps, 100 iterations each, adaptive strength
TrainingL2 loss on clean cameras, 2 days on 8 GPUs
Key result66.5 mAA(30) on CO3Dv2 vs. 45.0 for PixSfM
The broader lesson: Diffusion models are not just for generating images. Any iterative refinement problem — pose estimation, protein folding, planning — might benefit from the diffusion formulation. The key insight is that diffusion gives you a principled way to combine learned priors with test-time constraints, something that pure regression or pure optimization struggle to do.
What is the "student surpasses teacher" pattern in PoseDiffusion's training?