What if estimating camera poses was just denoising? Start from random cameras, iteratively refine them through a learned diffusion process, and inject epipolar geometry constraints along the way. The result beats COLMAP on real-world scenes.
You have ten tourist photos of the Colosseum, taken from different angles by different people. You want to figure out exactly where each camera was when the photo was taken — its position, its orientation, and its focal length. This is camera pose estimation, and it's the foundation of everything from 3D reconstruction to augmented reality.
The classical approach is a pipeline called Structure from Motion (SfM). It works in stages: detect keypoints (SIFT, SuperPoint), match them across images (nearest-neighbor, SuperGlue), use RANSAC to reject outliers, compute relative poses from the five-point algorithm, then run Bundle Adjustment to jointly refine all cameras and 3D points.
This pipeline is brittle. Each stage can fail, and failures cascade. If keypoint matching fails on a wide-baseline pair (two photos taken from very different positions), that image pair gets dropped. If too many pairs drop, the reconstruction collapses.
Learned methods like RelPose tried to bypass the correspondence problem by directly predicting camera poses from image features. But RelPose only predicts rotations, not translations, and it can't match the precision of Bundle Adjustment when many images are available.
We need something that can handle both sparse wide-baseline views and dense multi-view sequences — gracefully, without brittle handoffs between pipeline stages.
Two views of the same scene. Green lines are correct matches; red lines are incorrect. As baseline widens, more matches fail. Click "Widen Baseline" to see the effect.
Here's the idea that changes everything: what if camera pose estimation was just denoising?
Think about what a diffusion model does for images. You start with pure noise and gradually refine it into a coherent picture. At each step, the model nudges the noisy image slightly closer to something that looks real. The final result emerges through many small corrections, not one giant leap.
Now think about what Bundle Adjustment does. You start with rough camera poses (from a noisy initialization) and iteratively refine them until all the cameras are geometrically consistent. Each iteration makes small adjustments. The final result emerges through many corrections.
The parallel is striking. Both processes are iterative refinements from noise to signal. PoseDiffusion makes this parallel literal.
The model learns the conditional distribution p(x|I) — the probability of camera parameters x given images I. At test time, sampling from this distribution produces camera poses. Because the distribution is assumed to be near-delta (i.e., there's essentially one right answer for a given set of images), any sample is a valid pose estimate.
Before we dive into PoseDiffusion's mechanics, let's nail down the three pieces of geometry it relies on.
A camera's extrinsic parameters describe where it is and which way it's pointing. Formally, extrinsics g = (R, t) consist of a rotation matrix R in SO(3) and a translation vector t in R3. Together, they define a rigid-body transformation from world coordinates to camera coordinates:
PoseDiffusion represents the rotation as a unit quaternion q in H (4 numbers) and keeps the translation as a 3-vector, giving 7 numbers per camera for extrinsics.
The intrinsic parameters describe how the camera projects 3D points onto its 2D sensor. The calibration matrix K maps a 3D camera-space point to a 2D pixel:
PoseDiffusion simplifies this to one degree of freedom: the focal length f. The principal point (px, py) is fixed at the image center, which is standard in SfM. The focal length is predicted as f = exp(f̂) to guarantee it's always positive. This adds 1 number per camera, for a total of 8 parameters per camera.
Given two cameras with known poses, there's a fundamental relationship between corresponding points. If a point p1 in image 1 corresponds to a point p2 in image 2, then the epipolar constraint says:
where F is the Fundamental Matrix computed from the two cameras' poses, and p̃ denotes homogeneous coordinates. This says: if you know one point in image 1, the corresponding point in image 2 must lie on a specific line (the epipolar line). If the poses are correct, all correspondences satisfy this constraint. If they don't, the poses are wrong.
PoseDiffusion adapts DDPM (Denoising Diffusion Probabilistic Models) to the domain of camera parameters. Let's trace how the standard DDPM machinery maps onto this problem.
Start with ground-truth camera parameters x0. The forward process adds Gaussian noise over T steps:
After T steps, the cameras are indistinguishable from pure Gaussian noise. The variance schedule β1, ..., βT controls how fast this happens. PoseDiffusion uses T = 100 steps with a linear schedule from 10-3 to 0.2.
A convenient closed-form lets us jump directly to any timestep:
where ᾱt = ∏i=1t (1 - βi). This is essential for efficient training: we can sample any noise level without simulating the full chain.
The reverse process is what we actually use at inference. Starting from pure noise xT ~ N(0, I), we iteratively denoise:
The denoiser Dθ takes the current noisy cameras xt, the timestep t, and image features I, and predicts the clean cameras. PoseDiffusion uses the "x0 prediction" formulation: the network directly predicts the clean signal rather than the noise. This is empirically more stable for camera parameters.
In image diffusion, each sample is a grid of pixels — high-dimensional, spatially structured. Here, each sample is a set of camera parameter vectors. There are key differences:
The diffusion denoiser alone can produce reasonable poses, but it's a feed-forward neural network — and neural networks are notoriously bad at regressing precise geometric quantities like rotation angles and translation vectors. PoseDiffusion's secret weapon is Geometry-Guided Sampling (GGS): injecting classical epipolar constraints directly into the diffusion sampling process.
Recall that in classifier-guided diffusion for images, you steer samples toward a desired class by adding the gradient of a classifier to the denoising step. PoseDiffusion does the same thing, but the "classifier" is replaced by an epipolar geometry likelihood.
At each denoising step t, the predicted mean μt-1 is adjusted:
The gradient ∇ log p(I|x) pushes the poses toward satisfying epipolar constraints from 2D correspondences. The scalar s controls guidance strength.
The likelihood p(I|x) is modeled as a product of exponential distributions over pairwise Sampson Errors:
where eij is the Sampson Epipolar Error between cameras i and j, computed from 2D correspondences extracted by SuperPoint+SuperGlue. The Sampson error is clamped at ε = 10 to handle outlier correspondences robustly.
The gradient of log p(I|x) is just the negative gradient of the total Sampson error across all pairs. It tells the optimizer: "adjust these camera poses so that epipolar lines pass closer to the matched keypoints."
Neural networks are fundamentally bad at precise geometric regression. A network might learn that "cameras looking at the same object should face inward," but it can't learn that "this specific camera's optical axis must pass within 0.5 pixels of this specific point." That level of precision comes from explicit geometric computation.
GGS provides this precision by computing exact Sampson errors from 2D correspondences. The gradient of the Sampson error tells the optimizer exactly which direction to nudge each camera to improve geometric consistency. The diffusion prior provides the rough neighborhood; GGS provides the final precision within that neighborhood.
The late application (last 10 steps) is critical: early in denoising, cameras are too scattered for epipolar constraints to be meaningful (random camera pairs have no real geometric relationship). By step 10, the diffusion model has already placed cameras approximately right — GGS just needs to fine-tune within a small region, which is exactly what gradient descent excels at.
Drag the slider to move through denoising timesteps. Watch cameras (colored triangles) refine from random noise to clean poses. In the final steps (t < 10), epipolar lines (dashed) appear as GGS activates, pulling cameras into geometric consistency.
The denoiser Dθ is a transformer that processes the noisy camera parameters jointly with image features. Let's trace the data flow.
For each of the N cameras, the transformer receives a token built from three components:
These three are concatenated into a single token per camera, then fed into the transformer.
The architecture is a standard transformer encoder with 8 layers, 4 attention heads, and feedforward dimension 1024. There's no decoder — all cameras attend to each other through self-attention. This is important: each camera can see every other camera's noisy pose and image features. The model learns to reason about multi-view consistency.
The transformer's output tokens are passed through a 2-layer MLP (hidden dim 128, output dim 8) to produce the predicted clean camera parameters: log-focal-length f̂, quaternion q, and translation t for each camera.
Data flow from inputs through the transformer to output camera parameters. All N cameras are processed jointly via self-attention.
Image features come from DINO ViT-S/16, pretrained in a self-supervised fashion. The images are center-cropped and resized to 224x224, then features are extracted at three scales (1x, 1/2, 1/3) and averaged for multi-scale understanding. DINO's weights are fine-tuned during training — this lets the feature extractor adapt to the pose-estimation task.
SfM datasets define poses in arbitrary scene-specific coordinate frames. To prevent the model from overfitting to these arbitrary frames, PoseDiffusion canonicalizes all poses relative to a randomly selected pivot camera. The pivot camera gets identity rotation and zero translation. A binary flag in the input tells the model which camera is the pivot. Translations are further normalized by their median norm to handle scale ambiguity.
PoseDiffusion's transformer has only 8 layers, 4 heads, and processes at most ~20 tokens. Compare:
| Model | Tokens | Layers | Params | Task |
|---|---|---|---|---|
| PoseDiffusion | 3-20 | 8 | ~5M | Denoise N camera poses |
| DiT-XL/2 | 256 | 28 | 675M | Denoise 32×32×4 latent |
| SD3-8B | ~4500 | 38 | 8B | Denoise 128×128×16 latent |
The tiny model works because: (1) the output space is 8N dimensions, not millions of pixels; (2) each token already carries rich 384-dim visual features from DINO; (3) the geometric reasoning required (relative camera placement) is compositionally simpler than photorealistic image generation. More parameters would likely overfit — CO3Dv2 has only ~37K scenes.
PoseDiffusion is trained with a remarkably simple objective: predict the clean cameras from noisy ones.
At each training step, the model receives a batch of scenes with ground-truth cameras x0 and images I. A random timestep t is sampled, noise is added to get xt, and the denoiser predicts x0:
That's it. No adversarial losses, no perceptual losses, no multi-task heads. Just an L2 loss between predicted and ground-truth camera parameters, averaged over all cameras in the scene and all timesteps.
Two datasets provide the training signal:
| Component | Parameters | Status |
|---|---|---|
| DINO ViT-S/16 | ~22M | Pretrained (self-supervised on ImageNet), fine-tuned during training |
| Transformer denoiser | ~5M | Trained from scratch |
| MLP head | ~75K | Trained from scratch |
| SuperPoint + SuperGlue | ~13M | Frozen (pretrained, used only at inference for GGS) |
Total trainable: ~27M parameters. This is remarkably small — orders of magnitude fewer than image diffusion models. The small model size is possible because: (1) the output space is only 8 numbers per camera, not millions of pixels; (2) DINO provides rich visual features without needing a massive backbone; (3) the transformer only processes N tokens (number of cameras), not thousands of image patches.
Why diffusion for poses instead of direct regression? A regression network maps images directly to a single pose prediction. The problem: multi-view pose estimation is inherently ambiguous (especially with few views), and the loss landscape has many local minima. Diffusion provides: (1) iterative refinement from coarse to fine (the 100-step chain), (2) implicit ensembling (can sample multiple times for uncertainty), and (3) a natural injection point for geometric constraints (GGS). The PoseReg ablation confirms this: same architecture without diffusion scores 48.2 mAA vs 66.5 with diffusion.
Why DINO instead of CLIP or ResNet? DINO ViT-S/16 provides features trained with self-supervised objectives that emphasize spatial structure and object parts — exactly what you need for geometric reasoning. CLIP features emphasize semantic similarity, which is less useful for precise pose estimation. ResNet features lack the global receptive field that ViT's self-attention provides.
Why fine-tune DINO? Generic DINO features are good but not specialized for pose. Fine-tuning lets the feature extractor learn to emphasize viewpoint-discriminative information (silhouettes, parallax cues) over semantic content (object identity). This is one of the keys to PoseDiffusion's performance.
A key advantage of the diffusion formulation: the model is trained one step at a time. Unlike autoregressive methods that require backpropagation through the full generation chain, each training step only requires forward/backward through a single denoising step. This makes training tractable even for complex geometric reasoning.
PoseDiffusion is evaluated on two challenging real-world datasets with different characteristics. The results are compelling.
Each scene is a turntable-like video of a single object. Cameras orbit the object at roughly constant distance. PoseDiffusion significantly outperforms all baselines in both sparse and dense settings.
| Method | RRA@15 | RTA@15 | mAA(30) |
|---|---|---|---|
| RelPose | 57.1 | — | — |
| COLMAP+SPSG | 33.7 | 32.9 | 30.1 |
| PixSfM | 53.2 | 49.1 | 45.0 |
| PoseReg (no diffusion) | 57.0 | 53.4 | 48.2 |
| Ours w/o GGS | 75.9 | 72.8 | 56.0 |
| PoseDiffusion | 80.5 | 79.8 | 66.5 |
Key observations: (1) The diffusion model alone (w/o GGS) already beats every baseline. (2) Adding GGS provides a further 10+ point boost in mAA(30). (3) The non-diffusion baseline PoseReg with the same architecture scores much lower, validating that diffusion itself — not just the architecture — is responsible for the gains.
These are fly-through videos of real interiors and exteriors — the domain where COLMAP traditionally excels. Yet PoseDiffusion still wins across all metrics and frame counts.
To test whether the estimated cameras are truly useful, the authors train NeRFs using PoseDiffusion's output. The NeRF rendering quality (PSNR) matches or exceeds NeRFs trained with COLMAP cameras — and crucially, replacing predicted focal lengths with ground-truth makes no difference, proving that the intrinsic estimation is highly accurate.
Perhaps the most impressive result: a model trained on 41 CO3Dv2 categories transfers to 10 unseen categories with only a small accuracy drop (50.8 to 48.0 mAA). Even more remarkably, transferring from CO3Dv2 (object-centric, circular trajectories) to RealEstate10K (scene-centric, linear trajectories) — a huge domain shift — produces results comparable to PixSfM.
Mean Average Accuracy at 30 degrees. Higher is better. 10 input frames.
PoseDiffusion's name includes "Bundle Adjustment" for a reason. The connection runs deeper than a surface analogy.
Bundle Adjustment (BA) is the gold standard for refining camera poses. Given initial camera estimates and 3D point estimates, BA minimizes the reprojection error: the sum of squared distances between observed 2D keypoints and the projected positions of the estimated 3D points. It's a nonlinear least-squares optimization, typically solved with Levenberg-Marquardt.
where pji is the observed 2D position of point j in camera i, and π is the projection function.
PoseDiffusion mirrors BA in several ways:
But PoseDiffusion also has advantages over classical BA:
Two paths to the same goal. Left: classical BA refines via gradient descent on reprojection error. Right: diffusion refines via learned denoising + geometric guidance. Both converge from rough to precise.
PoseDiffusion sits at the intersection of classical geometry and deep generative modeling. Let's map where it connects to the broader landscape.
VGGSfM (2024) extends the PoseDiffusion idea: it adds differentiable Bundle Adjustment on top of the diffusion-predicted poses, and jointly estimates 3D structure. Where PoseDiffusion skips 3D points entirely, VGGSfM uses them as an additional refinement signal. VGGSfM achieves even higher accuracy, validating the "diffusion initialization + classical refinement" paradigm.
Rooms from Motion tackles indoor scene reconstruction from panoramic images. Like PoseDiffusion, it faces the challenge of wide baselines (rooms have very different viewpoints). Both methods show that learned priors can rescue reconstruction when classical matching fails.
COLMAP is the workhorse SfM pipeline that PoseDiffusion aims to replace. PoseDiffusion's training data is itself generated by COLMAP (on CO3Dv2) or ORB-SLAM (on RealEstate10K) — the student learns from the teacher, then surpasses it in the hardest cases. This is a recurring pattern in learned geometry: use classical methods to generate training data, then train a model that handles the failure modes better.
SLAM systems (ORB-SLAM, VINS-Mono) solve pose estimation in real-time. They use fast, approximate methods (PnP, essential matrix) and rely on temporal continuity. PoseDiffusion operates on unordered image sets, handles wider baselines, but is much slower. Future work could combine the best of both: SLAM for real-time tracking, PoseDiffusion for offline refinement.
| Aspect | PoseDiffusion |
|---|---|
| Input | N unordered images (any N) |
| Output | Camera extrinsics (quaternion + translation) + intrinsics (focal length) |
| Backbone | DINO ViT-S/16 (fine-tuned) |
| Denoiser | 8-layer Transformer encoder, 4 heads |
| Diffusion | DDPM, T=100, x0 prediction |
| Guidance | Sampson epipolar error via SuperPoint+SuperGlue |
| GGS schedule | Last 10 steps, 100 iterations each, adaptive strength |
| Training | L2 loss on clean cameras, 2 days on 8 GPUs |
| Key result | 66.5 mAA(30) on CO3Dv2 vs. 45.0 for PixSfM |