Recovering 3D structure and camera motion from 2D images: calibration, pose estimation, SfM, bundle adjustment, and SLAM.
Walk around a building and take photos from different angles. From these 2D images alone, can we recover the 3D shape of the building and figure out where each camera was? Yes — this is Structure from Motion (SfM).
SfM solves two problems simultaneously:
Multiple views of a scene reveal its 3D structure through triangulation.
Before recovering 3D structure, you need to know your camera's intrinsic parameters: focal length, principal point, and lens distortion.
The intrinsic matrix K maps from camera coordinates to pixel coordinates. fx, fy are the focal lengths in pixels, and (cx, cy) is the principal point (usually near image center).
| Method | How It Works |
|---|---|
| Checkerboard | Photograph a known pattern from multiple angles. Detect corners, solve for K. (Zhang's method) |
| Vanishing points | Detect parallel lines in the scene. Their vanishing points constrain K. |
| Self-calibration | Estimate K from point correspondences alone, using constraints from the essential matrix. |
Pose estimation computes the camera's position and orientation given known 3D-to-2D correspondences. "I know this 3D point projects to this pixel — where must the camera be?"
The Perspective-n-Point (PnP) problem: given n pairs of 3D points and their 2D projections, find the camera pose [R|t].
| Points | Method |
|---|---|
| n = 3 | P3P: 4 solutions (choose with a 4th point) |
| n = 4 | Direct linear transform (DLT) |
| n ≥ 6 | Linear least squares + refinement |
| Many | RANSAC + PnP for robustness |
Visual localization uses pose estimation at city scale: match a query photo against a 3D map (built by SfM), find 2D-3D correspondences, and solve PnP. This enables GPS-free positioning accurate to centimeters.
The simplest SfM setup: two images. You do not know the 3D scene or either camera pose. All you have are point correspondences (from feature matching).
Despite this seemingly impossible situation, the geometry is remarkably constrained. A matched pair of points (x, x') in two calibrated cameras must satisfy the epipolar constraint:
where E is the essential matrix. It encodes the relative rotation and translation between cameras. With 5 or more correspondences, you can recover E (and thus the relative pose) up to scale.
For uncalibrated cameras (unknown K), the constraint becomes x'TFx = 0, where F is the fundamental matrix. F has 7 DOF and can be estimated from 7 or more correspondences.
The essential matrix E = [t]×R encodes the relative rotation R and translation direction t between two calibrated cameras. It has exactly 5 degrees of freedom (3 for rotation + 2 for translation direction; magnitude is unknown due to scale ambiguity).
Given the epipolar constraint x'TEx = 0, each point correspondence provides one equation. Since E has 5 DOF, 5 points suffice (Nistér's 5-point algorithm). In practice, the 8-point algorithm (with normalization) is simpler and used inside RANSAC.
The epipole is where the other camera's center projects into your image. All epipolar lines pass through it. When the cameras have the same horizontal axis (rectified), the epipolar lines become horizontal scan lines — the basis for stereo matching (Chapter 12).
A point in one image constrains its match to an epipolar line in the other image.
Two frames give you relative pose and sparse 3D points. Adding more frames gives more points and more constraints, improving accuracy. Multi-frame SfM scales this to hundreds or thousands of images.
Two strategies:
| Strategy | How It Works | Trade-off |
|---|---|---|
| Incremental SfM | Start with 2 images. Add new images one at a time, solving PnP for pose, triangulating new points, running bundle adjustment. | Robust but slow. Drift accumulates. |
| Global SfM | Estimate all relative poses from image pairs, then solve for all absolute poses simultaneously (rotation averaging + translation averaging). | Faster but less robust to outliers. |
Internet-scale SfM (Building Rome in a Day) processed hundreds of thousands of tourist photos of landmarks, automatically building 3D models of famous sites from Flickr photos. This demonstrated that SfM could work "in the wild" with uncontrolled, diverse imagery.
Bundle adjustment is the final refinement step. It simultaneously optimizes all camera poses and all 3D point positions to minimize the total reprojection error:
where π is the projection function, xij is the observed 2D position of point j in camera i, and Xj is the 3D point position.
Modern solvers like Ceres (Google) and g2o exploit this sparsity with efficient sparse Cholesky factorization, often converging in seconds for problems with millions of observations.
Simultaneous Localization and Mapping (SLAM) is SfM in real time. A robot or AR device moves through an unknown environment, building a map while simultaneously tracking its own position within that map.
| System | Key Approach |
|---|---|
| ORB-SLAM | Feature-based visual SLAM. ORB features, bag-of-words for loop closure, bundle adjustment. |
| LSD-SLAM | Direct (no features). Optimizes photometric error on semi-dense depth maps. |
| DTAM | Dense tracking and mapping. Real-time dense reconstruction from a single moving camera. |
| NeRF-SLAM | Neural implicit map representation. Combines SLAM with neural radiance fields. |
Applications span robotics (autonomous navigation), AR (spatial anchors), and autonomous driving (HD map building). Modern AR devices (Apple Vision Pro, Meta Quest) run visual-inertial SLAM in real time, fusing camera tracking with IMU data for robust 6-DOF pose estimation.
Watch a camera trace a path, building a map and correcting drift on loop closure.
See how triangulation works: two cameras observe the same 3D point. The intersection of back-projected rays gives the 3D position. Move the point to see how the geometry changes.
Two cameras (blue triangles) observe a 3D point (yellow). Back-projected rays intersect at the point's 3D location.
| Concept | Used In |
|---|---|
| Camera calibration | Ch 2 (image formation), Ch 12 (stereo), every 3D vision task |
| Pose estimation / PnP | Ch 8 (alignment), Ch 12 (rectification), AR applications |
| Epipolar geometry | Ch 12 (stereo correspondence), Ch 9 (motion estimation) |
| Bundle adjustment | Ch 8 (panoramas), Ch 12 (multi-view stereo), autonomous driving |
| SLAM | Robotics, AR/VR, autonomous driving, drone navigation |
| SfM point clouds | Ch 13 (3D reconstruction), Ch 14 (neural rendering) |