Szeliski, Chapter 11

Structure from Motion and SLAM

Recovering 3D structure and camera motion from 2D images: calibration, pose estimation, SfM, bundle adjustment, and SLAM.

Prerequisites: Chapter 2 (projective geometry), Chapter 7 (features), Chapter 8 (alignment).
10
Chapters
6+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why SfM?

Walk around a building and take photos from different angles. From these 2D images alone, can we recover the 3D shape of the building and figure out where each camera was? Yes — this is Structure from Motion (SfM).

SfM solves two problems simultaneously:

Why it matters: SfM powers Google Earth (3D city models from satellite/street images), autonomous driving (mapping from dashcam video), AR/VR (spatial mapping), archaeology (digital heritage preservation), and Hollywood VFX (match-moving CGI into real footage).
SfM Overview

Multiple views of a scene reveal its 3D structure through triangulation.

What two things does Structure from Motion recover simultaneously?

Chapter 1: Camera Calibration

Before recovering 3D structure, you need to know your camera's intrinsic parameters: focal length, principal point, and lens distortion.

K = ⎡ fx   0   cx
  ⎣ 0   fy   cy
  ⎣ 0   0   1 ⎦

The intrinsic matrix K maps from camera coordinates to pixel coordinates. fx, fy are the focal lengths in pixels, and (cx, cy) is the principal point (usually near image center).

MethodHow It Works
CheckerboardPhotograph a known pattern from multiple angles. Detect corners, solve for K. (Zhang's method)
Vanishing pointsDetect parallel lines in the scene. Their vanishing points constrain K.
Self-calibrationEstimate K from point correspondences alone, using constraints from the essential matrix.
Radial distortion bends straight lines into curves (barrel or pincushion distortion). It is modeled as: x' = x(1 + k1r2 + k2r4), where r is the distance from the principal point. Correcting distortion is essential before any geometric computation — uncorrected distortion corrupts all subsequent 3D estimates.
What does the intrinsic matrix K encode?

Chapter 2: Pose Estimation

Pose estimation computes the camera's position and orientation given known 3D-to-2D correspondences. "I know this 3D point projects to this pixel — where must the camera be?"

The Perspective-n-Point (PnP) problem: given n pairs of 3D points and their 2D projections, find the camera pose [R|t].

PointsMethod
n = 3P3P: 4 solutions (choose with a 4th point)
n = 4Direct linear transform (DLT)
n ≥ 6Linear least squares + refinement
ManyRANSAC + PnP for robustness
Triangulation: Once you know camera poses, recovering a 3D point from its projections in two or more views is straightforward: intersect the back-projected rays. In practice, rays do not intersect perfectly due to noise, so you minimize the reprojection error. This is triangulation — the inverse of projection.

Visual localization uses pose estimation at city scale: match a query photo against a 3D map (built by SfM), find 2D-3D correspondences, and solve PnP. This enables GPS-free positioning accurate to centimeters.

What is the minimum number of 3D-to-2D correspondences needed to estimate a camera's pose?

Chapter 3: Two-Frame SfM

The simplest SfM setup: two images. You do not know the 3D scene or either camera pose. All you have are point correspondences (from feature matching).

Despite this seemingly impossible situation, the geometry is remarkably constrained. A matched pair of points (x, x') in two calibrated cameras must satisfy the epipolar constraint:

x'T E x = 0

where E is the essential matrix. It encodes the relative rotation and translation between cameras. With 5 or more correspondences, you can recover E (and thus the relative pose) up to scale.

Scale ambiguity: From two views alone, you can recover the relative pose and 3D structure, but not the absolute scale. A model of a toy house and a real house produce identical images if the camera is scaled proportionally. You need either a known distance (baseline) or additional constraints (GPS, IMU) to resolve scale.

For uncalibrated cameras (unknown K), the constraint becomes x'TFx = 0, where F is the fundamental matrix. F has 7 DOF and can be estimated from 7 or more correspondences.

Why is there a scale ambiguity in two-frame SfM?

Chapter 4: The Essential Matrix

The essential matrix E = [t]×R encodes the relative rotation R and translation direction t between two calibrated cameras. It has exactly 5 degrees of freedom (3 for rotation + 2 for translation direction; magnitude is unknown due to scale ambiguity).

Given the epipolar constraint x'TEx = 0, each point correspondence provides one equation. Since E has 5 DOF, 5 points suffice (Nistér's 5-point algorithm). In practice, the 8-point algorithm (with normalization) is simpler and used inside RANSAC.

Epipolar lines: Given a point x in image 1, the epipolar constraint tells us its match must lie on a specific line in image 2: the epipolar line l' = Ex. This reduces the search from 2D to 1D, dramatically speeding up matching. It is also a powerful geometric check: if a match does not lie near the epipolar line, it is wrong.

The epipole is where the other camera's center projects into your image. All epipolar lines pass through it. When the cameras have the same horizontal axis (rectified), the epipolar lines become horizontal scan lines — the basis for stereo matching (Chapter 12).

Epipolar Geometry

A point in one image constrains its match to an epipolar line in the other image.

Point position 50
What does the epipolar constraint reduce the matching search from?

Chapter 5: Multi-Frame SfM

Two frames give you relative pose and sparse 3D points. Adding more frames gives more points and more constraints, improving accuracy. Multi-frame SfM scales this to hundreds or thousands of images.

Two strategies:

StrategyHow It WorksTrade-off
Incremental SfMStart with 2 images. Add new images one at a time, solving PnP for pose, triangulating new points, running bundle adjustment.Robust but slow. Drift accumulates.
Global SfMEstimate all relative poses from image pairs, then solve for all absolute poses simultaneously (rotation averaging + translation averaging).Faster but less robust to outliers.
COLMAP is the gold standard for incremental SfM. Given a set of unordered photos, it: (1) extracts and matches SIFT features, (2) initializes from the best image pair, (3) incrementally adds images via PnP, (4) triangulates new 3D points, (5) runs bundle adjustment after each addition. Output: camera poses + sparse 3D point cloud.

Internet-scale SfM (Building Rome in a Day) processed hundreds of thousands of tourist photos of landmarks, automatically building 3D models of famous sites from Flickr photos. This demonstrated that SfM could work "in the wild" with uncontrolled, diverse imagery.

What is the key advantage of incremental SfM over global SfM?

Chapter 6: Bundle Adjustment

Bundle adjustment is the final refinement step. It simultaneously optimizes all camera poses and all 3D point positions to minimize the total reprojection error:

minR,t,Xi,j ||xij − π(RiXj + ti)||2

where π is the projection function, xij is the observed 2D position of point j in camera i, and Xj is the 3D point position.

Sparsity is the key. A reconstruction with 1000 cameras and 100,000 points has 306,000 parameters. But each 2D observation involves only one camera (6 params) and one point (3 params). The resulting Jacobian is extremely sparse. The Schur complement trick eliminates all point variables, reducing the system to a camera-only problem of 6,000 unknowns. This makes bundle adjustment tractable even for city-scale reconstructions.

Modern solvers like Ceres (Google) and g2o exploit this sparsity with efficient sparse Cholesky factorization, often converging in seconds for problems with millions of observations.

What does bundle adjustment minimize?

Chapter 7: SLAM

Simultaneous Localization and Mapping (SLAM) is SfM in real time. A robot or AR device moves through an unknown environment, building a map while simultaneously tracking its own position within that map.

SystemKey Approach
ORB-SLAMFeature-based visual SLAM. ORB features, bag-of-words for loop closure, bundle adjustment.
LSD-SLAMDirect (no features). Optimizes photometric error on semi-dense depth maps.
DTAMDense tracking and mapping. Real-time dense reconstruction from a single moving camera.
NeRF-SLAMNeural implicit map representation. Combines SLAM with neural radiance fields.
Loop closure: The critical challenge in SLAM. As the robot explores, pose errors accumulate (drift). When it returns to a previously visited location, the system must recognize this (place recognition) and correct the accumulated drift by distributing the error around the loop. Without loop closure, maps become globally inconsistent.

Applications span robotics (autonomous navigation), AR (spatial anchors), and autonomous driving (HD map building). Modern AR devices (Apple Vision Pro, Meta Quest) run visual-inertial SLAM in real time, fusing camera tracking with IMU data for robust 6-DOF pose estimation.

SLAM Trajectory

Watch a camera trace a path, building a map and correcting drift on loop closure.

What is loop closure and why is it critical for SLAM?

Chapter 8: Showcase — Triangulation Demo

See how triangulation works: two cameras observe the same 3D point. The intersection of back-projected rays gives the 3D position. Move the point to see how the geometry changes.

Two-View Triangulation

Two cameras (blue triangles) observe a 3D point (yellow). Back-projected rays intersect at the point's 3D location.

Point depth 50
Baseline 40
Baseline matters: Wide baselines give better depth precision (the triangulation angle is larger). Narrow baselines are more robust (easier to match features). This is the classic baseline-matching trade-off in stereo and SfM.

Chapter 9: Connections

ConceptUsed In
Camera calibrationCh 2 (image formation), Ch 12 (stereo), every 3D vision task
Pose estimation / PnPCh 8 (alignment), Ch 12 (rectification), AR applications
Epipolar geometryCh 12 (stereo correspondence), Ch 9 (motion estimation)
Bundle adjustmentCh 8 (panoramas), Ch 12 (multi-view stereo), autonomous driving
SLAMRobotics, AR/VR, autonomous driving, drone navigation
SfM point cloudsCh 13 (3D reconstruction), Ch 14 (neural rendering)
Szeliski's perspective: "Structure from motion is perhaps the most beautiful result in computer vision: from nothing but 2D images, we recover the full 3D structure of the world and the cameras that captured it. The fact that this works at all is remarkable. That it works at city scale from tourist photos is extraordinary."
What technology enables AR devices to overlay virtual objects on the real world accurately?