Szeliski, Chapter 11

Structure from Motion and SLAM

Recovering 3D structure and camera motion from 2D images: calibration, pose estimation, SfM, bundle adjustment, and SLAM.

Prerequisites: Chapter 2 (projective geometry), Chapter 7 (features), Chapter 8 (alignment).

Chapters

Simulations

Assumed CV Knowledge

Chapter 0: Why SfM?

Walk around a building and take photos from different angles. From these 2D images alone, can we recover the 3D shape of the building and figure out where each camera was? Yes — this is Structure from Motion (SfM).

SfM solves two problems simultaneously:

Structure: The 3D positions of scene points
Motion: The position and orientation (pose) of each camera

Why it matters: SfM powers Google Earth (3D city models from satellite/street images), autonomous driving (mapping from dashcam video), AR/VR (spatial mapping), archaeology (digital heritage preservation), and Hollywood VFX (match-moving CGI into real footage).

SfM Overview

Multiple views of a scene reveal its 3D structure through triangulation.

What two things does Structure from Motion recover simultaneously?

The 3D positions of scene points (structure) and the position/orientation of each camera (motion) Color and brightness Object categories and boundaries

Chapter 1: Camera Calibration

Before recovering 3D structure, you need to know your camera's intrinsic parameters: focal length, principal point, and lens distortion.

K = ⎡ f_x 0 c_x ⎤
⎣ 0 f_y c_y ⎦
⎣ 0 0 1 ⎦

The intrinsic matrix K maps from camera coordinates to pixel coordinates. f_x, f_y are the focal lengths in pixels, and (c_x, c_y) is the principal point (usually near image center).

Method	How It Works
Checkerboard	Photograph a known pattern from multiple angles. Detect corners, solve for K. (Zhang's method)
Vanishing points	Detect parallel lines in the scene. Their vanishing points constrain K.
Self-calibration	Estimate K from point correspondences alone, using constraints from the essential matrix.

Radial distortion bends straight lines into curves (barrel or pincushion distortion). It is modeled as: x' = x(1 + k₁r² + k₂r⁴), where r is the distance from the principal point. Correcting distortion is essential before any geometric computation — uncorrected distortion corrupts all subsequent 3D estimates.

What does the intrinsic matrix K encode?

The camera's internal parameters (focal length, principal point) that map from 3D camera coordinates to 2D pixel coordinates The camera's position in the world The scene's 3D structure

Chapter 2: Pose Estimation

Pose estimation computes the camera's position and orientation given known 3D-to-2D correspondences. "I know this 3D point projects to this pixel — where must the camera be?"

The Perspective-n-Point (PnP) problem: given n pairs of 3D points and their 2D projections, find the camera pose [R|t].

Points	Method
n = 3	P3P: 4 solutions (choose with a 4th point)
n = 4	Direct linear transform (DLT)
n ≥ 6	Linear least squares + refinement
Many	RANSAC + PnP for robustness

Triangulation: Once you know camera poses, recovering a 3D point from its projections in two or more views is straightforward: intersect the back-projected rays. In practice, rays do not intersect perfectly due to noise, so you minimize the reprojection error. This is triangulation — the inverse of projection.

Visual localization uses pose estimation at city scale: match a query photo against a 3D map (built by SfM), find 2D-3D correspondences, and solve PnP. This enables GPS-free positioning accurate to centimeters.

What is the minimum number of 3D-to-2D correspondences needed to estimate a camera's pose?

3 points (P3P), which gives up to 4 possible solutions disambiguated with a 4th point 1 point 10 points

Chapter 3: Two-Frame SfM

The simplest SfM setup: two images. You do not know the 3D scene or either camera pose. All you have are point correspondences (from feature matching).

Despite this seemingly impossible situation, the geometry is remarkably constrained. A matched pair of points (x, x') in two calibrated cameras must satisfy the epipolar constraint:

x'^T E x = 0

where E is the essential matrix. It encodes the relative rotation and translation between cameras. With 5 or more correspondences, you can recover E (and thus the relative pose) up to scale.

Scale ambiguity: From two views alone, you can recover the relative pose and 3D structure, but not the absolute scale. A model of a toy house and a real house produce identical images if the camera is scaled proportionally. You need either a known distance (baseline) or additional constraints (GPS, IMU) to resolve scale.

For uncalibrated cameras (unknown K), the constraint becomes x'^TFx = 0, where F is the fundamental matrix. F has 7 DOF and can be estimated from 7 or more correspondences.

Why is there a scale ambiguity in two-frame SfM?

A scene and a proportionally scaled version of it produce identical 2D projections, so absolute scale cannot be determined from images alone The images are too small There are not enough feature matches

Chapter 4: The Essential Matrix

The essential matrix E = [t]_×R encodes the relative rotation R and translation direction t between two calibrated cameras. It has exactly 5 degrees of freedom (3 for rotation + 2 for translation direction; magnitude is unknown due to scale ambiguity).

Given the epipolar constraint x'^TEx = 0, each point correspondence provides one equation. Since E has 5 DOF, 5 points suffice (Nistér's 5-point algorithm). In practice, the 8-point algorithm (with normalization) is simpler and used inside RANSAC.

Epipolar lines: Given a point x in image 1, the epipolar constraint tells us its match must lie on a specific line in image 2: the epipolar line l' = Ex. This reduces the search from 2D to 1D, dramatically speeding up matching. It is also a powerful geometric check: if a match does not lie near the epipolar line, it is wrong.

The epipole is where the other camera's center projects into your image. All epipolar lines pass through it. When the cameras have the same horizontal axis (rectified), the epipolar lines become horizontal scan lines — the basis for stereo matching (Chapter 12).

Epipolar Geometry

A point in one image constrains its match to an epipolar line in the other image.

Point position 50

What does the epipolar constraint reduce the matching search from?

From a 2D search over the entire image to a 1D search along a single epipolar line From color to grayscale From multiple images to a single image

Chapter 5: Multi-Frame SfM

Two frames give you relative pose and sparse 3D points. Adding more frames gives more points and more constraints, improving accuracy. Multi-frame SfM scales this to hundreds or thousands of images.

Two strategies:

Strategy	How It Works	Trade-off
Incremental SfM	Start with 2 images. Add new images one at a time, solving PnP for pose, triangulating new points, running bundle adjustment.	Robust but slow. Drift accumulates.
Global SfM	Estimate all relative poses from image pairs, then solve for all absolute poses simultaneously (rotation averaging + translation averaging).	Faster but less robust to outliers.

COLMAP is the gold standard for incremental SfM. Given a set of unordered photos, it: (1) extracts and matches SIFT features, (2) initializes from the best image pair, (3) incrementally adds images via PnP, (4) triangulates new 3D points, (5) runs bundle adjustment after each addition. Output: camera poses + sparse 3D point cloud.

Internet-scale SfM (Building Rome in a Day) processed hundreds of thousands of tourist photos of landmarks, automatically building 3D models of famous sites from Flickr photos. This demonstrated that SfM could work "in the wild" with uncontrolled, diverse imagery.

What is the key advantage of incremental SfM over global SfM?

It is more robust because each new image is verified against the existing reconstruction, making it resilient to outliers It is faster It does not need feature matching

Chapter 6: Bundle Adjustment

Bundle adjustment is the final refinement step. It simultaneously optimizes all camera poses and all 3D point positions to minimize the total reprojection error:

min_R,t,X ∑_i,j ||x_ij − π(R_iX_j + t_i)||²

where π is the projection function, x_ij is the observed 2D position of point j in camera i, and X_j is the 3D point position.

Sparsity is the key. A reconstruction with 1000 cameras and 100,000 points has 306,000 parameters. But each 2D observation involves only one camera (6 params) and one point (3 params). The resulting Jacobian is extremely sparse. The Schur complement trick eliminates all point variables, reducing the system to a camera-only problem of 6,000 unknowns. This makes bundle adjustment tractable even for city-scale reconstructions.

Modern solvers like Ceres (Google) and g2o exploit this sparsity with efficient sparse Cholesky factorization, often converging in seconds for problems with millions of observations.

What does bundle adjustment minimize?

The total reprojection error — the sum of squared distances between observed 2D points and the re-projected 3D points across all cameras The number of cameras The image resolution

Chapter 7: SLAM

Simultaneous Localization and Mapping (SLAM) is SfM in real time. A robot or AR device moves through an unknown environment, building a map while simultaneously tracking its own position within that map.

System	Key Approach
ORB-SLAM	Feature-based visual SLAM. ORB features, bag-of-words for loop closure, bundle adjustment.
LSD-SLAM	Direct (no features). Optimizes photometric error on semi-dense depth maps.
DTAM	Dense tracking and mapping. Real-time dense reconstruction from a single moving camera.
NeRF-SLAM	Neural implicit map representation. Combines SLAM with neural radiance fields.

Loop closure: The critical challenge in SLAM. As the robot explores, pose errors accumulate (drift). When it returns to a previously visited location, the system must recognize this (place recognition) and correct the accumulated drift by distributing the error around the loop. Without loop closure, maps become globally inconsistent.

Applications span robotics (autonomous navigation), AR (spatial anchors), and autonomous driving (HD map building). Modern AR devices (Apple Vision Pro, Meta Quest) run visual-inertial SLAM in real time, fusing camera tracking with IMU data for robust 6-DOF pose estimation.

SLAM Trajectory

Watch a camera trace a path, building a map and correcting drift on loop closure.

What is loop closure and why is it critical for SLAM?

Recognizing a previously visited place and correcting accumulated drift to keep the map globally consistent Closing the camera aperture Ending the SLAM algorithm

Chapter 8: Showcase — Triangulation Demo

See how triangulation works: two cameras observe the same 3D point. The intersection of back-projected rays gives the 3D position. Move the point to see how the geometry changes.

Two-View Triangulation

Two cameras (blue triangles) observe a 3D point (yellow). Back-projected rays intersect at the point's 3D location.

Point depth 50

Baseline 40

Baseline matters: Wide baselines give better depth precision (the triangulation angle is larger). Narrow baselines are more robust (easier to match features). This is the classic baseline-matching trade-off in stereo and SfM.

Chapter 9: Connections

Concept	Used In
Camera calibration	Ch 2 (image formation), Ch 12 (stereo), every 3D vision task
Pose estimation / PnP	Ch 8 (alignment), Ch 12 (rectification), AR applications
Epipolar geometry	Ch 12 (stereo correspondence), Ch 9 (motion estimation)
Bundle adjustment	Ch 8 (panoramas), Ch 12 (multi-view stereo), autonomous driving
SLAM	Robotics, AR/VR, autonomous driving, drone navigation
SfM point clouds	Ch 13 (3D reconstruction), Ch 14 (neural rendering)

Szeliski's perspective: "Structure from motion is perhaps the most beautiful result in computer vision: from nothing but 2D images, we recover the full 3D structure of the world and the cameras that captured it. The fact that this works at all is remarkable. That it works at city scale from tourist photos is extraordinary."

What technology enables AR devices to overlay virtual objects on the real world accurately?

Visual-inertial SLAM, which tracks the device's 6-DOF pose in real time while building a 3D map of the environment Image classification Texture synthesis