Szeliski, Chapter 8

Image Alignment and Stitching

From matching pairs of images to building seamless panoramas: alignment, RANSAC, blending, and compositing.

Prerequisites: Chapter 7 (features), Chapter 2 (projective transforms), Chapter 4 (optimization).
10
Chapters
6+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Stitching?

Your phone's panorama mode works like magic: slowly sweep the camera, and out comes a wide, seamless image. But behind that simplicity lies a chain of algorithms: feature detection, geometric alignment, global optimization, and pixel blending.

Image stitching solves the problem of combining multiple overlapping images into a single, larger image. The pipeline:

The key insight: If the camera only rotates (no translation), every pair of images is related by a homography — an 8-parameter projective transformation. This means any scene, regardless of depth, can be perfectly stitched from a single viewpoint. Translation breaks this, but for panoramas, rotation is the dominant motion.
Stitching Pipeline

The five stages from raw images to a seamless panorama.

Under what camera motion can two images be perfectly related by a homography?

Chapter 1: Pairwise Alignment

Given feature matches between two images, how do you find the transformation that aligns them? Start with the simplest case: a translation (shift in x and y).

With matched point pairs (xi, xi'), set up a least-squares system:

minti ||xi' − (xi + t)||2

The solution is just the average displacement: t = mean(xi' − xi). For more complex transformations, the math changes but the principle remains: find the parameters that minimize the sum of squared reprojection errors.

ModelDOFMin. PointsWhat It Allows
Translation21Shift only
Similarity42Shift + rotate + uniform scale
Affine63Shift + rotate + scale + shear
Homography84Full projective (handles perspective)
Homography estimation: A homography H is a 3×3 matrix (8 DOF, since scale is arbitrary). Each point correspondence gives 2 equations. With 4 points, you get 8 equations — exactly enough. In practice, you use many more points and solve by least squares for robustness.
How many point correspondences are needed at minimum to estimate a homography?

Chapter 2: RANSAC

Feature matching always produces some wrong matches (outliers). Even a few outliers can completely destroy a least-squares fit. RANSAC (Random Sample Consensus) is the standard solution.

The algorithm:

How many iterations? If the inlier ratio is w and we need n points, the probability of picking all inliers in one trial is wn. To be 99% sure of at least one good sample: k = log(0.01)/log(1 − wn). With 50% inliers and n=4, you need about 72 iterations. With 80% inliers, just 5.
RANSAC Line Fitting

Watch RANSAC find the line despite outliers. Least squares (gray) is pulled by outliers. RANSAC (green) ignores them.

Outlier fraction 30%
Why is RANSAC preferred over least squares when outliers are present?

Chapter 3: Parametric Motion Models

The choice of motion model depends on the scene and camera motion:

When to use what: For document scanning, an affine transform suffices (the document is planar). For panoramas, a homography handles perspective. For whiteboard capture, a homography corrects keystone distortion and maps the trapezoid back to a rectangle.

A homography maps point (x, y) to (x', y') via:

⎡ x' ⎤
⎣ y' ⎦
= ⎡ h1 h2 h3
⎣ h4 h5 h6
⎡ x ⎤
⎣ y ⎦
/ (h7x + h8y + 1)

The division by the last row is what gives projective transformations their power: parallel lines can converge, rectangles can become trapezoids.

Transformation Gallery

See how different motion models transform a square grid.

What makes a homography more powerful than an affine transform?

Chapter 4: Rotational Panoramas

When the camera rotates without translating, the relationship between any two images is a homography. But what projection surface should the panorama use?

ProjectionPropertiesBest For
PlanarStraight lines stay straight. Severe distortion at edges.Narrow fields of view (<90°)
CylindricalHorizontal lines stay straight. Wraps around.Horizontal panoramas (120-360°)
SphericalFull omnidirectional coverage. Area distortion at poles.Full 360° × 180° spheres
Cylindrical projection: To create a cylindrical panorama, project each image onto a cylinder. In cylindrical coordinates, purely rotational images differ by only a horizontal translation. This simplifies alignment dramatically: you only need to estimate a 2D shift instead of a full homography.

But cylindrical projection introduces a subtlety: you need to know the camera's focal length to correctly project onto the cylinder. If the focal length is wrong, straight lines become curved. This is why phone panorama modes often use the gyroscope to estimate rotation rather than relying on feature-based alignment alone.

Why does cylindrical projection simplify panorama alignment?

Chapter 5: Global Alignment

Pairwise alignment estimates a transform between each pair. But errors accumulate: if image 1 aligns to image 2, and image 2 aligns to image 3, the alignment from 1 to 3 inherits both errors. Over many images, this creates visible drift.

Global alignment estimates all transformations simultaneously, minimizing the total reprojection error across all image pairs:

minH1,...,Hnpairs (i,j)matches k ||Hj xkj − Hi xki||2
Gap closing: When a panorama loops back on itself (360°), the last image overlaps the first. This creates a "loop closure" constraint that distributes accumulated drift evenly around the loop. Without it, there is a visible seam where the panorama wraps. With it, the error spreads so thin it becomes invisible.

Recognizing panoramas (Brown and Lowe, 2007) automated the entire pipeline: given an unordered set of photos, the system automatically identifies which images overlap, groups them into panoramas, estimates all transformations globally, and produces stitched results. This is the technology behind every phone panorama app.

What problem does global alignment solve that pairwise alignment does not?

Chapter 6: Bundle Adjustment

Global alignment for panoramas is a special case of bundle adjustment: the simultaneous refinement of camera parameters and 3D point positions to minimize total reprojection error.

For panoramas, "camera parameters" are just the rotation (and optionally focal length). For full 3D reconstruction (Chapter 11), bundle adjustment also optimizes camera positions and 3D point locations.

Why "bundle"? The name comes from the "bundle of rays" connecting each 3D point to the cameras that see it. Adjustment means tweaking camera poses and 3D points until these rays are as consistent as possible. It is a large nonlinear least-squares problem, solved with Levenberg-Marquardt. The Jacobian is sparse (each observation involves one camera and one point), enabling efficient computation even with millions of observations.

Modern bundle adjustment implementations (Ceres Solver, g2o) exploit this sparsity structure. They decompose the normal equations using the Schur complement, reducing the problem size from (cameras + points) to just cameras, since there are far fewer cameras than points.

What makes bundle adjustment computationally tractable despite having millions of parameters?

Chapter 7: Blending and Compositing

Alignment is only half the battle. When images overlap, you need to combine their pixels without visible seams. Differences in exposure, white balance, and vignetting create mismatches at boundaries.

Blending strategies:

MethodHow It WorksQuality
FeatheringLinear weight falloff from center. Average in overlap.Simple but shows ghosting with motion.
Laplacian pyramidBlend at each frequency band separately. Low frequencies: slow transition. High frequencies: sharp seam.Excellent. The standard approach.
Optimal seamFind the cut through the overlap where images match best (graph cut / dynamic programming).Best for parallax and moving objects.
Gradient-domainBlend gradients, then reconstruct via Poisson equation. Seamless by construction.Eliminates intensity differences.
Laplacian blending: Decompose both images into Laplacian pyramids (band-pass frequency layers). Create a mask pyramid. At each level, blend using the mask. This ensures that low-frequency content (exposure) transitions smoothly while high-frequency content (texture) switches sharply at the seam. The result: invisible boundaries with no ghosting.
Blending Comparison

Compare naive stitching (hard cut) with smooth blending. Notice the seam difference.

Why does Laplacian pyramid blending produce better results than simple feathering?

Chapter 8: Showcase — Panorama Builder

Let's visualize the stitching pipeline end to end. Three overlapping images are aligned via homographies and blended into a seamless panorama.

Step-by-Step Stitching

Watch three images align and blend into a single panorama.

Real-world challenges: Motion parallax (nearby objects shift differently from far ones), moving objects (people walking), exposure differences, lens distortion, and rolling shutter artifacts all complicate real panoramas. Modern systems address each: deghosting removes moving objects, exposure compensation normalizes brightness, and lens distortion models correct warping.

Chapter 9: Connections

Image alignment and stitching connects to many other topics:

ConceptUsed In
RANSACCh 7 (matching), Ch 11 (pose estimation), Ch 9 (motion), virtually everywhere
Homography estimationCh 2 (projective geometry), Ch 11 (planar SfM), AR overlays
Bundle adjustmentCh 11 (SfM), Ch 12 (multi-view stereo), visual SLAM
Laplacian blendingCh 10 (HDR, compositing), Ch 3 (pyramids), image editing
Panorama recognitionCh 6 (image retrieval), photo organization
Global alignmentCh 11 (global SfM), map building
Szeliski's perspective: "Image stitching is one of the great success stories of computer vision. The combination of robust feature matching, projective geometry, and multi-band blending produces results that are almost always seamless. It's also one of the most widely deployed vision algorithms — billions of panoramas are created every year on smartphones."
Which component of the stitching pipeline is also critical for 3D reconstruction (Chapter 11)?