Szeliski, Chapter 8

Image Alignment and Stitching

From matching pairs of images to building seamless panoramas: alignment, RANSAC, blending, and compositing.

Prerequisites: Chapter 7 (features), Chapter 2 (projective transforms), Chapter 4 (optimization).

Chapters

Simulations

Assumed CV Knowledge

Chapter 0: Why Stitching?

Your phone's panorama mode works like magic: slowly sweep the camera, and out comes a wide, seamless image. But behind that simplicity lies a chain of algorithms: feature detection, geometric alignment, global optimization, and pixel blending.

Image stitching solves the problem of combining multiple overlapping images into a single, larger image. The pipeline:

Detect and match features between overlapping image pairs (Chapter 7)
Estimate geometric transformations that align each pair
Reject outliers using RANSAC
Globally align all images into a common coordinate frame
Blend the overlapping regions seamlessly

The key insight: If the camera only rotates (no translation), every pair of images is related by a homography — an 8-parameter projective transformation. This means any scene, regardless of depth, can be perfectly stitched from a single viewpoint. Translation breaks this, but for panoramas, rotation is the dominant motion.

Stitching Pipeline

The five stages from raw images to a seamless panorama.

Under what camera motion can two images be perfectly related by a homography?

Any camera motion Pure rotation (no translation) from a single viewpoint Only horizontal panning

Chapter 1: Pairwise Alignment

Given feature matches between two images, how do you find the transformation that aligns them? Start with the simplest case: a translation (shift in x and y).

With matched point pairs (x_i, x_i'), set up a least-squares system:

min_t ∑_i ||x_i' − (x_i + t)||²

The solution is just the average displacement: t = mean(x_i' − x_i). For more complex transformations, the math changes but the principle remains: find the parameters that minimize the sum of squared reprojection errors.

Model	DOF	Min. Points	What It Allows
Translation	2	1	Shift only
Similarity	4	2	Shift + rotate + uniform scale
Affine	6	3	Shift + rotate + scale + shear
Homography	8	4	Full projective (handles perspective)

Homography estimation: A homography H is a 3×3 matrix (8 DOF, since scale is arbitrary). Each point correspondence gives 2 equations. With 4 points, you get 8 equations — exactly enough. In practice, you use many more points and solve by least squares for robustness.

How many point correspondences are needed at minimum to estimate a homography?

2 points 4 points — each gives 2 equations, and a homography has 8 degrees of freedom 8 points

Chapter 2: RANSAC

Feature matching always produces some wrong matches (outliers). Even a few outliers can completely destroy a least-squares fit. RANSAC (Random Sample Consensus) is the standard solution.

The algorithm:

Sample: Randomly pick the minimum number of points (e.g., 4 for a homography)
Fit: Compute the model from this minimal set
Score: Count how many other matches agree with the model (inliers)
Repeat: After many iterations, keep the model with the most inliers
Refine: Re-estimate the model using all inliers (least squares)

How many iterations? If the inlier ratio is w and we need n points, the probability of picking all inliers in one trial is wⁿ. To be 99% sure of at least one good sample: k = log(0.01)/log(1 − wⁿ). With 50% inliers and n=4, you need about 72 iterations. With 80% inliers, just 5.

RANSAC Line Fitting

Watch RANSAC find the line despite outliers. Least squares (gray) is pulled by outliers. RANSAC (green) ignores them.

Outlier fraction 30%

Why is RANSAC preferred over least squares when outliers are present?

RANSAC fits models using minimal random subsets and scores by inlier count, so outliers cannot corrupt the estimate RANSAC is always faster RANSAC uses all points simultaneously

Chapter 3: Parametric Motion Models

The choice of motion model depends on the scene and camera motion:

When to use what: For document scanning, an affine transform suffices (the document is planar). For panoramas, a homography handles perspective. For whiteboard capture, a homography corrects keystone distortion and maps the trapezoid back to a rectangle.

A homography maps point (x, y) to (x', y') via:

⎡ x' ⎤
⎣ y' ⎦ = ⎡ h₁ h₂ h₃ ⎤
⎣ h₄ h₅ h₆ ⎦ ⎡ x ⎤
⎣ y ⎦ / (h₇x + h₈y + 1)

The division by the last row is what gives projective transformations their power: parallel lines can converge, rectangles can become trapezoids.

Transformation Gallery

See how different motion models transform a square grid.

What makes a homography more powerful than an affine transform?

The projective division allows parallel lines to converge, modeling perspective effects that affine transforms cannot Homographies are faster to compute Homographies use fewer parameters

Chapter 4: Rotational Panoramas

When the camera rotates without translating, the relationship between any two images is a homography. But what projection surface should the panorama use?

Projection	Properties	Best For
Planar	Straight lines stay straight. Severe distortion at edges.	Narrow fields of view (<90°)
Cylindrical	Horizontal lines stay straight. Wraps around.	Horizontal panoramas (120-360°)
Spherical	Full omnidirectional coverage. Area distortion at poles.	Full 360° × 180° spheres

Cylindrical projection: To create a cylindrical panorama, project each image onto a cylinder. In cylindrical coordinates, purely rotational images differ by only a horizontal translation. This simplifies alignment dramatically: you only need to estimate a 2D shift instead of a full homography.

But cylindrical projection introduces a subtlety: you need to know the camera's focal length to correctly project onto the cylinder. If the focal length is wrong, straight lines become curved. This is why phone panorama modes often use the gyroscope to estimate rotation rather than relying on feature-based alignment alone.

Why does cylindrical projection simplify panorama alignment?

In cylindrical coordinates, pure rotational images differ by only a horizontal translation, reducing alignment to a 2D shift It removes all distortion It does not require feature detection

Chapter 5: Global Alignment

Pairwise alignment estimates a transform between each pair. But errors accumulate: if image 1 aligns to image 2, and image 2 aligns to image 3, the alignment from 1 to 3 inherits both errors. Over many images, this creates visible drift.

Global alignment estimates all transformations simultaneously, minimizing the total reprojection error across all image pairs:

min_{H₁,...,H_n} ∑_{pairs (i,j)} ∑_{matches k} ||H_j x_k^j − H_i x_kⁱ||²

Gap closing: When a panorama loops back on itself (360°), the last image overlaps the first. This creates a "loop closure" constraint that distributes accumulated drift evenly around the loop. Without it, there is a visible seam where the panorama wraps. With it, the error spreads so thin it becomes invisible.

Recognizing panoramas (Brown and Lowe, 2007) automated the entire pipeline: given an unordered set of photos, the system automatically identifies which images overlap, groups them into panoramas, estimates all transformations globally, and produces stitched results. This is the technology behind every phone panorama app.

What problem does global alignment solve that pairwise alignment does not?

It minimizes total error across all images simultaneously, preventing drift accumulation from sequential pairwise estimates It makes alignment faster It requires fewer feature matches

Chapter 6: Bundle Adjustment

Global alignment for panoramas is a special case of bundle adjustment: the simultaneous refinement of camera parameters and 3D point positions to minimize total reprojection error.

For panoramas, "camera parameters" are just the rotation (and optionally focal length). For full 3D reconstruction (Chapter 11), bundle adjustment also optimizes camera positions and 3D point locations.

Why "bundle"? The name comes from the "bundle of rays" connecting each 3D point to the cameras that see it. Adjustment means tweaking camera poses and 3D points until these rays are as consistent as possible. It is a large nonlinear least-squares problem, solved with Levenberg-Marquardt. The Jacobian is sparse (each observation involves one camera and one point), enabling efficient computation even with millions of observations.

Modern bundle adjustment implementations (Ceres Solver, g2o) exploit this sparsity structure. They decompose the normal equations using the Schur complement, reducing the problem size from (cameras + points) to just cameras, since there are far fewer cameras than points.

What makes bundle adjustment computationally tractable despite having millions of parameters?

The Jacobian is sparse (each observation involves only one camera and one point), and the Schur complement reduces the effective problem size It uses a simple closed-form solution It processes only a few points at a time

Chapter 7: Blending and Compositing

Alignment is only half the battle. When images overlap, you need to combine their pixels without visible seams. Differences in exposure, white balance, and vignetting create mismatches at boundaries.

Blending strategies:

Method	How It Works	Quality
Feathering	Linear weight falloff from center. Average in overlap.	Simple but shows ghosting with motion.
Laplacian pyramid	Blend at each frequency band separately. Low frequencies: slow transition. High frequencies: sharp seam.	Excellent. The standard approach.
Optimal seam	Find the cut through the overlap where images match best (graph cut / dynamic programming).	Best for parallax and moving objects.
Gradient-domain	Blend gradients, then reconstruct via Poisson equation. Seamless by construction.	Eliminates intensity differences.

Laplacian blending: Decompose both images into Laplacian pyramids (band-pass frequency layers). Create a mask pyramid. At each level, blend using the mask. This ensures that low-frequency content (exposure) transitions smoothly while high-frequency content (texture) switches sharply at the seam. The result: invisible boundaries with no ghosting.

Blending Comparison

Compare naive stitching (hard cut) with smooth blending. Notice the seam difference.

Why does Laplacian pyramid blending produce better results than simple feathering?

It blends each frequency band separately: slow transition for exposure differences, sharp transition for texture details It is faster to compute It uses more memory

Chapter 8: Showcase — Panorama Builder

Let's visualize the stitching pipeline end to end. Three overlapping images are aligned via homographies and blended into a seamless panorama.

Step-by-Step Stitching

Watch three images align and blend into a single panorama.

Real-world challenges: Motion parallax (nearby objects shift differently from far ones), moving objects (people walking), exposure differences, lens distortion, and rolling shutter artifacts all complicate real panoramas. Modern systems address each: deghosting removes moving objects, exposure compensation normalizes brightness, and lens distortion models correct warping.

Chapter 9: Connections

Image alignment and stitching connects to many other topics:

Concept	Used In
RANSAC	Ch 7 (matching), Ch 11 (pose estimation), Ch 9 (motion), virtually everywhere
Homography estimation	Ch 2 (projective geometry), Ch 11 (planar SfM), AR overlays
Bundle adjustment	Ch 11 (SfM), Ch 12 (multi-view stereo), visual SLAM
Laplacian blending	Ch 10 (HDR, compositing), Ch 3 (pyramids), image editing
Panorama recognition	Ch 6 (image retrieval), photo organization
Global alignment	Ch 11 (global SfM), map building

Szeliski's perspective: "Image stitching is one of the great success stories of computer vision. The combination of robust feature matching, projective geometry, and multi-band blending produces results that are almost always seamless. It's also one of the most widely deployed vision algorithms — billions of panoramas are created every year on smartphones."

Which component of the stitching pipeline is also critical for 3D reconstruction (Chapter 11)?

Bundle adjustment — it jointly optimizes camera poses and 3D structure in both stitching and SfM Laplacian blending Cylindrical projection