Szeliski, Chapter 9

Motion Estimation

Computing how pixels move between frames: translational alignment, parametric motion, optical flow, and layered motion models.

Prerequisites: Chapter 3 (filtering), Chapter 4 (optimization), Chapter 8 (alignment) helpful.
10
Chapters
6+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Motion?

You watch a car drive past and effortlessly perceive its speed and direction. Your visual system computes motion from the changing pattern of light on your retina. Motion estimation gives computers the same ability: given two or more video frames, compute how each pixel moved.

Motion information enables:

Two levels of motion: Parametric motion models assume all pixels move according to a single global model (translation, affine, etc.) — good for camera motion. Optical flow estimates a separate motion vector for every pixel — good for object motion. Most real scenes need both: global camera motion plus local object motion.
Motion Field

Visualize per-pixel motion vectors. Toggle between uniform (camera) and local (object) motion.

What is the difference between parametric motion and optical flow?

Chapter 1: Translational Alignment

The simplest motion model: every pixel moves by the same amount (u, v). This is pure translation, caused by camera panning or a distant moving object.

Three approaches to estimate the translation:

MethodHow It WorksPros / Cons
Exhaustive searchTry every possible (u,v) shift, measure SSD or NCCSimple but slow for large displacements
Fourier-basedTranslation in spatial domain = phase shift in frequency domain. Cross-power spectrum peaks at the shift.Fast (FFT), sub-pixel, but only handles translation
Coarse-to-fineBuild image pyramids. Estimate large motion at coarse level, refine at fine levels.Handles large displacements efficiently
The coarse-to-fine trick: A 100-pixel shift requires searching a 200×200 window at full resolution — 40,000 evaluations. But if you downsample by 8x, the shift becomes 12 pixels — only 625 evaluations. Refine at each pyramid level, and the total cost is logarithmic in the maximum displacement. This hierarchical approach underpins nearly all modern motion estimation.
Why does coarse-to-fine estimation dramatically speed up motion search?

Chapter 2: Lucas-Kanade

Instead of searching over all possible shifts, the Lucas-Kanade method uses calculus. Assume the image intensity is approximately constant along the motion path:

I(x + u, y + v, t + 1) ≈ I(x, y, t)

Taylor-expand the left side:

Ixu + Iyv + It ≈ 0

This is the brightness constancy equation. One equation, two unknowns (u, v). Lucas-Kanade solves this by assuming all pixels in a small window share the same motion. This gives an overdetermined system solved by least squares:

⎡ u ⎤
⎣ v ⎦
= (ATA)−1 ATb

where A contains the spatial gradients and b contains the temporal gradients.

The ATA matrix is the structure tensor! The same matrix from Harris corner detection (Chapter 7). When both eigenvalues are large (a corner), the system is well-conditioned and the flow is reliable. When one eigenvalue is small (an edge), motion along the edge is ambiguous. When both are small (flat region), no motion can be estimated. Good flow estimation requires texture.
Lucas-Kanade Flow

The flow is most reliable at corners (both eigenvalues large). At edges, only perpendicular motion is estimable.

Why can't Lucas-Kanade estimate flow at flat (textureless) image regions?

Chapter 3: Parametric Motion

Sometimes all pixels move according to a single global model — for example, when the entire camera moves. Parametric motion fits a transformation (affine, homography) to the entire image pair.

Application: Video stabilization. Estimate the frame-to-frame camera motion (usually an affine or homography). Smooth the motion trajectory. Apply the inverse of the residual (jitter) to each frame. Result: a steady video from a shaky handheld camera.

Learned motion estimation: Classical methods fit explicit parametric models. Deep networks (e.g., HomographyNet) can learn to predict homographies directly from image pairs, trained on synthetic warps. They are more robust to large viewpoint changes and scenes that violate the planar assumption.

For spline-based motion, the image is divided into a grid, and each grid cell has its own transformation parameters. The motion field is a smooth interpolation (B-spline) of these local transformations. This handles spatially varying motion without the full cost of per-pixel optical flow.

How does video stabilization use parametric motion estimation?

Chapter 4: Optical Flow

Optical flow estimates a motion vector (u, v) at every pixel. This is the most detailed motion representation: a dense vector field describing where every point moves from one frame to the next.

The brightness constancy equation gives one constraint per pixel but two unknowns (u, v). This is the aperture problem: through a small aperture, you can only see the component of motion perpendicular to the local edge. To resolve the ambiguity, you need an additional constraint.

The aperture problem: Watch a barber pole rotate. Through a small window, the stripes appear to move upward — but the pole is actually rotating horizontally. The local gradient only reveals motion perpendicular to the edge, not along it. This is why optical flow needs either a large window (Lucas-Kanade) or a smoothness assumption (Horn-Schunck) to disambiguate.
The Aperture Problem

Through a small window, only the motion perpendicular to the edge is visible. The true motion direction is ambiguous.

What is the aperture problem in optical flow estimation?

Chapter 5: Horn-Schunck

Horn and Schunck (1981) added a smoothness constraint: the flow field should vary smoothly across the image. They minimize:

E = ∑ (Ixu + Iyv + It)2 + λ(||∇u||2 + ||∇v||2)

The first term enforces brightness constancy. The second term (weighted by λ) penalizes flow gradients, encouraging neighboring pixels to move similarly.

Smoothness vs. detail: Large λ gives a smooth flow field but blurs motion boundaries (where a moving object meets the background). Small λ preserves boundaries but may be noisy. The trade-off is fundamental: you want smooth flow within objects but sharp discontinuities between them. This is why modern methods use robust smoothness terms that allow discontinuities.
MethodApproachKey Feature
Lucas-KanadeLocal (window)Reliable at corners. Sparse.
Horn-SchunckGlobal (variational)Dense flow. Over-smooths boundaries.
TV-L1Global + robust penaltyPreserves discontinuities. Slower.
FlowNet/RAFTDeep learningState of the art. Fast at inference.
What trade-off does the smoothness parameter λ in Horn-Schunck control?

Chapter 6: Deep Optical Flow

The deep learning revolution transformed optical flow. Instead of hand-crafted energy functions, train a network end-to-end on ground truth flow data.

MethodYearKey Innovation
FlowNet2015First end-to-end CNN for flow. Encoder-decoder architecture.
FlowNet 2.02017Stacked refinement networks. Competitive with classical methods.
PWC-Net2018Pyramid, warping, cost volume. Compact and efficient.
RAFT2020Recurrent all-pairs field transforms. State of the art.
RAFT's innovation: Instead of coarse-to-fine refinement, RAFT builds a full 4D correlation volume (all pairs of pixels) and iteratively updates the flow field using a GRU. Each update step looks up the correlation volume at the current flow estimate and adjusts. This avoids the loss of fine detail that plagues coarse-to-fine approaches.

Training data is the bottleneck. Real optical flow ground truth is nearly impossible to obtain (you would need to know the true 3D motion of every point). Solutions: (1) synthetic data (Flying Chairs, Sintel), (2) unsupervised losses (photometric consistency), (3) semi-supervised pretraining then fine-tuning on real video.

Why is ground truth optical flow data so difficult to obtain for real scenes?

Chapter 7: Layered Motion

Real scenes contain multiple objects moving independently. A car drives past a stationary building. A person walks in front of a crowd. Layered motion decomposes the scene into layers, each with its own motion model.

Each layer has:

Application: frame interpolation. To generate frames between two video frames (for slow motion or higher frame rate), you need to know how each pixel moves. But occluded regions appear in one frame and not the other. Layered motion handles this: each layer is interpolated separately, and the depth ordering determines which layer is visible at each pixel in the interpolated frame.

Video object segmentation takes layered motion a step further: segment and track objects across an entire video sequence. Modern approaches (e.g., SAM-Track) combine segmentation foundation models with temporal tracking, enabling one-click tracking of any object through a video.

Why do layered motion models better represent real-world video than a single global flow field?

Chapter 8: Showcase — Flow Field Playground

Interact with a motion field. A set of particles moves according to a flow field. Adjust the flow type to see how different motion patterns look.

Particle Flow Visualization

Particles trace the flow field. Color encodes direction: right = teal, left = warm, up/down = blue.

Flow visualization: Optical flow is often visualized using the Middlebury color wheel: direction maps to hue, magnitude maps to saturation. This makes it easy to see the flow pattern at a glance. The particle visualization above shows the same information by tracing trajectories.

Chapter 9: Connections

ConceptUsed In
Optical flowCh 6 (action recognition), Ch 10 (video denoising), Ch 12 (depth from motion)
Lucas-Kanade / gradient-basedCh 7 (feature tracking), Ch 8 (alignment refinement)
Parametric motionCh 8 (stitching), Ch 11 (pose estimation), video stabilization
Layered motionCh 10 (compositing), video editing, frame interpolation
Coarse-to-fine pyramidsCh 3 (image pyramids), Ch 7 (multi-scale features), Ch 12 (stereo)
Deep flow (RAFT)Autonomous driving, video generation, 3D scene understanding
Szeliski's perspective: "Motion estimation is the temporal analog of stereo correspondence (Chapter 12). Both solve the same fundamental problem — finding pixel correspondences between images — but with different constraints. Stereo uses epipolar geometry; motion uses temporal continuity. The algorithmic techniques transfer directly between them."
What is the fundamental similarity between optical flow (Chapter 9) and stereo matching (Chapter 12)?