Szeliski, Chapter 9

Motion Estimation

Computing how pixels move between frames: translational alignment, parametric motion, optical flow, and layered motion models.

Prerequisites: Chapter 3 (filtering), Chapter 4 (optimization), Chapter 8 (alignment) helpful.

Chapters

Simulations

Assumed CV Knowledge

Chapter 0: Why Motion?

You watch a car drive past and effortlessly perceive its speed and direction. Your visual system computes motion from the changing pattern of light on your retina. Motion estimation gives computers the same ability: given two or more video frames, compute how each pixel moved.

Motion information enables:

Video stabilization: Remove camera shake by estimating and compensating for camera motion
Frame interpolation: Generate in-between frames for slow-motion video
Video compression: Encode only the differences between frames (motion-compensated prediction)
Action recognition: Movement patterns distinguish walking from running from dancing
3D reconstruction: Motion parallax reveals depth

Two levels of motion: Parametric motion models assume all pixels move according to a single global model (translation, affine, etc.) — good for camera motion. Optical flow estimates a separate motion vector for every pixel — good for object motion. Most real scenes need both: global camera motion plus local object motion.

Motion Field

Visualize per-pixel motion vectors. Toggle between uniform (camera) and local (object) motion.

What is the difference between parametric motion and optical flow?

Parametric motion fits a single global model to all pixels; optical flow estimates independent motion at every pixel They are the same thing Optical flow only works on objects, not cameras

Chapter 1: Translational Alignment

The simplest motion model: every pixel moves by the same amount (u, v). This is pure translation, caused by camera panning or a distant moving object.

Three approaches to estimate the translation:

Method	How It Works	Pros / Cons
Exhaustive search	Try every possible (u,v) shift, measure SSD or NCC	Simple but slow for large displacements
Fourier-based	Translation in spatial domain = phase shift in frequency domain. Cross-power spectrum peaks at the shift.	Fast (FFT), sub-pixel, but only handles translation
Coarse-to-fine	Build image pyramids. Estimate large motion at coarse level, refine at fine levels.	Handles large displacements efficiently

The coarse-to-fine trick: A 100-pixel shift requires searching a 200×200 window at full resolution — 40,000 evaluations. But if you downsample by 8x, the shift becomes 12 pixels — only 625 evaluations. Refine at each pyramid level, and the total cost is logarithmic in the maximum displacement. This hierarchical approach underpins nearly all modern motion estimation.

Why does coarse-to-fine estimation dramatically speed up motion search?

Downsampling reduces large displacements to small ones, so the search window at each level is small, and refinement at finer levels adds precision without cost It uses a faster algorithm It skips most of the image

Chapter 2: Lucas-Kanade

Instead of searching over all possible shifts, the Lucas-Kanade method uses calculus. Assume the image intensity is approximately constant along the motion path:

I(x + u, y + v, t + 1) ≈ I(x, y, t)

Taylor-expand the left side:

I_xu + I_yv + I_t ≈ 0

This is the brightness constancy equation. One equation, two unknowns (u, v). Lucas-Kanade solves this by assuming all pixels in a small window share the same motion. This gives an overdetermined system solved by least squares:

⎡ u ⎤
⎣ v ⎦ = (A^TA)⁻¹ A^Tb

where A contains the spatial gradients and b contains the temporal gradients.

The A^TA matrix is the structure tensor! The same matrix from Harris corner detection (Chapter 7). When both eigenvalues are large (a corner), the system is well-conditioned and the flow is reliable. When one eigenvalue is small (an edge), motion along the edge is ambiguous. When both are small (flat region), no motion can be estimated. Good flow estimation requires texture.

Lucas-Kanade Flow

The flow is most reliable at corners (both eigenvalues large). At edges, only perpendicular motion is estimable.

Why can't Lucas-Kanade estimate flow at flat (textureless) image regions?

With no spatial gradients, the A^TA matrix is singular (both eigenvalues near zero), making the linear system unsolvable Flat regions cannot move The algorithm only works on corners

Chapter 3: Parametric Motion

Sometimes all pixels move according to a single global model — for example, when the entire camera moves. Parametric motion fits a transformation (affine, homography) to the entire image pair.

Application: Video stabilization. Estimate the frame-to-frame camera motion (usually an affine or homography). Smooth the motion trajectory. Apply the inverse of the residual (jitter) to each frame. Result: a steady video from a shaky handheld camera.

Learned motion estimation: Classical methods fit explicit parametric models. Deep networks (e.g., HomographyNet) can learn to predict homographies directly from image pairs, trained on synthetic warps. They are more robust to large viewpoint changes and scenes that violate the planar assumption.

For spline-based motion, the image is divided into a grid, and each grid cell has its own transformation parameters. The motion field is a smooth interpolation (B-spline) of these local transformations. This handles spatially varying motion without the full cost of per-pixel optical flow.

How does video stabilization use parametric motion estimation?

It estimates frame-to-frame camera motion, smooths the trajectory, and applies the inverse of jitter to produce steady video It removes all motion from the video It tracks individual objects

Chapter 4: Optical Flow

Optical flow estimates a motion vector (u, v) at every pixel. This is the most detailed motion representation: a dense vector field describing where every point moves from one frame to the next.

The brightness constancy equation gives one constraint per pixel but two unknowns (u, v). This is the aperture problem: through a small aperture, you can only see the component of motion perpendicular to the local edge. To resolve the ambiguity, you need an additional constraint.

The aperture problem: Watch a barber pole rotate. Through a small window, the stripes appear to move upward — but the pole is actually rotating horizontally. The local gradient only reveals motion perpendicular to the edge, not along it. This is why optical flow needs either a large window (Lucas-Kanade) or a smoothness assumption (Horn-Schunck) to disambiguate.

The Aperture Problem

Through a small window, only the motion perpendicular to the edge is visible. The true motion direction is ambiguous.

What is the aperture problem in optical flow estimation?

Through a local window, only the motion component perpendicular to the edge is observable, making the full motion direction ambiguous The camera aperture is too small The image resolution is too low

Chapter 5: Horn-Schunck

Horn and Schunck (1981) added a smoothness constraint: the flow field should vary smoothly across the image. They minimize:

E = ∑ (I_xu + I_yv + I_t)² + λ(||∇u||² + ||∇v||²)

The first term enforces brightness constancy. The second term (weighted by λ) penalizes flow gradients, encouraging neighboring pixels to move similarly.

Smoothness vs. detail: Large λ gives a smooth flow field but blurs motion boundaries (where a moving object meets the background). Small λ preserves boundaries but may be noisy. The trade-off is fundamental: you want smooth flow within objects but sharp discontinuities between them. This is why modern methods use robust smoothness terms that allow discontinuities.

Method	Approach	Key Feature
Lucas-Kanade	Local (window)	Reliable at corners. Sparse.
Horn-Schunck	Global (variational)	Dense flow. Over-smooths boundaries.
TV-L1	Global + robust penalty	Preserves discontinuities. Slower.
FlowNet/RAFT	Deep learning	State of the art. Fast at inference.

What trade-off does the smoothness parameter λ in Horn-Schunck control?

Large λ gives smooth flow but blurs motion boundaries; small λ preserves boundaries but may be noisy It controls the image resolution It determines the number of iterations

Chapter 6: Deep Optical Flow

The deep learning revolution transformed optical flow. Instead of hand-crafted energy functions, train a network end-to-end on ground truth flow data.

Method	Year	Key Innovation
FlowNet	2015	First end-to-end CNN for flow. Encoder-decoder architecture.
FlowNet 2.0	2017	Stacked refinement networks. Competitive with classical methods.
PWC-Net	2018	Pyramid, warping, cost volume. Compact and efficient.
RAFT	2020	Recurrent all-pairs field transforms. State of the art.

RAFT's innovation: Instead of coarse-to-fine refinement, RAFT builds a full 4D correlation volume (all pairs of pixels) and iteratively updates the flow field using a GRU. Each update step looks up the correlation volume at the current flow estimate and adjusts. This avoids the loss of fine detail that plagues coarse-to-fine approaches.

Training data is the bottleneck. Real optical flow ground truth is nearly impossible to obtain (you would need to know the true 3D motion of every point). Solutions: (1) synthetic data (Flying Chairs, Sintel), (2) unsupervised losses (photometric consistency), (3) semi-supervised pretraining then fine-tuning on real video.

Why is ground truth optical flow data so difficult to obtain for real scenes?

You would need to know the true 3D motion of every visible point, which is impractical to measure in real scenes The videos are too large Cameras cannot record motion

Chapter 7: Layered Motion

Real scenes contain multiple objects moving independently. A car drives past a stationary building. A person walks in front of a crowd. Layered motion decomposes the scene into layers, each with its own motion model.

Each layer has:

A motion model (how the layer moves: affine, homography, or dense flow)
An appearance model (what the layer looks like: color, texture)
A support map (which pixels belong to this layer)

Application: frame interpolation. To generate frames between two video frames (for slow motion or higher frame rate), you need to know how each pixel moves. But occluded regions appear in one frame and not the other. Layered motion handles this: each layer is interpolated separately, and the depth ordering determines which layer is visible at each pixel in the interpolated frame.

Video object segmentation takes layered motion a step further: segment and track objects across an entire video sequence. Modern approaches (e.g., SAM-Track) combine segmentation foundation models with temporal tracking, enabling one-click tracking of any object through a video.

Why do layered motion models better represent real-world video than a single global flow field?

Real scenes contain multiple objects moving independently, and layered models assign each region its own motion, correctly handling occlusions Layered models are faster to compute They use fewer parameters

Chapter 8: Showcase — Flow Field Playground

Interact with a motion field. A set of particles moves according to a flow field. Adjust the flow type to see how different motion patterns look.

Particle Flow Visualization

Particles trace the flow field. Color encodes direction: right = teal, left = warm, up/down = blue.

Flow visualization: Optical flow is often visualized using the Middlebury color wheel: direction maps to hue, magnitude maps to saturation. This makes it easy to see the flow pattern at a glance. The particle visualization above shows the same information by tracing trajectories.

Chapter 9: Connections

Concept	Used In
Optical flow	Ch 6 (action recognition), Ch 10 (video denoising), Ch 12 (depth from motion)
Lucas-Kanade / gradient-based	Ch 7 (feature tracking), Ch 8 (alignment refinement)
Parametric motion	Ch 8 (stitching), Ch 11 (pose estimation), video stabilization
Layered motion	Ch 10 (compositing), video editing, frame interpolation
Coarse-to-fine pyramids	Ch 3 (image pyramids), Ch 7 (multi-scale features), Ch 12 (stereo)
Deep flow (RAFT)	Autonomous driving, video generation, 3D scene understanding

Szeliski's perspective: "Motion estimation is the temporal analog of stereo correspondence (Chapter 12). Both solve the same fundamental problem — finding pixel correspondences between images — but with different constraints. Stereo uses epipolar geometry; motion uses temporal continuity. The algorithmic techniques transfer directly between them."

What is the fundamental similarity between optical flow (Chapter 9) and stereo matching (Chapter 12)?

Both solve the pixel correspondence problem — finding where each pixel in one image appears in another — using similar algorithms with different geometric constraints They use the same camera setup They only work on the same types of scenes