Computing how pixels move between frames: translational alignment, parametric motion, optical flow, and layered motion models.
You watch a car drive past and effortlessly perceive its speed and direction. Your visual system computes motion from the changing pattern of light on your retina. Motion estimation gives computers the same ability: given two or more video frames, compute how each pixel moved.
Motion information enables:
Visualize per-pixel motion vectors. Toggle between uniform (camera) and local (object) motion.
The simplest motion model: every pixel moves by the same amount (u, v). This is pure translation, caused by camera panning or a distant moving object.
Three approaches to estimate the translation:
| Method | How It Works | Pros / Cons |
|---|---|---|
| Exhaustive search | Try every possible (u,v) shift, measure SSD or NCC | Simple but slow for large displacements |
| Fourier-based | Translation in spatial domain = phase shift in frequency domain. Cross-power spectrum peaks at the shift. | Fast (FFT), sub-pixel, but only handles translation |
| Coarse-to-fine | Build image pyramids. Estimate large motion at coarse level, refine at fine levels. | Handles large displacements efficiently |
Instead of searching over all possible shifts, the Lucas-Kanade method uses calculus. Assume the image intensity is approximately constant along the motion path:
Taylor-expand the left side:
This is the brightness constancy equation. One equation, two unknowns (u, v). Lucas-Kanade solves this by assuming all pixels in a small window share the same motion. This gives an overdetermined system solved by least squares:
where A contains the spatial gradients and b contains the temporal gradients.
The flow is most reliable at corners (both eigenvalues large). At edges, only perpendicular motion is estimable.
Sometimes all pixels move according to a single global model — for example, when the entire camera moves. Parametric motion fits a transformation (affine, homography) to the entire image pair.
Application: Video stabilization. Estimate the frame-to-frame camera motion (usually an affine or homography). Smooth the motion trajectory. Apply the inverse of the residual (jitter) to each frame. Result: a steady video from a shaky handheld camera.
For spline-based motion, the image is divided into a grid, and each grid cell has its own transformation parameters. The motion field is a smooth interpolation (B-spline) of these local transformations. This handles spatially varying motion without the full cost of per-pixel optical flow.
Optical flow estimates a motion vector (u, v) at every pixel. This is the most detailed motion representation: a dense vector field describing where every point moves from one frame to the next.
The brightness constancy equation gives one constraint per pixel but two unknowns (u, v). This is the aperture problem: through a small aperture, you can only see the component of motion perpendicular to the local edge. To resolve the ambiguity, you need an additional constraint.
Through a small window, only the motion perpendicular to the edge is visible. The true motion direction is ambiguous.
Horn and Schunck (1981) added a smoothness constraint: the flow field should vary smoothly across the image. They minimize:
The first term enforces brightness constancy. The second term (weighted by λ) penalizes flow gradients, encouraging neighboring pixels to move similarly.
| Method | Approach | Key Feature |
|---|---|---|
| Lucas-Kanade | Local (window) | Reliable at corners. Sparse. |
| Horn-Schunck | Global (variational) | Dense flow. Over-smooths boundaries. |
| TV-L1 | Global + robust penalty | Preserves discontinuities. Slower. |
| FlowNet/RAFT | Deep learning | State of the art. Fast at inference. |
The deep learning revolution transformed optical flow. Instead of hand-crafted energy functions, train a network end-to-end on ground truth flow data.
| Method | Year | Key Innovation |
|---|---|---|
| FlowNet | 2015 | First end-to-end CNN for flow. Encoder-decoder architecture. |
| FlowNet 2.0 | 2017 | Stacked refinement networks. Competitive with classical methods. |
| PWC-Net | 2018 | Pyramid, warping, cost volume. Compact and efficient. |
| RAFT | 2020 | Recurrent all-pairs field transforms. State of the art. |
Training data is the bottleneck. Real optical flow ground truth is nearly impossible to obtain (you would need to know the true 3D motion of every point). Solutions: (1) synthetic data (Flying Chairs, Sintel), (2) unsupervised losses (photometric consistency), (3) semi-supervised pretraining then fine-tuning on real video.
Real scenes contain multiple objects moving independently. A car drives past a stationary building. A person walks in front of a crowd. Layered motion decomposes the scene into layers, each with its own motion model.
Each layer has:
Video object segmentation takes layered motion a step further: segment and track objects across an entire video sequence. Modern approaches (e.g., SAM-Track) combine segmentation foundation models with temporal tracking, enabling one-click tracking of any object through a video.
Interact with a motion field. A set of particles moves according to a flow field. Adjust the flow type to see how different motion patterns look.
Particles trace the flow field. Color encodes direction: right = teal, left = warm, up/down = blue.
| Concept | Used In |
|---|---|
| Optical flow | Ch 6 (action recognition), Ch 10 (video denoising), Ch 12 (depth from motion) |
| Lucas-Kanade / gradient-based | Ch 7 (feature tracking), Ch 8 (alignment refinement) |
| Parametric motion | Ch 8 (stitching), Ch 11 (pose estimation), video stabilization |
| Layered motion | Ch 10 (compositing), video editing, frame interpolation |
| Coarse-to-fine pyramids | Ch 3 (image pyramids), Ch 7 (multi-scale features), Ch 12 (stereo) |
| Deep flow (RAFT) | Autonomous driving, video generation, 3D scene understanding |