The algorithm that teaches computers to see motion — from pixel displacements to video understanding.
You're watching a video of a busy street. Cars move left, pedestrians walk right, a bird flies overhead. You see motion effortlessly. But to a computer, each frame is just a grid of numbers — pixel brightnesses. Frame 1 is one grid. Frame 2 is another. There is no built-in notion of "that pixel moved over there."
Without a way to connect pixels across frames, a computer sees a video as a slideshow of unrelated photographs. It can't tell what moved, how fast, or in which direction. It can't stabilize shaky footage, interpolate between frames, or understand actions.
Optical flow is the answer: for every pixel in frame 1, compute where it ended up in frame 2. The result is a dense displacement field — a map of motion that turns a pair of static images into a description of a moving world.
A circle moves between frames. Without optical flow, the computer just sees two snapshots. Click Show Flow to reveal the displacement arrows connecting frame 1 to frame 2.
At every pixel location (x, y) in frame 1, optical flow assigns a displacement vector (u, v). The value u tells how far the pixel moved horizontally, and v tells how far it moved vertically. If a pixel at (100, 50) has flow (3, −2), it moved to (103, 48) in frame 2.
The output of optical flow is a tensor with shape [H, W, 2] — the same height and width as the image, but with 2 channels instead of 3 (RGB). Channel 0 is horizontal displacement u, channel 1 is vertical displacement v. This is a dense flow field: one vector per pixel.
A flow field has two values per pixel, which is hard to show as a grayscale image. The standard visualization uses the HSV color wheel: the direction of motion maps to hue (red = right, cyan = left, green = down, magenta = up), and the magnitude maps to saturation. A stationary pixel is white or black.
Move your mouse over the wheel. The arrow shows the flow vector, and the color shows how it would be displayed in a flow visualization. Direction = hue, speed = saturation.
| Quantity | Shape | Meaning |
|---|---|---|
| Frame 1 | [H, W, 3] | RGB image at time t |
| Frame 2 | [H, W, 3] | RGB image at time t+1 |
| Flow | [H, W, 2] | Per-pixel displacement (u, v) |
| u | [H, W] | Horizontal displacement (pixels) |
| v | [H, W] | Vertical displacement (pixels) |
Every classical optical flow method starts with one assumption: a pixel's brightness doesn't change as it moves. If a pixel is bright in frame 1, it's still bright in frame 2 — it just moved to a new location. Formally:
Here I(x, y, t) is the brightness at position (x, y) in frame t. The pixel moved by (u, v), so in the next frame it's at (x+u, y+v). This is the brightness constancy assumption.
We Taylor-expand the right side around (x, y, t):
Where Ix = ∂I/∂x (horizontal gradient), Iy = ∂I/∂y (vertical gradient), and It = ∂I/∂t (temporal change). Substituting the brightness constancy equation, I(x,y,t) cancels on both sides, leaving:
This is the optical flow constraint equation. One equation, two unknowns (u and v). We can't solve it at a single pixel — there are infinitely many (u, v) pairs that satisfy it. This fundamental limitation is called the aperture problem.
A bar moves behind a circular aperture. You can see up-down motion, but can you tell if the bar is also moving sideways? Drag the True Direction slider to change the actual motion.
Lucas and Kanade (1981) had a simple idea: if flow is unknown at one pixel, use the neighbors. Assume that all pixels in a small patch (say 5×5) share the same flow (u, v). Now each pixel contributes one equation Ix·u + Iy·v + It = 0, and with 25 pixels we have 25 equations for 2 unknowns. That's an overdetermined system — solve with least squares.
Stack all the pixel equations into a matrix. For a patch of n pixels:
The least-squares solution is:
ATA is a 2×2 matrix (called the structure tensor). The key: it must be invertible. If the patch is a flat region (no gradient), ATA is all zeros — no solution. If it's a pure edge, ATA has one near-zero eigenvalue — ambiguous along the edge. Only at corners (two strong eigenvalues) is the system well-conditioned.
A pattern translates between frames. The algorithm picks patches (squares) and solves a 2×2 system for each. Green arrows = computed flow. Adjust patch size to see the tradeoff.
| Quantity | Shape | Meaning |
|---|---|---|
| A | [n, 2] | Gradient matrix for n pixels in patch |
| b | [n, 1] | Negative temporal gradient |
| ATA | [2, 2] | Structure tensor (must be invertible) |
| d | [2, 1] | Flow vector (u, v) for this patch |
Lucas-Kanade solves flow locally, patch by patch. Horn and Schunck (1981) took the opposite approach: solve for the entire flow field at once, with a global smoothness constraint. The idea: neighboring pixels should have similar flow, because real objects move coherently.
Horn-Schunck minimizes a single energy over all pixels:
The first term is the data term: brightness constancy should hold everywhere. The second term is the smoothness term: flow should vary slowly. The parameter α controls the balance — large α means smoother flow (fewer sharp transitions), small α means more faithful to the raw data (noisier).
The minimization leads to a system of equations solved iteratively with Gauss-Seidel or Jacobi iteration. At each iteration, every pixel's flow is updated using its neighbors' flow from the previous iteration. After enough iterations, the flow field converges.
Here ū and v̄ are the local averages of u and v from neighboring pixels. Each iteration nudges the flow toward both the data and its neighbors. α literally controls how much "peer pressure" each pixel feels from its neighbors.
Two objects move in different directions. Adjust α to see how smoothness affects the computed flow. Low α = sharp boundaries but noisy. High α = smooth but blurs motion boundaries.
| Method | Scope | Strengths | Weaknesses |
|---|---|---|---|
| Lucas-Kanade | Local (per patch) | Fast, robust at corners | Sparse, fails on large motion |
| Horn-Schunck | Global (entire image) | Dense, handles textureless areas | Slow, over-smooths boundaries |
Classical methods are elegant but slow and brittle. In 2015, Dosovitskiy et al. asked: can a convolutional neural network learn optical flow end-to-end? The answer was FlowNet — the first CNN to predict dense flow directly from an image pair.
Stack both frames along the channel dimension: input is [6, H, W] (3 RGB channels from each frame). Feed this into an encoder-decoder (contracting path with conv+stride, expanding path with deconv). The output is [2, H, W] — predicted flow.
Instead of stacking the images, process each through a separate encoder to get feature maps, then compute a correlation volume that measures how similar each feature in image 1 is to features in image 2 within a search window. This gives the network an explicit matching signal.
Where do you get ground-truth flow to supervise training? You render it. The FlyingChairs dataset pastes random chair images onto backgrounds and moves them — the renderer knows exactly how every pixel moved. The loss is simply:
End-point error (EPE): the Euclidean distance between predicted and ground-truth flow, averaged over all pixels. FlowNet achieved ~2.7 pixel EPE on FlyingChairs while running at 10 FPS — orders of magnitude faster than classical methods.
The encoder shrinks spatial resolution while growing channel depth. The decoder expands back. Skip connections preserve fine detail. Hover over each stage to see tensor shapes.
| Model | Year | EPE (Sintel) | FPS | Key Idea |
|---|---|---|---|---|
| FlowNetS | 2015 | 7.42 | 10 | Stacked input, single encoder-decoder |
| FlowNetC | 2015 | 6.85 | 10 | Correlation layer for explicit matching |
| FlowNet2 | 2017 | 4.16 | 8 | Stacked networks, iterative refinement |
RAFT (Recurrent All-pairs Field Transforms, Teed & Deng 2020) is the dominant optical flow method. It won the ECCV 2020 Best Paper Award and every method since builds on its ideas. The core insight: don't predict flow in one shot — iteratively refine it by repeatedly looking up correlations.
Stage 1: Feature Extraction. A shared CNN encoder processes both images independently, producing feature maps at 1/8 resolution:
Stage 2: Correlation Volume. Compute the dot product between every feature in f1 and every feature in f2. The result is a 4D tensor:
This is the all-pairs correlation volume. Entry C(i,j,k,l) measures how similar the feature at position (i,j) in image 1 is to the feature at (k,l) in image 2. Build it once, look it up many times.
Stage 3: Iterative Update (GRU). Starting from zero flow, a Gated Recurrent Unit refines the estimate over 12 iterations. At each iteration:
Watch RAFT iteratively refine its flow estimate. Iteration 0 is random. Each GRU step improves the flow by looking up correlations. Adjust the number of iterations or add noise to see how the algorithm converges.
RAFT supervises at every iteration, not just the last. The loss weights later iterations more heavily:
This encourages each iteration to improve, not just the final output. L1 loss (not L2) because it's less sensitive to outliers at motion boundaries.
| Component | Shape | Purpose |
|---|---|---|
| Feature encoder | [H/8, W/8, 256] | Extract appearance features from each image |
| Correlation volume | [H/8, W/8, H/8, W/8] | All-pairs similarity lookup table |
| GRU hidden state | [H/8, W/8, 128] | Recurrent memory across iterations |
| Flow estimate | [H/8, W/8, 2] | Current (u, v) prediction, upsampled to [H, W, 2] |
Ground-truth optical flow for real video is nearly impossible to obtain. You'd need to know exactly where every single pixel moved — sub-pixel precision for millions of pixels. Instead, the entire field relies on synthetic datasets where the renderer provides perfect ground truth for free.
| Dataset | Year | Size | What It Is | Realism |
|---|---|---|---|---|
| FlyingChairs | 2015 | 22k pairs | Random chairs on backgrounds, 2D motion | Low |
| FlyingThings3D | 2016 | 26k pairs | Random 3D objects with 3D motion | Medium |
| MPI Sintel | 2012 | 1k pairs | Frames from animated film, realistic effects | High |
| KITTI | 2012/2015 | 400 pairs | Real driving scenes, LiDAR ground truth (sparse) | Real (sparse GT) |
RAFT and its successors follow a specific curriculum. Start simple, increase complexity:
Models trained on synthetic data struggle with real-world challenges that don't appear in training:
Toggle real-world effects to see how they degrade optical flow estimation. Each effect violates an assumption the model learned from clean synthetic data.
Optical flow isn't just an academic exercise. It's a core primitive that powers dozens of real systems. Here's how flow enables each application:
Shaky handheld video has erratic global motion. Compute optical flow between consecutive frames → estimate the dominant (camera) motion → subtract it → smooth the residual. The result: steady footage as if shot on a gimbal. Every phone camera and YouTube's video stabilizer use this pipeline.
Simonyan & Zisserman (2014) showed that feeding precomputed optical flow as a separate stream alongside RGB dramatically improves action recognition. The spatial stream sees appearance (who/what), the temporal stream sees motion (doing what). Flow compresses temporal information into a single tensor that CNNs can process.
Given frames at t=0 and t=1, generate the frame at t=0.5. Compute forward flow (0→1) and backward flow (1→0), then warp both frames to the midpoint and blend. This is how slow-motion video is created from standard 30fps footage.
Optical flow helps self-driving cars detect independently moving objects. The ego-vehicle's motion creates a predictable flow pattern (expansion from the focus of expansion). Anything deviating from this pattern is an independent mover — a pedestrian, cyclist, or other car.
Remove an object from video by computing flow, propagating the inpainting mask across frames along flow trajectories, then filling the hole with temporally consistent content.
A camera shakes randomly while viewing a scene. Optical flow estimates the camera motion, and stabilization subtracts it. Toggle stabilization to see the difference.
Optical flow is one way to understand motion, but it has clear limitations: it's 2D (ignores depth), pairwise (two frames only), and dense (one vector per pixel, even where it's ambiguous). Modern research addresses each of these.
| Limitation | Why | Solution |
|---|---|---|
| 2D only | Flow is projected motion — no depth | Scene flow: 3D displacement per point |
| Pairwise | Only connects frame t to t+1 | CoTracker / TAPIR: long-range point tracking |
| Brightness constancy | Fails on lighting changes, occlusion | Learned features replace raw pixels |
| Uniform density | Wastes compute on static background | Sparse tracking: track only interesting points |
| Method | Tracks | Duration | Dimension |
|---|---|---|---|
| Optical Flow | All pixels | 2 frames | 2D |
| Scene Flow | All points | 2 frames | 3D |
| Point Tracking (CoTracker) | Selected points | Full video | 2D |
| SLAM | Sparse features | Full video | 3D + camera pose |
| Video Transformers | Implicit (tokens) | Full video | Learned space |
| Concept | One-Line Summary |
|---|---|
| Flow field | Per-pixel (u, v) displacement, shape [H, W, 2] |
| Brightness constancy | Ixu + Iyv + It = 0 — one equation, two unknowns |
| Aperture problem | Can only see motion perpendicular to edge |
| Lucas-Kanade | Local patches, least squares (ATA)−1ATb |
| Horn-Schunck | Global smoothness energy, iterative solver, α controls smoothness |
| FlowNet | First CNN for flow, encoder-decoder, trained on FlyingChairs |
| RAFT | All-pairs correlation + GRU iterations, state-of-the-art |
| Synthetic training | Chairs → Things3D → Sintel curriculum |
| Key metric | End-point error (EPE): average pixel distance to ground truth |
You now understand optical flow — from the brightness constancy constraint that started it all, to RAFT's iterative correlation lookups that define the state of the art. Every pixel tells a story of motion.