Computer Vision

Understand Optical
Flow

The algorithm that teaches computers to see motion — from pixel displacements to video understanding.

Prerequisites: Basic calculus (derivatives) + Intuition for images as grids of numbers. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Optical Flow?

You're watching a video of a busy street. Cars move left, pedestrians walk right, a bird flies overhead. You see motion effortlessly. But to a computer, each frame is just a grid of numbers — pixel brightnesses. Frame 1 is one grid. Frame 2 is another. There is no built-in notion of "that pixel moved over there."

Without a way to connect pixels across frames, a computer sees a video as a slideshow of unrelated photographs. It can't tell what moved, how fast, or in which direction. It can't stabilize shaky footage, interpolate between frames, or understand actions.

Optical flow is the answer: for every pixel in frame 1, compute where it ended up in frame 2. The result is a dense displacement field — a map of motion that turns a pair of static images into a description of a moving world.

The core question: Given two consecutive frames, where did each pixel go? Optical flow answers this with a vector (u, v) at every pixel — a complete description of apparent motion in the image.
Two Frames, No Motion Information

A circle moves between frames. Without optical flow, the computer just sees two snapshots. Click Show Flow to reveal the displacement arrows connecting frame 1 to frame 2.

Why can't a computer understand motion from raw video frames alone?

Chapter 1: The Flow Field

At every pixel location (x, y) in frame 1, optical flow assigns a displacement vector (u, v). The value u tells how far the pixel moved horizontally, and v tells how far it moved vertically. If a pixel at (100, 50) has flow (3, −2), it moved to (103, 48) in frame 2.

The output of optical flow is a tensor with shape [H, W, 2] — the same height and width as the image, but with 2 channels instead of 3 (RGB). Channel 0 is horizontal displacement u, channel 1 is vertical displacement v. This is a dense flow field: one vector per pixel.

Input
Frame 1: [H, W, 3]    Frame 2: [H, W, 3]
Optical Flow Algorithm
Compute displacement for every pixel
Output
Flow field: [H, W, 2] — (u, v) per pixel

Visualizing Flow: Color Coding

A flow field has two values per pixel, which is hard to show as a grayscale image. The standard visualization uses the HSV color wheel: the direction of motion maps to hue (red = right, cyan = left, green = down, magenta = up), and the magnitude maps to saturation. A stationary pixel is white or black.

Flow Color Wheel

Move your mouse over the wheel. The arrow shows the flow vector, and the color shows how it would be displayed in a flow visualization. Direction = hue, speed = saturation.

Reading a flow map: When you see a colorful flow visualization, each pixel's color tells you a direction and speed. Uniform color = rigid motion (camera pan). Rainbow = complex scene motion. This encoding is universal in the optical flow literature.
QuantityShapeMeaning
Frame 1[H, W, 3]RGB image at time t
Frame 2[H, W, 3]RGB image at time t+1
Flow[H, W, 2]Per-pixel displacement (u, v)
u[H, W]Horizontal displacement (pixels)
v[H, W]Vertical displacement (pixels)
What is the shape of a dense optical flow field for a 480×640 image?

Chapter 2: Brightness Constancy

Every classical optical flow method starts with one assumption: a pixel's brightness doesn't change as it moves. If a pixel is bright in frame 1, it's still bright in frame 2 — it just moved to a new location. Formally:

I(x, y, t) = I(x + u, y + v, t + 1)

Here I(x, y, t) is the brightness at position (x, y) in frame t. The pixel moved by (u, v), so in the next frame it's at (x+u, y+v). This is the brightness constancy assumption.

Taylor Expansion: One Equation, Two Unknowns

We Taylor-expand the right side around (x, y, t):

I(x+u, y+v, t+1) ≈ I(x,y,t) + Ix·u + Iy·v + It

Where Ix = ∂I/∂x (horizontal gradient), Iy = ∂I/∂y (vertical gradient), and It = ∂I/∂t (temporal change). Substituting the brightness constancy equation, I(x,y,t) cancels on both sides, leaving:

Ix·u + Iy·v + It = 0

This is the optical flow constraint equation. One equation, two unknowns (u and v). We can't solve it at a single pixel — there are infinitely many (u, v) pairs that satisfy it. This fundamental limitation is called the aperture problem.

The aperture problem: Imagine looking at a moving edge through a tiny hole (aperture). You can see motion perpendicular to the edge, but not along it. A horizontal edge could be moving purely up, or diagonally — you can't tell. One constraint equation can only determine the component of flow perpendicular to the image gradient.
The Aperture Problem

A bar moves behind a circular aperture. You can see up-down motion, but can you tell if the bar is also moving sideways? Drag the True Direction slider to change the actual motion.

True Direction90°
Why this matters: Every optical flow algorithm must add extra constraints beyond brightness constancy to resolve the aperture problem. Lucas-Kanade assumes local smoothness. Horn-Schunck assumes global smoothness. Neural networks learn from data. The assumption you add defines the method.
Why can't we solve for flow (u, v) at a single pixel using brightness constancy alone?

Chapter 3: Lucas-Kanade

Lucas and Kanade (1981) had a simple idea: if flow is unknown at one pixel, use the neighbors. Assume that all pixels in a small patch (say 5×5) share the same flow (u, v). Now each pixel contributes one equation Ix·u + Iy·v + It = 0, and with 25 pixels we have 25 equations for 2 unknowns. That's an overdetermined system — solve with least squares.

The Math: A 2×2 System

Stack all the pixel equations into a matrix. For a patch of n pixels:

A = [[Ix1, Iy1], [Ix2, Iy2], ..., [Ixn, Iyn]]    b = −[It1, It2, ..., Itn]T

The least-squares solution is:

[u, v]T = (ATA)−1 ATb

ATA is a 2×2 matrix (called the structure tensor). The key: it must be invertible. If the patch is a flat region (no gradient), ATA is all zeros — no solution. If it's a pure edge, ATA has one near-zero eigenvalue — ambiguous along the edge. Only at corners (two strong eigenvalues) is the system well-conditioned.

Input
Image pair → compute Ix, Iy, It gradients
For each patch
Build A [n, 2] and b [n, 1] from gradient values
Solve
d = (ATA)−1 ATb → flow (u, v) for this patch
↓ repeat for all patches
Lucas-Kanade is local. It only looks at a small window around each pixel. This means it's fast and parallelizable, but it can't resolve large displacements (motion bigger than the patch) and it produces sparse or unreliable flow in textureless regions. It works best on corners and textured areas.
Lucas-Kanade: Local Patches

A pattern translates between frames. The algorithm picks patches (squares) and solves a 2×2 system for each. Green arrows = computed flow. Adjust patch size to see the tradeoff.

Patch Size7
True Flow Angle30°
QuantityShapeMeaning
A[n, 2]Gradient matrix for n pixels in patch
b[n, 1]Negative temporal gradient
ATA[2, 2]Structure tensor (must be invertible)
d[2, 1]Flow vector (u, v) for this patch
Lucas-Kanade fails in flat (textureless) regions because:

Chapter 4: Horn-Schunck

Lucas-Kanade solves flow locally, patch by patch. Horn and Schunck (1981) took the opposite approach: solve for the entire flow field at once, with a global smoothness constraint. The idea: neighboring pixels should have similar flow, because real objects move coherently.

The Energy Function

Horn-Schunck minimizes a single energy over all pixels:

E = ∑x,y (Ixu + Iyv + It)² + α(|∇u|² + |∇v|²)

The first term is the data term: brightness constancy should hold everywhere. The second term is the smoothness term: flow should vary slowly. The parameter α controls the balance — large α means smoother flow (fewer sharp transitions), small α means more faithful to the raw data (noisier).

The minimization leads to a system of equations solved iteratively with Gauss-Seidel or Jacobi iteration. At each iteration, every pixel's flow is updated using its neighbors' flow from the previous iteration. After enough iterations, the flow field converges.

Initialize
u = 0, v = 0 everywhere (or from LK)
Iterate
For each pixel: u ← ū − Ix(Ixū + Iyv̄ + It) / (α² + Ix² + Iy²)
↓ repeat 50–200 times
Output
Smooth, dense flow field [H, W, 2]

Here ū and v̄ are the local averages of u and v from neighboring pixels. Each iteration nudges the flow toward both the data and its neighbors. α literally controls how much "peer pressure" each pixel feels from its neighbors.

α is a smoothness dial. α = 0: pure data, noisy, holes in textureless regions. α → ∞: constant flow everywhere (useless). In practice, α between 1 and 100 works well. Larger objects and slower motion tolerate more smoothing.
Horn-Schunck: Smoothness vs Data Fidelity

Two objects move in different directions. Adjust α to see how smoothness affects the computed flow. Low α = sharp boundaries but noisy. High α = smooth but blurs motion boundaries.

α (smoothness)10
Iterations30
MethodScopeStrengthsWeaknesses
Lucas-KanadeLocal (per patch)Fast, robust at cornersSparse, fails on large motion
Horn-SchunckGlobal (entire image)Dense, handles textureless areasSlow, over-smooths boundaries
What happens to the Horn-Schunck flow field when α is very large?

Chapter 5: FlowNet — Learning to See Motion

Classical methods are elegant but slow and brittle. In 2015, Dosovitskiy et al. asked: can a convolutional neural network learn optical flow end-to-end? The answer was FlowNet — the first CNN to predict dense flow directly from an image pair.

FlowNetSimple: The Brute Force Approach

Stack both frames along the channel dimension: input is [6, H, W] (3 RGB channels from each frame). Feed this into an encoder-decoder (contracting path with conv+stride, expanding path with deconv). The output is [2, H, W] — predicted flow.

Input
Stack frames: [H, W, 3] + [H, W, 3] → [H, W, 6]
Encoder
Conv layers with stride 2: [H, W, 6] → [H/64, W/64, 1024]
Decoder
Deconv + skip connections: [H/64, W/64, 1024] → [H, W, 2]
Output
Dense flow: [H, W, 2]

FlowNetCorr: Explicit Matching

Instead of stacking the images, process each through a separate encoder to get feature maps, then compute a correlation volume that measures how similar each feature in image 1 is to features in image 2 within a search window. This gives the network an explicit matching signal.

Training: Synthetic Data

Where do you get ground-truth flow to supervise training? You render it. The FlyingChairs dataset pastes random chair images onto backgrounds and moves them — the renderer knows exactly how every pixel moved. The loss is simply:

L = ∑x,y ||flowpred(x,y) − flowgt(x,y)||2

End-point error (EPE): the Euclidean distance between predicted and ground-truth flow, averaged over all pixels. FlowNet achieved ~2.7 pixel EPE on FlyingChairs while running at 10 FPS — orders of magnitude faster than classical methods.

FlowNet2 (2017): Stack multiple FlowNets in series. The first network makes a rough estimate. The second takes the warped image (frame 1 warped by the first estimate) and the residual, and refines the flow. Stacking networks reduced EPE from 2.71 to 2.02 on Sintel. This iterative refinement idea reappears in every modern method.
Encoder-Decoder Architecture

The encoder shrinks spatial resolution while growing channel depth. The decoder expands back. Skip connections preserve fine detail. Hover over each stage to see tensor shapes.

ModelYearEPE (Sintel)FPSKey Idea
FlowNetS20157.4210Stacked input, single encoder-decoder
FlowNetC20156.8510Correlation layer for explicit matching
FlowNet220174.168Stacked networks, iterative refinement
Why is FlowNet trained on synthetic (rendered) data instead of real video?

Chapter 6: RAFT — The Modern Standard

RAFT (Recurrent All-pairs Field Transforms, Teed & Deng 2020) is the dominant optical flow method. It won the ECCV 2020 Best Paper Award and every method since builds on its ideas. The core insight: don't predict flow in one shot — iteratively refine it by repeatedly looking up correlations.

Architecture: Three Stages

Stage 1: Feature Extraction. A shared CNN encoder processes both images independently, producing feature maps at 1/8 resolution:

f1, f2 = Encoder(I1), Encoder(I2)    shape: [H/8, W/8, 256]

Stage 2: Correlation Volume. Compute the dot product between every feature in f1 and every feature in f2. The result is a 4D tensor:

C(i, j, k, l) = f1(i, j) · f2(k, l)    shape: [H/8, W/8, H/8, W/8]

This is the all-pairs correlation volume. Entry C(i,j,k,l) measures how similar the feature at position (i,j) in image 1 is to the feature at (k,l) in image 2. Build it once, look it up many times.

Stage 3: Iterative Update (GRU). Starting from zero flow, a Gated Recurrent Unit refines the estimate over 12 iterations. At each iteration:

Lookup
At current flow estimate, index into correlation volume → [H/8, W/8, R²]
Context
Concatenate: correlation features + current flow + context features
GRU Update
ht = GRU(ht−1, input) → hidden state [H/8, W/8, 128]
Flow Head
Two conv layers → Δflow [H/8, W/8, 2]. Add to current flow.
↻ repeat 12 times
Why this works: The correlation volume is a "lookup table of similarity." At each iteration, the GRU says "given where I currently think things moved, how well do the features match?" If the match is poor, it adjusts. After 12 iterations, even large displacements converge. The all-pairs design means RAFT handles large motions that defeat patch-based methods.
RAFT: Iterative Flow Refinement

Watch RAFT iteratively refine its flow estimate. Iteration 0 is random. Each GRU step improves the flow by looking up correlations. Adjust the number of iterations or add noise to see how the algorithm converges.

Iterations12
Motion Complexity2
Iteration: 0 / 12

Training Details

RAFT supervises at every iteration, not just the last. The loss weights later iterations more heavily:

L = ∑i=1N γN−i ||flowi − flowgt||1     γ = 0.8

This encourages each iteration to improve, not just the final output. L1 loss (not L2) because it's less sensitive to outliers at motion boundaries.

ComponentShapePurpose
Feature encoder[H/8, W/8, 256]Extract appearance features from each image
Correlation volume[H/8, W/8, H/8, W/8]All-pairs similarity lookup table
GRU hidden state[H/8, W/8, 128]Recurrent memory across iterations
Flow estimate[H/8, W/8, 2]Current (u, v) prediction, upsampled to [H, W, 2]
RAFT's numbers: EPE 1.61 on Sintel Clean, 2.86 on Sintel Final. It runs at ~10 FPS on a single GPU. Every competitive method since 2020 (GMA, FlowFormer, VideoFlow) uses RAFT's correlation volume + iterative refinement as the backbone.

Chapter 7: Training on Synthetic Data

Ground-truth optical flow for real video is nearly impossible to obtain. You'd need to know exactly where every single pixel moved — sub-pixel precision for millions of pixels. Instead, the entire field relies on synthetic datasets where the renderer provides perfect ground truth for free.

The Datasets

DatasetYearSizeWhat It IsRealism
FlyingChairs201522k pairsRandom chairs on backgrounds, 2D motionLow
FlyingThings3D201626k pairsRandom 3D objects with 3D motionMedium
MPI Sintel20121k pairsFrames from animated film, realistic effectsHigh
KITTI2012/2015400 pairsReal driving scenes, LiDAR ground truth (sparse)Real (sparse GT)

The Training Schedule

RAFT and its successors follow a specific curriculum. Start simple, increase complexity:

Stage 1: FlyingChairs
100k iterations. Simple 2D motion. The network learns basic correspondence.
Stage 2: FlyingThings3D
100k iterations. 3D motion, occlusion, varying depth.
Stage 3: Fine-tune
Sintel + KITTI + HD1K. Real-world effects: blur, fog, specularities.

The Generalization Gap

Models trained on synthetic data struggle with real-world challenges that don't appear in training:

Why not just use real data? Getting dense per-pixel ground truth for real video requires exotic hardware (multiple high-speed cameras + structured light) or slow, expensive manual annotation. LiDAR gives sparse 3D flow for ~5% of pixels. Synthetic data gives 100% dense, sub-pixel accurate flow for free. The tradeoff: perfect annotations, imperfect realism.
Synthetic vs Real: What Degrades

Toggle real-world effects to see how they degrade optical flow estimation. Each effect violates an assumption the model learned from clean synthetic data.

Why do optical flow models follow a training curriculum (Chairs → Things → Sintel)?

Chapter 8: Applications

Optical flow isn't just an academic exercise. It's a core primitive that powers dozens of real systems. Here's how flow enables each application:

Video Stabilization

Shaky handheld video has erratic global motion. Compute optical flow between consecutive frames → estimate the dominant (camera) motion → subtract it → smooth the residual. The result: steady footage as if shot on a gimbal. Every phone camera and YouTube's video stabilizer use this pipeline.

Action Recognition (Two-Stream Networks)

Simonyan & Zisserman (2014) showed that feeding precomputed optical flow as a separate stream alongside RGB dramatically improves action recognition. The spatial stream sees appearance (who/what), the temporal stream sees motion (doing what). Flow compresses temporal information into a single tensor that CNNs can process.

Spatial Stream
Single RGB frame [H, W, 3] → CNN → "person, ball, court"
+
Temporal Stream
Stacked flow [H, W, 2L] (L frames) → CNN → "kicking motion"
Fusion
Late fusion → "person kicking ball"

Video Frame Interpolation

Given frames at t=0 and t=1, generate the frame at t=0.5. Compute forward flow (0→1) and backward flow (1→0), then warp both frames to the midpoint and blend. This is how slow-motion video is created from standard 30fps footage.

Autonomous Driving

Optical flow helps self-driving cars detect independently moving objects. The ego-vehicle's motion creates a predictable flow pattern (expansion from the focus of expansion). Anything deviating from this pattern is an independent mover — a pedestrian, cyclist, or other car.

Video Editing

Remove an object from video by computing flow, propagating the inpainting mask across frames along flow trajectories, then filling the hole with temporally consistent content.

The common thread: Optical flow converts temporal information (change over time) into a spatial representation (a 2D map). This lets spatial tools (CNNs, segmentation, warping) process motion without needing 3D or temporal architectures.
Application: Video Stabilization

A camera shakes randomly while viewing a scene. Optical flow estimates the camera motion, and stabilization subtracts it. Toggle stabilization to see the difference.

In the two-stream architecture for action recognition, what does the temporal stream take as input?

Chapter 9: Connections & Beyond

Optical flow is one way to understand motion, but it has clear limitations: it's 2D (ignores depth), pairwise (two frames only), and dense (one vector per pixel, even where it's ambiguous). Modern research addresses each of these.

What Optical Flow Cannot Do

LimitationWhySolution
2D onlyFlow is projected motion — no depthScene flow: 3D displacement per point
PairwiseOnly connects frame t to t+1CoTracker / TAPIR: long-range point tracking
Brightness constancyFails on lighting changes, occlusionLearned features replace raw pixels
Uniform densityWastes compute on static backgroundSparse tracking: track only interesting points

The Broader Landscape

MethodTracksDurationDimension
Optical FlowAll pixels2 frames2D
Scene FlowAll points2 frames3D
Point Tracking (CoTracker)Selected pointsFull video2D
SLAMSparse featuresFull video3D + camera pose
Video TransformersImplicit (tokens)Full videoLearned space
Related lessons: Optical flow connects to many areas. For 3D reconstruction, see SLAM lessons. For video understanding at scale, video transformers learn temporal relationships without explicit flow. For robotics, VLA architectures use flow-like representations for manipulation.

Cheat Sheet

ConceptOne-Line Summary
Flow fieldPer-pixel (u, v) displacement, shape [H, W, 2]
Brightness constancyIxu + Iyv + It = 0 — one equation, two unknowns
Aperture problemCan only see motion perpendicular to edge
Lucas-KanadeLocal patches, least squares (ATA)−1ATb
Horn-SchunckGlobal smoothness energy, iterative solver, α controls smoothness
FlowNetFirst CNN for flow, encoder-decoder, trained on FlyingChairs
RAFTAll-pairs correlation + GRU iterations, state-of-the-art
Synthetic trainingChairs → Things3D → Sintel curriculum
Key metricEnd-point error (EPE): average pixel distance to ground truth
"Vision is not just about seeing — it's about seeing change."
— James Gibson, ecological psychologist

You now understand optical flow — from the brightness constancy constraint that started it all, to RAFT's iterative correlation lookups that define the state of the art. Every pixel tells a story of motion.

What is the key advantage of point tracking methods like CoTracker over optical flow?