Optical Flow — From Two Frames to Motion Understanding

Chapter 0: Why Optical Flow?

You're watching a video of a busy street. Cars move left, pedestrians walk right, a bird flies overhead. You see motion effortlessly. But to a computer, each frame is just a grid of numbers — pixel brightnesses. Frame 1 is one grid. Frame 2 is another. There is no built-in notion of "that pixel moved over there."

Without a way to connect pixels across frames, a computer sees a video as a slideshow of unrelated photographs. It can't tell what moved, how fast, or in which direction. It can't stabilize shaky footage, interpolate between frames, or understand actions.

Optical flow is the answer: for every pixel in frame 1, compute where it ended up in frame 2. The result is a dense displacement field — a map of motion that turns a pair of static images into a description of a moving world.

The core question: Given two consecutive frames, where did each pixel go? Optical flow answers this with a vector (u, v) at every pixel — a complete description of apparent motion in the image.

Two Frames, No Motion Information

A circle moves between frames. Without optical flow, the computer just sees two snapshots. Click Show Flow to reveal the displacement arrows connecting frame 1 to frame 2.

Why can't a computer understand motion from raw video frames alone?

Video files are too large to process Each frame is just a grid of numbers with no correspondence to the next frame Cameras can't capture fast motion

Chapter 1: The Flow Field

At every pixel location (x, y) in frame 1, optical flow assigns a displacement vector (u, v). The value u tells how far the pixel moved horizontally, and v tells how far it moved vertically. If a pixel at (100, 50) has flow (3, −2), it moved to (103, 48) in frame 2.

The output of optical flow is a tensor with shape [H, W, 2] — the same height and width as the image, but with 2 channels instead of 3 (RGB). Channel 0 is horizontal displacement u, channel 1 is vertical displacement v. This is a dense flow field: one vector per pixel.

Input

Frame 1: [H, W, 3] Frame 2: [H, W, 3]

↓

Optical Flow Algorithm

Compute displacement for every pixel

↓

Output

Flow field: [H, W, 2] — (u, v) per pixel

Visualizing Flow: Color Coding

A flow field has two values per pixel, which is hard to show as a grayscale image. The standard visualization uses the HSV color wheel: the direction of motion maps to hue (red = right, cyan = left, green = down, magenta = up), and the magnitude maps to saturation. A stationary pixel is white or black.

Flow Color Wheel

Move your mouse over the wheel. The arrow shows the flow vector, and the color shows how it would be displayed in a flow visualization. Direction = hue, speed = saturation.

Reading a flow map: When you see a colorful flow visualization, each pixel's color tells you a direction and speed. Uniform color = rigid motion (camera pan). Rainbow = complex scene motion. This encoding is universal in the optical flow literature.

Quantity	Shape	Meaning
Frame 1	[H, W, 3]	RGB image at time t
Frame 2	[H, W, 3]	RGB image at time t+1
Flow	[H, W, 2]	Per-pixel displacement (u, v)
u	[H, W]	Horizontal displacement (pixels)
v	[H, W]	Vertical displacement (pixels)

What is the shape of a dense optical flow field for a 480×640 image?

[480, 640, 2] — two channels for (u, v) [480, 640, 3] — three channels for RGB [480, 640] — one scalar per pixel

Chapter 2: Brightness Constancy

Every classical optical flow method starts with one assumption: a pixel's brightness doesn't change as it moves. If a pixel is bright in frame 1, it's still bright in frame 2 — it just moved to a new location. Formally:

I(x, y, t) = I(x + u, y + v, t + 1)

Here I(x, y, t) is the brightness at position (x, y) in frame t. The pixel moved by (u, v), so in the next frame it's at (x+u, y+v). This is the brightness constancy assumption.

Taylor Expansion: One Equation, Two Unknowns

We Taylor-expand the right side around (x, y, t):

I(x+u, y+v, t+1) ≈ I(x,y,t) + I_x·u + I_y·v + I_t

Where I_x = ∂I/∂x (horizontal gradient), I_y = ∂I/∂y (vertical gradient), and I_t = ∂I/∂t (temporal change). Substituting the brightness constancy equation, I(x,y,t) cancels on both sides, leaving:

I_x·u + I_y·v + I_t = 0

This is the optical flow constraint equation. One equation, two unknowns (u and v). We can't solve it at a single pixel — there are infinitely many (u, v) pairs that satisfy it. This fundamental limitation is called the aperture problem.

The aperture problem: Imagine looking at a moving edge through a tiny hole (aperture). You can see motion perpendicular to the edge, but not along it. A horizontal edge could be moving purely up, or diagonally — you can't tell. One constraint equation can only determine the component of flow perpendicular to the image gradient.

The Aperture Problem

A bar moves behind a circular aperture. You can see up-down motion, but can you tell if the bar is also moving sideways? Drag the True Direction slider to change the actual motion.

True Direction90°

Why this matters: Every optical flow algorithm must add extra constraints beyond brightness constancy to resolve the aperture problem. Lucas-Kanade assumes local smoothness. Horn-Schunck assumes global smoothness. Neural networks learn from data. The assumption you add defines the method.

Why can't we solve for flow (u, v) at a single pixel using brightness constancy alone?

One equation with two unknowns — infinitely many solutions The brightness changes too fast Pixels are too small to track

Chapter 3: Lucas-Kanade

Lucas and Kanade (1981) had a simple idea: if flow is unknown at one pixel, use the neighbors. Assume that all pixels in a small patch (say 5×5) share the same flow (u, v). Now each pixel contributes one equation I_x·u + I_y·v + I_t = 0, and with 25 pixels we have 25 equations for 2 unknowns. That's an overdetermined system — solve with least squares.

The Math: A 2×2 System

Stack all the pixel equations into a matrix. For a patch of n pixels:

A = [[I_x1, I_y1], [I_x2, I_y2], ..., [I_xn, I_yn]] b = −[I_t1, I_t2, ..., I_tn]^T

The least-squares solution is:

[u, v]^T = (A^TA)⁻¹ A^Tb

A^TA is a 2×2 matrix (called the structure tensor). The key: it must be invertible. If the patch is a flat region (no gradient), A^TA is all zeros — no solution. If it's a pure edge, A^TA has one near-zero eigenvalue — ambiguous along the edge. Only at corners (two strong eigenvalues) is the system well-conditioned.

Input

Image pair → compute I_x, I_y, I_t gradients

↓

For each patch

Build A [n, 2] and b [n, 1] from gradient values

↓

Solve

d = (A^TA)⁻¹ A^Tb → flow (u, v) for this patch

↓ repeat for all patches

Lucas-Kanade is local. It only looks at a small window around each pixel. This means it's fast and parallelizable, but it can't resolve large displacements (motion bigger than the patch) and it produces sparse or unreliable flow in textureless regions. It works best on corners and textured areas.

Lucas-Kanade: Local Patches

A pattern translates between frames. The algorithm picks patches (squares) and solves a 2×2 system for each. Green arrows = computed flow. Adjust patch size to see the tradeoff.

Patch Size7

True Flow Angle30°

Quantity	Shape	Meaning
A	[n, 2]	Gradient matrix for n pixels in patch
b	[n, 1]	Negative temporal gradient
A^TA	[2, 2]	Structure tensor (must be invertible)
d	[2, 1]	Flow vector (u, v) for this patch

Lucas-Kanade fails in flat (textureless) regions because:

The brightness changes too fast The gradients I_x and I_y are near zero, making A^TA singular The patch is too large

Chapter 4: Horn-Schunck

Lucas-Kanade solves flow locally, patch by patch. Horn and Schunck (1981) took the opposite approach: solve for the entire flow field at once, with a global smoothness constraint. The idea: neighboring pixels should have similar flow, because real objects move coherently.

The Energy Function

Horn-Schunck minimizes a single energy over all pixels:

E = ∑_x,y (I_xu + I_yv + I_t)² + α(|∇u|² + |∇v|²)

The first term is the data term: brightness constancy should hold everywhere. The second term is the smoothness term: flow should vary slowly. The parameter α controls the balance — large α means smoother flow (fewer sharp transitions), small α means more faithful to the raw data (noisier).

The minimization leads to a system of equations solved iteratively with Gauss-Seidel or Jacobi iteration. At each iteration, every pixel's flow is updated using its neighbors' flow from the previous iteration. After enough iterations, the flow field converges.

Initialize

u = 0, v = 0 everywhere (or from LK)

↓

Iterate

For each pixel: u ← ū − I_x(I_xū + I_yv̄ + I_t) / (α² + I_x² + I_y²)

↓ repeat 50–200 times

Output

Smooth, dense flow field [H, W, 2]

Here ū and v̄ are the local averages of u and v from neighboring pixels. Each iteration nudges the flow toward both the data and its neighbors. α literally controls how much "peer pressure" each pixel feels from its neighbors.

α is a smoothness dial. α = 0: pure data, noisy, holes in textureless regions. α → ∞: constant flow everywhere (useless). In practice, α between 1 and 100 works well. Larger objects and slower motion tolerate more smoothing.

Horn-Schunck: Smoothness vs Data Fidelity

Two objects move in different directions. Adjust α to see how smoothness affects the computed flow. Low α = sharp boundaries but noisy. High α = smooth but blurs motion boundaries.

α (smoothness)10

Iterations30

Method	Scope	Strengths	Weaknesses
Lucas-Kanade	Local (per patch)	Fast, robust at corners	Sparse, fails on large motion
Horn-Schunck	Global (entire image)	Dense, handles textureless areas	Slow, over-smooths boundaries

What happens to the Horn-Schunck flow field when α is very large?

The flow becomes noisier The flow becomes overly smooth, blurring motion boundaries between objects The algorithm runs faster

Chapter 5: FlowNet — Learning to See Motion

Classical methods are elegant but slow and brittle. In 2015, Dosovitskiy et al. asked: can a convolutional neural network learn optical flow end-to-end? The answer was FlowNet — the first CNN to predict dense flow directly from an image pair.

FlowNetSimple: The Brute Force Approach

Stack both frames along the channel dimension: input is [6, H, W] (3 RGB channels from each frame). Feed this into an encoder-decoder (contracting path with conv+stride, expanding path with deconv). The output is [2, H, W] — predicted flow.

Input

Stack frames: [H, W, 3] + [H, W, 3] → [H, W, 6]

↓

Encoder

Conv layers with stride 2: [H, W, 6] → [H/64, W/64, 1024]

↓

Decoder

Deconv + skip connections: [H/64, W/64, 1024] → [H, W, 2]

↓

Output

Dense flow: [H, W, 2]

FlowNetCorr: Explicit Matching

Instead of stacking the images, process each through a separate encoder to get feature maps, then compute a correlation volume that measures how similar each feature in image 1 is to features in image 2 within a search window. This gives the network an explicit matching signal.

Training: Synthetic Data

Where do you get ground-truth flow to supervise training? You render it. The FlyingChairs dataset pastes random chair images onto backgrounds and moves them — the renderer knows exactly how every pixel moved. The loss is simply:

L = ∑_x,y ||flow_pred(x,y) − flow_gt(x,y)||₂

End-point error (EPE): the Euclidean distance between predicted and ground-truth flow, averaged over all pixels. FlowNet achieved ~2.7 pixel EPE on FlyingChairs while running at 10 FPS — orders of magnitude faster than classical methods.

FlowNet2 (2017): Stack multiple FlowNets in series. The first network makes a rough estimate. The second takes the warped image (frame 1 warped by the first estimate) and the residual, and refines the flow. Stacking networks reduced EPE from 2.71 to 2.02 on Sintel. This iterative refinement idea reappears in every modern method.

Encoder-Decoder Architecture

The encoder shrinks spatial resolution while growing channel depth. The decoder expands back. Skip connections preserve fine detail. Hover over each stage to see tensor shapes.

Model	Year	EPE (Sintel)	FPS	Key Idea
FlowNetS	2015	7.42	10	Stacked input, single encoder-decoder
FlowNetC	2015	6.85	10	Correlation layer for explicit matching
FlowNet2	2017	4.16	8	Stacked networks, iterative refinement

Why is FlowNet trained on synthetic (rendered) data instead of real video?

Real video is too blurry Synthetic data is more photorealistic The renderer provides free, perfect ground-truth flow for every pixel

Chapter 6: RAFT — The Modern Standard

RAFT (Recurrent All-pairs Field Transforms, Teed & Deng 2020) is the dominant optical flow method. It won the ECCV 2020 Best Paper Award and every method since builds on its ideas. The core insight: don't predict flow in one shot — iteratively refine it by repeatedly looking up correlations.

Architecture: Three Stages

Stage 1: Feature Extraction. A shared CNN encoder processes both images independently, producing feature maps at 1/8 resolution:

f₁, f₂ = Encoder(I₁), Encoder(I₂) shape: [H/8, W/8, 256]

Stage 2: Correlation Volume. Compute the dot product between every feature in f₁ and every feature in f₂. The result is a 4D tensor:

C(i, j, k, l) = f₁(i, j) · f₂(k, l) shape: [H/8, W/8, H/8, W/8]

This is the all-pairs correlation volume. Entry C(i,j,k,l) measures how similar the feature at position (i,j) in image 1 is to the feature at (k,l) in image 2. Build it once, look it up many times.

Stage 3: Iterative Update (GRU). Starting from zero flow, a Gated Recurrent Unit refines the estimate over 12 iterations. At each iteration:

Lookup

At current flow estimate, index into correlation volume → [H/8, W/8, R²]

↓

Context

Concatenate: correlation features + current flow + context features

↓

GRU Update

h_t = GRU(h_t−1, input) → hidden state [H/8, W/8, 128]

↓

Flow Head

Two conv layers → Δflow [H/8, W/8, 2]. Add to current flow.

↻ repeat 12 times

Why this works: The correlation volume is a "lookup table of similarity." At each iteration, the GRU says "given where I currently think things moved, how well do the features match?" If the match is poor, it adjusts. After 12 iterations, even large displacements converge. The all-pairs design means RAFT handles large motions that defeat patch-based methods.

RAFT: Iterative Flow Refinement

Watch RAFT iteratively refine its flow estimate. Iteration 0 is random. Each GRU step improves the flow by looking up correlations. Adjust the number of iterations or add noise to see how the algorithm converges.

Iterations12

Motion Complexity2

Iteration: 0 / 12

Training Details

RAFT supervises at every iteration, not just the last. The loss weights later iterations more heavily:

L = ∑_i=1^N γ^N−i ||flow_i − flow_gt||₁ γ = 0.8

This encourages each iteration to improve, not just the final output. L1 loss (not L2) because it's less sensitive to outliers at motion boundaries.

Component	Shape	Purpose
Feature encoder	[H/8, W/8, 256]	Extract appearance features from each image
Correlation volume	[H/8, W/8, H/8, W/8]	All-pairs similarity lookup table
GRU hidden state	[H/8, W/8, 128]	Recurrent memory across iterations
Flow estimate	[H/8, W/8, 2]	Current (u, v) prediction, upsampled to [H, W, 2]

RAFT's numbers: EPE 1.61 on Sintel Clean, 2.86 on Sintel Final. It runs at ~10 FPS on a single GPU. Every competitive method since 2020 (GMA, FlowFormer, VideoFlow) uses RAFT's correlation volume + iterative refinement as the backbone.

Chapter 7: Training on Synthetic Data

Ground-truth optical flow for real video is nearly impossible to obtain. You'd need to know exactly where every single pixel moved — sub-pixel precision for millions of pixels. Instead, the entire field relies on synthetic datasets where the renderer provides perfect ground truth for free.

The Datasets

Dataset	Year	Size	What It Is	Realism
FlyingChairs	2015	22k pairs	Random chairs on backgrounds, 2D motion	Low
FlyingThings3D	2016	26k pairs	Random 3D objects with 3D motion	Medium
MPI Sintel	2012	1k pairs	Frames from animated film, realistic effects	High
KITTI	2012/2015	400 pairs	Real driving scenes, LiDAR ground truth (sparse)	Real (sparse GT)

The Training Schedule

RAFT and its successors follow a specific curriculum. Start simple, increase complexity:

Stage 1: FlyingChairs

100k iterations. Simple 2D motion. The network learns basic correspondence.

↓

Stage 2: FlyingThings3D

100k iterations. 3D motion, occlusion, varying depth.

↓

Stage 3: Fine-tune

Sintel + KITTI + HD1K. Real-world effects: blur, fog, specularities.

The Generalization Gap

Models trained on synthetic data struggle with real-world challenges that don't appear in training:

Textureless surfaces — white walls, clear skies (synthetic data has rich textures)
Reflections and transparencies — glass, water, mirrors break brightness constancy
Non-Lambertian materials — specular highlights shift with viewpoint
Motion blur — fast-moving objects smear across the frame
Fog and rain — atmospheric effects change appearance between frames

Why not just use real data? Getting dense per-pixel ground truth for real video requires exotic hardware (multiple high-speed cameras + structured light) or slow, expensive manual annotation. LiDAR gives sparse 3D flow for ~5% of pixels. Synthetic data gives 100% dense, sub-pixel accurate flow for free. The tradeoff: perfect annotations, imperfect realism.

Synthetic vs Real: What Degrades

Toggle real-world effects to see how they degrade optical flow estimation. Each effect violates an assumption the model learned from clean synthetic data.

Why do optical flow models follow a training curriculum (Chairs → Things → Sintel)?

Start with simple motion patterns, gradually add complexity and realism Each dataset tests a different image resolution Later datasets have more images

Chapter 8: Applications

Optical flow isn't just an academic exercise. It's a core primitive that powers dozens of real systems. Here's how flow enables each application:

Video Stabilization

Shaky handheld video has erratic global motion. Compute optical flow between consecutive frames → estimate the dominant (camera) motion → subtract it → smooth the residual. The result: steady footage as if shot on a gimbal. Every phone camera and YouTube's video stabilizer use this pipeline.

Action Recognition (Two-Stream Networks)

Simonyan & Zisserman (2014) showed that feeding precomputed optical flow as a separate stream alongside RGB dramatically improves action recognition. The spatial stream sees appearance (who/what), the temporal stream sees motion (doing what). Flow compresses temporal information into a single tensor that CNNs can process.

Spatial Stream

Single RGB frame [H, W, 3] → CNN → "person, ball, court"

Temporal Stream

Stacked flow [H, W, 2L] (L frames) → CNN → "kicking motion"

↓

Fusion

Late fusion → "person kicking ball"

Video Frame Interpolation

Given frames at t=0 and t=1, generate the frame at t=0.5. Compute forward flow (0→1) and backward flow (1→0), then warp both frames to the midpoint and blend. This is how slow-motion video is created from standard 30fps footage.

Autonomous Driving

Optical flow helps self-driving cars detect independently moving objects. The ego-vehicle's motion creates a predictable flow pattern (expansion from the focus of expansion). Anything deviating from this pattern is an independent mover — a pedestrian, cyclist, or other car.

Video Editing

Remove an object from video by computing flow, propagating the inpainting mask across frames along flow trajectories, then filling the hole with temporally consistent content.

The common thread: Optical flow converts temporal information (change over time) into a spatial representation (a 2D map). This lets spatial tools (CNNs, segmentation, warping) process motion without needing 3D or temporal architectures.

Application: Video Stabilization

A camera shakes randomly while viewing a scene. Optical flow estimates the camera motion, and stabilization subtracts it. Toggle stabilization to see the difference.

In the two-stream architecture for action recognition, what does the temporal stream take as input?

Raw video frames Stacked optical flow fields Audio features

Chapter 9: Connections & Beyond

Optical flow is one way to understand motion, but it has clear limitations: it's 2D (ignores depth), pairwise (two frames only), and dense (one vector per pixel, even where it's ambiguous). Modern research addresses each of these.

What Optical Flow Cannot Do

Limitation	Why	Solution
2D only	Flow is projected motion — no depth	Scene flow: 3D displacement per point
Pairwise	Only connects frame t to t+1	CoTracker / TAPIR: long-range point tracking
Brightness constancy	Fails on lighting changes, occlusion	Learned features replace raw pixels
Uniform density	Wastes compute on static background	Sparse tracking: track only interesting points

The Broader Landscape

Method	Tracks	Duration	Dimension
Optical Flow	All pixels	2 frames	2D
Scene Flow	All points	2 frames	3D
Point Tracking (CoTracker)	Selected points	Full video	2D
SLAM	Sparse features	Full video	3D + camera pose
Video Transformers	Implicit (tokens)	Full video	Learned space

Related lessons: Optical flow connects to many areas. For 3D reconstruction, see SLAM lessons. For video understanding at scale, video transformers learn temporal relationships without explicit flow. For robotics, VLA architectures use flow-like representations for manipulation.

Cheat Sheet

Concept	One-Line Summary
Flow field	Per-pixel (u, v) displacement, shape [H, W, 2]
Brightness constancy	I_xu + I_yv + I_t = 0 — one equation, two unknowns
Aperture problem	Can only see motion perpendicular to edge
Lucas-Kanade	Local patches, least squares (A^TA)⁻¹A^Tb
Horn-Schunck	Global smoothness energy, iterative solver, α controls smoothness
FlowNet	First CNN for flow, encoder-decoder, trained on FlyingChairs
RAFT	All-pairs correlation + GRU iterations, state-of-the-art
Synthetic training	Chairs → Things3D → Sintel curriculum
Key metric	End-point error (EPE): average pixel distance to ground truth

"Vision is not just about seeing — it's about seeing change."

— James Gibson, ecological psychologist

You now understand optical flow — from the brightness constancy constraint that started it all, to RAFT's iterative correlation lookups that define the state of the art. Every pixel tells a story of motion.

What is the key advantage of point tracking methods like CoTracker over optical flow?

They produce denser output They run faster They track points across many frames, not just two

Understand OpticalFlow

Chapter 0: Why Optical Flow?

Chapter 1: The Flow Field

Visualizing Flow: Color Coding

Chapter 2: Brightness Constancy

Taylor Expansion: One Equation, Two Unknowns

Chapter 3: Lucas-Kanade

The Math: A 2×2 System

Chapter 4: Horn-Schunck

The Energy Function

Chapter 5: FlowNet — Learning to See Motion

FlowNetSimple: The Brute Force Approach

FlowNetCorr: Explicit Matching

Training: Synthetic Data

Chapter 6: RAFT — The Modern Standard

Architecture: Three Stages

Training Details

Chapter 7: Training on Synthetic Data

The Datasets

The Training Schedule

The Generalization Gap

Chapter 8: Applications

Video Stabilization

Action Recognition (Two-Stream Networks)

Video Frame Interpolation

Autonomous Driving

Video Editing

Chapter 9: Connections & Beyond

What Optical Flow Cannot Do

The Broader Landscape

Cheat Sheet

Understand Optical
Flow