Zachary Teed, Jia Deng — Princeton University, 2020

RAFT: Recurrent All-Pairs Field Transforms

Estimate dense optical flow by building a 4D correlation volume over all pixel pairs, then iteratively refining the flow field with a lightweight GRU — 12 identical update steps that mimic first-order optimization.

Prerequisites: Convolutional neural networks + Recurrent units (GRU) + Feature matching basics
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have two consecutive video frames. For every single pixel in frame 1, you want to know where that pixel moved to in frame 2. This is optical flow — a dense 2D displacement field that captures per-pixel motion between images.

Think of it this way: if you could paint every pixel a unique color and then photograph the scene a moment later, optical flow is the mapping that tells you where each color ended up. The output is a tensor of shape [H, W, 2] — two values (dx, dy) for every pixel, describing horizontal and vertical displacement.

Why is this hard? Three reasons make optical flow one of the oldest unsolved problems in computer vision:

The fundamental tension: Traditional methods trade off a data term (matching visual appearance) against a regularization term (enforcing smoothness). Get the balance wrong and you either hallucinate motion or oversmooth it away. The key question: can we learn this trade-off from data instead of hand-designing it?

Before RAFT, deep learning approaches used coarse-to-fine pyramids: estimate flow at low resolution, then progressively refine at higher resolutions. This works but has three problems: (1) errors at coarse levels can never be recovered, (2) small fast-moving objects get missed at low resolution, and (3) training requires 1M+ iterations to learn the multi-stage cascade.

Full data flow at a glance: Image pair I1, I2 ∈ RH×W×3 → Feature encoder gθ produces per-pixel features [H/8, W/8, 256] for both images → Context encoder hθ extracts features from I1 only → 4D correlation volume C ∈ RH/8 × W/8 × H/8 × W/8 from dot products of all feature pairs → Pool last 2 dims to form 4-level pyramid {C1, C2, C3, C4} → Initialize flow f0 = 0 → For k = 1..12: GRU update operator uses correlation lookups + current flow + context → Δf → fk+1 = fk + Δf → Convex upsampling from [H/8, W/8, 2] to [H, W, 2].
Optical Flow Intuition

Drag the circle in Frame 1 (left panel). The arrow shows the flow vector — where that pixel moves to in Frame 2. The flow field is [H, W, 2]: two displacement values per pixel.

What is the output of an optical flow model?

Chapter 1: The Key Insight

RAFT's central insight is to operate at a single resolution and iterate, rather than building a coarse-to-fine pyramid. This single change has cascading benefits.

Previous methods (FlowNet2, PWC-Net, LiteFlowNet) all followed the same playbook: estimate flow at 1/64 resolution, warp the second image, re-estimate flow at 1/32, warp again, and so on up to full resolution. Each stage has its own weights — no sharing. This means: if the coarse estimate is wrong, every subsequent stage tries to fix someone else's mistake.

RAFT takes a radically different approach. It maintains a single flow field at 1/8 resolution and applies the same update operator 12 times (or more — up to 100+ at inference without divergence). The weights are shared across all iterations. Each iteration receives the same information: what does the correlation volume say, and what does the current flow estimate look like?

Step 1: Extract Features
Feature encoder maps both images to [H/8, W/8, 256]. Context encoder maps I1 to [H/8, W/8, 256]. Run once.
Step 2: Build Correlation Volume
Dot product of all feature pairs → 4D volume [H/8, W/8, H/8, W/8]. Pool to form multi-scale pyramid. Run once.
Step 3: Iterate (12×)
GRU-based update operator: look up correlations at current flow estimate, predict Δf, update fk+1 = fk + Δf. Same weights every iteration.
Step 4: Upsample
Convex upsampling: [H/8, W/8, 2] → [H, W, 2] using learned 8×8×9 masks.
Why weight sharing matters: When you force all 12 updates to use the same parameters, you constrain the network to learn a general-purpose update rule — something that works at any iteration. This is exactly what first-order optimization algorithms do: the same step (gradient descent, ADMM, proximal gradient) is applied repeatedly until convergence. RAFT learns this step from data. Weight sharing also means only 2.7M parameters in the update operator, vs. 38M for FlowNetS.
What degrades when you change iterations: At training time, RAFT uses 12 iterations. At inference, you can run more. Going from 12 to 32 iterations consistently improves accuracy (the flow converges closer to the fixed point). Going below 6 degrades sharply — the flow hasn't converged. At 100+ iterations the updates become negligible (Δf → 0), confirming convergence. Unlike pyramid methods, where the number of stages is fixed by architecture, RAFT gracefully trades compute for accuracy.
Why does RAFT use weight-shared iterative updates instead of a coarse-to-fine pyramid?

Chapter 2: Feature Extraction

RAFT uses two separate encoders. Both are simple residual networks with 6 residual blocks, producing features at 1/8 resolution. But they serve very different purposes.

Feature Encoder gθ

The feature encoder is applied to both images. Its job is to extract appearance features that can be matched across frames. It maps each image from RH×W×3 to RH/8×W/8×256.

The architecture: 2 residual blocks at 1/2 resolution, 2 at 1/4, and 2 at 1/8. Each block has instance normalization (not batch norm) for better generalization across datasets. The output is a 256-dimensional feature vector for every pixel at 1/8 resolution.

Context Encoder hθ

The context encoder is applied to only I1. Its job is to provide a persistent reference signal to the update operator — information about the scene structure, edges, and texture patterns that should guide how the flow evolves. It uses batch normalization (not instance norm) and outputs at the same [H/8, W/8, 256] resolution.

Frozen vs. Trained: Both encoders are trained from scratch along with the rest of the network. No pretrained backbones, no frozen weights. Feature encoder: trained, shared weights between I1 and I2. Context encoder: trained, separate weights. Update operator GRU: trained, shared across all 12 iterations. Convex upsampling mask: trained. Total parameters: ~5.3M. Small RAFT (1/5 params): ~1M.
Why two encoders? The feature encoder produces representations optimized for matching — features from corresponding pixels should have high dot products. The context encoder produces representations optimized for guiding updates — edge information, texture boundaries, and local structure that tells the update operator where flow discontinuities should be. Merging these two roles into one encoder degrades performance (ablation: +0.3 EPE on Sintel clean).
ComponentInputOutput ShapeParameters
Feature encoder gθI1 and I2 (shared)[H/8, W/8, 256]~2.5M
Context encoder hθI1 only[H/8, W/8, 256]~2.5M
Update operatorcorrelation + flow + contextΔf [H/8, W/8, 2]~2.7M
Upsampling maskhidden state[H/8, W/8, 8×8×9]~0.1M
Why does RAFT use two separate encoders instead of one?

Chapter 3: The 4D Correlation Volume

This is the heart of RAFT. After feature extraction, we have two feature maps: gθ(I1) ∈ RH/8×W/8×256 and gθ(I2) ∈ RH/8×W/8×256. Now we need to measure how similar every pixel in I1 is to every pixel in I2.

Building the Volume

Take every feature vector in I1 (indexed by i, j) and compute its dot product with every feature vector in I2 (indexed by k, l):

Cijkl = ∑h gθ(I1)ijh · gθ(I2)klh

The result is a 4D tensor: C ∈ RH/8 × W/8 × H/8 × W/8. For a 480×640 image at 1/8 resolution (60×80), this is 60×80×60×80 = 23 million entries. It sounds huge, but it's computed as a single matrix multiplication and takes only 17% of total inference time.

The Correlation Pyramid

We pool the last two dimensions (the I2 dimensions) with kernel sizes {1, 2, 4, 8} to create four levels:

The crucial design: we pool only the I2 dimensions, keeping the I1 dimensions at full resolution. This means the network retains high-resolution information about where in I1 each correspondence comes from, which is essential for recovering small fast-moving objects.

Why all pairs? Previous methods (PWC-Net, FlowNet2) computed correlations only in a local window around the current flow estimate. This means they can only find correspondences within that window. If the window is 4 pixels wide at 1/16 resolution, that's 64 pixels at full resolution. A fast-moving baseball at 200 pixels of displacement is invisible. RAFT computes correlations for all pairs, so no displacement is out of reach. The multi-scale pyramid means a radius-4 lookup at level C4 covers 256 pixels at full resolution.

Correlation Lookup

During each GRU iteration, the current flow estimate fk maps each pixel x = (u, v) in I1 to its estimated position x' = (u + fk1, v + fk2) in I2. Around x', we sample a local neighborhood of radius r = 4 from each pyramid level using bilinear interpolation. At level k, the grid is centered at x'/2k, so a radius-4 neighborhood at level 4 covers 24 × 4 = 64 pixels (256 at full resolution). The lookups from all levels are concatenated into a single feature vector.

Tensor shapes through the lookup: At each iteration, for each pixel in I1 (total H/8 × W/8 pixels), we look up a (2r+1) × (2r+1) = 9×9 = 81 correlation values from each of 4 pyramid levels = 324 values per pixel. Two conv layers reduce this to 256 dims. Together with flow features (128 dims) and context features (128 dims), the GRU input is 512 dimensions per pixel.
4D Correlation Volume

Click on the I1 grid (left) to select a pixel. The right panel shows its correlation slice — how similar that pixel is to every pixel in I2. Brighter = higher correlation. The orange box shows the lookup neighborhood around the current flow estimate.

Why does RAFT pool only the I2 dimensions (not I1) when building the correlation pyramid?

Chapter 4: Iterative Updates

RAFT initializes the flow field to zero everywhere: f0 = 0. Then it runs 12 iterations (at training time) of the same update operator, each producing a small correction Δf that is added to the current estimate: fk+1 = fk + Δf.

Each iteration follows the same three steps:

Step 1: Gather Information

The update operator receives three inputs, concatenated into a single feature map:

Step 2: Update Hidden State (GRU)

The concatenated features (512 dims) and the previous hidden state hk-1 (128 dims) are fed into a convolutional GRU cell. The GRU maintains a persistent hidden state that accumulates information across iterations — it remembers what it has learned from previous updates.

Step 3: Predict Δf

The GRU's output hidden state hk passes through 2 convolutional layers to produce the flow update Δf ∈ RH/8×W/8×2. This is added to the current flow to get fk+1.

The optimization analogy: Think of each iteration as one step of gradient descent. The correlation lookup is like computing the gradient: "which direction should flow move to better align the two images?" The hidden state is like momentum: it accumulates evidence across iterations. The flow update Δf is the actual step. But unlike hand-crafted optimizers, the "gradient" and "momentum" are learned from data.
Engineering decision — detached gradients: When computing fk+1 = fk + Δf, the gradient flows only through the Δf branch. The fk term is treated as a constant (gradients detached). This prevents exploding gradients through the long chain of 12 iterations and stabilizes training. It also means each iteration is trained to predict the right increment, not the right absolute value.
Iterative Flow Refinement

Watch the flow field converge from zero initialization. Each step adds a Δf correction. Click "Reset" to restart. The plot shows total displacement error decreasing over iterations.

Iteration 0 / 12
Why does RAFT detach gradients through fk when computing fk+1 = fk + Δf?

Chapter 5: The Convolutional GRU

The update operator's core is a convolutional GRU (Gated Recurrent Unit), adapted from the sequence modeling world. Instead of fully connected layers, it uses 3×3 convolutions, processing the entire spatial feature map at once.

The GRU Equations

Given input xt (the concatenated correlation + flow + context features) and previous hidden state ht-1:

zt = σ(Conv3×3([ht-1, xt]))
rt = σ(Conv3×3([ht-1, xt]))
t = tanh(Conv3×3([rt ⊙ ht-1, xt]))
ht = (1 − zt) ⊙ ht-1 + zt ⊙ h̃t

Where σ is the sigmoid function, ⊙ is element-wise multiplication, and [·, ·] is concatenation.

Let's unpack each gate:

Why GRU and not LSTM? The GRU has fewer parameters (2 gates vs. 3) and works just as well here. The key property RAFT needs is bounded activations (sigmoid and tanh). These prevent the hidden state from growing unboundedly over 12+ iterations, encouraging convergence to a fixed point. An unconstrained linear RNN would diverge.

Separable ConvGRU

RAFT also experiments with a separable variant: replace the single 3×3 convolution with two sequential GRUs — one with 1×5 convolution and one with 5×1 convolution. This increases the receptive field from 3×3 to 5×5 without significantly increasing parameters, and gives a small accuracy boost (0.1 EPE on Sintel).

Tensor shapes through the GRU: Hidden state h ∈ RH/8 × W/8 × 128. Input x ∈ RH/8 × W/8 × 512 (256 correlation + 128 flow + 128 context). All three gate convolutions: Conv3×3(640 → 128). Total GRU params: ~2.7M. For the separable variant, the 1×5 GRU processes [h, x] → h', then the 5×1 GRU processes [h', x] → ht. Receptive field doubles with minimal parameter increase.
What property of the GRU's activation functions (sigmoid, tanh) is essential for RAFT's iterative design?

Chapter 6: Convex Upsampling

The GRU outputs flow at 1/8 resolution: [H/8, W/8, 2]. But we need full-resolution flow: [H, W, 2]. Naive bilinear upsampling blurs edges and produces artifacts at motion boundaries. RAFT uses a learned convex upsampling scheme.

How It Works

For each pixel in the high-resolution output, RAFT expresses it as a convex combination of its 3×3 coarse neighbors. "Convex" means the weights are non-negative and sum to 1 (enforced by softmax).

Concretely, the GRU's hidden state is passed through two convolutional layers to produce a mask of shape [H/8, W/8, 8×8×9]. The 8×8 factor tiles each coarse pixel into 64 sub-pixels (to reach full resolution). The 9 values are the weights for the 3×3 neighborhood, passed through softmax. The full-resolution flow for each sub-pixel is:

fup(x, y) = ∑n ∈ N3×3 wn(x, y) · fcoarse(n)

where wn are the softmax-normalized weights and fcoarse(n) are the 9 coarse-resolution flow values in the neighborhood.

Why convex, not deconvolution? A transposed convolution (deconvolution) can produce negative weights and checkerboard artifacts. Convex upsampling guarantees that the output is always a weighted average of valid flow values — it cannot hallucinate flow that doesn't exist in the coarse field. The softmax ensures the weights are a proper probability distribution. At motion boundaries (e.g., a foreground object against a background), the weights become highly concentrated on one side, producing a sharp edge.
Implementation detail: In PyTorch, this is implemented using torch.nn.functional.unfold to extract 3×3 neighborhoods, then multiply by the mask weights and reshape. The mask prediction head adds ~0.1M parameters. The full-resolution flow is only needed at the final iteration during training (all intermediate losses can be computed at 1/8 resolution then upsampled with bilinear for the loss), reducing memory pressure.
Convex vs. Bilinear Upsampling

Compare how a motion boundary is upsampled. Left: bilinear (blurry edge). Right: convex (sharp edge). The convex weights concentrate on one side at the boundary.

Why does convex upsampling produce sharper motion boundaries than bilinear upsampling?

Chapter 7: Training

RAFT is trained end-to-end on synthetic data, then optionally finetuned on real data. The training schedule and loss design are carefully engineered for convergence.

Loss Function

RAFT supervises on all 12 intermediate flow predictions, not just the final one. The loss is the L1 distance between predicted and ground truth flow, summed across all iterations with exponentially increasing weights:

L = ∑i=1N γN−i · ||fgt − fi||1

where γ = 0.8. This means iteration 12 gets weight 1.0, iteration 11 gets 0.8, iteration 10 gets 0.64, and so on. Early iterations are supervised more lightly because they naturally have worse predictions. But they still get gradient signal, which helps the GRU learn to make useful early updates.

Training Schedule

StageDatasetIterationsBatch SizeImage Crop
1. PretrainFlyingChairs (C)100K12368 × 496
2. PretrainFlyingThings3D (T)100K6400 × 720
3. Finetune (Sintel)Sintel + KITTI + HD1K100K6368 × 768
4. Finetune (KITTI)KITTI-201550K6288 × 960
Optimizer: AdamW with gradient clipping to [−1, 1]. Learning rate schedule: one-cycle policy with warmup. Training hardware: 2× 2080Ti GPUs. Total C+T training: ~200K iterations, far less than PWC-Net (1M+) or FlowNet2 (hundreds of thousands per stage). The C+T schedule is the standard "generalization test" — how well does the model transfer to Sintel/KITTI without seeing any real data?
What degrades when training changes: Training on FlyingChairs alone: Sintel clean EPE 3.12 (vs. 1.43 with C+T). The curriculum matters — Chairs provides basic motion patterns, Things3D adds complex occlusions and realistic objects. Reducing iterations from 12 to 6 at training time: +0.15 EPE on Sintel. The network needs enough iterations during training to learn the convergence behavior. Removing the exponential weighting (γ = 1, equal weights): +0.08 EPE — the network wastes capacity trying to make early iterations perfect instead of focusing on the final estimate.
Why does RAFT use exponentially increasing weights (γ = 0.8) in its multi-iteration loss?

Chapter 8: Results

RAFT sets the state of the art on both major optical flow benchmarks, with particularly impressive cross-dataset generalization.

Sintel (MPI Sintel, realistic animated scenes)

MethodTrainingClean EPE ↓Final EPE ↓
FlowNet2C+T2.023.54
PWC-NetC+T2.553.93
VCNC+T2.213.68
RAFTC+T1.432.71
After finetuning:
FlowNet2C+T+S/K4.16 (test)5.74 (test)
RAFTC+T+S/K1.61 (test)2.86 (test)

On Sintel test, RAFT achieves 1.61 EPE (clean) and 2.86 EPE (final) — 30-36% error reduction over all prior work.

KITTI-2015 (real driving scenes)

MethodTrainingEPE ↓F1-all ↓
MaskFlowNetC+T-23.1%
VCNC+T8.3625.1%
RAFTC+T5.0417.4%
RAFT (finetuned)C+T+S+K0.635.10%

On KITTI test, RAFT achieves F1-all of 5.10% — a 16% error reduction from the previous best (6.10%).

Key Ablations

AblationSintel Clean EPEChange
Full RAFT1.43
Replace all-pairs with local correlation1.73+21%
Replace convex upsample with bilinear1.68+17%
Remove multi-scale pyramid (single level)1.62+13%
No weight sharing (12 separate decoders)1.71+20%
6 iterations instead of 121.58+10%
Efficiency: RAFT processes 1088×436 video at 10 FPS on a 1080Ti. Small RAFT (1/5 parameters) runs at 20 FPS while still outperforming all prior methods on Sintel. The 10× fewer training iterations compared to prior methods is a major practical advantage.
Results Comparison

End-point error (EPE) on Sintel Clean (C+T training). Lower is better. RAFT dramatically reduces error compared to prior methods.

According to the ablation study, which component contributes the most to RAFT's accuracy?

Chapter 9: Connections

RAFT sits at a pivotal point in computer vision — it changed how the field thinks about optical flow and iterative refinement.

Relation to Classical Optimization

Horn-Schunck (1981) and TV-L1 maintain a flow field and iteratively refine it using gradient descent on an energy function. RAFT does the same thing — but learns the energy function and the update rule from data. The correlation volume replaces the hand-crafted data term. The GRU replaces the hand-crafted regularization and update step.

Relation to Deep Equilibrium Models

RAFT's weight-shared iterations can be viewed as a Deep Equilibrium Model (DEQ): find the fixed point f* where the update operator produces Δf = 0. DEQ models solve for this fixed point directly (e.g., via Broyden's method). RAFT uses explicit unrolling but converges to the same idea: the final flow is the fixed point of a learned operator.

Influence on Downstream Work

RAFT's design has been adopted far beyond optical flow:

Cheat Sheet

AspectRAFT
InputImage pair I1, I2 ∈ RH×W×3
OutputDense flow field [H, W, 2]
Feature encoderResNet-like, 1/8 res, 256 dims
CorrelationAll-pairs 4D volume + 4-level pyramid
Update operatorConvolutional GRU, 2.7M params
Iterations12 (train), 12-32 (test), 100+ stable
UpsamplingLearned convex combination (8×)
LossMulti-iteration L1, γ = 0.8
Training dataFlyingChairs + FlyingThings3D
Key result30% EPE reduction on Sintel
Total params~5.3M (small: ~1M)
Speed10 FPS on 1080Ti (1088×436)
The broader lesson: When you have a structured problem with a known iterative solution shape (match, update, refine), learn the update rule instead of learning the answer directly. Weight sharing forces generality. Bounded activations ensure convergence. And building the full search space (all-pairs) upfront means you never have to worry about missing a correspondence.
What is the key conceptual link between RAFT and classical optical flow methods like Horn-Schunck?