RAFT — Veanors

Chapter 0: The Problem

You have two consecutive video frames. For every single pixel in frame 1, you want to know where that pixel moved to in frame 2. This is optical flow — a dense 2D displacement field that captures per-pixel motion between images.

Think of it this way: if you could paint every pixel a unique color and then photograph the scene a moment later, optical flow is the mapping that tells you where each color ended up. The output is a tensor of shape [H, W, 2] — two values (dx, dy) for every pixel, describing horizontal and vertical displacement.

Why is this hard? Three reasons make optical flow one of the oldest unsolved problems in computer vision:

Large displacements: A fast-moving object can jump 100+ pixels between frames. Searching that wide is expensive.
Textureless regions: A white wall has no features to match. Without texture, any displacement looks equally plausible.
Occlusions: Objects disappear behind other objects. A pixel in frame 1 may have no match in frame 2 at all.

The fundamental tension: Traditional methods trade off a data term (matching visual appearance) against a regularization term (enforcing smoothness). Get the balance wrong and you either hallucinate motion or oversmooth it away. The key question: can we learn this trade-off from data instead of hand-designing it?

Before RAFT, deep learning approaches used coarse-to-fine pyramids: estimate flow at low resolution, then progressively refine at higher resolutions. This works but has three problems: (1) errors at coarse levels can never be recovered, (2) small fast-moving objects get missed at low resolution, and (3) training requires 1M+ iterations to learn the multi-stage cascade.

Full data flow at a glance: Image pair I₁, I₂ ∈ R^H×W×3 → Feature encoder g_θ produces per-pixel features [H/8, W/8, 256] for both images → Context encoder h_θ extracts features from I₁ only → 4D correlation volume C ∈ R^{H/8 × W/8 × H/8 × W/8} from dot products of all feature pairs → Pool last 2 dims to form 4-level pyramid {C¹, C², C³, C⁴} → Initialize flow f₀ = 0 → For k = 1..12: GRU update operator uses correlation lookups + current flow + context → Δf → f_k+1 = f_k + Δf → Convex upsampling from [H/8, W/8, 2] to [H, W, 2].

Optical Flow Intuition

Drag the circle in Frame 1 (left panel). The arrow shows the flow vector — where that pixel moves to in Frame 2. The flow field is [H, W, 2]: two displacement values per pixel.

What is the output of an optical flow model?

A dense displacement field [H, W, 2] giving horizontal and vertical motion per pixel A single global motion vector for the entire image A binary mask showing which pixels moved

Chapter 1: The Key Insight

RAFT's central insight is to operate at a single resolution and iterate, rather than building a coarse-to-fine pyramid. This single change has cascading benefits.

Previous methods (FlowNet2, PWC-Net, LiteFlowNet) all followed the same playbook: estimate flow at 1/64 resolution, warp the second image, re-estimate flow at 1/32, warp again, and so on up to full resolution. Each stage has its own weights — no sharing. This means: if the coarse estimate is wrong, every subsequent stage tries to fix someone else's mistake.

RAFT takes a radically different approach. It maintains a single flow field at 1/8 resolution and applies the same update operator 12 times (or more — up to 100+ at inference without divergence). The weights are shared across all iterations. Each iteration receives the same information: what does the correlation volume say, and what does the current flow estimate look like?

Step 1: Extract Features

Feature encoder maps both images to [H/8, W/8, 256]. Context encoder maps I₁ to [H/8, W/8, 256]. Run once.

↓

Step 2: Build Correlation Volume

Dot product of all feature pairs → 4D volume [H/8, W/8, H/8, W/8]. Pool to form multi-scale pyramid. Run once.

↓

Step 3: Iterate (12×)

GRU-based update operator: look up correlations at current flow estimate, predict Δf, update f_k+1 = f_k + Δf. Same weights every iteration.

↓

Step 4: Upsample

Convex upsampling: [H/8, W/8, 2] → [H, W, 2] using learned 8×8×9 masks.

Why weight sharing matters: When you force all 12 updates to use the same parameters, you constrain the network to learn a general-purpose update rule — something that works at any iteration. This is exactly what first-order optimization algorithms do: the same step (gradient descent, ADMM, proximal gradient) is applied repeatedly until convergence. RAFT learns this step from data. Weight sharing also means only 2.7M parameters in the update operator, vs. 38M for FlowNetS.

What degrades when you change iterations: At training time, RAFT uses 12 iterations. At inference, you can run more. Going from 12 to 32 iterations consistently improves accuracy (the flow converges closer to the fixed point). Going below 6 degrades sharply — the flow hasn't converged. At 100+ iterations the updates become negligible (Δf → 0), confirming convergence. Unlike pyramid methods, where the number of stages is fixed by architecture, RAFT gracefully trades compute for accuracy.

Why does RAFT use weight-shared iterative updates instead of a coarse-to-fine pyramid?

Weight sharing constrains the network to learn a general update rule (like an optimization step), enables flexible iteration counts, uses far fewer parameters (2.7M vs 38M), and avoids irrecoverable coarse-level errors It is faster to train on a single GPU Coarse-to-fine methods require more training data

Chapter 2: Feature Extraction

RAFT uses two separate encoders. Both are simple residual networks with 6 residual blocks, producing features at 1/8 resolution. But they serve very different purposes.

Feature Encoder g_θ

The feature encoder is applied to both images. Its job is to extract appearance features that can be matched across frames. It maps each image from R^H×W×3 to R^{H/8×W/8×256}.

The architecture: 2 residual blocks at 1/2 resolution, 2 at 1/4, and 2 at 1/8. Each block has instance normalization (not batch norm) for better generalization across datasets. The output is a 256-dimensional feature vector for every pixel at 1/8 resolution.

Context Encoder h_θ

The context encoder is applied to only I₁. Its job is to provide a persistent reference signal to the update operator — information about the scene structure, edges, and texture patterns that should guide how the flow evolves. It uses batch normalization (not instance norm) and outputs at the same [H/8, W/8, 256] resolution.

Frozen vs. Trained: Both encoders are trained from scratch along with the rest of the network. No pretrained backbones, no frozen weights. Feature encoder: trained, shared weights between I₁ and I₂. Context encoder: trained, separate weights. Update operator GRU: trained, shared across all 12 iterations. Convex upsampling mask: trained. Total parameters: ~5.3M. Small RAFT (1/5 params): ~1M.

Why two encoders? The feature encoder produces representations optimized for matching — features from corresponding pixels should have high dot products. The context encoder produces representations optimized for guiding updates — edge information, texture boundaries, and local structure that tells the update operator where flow discontinuities should be. Merging these two roles into one encoder degrades performance (ablation: +0.3 EPE on Sintel clean).

Component	Input	Output Shape	Parameters
Feature encoder g_θ	I₁ and I₂ (shared)	[H/8, W/8, 256]	~2.5M
Context encoder h_θ	I₁ only	[H/8, W/8, 256]	~2.5M
Update operator	correlation + flow + context	Δf [H/8, W/8, 2]	~2.7M
Upsampling mask	hidden state	[H/8, W/8, 8×8×9]	~0.1M

Why does RAFT use two separate encoders instead of one?

The feature encoder is optimized for cross-image matching (high dot products at correspondences), while the context encoder is optimized for guiding flow updates (edge and structure information from I₁) Two encoders double the parameter count for better capacity The context encoder processes depth information that the feature encoder cannot

Chapter 3: The 4D Correlation Volume

This is the heart of RAFT. After feature extraction, we have two feature maps: g_θ(I₁) ∈ R^{H/8×W/8×256} and g_θ(I₂) ∈ R^{H/8×W/8×256}. Now we need to measure how similar every pixel in I₁ is to every pixel in I₂.

Building the Volume

Take every feature vector in I₁ (indexed by i, j) and compute its dot product with every feature vector in I₂ (indexed by k, l):

C_ijkl = ∑_h g_θ(I₁)_ijh · g_θ(I₂)_klh

The result is a 4D tensor: C ∈ R^{H/8 × W/8 × H/8 × W/8}. For a 480×640 image at 1/8 resolution (60×80), this is 60×80×60×80 = 23 million entries. It sounds huge, but it's computed as a single matrix multiplication and takes only 17% of total inference time.

The Correlation Pyramid

We pool the last two dimensions (the I₂ dimensions) with kernel sizes {1, 2, 4, 8} to create four levels:

C¹: [H/8, W/8, H/8, W/8] — full resolution, small displacements
C²: [H/8, W/8, H/16, W/16] — 2x pooled
C³: [H/8, W/8, H/32, W/32] — 4x pooled
C⁴: [H/8, W/8, H/64, W/64] — 8x pooled, large displacements

The crucial design: we pool only the I₂ dimensions, keeping the I₁ dimensions at full resolution. This means the network retains high-resolution information about where in I₁ each correspondence comes from, which is essential for recovering small fast-moving objects.

Why all pairs? Previous methods (PWC-Net, FlowNet2) computed correlations only in a local window around the current flow estimate. This means they can only find correspondences within that window. If the window is 4 pixels wide at 1/16 resolution, that's 64 pixels at full resolution. A fast-moving baseball at 200 pixels of displacement is invisible. RAFT computes correlations for all pairs, so no displacement is out of reach. The multi-scale pyramid means a radius-4 lookup at level C⁴ covers 256 pixels at full resolution.

Correlation Lookup

During each GRU iteration, the current flow estimate f_k maps each pixel x = (u, v) in I₁ to its estimated position x' = (u + f_k¹, v + f_k²) in I₂. Around x', we sample a local neighborhood of radius r = 4 from each pyramid level using bilinear interpolation. At level k, the grid is centered at x'/2^k, so a radius-4 neighborhood at level 4 covers 2⁴ × 4 = 64 pixels (256 at full resolution). The lookups from all levels are concatenated into a single feature vector.

Tensor shapes through the lookup: At each iteration, for each pixel in I₁ (total H/8 × W/8 pixels), we look up a (2r+1) × (2r+1) = 9×9 = 81 correlation values from each of 4 pyramid levels = 324 values per pixel. Two conv layers reduce this to 256 dims. Together with flow features (128 dims) and context features (128 dims), the GRU input is 512 dimensions per pixel.

4D Correlation Volume

Click on the I₁ grid (left) to select a pixel. The right panel shows its correlation slice — how similar that pixel is to every pixel in I₂. Brighter = higher correlation. The orange box shows the lookup neighborhood around the current flow estimate.

Why does RAFT pool only the I₂ dimensions (not I₁) when building the correlation pyramid?

Because I₂ has more features than I₁ To reduce GPU memory usage symmetrically Keeping I₁ at full resolution preserves per-pixel location information needed to recover small fast-moving objects, while pooling I₂ extends the search range for large displacements

Chapter 4: Iterative Updates

RAFT initializes the flow field to zero everywhere: f₀ = 0. Then it runs 12 iterations (at training time) of the same update operator, each producing a small correction Δf that is added to the current estimate: f_k+1 = f_k + Δf.

Each iteration follows the same three steps:

Step 1: Gather Information

The update operator receives three inputs, concatenated into a single feature map:

Correlation features: Look up the correlation pyramid at the current flow estimate (as described in Ch. 3). Process through 2 conv layers. Output: 256 dims per pixel.
Flow features: Pass the current flow estimate f_k through 2 conv layers. This gives the network explicit access to what the current flow looks like. Output: 128 dims per pixel.
Context features: Directly injected from the context encoder (computed once). This provides a stable reference signal about the structure of I₁. Output: 128 dims per pixel.

Step 2: Update Hidden State (GRU)

The concatenated features (512 dims) and the previous hidden state h_k-1 (128 dims) are fed into a convolutional GRU cell. The GRU maintains a persistent hidden state that accumulates information across iterations — it remembers what it has learned from previous updates.

Step 3: Predict Δf

The GRU's output hidden state h_k passes through 2 convolutional layers to produce the flow update Δf ∈ R^H/8×W/8×2. This is added to the current flow to get f_k+1.

The optimization analogy: Think of each iteration as one step of gradient descent. The correlation lookup is like computing the gradient: "which direction should flow move to better align the two images?" The hidden state is like momentum: it accumulates evidence across iterations. The flow update Δf is the actual step. But unlike hand-crafted optimizers, the "gradient" and "momentum" are learned from data.

Engineering decision — detached gradients: When computing f_k+1 = f_k + Δf, the gradient flows only through the Δf branch. The f_k term is treated as a constant (gradients detached). This prevents exploding gradients through the long chain of 12 iterations and stabilizes training. It also means each iteration is trained to predict the right increment, not the right absolute value.

Iterative Flow Refinement

Watch the flow field converge from zero initialization. Each step adds a Δf correction. Click "Reset" to restart. The plot shows total displacement error decreasing over iterations.

Iteration 0 / 12

Why does RAFT detach gradients through f_k when computing f_k+1 = f_k + Δf?

To prevent exploding gradients through the chain of 12 iterations, and to train each iteration to predict the right increment Δf rather than the absolute flow value Because f_k is a constant that never changes To reduce GPU memory usage during backpropagation

Chapter 5: The Convolutional GRU

The update operator's core is a convolutional GRU (Gated Recurrent Unit), adapted from the sequence modeling world. Instead of fully connected layers, it uses 3×3 convolutions, processing the entire spatial feature map at once.

The GRU Equations

Given input x_t (the concatenated correlation + flow + context features) and previous hidden state h_t-1:

z_t = σ(Conv_3×3([h_t-1, x_t]))

r_t = σ(Conv_3×3([h_t-1, x_t]))

h̃_t = tanh(Conv_3×3([r_t ⊙ h_t-1, x_t]))

h_t = (1 − z_t) ⊙ h_t-1 + z_t ⊙ h̃_t

Where σ is the sigmoid function, ⊙ is element-wise multiplication, and [·, ·] is concatenation.

Let's unpack each gate:

z_t (update gate): How much to replace old hidden state with new candidate. z = 0 means keep everything. z = 1 means replace everything. This is a "trust slider" between memory and new evidence.
r_t (reset gate): How much of the old hidden state to let through when computing the candidate. r = 0 means ignore the past entirely. r = 1 means use all of it. This lets the network "forget" if the old estimate was bad.
h̃_t (candidate): The proposed new hidden state, computed from gated past + current input.

Why GRU and not LSTM? The GRU has fewer parameters (2 gates vs. 3) and works just as well here. The key property RAFT needs is bounded activations (sigmoid and tanh). These prevent the hidden state from growing unboundedly over 12+ iterations, encouraging convergence to a fixed point. An unconstrained linear RNN would diverge.

Separable ConvGRU

RAFT also experiments with a separable variant: replace the single 3×3 convolution with two sequential GRUs — one with 1×5 convolution and one with 5×1 convolution. This increases the receptive field from 3×3 to 5×5 without significantly increasing parameters, and gives a small accuracy boost (0.1 EPE on Sintel).

Tensor shapes through the GRU: Hidden state h ∈ R^{H/8 × W/8 × 128}. Input x ∈ R^{H/8 × W/8 × 512} (256 correlation + 128 flow + 128 context). All three gate convolutions: Conv_3×3(640 → 128). Total GRU params: ~2.7M. For the separable variant, the 1×5 GRU processes [h, x] → h', then the 5×1 GRU processes [h', x] → h_t. Receptive field doubles with minimal parameter increase.

What property of the GRU's activation functions (sigmoid, tanh) is essential for RAFT's iterative design?

They make the network easier to parallelize on GPUs They bound the hidden state, preventing unbounded growth over 12+ iterations and encouraging convergence to a fixed point They reduce memory consumption compared to ReLU

Chapter 6: Convex Upsampling

The GRU outputs flow at 1/8 resolution: [H/8, W/8, 2]. But we need full-resolution flow: [H, W, 2]. Naive bilinear upsampling blurs edges and produces artifacts at motion boundaries. RAFT uses a learned convex upsampling scheme.

How It Works

For each pixel in the high-resolution output, RAFT expresses it as a convex combination of its 3×3 coarse neighbors. "Convex" means the weights are non-negative and sum to 1 (enforced by softmax).

Concretely, the GRU's hidden state is passed through two convolutional layers to produce a mask of shape [H/8, W/8, 8×8×9]. The 8×8 factor tiles each coarse pixel into 64 sub-pixels (to reach full resolution). The 9 values are the weights for the 3×3 neighborhood, passed through softmax. The full-resolution flow for each sub-pixel is:

f^up(x, y) = ∑_{n ∈ N_3×3} w_n(x, y) · f^coarse(n)

where w_n are the softmax-normalized weights and f^coarse(n) are the 9 coarse-resolution flow values in the neighborhood.

Why convex, not deconvolution? A transposed convolution (deconvolution) can produce negative weights and checkerboard artifacts. Convex upsampling guarantees that the output is always a weighted average of valid flow values — it cannot hallucinate flow that doesn't exist in the coarse field. The softmax ensures the weights are a proper probability distribution. At motion boundaries (e.g., a foreground object against a background), the weights become highly concentrated on one side, producing a sharp edge.

Implementation detail: In PyTorch, this is implemented using torch.nn.functional.unfold to extract 3×3 neighborhoods, then multiply by the mask weights and reshape. The mask prediction head adds ~0.1M parameters. The full-resolution flow is only needed at the final iteration during training (all intermediate losses can be computed at 1/8 resolution then upsampled with bilinear for the loss), reducing memory pressure.

Convex vs. Bilinear Upsampling

Compare how a motion boundary is upsampled. Left: bilinear (blurry edge). Right: convex (sharp edge). The convex weights concentrate on one side at the boundary.

Why does convex upsampling produce sharper motion boundaries than bilinear upsampling?

Because convex upsampling uses more neighboring pixels Because the learned softmax weights can concentrate on one side of a boundary, avoiding blending between foreground and background flow values Because it uses a larger kernel size for upsampling

Chapter 7: Training

RAFT is trained end-to-end on synthetic data, then optionally finetuned on real data. The training schedule and loss design are carefully engineered for convergence.

Loss Function

RAFT supervises on all 12 intermediate flow predictions, not just the final one. The loss is the L1 distance between predicted and ground truth flow, summed across all iterations with exponentially increasing weights:

L = ∑_i=1^N γ^N−i · ||f_gt − f_i||₁

where γ = 0.8. This means iteration 12 gets weight 1.0, iteration 11 gets 0.8, iteration 10 gets 0.64, and so on. Early iterations are supervised more lightly because they naturally have worse predictions. But they still get gradient signal, which helps the GRU learn to make useful early updates.

Training Schedule

Stage	Dataset	Iterations	Batch Size	Image Crop
1. Pretrain	FlyingChairs (C)	100K	12	368 × 496
2. Pretrain	FlyingThings3D (T)	100K	6	400 × 720
3. Finetune (Sintel)	Sintel + KITTI + HD1K	100K	6	368 × 768
4. Finetune (KITTI)	KITTI-2015	50K	6	288 × 960

Optimizer: AdamW with gradient clipping to [−1, 1]. Learning rate schedule: one-cycle policy with warmup. Training hardware: 2× 2080Ti GPUs. Total C+T training: ~200K iterations, far less than PWC-Net (1M+) or FlowNet2 (hundreds of thousands per stage). The C+T schedule is the standard "generalization test" — how well does the model transfer to Sintel/KITTI without seeing any real data?

What degrades when training changes: Training on FlyingChairs alone: Sintel clean EPE 3.12 (vs. 1.43 with C+T). The curriculum matters — Chairs provides basic motion patterns, Things3D adds complex occlusions and realistic objects. Reducing iterations from 12 to 6 at training time: +0.15 EPE on Sintel. The network needs enough iterations during training to learn the convergence behavior. Removing the exponential weighting (γ = 1, equal weights): +0.08 EPE — the network wastes capacity trying to make early iterations perfect instead of focusing on the final estimate.

Why does RAFT use exponentially increasing weights (γ = 0.8) in its multi-iteration loss?

Later iterations have better flow estimates and should be supervised more strongly, while early iterations still get gradient signal to learn useful initial updates without being penalized for naturally worse predictions To compensate for gradient vanishing in the recurrent chain To make training converge in fewer GPU hours

Chapter 8: Results

RAFT sets the state of the art on both major optical flow benchmarks, with particularly impressive cross-dataset generalization.

Sintel (MPI Sintel, realistic animated scenes)

Method	Training	Clean EPE ↓	Final EPE ↓
FlowNet2	C+T	2.02	3.54
PWC-Net	C+T	2.55	3.93
VCN	C+T	2.21	3.68
RAFT	C+T	1.43	2.71
After finetuning:
FlowNet2	C+T+S/K	4.16 (test)	5.74 (test)
RAFT	C+T+S/K	1.61 (test)	2.86 (test)

On Sintel test, RAFT achieves 1.61 EPE (clean) and 2.86 EPE (final) — 30-36% error reduction over all prior work.

KITTI-2015 (real driving scenes)

Method	Training	EPE ↓	F1-all ↓
MaskFlowNet	C+T	-	23.1%
VCN	C+T	8.36	25.1%
RAFT	C+T	5.04	17.4%
RAFT (finetuned)	C+T+S+K	0.63	5.10%

On KITTI test, RAFT achieves F1-all of 5.10% — a 16% error reduction from the previous best (6.10%).

Key Ablations

Ablation	Sintel Clean EPE	Change
Full RAFT	1.43	—
Replace all-pairs with local correlation	1.73	+21%
Replace convex upsample with bilinear	1.68	+17%
Remove multi-scale pyramid (single level)	1.62	+13%
No weight sharing (12 separate decoders)	1.71	+20%
6 iterations instead of 12	1.58	+10%

Efficiency: RAFT processes 1088×436 video at 10 FPS on a 1080Ti. Small RAFT (1/5 parameters) runs at 20 FPS while still outperforming all prior methods on Sintel. The 10× fewer training iterations compared to prior methods is a major practical advantage.

Results Comparison

End-point error (EPE) on Sintel Clean (C+T training). Lower is better. RAFT dramatically reduces error compared to prior methods.

According to the ablation study, which component contributes the most to RAFT's accuracy?

The all-pairs correlation volume (removing it causes the largest EPE increase of +21%) and weight sharing across iterations (+20%) The convex upsampling module Using 12 iterations instead of 6

Chapter 9: Connections

RAFT sits at a pivotal point in computer vision — it changed how the field thinks about optical flow and iterative refinement.

Relation to Classical Optimization

Horn-Schunck (1981) and TV-L1 maintain a flow field and iteratively refine it using gradient descent on an energy function. RAFT does the same thing — but learns the energy function and the update rule from data. The correlation volume replaces the hand-crafted data term. The GRU replaces the hand-crafted regularization and update step.

Relation to Deep Equilibrium Models

RAFT's weight-shared iterations can be viewed as a Deep Equilibrium Model (DEQ): find the fixed point f* where the update operator produces Δf = 0. DEQ models solve for this fixed point directly (e.g., via Broyden's method). RAFT uses explicit unrolling but converges to the same idea: the final flow is the fixed point of a learned operator.

Influence on Downstream Work

RAFT's design has been adopted far beyond optical flow:

RAFT-Stereo (2021): Same architecture for stereo disparity estimation. The 4D volume becomes 3D (only horizontal displacement).
RAFT-3D (2021): Extends to scene flow (3D motion) by operating on rigid-body motion fields.
CoTracker: Long-range point tracking uses iterative correlation lookups inspired by RAFT.
FlowFormer (2022): Replaces the GRU with a transformer but keeps the all-pairs correlation volume.

Cheat Sheet

Aspect	RAFT
Input	Image pair I₁, I₂ ∈ R^H×W×3
Output	Dense flow field [H, W, 2]
Feature encoder	ResNet-like, 1/8 res, 256 dims
Correlation	All-pairs 4D volume + 4-level pyramid
Update operator	Convolutional GRU, 2.7M params
Iterations	12 (train), 12-32 (test), 100+ stable
Upsampling	Learned convex combination (8×)
Loss	Multi-iteration L1, γ = 0.8
Training data	FlyingChairs + FlyingThings3D
Key result	30% EPE reduction on Sintel
Total params	~5.3M (small: ~1M)
Speed	10 FPS on 1080Ti (1088×436)

The broader lesson: When you have a structured problem with a known iterative solution shape (match, update, refine), learn the update rule instead of learning the answer directly. Weight sharing forces generality. Bounded activations ensure convergence. And building the full search space (all-pairs) upfront means you never have to worry about missing a correspondence.

What is the key conceptual link between RAFT and classical optical flow methods like Horn-Schunck?

Both maintain a single flow field and iteratively refine it, but RAFT learns the update rule and similarity measure from data instead of hand-crafting them Both use the same loss function Both use coarse-to-fine pyramids for large displacements

RAFT: Recurrent All-Pairs Field Transforms