Estimate dense optical flow by building a 4D correlation volume over all pixel pairs, then iteratively refining the flow field with a lightweight GRU — 12 identical update steps that mimic first-order optimization.
You have two consecutive video frames. For every single pixel in frame 1, you want to know where that pixel moved to in frame 2. This is optical flow — a dense 2D displacement field that captures per-pixel motion between images.
Think of it this way: if you could paint every pixel a unique color and then photograph the scene a moment later, optical flow is the mapping that tells you where each color ended up. The output is a tensor of shape [H, W, 2] — two values (dx, dy) for every pixel, describing horizontal and vertical displacement.
Why is this hard? Three reasons make optical flow one of the oldest unsolved problems in computer vision:
Before RAFT, deep learning approaches used coarse-to-fine pyramids: estimate flow at low resolution, then progressively refine at higher resolutions. This works but has three problems: (1) errors at coarse levels can never be recovered, (2) small fast-moving objects get missed at low resolution, and (3) training requires 1M+ iterations to learn the multi-stage cascade.
Drag the circle in Frame 1 (left panel). The arrow shows the flow vector — where that pixel moves to in Frame 2. The flow field is [H, W, 2]: two displacement values per pixel.
RAFT's central insight is to operate at a single resolution and iterate, rather than building a coarse-to-fine pyramid. This single change has cascading benefits.
Previous methods (FlowNet2, PWC-Net, LiteFlowNet) all followed the same playbook: estimate flow at 1/64 resolution, warp the second image, re-estimate flow at 1/32, warp again, and so on up to full resolution. Each stage has its own weights — no sharing. This means: if the coarse estimate is wrong, every subsequent stage tries to fix someone else's mistake.
RAFT takes a radically different approach. It maintains a single flow field at 1/8 resolution and applies the same update operator 12 times (or more — up to 100+ at inference without divergence). The weights are shared across all iterations. Each iteration receives the same information: what does the correlation volume say, and what does the current flow estimate look like?
RAFT uses two separate encoders. Both are simple residual networks with 6 residual blocks, producing features at 1/8 resolution. But they serve very different purposes.
The feature encoder is applied to both images. Its job is to extract appearance features that can be matched across frames. It maps each image from RH×W×3 to RH/8×W/8×256.
The architecture: 2 residual blocks at 1/2 resolution, 2 at 1/4, and 2 at 1/8. Each block has instance normalization (not batch norm) for better generalization across datasets. The output is a 256-dimensional feature vector for every pixel at 1/8 resolution.
The context encoder is applied to only I1. Its job is to provide a persistent reference signal to the update operator — information about the scene structure, edges, and texture patterns that should guide how the flow evolves. It uses batch normalization (not instance norm) and outputs at the same [H/8, W/8, 256] resolution.
| Component | Input | Output Shape | Parameters |
|---|---|---|---|
| Feature encoder gθ | I1 and I2 (shared) | [H/8, W/8, 256] | ~2.5M |
| Context encoder hθ | I1 only | [H/8, W/8, 256] | ~2.5M |
| Update operator | correlation + flow + context | Δf [H/8, W/8, 2] | ~2.7M |
| Upsampling mask | hidden state | [H/8, W/8, 8×8×9] | ~0.1M |
This is the heart of RAFT. After feature extraction, we have two feature maps: gθ(I1) ∈ RH/8×W/8×256 and gθ(I2) ∈ RH/8×W/8×256. Now we need to measure how similar every pixel in I1 is to every pixel in I2.
Take every feature vector in I1 (indexed by i, j) and compute its dot product with every feature vector in I2 (indexed by k, l):
The result is a 4D tensor: C ∈ RH/8 × W/8 × H/8 × W/8. For a 480×640 image at 1/8 resolution (60×80), this is 60×80×60×80 = 23 million entries. It sounds huge, but it's computed as a single matrix multiplication and takes only 17% of total inference time.
We pool the last two dimensions (the I2 dimensions) with kernel sizes {1, 2, 4, 8} to create four levels:
The crucial design: we pool only the I2 dimensions, keeping the I1 dimensions at full resolution. This means the network retains high-resolution information about where in I1 each correspondence comes from, which is essential for recovering small fast-moving objects.
During each GRU iteration, the current flow estimate fk maps each pixel x = (u, v) in I1 to its estimated position x' = (u + fk1, v + fk2) in I2. Around x', we sample a local neighborhood of radius r = 4 from each pyramid level using bilinear interpolation. At level k, the grid is centered at x'/2k, so a radius-4 neighborhood at level 4 covers 24 × 4 = 64 pixels (256 at full resolution). The lookups from all levels are concatenated into a single feature vector.
Click on the I1 grid (left) to select a pixel. The right panel shows its correlation slice — how similar that pixel is to every pixel in I2. Brighter = higher correlation. The orange box shows the lookup neighborhood around the current flow estimate.
RAFT initializes the flow field to zero everywhere: f0 = 0. Then it runs 12 iterations (at training time) of the same update operator, each producing a small correction Δf that is added to the current estimate: fk+1 = fk + Δf.
Each iteration follows the same three steps:
The update operator receives three inputs, concatenated into a single feature map:
The concatenated features (512 dims) and the previous hidden state hk-1 (128 dims) are fed into a convolutional GRU cell. The GRU maintains a persistent hidden state that accumulates information across iterations — it remembers what it has learned from previous updates.
The GRU's output hidden state hk passes through 2 convolutional layers to produce the flow update Δf ∈ RH/8×W/8×2. This is added to the current flow to get fk+1.
Watch the flow field converge from zero initialization. Each step adds a Δf correction. Click "Reset" to restart. The plot shows total displacement error decreasing over iterations.
The update operator's core is a convolutional GRU (Gated Recurrent Unit), adapted from the sequence modeling world. Instead of fully connected layers, it uses 3×3 convolutions, processing the entire spatial feature map at once.
Given input xt (the concatenated correlation + flow + context features) and previous hidden state ht-1:
Where σ is the sigmoid function, ⊙ is element-wise multiplication, and [·, ·] is concatenation.
Let's unpack each gate:
RAFT also experiments with a separable variant: replace the single 3×3 convolution with two sequential GRUs — one with 1×5 convolution and one with 5×1 convolution. This increases the receptive field from 3×3 to 5×5 without significantly increasing parameters, and gives a small accuracy boost (0.1 EPE on Sintel).
The GRU outputs flow at 1/8 resolution: [H/8, W/8, 2]. But we need full-resolution flow: [H, W, 2]. Naive bilinear upsampling blurs edges and produces artifacts at motion boundaries. RAFT uses a learned convex upsampling scheme.
For each pixel in the high-resolution output, RAFT expresses it as a convex combination of its 3×3 coarse neighbors. "Convex" means the weights are non-negative and sum to 1 (enforced by softmax).
Concretely, the GRU's hidden state is passed through two convolutional layers to produce a mask of shape [H/8, W/8, 8×8×9]. The 8×8 factor tiles each coarse pixel into 64 sub-pixels (to reach full resolution). The 9 values are the weights for the 3×3 neighborhood, passed through softmax. The full-resolution flow for each sub-pixel is:
where wn are the softmax-normalized weights and fcoarse(n) are the 9 coarse-resolution flow values in the neighborhood.
torch.nn.functional.unfold to extract 3×3 neighborhoods, then multiply by the mask weights and reshape. The mask prediction head adds ~0.1M parameters. The full-resolution flow is only needed at the final iteration during training (all intermediate losses can be computed at 1/8 resolution then upsampled with bilinear for the loss), reducing memory pressure.Compare how a motion boundary is upsampled. Left: bilinear (blurry edge). Right: convex (sharp edge). The convex weights concentrate on one side at the boundary.
RAFT is trained end-to-end on synthetic data, then optionally finetuned on real data. The training schedule and loss design are carefully engineered for convergence.
RAFT supervises on all 12 intermediate flow predictions, not just the final one. The loss is the L1 distance between predicted and ground truth flow, summed across all iterations with exponentially increasing weights:
where γ = 0.8. This means iteration 12 gets weight 1.0, iteration 11 gets 0.8, iteration 10 gets 0.64, and so on. Early iterations are supervised more lightly because they naturally have worse predictions. But they still get gradient signal, which helps the GRU learn to make useful early updates.
| Stage | Dataset | Iterations | Batch Size | Image Crop |
|---|---|---|---|---|
| 1. Pretrain | FlyingChairs (C) | 100K | 12 | 368 × 496 |
| 2. Pretrain | FlyingThings3D (T) | 100K | 6 | 400 × 720 |
| 3. Finetune (Sintel) | Sintel + KITTI + HD1K | 100K | 6 | 368 × 768 |
| 4. Finetune (KITTI) | KITTI-2015 | 50K | 6 | 288 × 960 |
RAFT sets the state of the art on both major optical flow benchmarks, with particularly impressive cross-dataset generalization.
| Method | Training | Clean EPE ↓ | Final EPE ↓ |
|---|---|---|---|
| FlowNet2 | C+T | 2.02 | 3.54 |
| PWC-Net | C+T | 2.55 | 3.93 |
| VCN | C+T | 2.21 | 3.68 |
| RAFT | C+T | 1.43 | 2.71 |
| After finetuning: | |||
| FlowNet2 | C+T+S/K | 4.16 (test) | 5.74 (test) |
| RAFT | C+T+S/K | 1.61 (test) | 2.86 (test) |
On Sintel test, RAFT achieves 1.61 EPE (clean) and 2.86 EPE (final) — 30-36% error reduction over all prior work.
| Method | Training | EPE ↓ | F1-all ↓ |
|---|---|---|---|
| MaskFlowNet | C+T | - | 23.1% |
| VCN | C+T | 8.36 | 25.1% |
| RAFT | C+T | 5.04 | 17.4% |
| RAFT (finetuned) | C+T+S+K | 0.63 | 5.10% |
On KITTI test, RAFT achieves F1-all of 5.10% — a 16% error reduction from the previous best (6.10%).
| Ablation | Sintel Clean EPE | Change |
|---|---|---|
| Full RAFT | 1.43 | — |
| Replace all-pairs with local correlation | 1.73 | +21% |
| Replace convex upsample with bilinear | 1.68 | +17% |
| Remove multi-scale pyramid (single level) | 1.62 | +13% |
| No weight sharing (12 separate decoders) | 1.71 | +20% |
| 6 iterations instead of 12 | 1.58 | +10% |
End-point error (EPE) on Sintel Clean (C+T training). Lower is better. RAFT dramatically reduces error compared to prior methods.
RAFT sits at a pivotal point in computer vision — it changed how the field thinks about optical flow and iterative refinement.
Horn-Schunck (1981) and TV-L1 maintain a flow field and iteratively refine it using gradient descent on an energy function. RAFT does the same thing — but learns the energy function and the update rule from data. The correlation volume replaces the hand-crafted data term. The GRU replaces the hand-crafted regularization and update step.
RAFT's weight-shared iterations can be viewed as a Deep Equilibrium Model (DEQ): find the fixed point f* where the update operator produces Δf = 0. DEQ models solve for this fixed point directly (e.g., via Broyden's method). RAFT uses explicit unrolling but converges to the same idea: the final flow is the fixed point of a learned operator.
RAFT's design has been adopted far beyond optical flow:
| Aspect | RAFT |
|---|---|
| Input | Image pair I1, I2 ∈ RH×W×3 |
| Output | Dense flow field [H, W, 2] |
| Feature encoder | ResNet-like, 1/8 res, 256 dims |
| Correlation | All-pairs 4D volume + 4-level pyramid |
| Update operator | Convolutional GRU, 2.7M params |
| Iterations | 12 (train), 12-32 (test), 100+ stable |
| Upsampling | Learned convex combination (8×) |
| Loss | Multi-iteration L1, γ = 0.8 |
| Training data | FlyingChairs + FlyingThings3D |
| Key result | 30% EPE reduction on Sintel |
| Total params | ~5.3M (small: ~1M) |
| Speed | 10 FPS on 1080Ti (1088×436) |