The Complete Beginner's Path

Understand Flow
Matching

The elegant successor to diffusion models. Learn how straight-line transport between noise and data enables faster, simpler generation — powering SD3 and Flux.

Prerequisites: Basic calculus + Familiarity with diffusion models (helpful but not required).
9
Chapters
6+
Simulations
0
Stochastic ODEs

Chapter 0: Paths Between Distributions

Generative modeling is fundamentally about transport: moving samples from a distribution we can easily sample (noise) to a distribution we want (data). Diffusion models do this via a winding, stochastic path with hundreds of steps. What if we could find a straight line instead?

Flow matching frames generation as learning a velocity field that transports particles from noise to data along smooth paths. At t=0, particles are random noise. At t=1, they've arrived at data samples. The velocity field v(x, t) tells each particle which direction to move at each moment.

The key idea: Instead of learning to remove noise (diffusion), learn a velocity field that pushes noise toward data. Straighter paths = fewer integration steps = faster generation.
Particles Flowing: Noise to Data

Watch particles travel from random noise (t=0) toward two data clusters (t=1). Each particle follows the learned velocity field.

t = 0.00
Check: What does the velocity field v(x, t) describe?

Chapter 1: Continuous Normalizing Flows

The mathematical backbone of flow matching is the Continuous Normalizing Flow (CNF). Instead of a sequence of discrete transformations (like normalizing flows), a CNF defines a continuous trajectory via an ODE:

dx/dt = vθ(x, t)     x(0) ~ pnoise,   x(1) ~ pdata

Starting from a noise sample x(0), we integrate the ODE from t=0 to t=1 to get a data sample x(1). The velocity field vθ is parameterized by a neural network. The elegant part: this single ODE replaces the entire noising/denoising dance of diffusion models.

Why ODE, not SDE? Diffusion models use stochastic differential equations (random noise at each step). CNFs use ordinary differential equations (deterministic given v). Deterministic paths are easier to integrate, require fewer steps, and give exact log-likelihoods.
Noise x(0)
x(0) ~ N(0, I)
↓ integrate dx/dt = v(x,t)
x(0.25)
Starting to take shape
↓ integrate
x(0.5)
Halfway there
↓ integrate
Data x(1)
A sample from pdata
Check: What type of equation governs a continuous normalizing flow?

Chapter 2: Optimal Transport

Not all paths from noise to data are equal. A path could be a wild, looping curve — or a straight line. Optimal transport (OT) asks: what's the most efficient way to move mass from one distribution to another? The answer: straight paths with constant velocity.

Why does straightness matter? Because straighter paths can be integrated with fewer numerical steps. A perfectly straight path needs just one Euler step. Curved paths need many steps to follow accurately.

Curved vs Straight Paths

Compare diffusion-style curved paths (left) with OT straight paths (right). Straight paths need fewer integration steps.

Straight-line interpolation: The OT path from noise z to data x1 is simply xt = (1-t)z + t x1. The velocity along this path is constant: v = x1 - z. This is as simple as it gets.
xt = (1-t) · z + t · x1    →    v = x1 - z
Check: Why are straight transport paths better than curved ones?

Chapter 3: Conditional Flow Matching

The big problem with learning a velocity field: the true marginal velocity vt(x) is intractable. It requires integrating over all possible data-noise pairings. Conditional Flow Matching (CFM) sidesteps this by training on per-sample straight paths.

For each training pair (noise z, data x1), the conditional path is xt = (1-t)z + t x1 and the conditional velocity is ut = x1 - z. The beautiful insight: matching these conditional velocities gives the same gradient as matching the true marginal velocity.

LCFM = Et, z, x1 [ || vθ(xt, t) - (x1 - z) ||² ]
Why this works: The conditional and marginal flow matching losses have identical gradients (proven by Lipman et al., 2023). We can train on simple per-sample paths while implicitly learning the complex marginal flow. No intractable integrals required.
Sample
Pick z ~ N(0,I) and x1 from dataset
Interpolate
xt = (1-t)z + t x1
Target velocity
ut = x1 - z
Loss
|| vθ(xt, t) - ut ||²
Check: What does Conditional Flow Matching train on?

Chapter 4: Training

Training a flow matching model is arguably even simpler than training a diffusion model. For each training step:

  1. Sample noise z ~ N(0, I)
  2. Sample a data point x1 from the dataset
  3. Sample a random time t ~ U(0, 1)
  4. Compute xt = (1-t)z + t x1
  5. Predict velocity: v̂ = vθ(xt, t)
  6. Loss = || v̂ - (x1 - z) ||²
Compare to diffusion: In diffusion, you predict noise ε. In flow matching, you predict velocity (x1 - z). The target is a direction vector pointing from noise to data. The loss is MSE in both cases.
Interactive: Velocity Field

Arrows show the learned velocity field at time t. At t=0, velocities point toward data. At t=1, they've converged.

Time t0.30
Check: What is the training target in flow matching?

Chapter 5: Sampling

To generate a sample, start from noise z ~ N(0, I) and integrate the learned ODE forward from t=0 to t=1. The simplest method is Euler integration: take N evenly spaced steps, at each step nudging x by the predicted velocity times the step size.

xt+Δt = xt + Δt · vθ(xt, t)     where Δt = 1/N

Because the paths are approximately straight, Euler's method works well even with few steps. Higher-order solvers (Midpoint, RK4) give better accuracy for the same step count, but even plain Euler with 20-50 steps produces excellent results.

Interactive: Euler Steps Along the Flow

Watch particles integrate from noise to data. Adjust the step count: more steps = more accurate paths.

Euler steps20
The straight-path advantage: If paths were perfectly straight, one Euler step would be exact. In practice, learned paths are nearly straight, so 20-50 steps suffice. Compare with DDPM's 1000 steps!
Check: Why does flow matching need fewer sampling steps than diffusion?

Chapter 6: Reflow & Distillation

Even though flow matching paths are straighter than diffusion, they're not perfectly straight. Paths from different noise-data pairs can cross each other, forcing the network to learn a curved velocity field to avoid collisions. Reflow straightens paths further.

The procedure: (1) Generate data-noise pairs by running the trained model forward and backward. (2) Retrain on these paired samples. Each iteration makes paths straighter. After 2-3 reflow iterations, paths are nearly straight enough for 1-step generation.

Initial Flow
Good but paths cross
↓ Reflow iteration 1
Straighter
Fewer crossings, lower curvature
↓ Reflow iteration 2
Nearly Straight
Suitable for 1-2 step generation

Distillation

Distillation takes a different approach: train a student model to mimic the teacher in fewer steps. Progressive distillation halves the step count repeatedly (64 → 32 → 16 → 8 → 4 → 2 → 1). Combined with reflow, this yields high-quality 1-4 step models.

Path Straightening

Compare paths before and after reflow. Straighter paths = fewer steps needed.

Reflow iterations0
The endgame: Reflow + distillation aims for single-step generation with diffusion-level quality. This is essentially what Flux-Schnell and SDXL-Turbo achieve in practice.
Check: What does reflow do to transport paths?

Chapter 7: SD3 / Flux

Flow matching has gone from theory to production. Both Stable Diffusion 3 (Stability AI) and Flux (Black Forest Labs) use rectified flow matching as their core framework, combined with a new architecture: the MMDiT (Multimodal Diffusion Transformer).

MMDiT Architecture

MMDiT replaces the U-Net with a Transformer. Both the noisy latent patches and the text tokens are processed as separate streams that interact through joint attention layers. This bidirectional interaction gives the text genuine influence over image generation.

Text Stream
T5 + CLIP embeddings as tokens
Joint Attention
Image and text tokens attend to each other
Image Stream
Patchified noisy latent + positional encoding
FeatureSD 1.5SDXLSD3 / Flux
ArchitectureU-NetLarger U-NetMMDiT
FrameworkDDPMDDPMFlow matching
Text encoderCLIPCLIP + OpenCLIPCLIP + T5-XXL
Resolution512px1024px1024px+
Steps20-5020-4020-30
Flux variants: Flux.1-pro (best quality, API-only), Flux.1-dev (open weights, guidance-distilled), Flux.1-schnell (4-step distilled, fastest). The schnell variant demonstrates the power of reflow + distillation.
Check: What architecture do SD3 and Flux use instead of U-Net?

Chapter 8: Flow vs Diffusion

Flow matching and diffusion are closely related — in fact, diffusion can be seen as a special case of flow matching with a particular (non-straight) path choice. But the differences matter in practice.

AspectDiffusion (DDPM)Flow Matching
What it learnsNoise prediction εθVelocity field vθ
Path shapeCurved (variance-preserving)Straight (OT interpolation)
Math frameworkSDE (stochastic)ODE (deterministic)
Typical steps20-100010-50
Noise scheduleβt schedule (many choices)Linear interpolation (one choice)
Training targetε (noise)x1 - z (velocity)
Log-likelihoodApproximate (ELBO)Exact (via ODE)
Side-by-Side: Diffusion vs Flow

Left: diffusion-style curved trajectories. Right: flow matching straight trajectories. Both reach the same target.

When to Use Which?

Use diffusion when you have existing DDPM/DDIM infrastructure, need maximum compatibility with LoRAs/ControlNets built for SD 1.5/SDXL, or when stochastic sampling benefits diversity.
Use flow matching for new projects, when speed matters (fewer steps), when you want a simpler noise schedule, or when using modern architectures (DiT/MMDiT). It's the clear direction the field is heading.
"The shortest path between two truths in the real domain passes through the complex domain."
— Jacques Hadamard

You now understand flow matching: straight paths, simple training, fast sampling. The next generation of generative models is built on these ideas.