The elegant successor to diffusion models. Learn how straight-line transport between noise and data enables faster, simpler generation — powering SD3 and Flux.
Generative modeling is fundamentally about transport: moving samples from a distribution we can easily sample (noise) to a distribution we want (data). Diffusion models do this via a winding, stochastic path with hundreds of steps. What if we could find a straight line instead?
Flow matching frames generation as learning a velocity field that transports particles from noise to data along smooth paths. At t=0, particles are random noise. At t=1, they've arrived at data samples. The velocity field v(x, t) tells each particle which direction to move at each moment.
Watch particles travel from random noise (t=0) toward two data clusters (t=1). Each particle follows the learned velocity field.
The mathematical backbone of flow matching is the Continuous Normalizing Flow (CNF). Instead of a sequence of discrete transformations (like normalizing flows), a CNF defines a continuous trajectory via an ODE:
Starting from a noise sample x(0), we integrate the ODE from t=0 to t=1 to get a data sample x(1). The velocity field vθ is parameterized by a neural network. The elegant part: this single ODE replaces the entire noising/denoising dance of diffusion models.
Not all paths from noise to data are equal. A path could be a wild, looping curve — or a straight line. Optimal transport (OT) asks: what's the most efficient way to move mass from one distribution to another? The answer: straight paths with constant velocity.
Why does straightness matter? Because straighter paths can be integrated with fewer numerical steps. A perfectly straight path needs just one Euler step. Curved paths need many steps to follow accurately.
Compare diffusion-style curved paths (left) with OT straight paths (right). Straight paths need fewer integration steps.
The big problem with learning a velocity field: the true marginal velocity vt(x) is intractable. It requires integrating over all possible data-noise pairings. Conditional Flow Matching (CFM) sidesteps this by training on per-sample straight paths.
For each training pair (noise z, data x1), the conditional path is xt = (1-t)z + t x1 and the conditional velocity is ut = x1 - z. The beautiful insight: matching these conditional velocities gives the same gradient as matching the true marginal velocity.
Training a flow matching model is arguably even simpler than training a diffusion model. For each training step:
Arrows show the learned velocity field at time t. At t=0, velocities point toward data. At t=1, they've converged.
To generate a sample, start from noise z ~ N(0, I) and integrate the learned ODE forward from t=0 to t=1. The simplest method is Euler integration: take N evenly spaced steps, at each step nudging x by the predicted velocity times the step size.
Because the paths are approximately straight, Euler's method works well even with few steps. Higher-order solvers (Midpoint, RK4) give better accuracy for the same step count, but even plain Euler with 20-50 steps produces excellent results.
Watch particles integrate from noise to data. Adjust the step count: more steps = more accurate paths.
Even though flow matching paths are straighter than diffusion, they're not perfectly straight. Paths from different noise-data pairs can cross each other, forcing the network to learn a curved velocity field to avoid collisions. Reflow straightens paths further.
The procedure: (1) Generate data-noise pairs by running the trained model forward and backward. (2) Retrain on these paired samples. Each iteration makes paths straighter. After 2-3 reflow iterations, paths are nearly straight enough for 1-step generation.
Distillation takes a different approach: train a student model to mimic the teacher in fewer steps. Progressive distillation halves the step count repeatedly (64 → 32 → 16 → 8 → 4 → 2 → 1). Combined with reflow, this yields high-quality 1-4 step models.
Compare paths before and after reflow. Straighter paths = fewer steps needed.
Flow matching has gone from theory to production. Both Stable Diffusion 3 (Stability AI) and Flux (Black Forest Labs) use rectified flow matching as their core framework, combined with a new architecture: the MMDiT (Multimodal Diffusion Transformer).
MMDiT replaces the U-Net with a Transformer. Both the noisy latent patches and the text tokens are processed as separate streams that interact through joint attention layers. This bidirectional interaction gives the text genuine influence over image generation.
| Feature | SD 1.5 | SDXL | SD3 / Flux |
|---|---|---|---|
| Architecture | U-Net | Larger U-Net | MMDiT |
| Framework | DDPM | DDPM | Flow matching |
| Text encoder | CLIP | CLIP + OpenCLIP | CLIP + T5-XXL |
| Resolution | 512px | 1024px | 1024px+ |
| Steps | 20-50 | 20-40 | 20-30 |
Flow matching and diffusion are closely related — in fact, diffusion can be seen as a special case of flow matching with a particular (non-straight) path choice. But the differences matter in practice.
| Aspect | Diffusion (DDPM) | Flow Matching |
|---|---|---|
| What it learns | Noise prediction εθ | Velocity field vθ |
| Path shape | Curved (variance-preserving) | Straight (OT interpolation) |
| Math framework | SDE (stochastic) | ODE (deterministic) |
| Typical steps | 20-1000 | 10-50 |
| Noise schedule | βt schedule (many choices) | Linear interpolation (one choice) |
| Training target | ε (noise) | x1 - z (velocity) |
| Log-likelihood | Approximate (ELBO) | Exact (via ODE) |
Left: diffusion-style curved trajectories. Right: flow matching straight trajectories. Both reach the same target.
You now understand flow matching: straight paths, simple training, fast sampling. The next generation of generative models is built on these ideas.