What Is Flow Matching?
Flow matching is a framework for training continuous normalizing flows (CNFs) that learn a velocity field transporting a simple noise distribution (e.g. Gaussian) to a complex data distribution. The core insight: instead of learning a curved, stochastic denoising process like diffusion, flow matching constructs straight-line probability paths between noise and data.
The velocity field v_θ(x_t, t) is parameterized by a neural network and
defines an ordinary differential equation (ODE):
At t = 0 we have pure noise; at t = 1 we have data. The network learns to push probability mass along optimal transport (OT) paths — the shortest, straightest routes from noise to data. Because the paths are straight, simple ODE solvers (even Euler with very few steps) can integrate them accurately.
Architecture
The architecture is a continuous normalizing flow: a neural network
v_θ (typically a U-Net or DiT) that predicts a velocity vector at
every point in space and time. Integrating this velocity field from t=0 to t=1
transforms noise into data samples.
The training objective is strikingly simple:
where u_t is the conditional velocity that moves noise sample
x0 to data sample x1 along a straight line:
x_t = (1-t) x_0 + t x_1.
The network simply regresses to match these straight-line velocities.
Core Mechanisms
Model a continuous-time transformation via an ODE. Unlike discrete normalizing flows (with invertible layers), CNFs define a smooth path through time, parameterized by a learned velocity field. The change-of-variables formula gives exact log-likelihoods.
OT theory finds the minimum-cost mapping between distributions. Flow matching uses OT to construct straight interpolation paths xt = (1-t)x0 + tx1, which are provably the shortest paths under Euclidean cost. Straighter paths = fewer integration steps = faster sampling.
The key training trick: instead of learning the global velocity field directly (intractable), learn to match conditional velocities that connect each noise-data pair along a straight line. The marginal of these conditional flows recovers the true velocity field. This makes training a simple per-sample regression.
The interpolation xt = (1-t)x0 + tx1 defines a time-dependent Gaussian probability path pt(x | x1) with mean μt = tx1 and variance σt = 1-t. This path smoothly transforms from N(0,I) at t=0 to a delta at x1 at t=1.
Diffusion vs Flow Matching
Both frameworks generate data by transforming noise, but they take fundamentally different paths through the probability space:
Diffusion (DDPM/DDIM)
Follows curved paths through probability space. The stochastic denoising process requires many small steps to stay on the learned manifold. Noise schedule and variance weighting are critical hyperparameters.
Flow Matching
Follows straight OT paths. The deterministic ODE has near-constant velocity, so even a coarse Euler integrator produces high-quality samples. No noise schedule needed — just linear interpolation.
Reflow / Distillation
Even after training with OT paths, the learned velocity field may not produce perfectly straight trajectories (it matches conditional velocities, not global OT). Reflow is a self-distillation technique that straightens the learned flow:
Train the teacher
Train a flow matching model vθ with standard CFM loss. This gives good but not perfectly straight paths.
Generate (x0, x1) pairs
Sample noise x0 and run the teacher ODE to get paired data x1 = ODE(x0; vθ). These pairs lie on the teacher's actual trajectories.
Retrain on paired data
Train a student vφ on these (x0, x1) pairs with the same CFM loss. The student learns straight paths between the teacher's actual endpoints, making trajectories straighter.
Iterate (optional)
Repeat steps 2–3. Each round straightens the paths further. After 2–3 rounds of reflow, a single Euler step can produce reasonable samples.
Key Models
Flow matching has rapidly become the backbone of state-of-the-art generative models, replacing diffusion in many flagship systems:
Uses rectified flow matching with a multimodal DiT (MM-DiT) backbone. Text and image tokens attend to each other via joint attention. The flow formulation enables high-quality 8-step generation.
Built by the original Stable Diffusion team. Pure rectified flow transformer with rotary positional embeddings. Flux.1-dev and Flux.1-schnell (distilled to 4 steps) set new benchmarks in text-to-image quality.
Uses flow matching as the action generation head in a vision-language-action model. The velocity field maps noise to robot actions, enabling multimodal action distributions for dexterous manipulation.
Training Pipeline
Sample data and noise
Draw a data sample x1 from the training set and noise x0 ~ N(0, I).
Sample time and interpolate
Draw t ~ U(0,1). Construct the interpolated point xt = (1-t)x0 + tx1.
Compute target velocity
The target velocity is simply ut = x1 − x0 (the direction from noise to data).
Regress
Feed xt and t into the network. Minimize || vθ(xt, t) − ut ||2. That is it. No noise schedule, no weighting tricks.
Inference Pipeline
Sample initial noise
Draw x0 ~ N(0, I) in the latent space (if using a latent model like SD3 or Flux, pass through the VAE encoder first).
Integrate the ODE
Use an ODE solver (Euler, midpoint, or adaptive RK45) to integrate dx/dt = vθ(xt, t) from t=0 to t=1. Each step is one network forward pass.
Apply guidance (optional)
For conditional generation, use classifier-free guidance: vguided = vuncond + s(vcond − vuncond) with guidance scale s. Works identically to diffusion CFG.
Decode
Pass the final x1 through the VAE decoder (for latent models) to get the output image, video, or action sequence.
Model Zoo
A non-exhaustive catalog of notable flow matching models and their key characteristics:
| Model | Backbone | Domain | Steps | Key Innovation |
|---|---|---|---|---|
| Stable Diffusion 3 | MM-DiT | Text-to-Image | 8–28 | Joint text-image attention with rectified flow |
| Flux.1-dev | DiT | Text-to-Image | 20–50 | Parallel + single-stream transformer blocks |
| Flux.1-schnell | DiT | Text-to-Image | 1–4 | Guidance-distilled from Flux.1-dev |
| π0 | VLM + Flow | Robot Actions | 10 | Flow matching action head for dexterous manipulation |
| InstaFlow | U-Net | Text-to-Image | 1 | Reflow distillation to single-step generation |
| SiT | DiT | Image Generation | 10–50 | Scalable interpolant transformers with flow matching |
| Voicebox | Transformer | Speech Synthesis | 16 | Non-autoregressive TTS via conditional flow matching |
| Riemannian FM | Various | Manifold Data | 10–50 | Extends flow matching to Riemannian manifolds |
| CogVideoX | 3D DiT | Text-to-Video | 50 | Expert transformer with rectified flow for video |