What Is Flow Matching?

Flow matching is a framework for training continuous normalizing flows (CNFs) that learn a velocity field transporting a simple noise distribution (e.g. Gaussian) to a complex data distribution. The core insight: instead of learning a curved, stochastic denoising process like diffusion, flow matching constructs straight-line probability paths between noise and data.

The velocity field v_θ(x_t, t) is parameterized by a neural network and defines an ordinary differential equation (ODE):

dx/dt = vθ(xt, t),   t ∈ [0, 1]

At t = 0 we have pure noise; at t = 1 we have data. The network learns to push probability mass along optimal transport (OT) paths — the shortest, straightest routes from noise to data. Because the paths are straight, simple ODE solvers (even Euler with very few steps) can integrate them accurately.

Key Insight
Unlike diffusion models which learn a score function (gradient of log-density), flow matching directly learns a velocity field. The training objective is a simple regression loss — no noise schedule, no variance weighting, no ELBO. This makes the framework conceptually cleaner and easier to train.

Architecture

The architecture is a continuous normalizing flow: a neural network v_θ (typically a U-Net or DiT) that predicts a velocity vector at every point in space and time. Integrating this velocity field from t=0 to t=1 transforms noise into data samples.

Flow Field Visualization Interactive
0.00
Particles at t=0.00 — pure noise. Drag the slider or click Play.

The training objective is strikingly simple:

LFM = Et, x0, x1 || vθ(xt, t) − ut(xt | x0, x1) ||2

where u_t is the conditional velocity that moves noise sample x0 to data sample x1 along a straight line: x_t = (1-t) x_0 + t x_1. The network simply regresses to match these straight-line velocities.

Core Mechanisms

Continuous Normalizing Flows

Model a continuous-time transformation via an ODE. Unlike discrete normalizing flows (with invertible layers), CNFs define a smooth path through time, parameterized by a learned velocity field. The change-of-variables formula gives exact log-likelihoods.

Optimal Transport Paths

OT theory finds the minimum-cost mapping between distributions. Flow matching uses OT to construct straight interpolation paths xt = (1-t)x0 + tx1, which are provably the shortest paths under Euclidean cost. Straighter paths = fewer integration steps = faster sampling.

Conditional Flow Matching (CFM)

The key training trick: instead of learning the global velocity field directly (intractable), learn to match conditional velocities that connect each noise-data pair along a straight line. The marginal of these conditional flows recovers the true velocity field. This makes training a simple per-sample regression.

Gaussian Probability Paths

The interpolation xt = (1-t)x0 + tx1 defines a time-dependent Gaussian probability path pt(x | x1) with mean μt = tx1 and variance σt = 1-t. This path smoothly transforms from N(0,I) at t=0 to a delta at x1 at t=1.

Diffusion vs Flow Matching

Both frameworks generate data by transforming noise, but they take fundamentally different paths through the probability space:

Path Comparison: Diffusion vs Flow Matching Interactive
Click Animate to compare trajectories

Diffusion (DDPM/DDIM)

20–50 steps
Typical sampling steps

Follows curved paths through probability space. The stochastic denoising process requires many small steps to stay on the learned manifold. Noise schedule and variance weighting are critical hyperparameters.

Flow Matching

4–10 steps
Typical sampling steps

Follows straight OT paths. The deterministic ODE has near-constant velocity, so even a coarse Euler integrator produces high-quality samples. No noise schedule needed — just linear interpolation.

Why Fewer Steps?
A straight line between two points can be traversed in one Euler step with zero error. In practice, the learned velocity field is almost straight (after OT matching), so 4–10 steps suffice. Diffusion paths curve because the forward process adds noise in many small increments, and the reverse process must carefully undo each one.

Reflow / Distillation

Even after training with OT paths, the learned velocity field may not produce perfectly straight trajectories (it matches conditional velocities, not global OT). Reflow is a self-distillation technique that straightens the learned flow:

ODE Integration Steps: Before and After Reflow Interactive
10
Showing: Before Reflow
1

Train the teacher

Train a flow matching model vθ with standard CFM loss. This gives good but not perfectly straight paths.

2

Generate (x0, x1) pairs

Sample noise x0 and run the teacher ODE to get paired data x1 = ODE(x0; vθ). These pairs lie on the teacher's actual trajectories.

3

Retrain on paired data

Train a student vφ on these (x0, x1) pairs with the same CFM loss. The student learns straight paths between the teacher's actual endpoints, making trajectories straighter.

4

Iterate (optional)

Repeat steps 2–3. Each round straightens the paths further. After 2–3 rounds of reflow, a single Euler step can produce reasonable samples.

Key Models

Flow matching has rapidly become the backbone of state-of-the-art generative models, replacing diffusion in many flagship systems:

Stable Diffusion 3
Stability AI — 2024

Uses rectified flow matching with a multimodal DiT (MM-DiT) backbone. Text and image tokens attend to each other via joint attention. The flow formulation enables high-quality 8-step generation.

Flux
Black Forest Labs — 2024

Built by the original Stable Diffusion team. Pure rectified flow transformer with rotary positional embeddings. Flux.1-dev and Flux.1-schnell (distilled to 4 steps) set new benchmarks in text-to-image quality.

π0 Action Head
Physical Intelligence — 2024

Uses flow matching as the action generation head in a vision-language-action model. The velocity field maps noise to robot actions, enabling multimodal action distributions for dexterous manipulation.

Training Pipeline

Velocity Field Learning Interactive
Epoch 0 — untrained velocity field
1

Sample data and noise

Draw a data sample x1 from the training set and noise x0 ~ N(0, I).

2

Sample time and interpolate

Draw t ~ U(0,1). Construct the interpolated point xt = (1-t)x0 + tx1.

3

Compute target velocity

The target velocity is simply ut = x1 − x0 (the direction from noise to data).

4

Regress

Feed xt and t into the network. Minimize || vθ(xt, t) − ut ||2. That is it. No noise schedule, no weighting tricks.

Inference Pipeline

1

Sample initial noise

Draw x0 ~ N(0, I) in the latent space (if using a latent model like SD3 or Flux, pass through the VAE encoder first).

2

Integrate the ODE

Use an ODE solver (Euler, midpoint, or adaptive RK45) to integrate dx/dt = vθ(xt, t) from t=0 to t=1. Each step is one network forward pass.

3

Apply guidance (optional)

For conditional generation, use classifier-free guidance: vguided = vuncond + s(vcond − vuncond) with guidance scale s. Works identically to diffusion CFG.

4

Decode

Pass the final x1 through the VAE decoder (for latent models) to get the output image, video, or action sequence.

Model Zoo

A non-exhaustive catalog of notable flow matching models and their key characteristics:

Model Backbone Domain Steps Key Innovation
Stable Diffusion 3 MM-DiT Text-to-Image 8–28 Joint text-image attention with rectified flow
Flux.1-dev DiT Text-to-Image 20–50 Parallel + single-stream transformer blocks
Flux.1-schnell DiT Text-to-Image 1–4 Guidance-distilled from Flux.1-dev
π0 VLM + Flow Robot Actions 10 Flow matching action head for dexterous manipulation
InstaFlow U-Net Text-to-Image 1 Reflow distillation to single-step generation
SiT DiT Image Generation 10–50 Scalable interpolant transformers with flow matching
Voicebox Transformer Speech Synthesis 16 Non-autoregressive TTS via conditional flow matching
Riemannian FM Various Manifold Data 10–50 Extends flow matching to Riemannian manifolds
CogVideoX 3D DiT Text-to-Video 50 Expert transformer with rectified flow for video