Flow Matching

What Is Flow Matching?

Flow matching is a framework for training continuous normalizing flows (CNFs) that learn a velocity field transporting a simple noise distribution (e.g. Gaussian) to a complex data distribution. The core insight: instead of learning a curved, stochastic denoising process like diffusion, flow matching constructs straight-line probability paths between noise and data.

The velocity field v_θ(x_t, t) is parameterized by a neural network and defines an ordinary differential equation (ODE):

dx/dt = v θ (x t, t), t \in [0, 1]

At t = 0 we have pure noise; at t = 1 we have data. The network learns to push probability mass along optimal transport (OT) paths — the shortest, straightest routes from noise to data. Because the paths are straight, simple ODE solvers (even Euler with very few steps) can integrate them accurately.

Key Insight

Unlike diffusion models which learn a score function (gradient of log-density), flow matching directly learns a velocity field. The training objective is a simple regression loss — no noise schedule, no variance weighting, no ELBO. This makes the framework conceptually cleaner and easier to train.

Architecture

The architecture is a continuous normalizing flow: a neural network v_θ (typically a U-Net or DiT) that predicts a velocity vector at every point in space and time. Integrating this velocity field from t=0 to t=1 transforms noise into data samples.

Flow Field Visualization Interactive

Time t 0.00

Particles at t=0.00 — pure noise. Drag the slider or click Play.

The training objective is strikingly simple:

L FM = E t, x 0, x 1 || v θ (x t, t) - u t (x t | x 0, x 1) || 2

where u_t is the conditional velocity that moves noise sample x₀ to data sample x₁ along a straight line: x_t = (1-t) x_0 + t x_1. The network simply regresses to match these straight-line velocities.

Core Mechanisms

∂ Continuous Normalizing Flows

Model a continuous-time transformation via an ODE. Unlike discrete normalizing flows (with invertible layers), CNFs define a smooth path through time, parameterized by a learned velocity field. The change-of-variables formula gives exact log-likelihoods.

↔ Optimal Transport Paths

OT theory finds the minimum-cost mapping between distributions. Flow matching uses OT to construct straight interpolation paths x_t = (1-t)x₀ + tx₁, which are provably the shortest paths under Euclidean cost. Straighter paths = fewer integration steps = faster sampling.

∴ Conditional Flow Matching (CFM)

The key training trick: instead of learning the global velocity field directly (intractable), learn to match conditional velocities that connect each noise-data pair along a straight line. The marginal of these conditional flows recovers the true velocity field. This makes training a simple per-sample regression.

∿ Gaussian Probability Paths

The interpolation x_t = (1-t)x₀ + tx₁ defines a time-dependent Gaussian probability path p_t(x | x₁) with mean μ_t = tx₁ and variance σ_t = 1-t. This path smoothly transforms from N(0,I) at t=0 to a delta at x₁ at t=1.

Diffusion vs Flow Matching

Both frameworks generate data by transforming noise, but they take fundamentally different paths through the probability space:

Path Comparison: Diffusion vs Flow Matching Interactive

Click Animate to compare trajectories

Diffusion (DDPM/DDIM)

20–50 steps

Typical sampling steps

Follows curved paths through probability space. The stochastic denoising process requires many small steps to stay on the learned manifold. Noise schedule and variance weighting are critical hyperparameters.

4–10 steps

Typical sampling steps

Follows straight OT paths. The deterministic ODE has near-constant velocity, so even a coarse Euler integrator produces high-quality samples. No noise schedule needed — just linear interpolation.

Why Fewer Steps?

A straight line between two points can be traversed in one Euler step with zero error. In practice, the learned velocity field is almost straight (after OT matching), so 4–10 steps suffice. Diffusion paths curve because the forward process adds noise in many small increments, and the reverse process must carefully undo each one.

Reflow / Distillation

Even after training with OT paths, the learned velocity field may not produce perfectly straight trajectories (it matches conditional velocities, not global OT). Reflow is a self-distillation technique that straightens the learned flow:

ODE Integration Steps: Before and After Reflow Interactive

Euler Steps 10

Showing: Before Reflow

Train the teacher

Train a flow matching model v_θ with standard CFM loss. This gives good but not perfectly straight paths.

Generate (x₀, x₁) pairs

Sample noise x₀ and run the teacher ODE to get paired data x₁ = ODE(x₀; v_θ). These pairs lie on the teacher's actual trajectories.

Retrain on paired data

Train a student v_φ on these (x₀, x₁) pairs with the same CFM loss. The student learns straight paths between the teacher's actual endpoints, making trajectories straighter.

Iterate (optional)

Repeat steps 2–3. Each round straightens the paths further. After 2–3 rounds of reflow, a single Euler step can produce reasonable samples.

Key Models

Flow matching has rapidly become the backbone of state-of-the-art generative models, replacing diffusion in many flagship systems:

Stable Diffusion 3

Stability AI — 2024

Uses rectified flow matching with a multimodal DiT (MM-DiT) backbone. Text and image tokens attend to each other via joint attention. The flow formulation enables high-quality 8-step generation.

Flux

Black Forest Labs — 2024

Built by the original Stable Diffusion team. Pure rectified flow transformer with rotary positional embeddings. Flux.1-dev and Flux.1-schnell (distilled to 4 steps) set new benchmarks in text-to-image quality.

π₀ Action Head

Physical Intelligence — 2024

Uses flow matching as the action generation head in a vision-language-action model. The velocity field maps noise to robot actions, enabling multimodal action distributions for dexterous manipulation.

Training Pipeline

Velocity Field Learning Interactive

Epoch 0 — untrained velocity field

Sample data and noise

Draw a data sample x₁ from the training set and noise x₀ ~ N(0, I).

Sample time and interpolate

Draw t ~ U(0,1). Construct the interpolated point x_t = (1-t)x₀ + tx₁.

Compute target velocity

The target velocity is simply u_t = x₁ − x₀ (the direction from noise to data).

Regress

Feed x_t and t into the network. Minimize || v_θ(x_t, t) − u_t ||². That is it. No noise schedule, no weighting tricks.

Inference Pipeline

Sample initial noise

Draw x₀ ~ N(0, I) in the latent space (if using a latent model like SD3 or Flux, pass through the VAE encoder first).

Integrate the ODE

Use an ODE solver (Euler, midpoint, or adaptive RK45) to integrate dx/dt = v_θ(x_t, t) from t=0 to t=1. Each step is one network forward pass.

Apply guidance (optional)

For conditional generation, use classifier-free guidance: v_guided = v_uncond + s(v_cond − v_uncond) with guidance scale s. Works identically to diffusion CFG.

Decode

Pass the final x₁ through the VAE decoder (for latent models) to get the output image, video, or action sequence.

Model Zoo

A non-exhaustive catalog of notable flow matching models and their key characteristics:

Model	Backbone	Domain	Steps	Key Innovation
Stable Diffusion 3	MM-DiT	Text-to-Image	8–28	Joint text-image attention with rectified flow
Flux.1-dev	DiT	Text-to-Image	20–50	Parallel + single-stream transformer blocks
Flux.1-schnell	DiT	Text-to-Image	1–4	Guidance-distilled from Flux.1-dev
π₀	VLM + Flow	Robot Actions	10	Flow matching action head for dexterous manipulation
InstaFlow	U-Net	Text-to-Image	1	Reflow distillation to single-step generation
SiT	DiT	Image Generation	10–50	Scalable interpolant transformers with flow matching
Voicebox	Transformer	Speech Synthesis	16	Non-autoregressive TTS via conditional flow matching
Riemannian FM	Various	Manifold Data	10–50	Extends flow matching to Riemannian manifolds
CogVideoX	3D DiT	Text-to-Video	50	Expert transformer with rectified flow for video

Flow Matching

What Is Flow Matching?

Architecture

Core Mechanisms

Diffusion vs Flow Matching

Diffusion (DDPM/DDIM)

Flow Matching

Reflow / Distillation

Train the teacher

Generate (x0, x1) pairs

Retrain on paired data

Iterate (optional)

Key Models

Training Pipeline

Sample data and noise

Sample time and interpolate

Compute target velocity

Regress

Inference Pipeline

Sample initial noise

Integrate the ODE

Apply guidance (optional)

Decode

Model Zoo

Generate (x₀, x₁) pairs