Introduction
Diffusion models achieved extraordinary results, but their training and sampling mechanics are surprisingly indirect. To generate data, you learn to reverse a noise-corruption process — predicting the noise added at each of a thousand tiny steps. The resulting sample trajectories curve and meander through high-dimensional space, requiring many network evaluations to produce a single sample.
In 2023, two independent lines of work converged on a radically simpler alternative.
Lipman et al. (Flow Matching for Generative Modeling) and Liu et al.
(Flow Straight and Fast) proposed flow matching: instead of learning
to reverse a diffusion process, learn a velocity field that transports noise
samples directly to data samples along straight-line paths. The neural network predicts
vθ(x, t) — how fast and in what direction a point x should move at
time t — and an ODE integrator follows these velocities from noise to data.
The advantages are striking. Training is simulation-free — no need to run a forward process or compute complex noise schedules. The learned paths are straighter, requiring fewer integration steps at generation time. The objective is a simple regression loss with no weighting gymnastics. And the mathematical framework — continuous normalizing flows — is cleaner and more general than the Markov chain machinery of DDPMs.
This article builds on the SDE perspective from Article 04 and the score-based framework from Article 03. Familiarity with neural ODEs is helpful but not required — we derive everything from scratch. The key prerequisite is comfort with the idea of a time-dependent vector field guiding samples through space.
Continuous Normalizing Flows
Before we can understand flow matching, we need the mathematical object it operates on: the continuous normalizing flow (CNF). Introduced by Chen et al. (2018) as part of their Neural ODE framework, a CNF defines a generative model through an ordinary differential equation rather than a sequence of discrete transformations.
Neural ODEs
A Neural ODE defines a continuous transformation of space. Given a neural network
vθ(x, t) that takes a position x and time t, we define the flow
through the ODE:
Starting from an initial point x0 at time t = 0, we integrate this ODE
forward to get the transformed point x1 at time t = 1. The function
vθ is the velocity field — at every point in space
and every moment in time, it tells you which direction to move and how fast. The entire trajectory
x(t) for t ∈ [0, 1] is determined by the initial condition and the velocity field.
We write the solution map as φt(x0) — the flow map
that sends each starting point to its location at time t. For a generative model, we want this flow
to transform a simple distribution (Gaussian noise) into a complex one (data). That is:
Push-forward & change of variables
The flow map φt doesn't just move individual points — it transforms
entire probability distributions. If x0 ~ p0, then the
distribution of xt = φt(x0) is called the
push-forward of p0 through φt, written
(φt)#p0 = pt.
Because the ODE defines a smooth, invertible mapping (as long as vθ is Lipschitz), we can compute exact log-likelihoods via the instantaneous change of variables formula (also from Chen et al., 2018):
This is beautiful: the log-density along a trajectory changes at a rate given by the divergence of the velocity field. Regions where the flow expands (positive divergence) decrease in density; regions where it compresses (negative divergence) increase in density. Unlike discrete normalizing flows, there are no architectural constraints — any neural network architecture can serve as vθ.
The original CNF training approach (Chen et al., 2018) required simulating the ODE during training to compute the loss, making it extremely slow. Maximum likelihood training required computing the divergence trace, adding further cost. Flow matching solves both problems by providing a simulation-free training objective.
Velocity arrows show the learned flow at each time t. Particles follow the velocity field from noise (t=0) to data (t=1). Drag the slider to see how the field evolves.
The Flow Matching Objective
Now we have the goal: find a velocity field vθ(x, t) whose flow
transforms p0 = 𝒩(0, I) into p1 ≈ pdata. But how
do we train it?
The most natural approach would be to regress vθ onto a target velocity field ut(x) that generates the desired probability path pt:
This is the flow matching objective: at each time t, sample a point x from the marginal distribution pt, and regress our network's predicted velocity onto the true velocity that would generate pt.
There's a fundamental problem: we don't know ut(x) or pt(x). The target velocity field ut is defined implicitly — it's the velocity field whose flow generates the probability path from p0 to p1. Computing it would require knowing the transport map between the noise distribution and the data distribution, which is exactly what we're trying to learn. We're stuck in a chicken-and-egg problem.
This is precisely the impasse that made original CNF training require simulation. The breakthrough of flow matching is realizing we can sidestep this problem entirely by conditioning.
Conditional Flow Matching
The key insight of Lipman et al. (2023) is elegant: instead of trying to construct the marginal probability path pt and its velocity field ut, we construct conditional probability paths pt(x | x1) — one for each data point x1 — where both the path and the velocity are known in closed form.
The Gaussian probability path
For each data sample x1, define the conditional probability path:
At t = 0, this is 𝒩(0, I) — pure noise. At t = 1, it collapses to a delta at x1 — the data point. The mean interpolates linearly from the origin to x1, while the variance shrinks to zero. Samples from this conditional path follow a simple formula:
Sample x0 ~ 𝒩(0, I) and a data point x1 ~ pdata. The conditional sample at time t is:
This is just linear interpolation between noise and data! The trajectory for each (x0, x1) pair is a perfectly straight line.
Now, what is the velocity along this straight-line path? Since xt = (1-t)x0 + t·x1, differentiating with respect to t gives:
The conditional velocity is constant — it doesn't depend on t or on the current position xt. It's simply the displacement vector from the noise sample to the data sample. Each particle moves at constant speed in a straight line from where it started to where it needs to go.
The CFM loss
The conditional flow matching (CFM) loss replaces the intractable marginal velocity ut(x) with the known conditional velocity ut(x | x1):
where xt = (1 − t) x0 + t x1.
This is simulation-free: no ODE integration during training. Just sample noise, sample data, interpolate, and regress the network output onto the displacement vector.
The mathematical miracle: Lipman et al. prove that the CFM loss has the same gradients as the intractable FM loss (up to a constant). Training with the conditional objective recovers the correct marginal velocity field. The proof uses the fact that the marginal velocity ut(x) is a mixture of conditional velocities ut(x | x1) weighted by the posterior p(x1 | xt). Regressing onto conditional velocities at points sampled from the conditional paths yields the same expected gradient as regressing onto marginal velocities at points sampled from the marginal path.
Compare this to DDPM training: sample noise ε, compute xt = √ᾱt x0 + √(1 − ᾱt) ε, regress onto ε. The structure is nearly identical! The difference: flow matching regresses onto the displacement x1 − x0 rather than the noise ε, and uses linear interpolation rather than the diffusion noising schedule.
Straight lines connect noise samples (left) to data samples (right). Animate particles traveling along these paths. The marginal distribution at each time t is shown below.
Optimal Transport & Rectified Flows
Standard CFM pairs noise samples x0 with data samples x1 randomly — each x0 is independently sampled and independently paired with an x1. But the pairing between source and target points has a dramatic effect on the geometry of the learned flow. Random pairing produces crossing trajectories: a noise sample on the left might be paired with a data sample on the right, while a nearby noise sample on the right is paired with data on the left. Their paths cross, creating unnecessary curvature in the marginal velocity field.
Optimal transport (OT) provides the mathematically optimal pairing. The OT coupling minimizes the total displacement cost — it pairs source and target points to minimize the sum of squared distances ∑ ||x0,i − x1,π(i)||2. This produces non-crossing trajectories that are globally straighter.
OT-CFM
Tong et al. (2024) proposed OT-CFM: use mini-batch optimal transport to pair noise and data samples within each training batch. For a batch of N noise samples and N data samples, solve the discrete OT problem (using the Sinkhorn algorithm or Hungarian method) to find the optimal assignment, then train with the CFM loss using these OT-paired samples.
The mini-batch OT approximation is surprisingly effective. Even with small batch sizes (64-256), the resulting flows are measurably straighter and can be sampled with fewer ODE solver steps. The computational overhead of solving the OT problem within each batch is negligible compared to the neural network forward pass.
Reflow: iterative straightening
Liu et al. (2023) proposed an alternative approach to straightening: reflow (also called rectified flows). The idea is elegantly recursive:
- Train a flow model with random (x0, x1) pairings.
- Use the trained model to generate synthetic (x0, x1) pairs by integrating the ODE from noise to data.
- Re-train a new flow model using these synthetic pairs as training data.
- Repeat. Each iteration produces straighter flows.
Why does this work? After step 1, the model has learned some flow φ. In step 2, we draw x0 ~ 𝒩(0, I) and compute x1 = φ1(x0). This (x0, x1) pair is now causally coupled — they're connected by the learned flow. Training on these coupled pairs in step 3 straightens the trajectories because the new model learns to go directly from x0 to its corresponding x1 in a straight line, rather than following the potentially curved path the original model took.
The ultimate payoff of straight flows: if trajectories are perfectly straight, you only need one Euler step to go from noise to data. This means a well-straightened flow can be distilled into a single-step generator — essentially collapsing the iterative ODE solve into a single forward pass. This is the foundation of consistency distillation and one-step generation methods.
Left: diffusion model trajectories curve through space. Right: flow matching trajectories are nearly straight. Straighter paths = fewer solver steps = faster generation.
Flow Matching vs Diffusion: A Detailed Comparison
Flow matching and diffusion models are more similar than they first appear. They use the same neural network architectures (U-Nets, DiTs), operate over the same data types, and produce comparable sample quality. The differences lie in the training objective, the interpolation schedule, and the geometry of sample trajectories.
| Aspect | DDPM / Diffusion | Flow Matching |
|---|---|---|
| Framework | Reverse Markov chain / SDE | Continuous normalizing flow / ODE |
| Network predicts | Noise εθ(xt, t) | Velocity vθ(xt, t) |
| Interpolation | xt = √ᾱt x0 + √(1−ᾱt) ε | xt = (1−t) x0 + t x1 |
| Target | ε (the noise that was added) | x1 − x0 (displacement vector) |
| Training loss | ||εθ(xt, t) − ε||² | ||vθ(xt, t) − (x1−x0)||² |
| Noise schedule | Learned or fixed βt schedule | Linear interpolation (no schedule needed) |
| Path geometry | Curved (noise added/removed incrementally) | Straight (linear interpolation) |
| Sampling steps | 50–1000 typical | 10–100 typical (fewer with OT/reflow) |
| Likelihood | ELBO (lower bound) | Exact via change of variables (but expensive) |
The relationship between the two frameworks runs deep. Kingma & Gao (2024) showed that diffusion models and flow matching are members of the same family — both can be viewed as learning a velocity field for an ODE, just with different interpolation schedules and parameterizations. The ε-prediction of diffusion and the velocity prediction of flow matching are related by a simple linear transformation that depends on the noise schedule.
In practice, the choice often comes down to engineering convenience. Flow matching's linear interpolation is simpler to implement and reason about. Its straighter paths mean fewer sampling steps, which translates directly to faster inference. Modern systems like Stable Diffusion 3 (Esser et al., 2024) use flow matching (specifically, rectified flows) as their training framework — a testament to its practical advantages.
Extensions & Frontiers
The flow matching framework has proven remarkably extensible. Its clean mathematical structure — define a probability path, derive the conditional velocity, regress — generalizes far beyond Euclidean spaces and Gaussian paths.
Riemannian Flow Matching (Chen & Lipman, 2024) extends flow matching to curved spaces — spheres, tori, hyperbolic spaces, and general Riemannian manifolds. This is crucial for data that lives on manifolds: molecular conformations (SO(3) for rotations), geological data (the sphere S2), or protein backbone angles (the torus). The key modification: replace linear interpolation with geodesic interpolation on the manifold, and replace Euclidean velocities with tangent vectors.
Discrete Flow Matching (Campbell et al., 2024) adapts the framework to discrete data — text, categorical features, molecular graphs. Instead of continuous velocities, the model learns transition rates between discrete states. This provides a principled alternative to autoregressive generation and discrete diffusion, with the same straight-path advantages.
Stochastic Interpolants (Albergo & Vanden-Eijnden, 2023) generalize the interpolation xt = (1−t)x0 + t·x1 by adding stochastic noise along the path: xt = αt x0 + βt x1 + σt ε. This interpolates between deterministic ODE flows (σt = 0) and stochastic SDE flows (σt > 0), unifying flow matching and score-based diffusion in a single framework.
Schrödinger Bridges provide the mathematically optimal stochastic transport between two distributions. While OT gives the optimal deterministic transport (straight lines with optimal pairing), Schrödinger bridges solve the entropy-regularized version, yielding diffusion processes (not just ODEs) that optimally connect distributions. Several recent works (De Bortoli et al., 2021; Shi et al., 2023) connect Schrödinger bridges to flow matching, enabling optimal stochastic transport without simulation-based training.
Left: random pairing creates crossing paths. Right: optimal transport pairing minimizes total displacement, producing non-crossing trajectories. Click “Resample” to generate new points.
Implementation
One of flow matching's greatest strengths is how little code it requires. The training loop is shorter and simpler than DDPM, with no noise schedule, no ᾱt computation, and no variance weighting. Here's a side-by-side comparison:
DDPM training step:
# DDPM training step
x_0 = get_data_batch() # clean data
t = torch.randint(0, T, (B,)) # discrete timestep
eps = torch.randn_like(x_0) # noise
alpha_bar_t = alpha_bar[t] # cumulative schedule
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
eps_pred = model(x_t, t) # predict noise
loss = F.mse_loss(eps_pred, eps)
Flow matching training step:
# Flow matching training step
x_1 = get_data_batch() # data (target)
x_0 = torch.randn_like(x_1) # noise (source)
t = torch.rand(B, 1, 1, 1) # continuous time ~ U[0,1]
x_t = (1 - t) * x_0 + t * x_1 # linear interpolation
v_pred = model(x_t, t) # predict velocity
loss = F.mse_loss(v_pred, x_1 - x_0) # target = displacement
The simplicity is remarkable. Flow matching eliminates the noise schedule entirely — no βt, no αt, no cumulative products. Time is sampled continuously from U[0,1] rather than as discrete integers. The interpolation is a simple lerp. The target is the raw displacement.
Time sampling strategies. While uniform sampling t ~ U[0,1] works, several works have found that non-uniform time sampling improves training. Logit-normal sampling (t ~ σ(Normal(0, 1))) concentrates samples near t = 0.5, where the interpolation is most ambiguous and the velocity field most complex. Stable Diffusion 3 uses this approach. Another option is to sample t with a density proportional to the expected loss at that timestep, focusing training effort on the hardest times.
Loss weighting. Although the unweighted MSE loss works well, some practitioners apply a time-dependent weight w(t) to the loss. Common choices include w(t) = 1/(1 − t + ε) to emphasize the final stages (where small errors become large in pixel space), or w(t) = 1/σt2 to normalize for the varying scale of the velocity across time.
Sampling (inference). To generate samples, integrate the learned ODE from t = 0 to t = 1:
# Flow matching sampling (Euler method)
x = torch.randn(B, C, H, W) # start from noise
steps = 50 # number of ODE steps
dt = 1.0 / steps
for i in range(steps):
t = torch.full((B, 1, 1, 1), i * dt)
v = model(x, t) # predicted velocity
x = x + v * dt # Euler step
# x is now approximately a data sample
Higher-order ODE solvers (Runge-Kutta, DPM-Solver) can reduce the number of steps further. With well-straightened flows (from OT-CFM or reflow), even 10-20 Euler steps can produce high-quality samples.
References
Seminal papers and key works referenced in this article.
- Lipman et al. "Flow Matching for Generative Modeling." ICLR, 2023. arXiv
- Liu et al. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR, 2023. arXiv
- Albergo & Vanden-Eijnden. "Building Normalizing Flows with Stochastic Interpolants." ICLR, 2023. arXiv
- Esser et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." ICML, 2024. arXiv