Train Continuous Normalizing Flows without simulating ODEs. Regress vector fields directly using conditional probability paths — enabling straight-line Optimal Transport trajectories that diffusion models cannot achieve.
You want to turn noise into images. Diffusion models do this brilliantly — but they discovered an unintuitive trick: add noise gradually, then learn to reverse it. The resulting trajectories from noise to data are curved, wandering paths through high-dimensional space.
What if you could just learn straight lines from noise to data? That's faster to traverse (fewer ODE steps at inference), easier to learn (simpler vector field), and more stable to train. But there's a catch: the elegant mathematical framework for this — Continuous Normalizing Flows — has been impractical because training requires simulating ODEs, which is extremely expensive.
The key advantages of Flow Matching over traditional diffusion training:
Diffusion models follow curved paths (orange). Flow Matching with OT follows straight paths (teal). Click to resample.
A Continuous Normalizing Flow (CNF) defines a generative model by moving samples along a time-dependent vector field. Think of it as a river: at time t = 0 you drop particles into noise, and the flow carries them to data at t = 1.
A CNF is defined by a vector field vt : Rd → Rd that generates a flow φt via the ODE:
Starting from φ0(x) = x (identity at t = 0), the flow map φt transforms samples from the source distribution p0 (noise) to the target distribution p1 (data). At each time t, there is an intermediate distribution pt = [φt]# p0 — the pushforward of p0 through φt.
The densities pt satisfy the continuity equation:
This is a conservation law: probability mass is neither created nor destroyed, it just flows. The vector field vt generates the probability path pt. Given vt, you can compute pt by solving this PDE (or equivalently, by simulating the ODE and pushing samples forward).
To generate a sample: draw x0 ~ p0 = N(0, I), then solve the ODE forward from t = 0 to t = 1 using an ODE solver (Euler, RK45, etc.):
With a learned vθ, this gives you a sample from the model's approximation of q (the data distribution).
Particles start as noise (t=0) and flow to data (t=1) following the vector field. Drag the slider to see intermediate states.
We have a beautiful generative model: learn vθ, integrate it to generate samples. But how do we train vθ? This is where CNFs ran into a wall for years.
The most natural objective is to match our learned vector field to the true one that generates the data distribution:
where ut is the vector field that generates a probability path pt interpolating between p0 and the data distribution q.
Problem 1: We don't know ut. It's defined implicitly as whatever vector field generates pt via the continuity equation. There's no closed-form expression for the marginal ut(x).
Problem 2: Even if we could compute ut, sampling from pt requires simulating the ODE forward from t = 0 — which means running an ODE solver inside the training loop. This is the same issue that plagued original CNF training (FFJORD): backpropagating through an ODE solver is expensive, memory-intensive, and numerically unstable.
Flow Matching's insight: we can get the simplicity of score matching (closed-form targets, no simulation) but for any probability path — not just diffusion paths.
ODE-based training requires expensive forward/backward solves per step. Flow Matching needs only a single network evaluation.
Here's the key insight that unlocks everything. Instead of thinking about the marginal probability path pt (which is intractable), we think about conditional probability paths pt(x | x1) — the path from noise to a specific data point x1.
Define a conditional probability path pt(x | x1) that:
The simplest choice is a Gaussian conditional path:
where μt and σt are differentiable functions of t with boundary conditions μ0 ≈ 0, σ0 ≈ 1, μ1 = x1, σ1 ≈ 0.
For any Gaussian conditional path, there exists a unique vector field ut(x | x1) that generates it (via the continuity equation). For Gaussian paths, it has a closed-form expression:
This is exact. No approximation. No simulation needed. Given a sample xt from pt(x | x1), we can compute the vector field analytically.
Sampling from pt(x | x1) is trivial via reparameterization:
For the simplest choice (linear interpolation, a.k.a. Optimal Transport path):
This is just linear interpolation between a noise sample and a data sample! And the conditional vector field becomes:
The target is simply the displacement from noise to data. Constant in time. A straight line.
Each data point x1 defines a conditional path from noise x0. The conditional vector field is constant: x1 - x0.
We now have all the pieces. The conditional vector field ut(x | x1) is tractable. But we want to learn the marginal vector field ut(x) that generates the marginal pt. How does conditioning on x1 help?
The crucial mathematical result (Theorem 2 in the paper): the marginal probability path can be recovered by marginalizing over the conditional paths:
And the marginal vector field that generates this pt is:
This is still intractable (it involves an integral over all data points). But here's the punchline:
Define the Conditional Flow Matching (CFM) loss:
The proof uses a simple trick. Expand LFM:
The cross term ⟨vθ(t,x), ut(x)⟩ under Ex ~ pt can be rewritten by substituting the marginal expressions. After expanding pt(x) = ∫ pt(x|x1)q(x1)dx1 and ut(x) as its conditional mixture, the cross term becomes Ex1 ~ q Ex ~ pt(x|x1) ⟨vθ, ut(x|x1)⟩ — exactly the cross term in LCFM. The squared norms also match up to constants independent of θ.
With OT conditional paths, training is beautifully simple:
# Flow Matching training step (OT path) def fm_step(model, x1, optimizer): # x1: batch of data samples [B, d] t = torch.rand(B, 1) # sample time x0 = torch.randn_like(x1) # sample noise xt = (1 - t) * x0 + t * x1 # interpolate target = x1 - x0 # conditional vector field pred = model(t, xt) # predict vector field loss = (pred - target).pow(2).mean() loss.backward() optimizer.step()
That's it. Four lines of math. No ODE solver. No score function. No noise schedule to tune.
Flow Matching works with any Gaussian conditional path. But which path should you choose? This is where the Optimal Transport (OT) path shines — and where FM decisively separates from diffusion.
The OT path uses linear interpolation:
For practical purposes with σmin → 0:
This is the displacement interpolation from Optimal Transport theory. The sample moves along a straight line from x0 to x1 at constant speed.
In OT theory, the optimal way to transport mass from one distribution to another (minimizing total displacement) is via straight lines. The conditional OT path xt = (1-t)x0 + tx1 moves each particle along the shortest path between its noise sample and its data target. While this is only the true OT map when conditioning on a single x1, the conditional paths still produce straighter marginal flows than diffusion paths.
The Variance Preserving path used by DDPM:
where αt = exp(−½ ∫0t β(s) ds) with some noise schedule β(t). This produces curved trajectories: the particle first drifts toward the origin (as noise is added) then curves toward x1 (as signal emerges).
Teal = OT (straight lines). Orange = Diffusion VP (curved). Drag the time slider to animate. Notice how OT paths cross directly while diffusion paths curve through the origin first.
Flow Matching doesn't replace diffusion — it generalizes it. Score matching for diffusion models is a special case of Conditional Flow Matching when you choose the diffusion probability path.
Recall: in diffusion models, we have a noising process:
The score function is ∇x log pt(x | x1) = −(xt − αt x1) / σt2 = −ε / σt.
Denoising score matching trains: predict ε from xt.
Now, the conditional vector field for the diffusion (VP) path is:
Substituting x = αt x1 + σt ε:
This is a linear combination of ε and x1 — exactly what diffusion models predict (up to reparameterization). Training a network to predict this vector field is equivalent to training it to predict ε or x1.
| Property | Diffusion (VP/VE) | Flow Matching |
|---|---|---|
| Path geometry | Curved (noise schedule-dependent) | Any — including straight OT lines |
| Training target | ε or x0 or score | Vector field (encompasses all above) |
| Inference | SDE/ODE solver (many steps) | ODE solver (fewer steps with OT) |
| Path flexibility | Locked to diffusion process | Any Gaussian path works |
| Noise schedule | Critical hyperparameter | Not needed (for OT paths) |
Rectified Flows (Liu et al., 2022) independently proposed the same OT conditional path xt = (1-t)x0 + tx1 and the same training objective. The key difference: Rectified Flows additionally proposed reflow — iteratively straightening trajectories by using the model's own predictions as new training pairs. Flow Matching provides the theoretical foundation (the equivalence theorem) while Rectified Flows provide the practical refinement technique.
Flow Matching isn't just a theoretical curiosity — it achieves state-of-the-art results and has become the foundation for modern generative models.
On ImageNet 64×64 (unconditional), comparing different probability paths with the same architecture:
| Method | Path | NLL (bpd) | FID | NFE |
|---|---|---|---|---|
| FM + VP path | Diffusion (VP) | 0.97 | 4.31 | 142 |
| FM + OT path | Optimal Transport | 0.93 | 3.50 | 110 |
| DDPM (original) | Diffusion (VP) | 3.70 | 11.0 | 1000 |
| Score SDE (VP) | Diffusion (VP) | 2.99 | 2.41* | 2000 |
*Score SDE uses 2000 NFEs and Predictor-Corrector sampling. OT-CFM uses only 110 NFEs with an adaptive ODE solver.
The paper shows that FM with OT paths has significantly more stable training dynamics compared to VP paths:
Flow Matching has become the standard training paradigm for modern generative models:
OT paths (teal) achieve lower FID with fewer steps than VP diffusion paths (orange).
Flow Matching sits at the intersection of several major ideas in generative modeling and beyond.