Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le — Meta AI, 2022

Flow Matching for Generative Modeling

Train Continuous Normalizing Flows without simulating ODEs. Regress vector fields directly using conditional probability paths — enabling straight-line Optimal Transport trajectories that diffusion models cannot achieve.

Prerequisites: Probability distributions & sampling + ODEs (basic) + Score matching / Diffusion intuition
9
Chapters
5+
Simulations

Chapter 0: Why Flow Matching?

You want to turn noise into images. Diffusion models do this brilliantly — but they discovered an unintuitive trick: add noise gradually, then learn to reverse it. The resulting trajectories from noise to data are curved, wandering paths through high-dimensional space.

What if you could just learn straight lines from noise to data? That's faster to traverse (fewer ODE steps at inference), easier to learn (simpler vector field), and more stable to train. But there's a catch: the elegant mathematical framework for this — Continuous Normalizing Flows — has been impractical because training requires simulating ODEs, which is extremely expensive.

The breakthrough: Flow Matching shows you can train CNFs without ever simulating an ODE during training. Instead of the intractable marginal vector field, you regress against conditional vector fields that have closed-form solutions. The resulting loss is as simple as score matching but works with ANY probability path — including straight-line Optimal Transport paths that diffusion cannot access.

The key advantages of Flow Matching over traditional diffusion training:

Noise-to-Data Trajectories: Curved vs Straight

Diffusion models follow curved paths (orange). Flow Matching with OT follows straight paths (teal). Click to resample.

What is the core computational bottleneck that Flow Matching eliminates from CNF training?

Chapter 1: Continuous Normalizing Flows

A Continuous Normalizing Flow (CNF) defines a generative model by moving samples along a time-dependent vector field. Think of it as a river: at time t = 0 you drop particles into noise, and the flow carries them to data at t = 1.

The ODE Formulation

A CNF is defined by a vector field vt : Rd → Rd that generates a flow φt via the ODE:

t(x) / dt = vtt(x)),   φ0(x) = x

Starting from φ0(x) = x (identity at t = 0), the flow map φt transforms samples from the source distribution p0 (noise) to the target distribution p1 (data). At each time t, there is an intermediate distribution pt = [φt]# p0 — the pushforward of p0 through φt.

Pushforward intuition: If you have a bucket of particles distributed as p0 (standard Gaussian), and you flow each particle through the ODE for time t, the resulting cloud of particles forms the distribution pt. At t = 1, that cloud should match your data distribution q. The pushforward [φt]# p0 just means "apply φt to each sample and look at the resulting distribution."

The Continuity Equation

The densities pt satisfy the continuity equation:

∂pt / ∂t + div(pt vt) = 0

This is a conservation law: probability mass is neither created nor destroyed, it just flows. The vector field vt generates the probability path pt. Given vt, you can compute pt by solving this PDE (or equivalently, by simulating the ODE and pushing samples forward).

Sampling from a CNF

To generate a sample: draw x0 ~ p0 = N(0, I), then solve the ODE forward from t = 0 to t = 1 using an ODE solver (Euler, RK45, etc.):

x1 = x0 + ∫01 vθ(t, φt(x0)) dt

With a learned vθ, this gives you a sample from the model's approximation of q (the data distribution).

CNFs vs. Discrete Normalizing Flows: Discrete NFs (RealNVP, Glow) use a fixed number of invertible layers. CNFs use a continuous transformation parameterized by an ODE — infinitely flexible, no architecture constraints on invertibility. The price: you need an ODE solver at inference. But the flexibility is enormous — any smooth vector field works.
Flow ODE: Particles Moving Through Time

Particles start as noise (t=0) and flow to data (t=1) following the vector field. Drag the slider to see intermediate states.

What does the flow map φt do in a CNF?

Chapter 2: The Training Problem

We have a beautiful generative model: learn vθ, integrate it to generate samples. But how do we train vθ? This is where CNFs ran into a wall for years.

The Naive Approach: Flow Matching (Intractable)

The most natural objective is to match our learned vector field to the true one that generates the data distribution:

LFM(θ) = Et ~ U[0,1] Ex ~ pt || vθ(t, x) − ut(x) ||2

where ut is the vector field that generates a probability path pt interpolating between p0 and the data distribution q.

Problem 1: We don't know ut. It's defined implicitly as whatever vector field generates pt via the continuity equation. There's no closed-form expression for the marginal ut(x).

Problem 2: Even if we could compute ut, sampling from pt requires simulating the ODE forward from t = 0 — which means running an ODE solver inside the training loop. This is the same issue that plagued original CNF training (FFJORD): backpropagating through an ODE solver is expensive, memory-intensive, and numerically unstable.

The cost of simulation: FFJORD-style training requires: (1) Forward ODE solve to get pt samples, (2) Backward ODE solve (adjoint method) for gradients, (3) Trace estimation for the log-likelihood term. Each training step costs O(NFE × d) where NFE is the number of function evaluations (50-200 typically). This makes CNFs 10-100x slower to train than diffusion models, which need only one forward pass per step.

Prior Solutions and Their Limitations

Flow Matching's insight: we can get the simplicity of score matching (closed-form targets, no simulation) but for any probability path — not just diffusion paths.

Training Cost Comparison

ODE-based training requires expensive forward/backward solves per step. Flow Matching needs only a single network evaluation.

Why can't we directly minimize LFM = E||vθ(t,x) - ut(x)||2?

Chapter 3: Conditional Paths

Here's the key insight that unlocks everything. Instead of thinking about the marginal probability path pt (which is intractable), we think about conditional probability paths pt(x | x1) — the path from noise to a specific data point x1.

The Conditioning Trick

Define a conditional probability path pt(x | x1) that:

The simplest choice is a Gaussian conditional path:

pt(x | x1) = N(x; μt(x1), σt(x1)2 I)

where μt and σt are differentiable functions of t with boundary conditions μ0 ≈ 0, σ0 ≈ 1, μ1 = x1, σ1 ≈ 0.

The Conditional Vector Field

For any Gaussian conditional path, there exists a unique vector field ut(x | x1) that generates it (via the continuity equation). For Gaussian paths, it has a closed-form expression:

ut(x | x1) = (σt′ / σt)(x − μt(x1)) + μt′(x1)

This is exact. No approximation. No simulation needed. Given a sample xt from pt(x | x1), we can compute the vector field analytically.

The Reparameterization

Sampling from pt(x | x1) is trivial via reparameterization:

xt = μt(x1) + σt(x1) · ε,   ε ~ N(0, I)

For the simplest choice (linear interpolation, a.k.a. Optimal Transport path):

μt(x1) = t · x1,   σt = 1 − t
xt = (1 − t) x0 + t x1,   x0 ~ N(0, I)

This is just linear interpolation between a noise sample and a data sample! And the conditional vector field becomes:

ut(xt | x1) = x1 − x0

The target is simply the displacement from noise to data. Constant in time. A straight line.

The magic: We've turned an intractable marginal problem into a trivial conditional problem. The conditional vector field ut(xt | x1) = x1 − x0 is just "point from where you are toward the data." No ODE, no simulation, no score function — just a vector.
Conditional Path: Linear Interpolation

Each data point x1 defines a conditional path from noise x0. The conditional vector field is constant: x1 - x0.

For the linear interpolation path xt = (1-t)x0 + t*x1, what is the conditional vector field ut(xt | x1)?

Chapter 4: The Flow Matching Objective

We now have all the pieces. The conditional vector field ut(x | x1) is tractable. But we want to learn the marginal vector field ut(x) that generates the marginal pt. How does conditioning on x1 help?

From Marginal to Conditional

The crucial mathematical result (Theorem 2 in the paper): the marginal probability path can be recovered by marginalizing over the conditional paths:

pt(x) = ∫ pt(x | x1) q(x1) dx1

And the marginal vector field that generates this pt is:

ut(x) = ∫ ut(x | x1) · pt(x1 | x) q(x1) dx1 / pt(x)

This is still intractable (it involves an integral over all data points). But here's the punchline:

The Conditional Flow Matching Loss

Define the Conditional Flow Matching (CFM) loss:

LCFM(θ) = Et ~ U[0,1] Ex1 ~ q Ext ~ pt(x|x1) || vθ(t, xt) − ut(xt | x1) ||2
Theorem (Lipman et al., 2022): LCFM and LFM have identical gradients with respect to θ. That is, ∇θ LCFM = ∇θ LFM. Minimizing the tractable conditional loss is equivalent to minimizing the intractable marginal loss.

The Proof Sketch

The proof uses a simple trick. Expand LFM:

LFM = Et Ex ~ pt [||vθ||2 − 2 ⟨vθ, ut(x)⟩]

The cross term ⟨vθ(t,x), ut(x)⟩ under Ex ~ pt can be rewritten by substituting the marginal expressions. After expanding pt(x) = ∫ pt(x|x1)q(x1)dx1 and ut(x) as its conditional mixture, the cross term becomes Ex1 ~ q Ex ~ pt(x|x1) ⟨vθ, ut(x|x1)⟩ — exactly the cross term in LCFM. The squared norms also match up to constants independent of θ.

The Training Algorithm

With OT conditional paths, training is beautifully simple:

# Flow Matching training step (OT path)
def fm_step(model, x1, optimizer):
    # x1: batch of data samples [B, d]
    t = torch.rand(B, 1)           # sample time
    x0 = torch.randn_like(x1)      # sample noise
    xt = (1 - t) * x0 + t * x1     # interpolate
    target = x1 - x0               # conditional vector field
    pred = model(t, xt)             # predict vector field
    loss = (pred - target).pow(2).mean()
    loss.backward()
    optimizer.step()

That's it. Four lines of math. No ODE solver. No score function. No noise schedule to tune.

Compare to diffusion training: DDPM training: sample t, compute noise schedule αt, σt, add noise xt = αtx0 + σtε, predict ε. Flow Matching: sample t, interpolate xt = (1-t)x0 + tx1, predict x1 - x0. The structure is almost identical! But FM works for any path, not just diffusion paths. And the OT path gives straighter trajectories.
Why can we use the Conditional FM loss (LCFM) instead of the intractable marginal FM loss (LFM)?

Chapter 5: Optimal Transport Paths

Flow Matching works with any Gaussian conditional path. But which path should you choose? This is where the Optimal Transport (OT) path shines — and where FM decisively separates from diffusion.

The OT Conditional Path

The OT path uses linear interpolation:

μt(x1) = t · x1,   σt(x1) = 1 − (1 − σmin)t

For practical purposes with σmin → 0:

xt = (1 − t) x0 + t x1

This is the displacement interpolation from Optimal Transport theory. The sample moves along a straight line from x0 to x1 at constant speed.

Why "Optimal Transport"?

In OT theory, the optimal way to transport mass from one distribution to another (minimizing total displacement) is via straight lines. The conditional OT path xt = (1-t)x0 + tx1 moves each particle along the shortest path between its noise sample and its data target. While this is only the true OT map when conditioning on a single x1, the conditional paths still produce straighter marginal flows than diffusion paths.

Diffusion Path (VP-SDE) for Comparison

The Variance Preserving path used by DDPM:

μt(x1) = αt x1,   σt = √(1 − αt2)

where αt = exp(−½ ∫0t β(s) ds) with some noise schedule β(t). This produces curved trajectories: the particle first drifts toward the origin (as noise is added) then curves toward x1 (as signal emerges).

Why Straight Lines Win

Empirical evidence: On ImageNet 64×64, OT-CFM achieves FID 3.5 with only 110 NFEs, while VP-CFM (diffusion path) needs 142 NFEs for FID 4.3. Same model, same architecture — the only difference is the path geometry. Straight paths are simply better.
SHOWCASE: OT vs Diffusion Trajectories

Teal = OT (straight lines). Orange = Diffusion VP (curved). Drag the time slider to animate. Notice how OT paths cross directly while diffusion paths curve through the origin first.

t = 0 (noise) t = 0.00 t = 1 (data)
Why do Optimal Transport paths require fewer ODE steps at inference than diffusion paths?

Chapter 6: Connection to Diffusion

Flow Matching doesn't replace diffusion — it generalizes it. Score matching for diffusion models is a special case of Conditional Flow Matching when you choose the diffusion probability path.

Score Matching as a Special Case

Recall: in diffusion models, we have a noising process:

xt = αt x1 + σt ε,   ε ~ N(0, I)

The score function is ∇x log pt(x | x1) = −(xt − αt x1) / σt2 = −ε / σt.

Denoising score matching trains: predict ε from xt.

Now, the conditional vector field for the diffusion (VP) path is:

ut(x | x1) = (σt′ / σt)(x − αt x1) + αt′ x1

Substituting x = αt x1 + σt ε:

ut(x | x1) = σt′ ε + αt′ x1

This is a linear combination of ε and x1 — exactly what diffusion models predict (up to reparameterization). Training a network to predict this vector field is equivalent to training it to predict ε or x1.

The unification: Denoising score matching (DSM), DDPM's ε-prediction, and x0-prediction are all special cases of Conditional Flow Matching with the VP-SDE probability path. The only difference is a time-dependent rescaling of the target. FM reveals that what diffusion models are really doing is learning a vector field — they just never frame it that way.

What FM Adds Beyond Diffusion

PropertyDiffusion (VP/VE)Flow Matching
Path geometryCurved (noise schedule-dependent)Any — including straight OT lines
Training targetε or x0 or scoreVector field (encompasses all above)
InferenceSDE/ODE solver (many steps)ODE solver (fewer steps with OT)
Path flexibilityLocked to diffusion processAny Gaussian path works
Noise scheduleCritical hyperparameterNot needed (for OT paths)

Relationship to Rectified Flows

Rectified Flows (Liu et al., 2022) independently proposed the same OT conditional path xt = (1-t)x0 + tx1 and the same training objective. The key difference: Rectified Flows additionally proposed reflow — iteratively straightening trajectories by using the model's own predictions as new training pairs. Flow Matching provides the theoretical foundation (the equivalence theorem) while Rectified Flows provide the practical refinement technique.

The landscape in 2024: Stable Diffusion 3, Flux, and most modern image generators use Flow Matching with OT paths. The field has largely moved from "diffusion" to "flow" formulations, precisely because of the advantages this paper identified: simpler training, faster inference, no noise schedule.
How is denoising score matching related to Flow Matching?

Chapter 7: Results & Applications

Flow Matching isn't just a theoretical curiosity — it achieves state-of-the-art results and has become the foundation for modern generative models.

ImageNet Results

On ImageNet 64×64 (unconditional), comparing different probability paths with the same architecture:

MethodPathNLL (bpd)FIDNFE
FM + VP pathDiffusion (VP)0.974.31142
FM + OT pathOptimal Transport0.933.50110
DDPM (original)Diffusion (VP)3.7011.01000
Score SDE (VP)Diffusion (VP)2.992.41*2000

*Score SDE uses 2000 NFEs and Predictor-Corrector sampling. OT-CFM uses only 110 NFEs with an adaptive ODE solver.

Key finding: OT-CFM simultaneously achieves better likelihood AND better sample quality (FID) than VP-CFM, while using fewer integration steps. This is remarkable — traditionally likelihood and sample quality are at odds (the "likelihood-FID tradeoff"). Straight paths help both.

Training Stability

The paper shows that FM with OT paths has significantly more stable training dynamics compared to VP paths:

Downstream Impact

Flow Matching has become the standard training paradigm for modern generative models:

Why FM won for robotics: Robot action generation needs fast inference (real-time control at 10-50 Hz). FM with OT paths needs only 5-10 denoising steps vs 50-100 for DDPM. This makes flow-based policies practical for real-time robot control where diffusion policies were too slow.
FID vs Number of Function Evaluations

OT paths (teal) achieve lower FID with fewer steps than VP diffusion paths (orange).

What practical advantage does Flow Matching with OT paths provide for downstream applications like robotics?

Chapter 8: Connections

Flow Matching sits at the intersection of several major ideas in generative modeling and beyond.

Related Papers in This Collection

The Intellectual Lineage

2018
Neural ODEs (Chen et al.) — continuous-depth networks
2019
FFJORD — free-form CNFs with trace estimation (expensive)
2020
Score SDE (Song et al.) — unified SDE framework for diffusion
2022
Flow Matching (this paper) — simulation-free CNF training
2022
Rectified Flows (Liu et al.) — iterative straightening + reflow
2024
SD3, Flux, MovieGen, pi0 — FM as the default generative backbone

Key Takeaways

The big picture: Flow Matching showed that the "right" way to think about generative modeling is through vector fields and probability paths. Diffusion was always doing this implicitly — FM made it explicit, and in doing so, unlocked better paths (OT), faster inference, and simpler training. It's the theoretical foundation on which modern generative AI is built.

References

  1. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., Le, M. "Flow Matching for Generative Modeling." ICLR 2023. arXiv:2210.02747
  2. Liu, X., Gong, C., Liu, Q. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR 2023. arXiv:2209.03003
  3. Chen, R. T. Q., Rubanova, Y., Bettencourt, J., Duvenaud, D. "Neural Ordinary Differential Equations." NeurIPS 2018. arXiv:1806.07366
  4. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., Poole, B. "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021. arXiv:2011.13456
  5. Esser, P. et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." ICML 2024. arXiv:2403.03206