Flow Matching — Veanors

Chapter 0: Why Flow Matching?

You want to turn noise into images. Diffusion models do this brilliantly — but they discovered an unintuitive trick: add noise gradually, then learn to reverse it. The resulting trajectories from noise to data are curved, wandering paths through high-dimensional space.

What if you could just learn straight lines from noise to data? That's faster to traverse (fewer ODE steps at inference), easier to learn (simpler vector field), and more stable to train. But there's a catch: the elegant mathematical framework for this — Continuous Normalizing Flows — has been impractical because training requires simulating ODEs, which is extremely expensive.

The breakthrough: Flow Matching shows you can train CNFs without ever simulating an ODE during training. Instead of the intractable marginal vector field, you regress against conditional vector fields that have closed-form solutions. The resulting loss is as simple as score matching but works with ANY probability path — including straight-line Optimal Transport paths that diffusion cannot access.

The key advantages of Flow Matching over traditional diffusion training:

Simulation-free training: No ODE solver during training. Just sample t, sample a pair (x₀, x₁), compute the conditional vector field analytically, and regress.
Path flexibility: Works with any Gaussian probability path. Diffusion paths, VP/VE paths, and crucially — Optimal Transport paths that give straight trajectories.
Faster inference: Straight OT paths need fewer integration steps (as few as ~10 NFEs vs 100+ for diffusion).
Unified framework: Score matching / denoising diffusion are special cases. FM generalizes them.

Noise-to-Data Trajectories: Curved vs Straight

Diffusion models follow curved paths (orange). Flow Matching with OT follows straight paths (teal). Click to resample.

What is the core computational bottleneck that Flow Matching eliminates from CNF training?

Simulating ODEs during training to compute the likelihood or match the vector field Computing the score function of the data distribution Backpropagating through a U-Net architecture

Chapter 1: Continuous Normalizing Flows

A Continuous Normalizing Flow (CNF) defines a generative model by moving samples along a time-dependent vector field. Think of it as a river: at time t = 0 you drop particles into noise, and the flow carries them to data at t = 1.

The ODE Formulation

A CNF is defined by a vector field v_t : R^d → R^d that generates a flow φ_t via the ODE:

dφ_t(x) / dt = v_t(φ_t(x)), φ₀(x) = x

Starting from φ₀(x) = x (identity at t = 0), the flow map φ_t transforms samples from the source distribution p₀ (noise) to the target distribution p₁ (data). At each time t, there is an intermediate distribution p_t = [φ_t]_# p₀ — the pushforward of p₀ through φ_t.

Pushforward intuition: If you have a bucket of particles distributed as p₀ (standard Gaussian), and you flow each particle through the ODE for time t, the resulting cloud of particles forms the distribution p_t. At t = 1, that cloud should match your data distribution q. The pushforward [φ_t]_# p₀ just means "apply φ_t to each sample and look at the resulting distribution."

The Continuity Equation

The densities p_t satisfy the continuity equation:

∂p_t / ∂t + div(p_t v_t) = 0

This is a conservation law: probability mass is neither created nor destroyed, it just flows. The vector field v_t generates the probability path p_t. Given v_t, you can compute p_t by solving this PDE (or equivalently, by simulating the ODE and pushing samples forward).

Sampling from a CNF

To generate a sample: draw x₀ ~ p₀ = N(0, I), then solve the ODE forward from t = 0 to t = 1 using an ODE solver (Euler, RK45, etc.):

x₁ = x₀ + ∫₀¹ v_θ(t, φ_t(x₀)) dt

With a learned v_θ, this gives you a sample from the model's approximation of q (the data distribution).

CNFs vs. Discrete Normalizing Flows: Discrete NFs (RealNVP, Glow) use a fixed number of invertible layers. CNFs use a continuous transformation parameterized by an ODE — infinitely flexible, no architecture constraints on invertibility. The price: you need an ODE solver at inference. But the flexibility is enormous — any smooth vector field works.

Flow ODE: Particles Moving Through Time

Particles start as noise (t=0) and flow to data (t=1) following the vector field. Drag the slider to see intermediate states.

What does the flow map φ_t do in a CNF?

It computes the likelihood of a data point It transforms samples from p₀ (noise) to p_t (intermediate distribution) by integrating the vector field from 0 to t It applies a fixed invertible transformation like an affine coupling layer

Chapter 2: The Training Problem

We have a beautiful generative model: learn v_θ, integrate it to generate samples. But how do we train v_θ? This is where CNFs ran into a wall for years.

The Naive Approach: Flow Matching (Intractable)

The most natural objective is to match our learned vector field to the true one that generates the data distribution:

L_FM(θ) = E_{t ~ U[0,1]} E_{x ~ p_t} || v_θ(t, x) − u_t(x) ||²

where u_t is the vector field that generates a probability path p_t interpolating between p₀ and the data distribution q.

Problem 1: We don't know u_t. It's defined implicitly as whatever vector field generates p_t via the continuity equation. There's no closed-form expression for the marginal u_t(x).

Problem 2: Even if we could compute u_t, sampling from p_t requires simulating the ODE forward from t = 0 — which means running an ODE solver inside the training loop. This is the same issue that plagued original CNF training (FFJORD): backpropagating through an ODE solver is expensive, memory-intensive, and numerically unstable.

The cost of simulation: FFJORD-style training requires: (1) Forward ODE solve to get p_t samples, (2) Backward ODE solve (adjoint method) for gradients, (3) Trace estimation for the log-likelihood term. Each training step costs O(NFE × d) where NFE is the number of function evaluations (50-200 typically). This makes CNFs 10-100x slower to train than diffusion models, which need only one forward pass per step.

Prior Solutions and Their Limitations

FFJORD (2019): Train with maximum likelihood via the instantaneous change of variables formula. Requires trace estimation (Hutchinson's) + ODE simulation. Expensive.
Score matching for diffusion (2020-2021): Sidestep the ODE issue by using a specific noising process (VP, VE SDE) where p_t(x | x₁) is known in closed form. But this locks you into diffusion-style curved paths.

Flow Matching's insight: we can get the simplicity of score matching (closed-form targets, no simulation) but for any probability path — not just diffusion paths.

Training Cost Comparison

ODE-based training requires expensive forward/backward solves per step. Flow Matching needs only a single network evaluation.

Why can't we directly minimize L_FM = E||v_θ(t,x) - u_t(x)||²?

Because u_t(x) has no closed-form expression and sampling from p_t requires expensive ODE simulation Because the loss is non-convex Because v_θ must be invertible

Chapter 3: Conditional Paths

Here's the key insight that unlocks everything. Instead of thinking about the marginal probability path p_t (which is intractable), we think about conditional probability paths p_t(x | x₁) — the path from noise to a specific data point x₁.

The Conditioning Trick

Define a conditional probability path p_t(x | x₁) that:

At t = 0: p₀(x | x₁) ≈ p₀(x) = N(0, I) (starts at noise, approximately independent of x₁)
At t = 1: p₁(x | x₁) = N(x₁, σ²_min I) (concentrated around x₁, essentially a delta)

The simplest choice is a Gaussian conditional path:

p_t(x | x₁) = N(x; μ_t(x₁), σ_t(x₁)² I)

where μ_t and σ_t are differentiable functions of t with boundary conditions μ₀ ≈ 0, σ₀ ≈ 1, μ₁ = x₁, σ₁ ≈ 0.

The Conditional Vector Field

For any Gaussian conditional path, there exists a unique vector field u_t(x | x₁) that generates it (via the continuity equation). For Gaussian paths, it has a closed-form expression:

u_t(x | x₁) = (σ_t′ / σ_t)(x − μ_t(x₁)) + μ_t′(x₁)

This is exact. No approximation. No simulation needed. Given a sample x_t from p_t(x | x₁), we can compute the vector field analytically.

The Reparameterization

Sampling from p_t(x | x₁) is trivial via reparameterization:

x_t = μ_t(x₁) + σ_t(x₁) · ε, ε ~ N(0, I)

For the simplest choice (linear interpolation, a.k.a. Optimal Transport path):

μ_t(x₁) = t · x₁, σ_t = 1 − t

x_t = (1 − t) x₀ + t x₁, x₀ ~ N(0, I)

This is just linear interpolation between a noise sample and a data sample! And the conditional vector field becomes:

u_t(x_t | x₁) = x₁ − x₀

The target is simply the displacement from noise to data. Constant in time. A straight line.

The magic: We've turned an intractable marginal problem into a trivial conditional problem. The conditional vector field u_t(x_t | x₁) = x₁ − x₀ is just "point from where you are toward the data." No ODE, no simulation, no score function — just a vector.

Conditional Path: Linear Interpolation

Each data point x₁ defines a conditional path from noise x₀. The conditional vector field is constant: x₁ - x₀.

For the linear interpolation path x_t = (1-t)x₀ + t*x₁, what is the conditional vector field u_t(x_t | x₁)?

t · (x₁ - x₀) (x_t - x₁) / (1 - t) x₁ - x₀ (constant, independent of t)

Chapter 4: The Flow Matching Objective

We now have all the pieces. The conditional vector field u_t(x | x₁) is tractable. But we want to learn the marginal vector field u_t(x) that generates the marginal p_t. How does conditioning on x₁ help?

From Marginal to Conditional

The crucial mathematical result (Theorem 2 in the paper): the marginal probability path can be recovered by marginalizing over the conditional paths:

p_t(x) = ∫ p_t(x | x₁) q(x₁) dx₁

And the marginal vector field that generates this p_t is:

u_t(x) = ∫ u_t(x | x₁) · p_t(x₁ | x) q(x₁) dx₁ / p_t(x)

This is still intractable (it involves an integral over all data points). But here's the punchline:

The Conditional Flow Matching Loss

Define the Conditional Flow Matching (CFM) loss:

L_CFM(θ) = E_{t ~ U[0,1]} E_{x₁ ~ q} E_{x_t ~ p_t(x|x₁)} || v_θ(t, x_t) − u_t(x_t | x₁) ||²

Theorem (Lipman et al., 2022): L_CFM and L_FM have identical gradients with respect to θ. That is, ∇_θ L_CFM = ∇_θ L_FM. Minimizing the tractable conditional loss is equivalent to minimizing the intractable marginal loss.

The Proof Sketch

The proof uses a simple trick. Expand L_FM:

L_FM = E_t E_{x ~ p_t} [||v_θ||² − 2 ⟨v_θ, u_t(x)⟩]

The cross term ⟨v_θ(t,x), u_t(x)⟩ under E_{x ~ p_t} can be rewritten by substituting the marginal expressions. After expanding p_t(x) = ∫ p_t(x|x₁)q(x₁)dx₁ and u_t(x) as its conditional mixture, the cross term becomes E_{x₁ ~ q} E_{x ~ p_t(x|x₁)} ⟨v_θ, u_t(x|x₁)⟩ — exactly the cross term in L_CFM. The squared norms also match up to constants independent of θ.

The Training Algorithm

With OT conditional paths, training is beautifully simple:

# Flow Matching training step (OT path)
def fm_step(model, x1, optimizer):
    # x1: batch of data samples [B, d]
    t = torch.rand(B, 1)           # sample time
    x0 = torch.randn_like(x1)      # sample noise
    xt = (1 - t) * x0 + t * x1     # interpolate
    target = x1 - x0               # conditional vector field
    pred = model(t, xt)             # predict vector field
    loss = (pred - target).pow(2).mean()
    loss.backward()
    optimizer.step()

That's it. Four lines of math. No ODE solver. No score function. No noise schedule to tune.

Compare to diffusion training: DDPM training: sample t, compute noise schedule α_t, σ_t, add noise x_t = α_tx₀ + σ_tε, predict ε. Flow Matching: sample t, interpolate x_t = (1-t)x₀ + tx₁, predict x₁ - x₀. The structure is almost identical! But FM works for any path, not just diffusion paths. And the OT path gives straighter trajectories.

Why can we use the Conditional FM loss (L_CFM) instead of the intractable marginal FM loss (L_FM)?

Because L_CFM is a tighter bound on L_FM Because they have identical gradients with respect to model parameters θ — minimizing one is equivalent to minimizing the other Because the conditional path converges to the marginal path as training progresses

Chapter 5: Optimal Transport Paths

Flow Matching works with any Gaussian conditional path. But which path should you choose? This is where the Optimal Transport (OT) path shines — and where FM decisively separates from diffusion.

The OT Conditional Path

The OT path uses linear interpolation:

μ_t(x₁) = t · x₁, σ_t(x₁) = 1 − (1 − σ_min)t

For practical purposes with σ_min → 0:

x_t = (1 − t) x₀ + t x₁

This is the displacement interpolation from Optimal Transport theory. The sample moves along a straight line from x₀ to x₁ at constant speed.

Why "Optimal Transport"?

In OT theory, the optimal way to transport mass from one distribution to another (minimizing total displacement) is via straight lines. The conditional OT path x_t = (1-t)x₀ + tx₁ moves each particle along the shortest path between its noise sample and its data target. While this is only the true OT map when conditioning on a single x₁, the conditional paths still produce straighter marginal flows than diffusion paths.

Diffusion Path (VP-SDE) for Comparison

The Variance Preserving path used by DDPM:

μ_t(x₁) = α_t x₁, σ_t = √(1 − α_t²)

where α_t = exp(−½ ∫₀^t β(s) ds) with some noise schedule β(t). This produces curved trajectories: the particle first drifts toward the origin (as noise is added) then curves toward x₁ (as signal emerges).

Why Straight Lines Win

Easier to learn: The vector field for straight lines is nearly constant (x₁ - x₀). For curved paths, the vector field changes direction throughout, requiring the network to learn complex time-dependent behavior.
Fewer integration steps: Straight trajectories have low curvature → Euler's method with large steps is accurate → fewer NFEs at inference (10 vs 100+).
Better conditioning: The neural network sees the same target (x₁ - x₀) regardless of t, reducing variance in gradient estimates.

Empirical evidence: On ImageNet 64×64, OT-CFM achieves FID 3.5 with only 110 NFEs, while VP-CFM (diffusion path) needs 142 NFEs for FID 4.3. Same model, same architecture — the only difference is the path geometry. Straight paths are simply better.

SHOWCASE: OT vs Diffusion Trajectories

Teal = OT (straight lines). Orange = Diffusion VP (curved). Drag the time slider to animate. Notice how OT paths cross directly while diffusion paths curve through the origin first.

t = 0 (noise) t = 0.00 t = 1 (data)

Why do Optimal Transport paths require fewer ODE steps at inference than diffusion paths?

Straight-line trajectories have low curvature, so simple ODE solvers (Euler) with large step sizes remain accurate — fewer function evaluations needed OT paths use a smaller network that evaluates faster OT paths operate in a lower-dimensional latent space

Chapter 6: Connection to Diffusion

Flow Matching doesn't replace diffusion — it generalizes it. Score matching for diffusion models is a special case of Conditional Flow Matching when you choose the diffusion probability path.

Score Matching as a Special Case

Recall: in diffusion models, we have a noising process:

x_t = α_t x₁ + σ_t ε, ε ~ N(0, I)

The score function is ∇_x log p_t(x | x₁) = −(x_t − α_t x₁) / σ_t² = −ε / σ_t.

Denoising score matching trains: predict ε from x_t.

Now, the conditional vector field for the diffusion (VP) path is:

u_t(x | x₁) = (σ_t′ / σ_t)(x − α_t x₁) + α_t′ x₁

Substituting x = α_t x₁ + σ_t ε:

u_t(x | x₁) = σ_t′ ε + α_t′ x₁

This is a linear combination of ε and x₁ — exactly what diffusion models predict (up to reparameterization). Training a network to predict this vector field is equivalent to training it to predict ε or x₁.

The unification: Denoising score matching (DSM), DDPM's ε-prediction, and x₀-prediction are all special cases of Conditional Flow Matching with the VP-SDE probability path. The only difference is a time-dependent rescaling of the target. FM reveals that what diffusion models are really doing is learning a vector field — they just never frame it that way.

What FM Adds Beyond Diffusion

Property	Diffusion (VP/VE)	Flow Matching
Path geometry	Curved (noise schedule-dependent)	Any — including straight OT lines
Training target	ε or x₀ or score	Vector field (encompasses all above)
Inference	SDE/ODE solver (many steps)	ODE solver (fewer steps with OT)
Path flexibility	Locked to diffusion process	Any Gaussian path works
Noise schedule	Critical hyperparameter	Not needed (for OT paths)

Relationship to Rectified Flows

Rectified Flows (Liu et al., 2022) independently proposed the same OT conditional path x_t = (1-t)x₀ + tx₁ and the same training objective. The key difference: Rectified Flows additionally proposed reflow — iteratively straightening trajectories by using the model's own predictions as new training pairs. Flow Matching provides the theoretical foundation (the equivalence theorem) while Rectified Flows provide the practical refinement technique.

The landscape in 2024: Stable Diffusion 3, Flux, and most modern image generators use Flow Matching with OT paths. The field has largely moved from "diffusion" to "flow" formulations, precisely because of the advantages this paper identified: simpler training, faster inference, no noise schedule.

How is denoising score matching related to Flow Matching?

They are completely different frameworks with no mathematical connection Score matching is a special case of CFM when using the VP-SDE diffusion probability path — predicting ε is equivalent to predicting the conditional vector field up to a time-dependent rescaling Flow Matching is an approximation of score matching that trades accuracy for speed

Chapter 7: Results & Applications

Flow Matching isn't just a theoretical curiosity — it achieves state-of-the-art results and has become the foundation for modern generative models.

ImageNet Results

On ImageNet 64×64 (unconditional), comparing different probability paths with the same architecture:

Method	Path	NLL (bpd)	FID	NFE
FM + VP path	Diffusion (VP)	0.97	4.31	142
FM + OT path	Optimal Transport	0.93	3.50	110
DDPM (original)	Diffusion (VP)	3.70	11.0	1000
Score SDE (VP)	Diffusion (VP)	2.99	2.41*	2000

*Score SDE uses 2000 NFEs and Predictor-Corrector sampling. OT-CFM uses only 110 NFEs with an adaptive ODE solver.

Key finding: OT-CFM simultaneously achieves better likelihood AND better sample quality (FID) than VP-CFM, while using fewer integration steps. This is remarkable — traditionally likelihood and sample quality are at odds (the "likelihood-FID tradeoff"). Straight paths help both.

Training Stability

The paper shows that FM with OT paths has significantly more stable training dynamics compared to VP paths:

Loss variance is lower (the target x₁ - x₀ is constant across time, reducing gradient noise)
No need to tune noise schedules (linear, cosine, etc.) — there is no schedule
The learned vector field is smoother (lower Lipschitz constant), making ODE integration more stable

Downstream Impact

Flow Matching has become the standard training paradigm for modern generative models:

Stable Diffusion 3 / Flux (2024): Use rectified flow (= OT-CFM) as the training objective. The "diffusion" in the name is legacy — the math is Flow Matching.
Speech synthesis (Voicebox, 2023): Flow Matching for text-to-speech, achieving state-of-the-art naturalness. The dMel decoder uses FM to generate mel spectrograms.
Video generation (MovieGen, 2024): FM for text-to-video, leveraging the fast inference (fewer steps = less latency for video frames).
Protein design (FrameFlow, 2023): FM on SE(3) for protein backbone generation.
Robotics (pi0, 2024): Flow Matching as the action generation head in Vision-Language-Action models.

Why FM won for robotics: Robot action generation needs fast inference (real-time control at 10-50 Hz). FM with OT paths needs only 5-10 denoising steps vs 50-100 for DDPM. This makes flow-based policies practical for real-time robot control where diffusion policies were too slow.

FID vs Number of Function Evaluations

OT paths (teal) achieve lower FID with fewer steps than VP diffusion paths (orange).

What practical advantage does Flow Matching with OT paths provide for downstream applications like robotics?

It uses less GPU memory during training It produces higher resolution outputs Much faster inference (5-10 ODE steps vs 50-100 for diffusion) — enabling real-time control at 10-50 Hz

Chapter 8: Connections

Flow Matching sits at the intersection of several major ideas in generative modeling and beyond.

The Intellectual Lineage

2018

Neural ODEs (Chen et al.) — continuous-depth networks

↓

2019

FFJORD — free-form CNFs with trace estimation (expensive)

↓

2020

Score SDE (Song et al.) — unified SDE framework for diffusion

↓

2022

Flow Matching (this paper) — simulation-free CNF training

↓