Flow Matching & Continuous Normalizing Flows — Diffusion & Flow Matching Internals

Introduction

Diffusion models achieved extraordinary results, but their training and sampling mechanics are surprisingly indirect. To generate data, you learn to reverse a noise-corruption process — predicting the noise added at each of a thousand tiny steps. The resulting sample trajectories curve and meander through high-dimensional space, requiring many network evaluations to produce a single sample.

In 2023, two independent lines of work converged on a radically simpler alternative. Lipman et al. (Flow Matching for Generative Modeling) and Liu et al. (Flow Straight and Fast) proposed flow matching: instead of learning to reverse a diffusion process, learn a velocity field that transports noise samples directly to data samples along straight-line paths. The neural network predicts v_θ(x, t) — how fast and in what direction a point x should move at time t — and an ODE integrator follows these velocities from noise to data.

The advantages are striking. Training is simulation-free — no need to run a forward process or compute complex noise schedules. The learned paths are straighter, requiring fewer integration steps at generation time. The objective is a simple regression loss with no weighting gymnastics. And the mathematical framework — continuous normalizing flows — is cleaner and more general than the Markov chain machinery of DDPMs.

ℹ Prerequisites

This article builds on the SDE perspective from Article 04 and the score-based framework from Article 03. Familiarity with neural ODEs is helpful but not required — we derive everything from scratch. The key prerequisite is comfort with the idea of a time-dependent vector field guiding samples through space.

Continuous Normalizing Flows

Before we can understand flow matching, we need the mathematical object it operates on: the continuous normalizing flow (CNF). Introduced by Chen et al. (2018) as part of their Neural ODE framework, a CNF defines a generative model through an ordinary differential equation rather than a sequence of discrete transformations.

Neural ODEs

A Neural ODE defines a continuous transformation of space. Given a neural network v_θ(x, t) that takes a position x and time t, we define the flow through the ODE:

dx/dt = v θ (x, t), t \in [0, 1]

Starting from an initial point x₀ at time t = 0, we integrate this ODE forward to get the transformed point x₁ at time t = 1. The function v_θ is the velocity field — at every point in space and every moment in time, it tells you which direction to move and how fast. The entire trajectory x(t) for t ∈ [0, 1] is determined by the initial condition and the velocity field.

We write the solution map as φ_t(x₀) — the flow map that sends each starting point to its location at time t. For a generative model, we want this flow to transform a simple distribution (Gaussian noise) into a complex one (data). That is:

x 0 ~ p 0 = 𝒩(0, I) \Rightarrow x 1 = φ 1 (x 0) ~ p 1 \approx p data

Push-forward & change of variables

The flow map φ_t doesn't just move individual points — it transforms entire probability distributions. If x₀ ~ p₀, then the distribution of x_t = φ_t(x₀) is called the push-forward of p₀ through φ_t, written (φ_t)_#p₀ = p_t.

Because the ODE defines a smooth, invertible mapping (as long as v_θ is Lipschitz), we can compute exact log-likelihoods via the instantaneous change of variables formula (also from Chen et al., 2018):

d/dt log p t (x(t)) = -div(v θ (x(t), t)) = -\sum i \partialv θ,i /\partialx i

This is beautiful: the log-density along a trajectory changes at a rate given by the divergence of the velocity field. Regions where the flow expands (positive divergence) decrease in density; regions where it compresses (negative divergence) increase in density. Unlike discrete normalizing flows, there are no architectural constraints — any neural network architecture can serve as v_θ.

ℹ The CNF bottleneck

The original CNF training approach (Chen et al., 2018) required simulating the ODE during training to compute the loss, making it extremely slow. Maximum likelihood training required computing the divergence trace, adding further cost. Flow matching solves both problems by providing a simulation-free training objective.

Flow Field Visualization Interactive

Velocity arrows show the learned flow at each time t. Particles follow the velocity field from noise (t=0) to data (t=1). Drag the slider to see how the field evolves.

Time t: 0.00 t = 0.00 | Noise distribution

The Flow Matching Objective

Now we have the goal: find a velocity field v_θ(x, t) whose flow transforms p₀ = 𝒩(0, I) into p₁ ≈ p_data. But how do we train it?

The most natural approach would be to regress v_θ onto a target velocity field u_t(x) that generates the desired probability path p_t:

ℒ FM (θ) = 𝔼 t ~ U[0,1] 𝔼 x ~ p t ‖ v θ (x, t) - u t (x) ‖ 2

This is the flow matching objective: at each time t, sample a point x from the marginal distribution p_t, and regress our network's predicted velocity onto the true velocity that would generate p_t.

There's a fundamental problem: we don't know u_t(x) or p_t(x). The target velocity field u_t is defined implicitly — it's the velocity field whose flow generates the probability path from p₀ to p₁. Computing it would require knowing the transport map between the noise distribution and the data distribution, which is exactly what we're trying to learn. We're stuck in a chicken-and-egg problem.

This is precisely the impasse that made original CNF training require simulation. The breakthrough of flow matching is realizing we can sidestep this problem entirely by conditioning.

Conditional Flow Matching

The key insight of Lipman et al. (2023) is elegant: instead of trying to construct the marginal probability path p_t and its velocity field u_t, we construct conditional probability paths p_t(x | x₁) — one for each data point x₁ — where both the path and the velocity are known in closed form.

The Gaussian probability path

For each data sample x₁, define the conditional probability path:

p t (x | x 1) = 𝒩(x; t \cdot x 1, (1 - t) 2 I)

At t = 0, this is 𝒩(0, I) — pure noise. At t = 1, it collapses to a delta at x₁ — the data point. The mean interpolates linearly from the origin to x₁, while the variance shrinks to zero. Samples from this conditional path follow a simple formula:

∑ Conditional interpolation

Sample x₀ ~ 𝒩(0, I) and a data point x₁ ~ p_data. The conditional sample at time t is:

x t = (1 - t) \cdot x 0 + t \cdot x 1

This is just linear interpolation between noise and data! The trajectory for each (x₀, x₁) pair is a perfectly straight line.

Now, what is the velocity along this straight-line path? Since x_t = (1-t)x₀ + t·x₁, differentiating with respect to t gives:

u t (x t | x 1) = dx t /dt = x 1 - x 0

The conditional velocity is constant — it doesn't depend on t or on the current position x_t. It's simply the displacement vector from the noise sample to the data sample. Each particle moves at constant speed in a straight line from where it started to where it needs to go.

The CFM loss

The conditional flow matching (CFM) loss replaces the intractable marginal velocity u_t(x) with the known conditional velocity u_t(x | x₁):

∑ CFM training objective

ℒ CFM (θ) = 𝔼 t ~ U[0,1] 𝔼 x 1 ~ p data 𝔼 x 0 ~ 𝒩(0,I) ‖ v θ (x t, t) - (x 1 - x 0) ‖ 2

where x_t = (1 − t) x₀ + t x₁.

This is simulation-free: no ODE integration during training. Just sample noise, sample data, interpolate, and regress the network output onto the displacement vector.

The mathematical miracle: Lipman et al. prove that the CFM loss has the same gradients as the intractable FM loss (up to a constant). Training with the conditional objective recovers the correct marginal velocity field. The proof uses the fact that the marginal velocity u_t(x) is a mixture of conditional velocities u_t(x | x₁) weighted by the posterior p(x₁ | x_t). Regressing onto conditional velocities at points sampled from the conditional paths yields the same expected gradient as regressing onto marginal velocities at points sampled from the marginal path.

Compare this to DDPM training: sample noise ε, compute x_t = √ᾱ_t x₀ + √(1 − ᾱ_t) ε, regress onto ε. The structure is nearly identical! The difference: flow matching regresses onto the displacement x₁ − x₀ rather than the noise ε, and uses linear interpolation rather than the diffusion noising schedule.

Conditional Flow Matching Interactive

Straight lines connect noise samples (left) to data samples (right). Animate particles traveling along these paths. The marginal distribution at each time t is shown below.

Time t: 0.00 t = 0.00 | Source (noise)

Optimal Transport & Rectified Flows

Standard CFM pairs noise samples x₀ with data samples x₁ randomly — each x₀ is independently sampled and independently paired with an x₁. But the pairing between source and target points has a dramatic effect on the geometry of the learned flow. Random pairing produces crossing trajectories: a noise sample on the left might be paired with a data sample on the right, while a nearby noise sample on the right is paired with data on the left. Their paths cross, creating unnecessary curvature in the marginal velocity field.

Optimal transport (OT) provides the mathematically optimal pairing. The OT coupling minimizes the total displacement cost — it pairs source and target points to minimize the sum of squared distances ∑ ||x_0,i − x_1,π(i)||². This produces non-crossing trajectories that are globally straighter.

OT-CFM

Tong et al. (2024) proposed OT-CFM: use mini-batch optimal transport to pair noise and data samples within each training batch. For a batch of N noise samples and N data samples, solve the discrete OT problem (using the Sinkhorn algorithm or Hungarian method) to find the optimal assignment, then train with the CFM loss using these OT-paired samples.

The mini-batch OT approximation is surprisingly effective. Even with small batch sizes (64-256), the resulting flows are measurably straighter and can be sampled with fewer ODE solver steps. The computational overhead of solving the OT problem within each batch is negligible compared to the neural network forward pass.

Reflow: iterative straightening

Liu et al. (2023) proposed an alternative approach to straightening: reflow (also called rectified flows). The idea is elegantly recursive:

Train a flow model with random (x₀, x₁) pairings.
Use the trained model to generate synthetic (x₀, x₁) pairs by integrating the ODE from noise to data.
Re-train a new flow model using these synthetic pairs as training data.
Repeat. Each iteration produces straighter flows.

Why does this work? After step 1, the model has learned some flow φ. In step 2, we draw x₀ ~ 𝒩(0, I) and compute x₁ = φ₁(x₀). This (x₀, x₁) pair is now causally coupled — they're connected by the learned flow. Training on these coupled pairs in step 3 straightens the trajectories because the new model learns to go directly from x₀ to its corresponding x₁ in a straight line, rather than following the potentially curved path the original model took.

💡 Straightness enables distillation

The ultimate payoff of straight flows: if trajectories are perfectly straight, you only need one Euler step to go from noise to data. This means a well-straightened flow can be distilled into a single-step generator — essentially collapsing the iterative ODE solve into a single forward pass. This is the foundation of consistency distillation and one-step generation methods.

Path Straightness: Diffusion vs Flow Matching Interactive

Left: diffusion model trajectories curve through space. Right: flow matching trajectories are nearly straight. Straighter paths = fewer solver steps = faster generation.

Time t: 0.00 t = 0.00

Flow Matching vs Diffusion: A Detailed Comparison

Flow matching and diffusion models are more similar than they first appear. They use the same neural network architectures (U-Nets, DiTs), operate over the same data types, and produce comparable sample quality. The differences lie in the training objective, the interpolation schedule, and the geometry of sample trajectories.

Aspect	DDPM / Diffusion	Flow Matching
Framework	Reverse Markov chain / SDE	Continuous normalizing flow / ODE
Network predicts	Noise ε_θ(x_t, t)	Velocity v_θ(x_t, t)
Interpolation	x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε	x_t = (1−t) x₀ + t x₁
Target	ε (the noise that was added)	x₁ − x₀ (displacement vector)
Training loss	\|\|ε_θ(x_t, t) − ε\|\|²	\|\|v_θ(x_t, t) − (x₁−x₀)\|\|²
Noise schedule	Learned or fixed β_t schedule	Linear interpolation (no schedule needed)
Path geometry	Curved (noise added/removed incrementally)	Straight (linear interpolation)
Sampling steps	50–1000 typical	10–100 typical (fewer with OT/reflow)
Likelihood	ELBO (lower bound)	Exact via change of variables (but expensive)

The relationship between the two frameworks runs deep. Kingma & Gao (2024) showed that diffusion models and flow matching are members of the same family — both can be viewed as learning a velocity field for an ODE, just with different interpolation schedules and parameterizations. The ε-prediction of diffusion and the velocity prediction of flow matching are related by a simple linear transformation that depends on the noise schedule.

In practice, the choice often comes down to engineering convenience. Flow matching's linear interpolation is simpler to implement and reason about. Its straighter paths mean fewer sampling steps, which translates directly to faster inference. Modern systems like Stable Diffusion 3 (Esser et al., 2024) use flow matching (specifically, rectified flows) as their training framework — a testament to its practical advantages.

Extensions & Frontiers

The flow matching framework has proven remarkably extensible. Its clean mathematical structure — define a probability path, derive the conditional velocity, regress — generalizes far beyond Euclidean spaces and Gaussian paths.

Riemannian Flow Matching (Chen & Lipman, 2024) extends flow matching to curved spaces — spheres, tori, hyperbolic spaces, and general Riemannian manifolds. This is crucial for data that lives on manifolds: molecular conformations (SO(3) for rotations), geological data (the sphere S²), or protein backbone angles (the torus). The key modification: replace linear interpolation with geodesic interpolation on the manifold, and replace Euclidean velocities with tangent vectors.

Discrete Flow Matching (Campbell et al., 2024) adapts the framework to discrete data — text, categorical features, molecular graphs. Instead of continuous velocities, the model learns transition rates between discrete states. This provides a principled alternative to autoregressive generation and discrete diffusion, with the same straight-path advantages.

Stochastic Interpolants (Albergo & Vanden-Eijnden, 2023) generalize the interpolation x_t = (1−t)x₀ + t·x₁ by adding stochastic noise along the path: x_t = α_t x₀ + β_t x₁ + σ_t ε. This interpolates between deterministic ODE flows (σ_t = 0) and stochastic SDE flows (σ_t > 0), unifying flow matching and score-based diffusion in a single framework.

Schrödinger Bridges provide the mathematically optimal stochastic transport between two distributions. While OT gives the optimal deterministic transport (straight lines with optimal pairing), Schrödinger bridges solve the entropy-regularized version, yielding diffusion processes (not just ODEs) that optimally connect distributions. Several recent works (De Bortoli et al., 2021; Shi et al., 2023) connect Schrödinger bridges to flow matching, enabling optimal stochastic transport without simulation-based training.

Random vs Optimal Transport Pairing Interactive

Left: random pairing creates crossing paths. Right: optimal transport pairing minimizes total displacement, producing non-crossing trajectories. Click “Resample” to generate new points.

Points: 8 Random cost: — | OT cost: —

Implementation

One of flow matching's greatest strengths is how little code it requires. The training loop is shorter and simpler than DDPM, with no noise schedule, no ᾱ_t computation, and no variance weighting. Here's a side-by-side comparison:

DDPM training step:

# DDPM training step
x_0 = get_data_batch()                      # clean data
t = torch.randint(0, T, (B,))               # discrete timestep
eps = torch.randn_like(x_0)                 # noise
alpha_bar_t = alpha_bar[t]                   # cumulative schedule
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps
eps_pred = model(x_t, t)                     # predict noise
loss = F.mse_loss(eps_pred, eps)

Flow matching training step:

# Flow matching training step
x_1 = get_data_batch()                      # data (target)
x_0 = torch.randn_like(x_1)                # noise (source)
t = torch.rand(B, 1, 1, 1)                  # continuous time ~ U[0,1]
x_t = (1 - t) * x_0 + t * x_1              # linear interpolation
v_pred = model(x_t, t)                       # predict velocity
loss = F.mse_loss(v_pred, x_1 - x_0)        # target = displacement

The simplicity is remarkable. Flow matching eliminates the noise schedule entirely — no β_t, no α_t, no cumulative products. Time is sampled continuously from U[0,1] rather than as discrete integers. The interpolation is a simple lerp. The target is the raw displacement.

Time sampling strategies. While uniform sampling t ~ U[0,1] works, several works have found that non-uniform time sampling improves training. Logit-normal sampling (t ~ σ(Normal(0, 1))) concentrates samples near t = 0.5, where the interpolation is most ambiguous and the velocity field most complex. Stable Diffusion 3 uses this approach. Another option is to sample t with a density proportional to the expected loss at that timestep, focusing training effort on the hardest times.

Loss weighting. Although the unweighted MSE loss works well, some practitioners apply a time-dependent weight w(t) to the loss. Common choices include w(t) = 1/(1 − t + ε) to emphasize the final stages (where small errors become large in pixel space), or w(t) = 1/σ_t² to normalize for the varying scale of the velocity across time.

Sampling (inference). To generate samples, integrate the learned ODE from t = 0 to t = 1:

# Flow matching sampling (Euler method)
x = torch.randn(B, C, H, W)                # start from noise
steps = 50                                   # number of ODE steps
dt = 1.0 / steps
for i in range(steps):
    t = torch.full((B, 1, 1, 1), i * dt)
    v = model(x, t)                          # predicted velocity
    x = x + v * dt                           # Euler step
# x is now approximately a data sample

Higher-order ODE solvers (Runge-Kutta, DPM-Solver) can reduce the number of steps further. With well-straightened flows (from OT-CFM or reflow), even 10-20 Euler steps can produce high-quality samples.

References

Seminal papers and key works referenced in this article.

Lipman et al. "Flow Matching for Generative Modeling." ICLR, 2023. arXiv
Liu et al. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR, 2023. arXiv
Albergo & Vanden-Eijnden. "Building Normalizing Flows with Stochastic Interpolants." ICLR, 2023. arXiv
Esser et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." ICML, 2024. arXiv