Ch 3: Flow Matching — Flow Matching & Diffusion

t	α_t = t	β_t = 1−t	x = α_tz + β_tε = t·3 + (1−t)·0.7
0.0	0.0	1.0	0 + 0.7 = 0.700
0.25	0.25	0.75	0.75 + 0.525 = 1.275
0.50	0.50	0.50	1.5 + 0.35 = 1.850
0.75	0.75	0.25	2.25 + 0.175 = 2.425
1.0	1.0	0.0	3.0 + 0 = 3.000

Chapter 3: The Conditional Vector Field

A probability path tells us what distributions we want at each time t. But we need a vector field whose ODE actually produces those distributions. Given a conditional probability path p_t(·|z) = N(α_tz, β_t²I), we need to find u_t^target(x|z) such that:

X₀ ~ p_init, dX/dt = u_t^target(X_t|z) ⇒ X_t ~ p_t(·|z)

This is one of the most important derivations in the course. Let's do it step by step, showing every algebraic move.

Step 1: Define the conditional flow. We want a flow ψ_t(x|z) such that if X₀ ~ N(0, I), then X_t = ψ_t(X₀|z) ~ N(α_tz, β_t²I). The simplest choice is an affine map:

ψ_t^target(x|z) = α_t z + β_t x

Step 2: Verify it produces the right distributions. If X₀ ~ N(0, I), then X_t = ψ_t(X₀|z) = α_tz + β_tX₀. Since X₀ is Gaussian and we are applying an affine transformation, X_t is also Gaussian:

E[X_t] = α_tz + β_t E[X₀] = α_tz + 0 = α_tz

Var(X_t) = β_t² Var(X₀) = β_t² I

So X_t ~ N(α_tz, β_t²I) = p_t(·|z). Check. The flow produces the correct conditional probability path.

Step 3: Extract the vector field. By the definition of a flow, dψ/dt = u(ψ, z). We differentiate ψ_t with respect to t:

dψ_t/dt = d(α_tz + β_tx)/dt = α̇_t z + β̇_t x

Step 4: Re-express in terms of the output. The vector field u must be a function of ψ_t (the current position), not x (the initial position). Since ψ_t = α_tz + β_tx, we solve for x = (ψ_t − α_tz)/β_t and substitute:

u_t^target(ψ_t|z) = α̇_t z + β̇_t · (ψ_t − α_tz)/β_t

= (α̇_t − α_tβ̇_t/β_t) z + (β̇_t/β_t) ψ_t

Renaming ψ_t back to x (since the formula should express the vector field as a function of the current position x), the conditional Gaussian vector field is:

u_t^target(x|z) = (α̇_t − α_tβ̇_t/β_t) z + (β̇_t/β_t) x

Understanding each term. The vector field has two parts:

1. The z-dependent term (α̇_t − α_tβ̇_t/β_t) z — this pulls the particle toward the data point z. The coefficient depends on the time derivatives of the schedulers.

2. The x-dependent term (β̇_t/β_t) x — this adjusts the velocity based on the current position. For CondOT, β̇_t/β_t = −1/(1−t), which is always negative, meaning it pushes toward the origin.

Combined, the two terms produce a velocity that moves x along a trajectory from the initial noise toward the target data point z.

This is an exact, analytical formula. No neural network needed. Given schedules α_t, β_t and a data point z, we can compute the conditional vector field at any (x, t) by plugging into the formula. This is the target that the neural network will learn to approximate.

Step 5: Specialize to CondOT (α_t = t, β_t = 1−t). Let's compute each derivative:

α̇_t = d(t)/dt = 1. β̇_t = d(1−t)/dt = −1.

Now substitute into the general formula:

u_t^target(x|z) = (1 − t · (−1)/(1−t)) z + (−1/(1−t)) x

= (1 + t/(1−t)) z − x/(1−t)

= z/(1−t) − x/(1−t) = (z − x)/(1−t)

Step 6: The elegant shortcut. There is a much simpler way to see the answer. Since x = α_tz + β_tε, the trajectory of a specific sample is x(t) = tz + (1−t)ε for CondOT. Its time derivative is:

dx/dt = d(tz + (1−t)ε)/dt = z − ε

But dx/dt is the velocity of the trajectory, which is exactly the conditional vector field evaluated at x(t). Therefore:

u_t^target(x|z) = α̇_t z + β̇_t ε = 1 · z + (−1) · ε = z − ε

The CondOT target is simply z − ε! This is the velocity that, at any time t, points from the current noisy sample x = tz + (1−t)ε toward the clean data z at constant speed. The neural network learns to predict (z − ε) from the noisy input x and time t.

Numerical example. z = 3.0, ε = 0.7, so the target velocity is z − ε = 3.0 − 0.7 = 2.3 at every time t. The velocity is constant along the straight line from ε to z.

Verification by trajectory. The trajectory under this velocity is x(t) = x(0) + t · (z − ε). Since x(0) = ε = 0.7:

t	x(t) = 0.7 + t · 2.3	Should equal tz + (1−t)ε	Match?
0.0	0.7 + 0 = 0.700	0 + 0.7 = 0.700	Yes
0.25	0.7 + 0.575 = 1.275	0.75 + 0.525 = 1.275	Yes
0.50	0.7 + 1.15 = 1.850	1.5 + 0.35 = 1.850	Yes
1.0	0.7 + 2.3 = 3.000	3.0 + 0 = 3.000	Yes

The constant velocity z − ε traces a straight line from ε to z. This is the optimal transport solution — the shortest path between the two points.

2D numerical example. z = (3.0, −1.0), ε = (0.5, 0.8). Target velocity: z − ε = (2.5, −1.8). At t = 0.4: x = 0.4(3.0, −1.0) + 0.6(0.5, 0.8) = (1.2, −0.4) + (0.3, 0.48) = (1.5, 0.08). The trajectory is a straight line in 2D from (0.5, 0.8) to (3.0, −1.0).

python
# Computing the conditional vector field for CondOT
def condot_target(z, eps):
    """Target velocity for CondOT path."""
    return z - eps  # That's it. The target is z - eps.

# Example
z = torch.tensor([3.0, -1.0])
eps = torch.tensor([0.5, 0.8])
target = condot_target(z, eps)  # tensor([2.5, -1.8])

For the CondOT path, if z = 5 and ε = −1, what is the target velocity u_t^target(x|z)?

z + ε = 4 z · ε = −5 z − ε = 6

Chapter 4: The Marginal Vector Field

The conditional vector field u_t^target(x|z) moves everything toward a single data point z. That is useless by itself — we do not want every sample to collapse to the same z. We need a vector field that generates all of p_data.

The marginal vector field is the average of conditional vector fields, weighted by how likely each data point z is to have produced the current noisy sample x:

u_t^target(x) = ∫ u_t^target(x|z) · p_t(x|z) p_data(z) / p_t(x) · dz

The weight p_t(x|z)p_data(z)/p_t(x) is a posterior: given the noisy observation x at time t, how likely is it that the clean data was z? This is Bayes' rule in action.

Worked example — marginal field with 2 data points. Suppose p_data puts equal mass on z₁ = −3 and z₂ = +3 (1D). At t = 0.5, consider a noisy observation x = 1.0. Which data point produced it?

Under the conditional path (CondOT), x = tz + (1−t)ε, so at t = 0.5:

If z = z₁ = −3: x = 0.5(−3) + 0.5ε, so ε = (x − 0.5(−3))/0.5 = (1 + 1.5)/0.5 = 5.0.

If z = z₂ = +3: x = 0.5(3) + 0.5ε, so ε = (x − 1.5)/0.5 = −1.0.

Since ε ~ N(0, 1), ε = −1.0 is much more likely than ε = 5.0. So the posterior strongly favors z₂ = +3. The marginal velocity at x = 1.0 is dominated by the conditional velocity pointing toward z₂.

Conditional velocities: u(x|z₁) = z₁ − ε₁ = −3 − 5 = −8, u(x|z₂) = z₂ − ε₂ = 3 − (−1) = 4.

Posterior weights: w₁ ∝ N(5; 0, 1) ≈ 10⁻⁶, w₂ ∝ N(−1; 0, 1) ≈ 0.24.

Marginal velocity: u(x) ≈ w₁(−8) + w₂(4) / (w₁ + w₂) ≈ 4.0. The field points toward +3.

Intuition for the marginal vector field: For each possible data point z, we compute the velocity that would bring us to z. Then we average these velocities, weighting by how much we believe x came from z. Early in the process (t near 0, x is very noisy), many data points are plausible, so the velocity is an average over many directions. Late in the process (t near 1, x is nearly clean), only one or two data points are plausible, and the velocity focuses sharply.

Conditional vs. Marginal Vector Fields

Orange arrows show the conditional field for ONE data point z. Teal arrows show the marginal field (averaged over all data points). Toggle between them to see the difference.

View

t 0.30

Theorem 9 (Marginalization Trick): The marginal vector field u_t^target(x) generates the marginal probability path p_t:

X₀ ~ p_init, dX/dt = u_t^target(X_t) ⇒ X_t ~ p_t, X₁ ~ p_data

This is the key theorem. If we could compute u_t^target(x), we would be done — just simulate this ODE and get perfect samples. The problem: we cannot compute u_t^target(x) because the integral involves p_t(x) in the denominator, which is intractable.

The setup for the big trick: We cannot compute u_t^target(x) directly. But we CAN compute u_t^target(x|z) for any specific z. Flow matching exploits this: we train the neural network to match the conditional vector field — and this turns out to be equivalent to matching the marginal one!

The continuity equation. The mathematical tool that proves Theorem 9 is the continuity equation. For an ODE dX/dt = u_t(X_t), the density p_t(x) of X_t evolves according to:

∂_t p_t(x) = −div(p_t u_t)(x)

This equation from physics says: the rate of change of probability at x equals the net inflow of probability mass. The divergence measures outflow; the negative divergence measures inflow. Multiplying by p_t accounts for the total mass being transported.

Proof idea for Theorem 9. We need to show that u_t^target(x) = ∫ u(x|z) · posterior(z|x) dz satisfies the continuity equation for p_t. The proof directly computes:

1. Start with ∂_t p_t(x) = ∫ ∂_t p_t(x|z) p_data(z) dz (definition of marginal).

2. Each conditional term satisfies its own continuity equation: ∂_t p_t(x|z) = −div(p_t(·|z) u_t(·|z))(x).

3. Integrate: ∂_t p_t(x) = −div(∫ p_t(x|z) u_t(x|z) p_data(z) dz).

4. Multiply and divide by p_t(x): = −div(p_t(x) · ∫ u_t(x|z) posterior(z|x) dz) = −div(p_t u_t^target).

This matches the continuity equation for u_t^target, proving the theorem.

Why can't we compute the marginal vector field u_t^target(x) directly?

The integral requires evaluating p_t(x), which is intractable It does not exist for Gaussian paths It requires knowledge of the neural network architecture

Chapter 5: The Flow Matching Loss

We want u_t^θ ≈ u_t^target (marginal vector field). The natural loss would be:

L_FM(θ) = E_{t~Unif, x~p_t} [ || u_t^θ(x) − u_t^target(x) ||² ]

But we cannot compute u_t^target(x)! Instead, define the conditional flow matching loss:

L_CFM(θ) = E_{t~Unif, z~p_data, x~p_t(·|z)} [ || u_t^θ(x) − u_t^target(x|z) ||² ]

The difference: we regress against the conditional vector field u_t^target(x|z) instead of the marginal u_t^target(x). We can compute this because we know the conditional vector field analytically.

Theorem 12 (The key result): L_FM(θ) = L_CFM(θ) + C, where C is a constant independent of θ. Therefore:

∇_θ L_FM(θ) = ∇_θ L_CFM(θ)

Minimizing the conditional loss is identical to minimizing the marginal loss. By training against the tractable conditional target, we implicitly train against the intractable marginal target.

Proof sketch. Expand ||a − b||² = ||a||² − 2a^Tb + ||b||² for both losses. The ||a||² terms are the same (they both sample x ~ p_t). The cross terms are equal because:

E_{x~p_t}[u_t^θ(x)^T u_t^target(x)] = E_{z~p_data, x~p_t(·|z)}[u_t^θ(x)^T u_t^target(x|z)]

This is the crucial step: the marginal vector field in the cross-term can be replaced by the conditional one by expanding the definition and using linearity of integration. The ||b||² terms differ but are constants in θ.

Why this is profound: Training is simulation-free. We never simulate any ODE during training. Each training step is O(1) neural network evaluations (not O(n) like ODE simulation). This makes flow matching scalable to billion-parameter models.

Full proof of Theorem 12. We need to show L_FM(θ) = L_CFM(θ) + C. Expand the squared norm:

||a − b||² = ||a||² − 2a^Tb + ||b||²

Apply to L_FM:

L_FM = E[||u_θ||²] − 2E[u_θ^T u^target] + E[||u^target||²]

The last term is constant in θ (call it C₁). The first term E[||u_θ(x)||²] where x ~ p_t is the same as E[||u_θ(x)||²] where z ~ p_data, x ~ p_t(·|z), because sampling x ~ p_t is equivalent to the two-step procedure.

The crucial step is the cross-term. Using the definition of u_t^target(x) as a posterior-weighted integral:

E_{x~p_t}[u_θ(x)^T u^target(x)] = E_{z~p_data, x~p_t(·|z)}[u_θ(x)^T u^target(x|z)]

This is the key identity. It says: averaging u_θ against the marginal velocity is the same as averaging against the conditional velocity, because the marginal velocity is a posterior-weighted average of conditional velocities. When we take the expectation over x ~ p_t, the posterior weights integrate out.

Substituting back and adding/subtracting ||u^target(x|z)||² inside the expectation, we get L_FM = L_CFM + C₁ + C₂, where C₂ is also independent of θ. QED.

Worked example — verifying the theorem for d = 1. Suppose p_data puts mass 0.5 on z = −2 and 0.5 on z = +2. CondOT path. Consider fixed t = 0.5, x = 0.

The marginal velocity u^target(x = 0) at t = 0.5:

For z = −2: ε = (x − tz)/(1−t) = (0 − (−1))/0.5 = 2.0. Conditional velocity: z − ε = −2 − 2 = −4. Weight: N(ε=2; 0, 1) = 0.054.

For z = +2: ε = (0 − 1)/0.5 = −2.0. Conditional velocity: z − ε = 2 − (−2) = 4. Weight: N(ε=−2; 0, 1) = 0.054.

Both weights are equal! So u^target(0) = 0.5(−4) + 0.5(4) = 0. The velocity at the midpoint is zero because both data points pull equally.

If we slightly perturb x = 0.5: For z = +2, ε = (0.5 − 1)/0.5 = −1.0, weight N(−1; 0, 1) = 0.242. For z = −2, ε = (0.5 + 1)/0.5 = 3.0, weight N(3; 0, 1) = 0.004. Now z = +2 dominates, and u^target(0.5) ≈ 4. The field points toward +2. This is how the marginal field "decides" which mode to send each particle to.

Why does training with conditional targets work even though different z values give different targets for the same x? Because the neural network learns the average target over all possible z values, weighted by the posterior. This average IS the marginal velocity. Stochastic gradient descent naturally computes this average over many training steps. Each step shows a different (z, ε) pair, and the accumulated gradient converges to the gradient of the marginal loss.

An analogy to make this concrete. Imagine training a weather prediction model. At location x = "40°N, 74°W" at time t = "January," the target wind velocity varies across years. Some years it blows east, some west, some north. If you train the model on individual year samples, the model learns the average wind at that location and time. This average is the climatological wind — the marginal velocity in our analogy. Individual years are like individual data points z. The model cannot memorize a specific year; it learns the average over all years. This is exactly how flow matching works: individual training samples give conditional targets, but the model learns the marginal (average) velocity.

Convergence of SGD for flow matching. Each mini-batch provides a noisy estimate of ∇_θ L_CFM. By Theorem 12, this is an unbiased estimate of ∇_θ L_FM. Standard SGD convergence results guarantee that the parameters converge to a minimizer of L_FM (given sufficient model capacity and decreasing learning rate). In practice, Adam with a fixed learning rate and cosine schedule works well.

Training vs. sampling complexity. A comparison that highlights why flow matching is practical:

Operation	NFE per step	Backprop?	Typical iterations
Flow matching training step	1	Yes	100K–1M
Sampling (Euler, 50 steps)	50	No	1 per image
Naive "simulate + backprop" training	50	Yes (through all 50)	Prohibitive

Flow matching training is 50x cheaper per step than the naive approach, with no memory overhead from storing intermediate ODE states for backpropagation.

Memory usage comparison. The naive "simulate + backprop" approach stores all intermediate states for backpropagation. For 50 Euler steps with a DiT-XL model: memory = 50 × (activation memory per step) ≈ 50 × 2GB = 100GB. This exceeds the memory of any single GPU. Flow matching: memory = 1 × 2GB = 2GB. A 50x memory reduction.

Techniques like checkpointing (recomputing activations instead of storing them) can reduce the naive approach's memory to O(1), but at the cost of 2-3x more computation. Flow matching avoids this tradeoff entirely.

EMA (Exponential Moving Average) of weights. One important practical detail: during training, we maintain an EMA of the model weights with decay rate 0.9999. At inference, we use the EMA weights instead of the raw weights. This smooths out training noise and consistently improves sample quality. All production models use this technique.

python
# EMA of weights (essential for quality)
class EMA:
    def __init__(self, model, decay=0.9999):
        self.ema_params = {n: p.clone() for n, p in model.named_parameters()}
        self.decay = decay

    def update(self, model):
        for n, p in model.named_parameters():
            self.ema_params[n].mul_(self.decay).add_(p, alpha=1-self.decay)

# Usage: after each optimizer step, call ema.update(model)
# At inference: load ema.ema_params into the model

python
# Conditional Flow Matching Loss (one training step)
def cfm_loss(model, z_batch):
    """z_batch: [B, d] — batch of data samples"""
    B, d = z_batch.shape
    t = torch.rand(B, 1)                   # t ~ Unif[0,1]
    eps = torch.randn_like(z_batch)          # eps ~ N(0, I)
    x = t * z_batch + (1 - t) * eps         # x ~ p_t(·|z)
    target = z_batch - eps                   # u_target = z - eps
    pred = model(x, t)                       # u_theta(x, t)
    return (pred - target).pow(2).mean()     # MSE loss

Why can we train against u_t^target(x|z) instead of u_t^target(x)?

Because L_FM = L_CFM + C (they differ by a constant in θ), so their gradients are identical Because the conditional field is more accurate Because we do not actually care about the marginal field

Chapter 6: The CondOT Path

Let us instantiate everything for the most popular choice: the Conditional Optimal Transport (CondOT) path with α_t = t and β_t = 1 − t. This gives:

Quantity	General Gaussian	CondOT (α=t, β=1−t)
Conditional path	p_t(·\|z) = N(α_tz, β_t²I)	N(tz, (1−t)²I)
Sample from path	x = α_tz + β_tε	x = tz + (1−t)ε
Derivatives	α̇_t, β̇_t	α̇_t = 1, β̇_t = −1
Conditional velocity	(α̇_t − α_tβ̇_t/β_t)z + (β̇_t/β_t)x	z − ε
CFM loss	\|\|u_t^θ(α_tz+β_tε) − (α̇_tz+β̇_tε)\|\|²	\|\|u_t^θ(tz+(1−t)ε) − (z−ε)\|\|²

Worked example — one training step by hand. Suppose our dataset is {2.0, 5.0, −1.0}. One training step:

1. Sample data: z = 5.0 (randomly chosen from dataset).

2. Sample time: t = 0.3 (from Unif[0,1]).

3. Sample noise: ε = −0.8 (from N(0,1)).

4. Compute noisy input: x = 0.3 · 5.0 + 0.7 · (−0.8) = 1.5 − 0.56 = 0.94.

5. Compute target velocity: z − ε = 5.0 − (−0.8) = 5.8.

6. Neural network predicts: u_0.3^θ(0.94) = 4.2 (hypothetical).

7. Loss: (4.2 − 5.8)² = (−1.6)² = 2.56.

8. Backpropagate and update θ to reduce this loss.

Why "optimal transport"? The CondOT path creates straight-line trajectories from ε to z. Each noise sample ε is transported along the shortest path to a data point z. This is related to optimal transport theory, which studies the most efficient ways to move mass from one distribution to another. Straight-line paths mean the vector field changes slowly over time, making it easier for the neural network to learn.

The general Gaussian CFM loss with arbitrary schedulers is:

L_CFM = E_{t, z, ε}[ ||u_t^θ(α_tz + β_tε) − (α̇_tz + β̇_tε)||² ]

Alternative noise schedules. While CondOT (α_t = t, β_t = 1 − t) creates straight-line paths, other schedules are possible:

Name	α_t	β_t	Path shape
CondOT (linear)	t	1 − t	Straight lines from ε to z
Cosine schedule	sin(πt/2)	cos(πt/2)	Curved paths, slow start & end
VP (DDPM-style)	√(ᾱ_t)	√(1 − ᾱ_t)	Preserves norm

In practice, the CondOT schedule works extremely well. Straight-line paths mean the vector field is nearly constant over time, which is easier for a neural network to learn. The cosine schedule was popular before flow matching (in DDPM-era models) because it avoids spending too much training time at very high or very low noise levels.

Signal-to-Noise Ratio (SNR). The noise schedule determines how much "signal" (data) vs. "noise" is present at each time t. The SNR is defined as:

SNR(t) = α_t² / β_t²

t (CondOT)	α_t = t	β_t = 1−t	SNR = t²/(1−t)²	Interpretation
0.1	0.1	0.9	0.012	Almost pure noise
0.3	0.3	0.7	0.184	Noise dominates
0.5	0.5	0.5	1.000	Equal signal and noise
0.7	0.7	0.3	5.444	Signal dominates
0.9	0.9	0.1	81.0	Almost pure data

The network faces very different tasks at different t values. At low t (high noise), it must infer the general class or mode of the data. At high t (low noise), it must predict fine details like texture and edges. Modern training strategies sometimes sample t non-uniformly to give more weight to informative noise levels (like the logit-normal sampling in Stable Diffusion 3).

Time sampling strategies. Uniform sampling t ~ Unif[0,1] weights all noise levels equally. But not all noise levels are equally important for image quality:

Strategy	Distribution of t	Effect
Uniform	Unif[0, 1]	Equal weight to all noise levels
Logit-normal	t = sigmoid(Normal(0, 1))	More weight near t = 0.5 (medium noise)
Cosine	t = cos⁻¹(Unif)	More weight near t = 0 and t = 1
Min-SNR weighting	Weight loss by min(SNR, γ)	Caps contribution of high-SNR timesteps

Stable Diffusion 3 uses logit-normal time sampling, which concentrates training on medium noise levels where the neural network has the most to learn. Very high noise (t near 0) is easy (just predict the data mean) and very low noise (t near 1) is easy (just predict the identity). The interesting work happens in between.

python
# Time sampling strategies
import torch

# Uniform
t_uniform = torch.rand(B)

# Logit-normal (Stable Diffusion 3)
t_logit_normal = torch.sigmoid(torch.randn(B))

# CDF-based cosine
u = torch.rand(B)
t_cosine = 1 - torch.acos(u) / torch.pi * 2

# All three are valid; logit-normal is most popular now

python
# Different noise schedules
import torch
import math

def condot_schedule(t):
    return t, 1 - t  # alpha, beta

def cosine_schedule(t):
    alpha = torch.sin(math.pi * t / 2)
    beta = torch.cos(math.pi * t / 2)
    return alpha, beta

# CondOT: x = t*z + (1-t)*eps, target = z - eps
# Cosine: x = sin(πt/2)*z + cos(πt/2)*eps
#         target = (π/2)*cos(πt/2)*z - (π/2)*sin(πt/2)*eps

In the CondOT path, the training target (z − ε) has a beautiful geometric interpretation. What is it?

The constant velocity of the straight line from noise ε to data z The gradient of the log-likelihood The noise prediction

Chapter 7: The Training Algorithm

Let us write out the complete training procedure and work through it in detail.

1. Sample data z

Draw a random example from the dataset

↓

2. Sample time t ~ Unif[0,1]

Random time along the probability path

↓

3. Sample noise ε ~ N(0, I)

Random Gaussian noise

↓

4. Compute x = tz + (1−t)ε

The noisy sample at time t

↓

5. Loss = ||u_θ(x,t) − (z−ε)||²

MSE between prediction and target

↓

6. Update θ via SGD

Standard backpropagation

Training Algorithm Step-Through

Click "Training Step" to see one iteration of Algorithm 3. Watch the loss decrease as the model learns to predict (z − ε) from noisy inputs.

python
# Algorithm 3: Complete Flow Matching Training Loop
import torch
import torch.nn as nn

class VelocityNet(nn.Module):
    def __init__(self, d=2, hidden=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d + 1, hidden),  # input: [x, t]
            nn.SiLU(),
            nn.Linear(hidden, hidden),
            nn.SiLU(),
            nn.Linear(hidden, d),      # output: velocity
        )

    def forward(self, x, t):
        return self.net(torch.cat([x, t], dim=-1))

# Training loop
model = VelocityNet(d=2)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for step in range(10000):
    z = sample_data(batch_size=256)       # [256, 2]
    t = torch.rand(256, 1)                # [256, 1]
    eps = torch.randn_like(z)              # [256, 2]
    x = t * z + (1 - t) * eps             # noisy sample
    target = z - eps                       # target velocity
    loss = (model(x, t) - target).pow(2).mean()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Three remarkable properties of this algorithm:
1. Simulation-free: No ODE simulation during training. Each step is one neural network forward + backward pass.
2. Simple regression: The loss is just MSE between predicted and target velocities.
3. Scalable: Works with standard mini-batch SGD. This is how billion-parameter models like Stable Diffusion 3 are trained.

What does the neural network see during training? Let's trace through 5 training steps in detail:

#	z (data)	t	ε (noise)	x = tz+(1−t)ε	Target = z−ε
1	2.0	0.71	−0.3	1.33	2.3
2	−1.0	0.12	1.5	1.20	−2.5
3	5.0	0.85	0.2	4.28	4.8
4	2.0	0.33	−1.1	−0.08	3.1
5	−1.0	0.55	0.8	−0.19	−1.8

Notice: the network receives input (x, t) and must predict the target. The same x value can have different targets depending on what z and ε were used — but on average, the network learns the correct marginal velocity. This is the magic of Theorem 12: training against individual conditional targets implicitly trains against the (intractable) average.

What does the loss curve look like? A typical flow matching training run on CIFAR-10 (32×32 images):

Training steps	MSE Loss	FID Score	Quality
1K	~2.0	>200	Random noise
10K	~0.8	~100	Blurry blobs with color
50K	~0.4	~30	Recognizable objects
200K	~0.25	~10	Good quality
500K	~0.20	~5	State of the art

The FID (Frechet Inception Distance) measures how similar the generated distribution is to the real data distribution. Lower is better. The loss decreases monotonically, but FID can plateau or even increase slightly due to overfitting.

Debugging flow matching training. Common issues and diagnostics:

If loss does not decrease: Check that x = t*z + (1-t)*eps (not z*t + eps*(1-t) with wrong broadcasting). Verify that t has shape [B, 1, 1, 1] for images (broadcasts over spatial dimensions). Make sure the model receives both x and t as input.

Mini-batch size matters. Larger batches give lower-variance gradient estimates, leading to faster convergence. For 2D toy experiments, B = 256 is plenty. For image generation, B = 128-512 per GPU with gradient accumulation. Stable Diffusion 3 was trained with an effective batch size of 2048.

Learning rate schedule. Most flow matching models use a linear warmup (1K-10K steps) followed by cosine decay. The peak learning rate is typically 1e-4 to 3e-4 for Adam. Using AdamW with weight decay 0.01 helps regularization.

python
# Complete training loop with all practical details
import torch
from torch.optim.lr_scheduler import CosineAnnealingLR

model = DiT(in_channels=4, hidden=1024, depth=24)
optimizer = torch.optim.AdamW(
    model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=500000)

for step in range(500000):
    z = next(dataloader)          # [B, 4, 64, 64] latent images
    B = z.shape[0]

    # Logit-normal time sampling (better than uniform)
    u = torch.randn(B)
    t = torch.sigmoid(u).view(B, 1, 1, 1)  # [B, 1, 1, 1]

    eps = torch.randn_like(z)
    x = t * z + (1 - t) * eps
    target = z - eps

    loss = (model(x, t.squeeze()) - target).pow(2).mean()

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    scheduler.step()

Tensor shapes in practice. For a batch of B = 256 images of size 64 × 64 × 3:

Tensor	Shape	Description
z	[256, 3, 64, 64]	Batch of data samples
t	[256, 1, 1, 1]	Random times (broadcast over spatial dims)
ε	[256, 3, 64, 64]	Gaussian noise
x = tz + (1−t)ε	[256, 3, 64, 64]	Noisy images
target = z − ε	[256, 3, 64, 64]	Velocity targets
pred = u_θ(x, t)	[256, 3, 64, 64]	Network prediction
loss	scalar	MSE over all elements

During flow matching training, how many neural network evaluations are needed per training step?

Just 1 (forward pass on the noisy sample x at time t) 50+ (one per ODE simulation step) 2 (one for each of the encoder and decoder)

Chapter 8: Trained Model Showcase

After training, we sample by simulating the ODE with the learned vector field. This is the full pipeline:

X₀ ~ N(0, I), dX_t = u_t^θ(X_t) dt ⇒ X₁ ≈ p_data

Below, we simulate a pre-trained flow matching model on a 2D distribution. Particles start as Gaussian noise and evolve to the target distribution using Euler steps. You can see the straight-line trajectories characteristic of the CondOT path.

Flow Matching: From Noise to Data

Watch 100 particles flow from Gaussian noise (t=0) to the target distribution (t=1). The target is a checkerboard pattern. Particles follow approximately straight paths — the hallmark of CondOT flow matching.

Target

Euler steps 50

python
# Complete Flow Matching Pipeline: Train + Sample
# After training (Algorithm 3), sample via Algorithm 1:
def generate_samples(model, n_samples=1000, n_steps=50, d=2):
    x = torch.randn(n_samples, d)   # pure noise
    h = 1.0 / n_steps
    with torch.no_grad():
        for i in range(n_steps):
            t = torch.full((n_samples, 1), i * h)
            x = x + h * model(x, t)     # Euler step
    return x  # generated samples!

The full picture: Training = regression against (z − ε). Sampling = simulate ODE with Euler method. That is the entire flow matching pipeline. Stable Diffusion 3 and FLUX are scaled-up versions of exactly this algorithm, with a U-Net or DiT as the neural network and images as data.

What happens at each sampling step — in detail. Suppose we have trained a model on 2D "two moons" data. During sampling with 20 Euler steps:

Step	t	X_t	u_θ(X_t, t)	What the network "thinks"
0	0.00	(0.3, −0.5)	(0.8, 1.2)	Noise — uncertain, average direction
5	0.25	(0.5, 0.1)	(1.1, 0.9)	Starting to "choose" the upper moon
10	0.50	(0.8, 0.5)	(0.7, 0.3)	Committed to upper moon, refining
15	0.75	(1.0, 0.7)	(0.2, 0.1)	Almost at target, slowing down
20	1.00	(1.05, 0.72)	—	Final sample on the upper moon

Notice how the velocity starts large (the particle needs to move far) and decreases as the particle approaches the data manifold. This is a consequence of the CondOT path: the velocity is constant in the conditional case (z − ε), but in the marginal case it varies as the posterior concentrates.

Why CondOT paths produce straight trajectories. For the CondOT path, the conditional trajectory is x(t) = tz + (1−t)ε — a straight line from ε to z. The marginal trajectories are not exactly straight (because the posterior-weighted average of straight lines is generally curved), but they are approximately straight, especially when the modes are well-separated. Straight trajectories mean the velocity field changes slowly over time, which has two benefits: (1) the neural network finds it easier to learn a slowly-varying function, and (2) the Euler method is more accurate for slowly-varying ODEs, so fewer steps are needed.

This is why flow matching with CondOT typically needs only 20-50 Euler steps, while DDPM-style models (which use curved VP noise schedules) historically needed 1000 steps. The improvement is purely due to straighter trajectories — the underlying mathematics is the same.

Trajectory straightness metric. Lipman et al. (2022) defined a straightness metric for flow models:

S = E[||X₁ − X₀||² / ∫₀¹ ||u_t(X_t)||² dt]

For perfectly straight trajectories, S = 1 (the path length equals the displacement). For curved trajectories, S < 1 (the path is longer than needed). CondOT paths achieve S ≈ 0.99, while VP (DDPM-style) paths achieve S ≈ 0.7. The higher the straightness, the fewer Euler steps needed for accurate sampling.

What the trained model looks like inside. After training, the neural network has learned a complex, time-dependent vector field. At t = 0 (pure noise), the field is approximately constant — pushing all particles in roughly the same direction (toward the center of the data distribution). As t increases, the field becomes more complex and spatially varying, with different regions pointing toward different modes. By t = 0.9, the field has fine-grained structure that places each particle precisely on the data manifold.

This progression from simple to complex is another reason why flow matching works so well: the neural network can dedicate most of its capacity to the hard part (fine-grained details at high t) rather than the easy part (rough direction at low t).

The efficiency argument: Flow matching training costs O(1) NFE per training step. ODE sampling costs O(n) NFE per image. For a model trained for 500K steps on batches of 256: training = 500K × 1 = 500K forward passes (with backprop). Generating 10K images = 10K × 50 = 500K forward passes (without backprop). The training cost and generation cost are roughly balanced, which is very different from GANs (cheap generation, hard training) or autoregressive models (expensive generation).

Extensions and variations of flow matching. The basic algorithm we described can be extended in several important ways:

1. Stochastic interpolants (Albergo & Vanden-Eijnden, 2023). A generalization where the interpolation between noise and data can have additional stochastic terms. This includes flow matching and score matching as special cases.

2. Rectified flows (Liu et al., 2022). After training, the trajectories may not be perfectly straight. Rectified flows "straighten" the trajectories by retraining: use the trained model to generate (noise, data) pairs, then retrain on these pairs. This iterative process produces straighter paths, enabling accurate generation in fewer steps.

3. Discrete flow matching. Flow matching extended to discrete data (like text tokens). Instead of continuous interpolation, discrete flow matching uses Markov chains that gradually convert a uniform distribution into the data distribution. This is covered in Chapter 7 of the book.

4. Equivariant flow matching. For data with symmetries (e.g., molecules that should be invariant to rotation), the vector field can be designed to respect these symmetries. This dramatically improves sample efficiency for scientific applications.

5. Consistency models (Song et al., 2023). Train a model that directly maps any point on the ODE trajectory to the endpoint (clean data). This enables one-step generation at the cost of additional training complexity. Consistency models can be trained from scratch or distilled from a pre-trained flow matching model.

6. Flow matching for video. The extension to video is straightforward: z ∈ R^T×H×W×3 is a sequence of frames. The neural network is a 3D U-Net or video DiT that processes both spatial and temporal dimensions. The training loss and sampling procedure are identical — just with higher-dimensional z. Meta's Movie Gen and OpenAI's Sora both use this approach.

7. Conditional optimal transport. The "OT" in CondOT refers to optimal transport theory. The connection: the CondOT path creates straight-line trajectories from noise to data, which correspond to the optimal transport plan between the noise and data distributions (under the Wasserstein-2 metric). This is not just a mathematical curiosity — optimal transport paths are the most efficient possible, minimizing the total "work" of transporting noise to data. This efficiency is why CondOT paths require fewer simulation steps.

Why does straight = efficient? Consider moving a particle from point A to point B. The shortest path is a straight line (distance ||B - A||). Any curved path has length > ||B - A||, meaning the particle travels further than necessary. Longer paths require more ODE steps to traverse accurately. CondOT paths are straight, so they are the most efficient paths, requiring the fewest steps. This is the geometric insight behind flow matching's practical success.

The scaling behavior of flow matching. How does flow matching scale with model size and data?

Model Size	Dataset	FID (ImageNet 256)	Training Cost
DiT-S/2 (33M)	ImageNet-1K	~68	~400 A100 hours
DiT-B/2 (130M)	ImageNet-1K	~43	~1,600 A100 hours
DiT-L/2 (458M)	ImageNet-1K	~10	~5,000 A100 hours
DiT-XL/2 (675M)	ImageNet-1K	~2.3	~13,000 A100 hours
SD3-Medium (~2B)	~1B images	<2	~500,000 A100 hours

The scaling is remarkably clean: bigger models + more data = better quality, with no signs of saturation. This is analogous to the scaling laws observed in large language models, and suggests that simply scaling up flow matching models will continue to improve quality.

DiT (Diffusion Transformer) architecture. The dominant neural network architecture for flow matching is the DiT, which replaces the traditional U-Net with a Vision Transformer. Key design choices:

1. Patchification: The input latent (e.g., 64×64×4) is split into 2×2 patches, producing a sequence of N = 1024 tokens, each of dimension D.

2. Time conditioning: The time t is embedded as a sinusoidal encoding and added to the token embeddings via adaptive layer norm (adaLN).

3. Text conditioning: Text prompt embeddings from CLIP/T5 are cross-attended in each transformer block.

4. Output: The transformer outputs velocity predictions for each patch, which are unpatchified to the original spatial resolution.

python
# Simplified DiT architecture
class DiTBlock(nn.Module):
    def forward(self, x, t_emb, text_emb):
        # x: [B, N, D] (sequence of patch tokens)
        # t_emb: [B, D] (time embedding)
        # text_emb: [B, L, D] (text token embeddings)

        # AdaLN: modulate layer norm with time
        scale, shift = self.adaLN(t_emb).chunk(2, dim=-1)
        x = self.norm(x) * (1 + scale) + shift

        # Self-attention + cross-attention with text
        x = x + self.self_attn(x)
        x = x + self.cross_attn(x, text_emb)
        x = x + self.ffn(x)
        return x

The DiT architecture scales much better than U-Nets because Transformers have well-understood scaling laws and benefit from hardware optimizations (FlashAttention, etc.). Stable Diffusion 3 and FLUX both use DiT-style architectures.

Multi-resolution training. Modern models are trained at multiple resolutions to improve efficiency and quality. The approach: start training at low resolution (e.g., 256×256), then fine-tune at higher resolution (512×512, 1024×1024). This works because the vector field learned at low resolution transfers well to higher resolution — the overall structure of the flow is similar, and only fine details change.

python
# Multi-resolution training schedule (simplified)
# Phase 1: Low resolution (fast, learns global structure)
for step in range(200000):
    z = load_batch(resolution=256)      # latent: [B, 4, 32, 32]
    loss = cfm_loss(model, z)
    optimizer.step()

# Phase 2: High resolution (slower, learns fine details)
for step in range(100000):
    z = load_batch(resolution=1024)     # latent: [B, 4, 128, 128]
    loss = cfm_loss(model, z)           # same loss function!
    optimizer.step()

Noise-conditioned augmentation. During training, many models apply random augmentations (crop, flip, color jitter) to the data. These augmentations increase the effective dataset size and improve generalization. However, care must be taken: augmentations that change the semantics (e.g., flipping text) should not be used. Common augmentations for image models: random horizontal flip, random crop, and multi-aspect-ratio training.

The flow matching loss landscape. Unlike GAN training, which has notoriously difficult optimization landscapes (saddle points, cycling), the flow matching loss is a simple convex-like objective for a fixed architecture. The loss decreases monotonically during training, gradients are well-behaved, and training rarely diverges. This stability is one of the main practical advantages of flow matching over GANs.

Distributed training. Large flow matching models are trained on clusters of hundreds to thousands of GPUs using data parallelism. Each GPU processes a different mini-batch, and gradients are averaged across GPUs. The training is embarrassingly parallel — there are no sequential dependencies between mini-batches. This is unlike autoregressive models, which require sequential token generation even during training.

The simplicity of flow matching is its superpower. The entire training algorithm fits on a single slide: sample z, sample t, sample ε, compute x = tz + (1−t)ε, compute loss = ||u_θ(x,t) − (z−ε)||², update weights. Everything else — architectures, noise schedules, guidance, distillation — is optimization on top of this foundation. When debugging flow matching models, always return to this simple recipe and verify each component.

Open research questions as of 2026:

1. Can we achieve one-step generation at the quality of 50-step models? Consistency distillation is getting close.

2. How to best handle conditional generation for complex, multi-modal conditions (text + image + layout)?

3. Can flow matching scale to extremely long videos (minutes, not seconds)?

4. What are the fundamental limits of flow matching? Is there a quality ceiling, or will it keep improving with scale?

5. How to best combine flow matching with discrete diffusion for joint text-image models?

6. Can we develop better noise schedules that are provably optimal for a given dataset?

7. How to handle variable-length data (different image resolutions, different video lengths) efficiently?

Recommended reading. The original flow matching papers are accessible and well-written. In order of importance:

Paper	Authors	Key contribution
Flow Matching for Generative Modeling	Lipman et al., 2022	The original flow matching paper
Rectified Flow	Liu et al., 2022	Straightening trajectories by retraining
Building Normalizing Flows with Stochastic Interpolants	Albergo & Vanden-Eijnden, 2023	Generalized framework (stochastic interpolants)
Scalable Diffusion Models with Transformers (DiT)	Peebles & Xie, 2023	Transformer architecture for diffusion
Scaling Rectified Flow Transformers	Esser et al., 2024	SD3 architecture and training details

Implementing flow matching from scratch. The complete implementation of a flow matching model on a 2D toy dataset requires only about 50 lines of PyTorch code (as shown in Chapter 8). For images, the main additions are: (1) a DiT or U-Net architecture instead of an MLP, (2) a VAE for latent encoding, and (3) text conditioning. The core training loop — sample z, sample t, sample ε, compute loss, update weights — remains identical regardless of the data type or model size.

Key equations to remember from this chapter:

Gaussian path: p_t(·|z) = N(α_tz, β_t²I)

Sample from path: x = α_tz + β_tε, ε ~ N(0, I)

CondOT: α_t = t, β_t = 1 − t

Conditional velocity: u_t^target(x|z) = α̇_tz + β̇_tε = z − ε (CondOT)

Marginal velocity: u_t^target(x) = ∫ u(x|z) · posterior(z|x) dz

CFM loss: L = E_t,z,ε[||u_θ(α_tz + β_tε, t) − (α̇_tz + β̇_tε)||²]

Theorem 12: ∇_θL_FM = ∇_θL_CFM

These seven equations are the complete mathematical foundation of flow matching. Everything else — noise schedules, architectures, guidance, distillation — is built on top of this foundation. If you understand why each equation has the form it does, you understand flow matching.

The flow matching research ecosystem. Flow matching has spawned a vibrant research community. Key research directions include:

1. Optimal transport connections: Making the link between CondOT paths and Wasserstein distances rigorous and exploiting it for better schedules.

2. Faster samplers: Reducing the number of NFE needed from 50 to 1-4 via distillation, consistency models, and analytic solvers.

3. Better architectures: Moving from U-Nets to DiTs to MMDiTs (SD3's multi-modal DiT) to even more efficient designs.

4. Controllable generation: Going beyond text conditioning to spatial control (ControlNet), style transfer, and compositional generation.

5. Scientific applications: Protein design, drug discovery, weather prediction, materials science — anywhere that requires sampling from complex distributions over structured data.

6. Video and 3D: Scaling flow matching to higher-dimensional data types with temporal and spatial consistency constraints.

The field moves fast. By the time you read this, some of the "open questions" listed above may have been answered. But the mathematical foundations — probability paths, conditional vector fields, the marginalization trick, simulation-free training — are timeless. They will remain the bedrock of generative modeling regardless of which specific architectures or applications dominate in the future.

python
# Rectified flow: straightening trajectories
# Step 1: Train initial model (standard flow matching)
model_v1 = train_flow_matching(dataset)

# Step 2: Generate (noise, data) pairs using trained model
x0_samples = torch.randn(10000, d)         # noise
x1_samples = generate(model_v1, x0_samples)  # generated data

# Step 3: Retrain on the paired data (noise, generated_data)
# The trajectories are now straighter because we use
# matched pairs instead of random (z, eps) samples
model_v2 = train_flow_matching_paired(x0_samples, x1_samples)

# model_v2 needs fewer steps for same quality

Number of function evaluations (NFE). Each Euler step requires one neural network forward pass. For 20 steps on a DiT-XL/2 model (675M parameters, generating 512×512 images), each forward pass takes about 40ms on an A100 GPU. Total generation time: 20 × 40ms = 0.8 seconds per image. Heun's method with 10 steps: 20 NFE, same cost, often better quality.

python
# End-to-end flow matching: train and sample on Two Moons
import torch
import torch.nn as nn

# 1. Dataset
def sample_moons(n):
    t = torch.rand(n) * torch.pi
    x = torch.where(torch.rand(n) < 0.5,
        torch.stack([torch.cos(t), torch.sin(t)], 1),
        torch.stack([1-torch.cos(t), 1-torch.sin(t)-.5], 1))
    return x + 0.05 * torch.randn_like(x)

# 2. Model
model = nn.Sequential(
    nn.Linear(3, 128), nn.SiLU(),
    nn.Linear(128, 128), nn.SiLU(),
    nn.Linear(128, 2))

# 3. Train (Algorithm 3)
opt = torch.optim.Adam(model.parameters(), 1e-3)
for step in range(5000):
    z = sample_moons(256)
    t = torch.rand(256, 1)
    eps = torch.randn_like(z)
    x = t * z + (1-t) * eps
    loss = (model(torch.cat([x, t], 1)) - (z - eps)).pow(2).mean()
    opt.zero_grad(); loss.backward(); opt.step()

# 4. Sample (Algorithm 1)
x = torch.randn(500, 2)
for i in range(50):
    t = torch.full((500, 1), i/50)
    x = x + (1/50) * model(torch.cat([x, t], 1))
# x now contains 500 samples from Two Moons!

After training a flow matching model, how do we generate a new sample?

Initialize X₀ ~ N(0,I) and simulate dX/dt = u_t^θ(X_t) from t=0 to t=1 using Euler method Feed a random z into the network and take the output Run the training loss backwards

Component	Formula	Status
Conditional path	p_t(·\|z) = N(α_tz, β_t²I)	Chosen by us
Marginal path	p_t(x) = ∫ p_t(x\|z)p_data(z)dz	Exists but intractable
Conditional vector field	u_t^target(x\|z) = α̇_tz + β̇_tε	Known analytically
Marginal vector field	u_t^target(x) = ∫ u(x\|z) · posterior(z\|x) dz	Exists but intractable
CFM loss	E[\|\|u_θ(x,t) − u_t^target(x\|z)\|\|²]	Tractable & equivalent to L_FM
Sampling	Euler ODE simulation	n neural net evaluations

Pitfall	Solution
t = 0 or t = 1 causes division by zero	Clamp t to [ε, 1−ε] with ε = 10⁻⁵
Loss spikes for extreme t values	Use logit-normal time sampling instead of Unif[0,1]
Generated samples are blurry	Increase Euler steps, or use Heun's method
Model memorizes training data	Dataset too small; increase diversity
Straight trajectories not achieved	Verify CondOT path implementation (alpha=t, beta=1-t)

Model	Domain	Organization
Stable Diffusion 3	Text-to-image	Stability AI
FLUX	Text-to-image	Black Forest Labs
Movie Gen Video	Text-to-video	Meta
AlphaFold3	Protein structure	Google DeepMind

Flow Matching

Chapter 0: Why Train?

Chapter 1: Probability Paths