The elegant successor to diffusion models. Learn how straight-line transport between noise and data enables faster, simpler generation — powering SD3 and Flux.
Generative modeling is fundamentally about transport: moving samples from a distribution we can easily sample (noise) to a distribution we want (data). Diffusion models do this via a winding, stochastic path with hundreds of steps. What if we could find a straight line instead?
Flow matching frames generation as learning a velocity field that transports particles from noise to data along smooth paths. At t=0, particles are random noise. At t=1, they've arrived at data samples. The velocity field v(x, t) tells each particle which direction to move at each moment.
The path from noise to data is a plain linear interpolation. Given a noise sample x0 ~ N(0, I) and a data sample x1:
At t=0, xt = x0 (pure noise). At t=1, xt = x1 (pure data). At t=0.5, it's the average of noise and data. The velocity along this path is constant: v = x1 - x0. No noise schedule, no ᾱt, no reparameterization trick. Just a straight line.
Watch particles travel from random noise (t=0) toward two data clusters (t=1). Each particle follows the learned velocity field.
In Chapter 0, we showed particles flowing from noise to data along paths. That's a powerful visual intuition — but to actually build this, we need to answer a precise question: how do we tell a particle where to go at each moment?
The answer is a velocity field. At every point in space and every moment in time, the velocity field says "move in this direction at this speed." If you've ever seen a weather map with wind arrows, that's a velocity field — each arrow tells air molecules where to flow. Our velocity field does the same thing for data particles.
The math tool for "follow a velocity field" is a differential equation — specifically, an ordinary differential equation (ODE). An ODE just says: "the rate of change of x equals some function of x and time." If you've done basic calculus, you've seen this as dx/dt = f(x, t). That's all an ODE is — a rule for how something changes over time.
This gives us the mathematical backbone of flow matching, called a Continuous Normalizing Flow (CNF):
Read this as: "starting from a noise sample x(0), follow the velocity field v from time t=0 to t=1, and you arrive at a data sample x(1)." The velocity field vθ is parameterized by a neural network — the θ are the learnable weights. Training teaches the network what velocity to assign at each (x, t) so that the flow transforms noise into realistic data.
The network takes in xt (the particle's current position, shape [B, C, H, W] — a batch of images) and the scalar time t, and outputs the velocity v (same shape [B, C, H, W]). Same input/output shape as a diffusion denoiser — but instead of predicting noise, it predicts direction of motion.
An ODE dx/dt = v(x, t) is a Markov system: the velocity depends only on where you are (x) and when (t), not how you got there. This means two particles at the same (x, t) must have the same velocity — their paths cannot cross at the same time. This is the uniqueness theorem for ODEs (Picard-Lindelöf): given a Lipschitz velocity field, each initial condition produces exactly one trajectory.
If v depended on history, you'd need a more complex representation (path-dependent or stochastic). The Markov property is what makes CNFs tractable — the network only needs to map (x, t) → v, not reason about trajectories.
This is also why paths crossing is problematic: at a crossing point, two particles at the same (x, t) need to go to different places. The network must learn a compromised velocity that sends them in an average direction — reducing sample quality. Reflow (Chapter 6) reduces crossings precisely to avoid this.
We now have a framework: a velocity field v(x, t) that guides particles from noise to data. But we haven't said anything about which paths those particles should take. A velocity field could send particles along wild, looping curves — or along straight lines. Both get from noise to data, but one is far more efficient.
Optimal transport (OT) is the math of "what's the cheapest way to move mass from one pile to another?" Think of it like a shipping problem: you have warehouses (noise) and stores (data), and you want to minimize total distance traveled. The answer, for our case, turns out to be beautifully simple: straight paths with constant velocity.
Why does straightness matter? Because straighter paths can be integrated with fewer numerical steps. A perfectly straight path needs just one Euler step. Curved paths need many steps to follow accurately.
Compare diffusion-style curved paths (left) with OT straight paths (right). Straight paths need fewer integration steps.
Consider Euler integration: xt+dt = xt + v(xt, t) · dt. If the true velocity is constant (straight path), then a single Euler step with dt=1 gives the exact answer. No numerical error at all. In practice, the velocity field isn't perfectly constant (different noise-data pairs have different velocities), so we need a few steps. But "a few" means 10-50, not 1000.
Contrast with diffusion: the variance-preserving path curves through high-dimensional space. The velocity changes direction at every point, so you need many small steps to track the curve accurately. This is the fundamental geometric reason flow matching is faster.
Given a fixed starting point z and endpoint x1, consider all smooth paths x(t) with x(0) = z, x(1) = x1. The kinetic energy (transport cost) of a path is:
Your task: Prove that the straight-line path x(t) = (1-t)z + t x1 minimizes this kinetic energy among all paths connecting z to x1.
Full derivation:
1. For any path x(t) from z to x1, the length is L = ∫01 ||dx/dt|| dt. The straight-line distance is ||x1 - z||, and by the triangle inequality, L ≥ ||x1 - z|| with equality iff the path is a straight line.
2. By Cauchy-Schwarz on the functions f(t) = 1 and g(t) = ||dx/dt||:
3. Therefore KE ≥ L² ≥ ||x1 - z||². The straight-line path achieves KE = ||x1 - z||² exactly (constant velocity), so it is the unique minimizer.
The key insight: Minimizing kinetic energy simultaneously forces the path to be (a) straight (shortest length) and (b) constant-speed (no acceleration). This is Brenier's theorem in its simplest form: the optimal transport map in Euclidean space is the straight-line displacement.
The OT problem is the free-particle Lagrangian. The velocity field v(x, t) plays the role of a physical velocity. The ODE dx/dt = v is Newton's first law. Flow matching is literally physics: particles travel in straight lines unless a force (data distribution) curves them.
Where else does "minimize squared velocity integrated over time" appear? (Hint: think Kalman smoothing, spline interpolation, geodesics on manifolds.)
We know the ideal path for each particle: a straight line from its specific noise point z to its specific data point x1. But here's the problem: during generation, the network sees an intermediate point xt and has to decide where to send it — without knowing which x1 it's headed toward. The "true" velocity at xt would require averaging over every possible data point it might be going to. That's impossibly expensive to compute (mathematicians call this intractable — meaning "can't be computed in practice").
Conditional Flow Matching (CFM) is the elegant trick that makes training possible anyway. Instead of computing that impossible average, we train on individual pairs. For each training pair (noise z, data x1), we know exactly where the particle should go: from z to x1 in a straight line. The conditional velocity for this specific pair is simply ut = x1 - z (the direction from noise to data, constant along the path).
The beautiful mathematical insight (proven by Lipman et al., 2023): if you train the network to match these per-pair velocities across many random pairs, the network automatically learns the correct average velocity at every point. You never have to compute the intractable average explicitly — gradient descent finds it for you.
Here's the concrete difference this makes. Without CFM, the target velocity at a point xt would require averaging over all possible noise-data pairs that pass through xt — an intractable integral. With CFM, for each training sample we just compute x1 - z (a single subtraction). The magic is that gradient descent over many such samples converges to the correct marginal velocity field.
The marginal flow matching loss is LFM = Et, x~pt[||vθ(x, t) - ut(x)||²], where ut(x) is the true marginal velocity at (x, t). This is intractable because ut(x) = Ex1|xt=x[x1 - z] requires knowing p(x1 | xt).
The conditional FM loss is LCFM = Et, z, x1[||vθ(xt, t) - (x1 - z)||²] where xt = (1-t)z + tx1.
Your task: Show that ∇θ LCFM = ∇θ LFM. (They have the same gradient, so optimizing one optimizes the other.)
Full derivation:
1. Write both losses with the squared norm expanded:
The last term doesn't depend on θ, so ∇θL is determined by the first two terms.
2. The first term E[||vθ(xt, t)||²] is the same in both losses because both sample xt from pt (the CFM interpolation samples from the same marginal distribution pt by construction).
3. For the cross-term in LCFM:
4. The conditional expectation E[x1 - z | xt = x, t] is precisely the definition of the marginal velocity field ut(x). So:
This is exactly the cross-term in LFM. Since both relevant terms match, ∇θLCFM = ∇θLFM. QED.
The key insight: The per-sample velocity (x1 - z) is a noisy but unbiased estimate of the marginal velocity ut(x). SGD with unbiased gradients converges to the same optimum as the full gradient. This is why we can train on simple per-sample targets and implicitly learn the complex marginal flow.
Both are instances of denoising score matching: train a network to predict a target direction from a noisy intermediate. The difference is the path geometry: diffusion uses a variance-preserving arc on a hypersphere, flow matching uses a straight line through ambient space. Lipman et al. (2023) showed that DDPM is literally a special case of flow matching with a specific (non-optimal) path choice.
If you change the FM path to xt = cos(πt/2) z + sin(πt/2) x1, what familiar framework do you recover? (Answer: variance-preserving diffusion with a cosine schedule.)
Training a flow matching model is arguably even simpler than training a diffusion model. For each training step:
Here it is in full. Compare this to the diffusion training loop — notice what's missing:
python for x_1 in dataloader: # 1. Sample data x_0 = torch.randn_like(x_1) # 2. Sample noise ~ N(0, I) t = torch.rand(B, 1, 1, 1) # 3. Sample t ~ Uniform[0, 1] x_t = (1 - t) * x_0 + t * x_1 # 4. Linear interpolation target = x_1 - x_0 # 5. Target velocity (constant!) v_hat = network(x_t, t.squeeze()) # 6. Predict velocity loss = F.mse_loss(v_hat, target) # 7. MSE loss loss.backward() optimizer.step()
(1-t)*x_0 + t*x_1 and the target is x_1 - x_0. That's the entire mathematical content of the training loop. This simplicity is why flow matching is becoming the default for new systems.Arrows show the learned velocity field at time t. At t=0, velocities point toward data. At t=1, they've converged.
cfm_loss function that computes a single training step. This is the complete mathematical content of flow matching training — everything else is standard deep learning boilerplate.t = torch.rand(B, 1, 1, 1) or reshape after sampling. The model expects t as shape [B] (squeezed), so pass t.squeeze() or t.view(B).python def cfm_loss(model, x_1, noise_fn=torch.randn_like): B = x_1.shape[0] # 1. Sample noise (same shape as data) z = noise_fn(x_1) # [B, C, H, W] # 2. Sample time uniformly in [0, 1] t = torch.rand(B, 1, 1, 1, device=x_1.device) # [B,1,1,1] for broadcasting # 3. Linear interpolation: x_t = (1-t)*z + t*x_1 x_t = (1 - t) * z + t * x_1 # [B, C, H, W] # 4. Target velocity = endpoint - startpoint target = x_1 - z # [B, C, H, W] # 5. Network predicts velocity at (x_t, t) v_pred = model(x_t, t.squeeze()) # [B, C, H, W] # 6. MSE between predicted and target velocity return F.mse_loss(v_pred, target)
To generate a sample, start from noise z ~ N(0, I) and integrate the learned ODE forward from t=0 to t=1. The simplest method is Euler integration: take N evenly spaced steps, at each step nudging x by the predicted velocity times the step size.
Because the paths are approximately straight, Euler's method works well even with few steps. Higher-order solvers (Midpoint, RK4) give better accuracy for the same step count, but even plain Euler with 20-50 steps produces excellent results.
python # Generate an image from pure noise x = torch.randn(1, 4, 64, 64) # start from noise (latent space) N = 20 # number of Euler steps dt = 1.0 / N for i in range(N): t = i / N # current time: 0, 0.05, 0.10, ... v = network(x, t) # predict velocity at current position x = x + v * dt # take one Euler step # x is now a sample from the data distribution
Compare this to DDPM's 1000 steps with stochastic noise injection at each step. Flow matching sampling is an ODE (deterministic), not an SDE (stochastic). Same noise in → same image out. Every time.
| Method | Typical Steps | Stochastic? | Same seed → same output? |
|---|---|---|---|
| DDPM | 1000 | Yes (SDE) | No |
| DDIM | 50 | No (ODE) | Yes |
| Flow Matching | 20-50 | No (ODE) | Yes |
Watch particles integrate from noise to data. Adjust the step count: more steps = more accurate paths.
Even though flow matching paths are straighter than diffusion, they're not perfectly straight. Paths from different noise-data pairs can cross each other, forcing the network to learn a curved velocity field to avoid collisions. Reflow straightens paths further.
The procedure is elegant: (1) Take a trained flow model. (2) Sample noise z ~ N(0, I), run the model forward to get the corresponding data point x1 = ODE(z). Now you have a (z, x1) pair that the model actually maps. (3) Retrain a new model on straight lines between these pairs. Because the pairs are already coupled (the model mapped z to x1), the straight-line approximation is much closer to the true path.
python # Reflow: straighten paths by retraining on coupled pairs # Step 1: Generate coupled (noise, data) pairs from trained model z = torch.randn(10000, 4, 64, 64) x_1 = ode_solve(old_model, z, t=0, t_end=1) # run the trained model pairs = list(zip(z, x_1)) # store the pairs # Step 2: Retrain on straight lines between these pairs for z_i, x_1_i in pairs: t = torch.rand(1) x_t = (1-t) * z_i + t * x_1_i # straight line between COUPLED pair loss = mse(new_model(x_t, t), x_1_i - z_i) # velocity = x_1 - z
Each iteration makes paths straighter. After 2-3 reflow iterations, paths are nearly straight enough for 1-step generation.
Distillation takes a different approach: train a student model to mimic the teacher in fewer steps. Progressive distillation halves the step count repeatedly (64 → 32 → 16 → 8 → 4 → 2 → 1). Combined with reflow, this yields high-quality 1-4 step models.
Compare paths before and after reflow. Straighter paths = fewer steps needed.
Flow matching has gone from theory to production. Both Stable Diffusion 3 (Stability AI) and Flux (Black Forest Labs) use rectified flow matching as their core framework, combined with a new architecture: the MMDiT (Multimodal Diffusion Transformer).
MMDiT replaces the U-Net with a Transformer. Both the noisy latent patches and the text tokens are processed as separate streams that interact through joint attention layers. This bidirectional interaction gives the text genuine influence over image generation.
| Feature | SD 1.5 | SDXL | SD3 / Flux |
|---|---|---|---|
| Architecture | U-Net | Larger U-Net | MMDiT |
| Framework | DDPM | DDPM | Flow matching |
| Text encoder | CLIP | CLIP + OpenCLIP | CLIP + T5-XXL |
| Resolution | 512px | 1024px | 1024px+ |
| Steps | 20-50 | 20-40 | 20-30 |
Flux.1-schnell generates a 1024×1024 image in 4 steps. Let that sink in. DDPM needed 1000 steps for 256×256. That's a 250× reduction in steps at 16× higher resolution. The combination of flow matching (straight paths), rectified flow (straightened further), and distillation (student mimics teacher in fewer steps) made this possible.
Real-world solution (FrameFlow / FoldFlow / Chroma):
1. Noise on SO(3): Use the isotropic Gaussian on SO(3) (IGSO(3)), which is the uniform distribution's analog for rotations. Sample by exponentiating a random tangent vector: R = exp(skew(ξ)), ξ ~ N(0, σ²I). At σ→∞, this approaches uniform on SO(3).
2. Straight line on SO(3): Use the geodesic (shortest rotation path): R(t) = R0 exp(t · log(R0-1R1)). The "velocity" is in the tangent space (Lie algebra so(3)), and interpolation stays on the manifold. This is the Riemannian analog of linear interpolation.
3. ODE steps: 100-200 steps with an adaptive solver (Dormand-Prince/RK45). Each step requires a full network forward pass through an SE(3)-equivariant transformer with pair interactions. At 500 residues with 100 steps: ~100 forward passes × 50ms each = 5 seconds on A100. Tight but feasible.
4. Equivariance: Use an SE(3)-equivariant architecture (e.g., EGNN, IPA from AlphaFold2). Key insight: the velocity network predicts vectors in the local frame of each residue, so global rotations/translations automatically transform correctly. Frame-based representations (as in AlphaFold2's IPA) give equivariance by construction.
Flow matching and diffusion are closely related — in fact, diffusion can be seen as a special case of flow matching with a particular (non-straight) path choice. But the differences matter in practice.
| Aspect | Diffusion (DDPM) | Flow Matching |
|---|---|---|
| What it learns | Noise prediction εθ | Velocity field vθ |
| Path shape | Curved (variance-preserving) | Straight (OT interpolation) |
| Math framework | SDE (stochastic) | ODE (deterministic) |
| Typical steps | 20-1000 | 10-50 |
| Noise schedule | βt schedule (many choices) | Linear interpolation (one choice) |
| Training target | ε (noise) | x1 - z (velocity) |
| Log-likelihood | Approximate (ELBO) | Exact (via ODE) |
Left: diffusion-style curved trajectories. Right: flow matching straight trajectories. Both reach the same target.
Four concrete reasons flow matching is becoming the default for new systems:
Conditioning works exactly as in diffusion: inject class labels or text embeddings into the network. The velocity field becomes v(xt, t, c) where c is the conditioning signal. Classifier-free guidance applies identically — train with condition dropout, at inference compute vguided = vuncond + w · (vcond - vuncond).
In score-based diffusion, the network learns sθ(x, t) ≈ ∇x log pt(x) (the score). In flow matching, it learns vθ(x, t) (the velocity). These are related.
For the OT path xt = (1-t)z + t x1 with z ~ N(0, I):
Your task: Show that vt(x) = x/(1-t) + (1-t) · ∇x log pt(x), i.e., the velocity field can be decomposed into a "drift toward origin" term and a score-scaled term.
Full derivation:
1. The conditional distribution is p(xt | x1) = N(xt; t x1, (1-t)² I).
2. From the conditional velocity, ut(x) = E[(x1 - x)/(1-t) | xt = x] (derived in Hint 2).
3. Tweedie's formula for Gaussians: the posterior mean of the "clean" signal given noisy observation is E[μ | x] = x + σ² ∇x log p(x). Here μ = t x1, σ = (1-t), so:
4. Therefore E[x1 | xt = x] = [x + (1-t)² ∇x log pt(x)] / t.
5. Substituting into the velocity:
Simplifying: = x(1-t)/(t(1-t)) + (1-t)s(x,t)/t ... After algebra:
The exact relationship (cleanly stated): vt(x) = [E[x1|xt=x] - x] / (1-t), and via Tweedie: vt(x) = (1-t)/t · ∇x log pt(x) + x · (1/t - 1)/(1-t).
The key insight: The velocity field and score function contain exactly the same information — they're connected by a time-dependent linear transformation. Learning one is equivalent to learning the other. The difference is purely in the loss weighting and path geometry, not in what the network fundamentally represents. This is why FM and score matching achieve similar final quality — they're learning the same object with different parameterizations.
You now understand flow matching: straight paths, simple training, fast sampling. The next generation of generative models is built on these ideas.