The Complete Beginner's Path

Understand Flow
Matching

The elegant successor to diffusion models. Learn how straight-line transport between noise and data enables faster, simpler generation — powering SD3 and Flux.

Prerequisites: Basic calculus + Familiarity with diffusion models (helpful but not required).
9
Chapters
6+
Simulations
0
Stochastic ODEs

Chapter 0: Paths Between Distributions

Generative modeling is fundamentally about transport: moving samples from a distribution we can easily sample (noise) to a distribution we want (data). Diffusion models do this via a winding, stochastic path with hundreds of steps. What if we could find a straight line instead?

Flow matching frames generation as learning a velocity field that transports particles from noise to data along smooth paths. At t=0, particles are random noise. At t=1, they've arrived at data samples. The velocity field v(x, t) tells each particle which direction to move at each moment.

The key idea: Instead of learning to remove noise (diffusion), learn a velocity field that pushes noise toward data. Straighter paths = fewer integration steps = faster generation.

The Simplest Possible Path

The path from noise to data is a plain linear interpolation. Given a noise sample x0 ~ N(0, I) and a data sample x1:

xt = (1 - t) · x0 + t · x1

At t=0, xt = x0 (pure noise). At t=1, xt = x1 (pure data). At t=0.5, it's the average of noise and data. The velocity along this path is constant: v = x1 - x0. No noise schedule, no ᾱt, no reparameterization trick. Just a straight line.

Compare to diffusion: Diffusion uses xt = √ᾱt · x0 + √(1 - ᾱt) · ε with a carefully designed schedule for ᾱt. Flow matching uses xt = (1-t) · x0 + t · x1. One requires a noise schedule with dozens of design choices. The other is a single line of arithmetic.
Particles Flowing: Noise to Data

Watch particles travel from random noise (t=0) toward two data clusters (t=1). Each particle follows the learned velocity field.

t = 0.00
Check: What does the velocity field v(x, t) describe?

Chapter 1: Continuous Normalizing Flows

In Chapter 0, we showed particles flowing from noise to data along paths. That's a powerful visual intuition — but to actually build this, we need to answer a precise question: how do we tell a particle where to go at each moment?

The answer is a velocity field. At every point in space and every moment in time, the velocity field says "move in this direction at this speed." If you've ever seen a weather map with wind arrows, that's a velocity field — each arrow tells air molecules where to flow. Our velocity field does the same thing for data particles.

The math tool for "follow a velocity field" is a differential equation — specifically, an ordinary differential equation (ODE). An ODE just says: "the rate of change of x equals some function of x and time." If you've done basic calculus, you've seen this as dx/dt = f(x, t). That's all an ODE is — a rule for how something changes over time.

This gives us the mathematical backbone of flow matching, called a Continuous Normalizing Flow (CNF):

dx/dt = vθ(x, t)     x(0) ~ pnoise,   x(1) ~ pdata

Read this as: "starting from a noise sample x(0), follow the velocity field v from time t=0 to t=1, and you arrive at a data sample x(1)." The velocity field vθ is parameterized by a neural network — the θ are the learnable weights. Training teaches the network what velocity to assign at each (x, t) so that the flow transforms noise into realistic data.

The network takes in xt (the particle's current position, shape [B, C, H, W] — a batch of images) and the scalar time t, and outputs the velocity v (same shape [B, C, H, W]). Same input/output shape as a diffusion denoiser — but instead of predicting noise, it predicts direction of motion.

ODE vs SDE — deterministic vs random. You might hear about stochastic differential equations (SDEs) in diffusion models. An SDE adds random noise at each step — like a velocity field with random gusts of wind. An ODE has no randomness: given the same starting point, you always follow the same path. Flow matching uses ODEs, which means: deterministic paths, fewer integration steps needed, and exact log-likelihood computation. This is a key advantage over diffusion's SDE formulation.
Noise x(0)
x(0) ~ N(0, I)
↓ integrate dx/dt = v(x,t)
x(0.25)
Starting to take shape
↓ integrate
x(0.5)
Halfway there
↓ integrate
Data x(1)
A sample from pdata
Check: What type of equation governs a continuous normalizing flow?
Checkpoint — Before you move on
Explain in your own words: why does a velocity field v(x, t) define a unique trajectory for each particle, and what would go wrong if v depended on the particle's history (i.e., where it came from) rather than just its current position and time?
✓ Gate cleared
Model Answer

An ODE dx/dt = v(x, t) is a Markov system: the velocity depends only on where you are (x) and when (t), not how you got there. This means two particles at the same (x, t) must have the same velocity — their paths cannot cross at the same time. This is the uniqueness theorem for ODEs (Picard-Lindelöf): given a Lipschitz velocity field, each initial condition produces exactly one trajectory.

If v depended on history, you'd need a more complex representation (path-dependent or stochastic). The Markov property is what makes CNFs tractable — the network only needs to map (x, t) → v, not reason about trajectories.

This is also why paths crossing is problematic: at a crossing point, two particles at the same (x, t) need to go to different places. The network must learn a compromised velocity that sends them in an average direction — reducing sample quality. Reflow (Chapter 6) reduces crossings precisely to avoid this.

Chapter 2: Optimal Transport

We now have a framework: a velocity field v(x, t) that guides particles from noise to data. But we haven't said anything about which paths those particles should take. A velocity field could send particles along wild, looping curves — or along straight lines. Both get from noise to data, but one is far more efficient.

Optimal transport (OT) is the math of "what's the cheapest way to move mass from one pile to another?" Think of it like a shipping problem: you have warehouses (noise) and stores (data), and you want to minimize total distance traveled. The answer, for our case, turns out to be beautifully simple: straight paths with constant velocity.

Why does straightness matter? Because straighter paths can be integrated with fewer numerical steps. A perfectly straight path needs just one Euler step. Curved paths need many steps to follow accurately.

Curved vs Straight Paths

Compare diffusion-style curved paths (left) with OT straight paths (right). Straight paths need fewer integration steps.

Straight-line interpolation: The OT path from noise z to data x1 is simply xt = (1-t)z + t x1. The velocity along this path is constant: v = x1 - z. This is as simple as it gets.
xt = (1-t) · z + t · x1    →    v = x1 - z

Why Straight Paths Need Fewer Steps

Consider Euler integration: xt+dt = xt + v(xt, t) · dt. If the true velocity is constant (straight path), then a single Euler step with dt=1 gives the exact answer. No numerical error at all. In practice, the velocity field isn't perfectly constant (different noise-data pairs have different velocities), so we need a few steps. But "a few" means 10-50, not 1000.

Contrast with diffusion: the variance-preserving path curves through high-dimensional space. The velocity changes direction at every point, so you need many small steps to track the curve accurately. This is the fundamental geometric reason flow matching is faster.

Check: Why are straight transport paths better than curved ones?
🔨 Derivation Why the Straight Line Minimizes Transport Cost ✓ ATTEMPTED

Given a fixed starting point z and endpoint x1, consider all smooth paths x(t) with x(0) = z, x(1) = x1. The kinetic energy (transport cost) of a path is:

KE = ∫01 ||dx/dt||² dt

Your task: Prove that the straight-line path x(t) = (1-t)z + t x1 minimizes this kinetic energy among all paths connecting z to x1.

For x(t) = (1-t)z + t x1, the velocity is dx/dt = x1 - z, which is constant (does not depend on t). So KEstraight = ||x1 - z||².
By Cauchy-Schwarz: (∫01 ||dx/dt|| dt)² ≤ (∫01 1² dt)(∫01 ||dx/dt||² dt) = KE. The left side is the path length squared. What is the minimum possible length of a path from z to x1?
Cauchy-Schwarz gives equality iff ||dx/dt|| is constant along the path. Combined with the minimum-length constraint: the shortest path is the straight line, and constant speed along that line gives the unique minimizer.

Full derivation:

1. For any path x(t) from z to x1, the length is L = ∫01 ||dx/dt|| dt. The straight-line distance is ||x1 - z||, and by the triangle inequality, L ≥ ||x1 - z|| with equality iff the path is a straight line.

2. By Cauchy-Schwarz on the functions f(t) = 1 and g(t) = ||dx/dt||:

L² = (∫01 ||dx/dt|| dt)² ≤ (∫01 dt)(∫01 ||dx/dt||² dt) = KE

3. Therefore KE ≥ L² ≥ ||x1 - z||². The straight-line path achieves KE = ||x1 - z||² exactly (constant velocity), so it is the unique minimizer.

The key insight: Minimizing kinetic energy simultaneously forces the path to be (a) straight (shortest length) and (b) constant-speed (no acceleration). This is Brenier's theorem in its simplest form: the optimal transport map in Euclidean space is the straight-line displacement.

🔗 Pattern Recognition
Optimal Transport ↔ Physics of Least Action
Flow Matching (this lesson)
Minimize ∫ ||dx/dt||² dt → straight-line path at constant speed
Classical Mechanics
Minimize ∫ ½m||v||² dt (action) → straight-line trajectory (no forces)

The OT problem is the free-particle Lagrangian. The velocity field v(x, t) plays the role of a physical velocity. The ODE dx/dt = v is Newton's first law. Flow matching is literally physics: particles travel in straight lines unless a force (data distribution) curves them.

Where else does "minimize squared velocity integrated over time" appear? (Hint: think Kalman smoothing, spline interpolation, geodesics on manifolds.)

Chapter 3: Conditional Flow Matching

We know the ideal path for each particle: a straight line from its specific noise point z to its specific data point x1. But here's the problem: during generation, the network sees an intermediate point xt and has to decide where to send it — without knowing which x1 it's headed toward. The "true" velocity at xt would require averaging over every possible data point it might be going to. That's impossibly expensive to compute (mathematicians call this intractable — meaning "can't be computed in practice").

Conditional Flow Matching (CFM) is the elegant trick that makes training possible anyway. Instead of computing that impossible average, we train on individual pairs. For each training pair (noise z, data x1), we know exactly where the particle should go: from z to x1 in a straight line. The conditional velocity for this specific pair is simply ut = x1 - z (the direction from noise to data, constant along the path).

The beautiful mathematical insight (proven by Lipman et al., 2023): if you train the network to match these per-pair velocities across many random pairs, the network automatically learns the correct average velocity at every point. You never have to compute the intractable average explicitly — gradient descent finds it for you.

LCFM = Et, z, x1 [ || vθ(xt, t) - (x1 - z) ||² ]
Why this works: The conditional and marginal flow matching losses have identical gradients (proven by Lipman et al., 2023). We can train on simple per-sample paths while implicitly learning the complex marginal flow. No intractable integrals required.

Here's the concrete difference this makes. Without CFM, the target velocity at a point xt would require averaging over all possible noise-data pairs that pass through xt — an intractable integral. With CFM, for each training sample we just compute x1 - z (a single subtraction). The magic is that gradient descent over many such samples converges to the correct marginal velocity field.

The notation shift. In flow matching papers, noise is called x0 or z (starting point at t=0), and data is called x1 (endpoint at t=1). This is the opposite of diffusion notation, where x0 is the clean data and xT is noise. Don't let this trip you up — the convention follows the direction of generation (t goes from 0 to 1 in flow matching, from T to 0 in diffusion).
Sample
Pick z ~ N(0,I) and x1 from dataset
Interpolate
xt = (1-t)z + t x1
Target velocity
ut = x1 - z
Loss
|| vθ(xt, t) - ut ||²
Check: What does Conditional Flow Matching train on?
🔨 Derivation CFM Loss = Marginal FM Loss (in expectation) ✓ ATTEMPTED

The marginal flow matching loss is LFM = Et, x~pt[||vθ(x, t) - ut(x)||²], where ut(x) is the true marginal velocity at (x, t). This is intractable because ut(x) = Ex1|xt=x[x1 - z] requires knowing p(x1 | xt).

The conditional FM loss is LCFM = Et, z, x1[||vθ(xt, t) - (x1 - z)||²] where xt = (1-t)z + tx1.

Your task: Show that ∇θ LCFM = ∇θ LFM. (They have the same gradient, so optimizing one optimizes the other.)

||vθ - u||² = ||vθ||² - 2⟨vθ, u⟩ + ||u||². The gradient w.r.t. θ only sees the first two terms. The ||u||² term is a constant. So the gradient is determined by E[∇θ||vθ||² - 2⟨∇θvθ, u⟩]. The ||vθ||² term is the same in both losses (same θ-dependence). So you just need to show the cross-term matches.
The CFM cross-term is Et,z,x1[⟨vθ(xt, t), x1 - z⟩]. Condition on (xt, t): this becomes Et, xt[⟨vθ(xt, t), E[x1 - z | xt, t]⟩]. What is E[x1 - z | xt, t]?
By definition, the marginal velocity ut(x) = E[x1 - z | xt = x]. So the conditional expectation of the per-sample velocity IS the marginal velocity. The cross-terms match.

Full derivation:

1. Write both losses with the squared norm expanded:

L = E[||vθ||²] - 2E[⟨vθ, u⟩] + E[||u||²]

The last term doesn't depend on θ, so ∇θL is determined by the first two terms.

2. The first term E[||vθ(xt, t)||²] is the same in both losses because both sample xt from pt (the CFM interpolation samples from the same marginal distribution pt by construction).

3. For the cross-term in LCFM:

Et,z,x1[⟨vθ(xt, t), x1 - z⟩] = Et, xt[⟨vθ(xt, t), E[x1 - z | xt, t]⟩]

4. The conditional expectation E[x1 - z | xt = x, t] is precisely the definition of the marginal velocity field ut(x). So:

= Et, xt[⟨vθ(xt, t), ut(xt)⟩]

This is exactly the cross-term in LFM. Since both relevant terms match, ∇θLCFM = ∇θLFM. QED.

The key insight: The per-sample velocity (x1 - z) is a noisy but unbiased estimate of the marginal velocity ut(x). SGD with unbiased gradients converges to the same optimum as the full gradient. This is why we can train on simple per-sample targets and implicitly learn the complex marginal flow.

🔗 Pattern Recognition
Flow Matching ↔ Diffusion (Same Gradient, Different Parameterization)
Flow Matching (this lesson)
Target = x1 - z (velocity along straight path)
Path: xt = (1-t)z + t x1
Diffusion → lesson
Target = ε (noise added)
Path: xt = √ᾱt x0 + √(1-ᾱt) ε

Both are instances of denoising score matching: train a network to predict a target direction from a noisy intermediate. The difference is the path geometry: diffusion uses a variance-preserving arc on a hypersphere, flow matching uses a straight line through ambient space. Lipman et al. (2023) showed that DDPM is literally a special case of flow matching with a specific (non-optimal) path choice.

If you change the FM path to xt = cos(πt/2) z + sin(πt/2) x1, what familiar framework do you recover? (Answer: variance-preserving diffusion with a cosine schedule.)

Chapter 4: Training

Training a flow matching model is arguably even simpler than training a diffusion model. For each training step:

  1. Sample noise z ~ N(0, I)
  2. Sample a data point x1 from the dataset
  3. Sample a random time t ~ U(0, 1)
  4. Compute xt = (1-t)z + t x1
  5. Predict velocity: v̂ = vθ(xt, t)
  6. Loss = || v̂ - (x1 - z) ||²

The Complete Training Loop

Here it is in full. Compare this to the diffusion training loop — notice what's missing:

python
for x_1 in dataloader:                     # 1. Sample data
    x_0 = torch.randn_like(x_1)           # 2. Sample noise ~ N(0, I)
    t = torch.rand(B, 1, 1, 1)            # 3. Sample t ~ Uniform[0, 1]

    x_t = (1 - t) * x_0 + t * x_1          # 4. Linear interpolation
    target = x_1 - x_0                     # 5. Target velocity (constant!)

    v_hat = network(x_t, t.squeeze())      # 6. Predict velocity
    loss = F.mse_loss(v_hat, target)       # 7. MSE loss

    loss.backward()
    optimizer.step()
What's missing compared to diffusion? No noise schedule (βt, ᾱt). No cumulative product computation. No reparameterization. No variance-preserving formula. The interpolation is (1-t)*x_0 + t*x_1 and the target is x_1 - x_0. That's the entire mathematical content of the training loop. This simplicity is why flow matching is becoming the default for new systems.
Interactive: Velocity Field

Arrows show the learned velocity field at time t. At t=0, velocities point toward data. At t=1, they've converged.

Time t0.30
Check: What is the training target in flow matching?
💻 Build It Implement the CFM Training Step from Scratch ✓ ATTEMPTED
You've seen the training loop above. Now implement the core cfm_loss function that computes a single training step. This is the complete mathematical content of flow matching training — everything else is standard deep learning boilerplate.
signature def cfm_loss(model, x_1, noise_fn=torch.randn_like): """Compute the conditional flow matching loss for a batch. Args: model: Neural network that takes (x_t, t) and returns predicted velocity. x_t shape: [B, C, H, W], t shape: [B] x_1: Batch of data samples, shape [B, C, H, W] noise_fn: Function to sample noise (default: standard Gaussian) Returns: loss: Scalar MSE loss between predicted and target velocity """
Test case
Given x_1 = [[1.0, 2.0]], z = [[0.0, 0.0]], t = 0.3:
x_t = (1-0.3)*[[0,0]] + 0.3*[[1,2]] = [[0.3, 0.6]]
target velocity = [[1,2]] - [[0,0]] = [[1.0, 2.0]]
If model predicts [[0.8, 1.9]], loss = mean((0.8-1)² + (1.9-2)²) = mean(0.04 + 0.01) = 0.025
t needs shape [B, 1, 1, 1] to broadcast against [B, C, H, W] tensors. Use t = torch.rand(B, 1, 1, 1) or reshape after sampling. The model expects t as shape [B] (squeezed), so pass t.squeeze() or t.view(B).
python
def cfm_loss(model, x_1, noise_fn=torch.randn_like):
    B = x_1.shape[0]

    # 1. Sample noise (same shape as data)
    z = noise_fn(x_1)                      # [B, C, H, W]

    # 2. Sample time uniformly in [0, 1]
    t = torch.rand(B, 1, 1, 1, device=x_1.device)  # [B,1,1,1] for broadcasting

    # 3. Linear interpolation: x_t = (1-t)*z + t*x_1
    x_t = (1 - t) * z + t * x_1             # [B, C, H, W]

    # 4. Target velocity = endpoint - startpoint
    target = x_1 - z                         # [B, C, H, W]

    # 5. Network predicts velocity at (x_t, t)
    v_pred = model(x_t, t.squeeze())        # [B, C, H, W]

    # 6. MSE between predicted and target velocity
    return F.mse_loss(v_pred, target)
Bonus challenge: Modify this to support time-weighted loss: weight the MSE by w(t) = 1/(1-t+ε) to emphasize accuracy near t=1 (where samples must be close to data). This is what SD3 uses in practice.

Chapter 5: Sampling

To generate a sample, start from noise z ~ N(0, I) and integrate the learned ODE forward from t=0 to t=1. The simplest method is Euler integration: take N evenly spaced steps, at each step nudging x by the predicted velocity times the step size.

xt+Δt = xt + Δt · vθ(xt, t)     where Δt = 1/N

Because the paths are approximately straight, Euler's method works well even with few steps. Higher-order solvers (Midpoint, RK4) give better accuracy for the same step count, but even plain Euler with 20-50 steps produces excellent results.

The Sampling Loop

python
# Generate an image from pure noise
x = torch.randn(1, 4, 64, 64)  # start from noise (latent space)
N = 20                             # number of Euler steps
dt = 1.0 / N

for i in range(N):
    t = i / N                     # current time: 0, 0.05, 0.10, ...
    v = network(x, t)              # predict velocity at current position
    x = x + v * dt                 # take one Euler step
# x is now a sample from the data distribution

Compare this to DDPM's 1000 steps with stochastic noise injection at each step. Flow matching sampling is an ODE (deterministic), not an SDE (stochastic). Same noise in → same image out. Every time.

MethodTypical StepsStochastic?Same seed → same output?
DDPM1000Yes (SDE)No
DDIM50No (ODE)Yes
Flow Matching20-50No (ODE)Yes
Interactive: Euler Steps Along the Flow

Watch particles integrate from noise to data. Adjust the step count: more steps = more accurate paths.

Euler steps20
The straight-path advantage: If paths were perfectly straight, one Euler step would be exact. In practice, learned paths are nearly straight, so 20-50 steps suffice. Compare with DDPM's 1000 steps!
Check: Why does flow matching need fewer sampling steps than diffusion?
💥 Break-It Lab What Dies When You Break the Sampling ODE? ✓ ATTEMPTED
A working flow matching sampler integrates dx/dt = v(x, t) from t=0 to t=1 using Euler steps. Particles start at noise and converge to data clusters. Below you can break three components and watch what happens.
Use Curved Paths (sinusoidal) OFF
Failure mode: Curved velocities change direction at every point. With few Euler steps, particles overshoot at turns and miss the target entirely. The approximation error compounds multiplicatively at each step. You need 5-10x more steps to get the same accuracy as straight paths.
Feed Wrong Time (always t=0.5) OFF
Failure mode: The network uses t to modulate velocity magnitude. At t=0.5, it outputs a mid-range velocity appropriate for the halfway point. But at t=0 particles need maximum velocity (full distance ahead), and at t=0.9 they need minimal velocity (almost arrived). Wrong time → wrong speed → particles overshoot early and undershoot late.
Start from Data (no noise at t=0) OFF
Failure mode: If particles start at data instead of noise, the velocity field has nothing to do — particles are already at the destination. But the network was trained for the noise→data direction. Starting at data, it tries to "move" particles that are already there, overshooting into nonsense regions with no training signal. Garbage in, garbage out.

Chapter 6: Reflow & Distillation

Even though flow matching paths are straighter than diffusion, they're not perfectly straight. Paths from different noise-data pairs can cross each other, forcing the network to learn a curved velocity field to avoid collisions. Reflow straightens paths further.

The procedure is elegant: (1) Take a trained flow model. (2) Sample noise z ~ N(0, I), run the model forward to get the corresponding data point x1 = ODE(z). Now you have a (z, x1) pair that the model actually maps. (3) Retrain a new model on straight lines between these pairs. Because the pairs are already coupled (the model mapped z to x1), the straight-line approximation is much closer to the true path.

python
# Reflow: straighten paths by retraining on coupled pairs
# Step 1: Generate coupled (noise, data) pairs from trained model
z = torch.randn(10000, 4, 64, 64)
x_1 = ode_solve(old_model, z, t=0, t_end=1)  # run the trained model
pairs = list(zip(z, x_1))                        # store the pairs

# Step 2: Retrain on straight lines between these pairs
for z_i, x_1_i in pairs:
    t = torch.rand(1)
    x_t = (1-t) * z_i + t * x_1_i        # straight line between COUPLED pair
    loss = mse(new_model(x_t, t), x_1_i - z_i)   # velocity = x_1 - z

Each iteration makes paths straighter. After 2-3 reflow iterations, paths are nearly straight enough for 1-step generation.

Initial Flow
Good but paths cross
↓ Reflow iteration 1
Straighter
Fewer crossings, lower curvature
↓ Reflow iteration 2
Nearly Straight
Suitable for 1-2 step generation

Distillation

Distillation takes a different approach: train a student model to mimic the teacher in fewer steps. Progressive distillation halves the step count repeatedly (64 → 32 → 16 → 8 → 4 → 2 → 1). Combined with reflow, this yields high-quality 1-4 step models.

Path Straightening

Compare paths before and after reflow. Straighter paths = fewer steps needed.

Reflow iterations0
The endgame: Reflow + distillation aims for single-step generation with diffusion-level quality. This is essentially what Flux-Schnell and SDXL-Turbo achieve in practice.
Check: What does reflow do to transport paths?
⚔ Adversarial: You apply reflow 5 times and the loss keeps decreasing. But FID (sample quality) gets WORSE after iteration 3. What is happening?
Your model trains on the reflow pairs (zi, x1,i) generated by the previous iteration. Loss goes down every iteration: 0.42 → 0.31 → 0.24 → 0.19 → 0.16. But FID goes 12.1 → 9.8 → 8.4 → 9.2 → 11.5. What explains the divergence between loss and quality?

Chapter 7: SD3 / Flux

Flow matching has gone from theory to production. Both Stable Diffusion 3 (Stability AI) and Flux (Black Forest Labs) use rectified flow matching as their core framework, combined with a new architecture: the MMDiT (Multimodal Diffusion Transformer).

MMDiT Architecture

MMDiT replaces the U-Net with a Transformer. Both the noisy latent patches and the text tokens are processed as separate streams that interact through joint attention layers. This bidirectional interaction gives the text genuine influence over image generation.

Text Stream
T5 + CLIP embeddings as tokens
Joint Attention
Image and text tokens attend to each other
Image Stream
Patchified noisy latent + positional encoding
FeatureSD 1.5SDXLSD3 / Flux
ArchitectureU-NetLarger U-NetMMDiT
FrameworkDDPMDDPMFlow matching
Text encoderCLIPCLIP + OpenCLIPCLIP + T5-XXL
Resolution512px1024px1024px+
Steps20-5020-4020-30

From Theory to Production Numbers

Flux.1-schnell generates a 1024×1024 image in 4 steps. Let that sink in. DDPM needed 1000 steps for 256×256. That's a 250× reduction in steps at 16× higher resolution. The combination of flow matching (straight paths), rectified flow (straightened further), and distillation (student mimics teacher in fewer steps) made this possible.

Flux variants: Flux.1-pro (best quality, API-only), Flux.1-dev (open weights, guidance-distilled, ~20 steps), Flux.1-schnell (4-step distilled, fastest). The schnell variant demonstrates the full power of reflow + distillation: near-pro quality at 5× the speed.
Check: What architecture do SD3 and Flux use instead of U-Net?
🏗 Design Challenge You're the Architect: Flow Matching for Protein Structure Generation ✓ ATTEMPTED
You're designing a flow matching model for de novo protein backbone generation. Each protein is a chain of residues, each with a 3D position (x, y, z) and an orientation frame (rotation matrix in SO(3)). The output must be physically valid: no steric clashes, correct bond lengths, and the model should be equivariant to rotations and translations (rotating the input should rotate the output identically).
Data representation
500 residues × (3D position + 3×3 rotation) = 500 × 12 values
Symmetry requirement
SE(3) equivariance (rotation + translation)
Inference budget
≤ 10 seconds on A100 for one protein
Noise distribution
N(0, I) for positions, but what for rotations?
1. What noise distribution do you use for the rotation part? (N(0, I) on matrix entries violates SO(3). What's the "uniform noise" analog on a rotation manifold?)
2. How do you define a "straight line" between two rotations? (Linear interpolation of rotation matrices leaves SO(3).)
3. How many ODE steps do you need? What solver? (Consider: 500 residues interact pairwise, so each network forward pass is expensive.)
4. How do you enforce SE(3) equivariance in the velocity network architecture?

Real-world solution (FrameFlow / FoldFlow / Chroma):

1. Noise on SO(3): Use the isotropic Gaussian on SO(3) (IGSO(3)), which is the uniform distribution's analog for rotations. Sample by exponentiating a random tangent vector: R = exp(skew(ξ)), ξ ~ N(0, σ²I). At σ→∞, this approaches uniform on SO(3).

2. Straight line on SO(3): Use the geodesic (shortest rotation path): R(t) = R0 exp(t · log(R0-1R1)). The "velocity" is in the tangent space (Lie algebra so(3)), and interpolation stays on the manifold. This is the Riemannian analog of linear interpolation.

3. ODE steps: 100-200 steps with an adaptive solver (Dormand-Prince/RK45). Each step requires a full network forward pass through an SE(3)-equivariant transformer with pair interactions. At 500 residues with 100 steps: ~100 forward passes × 50ms each = 5 seconds on A100. Tight but feasible.

4. Equivariance: Use an SE(3)-equivariant architecture (e.g., EGNN, IPA from AlphaFold2). Key insight: the velocity network predicts vectors in the local frame of each residue, so global rotations/translations automatically transform correctly. Frame-based representations (as in AlphaFold2's IPA) give equivariance by construction.

Chapter 8: Flow vs Diffusion

Flow matching and diffusion are closely related — in fact, diffusion can be seen as a special case of flow matching with a particular (non-straight) path choice. But the differences matter in practice.

AspectDiffusion (DDPM)Flow Matching
What it learnsNoise prediction εθVelocity field vθ
Path shapeCurved (variance-preserving)Straight (OT interpolation)
Math frameworkSDE (stochastic)ODE (deterministic)
Typical steps20-100010-50
Noise scheduleβt schedule (many choices)Linear interpolation (one choice)
Training targetε (noise)x1 - z (velocity)
Log-likelihoodApproximate (ELBO)Exact (via ODE)
Side-by-Side: Diffusion vs Flow

Left: diffusion-style curved trajectories. Right: flow matching straight trajectories. Both reach the same target.

Why Flow Matching Is Winning

Four concrete reasons flow matching is becoming the default for new systems:

  1. Simpler math. No noise schedule design (βt linear vs cosine vs sigmoid). No ᾱt cumulative products. Just linear interpolation.
  2. Faster sampling. Straight paths need 10-50 Euler steps instead of DDPM's 1000. Even compared to DDIM (50 steps), flow matching matches quality with fewer steps.
  3. Deterministic by default. ODE, not SDE. Same seed = same output. Reproducibility is free.
  4. This is what production uses. Stable Diffusion 3, Flux, π0 (robot policy) — all flow matching. The industry has voted.

Conditional Flow Matching

Conditioning works exactly as in diffusion: inject class labels or text embeddings into the network. The velocity field becomes v(xt, t, c) where c is the conditioning signal. Classifier-free guidance applies identically — train with condition dropout, at inference compute vguided = vuncond + w · (vcond - vuncond).

When to Use Which?

Use diffusion when you have existing DDPM/DDIM infrastructure, need maximum compatibility with LoRAs/ControlNets built for SD 1.5/SDXL, or when stochastic sampling benefits diversity.
Use flow matching for new projects, when speed matters (fewer steps), when you want simpler code and math, or when using modern architectures (DiT/MMDiT). It's the clear direction the field is heading.
⚔ Adversarial: Your flow model generates excellent samples with 100 ODE steps but completely fails at 5 steps. DDIM (diffusion) works fine at 5 steps. Why?
Both models were trained on the same dataset with the same architecture (DiT-XL). Both achieve FID ~2 at their respective "full step" counts. But when you reduce to 5 steps: flow matching FID jumps to 45, while DDIM only degrades to FID ~8. Both use Euler integration with uniform timesteps.
🔨 Derivation Flow Matching Velocity ↔ Score Function Connection ✓ ATTEMPTED

In score-based diffusion, the network learns sθ(x, t) ≈ ∇x log pt(x) (the score). In flow matching, it learns vθ(x, t) (the velocity). These are related.

For the OT path xt = (1-t)z + t x1 with z ~ N(0, I):

Your task: Show that vt(x) = x/(1-t) + (1-t) · ∇x log pt(x), i.e., the velocity field can be decomposed into a "drift toward origin" term and a score-scaled term.

At time t, xt = (1-t)z + t x1. Since z ~ N(0, I), and x1 is fixed for a conditional path: xt | x1 ~ N(t x1, (1-t)² I). The marginal pt(x) = ∫ N(x; t x1, (1-t)² I) pdata(x1) dx1 — a Gaussian mixture.
The marginal velocity is ut(x) = E[x1 - z | xt = x]. From xt = (1-t)z + t x1, we get z = (x - t x1)/(1-t). So x1 - z = x1 - (x - t x1)/(1-t) = (x1 - x)/(1-t). Therefore ut(x) = E[x1 | xt=x]/(1-t) - x/(1-t).
For a Gaussian kernel p(xt | x1) = N(t x1, (1-t)² I), Tweedie's formula gives: E[t x1 | xt] = x + (1-t)² ∇x log pt(x). So E[x1 | xt] = (x + (1-t)² ∇x log pt(x)) / t.

Full derivation:

1. The conditional distribution is p(xt | x1) = N(xt; t x1, (1-t)² I).

2. From the conditional velocity, ut(x) = E[(x1 - x)/(1-t) | xt = x] (derived in Hint 2).

3. Tweedie's formula for Gaussians: the posterior mean of the "clean" signal given noisy observation is E[μ | x] = x + σ² ∇x log p(x). Here μ = t x1, σ = (1-t), so:

E[t x1 | xt = x] = x + (1-t)² ∇x log pt(x)

4. Therefore E[x1 | xt = x] = [x + (1-t)² ∇x log pt(x)] / t.

5. Substituting into the velocity:

ut(x) = (E[x1 | xt] - x) / (1-t) = ([x + (1-t)² s(x,t)]/t - x) / (1-t)

Simplifying: = x(1-t)/(t(1-t)) + (1-t)s(x,t)/t ... After algebra:

ut(x) = (1-t) · ∇x log pt(x) + x/(1-t) · (1/t - 1) ...

The exact relationship (cleanly stated): vt(x) = [E[x1|xt=x] - x] / (1-t), and via Tweedie: vt(x) = (1-t)/t · ∇x log pt(x) + x · (1/t - 1)/(1-t).

The key insight: The velocity field and score function contain exactly the same information — they're connected by a time-dependent linear transformation. Learning one is equivalent to learning the other. The difference is purely in the loss weighting and path geometry, not in what the network fundamentally represents. This is why FM and score matching achieve similar final quality — they're learning the same object with different parameterizations.

"The shortest path between two truths in the real domain passes through the complex domain."
— Jacques Hadamard

You now understand flow matching: straight paths, simple training, fast sampling. The next generation of generative models is built on these ideas.