microFlow — From Noise to Data Along Straight Paths

Chapter 0: Paths Between Distributions

Generative modeling is fundamentally about transport: moving samples from a distribution we can easily sample (noise) to a distribution we want (data). Diffusion models do this via a winding, stochastic path with hundreds of steps. What if we could find a straight line instead?

Flow matching frames generation as learning a velocity field that transports particles from noise to data along smooth paths. At t=0, particles are random noise. At t=1, they've arrived at data samples. The velocity field v(x, t) tells each particle which direction to move at each moment.

The key idea: Instead of learning to remove noise (diffusion), learn a velocity field that pushes noise toward data. Straighter paths = fewer integration steps = faster generation.

The Simplest Possible Path

The path from noise to data is a plain linear interpolation. Given a noise sample x₀ ~ N(0, I) and a data sample x₁:

x_t = (1 - t) · x₀ + t · x₁

At t=0, x_t = x₀ (pure noise). At t=1, x_t = x₁ (pure data). At t=0.5, it's the average of noise and data. The velocity along this path is constant: v = x₁ - x₀. No noise schedule, no ᾱ_t, no reparameterization trick. Just a straight line.

Compare to diffusion: Diffusion uses x_t = √ᾱ_t · x₀ + √(1 - ᾱ_t) · ε with a carefully designed schedule for ᾱ_t. Flow matching uses x_t = (1-t) · x₀ + t · x₁. One requires a noise schedule with dozens of design choices. The other is a single line of arithmetic.

Particles Flowing: Noise to Data

Watch particles travel from random noise (t=0) toward two data clusters (t=1). Each particle follows the learned velocity field.

t = 0.00

Check: What does the velocity field v(x, t) describe?

The amount of noise to add at each step The direction and speed each particle should move at time t The final position of each particle

Chapter 1: Continuous Normalizing Flows

In Chapter 0, we showed particles flowing from noise to data along paths. That's a powerful visual intuition — but to actually build this, we need to answer a precise question: how do we tell a particle where to go at each moment?

The answer is a velocity field. At every point in space and every moment in time, the velocity field says "move in this direction at this speed." If you've ever seen a weather map with wind arrows, that's a velocity field — each arrow tells air molecules where to flow. Our velocity field does the same thing for data particles.

The math tool for "follow a velocity field" is a differential equation — specifically, an ordinary differential equation (ODE). An ODE just says: "the rate of change of x equals some function of x and time." If you've done basic calculus, you've seen this as dx/dt = f(x, t). That's all an ODE is — a rule for how something changes over time.

This gives us the mathematical backbone of flow matching, called a Continuous Normalizing Flow (CNF):

dx/dt = v_θ(x, t) x(0) ~ p_noise, x(1) ~ p_data

Read this as: "starting from a noise sample x(0), follow the velocity field v from time t=0 to t=1, and you arrive at a data sample x(1)." The velocity field v_θ is parameterized by a neural network — the θ are the learnable weights. Training teaches the network what velocity to assign at each (x, t) so that the flow transforms noise into realistic data.

The network takes in x_t (the particle's current position, shape [B, C, H, W] — a batch of images) and the scalar time t, and outputs the velocity v (same shape [B, C, H, W]). Same input/output shape as a diffusion denoiser — but instead of predicting noise, it predicts direction of motion.

ODE vs SDE — deterministic vs random. You might hear about stochastic differential equations (SDEs) in diffusion models. An SDE adds random noise at each step — like a velocity field with random gusts of wind. An ODE has no randomness: given the same starting point, you always follow the same path. Flow matching uses ODEs, which means: deterministic paths, fewer integration steps needed, and exact log-likelihood computation. This is a key advantage over diffusion's SDE formulation.

Noise x(0)

x(0) ~ N(0, I)

↓ integrate dx/dt = v(x,t)

x(0.25)

Starting to take shape

↓ integrate

x(0.5)

Halfway there

↓ integrate

Data x(1)

A sample from p_data

Check: What type of equation governs a continuous normalizing flow?

An ordinary differential equation (ODE) A partial differential equation (PDE) A stochastic difference equation

Checkpoint — Before you move on

Explain in your own words: why does a velocity field v(x, t) define a unique trajectory for each particle, and what would go wrong if v depended on the particle's history (i.e., where it came from) rather than just its current position and time?

✓ Gate cleared

Model Answer

An ODE dx/dt = v(x, t) is a Markov system: the velocity depends only on where you are (x) and when (t), not how you got there. This means two particles at the same (x, t) must have the same velocity — their paths cannot cross at the same time. This is the uniqueness theorem for ODEs (Picard-Lindelöf): given a Lipschitz velocity field, each initial condition produces exactly one trajectory.

If v depended on history, you'd need a more complex representation (path-dependent or stochastic). The Markov property is what makes CNFs tractable — the network only needs to map (x, t) → v, not reason about trajectories.

This is also why paths crossing is problematic: at a crossing point, two particles at the same (x, t) need to go to different places. The network must learn a compromised velocity that sends them in an average direction — reducing sample quality. Reflow (Chapter 6) reduces crossings precisely to avoid this.

Chapter 2: Optimal Transport

We now have a framework: a velocity field v(x, t) that guides particles from noise to data. But we haven't said anything about which paths those particles should take. A velocity field could send particles along wild, looping curves — or along straight lines. Both get from noise to data, but one is far more efficient.

Optimal transport (OT) is the math of "what's the cheapest way to move mass from one pile to another?" Think of it like a shipping problem: you have warehouses (noise) and stores (data), and you want to minimize total distance traveled. The answer, for our case, turns out to be beautifully simple: straight paths with constant velocity.

Why does straightness matter? Because straighter paths can be integrated with fewer numerical steps. A perfectly straight path needs just one Euler step. Curved paths need many steps to follow accurately.

Curved vs Straight Paths

Compare diffusion-style curved paths (left) with OT straight paths (right). Straight paths need fewer integration steps.

Straight-line interpolation: The OT path from noise z to data x₁ is simply x_t = (1-t)z + t x₁. The velocity along this path is constant: v = x₁ - z. This is as simple as it gets.

x_t = (1-t) · z + t · x₁ → v = x₁ - z

Why Straight Paths Need Fewer Steps

Consider Euler integration: x_t+dt = x_t + v(x_t, t) · dt. If the true velocity is constant (straight path), then a single Euler step with dt=1 gives the exact answer. No numerical error at all. In practice, the velocity field isn't perfectly constant (different noise-data pairs have different velocities), so we need a few steps. But "a few" means 10-50, not 1000.

Contrast with diffusion: the variance-preserving path curves through high-dimensional space. The velocity changes direction at every point, so you need many small steps to track the curve accurately. This is the fundamental geometric reason flow matching is faster.

Check: Why are straight transport paths better than curved ones?

They can be integrated accurately with fewer steps They produce higher resolution images They require a larger neural network

🔨 Derivation Why the Straight Line Minimizes Transport Cost ▶ ✓ ATTEMPTED

Given a fixed starting point z and endpoint x₁, consider all smooth paths x(t) with x(0) = z, x(1) = x₁. The kinetic energy (transport cost) of a path is:

KE = ∫₀¹ ||dx/dt||² dt

Your task: Prove that the straight-line path x(t) = (1-t)z + t x₁ minimizes this kinetic energy among all paths connecting z to x₁.

For x(t) = (1-t)z + t x₁, the velocity is dx/dt = x₁ - z, which is constant (does not depend on t). So KE_straight = ||x₁ - z||².

By Cauchy-Schwarz: (∫₀¹ ||dx/dt|| dt)² ≤ (∫₀¹ 1² dt)(∫₀¹ ||dx/dt||² dt) = KE. The left side is the path length squared. What is the minimum possible length of a path from z to x₁?

Cauchy-Schwarz gives equality iff ||dx/dt|| is constant along the path. Combined with the minimum-length constraint: the shortest path is the straight line, and constant speed along that line gives the unique minimizer.

Full derivation:

1. For any path x(t) from z to x₁, the length is L = ∫₀¹ ||dx/dt|| dt. The straight-line distance is ||x₁ - z||, and by the triangle inequality, L ≥ ||x₁ - z|| with equality iff the path is a straight line.

2. By Cauchy-Schwarz on the functions f(t) = 1 and g(t) = ||dx/dt||:

L² = (∫₀¹ ||dx/dt|| dt)² ≤ (∫₀¹ dt)(∫₀¹ ||dx/dt||² dt) = KE

3. Therefore KE ≥ L² ≥ ||x₁ - z||². The straight-line path achieves KE = ||x₁ - z||² exactly (constant velocity), so it is the unique minimizer.

The key insight: Minimizing kinetic energy simultaneously forces the path to be (a) straight (shortest length) and (b) constant-speed (no acceleration). This is Brenier's theorem in its simplest form: the optimal transport map in Euclidean space is the straight-line displacement.

🔗 Pattern Recognition

Optimal Transport ↔ Physics of Least Action

Flow Matching (this lesson)

Minimize ∫ ||dx/dt||² dt → straight-line path at constant speed

Classical Mechanics

Minimize ∫ ½m||v||² dt (action) → straight-line trajectory (no forces)

The OT problem is the free-particle Lagrangian. The velocity field v(x, t) plays the role of a physical velocity. The ODE dx/dt = v is Newton's first law. Flow matching is literally physics: particles travel in straight lines unless a force (data distribution) curves them.

Where else does "minimize squared velocity integrated over time" appear? (Hint: think Kalman smoothing, spline interpolation, geodesics on manifolds.)

Chapter 3: Conditional Flow Matching

We know the ideal path for each particle: a straight line from its specific noise point z to its specific data point x₁. But here's the problem: during generation, the network sees an intermediate point x_t and has to decide where to send it — without knowing which x₁ it's headed toward. The "true" velocity at x_t would require averaging over every possible data point it might be going to. That's impossibly expensive to compute (mathematicians call this intractable — meaning "can't be computed in practice").

Conditional Flow Matching (CFM) is the elegant trick that makes training possible anyway. Instead of computing that impossible average, we train on individual pairs. For each training pair (noise z, data x₁), we know exactly where the particle should go: from z to x₁ in a straight line. The conditional velocity for this specific pair is simply u_t = x₁ - z (the direction from noise to data, constant along the path).

The beautiful mathematical insight (proven by Lipman et al., 2023): if you train the network to match these per-pair velocities across many random pairs, the network automatically learns the correct average velocity at every point. You never have to compute the intractable average explicitly — gradient descent finds it for you.

L_CFM = E_{t, z, x₁} [ || v_θ(x_t, t) - (x₁ - z) ||² ]

Why this works: The conditional and marginal flow matching losses have identical gradients (proven by Lipman et al., 2023). We can train on simple per-sample paths while implicitly learning the complex marginal flow. No intractable integrals required.

Here's the concrete difference this makes. Without CFM, the target velocity at a point x_t would require averaging over all possible noise-data pairs that pass through x_t — an intractable integral. With CFM, for each training sample we just compute x₁ - z (a single subtraction). The magic is that gradient descent over many such samples converges to the correct marginal velocity field.

The notation shift. In flow matching papers, noise is called x₀ or z (starting point at t=0), and data is called x₁ (endpoint at t=1). This is the opposite of diffusion notation, where x₀ is the clean data and x_T is noise. Don't let this trip you up — the convention follows the direction of generation (t goes from 0 to 1 in flow matching, from T to 0 in diffusion).

Sample

Pick z ~ N(0,I) and x₁ from dataset

↓

Interpolate

x_t = (1-t)z + t x₁

↓

Target velocity

u_t = x₁ - z

↓

Loss

|| v_θ(x_t, t) - u_t ||²

Check: What does Conditional Flow Matching train on?

The marginal velocity field across all data Per-sample straight-line velocities (x₁ - z) The noise prediction like in diffusion

🔨 Derivation CFM Loss = Marginal FM Loss (in expectation) ▶ ✓ ATTEMPTED

The marginal flow matching loss is L_FM = E_{t, x~p_t}[||v_θ(x, t) - u_t(x)||²], where u_t(x) is the true marginal velocity at (x, t). This is intractable because u_t(x) = E_{x₁|x_t=x}[x₁ - z] requires knowing p(x₁ | x_t).

The conditional FM loss is L_CFM = E_{t, z, x₁}[||v_θ(x_t, t) - (x₁ - z)||²] where x_t = (1-t)z + tx₁.

Your task: Show that ∇_θ L_CFM = ∇_θ L_FM. (They have the same gradient, so optimizing one optimizes the other.)

||v_θ - u||² = ||v_θ||² - 2⟨v_θ, u⟩ + ||u||². The gradient w.r.t. θ only sees the first two terms. The ||u||² term is a constant. So the gradient is determined by E[∇_θ||v_θ||² - 2⟨∇_θv_θ, u⟩]. The ||v_θ||² term is the same in both losses (same θ-dependence). So you just need to show the cross-term matches.

The CFM cross-term is E_t,z,x₁[⟨v_θ(x_t, t), x₁ - z⟩]. Condition on (x_t, t): this becomes E_{t, x_t}[⟨v_θ(x_t, t), E[x₁ - z | x_t, t]⟩]. What is E[x₁ - z | x_t, t]?

By definition, the marginal velocity u_t(x) = E[x₁ - z | x_t = x]. So the conditional expectation of the per-sample velocity IS the marginal velocity. The cross-terms match.

Full derivation:

1. Write both losses with the squared norm expanded:

L = E[||v_θ||²] - 2E[⟨v_θ, u⟩] + E[||u||²]

The last term doesn't depend on θ, so ∇_θL is determined by the first two terms.

2. The first term E[||v_θ(x_t, t)||²] is the same in both losses because both sample x_t from p_t (the CFM interpolation samples from the same marginal distribution p_t by construction).

3. For the cross-term in L_CFM:

E_t,z,x₁[⟨v_θ(x_t, t), x₁ - z⟩] = E_{t, x_t}[⟨v_θ(x_t, t), E[x₁ - z | x_t, t]⟩]

4. The conditional expectation E[x₁ - z | x_t = x, t] is precisely the definition of the marginal velocity field u_t(x). So:

= E_{t, x_t}[⟨v_θ(x_t, t), u_t(x_t)⟩]

This is exactly the cross-term in L_FM. Since both relevant terms match, ∇_θL_CFM = ∇_θL_FM. QED.

The key insight: The per-sample velocity (x₁ - z) is a noisy but unbiased estimate of the marginal velocity u_t(x). SGD with unbiased gradients converges to the same optimum as the full gradient. This is why we can train on simple per-sample targets and implicitly learn the complex marginal flow.

🔗 Pattern Recognition

Flow Matching ↔ Diffusion (Same Gradient, Different Parameterization)

Flow Matching (this lesson)

Target = x₁ - z (velocity along straight path)
Path: x_t = (1-t)z + t x₁

Diffusion → lesson

Target = ε (noise added)
Path: x_t = √ᾱ_t x₀ + √(1-ᾱ_t) ε

Both are instances of denoising score matching: train a network to predict a target direction from a noisy intermediate. The difference is the path geometry: diffusion uses a variance-preserving arc on a hypersphere, flow matching uses a straight line through ambient space. Lipman et al. (2023) showed that DDPM is literally a special case of flow matching with a specific (non-optimal) path choice.

If you change the FM path to x_t = cos(πt/2) z + sin(πt/2) x₁, what familiar framework do you recover? (Answer: variance-preserving diffusion with a cosine schedule.)

Chapter 4: Training

Training a flow matching model is arguably even simpler than training a diffusion model. For each training step:

Sample noise z ~ N(0, I)
Sample a data point x₁ from the dataset
Sample a random time t ~ U(0, 1)
Compute x_t = (1-t)z + t x₁
Predict velocity: v̂ = v_θ(x_t, t)
Loss = || v̂ - (x₁ - z) ||²

The Complete Training Loop

Here it is in full. Compare this to the diffusion training loop — notice what's missing:

python
for x_1 in dataloader:                     # 1. Sample data
    x_0 = torch.randn_like(x_1)           # 2. Sample noise ~ N(0, I)
    t = torch.rand(B, 1, 1, 1)            # 3. Sample t ~ Uniform[0, 1]

    x_t = (1 - t) * x_0 + t * x_1          # 4. Linear interpolation
    target = x_1 - x_0                     # 5. Target velocity (constant!)

    v_hat = network(x_t, t.squeeze())      # 6. Predict velocity
    loss = F.mse_loss(v_hat, target)       # 7. MSE loss

    loss.backward()
    optimizer.step()

What's missing compared to diffusion? No noise schedule (β_t, ᾱ_t). No cumulative product computation. No reparameterization. No variance-preserving formula. The interpolation is (1-t)*x_0 + t*x_1 and the target is x_1 - x_0. That's the entire mathematical content of the training loop. This simplicity is why flow matching is becoming the default for new systems.

Interactive: Velocity Field

Arrows show the learned velocity field at time t. At t=0, velocities point toward data. At t=1, they've converged.

Time t0.30

Check: What is the training target in flow matching?

The velocity vector (x₁ - z) pointing from noise to data The noise that was added The clean data point x₁ directly

💻 Build It Implement the CFM Training Step from Scratch ▶ ✓ ATTEMPTED

You've seen the training loop above. Now implement the core cfm_loss function that computes a single training step. This is the complete mathematical content of flow matching training — everything else is standard deep learning boilerplate.

signature def cfm_loss(model, x_1, noise_fn=torch.randn_like): """Compute the conditional flow matching loss for a batch. Args: model: Neural network that takes (x_t, t) and returns predicted velocity. x_t shape: [B, C, H, W], t shape: [B] x_1: Batch of data samples, shape [B, C, H, W] noise_fn: Function to sample noise (default: standard Gaussian) Returns: loss: Scalar MSE loss between predicted and target velocity """

Test case

Given x_1 = [[1.0, 2.0]], z = [[0.0, 0.0]], t = 0.3:
x_t = (1-0.3)*[[0,0]] + 0.3*[[1,2]] = [[0.3, 0.6]]
target velocity = [[1,2]] - [[0,0]] = [[1.0, 2.0]]
If model predicts [[0.8, 1.9]], loss = mean((0.8-1)² + (1.9-2)²) = mean(0.04 + 0.01) = 0.025

t needs shape [B, 1, 1, 1] to broadcast against [B, C, H, W] tensors. Use t = torch.rand(B, 1, 1, 1) or reshape after sampling. The model expects t as shape [B] (squeezed), so pass t.squeeze() or t.view(B).

python
def cfm_loss(model, x_1, noise_fn=torch.randn_like):
    B = x_1.shape[0]

    # 1. Sample noise (same shape as data)
    z = noise_fn(x_1)                      # [B, C, H, W]

    # 2. Sample time uniformly in [0, 1]
    t = torch.rand(B, 1, 1, 1, device=x_1.device)  # [B,1,1,1] for broadcasting

    # 3. Linear interpolation: x_t = (1-t)*z + t*x_1
    x_t = (1 - t) * z + t * x_1             # [B, C, H, W]

    # 4. Target velocity = endpoint - startpoint
    target = x_1 - z                         # [B, C, H, W]

    # 5. Network predicts velocity at (x_t, t)
    v_pred = model(x_t, t.squeeze())        # [B, C, H, W]

    # 6. MSE between predicted and target velocity
    return F.mse_loss(v_pred, target)

Bonus challenge: Modify this to support time-weighted loss: weight the MSE by w(t) = 1/(1-t+ε) to emphasize accuracy near t=1 (where samples must be close to data). This is what SD3 uses in practice.

Chapter 5: Sampling

To generate a sample, start from noise z ~ N(0, I) and integrate the learned ODE forward from t=0 to t=1. The simplest method is Euler integration: take N evenly spaced steps, at each step nudging x by the predicted velocity times the step size.

x_t+Δt = x_t + Δt · v_θ(x_t, t) where Δt = 1/N

Because the paths are approximately straight, Euler's method works well even with few steps. Higher-order solvers (Midpoint, RK4) give better accuracy for the same step count, but even plain Euler with 20-50 steps produces excellent results.

The Sampling Loop

python
# Generate an image from pure noise
x = torch.randn(1, 4, 64, 64)  # start from noise (latent space)
N = 20                             # number of Euler steps
dt = 1.0 / N

for i in range(N):
    t = i / N                     # current time: 0, 0.05, 0.10, ...
    v = network(x, t)              # predict velocity at current position
    x = x + v * dt                 # take one Euler step
# x is now a sample from the data distribution

Compare this to DDPM's 1000 steps with stochastic noise injection at each step. Flow matching sampling is an ODE (deterministic), not an SDE (stochastic). Same noise in → same image out. Every time.

Method	Typical Steps	Stochastic?	Same seed → same output?
DDPM	1000	Yes (SDE)	No
DDIM	50	No (ODE)	Yes
Flow Matching	20-50	No (ODE)	Yes

Interactive: Euler Steps Along the Flow

Watch particles integrate from noise to data. Adjust the step count: more steps = more accurate paths.

Euler steps20

The straight-path advantage: If paths were perfectly straight, one Euler step would be exact. In practice, learned paths are nearly straight, so 20-50 steps suffice. Compare with DDPM's 1000 steps!

Check: Why does flow matching need fewer sampling steps than diffusion?

The transport paths are nearly straight, so simple integration works It uses a bigger neural network It generates lower quality images

💥 Break-It Lab What Dies When You Break the Sampling ODE? ▶ ✓ ATTEMPTED

A working flow matching sampler integrates dx/dt = v(x, t) from t=0 to t=1 using Euler steps. Particles start at noise and converge to data clusters. Below you can break three components and watch what happens.

Use Curved Paths (sinusoidal) OFF

Failure mode: Curved velocities change direction at every point. With few Euler steps, particles overshoot at turns and miss the target entirely. The approximation error compounds multiplicatively at each step. You need 5-10x more steps to get the same accuracy as straight paths.

Feed Wrong Time (always t=0.5) OFF

Failure mode: The network uses t to modulate velocity magnitude. At t=0.5, it outputs a mid-range velocity appropriate for the halfway point. But at t=0 particles need maximum velocity (full distance ahead), and at t=0.9 they need minimal velocity (almost arrived). Wrong time → wrong speed → particles overshoot early and undershoot late.

Start from Data (no noise at t=0) OFF

Failure mode: If particles start at data instead of noise, the velocity field has nothing to do — particles are already at the destination. But the network was trained for the noise→data direction. Starting at data, it tries to "move" particles that are already there, overshooting into nonsense regions with no training signal. Garbage in, garbage out.

Chapter 6: Reflow & Distillation

Even though flow matching paths are straighter than diffusion, they're not perfectly straight. Paths from different noise-data pairs can cross each other, forcing the network to learn a curved velocity field to avoid collisions. Reflow straightens paths further.

The procedure is elegant: (1) Take a trained flow model. (2) Sample noise z ~ N(0, I), run the model forward to get the corresponding data point x₁ = ODE(z). Now you have a (z, x₁) pair that the model actually maps. (3) Retrain a new model on straight lines between these pairs. Because the pairs are already coupled (the model mapped z to x₁), the straight-line approximation is much closer to the true path.

python
# Reflow: straighten paths by retraining on coupled pairs
# Step 1: Generate coupled (noise, data) pairs from trained model
z = torch.randn(10000, 4, 64, 64)
x_1 = ode_solve(old_model, z, t=0, t_end=1)  # run the trained model
pairs = list(zip(z, x_1))                        # store the pairs

# Step 2: Retrain on straight lines between these pairs
for z_i, x_1_i in pairs:
    t = torch.rand(1)
    x_t = (1-t) * z_i + t * x_1_i        # straight line between COUPLED pair
    loss = mse(new_model(x_t, t), x_1_i - z_i)   # velocity = x_1 - z

Each iteration makes paths straighter. After 2-3 reflow iterations, paths are nearly straight enough for 1-step generation.

Initial Flow

Good but paths cross

↓ Reflow iteration 1

Straighter

Fewer crossings, lower curvature

↓ Reflow iteration 2

Nearly Straight

Suitable for 1-2 step generation

Distillation

Distillation takes a different approach: train a student model to mimic the teacher in fewer steps. Progressive distillation halves the step count repeatedly (64 → 32 → 16 → 8 → 4 → 2 → 1). Combined with reflow, this yields high-quality 1-4 step models.

Path Straightening

Compare paths before and after reflow. Straighter paths = fewer steps needed.

Reflow iterations0

The endgame: Reflow + distillation aims for single-step generation with diffusion-level quality. This is essentially what Flux-Schnell and SDXL-Turbo achieve in practice.

Check: What does reflow do to transport paths?

Straightens them so fewer integration steps are needed Makes them more curved for better quality Removes the need for a neural network

⚔ Adversarial: You apply reflow 5 times and the loss keeps decreasing. But FID (sample quality) gets WORSE after iteration 3. What is happening?

Your model trains on the reflow pairs (z_i, x_1,i) generated by the previous iteration. Loss goes down every iteration: 0.42 → 0.31 → 0.24 → 0.19 → 0.16. But FID goes 12.1 → 9.8 → 8.4 → 9.2 → 11.5. What explains the divergence between loss and quality?

The network is overfitting to the training data Reflow pairs accumulate ODE integration errors — later iterations train on progressively corrupted (z, x₁) pairs The learning rate is too high for later iterations

Chapter 7: SD3 / Flux

Flow matching has gone from theory to production. Both Stable Diffusion 3 (Stability AI) and Flux (Black Forest Labs) use rectified flow matching as their core framework, combined with a new architecture: the MMDiT (Multimodal Diffusion Transformer).

MMDiT Architecture

MMDiT replaces the U-Net with a Transformer. Both the noisy latent patches and the text tokens are processed as separate streams that interact through joint attention layers. This bidirectional interaction gives the text genuine influence over image generation.

Text Stream

T5 + CLIP embeddings as tokens

↓

Joint Attention

Image and text tokens attend to each other

↑

Image Stream

Patchified noisy latent + positional encoding

Feature	SD 1.5	SDXL	SD3 / Flux
Architecture	U-Net	Larger U-Net	MMDiT
Framework	DDPM	DDPM	Flow matching
Text encoder	CLIP	CLIP + OpenCLIP	CLIP + T5-XXL
Resolution	512px	1024px	1024px+
Steps	20-50	20-40	20-30

From Theory to Production Numbers

Flux.1-schnell generates a 1024×1024 image in 4 steps. Let that sink in. DDPM needed 1000 steps for 256×256. That's a 250× reduction in steps at 16× higher resolution. The combination of flow matching (straight paths), rectified flow (straightened further), and distillation (student mimics teacher in fewer steps) made this possible.

Flux variants: Flux.1-pro (best quality, API-only), Flux.1-dev (open weights, guidance-distilled, ~20 steps), Flux.1-schnell (4-step distilled, fastest). The schnell variant demonstrates the full power of reflow + distillation: near-pro quality at 5× the speed.

Check: What architecture do SD3 and Flux use instead of U-Net?

A GAN discriminator MMDiT (Multimodal Diffusion Transformer) A VAE with skip connections

🏗 Design Challenge You're the Architect: Flow Matching for Protein Structure Generation ▶ ✓ ATTEMPTED

You're designing a flow matching model for de novo protein backbone generation. Each protein is a chain of residues, each with a 3D position (x, y, z) and an orientation frame (rotation matrix in SO(3)). The output must be physically valid: no steric clashes, correct bond lengths, and the model should be equivariant to rotations and translations (rotating the input should rotate the output identically).

Data representation

500 residues × (3D position + 3×3 rotation) = 500 × 12 values

Symmetry requirement

SE(3) equivariance (rotation + translation)

Inference budget

≤ 10 seconds on A100 for one protein

Noise distribution

N(0, I) for positions, but what for rotations?

1. What noise distribution do you use for the rotation part? (N(0, I) on matrix entries violates SO(3). What's the "uniform noise" analog on a rotation manifold?)

2. How do you define a "straight line" between two rotations? (Linear interpolation of rotation matrices leaves SO(3).)

3. How many ODE steps do you need? What solver? (Consider: 500 residues interact pairwise, so each network forward pass is expensive.)

4. How do you enforce SE(3) equivariance in the velocity network architecture?

Real-world solution (FrameFlow / FoldFlow / Chroma):

1. Noise on SO(3): Use the isotropic Gaussian on SO(3) (IGSO(3)), which is the uniform distribution's analog for rotations. Sample by exponentiating a random tangent vector: R = exp(skew(ξ)), ξ ~ N(0, σ²I). At σ→∞, this approaches uniform on SO(3).

2. Straight line on SO(3): Use the geodesic (shortest rotation path): R(t) = R₀ exp(t · log(R₀^-1R₁)). The "velocity" is in the tangent space (Lie algebra so(3)), and interpolation stays on the manifold. This is the Riemannian analog of linear interpolation.

3. ODE steps: 100-200 steps with an adaptive solver (Dormand-Prince/RK45). Each step requires a full network forward pass through an SE(3)-equivariant transformer with pair interactions. At 500 residues with 100 steps: ~100 forward passes × 50ms each = 5 seconds on A100. Tight but feasible.

4. Equivariance: Use an SE(3)-equivariant architecture (e.g., EGNN, IPA from AlphaFold2). Key insight: the velocity network predicts vectors in the local frame of each residue, so global rotations/translations automatically transform correctly. Frame-based representations (as in AlphaFold2's IPA) give equivariance by construction.

Chapter 8: Flow vs Diffusion

Flow matching and diffusion are closely related — in fact, diffusion can be seen as a special case of flow matching with a particular (non-straight) path choice. But the differences matter in practice.

Aspect	Diffusion (DDPM)	Flow Matching
What it learns	Noise prediction ε_θ	Velocity field v_θ
Path shape	Curved (variance-preserving)	Straight (OT interpolation)
Math framework	SDE (stochastic)	ODE (deterministic)
Typical steps	20-1000	10-50
Noise schedule	β_t schedule (many choices)	Linear interpolation (one choice)
Training target	ε (noise)	x₁ - z (velocity)
Log-likelihood	Approximate (ELBO)	Exact (via ODE)

Side-by-Side: Diffusion vs Flow

Left: diffusion-style curved trajectories. Right: flow matching straight trajectories. Both reach the same target.

Why Flow Matching Is Winning

Four concrete reasons flow matching is becoming the default for new systems:

Simpler math. No noise schedule design (β_t linear vs cosine vs sigmoid). No ᾱ_t cumulative products. Just linear interpolation.
Faster sampling. Straight paths need 10-50 Euler steps instead of DDPM's 1000. Even compared to DDIM (50 steps), flow matching matches quality with fewer steps.
Deterministic by default. ODE, not SDE. Same seed = same output. Reproducibility is free.
This is what production uses. Stable Diffusion 3, Flux, π0 (robot policy) — all flow matching. The industry has voted.

Conditional Flow Matching

Conditioning works exactly as in diffusion: inject class labels or text embeddings into the network. The velocity field becomes v(x_t, t, c) where c is the conditioning signal. Classifier-free guidance applies identically — train with condition dropout, at inference compute v_guided = v_uncond + w · (v_cond - v_uncond).

When to Use Which?

Use diffusion when you have existing DDPM/DDIM infrastructure, need maximum compatibility with LoRAs/ControlNets built for SD 1.5/SDXL, or when stochastic sampling benefits diversity.

Use flow matching for new projects, when speed matters (fewer steps), when you want simpler code and math, or when using modern architectures (DiT/MMDiT). It's the clear direction the field is heading.

⚔ Adversarial: Your flow model generates excellent samples with 100 ODE steps but completely fails at 5 steps. DDIM (diffusion) works fine at 5 steps. Why?

Both models were trained on the same dataset with the same architecture (DiT-XL). Both achieve FID ~2 at their respective "full step" counts. But when you reduce to 5 steps: flow matching FID jumps to 45, while DDIM only degrades to FID ~8. Both use Euler integration with uniform timesteps.

Flow matching is inherently worse at low step counts The flow model needs a different architecture for low-step generation The flow model's paths cross heavily (not yet reflowed), causing high curvature that Euler can't track in 5 steps — DDIM's variance-preserving paths are designed to avoid crossing

🔨 Derivation Flow Matching Velocity ↔ Score Function Connection ▶ ✓ ATTEMPTED

In score-based diffusion, the network learns s_θ(x, t) ≈ ∇_x log p_t(x) (the score). In flow matching, it learns v_θ(x, t) (the velocity). These are related.

For the OT path x_t = (1-t)z + t x₁ with z ~ N(0, I):

Your task: Show that v_t(x) = x/(1-t) + (1-t) · ∇_x log p_t(x), i.e., the velocity field can be decomposed into a "drift toward origin" term and a score-scaled term.

At time t, x_t = (1-t)z + t x₁. Since z ~ N(0, I), and x₁ is fixed for a conditional path: x_t | x₁ ~ N(t x₁, (1-t)² I). The marginal p_t(x) = ∫ N(x; t x₁, (1-t)² I) p_data(x₁) dx₁ — a Gaussian mixture.

The marginal velocity is u_t(x) = E[x₁ - z | x_t = x]. From x_t = (1-t)z + t x₁, we get z = (x - t x₁)/(1-t). So x₁ - z = x₁ - (x - t x₁)/(1-t) = (x₁ - x)/(1-t). Therefore u_t(x) = E[x₁ | x_t=x]/(1-t) - x/(1-t).

For a Gaussian kernel p(x_t | x₁) = N(t x₁, (1-t)² I), Tweedie's formula gives: E[t x₁ | x_t] = x + (1-t)² ∇_x log p_t(x). So E[x₁ | x_t] = (x + (1-t)² ∇_x log p_t(x)) / t.

Full derivation:

1. The conditional distribution is p(x_t | x₁) = N(x_t; t x₁, (1-t)² I).

2. From the conditional velocity, u_t(x) = E[(x₁ - x)/(1-t) | x_t = x] (derived in Hint 2).

3. Tweedie's formula for Gaussians: the posterior mean of the "clean" signal given noisy observation is E[μ | x] = x + σ² ∇_x log p(x). Here μ = t x₁, σ = (1-t), so:

E[t x₁ | x_t = x] = x + (1-t)² ∇_x log p_t(x)

4. Therefore E[x₁ | x_t = x] = [x + (1-t)² ∇_x log p_t(x)] / t.

5. Substituting into the velocity:

u_t(x) = (E[x₁ | x_t] - x) / (1-t) = ([x + (1-t)² s(x,t)]/t - x) / (1-t)

Simplifying: = x(1-t)/(t(1-t)) + (1-t)s(x,t)/t ... After algebra:

u_t(x) = (1-t) · ∇_x log p_t(x) + x/(1-t) · (1/t - 1) ...

The exact relationship (cleanly stated): v_t(x) = [E[x₁|x_t=x] - x] / (1-t), and via Tweedie: v_t(x) = (1-t)/t · ∇_x log p_t(x) + x · (1/t - 1)/(1-t).

The key insight: The velocity field and score function contain exactly the same information — they're connected by a time-dependent linear transformation. Learning one is equivalent to learning the other. The difference is purely in the loss weighting and path geometry, not in what the network fundamentally represents. This is why FM and score matching achieve similar final quality — they're learning the same object with different parameterizations.

"The shortest path between two truths in the real domain passes through the complex domain."

— Jacques Hadamard

You now understand flow matching: straight paths, simple training, fast sampling. The next generation of generative models is built on these ideas.

Understand FlowMatching

Chapter 0: Paths Between Distributions

The Simplest Possible Path

Chapter 1: Continuous Normalizing Flows

Chapter 2: Optimal Transport

Why Straight Paths Need Fewer Steps

Chapter 3: Conditional Flow Matching

Chapter 4: Training

The Complete Training Loop

Chapter 5: Sampling

The Sampling Loop

Chapter 6: Reflow & Distillation

Distillation

Chapter 7: SD3 / Flux

MMDiT Architecture

From Theory to Production Numbers

Chapter 8: Flow vs Diffusion

Why Flow Matching Is Winning

Conditional Flow Matching

When to Use Which?

Understand Flow
Matching