Holderrieth & Erives, Chapter 2

Flow and Diffusion Models

The mathematical machinery of generation: vector fields, ODEs, SDEs, and the Euler method.

Prerequisites: Ch 1 (generation as sampling), basic calculus (derivatives). That's it.
10
Chapters
5
Simulations
10
Quizzes

Chapter 0: Why ODEs?

In the last chapter, we formalized generative modeling as sampling from a data distribution pdata. We need an algorithm that converts simple noise — say from a Gaussian — into complex, structured data. But how?

Here is the key intuition: imagine a particle starting at a random position in space (noise). We want it to end up at a position that looks like real data. What if we could define a velocity field that tells the particle which direction to move at every point in space and time? The particle follows this field like a leaf in a stream, and if the field is designed correctly, it arrives at a sample from pdata.

This is exactly what a differential equation does. An ordinary differential equation (ODE) defines a velocity field, and "solving" the ODE means following that field to trace out a trajectory. A stochastic differential equation (SDE) adds random jitter on top — like a leaf in a turbulent stream.

The big idea: Generative models are differential equations. The vector field is a neural network. Generating a sample = simulating the differential equation from noise to data. The entire question of "how to generate" reduces to "how to define and simulate the right vector field."

This chapter builds the mathematical toolkit piece by piece:

Vector Fields
A function that assigns a velocity to every point in space
↓ define
ODEs & Flows
Follow the velocity field to trace deterministic trajectories
↓ simulate with
Euler Method
Take small steps along the velocity: Xt+h = Xt + h · ut(Xt)
↓ add randomness via
Brownian Motion & SDEs
Add noise at each step: Xt+h = Xt + h · ut(Xt) + σt√h · ε
In a flow/diffusion generative model, what role does the neural network play?

Chapter 1: Vector Fields

A vector field is a function that assigns a velocity vector to every point in space and time:

u : Rd × [0, 1] → Rd,   (x, t) ↦ ut(x)

At position x and time t, the vector field ut(x) tells you the velocity — which direction and how fast to move. Think of it as a wind map: at every location and every moment, there is an arrow showing which way the wind blows and how strongly.

Worked example — linear vector field. Consider ut(x) = −θx for θ = 2. This field always points toward the origin. At x = 3, the velocity is −6 (move left). At x = −1, the velocity is +2 (move right). Every particle is pulled toward zero, and the further away you are, the stronger the pull.

Worked example — circular vector field. In 2D, ut(x1, x2) = (−x2, x1) creates circular trajectories. At (1, 0) the velocity is (0, 1) — pointing upward. At (0, 1) the velocity is (−1, 0) — pointing left. Particles orbit the origin.

Interactive Vector Field Explorer

Choose a vector field type and see the velocity arrows. Click anywhere to drop a particle and watch it follow the field.

Field type
θ 2.0

Worked example — saddle point vector field. In 2D, ut(x1, x2) = (θ x1, −θ x2) creates a saddle point. At (1, 1): velocity is (2, −2) — pushed right and up. The x1 axis is unstable (grows), the x2 axis is stable (shrinks). Particles spread along x1 and compress along x2.

Mathematical properties of vector fields. A vector field ut(x) maps every point in Rd to a velocity vector in Rd. For our purposes, the vector field depends on both position x and time t. This time-dependence is crucial — the neural network needs to know "what time it is" to provide the correct velocity. At t = 0 (noise), the field should push particles apart. At t = 1 (data), it should guide them to the right locations.

python
# Defining vector fields in code
import torch

def linear_vf(x, theta=2.0):
    """u(x) = -theta * x (pulls toward origin)"""
    return -theta * x

def circular_vf(x, theta=1.0):
    """u(x1,x2) = (-theta*x2, theta*x1) (rotation)"""
    return theta * torch.stack([-x[...,1], x[...,0]], dim=-1)

# Neural network vector field (what we actually use)
class NeuralVF(torch.nn.Module):
    def forward(self, x, t):
        """x: [batch, d], t: [batch, 1] -> velocity [batch, d]"""
        return self.net(torch.cat([x, t], dim=-1))
Why vector fields matter: In a generative model, the neural network is the vector field. It takes a noisy point x and a time t, and outputs a velocity utθ(x). Generating a sample means following this velocity from t = 0 (noise) to t = 1 (data). The entire learning problem is: find parameters θ so that this velocity field transports noise into data.
For the linear vector field ut(x) = −2x, what is the velocity at x = 5?

Chapter 2: ODEs and Flows

An ordinary differential equation (ODE) says: "follow the vector field." Given a starting point x0, the ODE asks for a trajectory Xt whose velocity at every moment equals the vector field:

dXt/dt = ut(Xt),    X0 = x0

The first equation says the trajectory's velocity must match the vector field. The second equation says we start at x0. Together, they fully determine the trajectory — there is exactly one path through each starting point.

The solution to an ODE is called a flow, denoted ψt:

ψ : Rd × [0, 1] → Rd,   (x0, t) ↦ ψt(x0)

The flow ψt(x0) tells you: "if you start at x0 and follow the vector field for time t, where do you end up?" The trajectory is Xt = ψt(X0).

Three descriptions, one object: Vector fields, ODEs, and flows are three ways to describe the same thing. The vector field defines the ODE. The ODE's solution is the flow. Given any one, you can derive the other two.

Worked example — linear ODE. Let ut(x) = −θx for θ = 2. The ODE is dX/dt = −2X. The solution is:

ψt(x0) = e−2t · x0

Verification: At t = 0, ψ0(x0) = e0 · x0 = x0. Check. The derivative is dψ/dt = −2e−2tx0 = −2ψt(x0) = utt(x0)). Check. So the flow satisfies the ODE.

Numerical example. Start at x0 = 4, θ = 2:

tψt(4) = 4e−2tVelocity utt) = −2ψt
0.04.000−8.000
0.252.426−4.853
0.501.472−2.943
0.750.893−1.786
1.00.541−1.083

The particle decays exponentially toward zero. The velocity decreases as the particle gets closer to the origin.

Existence and uniqueness (Theorem 3): If the vector field ut(x) is continuously differentiable with bounded derivatives (always true for neural networks), the ODE has a unique solution. This means: given any starting point, there is exactly one trajectory. No ambiguity, no branching. This is great news for generative modeling — the flow is well-defined.

Moreover, the flow ψt is a diffeomorphism — a smooth, invertible transformation with a smooth inverse. This means the flow "warps" space without tearing or folding it. Think of it as stretching a rubber sheet: every point maps to exactly one other point, and you can always undo the deformation.

Why diffeomorphisms matter: The flow maps the noise distribution N(0, I) to the data distribution pdata. Because the map is invertible, we can also go backwards: given a data point z, we can find the noise point ψ1−1(z) that generated it. This invertibility is useful for computing likelihoods and for encoding data into a latent space.

Worked example — 2D rotation flow. For the circular vector field u(x1, x2) = (−x2, x1), the flow is a rotation:

ψt(x0) = (x0,1 cos(t) − x0,2 sin(t),   x0,1 sin(t) + x0,2 cos(t))

At t = π/2: the flow rotates everything 90 degrees counterclockwise. At t = π: 180 degrees. The inverse is simply rotating in the opposite direction.

For the ODE dX/dt = −2X with X0 = 4, what is X1?

Chapter 3: The Euler Method

We can analytically solve simple ODEs like dX/dt = −θX. But when the vector field ut(x) is a neural network with millions of parameters, there is no formula for the flow. We need to simulate the ODE numerically.

The Euler method is the simplest simulation technique. The idea: take small discrete steps along the vector field. Starting at X0 = x0, update iteratively:

Xt+h = Xt + h · ut(Xt)

where h = 1/n is the step size and n is the number of steps. We evaluate the velocity at the current position, multiply by the step size, and add it to get the new position. After n steps, we reach t = 1.

Worked example — Euler for dX/dt = −2X. Let X0 = 4, n = 4 steps, so h = 0.25:

SteptXtut(Xt) = −2XtXt+h = Xt + 0.25 · u
10.004.000−8.0004 + 0.25(−8) = 2.000
20.252.000−4.0002 + 0.25(−4) = 1.000
30.501.000−2.0001 + 0.25(−2) = 0.500
40.750.500−1.0000.5 + 0.25(−1) = 0.250

Euler gives X1 ≈ 0.250. The exact answer is 4e−2 ≈ 0.541. The error is significant because h = 0.25 is coarse. With n = 100 steps (h = 0.01), Euler gives X1 ≈ 0.536 — much closer.

Accuracy vs. cost tradeoff: More steps (smaller h) = more accurate but slower. Each step requires one evaluation of the neural network utθ(x). In practice, state-of-the-art image generators use 20-50 Euler steps. That means 20-50 forward passes of a large neural network per generated image.

Error analysis. The Euler method has a local truncation error of O(h2) per step and a global error of O(h). This means:

Steps nStep size h = 1/nApprox. global errorNFE (neural net evals)
40.250~25% relative4
100.100~10% relative10
500.020~2% relative50
1000.010~1% relative100

Heun's method has O(h2) global error but costs 2 NFE per step. So 25 Heun steps = 50 NFE, comparable to 50 Euler steps. In practice, Heun often gives better results at the same compute budget.

Heun's method (a refinement): Instead of blindly trusting the velocity at the current point, Heun's method takes a trial step, evaluates the velocity at the trial point, and averages the two velocities:

X't+h = Xt + h · ut(Xt)    (trial step)
Xt+h = Xt + (h/2)(ut(Xt) + ut+h(X't+h))    (corrected step)

This costs 2 function evaluations per step but is much more accurate.

Euler Method Step-Through

Watch the Euler method approximate the true ODE solution. Increase steps for better accuracy. The orange curve is the exact solution; teal dots are Euler steps.

Steps n 4
python
def euler_sample(u_theta, n_steps=50, d=2):
    """Sample from a flow model using Euler method."""
    h = 1.0 / n_steps
    x = torch.randn(d)  # X_0 ~ N(0, I)
    t = 0.0
    for _ in range(n_steps):
        v = u_theta(x, t)  # evaluate neural network
        x = x + h * v       # Euler step
        t += h
    return x  # X_1 ~ p_data
Using Euler with step size h = 0.5 for dX/dt = −2X, X0 = 4: what is X after 1 step (at t = 0.5)?

Chapter 4: Flow Models

We now have everything needed to define a flow model — a generative model based on an ODE. The recipe:

Step 1: Initialize
Sample X0 ~ pinit = N(0, Id) — pure Gaussian noise
Step 2: Simulate ODE
dXt/dt = utθ(Xt) using Euler method for t ∈ [0, 1]
Step 3: Output
X1 ~ pdata — the endpoint is our generated sample

The ODE is deterministic — given the same starting noise X0, you always get the same output X1. The randomness comes entirely from the random initialization. Different noise samples produce different outputs.

Key insight: Although we call it a "flow model," the neural network parameterizes the vector field utθ, not the flow ψt. The flow is computed by simulating the ODE. The neural network never sees the flow directly — it just outputs velocities.

What neural architecture is used? The neural network utθ must take a noisy input x and a time t and output a velocity of the same shape. Common architectures:

ArchitectureUsed ByInputOutput
U-NetStable Diffusion 1/2, DALL-E 2[B, C, H, W] + t[B, C, H, W]
DiT (Diffusion Transformer)SD3, FLUX, Sora[B, N, D] + t[B, N, D]
Simple MLPToy experiments[B, d] + t[B, d]

The time t is typically embedded using sinusoidal or learned embeddings, similar to positional encodings in Transformers. This time embedding is added to intermediate activations so the network "knows" what stage of the denoising process it is in. At t = 0, the input is pure noise, so the network should output a large velocity. At t = 1, the input is nearly clean, and the velocity should be small (just fine-tuning details).

Why the initial distribution matters. We usually choose pinit = N(0, Id) because:

1. We can sample from it easily (just generate random numbers).

2. It has full support on Rd — every point has positive probability.

3. It is isotropic — no preferred direction, so the model does not need to "undo" any structure in the noise.

4. It is well-studied mathematically, making theoretical analysis tractable.

Could we use a different pinit? Yes! Some models use a uniform distribution on a hypersphere, or a truncated Gaussian. But N(0, I) is the simplest and most common choice. The choice of pinit affects the difficulty of the learning problem: if pinit is very different from pdata, the flow must perform a more dramatic transformation, which requires more neural network capacity and more ODE steps.

The flow model is a universal approximator. Given a sufficiently expressive neural network utθ and the right training algorithm, a flow model can approximate any continuous probability distribution pdata. This follows from the universality of neural networks combined with the existence and uniqueness theorem for ODEs. The practical question is not whether a flow model can model pdata, but how many parameters and training steps it needs.

Computational cost of generation. Each Euler step requires one forward pass through the neural network. The total cost of generating one sample is n × (cost of one forward pass). For reference:

ModelParamsStepsTime per image (A100)
DiT-S/2 (CIFAR)33M50~0.3s
DiT-XL/2 (ImageNet 256)675M50~2s
SD3-Medium (512×512)2B28~3s
FLUX.1 (1024×1024)12B50~15s

Research in "distillation" aims to reduce the number of steps needed. Consistency models and progressive distillation can achieve good quality in 1-4 steps, at the cost of additional training. This is an active area of research as of 2025-2026.

Latent vs. pixel space. In practice, flow/diffusion models rarely operate directly in pixel space (too high-dimensional). Instead, a VAE first encodes images into a lower-dimensional latent space. The dimensions given above are for the latent representation. The actual generation pipeline is: noise → ODE/SDE in latent space → VAE decoder → pixels.

Classifier-free guidance. One of the most impactful practical techniques is not part of the basic theory but uses the score function. By training the model to sometimes drop the conditioning (text prompt), it learns both conditional and unconditional generation. At inference, the two predictions are combined to amplify the conditioning signal. This dramatically improves text-to-image alignment. We will see the mathematical details in Chapter 5 of the book, but it relies on the score function concepts from Chapter 4.

The Euler method vs. higher-order solvers. Beyond Euler and Heun, there are more advanced ODE solvers like DPM-Solver, DPM-Solver++, and UniPC that take advantage of the specific structure of the flow matching ODE. These can achieve the same quality as 50 Euler steps in just 10-20 steps by using higher-order approximations and specialized scheduling. In production systems, these advanced solvers are standard.

Step schedule matters. Instead of using uniform step sizes h = 1/n, production models use non-uniform schedules. The key insight: more steps should be allocated where the vector field changes rapidly. For CondOT paths, the field is nearly constant, so uniform steps work well. For VP (DDPM-style) paths, more steps are needed near t = 1 where fine details emerge.

python
# Non-uniform step schedules
import torch

# Uniform schedule (basic)
t_uniform = torch.linspace(0, 1, n_steps + 1)

# Quadratic schedule (more steps near t=1)
t_quad = torch.linspace(0, 1, n_steps + 1) ** 2

# Karras schedule (empirically optimized)
sigma_min, sigma_max = 0.002, 80
rho = 7
ramp = torch.linspace(0, 1, n_steps + 1)
t_karras = (sigma_max ** (1/rho) + ramp * (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho

The choice of step schedule can improve FID by 10-30% at the same number of steps. This is one of the most impactful "free" improvements in practice.

Adaptive step sizes. Some ODE solvers (like Dopri5) automatically choose the step size based on a local error estimate. They take larger steps where the vector field is smooth and smaller steps where it changes rapidly. This can achieve the same accuracy as 50 Euler steps in just 15-20 adaptive steps, with no manual tuning of the schedule.

Implementation in practice. The torchdiffeq library provides ODE solvers that work with PyTorch:

python
# Using torchdiffeq for adaptive ODE solving
from torchdiffeq import odeint

class FlowODE(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, t, x):
        """Required signature for torchdiffeq: (t, x) -> dx/dt"""
        t_batch = t.expand(x.shape[0], 1)
        return self.model(x, t_batch)

# Sample with adaptive solver
x0 = torch.randn(16, d)  # batch of 16 noise samples
t_span = torch.tensor([0.0, 1.0])
x1 = odeint(FlowODE(model), x0, t_span,
    method='dopri5', atol=1e-5, rtol=1e-5)[-1]

The adaptive solver automatically determines how many function evaluations (NFE) are needed, typically 15-40 depending on the vector field complexity. This often beats fixed-step Euler at the same compute budget.

Numerical stability considerations. When implementing ODE/SDE solvers for generative models:

1. Avoid t = 0 and t = 1 exactly. Some vector fields diverge at the endpoints. Use t ∈ [ε, 1−ε] with ε = 10−5.

2. Use float32 or higher. Float16 can cause numerical issues in the early steps (t near 0) where the velocity is large.

3. Gradient clipping during training. Clip gradients to norm 1.0 to prevent training instabilities.

4. EMA of model weights. Use exponential moving average of parameters for sampling (decay 0.9999).

A note on continuous vs. discrete time. In this course, we treat time as continuous (t ∈ [0, 1]). The original DDPM paper used discrete time steps (t ∈ {0, 1, ..., T}). The continuous formulation is more elegant and general — the discrete version is recovered as a special case when you discretize. Most modern implementations use continuous time internally and only discretize for the Euler/Heun solver.

Forward process vs. reverse process. In the DDPM literature, the "forward process" adds noise to data (going from t = 0 to t = 1) and the "reverse process" removes noise (going from t = 1 to t = 0). In the flow matching literature, the convention is reversed: t = 0 is noise and t = 1 is data. Both conventions are valid; just be careful when reading papers to check which convention is used.

In this course, we follow the flow matching convention: t = 0 is noise, t = 1 is data. The ODE flows from t = 0 to t = 1 during sampling. During training, we sample random t values and compute the loss at that timestep.

The complete lifecycle of a generated image. To tie everything together, here is what happens end-to-end when you type "a sunset over the ocean" into Stable Diffusion 3:

Encode prompt
CLIP + T5 encode "a sunset over the ocean" into embedding vectors
Initialize noise
X0 = randn(1, 4, 64, 64) — random latent noise
Simulate ODE (28 steps)
Each step: DiT(Xt, t, prompt_emb) → velocity → Xt+h
CFG: blend conditional and unconditional predictions
Decode latent
VAE decoder maps X1 (64×64×4) to pixels (512×512×3)
Display
A photorealistic sunset over the ocean appears on your screen

Total time: ~3 seconds on a modern GPU. Total neural network forward passes: 2 × 28 = 56 (two per step for classifier-free guidance). The model has ~2 billion parameters and was trained on ~1 billion images for ~500,000 GPU-hours. This is the machinery that Ch 3 and Ch 4 will teach you to build.

The mathematical elegance. Step back and appreciate the framework we have built. The entire generative modeling pipeline reduces to: (1) choose a vector field parametrized by a neural network, (2) simulate a differential equation. That is it. The complexity of generating images, videos, proteins, and music is encoded in the neural network weights, which are learned from data using the simple regression losses of Ch 3-4. The differential equation framework provides the mathematical guarantee that this process is well-defined (existence and uniqueness of solutions) and produces valid probability distributions (via the continuity/Fokker-Planck equations). This marriage of deep learning and classical mathematics is what makes flow and diffusion models both practical and beautiful.

Key equations to remember from this chapter:

ODE:   dXt/dt = ut(Xt),   X0 = x0
Euler:   Xt+h = Xt + h · ut(Xt)
SDE:   dXt = ut(Xt) dt + σt dWt
Euler-Maruyama:   Xt+h = Xt + h · ut(Xt) + σt√h · ε

Summary of ODE/SDE solvers for generative models:

SolverTypeOrderNFE per stepTypical steps
EulerODE1150-100
HeunODE2225-50
DPM-Solver++ODE2-31-215-25
Dopri5 (RK45)ODE5adaptiveauto (15-40 NFE)
Euler-MaruyamaSDE0.51100-1000

Worked example — Algorithm 1 by hand. Suppose d = 1, n = 4 steps, h = 0.25, and X0 = 1.5 (sampled from N(0,1)). Suppose the neural network outputs these velocities:

tXtutθ(Xt)Xt+h = Xt + 0.25 · u
0.001.500−3.21.5 + 0.25(−3.2) = 0.700
0.250.700−1.80.7 + 0.25(−1.8) = 0.250
0.500.250+0.50.25 + 0.25(0.5) = 0.375
0.750.375+0.90.375 + 0.25(0.9) = 0.600

The output X1 = 0.600 is our generated sample. The noise X0 = 1.5 was transformed into the data sample 0.600 by the learned vector field.

python
# Algorithm 1: Sampling from a Flow Model
def sample_flow_model(u_theta, n=50, d=784):
    t = 0.0
    h = 1.0 / n
    x = torch.randn(d)          # X_0 ~ N(0, I)
    for i in range(n):
        x = x + h * u_theta(x, t)  # Euler step
        t += h
    return x                      # X_1 ~ p_data
In a flow model, where does the randomness come from?

Chapter 5: Brownian Motion

Flow models are deterministic — once you fix the noise X0, the trajectory is fully determined. Diffusion models add stochasticity during the trajectory itself, using a fundamental mathematical object called Brownian motion.

A Brownian motion W = (Wt)0 ≤ t ≤ 1 is a continuous random walk. Think of it as a drunk person stumbling around — at every instant, they take a tiny random step. It has three defining properties:

Brownian motion (Wt):
1. Starts at zero: W0 = 0.
2. Normal increments: Wt − Ws ~ N(0, (t − s) Id). The variance grows linearly with the time gap.
3. Independent increments: Non-overlapping time intervals have independent steps.

Worked example — simulating Brownian motion. In 1D, with step size h = 0.01, we simulate by setting W0 = 0 and updating:

Wt+h = Wt + √h · εt,    εt ~ N(0, 1)

With h = 0.01: √h = 0.1. At each step, we add a Gaussian with standard deviation 0.1. After 100 steps, we have a Brownian motion path from t = 0 to t = 1.

Numerical trace. Let ε1 = 0.83, ε2 = −1.21, ε3 = 0.47:

StepWtεtWt+h = Wt + 0.1 · ε
10.0000.830.083
20.083−1.21−0.038
3−0.0380.470.009

The path zigzags randomly. Run the simulation again and you get a completely different path. This randomness is the ingredient that makes diffusion models stochastic.

Brownian Motion Simulator

Each click generates a new Brownian motion path. All paths start at zero but diverge wildly due to random increments.

Fun fact: Brownian motion paths are continuous (you could draw them without lifting your pen) but have infinite length (you would never stop drawing). They are also nowhere differentiable — the path is infinitely jagged at every point. This is why SDEs require special mathematics (stochastic calculus) rather than ordinary calculus.

Statistics of Brownian motion. Key properties that follow directly from the definition:

PropertyFormulaInterpretation
MeanE[Wt] = 0On average, goes nowhere
VarianceVar(Wt) = tSpread grows linearly with time
Std. dev.√tAt t=1: std=1. At t=4: std=2
CovarianceCov(Ws, Wt) = min(s,t)Past and future are correlated

Numerical check. Simulate 10,000 Brownian paths to t = 1. The empirical distribution of W1 should be approximately N(0, 1). Mean ≈ 0, Std ≈ 1. This is because W1 − W0 ~ N(0, 1) by the normal increments property.

python
# Simulating Brownian motion
import torch

def simulate_brownian(n_paths=5, n_steps=200, d=1):
    h = 1.0 / n_steps
    W = torch.zeros(n_paths, d)
    paths = [W.clone()]
    for _ in range(n_steps):
        W = W + (h ** 0.5) * torch.randn(n_paths, d)
        paths.append(W.clone())
    return torch.stack(paths)  # [n_steps+1, n_paths, d]
The increment W0.7 − W0.3 of a Brownian motion has what distribution?

Chapter 6: Stochastic Differential Equations

An SDE extends an ODE by adding a Brownian motion term. At each step, the particle follows the vector field and gets a random kick:

dXt = ut(Xt) dt + σt dWt

The first term ut(Xt) dt is the drift — the deterministic part, same as an ODE. The second term σt dWt is the diffusion — the random part, scaled by the diffusion coefficient σt. When σt = 0, the SDE reduces to an ODE.

Intuition: An ODE is like floating down a calm river — deterministic, smooth. An SDE is like floating down a turbulent river — there is a general current (drift) plus unpredictable eddies (diffusion). The larger σt, the more turbulent.

We simulate SDEs using the Euler-Maruyama method, the stochastic analog of the Euler method:

Xt+h = Xt + h · ut(Xt) + σt √h · εt,    εt ~ N(0, Id)

Compare to Euler for ODEs: Xt+h = Xt + h · ut(Xt). The only difference is the added noise σt√h · εt.

Worked example — Ornstein-Uhlenbeck process. The OU process uses ut(x) = −θx and constant σ:

dXt = −θ Xt dt + σ dWt

The drift −θx pushes the particle back toward zero (a "spring"). The diffusion σ adds noise. The two forces balance: the particle bounces around zero, eventually settling into a Gaussian distribution N(0, σ2/(2θ)).

Numerical example. Let θ = 2, σ = 1, X0 = 3, h = 0.1:

StepXtdrift = −2X · 0.1εnoise = √0.1 · εXt+h
13.000−0.6000.520.1642.564
22.564−0.513−1.31−0.4141.637
31.637−0.3270.880.2781.588

Notice how the trajectory is jagged (unlike the smooth ODE trajectory) because of the random noise at each step.

The √h scaling is crucial. Why does the noise scale as σ√h rather than σ h? Because Brownian increments have variance proportional to the time gap. If we used σ h, the total variance after n steps would be n · (σh)2 = σ2h, which vanishes as h → 0. With σ√h, the total variance is n · (σ√h)2 = n · σ2h = σ2, which is finite. This is the mathematically correct scaling for continuous-time stochastic processes.

Convergence of the OU process. The Ornstein-Uhlenbeck process converges to the stationary distribution N(0, σ2/(2θ)). For our example with θ = 2, σ = 1: the stationary distribution is N(0, 0.25). Starting at X0 = 3, the process decays toward zero (due to the drift −2X) and fluctuates around zero (due to the noise), eventually settling into N(0, 0.25).

The OU process as prototype: The Ornstein-Uhlenbeck process is to SDEs what the linear ODE dX/dt = −θX is to ODEs: the simplest non-trivial example, and the one that gives the most insight. It was the starting point for the original diffusion models (SMLD and DDPM in 2019-2020). Understanding it deeply pays dividends throughout this course.
ODE vs SDE Comparison

Both start at X0 = 3 with drift u(x) = −2x. The ODE (orange) is smooth and deterministic. The SDE (teal) is noisy. Increase σ to see more randomness.

σ 1.0
python
# Euler-Maruyama method for SDE simulation
def euler_maruyama(u_theta, sigma_t, n=50, d=2):
    h = 1.0 / n
    x = torch.randn(d)
    t = 0.0
    for _ in range(n):
        eps = torch.randn_like(x)
        x = x + h * u_theta(x, t) + sigma_t(t) * (h**0.5) * eps
        t += h
    return x
What is the difference between the Euler method and Euler-Maruyama?

Chapter 7: Diffusion Models

A diffusion model is a generative model based on an SDE, just as a flow model is based on an ODE. The recipe is identical except we add noise during sampling:

X0 ~ pinit,    dXt = utθ(Xt) dt + σt dWt

The neural network utθ parameterizes the vector field (exactly as in a flow model). The diffusion coefficient σt is a fixed schedule — not learned. The goal is still X1 ~ pdata.

Flow model = Diffusion model with σt = 0. Every flow model is a special case of a diffusion model. The distinction is purely about whether we add noise during sampling. The training algorithms (flow matching vs. score matching) also differ, as we will see in Chapters 3 and 4.

Why would we want noise during sampling? Two reasons:

1. Error correction. Neural networks are imperfect — utθ only approximates the true vector field. The added noise can help "shake" trajectories out of error states, similar to how simulated annealing helps optimization escape local minima.

2. Diversity. Different amounts of noise can produce more diverse samples. The same initial noise X0 can lead to different outputs depending on the Brownian motion realization.

3. Theoretical guarantees. Under certain conditions, SDE sampling can converge faster than ODE sampling, especially when the target distribution has many well-separated modes.

Worked example — comparing ODE and SDE sampling. Suppose our target has two sharp modes at −5 and +5. With ODE sampling, a particle starting at X0 = 0.1 always ends up at the same mode (say +5). With SDE sampling, the Brownian noise can "push" the particle across the boundary, allowing it to reach either mode. This can improve mode coverage.

python
# Algorithm 2: Sampling from a Diffusion Model
def sample_diffusion_model(u_theta, sigma, n=50, d=784):
    t = 0.0
    h = 1.0 / n
    x = torch.randn(d)                   # X_0 ~ N(0, I)
    for i in range(n):
        eps = torch.randn_like(x)          # fresh noise
        x = x + h * u_theta(x, t) \
            + sigma(t) * (h**0.5) * eps  # Euler-Maruyama
        t += h
    return x                              # X_1 ~ p_data
PropertyFlow ModelDiffusion Model
EquationODESDE
TrajectoriesSmooth, deterministicJagged, stochastic
Randomness sourceInitial noise X0 onlyX0 + Brownian motion Wt
SimulationEuler methodEuler-Maruyama method
Same X0 → same output?YesNo
σt0> 0 (fixed schedule)
What is the relationship between flow models and diffusion models?

Chapter 8: ODE vs SDE Showcase

Let's see everything in action. The canvas below lets you watch flow models (ODE) and diffusion models (SDE) generate samples from a 2D distribution in real time. Multiple particles start as Gaussian noise and evolve toward the target data distribution.

Flow Model Sampling Simulator

Watch particles flow from noise (t=0) to data (t=1). The target is a mixture of 4 Gaussians. Toggle SDE mode and adjust σ to see the difference.

Mode
σ (SDE only) 0.5
Steps 50
What to observe: In ODE mode, trajectories are smooth curves. Particles that start near each other stay near each other. In SDE mode, trajectories are jagged. Even particles starting from the same noise can diverge due to different Brownian motion realizations. Both end up at the target distribution — but the paths differ.

Practical considerations for real models:

AspectODE (Flow Model)SDE (Diffusion Model)
Typical steps20-5050-1000
Image qualityGood with few stepsBetter with many steps
Deterministic?Yes (same seed = same output)No (different each run)
SpeedFaster (fewer steps needed)Slower (more steps for stability)
GuidanceWorks but limitedNaturally supports classifier guidance
LikelihoodCan compute exact log p(x)Cannot (easily)

In practice, modern systems like Stable Diffusion 3 and FLUX use ODE sampling (flow matching) by default because it is faster. SDE sampling is used when higher quality is needed or when guidance techniques require it.

The key takeaway from this showcase: A flow model (ODE) and a diffusion model (SDE) are two ways to use the same trained neural network. The training is identical — only the sampling differs. This flexibility is one of the most powerful aspects of the flow/diffusion framework: train once, sample with either method depending on your quality/speed requirements.

Interpolation between ODE and SDE. You can even mix the two approaches. Start with SDE sampling (large σ) for the early steps (to get good mode coverage) and switch to ODE sampling (σ = 0) for the later steps (for precise, deterministic refinement). This "stochastic early, deterministic late" strategy is used by some state-of-the-art samplers.

python
# Complete comparison: ODE vs SDE sampling
def sample_ode(model, x0, n_steps=50):
    """Deterministic ODE sampling."""
    x, h = x0, 1.0 / n_steps
    for i in range(n_steps):
        t = i * h
        x = x + h * model(x, t)
    return x

def sample_sde(model, x0, sigma=0.5, n_steps=100):
    """Stochastic SDE sampling."""
    x, h = x0, 1.0 / n_steps
    for i in range(n_steps):
        t = i * h
        eps = torch.randn_like(x)
        x = x + h * model(x, t) + sigma * (h**0.5) * eps
    return x

# Same model, two sampling methods:
x0 = torch.randn(1, 784)
img_ode = sample_ode(model, x0)   # always same output for this x0
img_sde = sample_sde(model, x0)   # different each time
You run a flow model (ODE) twice with the same X0. Do you get the same output?

Chapter 9: Connections

We have built the complete mathematical machinery for generative modeling with differential equations. Let's recap:

ConceptDefinitionRole in Generation
Vector field ut(x)Velocity at position x, time tNeural network parameterizes this
ODEdX/dt = ut(Xt)Deterministic trajectory from noise to data
Flow ψtSolution map of ODEWhere a particle ends up at time t
Euler methodXt+h = Xt + hut(Xt)Numerical simulation of ODE
Brownian motionContinuous random walk WtSource of stochasticity in SDEs
SDEdX = u dt + σ dWStochastic trajectory from noise to data
Euler-MaruyamaXt+h = Xt + hu + σ√h εNumerical simulation of SDE
Flow modelSDE with σ = 0Deterministic generative model
Diffusion modelSDE with σ > 0Stochastic generative model
The open question: We know how to sample from a flow/diffusion model (simulate the ODE/SDE). But we have not yet discussed how to train the neural network utθ. That is the subject of the next two chapters: Flow Matching (Chapter 3) shows how to train a flow model, and Score Matching (Chapter 4) shows how to train a diffusion model. The training algorithms are strikingly simple — just regression.

What we know vs. what we need. Here is the gap between what we have and what we need:

What We HaveWhat We Need
How to define a vector field (Ch 1)How to LEARN the right vector field
How to solve an ODE/SDE numerically (Ch 3, 6)What ODE/SDE to solve
The Euler and Euler-Maruyama methodsA training loss function
A dataset z1, ..., zN from pdataParameters θ such that X1 ~ pdata

Chapter 3 (flow matching) closes this gap with a beautifully simple idea: define a probability path from noise to data, compute the vector field that follows this path analytically, and train the neural network to match it via MSE regression. No ODE simulation during training. No adversarial training. Just regression.

The key equations of this entire course fit on one card:

The entire flow/diffusion framework in 6 lines:

Data: z1, ..., zN ~ pdata    (collect examples)
Path: x = αtz + βtε    (ε ~ N(0,I), interpolate noise ↔ data)
Target: utarget = α̇tz + β̇tε    (velocity along the path)
Train: min ||uθ(x, t) − utarget||2    (MSE regression)
ODE sample: Xt+h = Xt + h uθ(Xt, t)    (Euler from noise)
SDE sample: Xt+h = Xt + h [uθ + (σ2/2)sθ] + σ√h ε    (add noise)
python
# Summary: the full generative modeling stack

# 1. REPRESENTATION (Ch 1)
z = data_sample   # z ∈ R^d (image, video, protein, ...)

# 2. MACHINERY (Ch 2 — this chapter)
u_theta = NeuralNet(d, hidden)  # u: R^d × [0,1] → R^d
# ODE: dX/dt = u_theta(X, t)  → flow model
# SDE: dX = u_theta(X,t)dt + σ dW  → diffusion model
# Simulate via Euler or Euler-Maruyama

# 3. TRAINING (Ch 3-4 — next)
# Flow matching: L = ||u_theta(x,t) - (z - eps)||^2
# Score matching: L = ||s_theta(x,t) + eps/beta||^2

# 4. GENERATION
# X_0 ~ N(0, I) → simulate ODE/SDE → X_1 ~ p_data
Ch 2 (Done)
We can SAMPLE from flow/diffusion models via ODE/SDE simulation
↓ next: how to TRAIN
Ch 3: Flow Matching
Train utθ by regressing against conditional vector fields
Ch 4: Score Matching
Score functions, SDE extension trick, denoising score matching
To build a complete generative model, what two things do we need?