Ch 2: Flow & Diffusion Models — Flow Matching & Diffusion

Chapter 1: Vector Fields

A vector field is a function that assigns a velocity vector to every point in space and time:

u : R^d × [0, 1] → R^d, (x, t) ↦ u_t(x)

At position x and time t, the vector field u_t(x) tells you the velocity — which direction and how fast to move. Think of it as a wind map: at every location and every moment, there is an arrow showing which way the wind blows and how strongly.

Worked example — linear vector field. Consider u_t(x) = −θx for θ = 2. This field always points toward the origin. At x = 3, the velocity is −6 (move left). At x = −1, the velocity is +2 (move right). Every particle is pulled toward zero, and the further away you are, the stronger the pull.

Worked example — circular vector field. In 2D, u_t(x₁, x₂) = (−x₂, x₁) creates circular trajectories. At (1, 0) the velocity is (0, 1) — pointing upward. At (0, 1) the velocity is (−1, 0) — pointing left. Particles orbit the origin.

Interactive Vector Field Explorer

Choose a vector field type and see the velocity arrows. Click anywhere to drop a particle and watch it follow the field.

Field type

θ 2.0

Worked example — saddle point vector field. In 2D, u_t(x₁, x₂) = (θ x₁, −θ x₂) creates a saddle point. At (1, 1): velocity is (2, −2) — pushed right and up. The x₁ axis is unstable (grows), the x₂ axis is stable (shrinks). Particles spread along x₁ and compress along x₂.

Mathematical properties of vector fields. A vector field u_t(x) maps every point in R^d to a velocity vector in R^d. For our purposes, the vector field depends on both position x and time t. This time-dependence is crucial — the neural network needs to know "what time it is" to provide the correct velocity. At t = 0 (noise), the field should push particles apart. At t = 1 (data), it should guide them to the right locations.

python
# Defining vector fields in code
import torch

def linear_vf(x, theta=2.0):
    """u(x) = -theta * x (pulls toward origin)"""
    return -theta * x

def circular_vf(x, theta=1.0):
    """u(x1,x2) = (-theta*x2, theta*x1) (rotation)"""
    return theta * torch.stack([-x[...,1], x[...,0]], dim=-1)

# Neural network vector field (what we actually use)
class NeuralVF(torch.nn.Module):
    def forward(self, x, t):
        """x: [batch, d], t: [batch, 1] -> velocity [batch, d]"""
        return self.net(torch.cat([x, t], dim=-1))

Why vector fields matter: In a generative model, the neural network is the vector field. It takes a noisy point x and a time t, and outputs a velocity u_t^θ(x). Generating a sample means following this velocity from t = 0 (noise) to t = 1 (data). The entire learning problem is: find parameters θ so that this velocity field transports noise into data.

For the linear vector field u_t(x) = −2x, what is the velocity at x = 5?

+10 (move right) 0 (stationary) −10 (move left toward origin)

Chapter 2: ODEs and Flows

An ordinary differential equation (ODE) says: "follow the vector field." Given a starting point x₀, the ODE asks for a trajectory X_t whose velocity at every moment equals the vector field:

dX_t/dt = u_t(X_t), X₀ = x₀

The first equation says the trajectory's velocity must match the vector field. The second equation says we start at x₀. Together, they fully determine the trajectory — there is exactly one path through each starting point.

The solution to an ODE is called a flow, denoted ψ_t:

ψ : R^d × [0, 1] → R^d, (x₀, t) ↦ ψ_t(x₀)

The flow ψ_t(x₀) tells you: "if you start at x₀ and follow the vector field for time t, where do you end up?" The trajectory is X_t = ψ_t(X₀).

Three descriptions, one object: Vector fields, ODEs, and flows are three ways to describe the same thing. The vector field defines the ODE. The ODE's solution is the flow. Given any one, you can derive the other two.

Worked example — linear ODE. Let u_t(x) = −θx for θ = 2. The ODE is dX/dt = −2X. The solution is:

ψ_t(x₀) = e^−2t · x₀

Verification: At t = 0, ψ₀(x₀) = e⁰ · x₀ = x₀. Check. The derivative is dψ/dt = −2e^−2tx₀ = −2ψ_t(x₀) = u_t(ψ_t(x₀)). Check. So the flow satisfies the ODE.

Numerical example. Start at x₀ = 4, θ = 2:

t	ψ_t(4) = 4e^−2t	Velocity u_t(ψ_t) = −2ψ_t
0.0	4.000	−8.000
0.25	2.426	−4.853
0.50	1.472	−2.943
0.75	0.893	−1.786
1.0	0.541	−1.083

The particle decays exponentially toward zero. The velocity decreases as the particle gets closer to the origin.

Existence and uniqueness (Theorem 3): If the vector field u_t(x) is continuously differentiable with bounded derivatives (always true for neural networks), the ODE has a unique solution. This means: given any starting point, there is exactly one trajectory. No ambiguity, no branching. This is great news for generative modeling — the flow is well-defined.

Moreover, the flow ψ_t is a diffeomorphism — a smooth, invertible transformation with a smooth inverse. This means the flow "warps" space without tearing or folding it. Think of it as stretching a rubber sheet: every point maps to exactly one other point, and you can always undo the deformation.

Why diffeomorphisms matter: The flow maps the noise distribution N(0, I) to the data distribution p_data. Because the map is invertible, we can also go backwards: given a data point z, we can find the noise point ψ₁⁻¹(z) that generated it. This invertibility is useful for computing likelihoods and for encoding data into a latent space.

Worked example — 2D rotation flow. For the circular vector field u(x₁, x₂) = (−x₂, x₁), the flow is a rotation:

ψ_t(x₀) = (x_0,1 cos(t) − x_0,2 sin(t), x_0,1 sin(t) + x_0,2 cos(t))

At t = π/2: the flow rotates everything 90 degrees counterclockwise. At t = π: 180 degrees. The inverse is simply rotating in the opposite direction.

For the ODE dX/dt = −2X with X₀ = 4, what is X₁?

0 (the particle reaches the origin) 4e⁻² ≈ 0.541 (exponential decay) 2 (halved)

Chapter 3: The Euler Method

We can analytically solve simple ODEs like dX/dt = −θX. But when the vector field u_t(x) is a neural network with millions of parameters, there is no formula for the flow. We need to simulate the ODE numerically.

The Euler method is the simplest simulation technique. The idea: take small discrete steps along the vector field. Starting at X₀ = x₀, update iteratively:

X_t+h = X_t + h · u_t(X_t)

where h = 1/n is the step size and n is the number of steps. We evaluate the velocity at the current position, multiply by the step size, and add it to get the new position. After n steps, we reach t = 1.

Worked example — Euler for dX/dt = −2X. Let X₀ = 4, n = 4 steps, so h = 0.25:

Step	t	X_t	u_t(X_t) = −2X_t	X_t+h = X_t + 0.25 · u
1	0.00	4.000	−8.000	4 + 0.25(−8) = 2.000
2	0.25	2.000	−4.000	2 + 0.25(−4) = 1.000
3	0.50	1.000	−2.000	1 + 0.25(−2) = 0.500
4	0.75	0.500	−1.000	0.5 + 0.25(−1) = 0.250

Euler gives X₁ ≈ 0.250. The exact answer is 4e⁻² ≈ 0.541. The error is significant because h = 0.25 is coarse. With n = 100 steps (h = 0.01), Euler gives X₁ ≈ 0.536 — much closer.

Accuracy vs. cost tradeoff: More steps (smaller h) = more accurate but slower. Each step requires one evaluation of the neural network u_t^θ(x). In practice, state-of-the-art image generators use 20-50 Euler steps. That means 20-50 forward passes of a large neural network per generated image.

Error analysis. The Euler method has a local truncation error of O(h²) per step and a global error of O(h). This means:

Steps n	Step size h = 1/n	Approx. global error	NFE (neural net evals)
4	0.250	~25% relative	4
10	0.100	~10% relative	10
50	0.020	~2% relative	50
100	0.010	~1% relative	100

Heun's method has O(h²) global error but costs 2 NFE per step. So 25 Heun steps = 50 NFE, comparable to 50 Euler steps. In practice, Heun often gives better results at the same compute budget.

Heun's method (a refinement): Instead of blindly trusting the velocity at the current point, Heun's method takes a trial step, evaluates the velocity at the trial point, and averages the two velocities:

X'_t+h = X_t + h · u_t(X_t) (trial step)

X_t+h = X_t + (h/2)(u_t(X_t) + u_t+h(X'_t+h)) (corrected step)

This costs 2 function evaluations per step but is much more accurate.

Euler Method Step-Through

Watch the Euler method approximate the true ODE solution. Increase steps for better accuracy. The orange curve is the exact solution; teal dots are Euler steps.

Steps n 4

python
def euler_sample(u_theta, n_steps=50, d=2):
    """Sample from a flow model using Euler method."""
    h = 1.0 / n_steps
    x = torch.randn(d)  # X_0 ~ N(0, I)
    t = 0.0
    for _ in range(n_steps):
        v = u_theta(x, t)  # evaluate neural network
        x = x + h * v       # Euler step
        t += h
    return x  # X_1 ~ p_data

Using Euler with step size h = 0.5 for dX/dt = −2X, X₀ = 4: what is X after 1 step (at t = 0.5)?

X_0.5 = 4 + 0.5 · (−8) = 0 X_0.5 = 4 · e⁻¹ ≈ 1.47 X_0.5 = 4 − 2 = 2

Chapter 4: Flow Models

We now have everything needed to define a flow model — a generative model based on an ODE. The recipe:

Step 1: Initialize

Sample X₀ ~ p_init = N(0, I_d) — pure Gaussian noise

↓

Step 2: Simulate ODE

dX_t/dt = u_t^θ(X_t) using Euler method for t ∈ [0, 1]

↓

Step 3: Output

X₁ ~ p_data — the endpoint is our generated sample

The ODE is deterministic — given the same starting noise X₀, you always get the same output X₁. The randomness comes entirely from the random initialization. Different noise samples produce different outputs.

Key insight: Although we call it a "flow model," the neural network parameterizes the vector field u_t^θ, not the flow ψ_t. The flow is computed by simulating the ODE. The neural network never sees the flow directly — it just outputs velocities.

What neural architecture is used? The neural network u_t^θ must take a noisy input x and a time t and output a velocity of the same shape. Common architectures:

Architecture	Used By	Input	Output
U-Net	Stable Diffusion 1/2, DALL-E 2	[B, C, H, W] + t	[B, C, H, W]
DiT (Diffusion Transformer)	SD3, FLUX, Sora	[B, N, D] + t	[B, N, D]
Simple MLP	Toy experiments	[B, d] + t	[B, d]

The time t is typically embedded using sinusoidal or learned embeddings, similar to positional encodings in Transformers. This time embedding is added to intermediate activations so the network "knows" what stage of the denoising process it is in. At t = 0, the input is pure noise, so the network should output a large velocity. At t = 1, the input is nearly clean, and the velocity should be small (just fine-tuning details).

Why the initial distribution matters. We usually choose p_init = N(0, I_d) because:

1. We can sample from it easily (just generate random numbers).

2. It has full support on R^d — every point has positive probability.

3. It is isotropic — no preferred direction, so the model does not need to "undo" any structure in the noise.

4. It is well-studied mathematically, making theoretical analysis tractable.

Could we use a different p_init? Yes! Some models use a uniform distribution on a hypersphere, or a truncated Gaussian. But N(0, I) is the simplest and most common choice. The choice of p_init affects the difficulty of the learning problem: if p_init is very different from p_data, the flow must perform a more dramatic transformation, which requires more neural network capacity and more ODE steps.

The flow model is a universal approximator. Given a sufficiently expressive neural network u_t^θ and the right training algorithm, a flow model can approximate any continuous probability distribution p_data. This follows from the universality of neural networks combined with the existence and uniqueness theorem for ODEs. The practical question is not whether a flow model can model p_data, but how many parameters and training steps it needs.

Computational cost of generation. Each Euler step requires one forward pass through the neural network. The total cost of generating one sample is n × (cost of one forward pass). For reference:

Model	Params	Steps	Time per image (A100)
DiT-S/2 (CIFAR)	33M	50	~0.3s
DiT-XL/2 (ImageNet 256)	675M	50	~2s
SD3-Medium (512×512)	2B	28	~3s
FLUX.1 (1024×1024)	12B	50	~15s

Research in "distillation" aims to reduce the number of steps needed. Consistency models and progressive distillation can achieve good quality in 1-4 steps, at the cost of additional training. This is an active area of research as of 2025-2026.

Latent vs. pixel space. In practice, flow/diffusion models rarely operate directly in pixel space (too high-dimensional). Instead, a VAE first encodes images into a lower-dimensional latent space. The dimensions given above are for the latent representation. The actual generation pipeline is: noise → ODE/SDE in latent space → VAE decoder → pixels.

Classifier-free guidance. One of the most impactful practical techniques is not part of the basic theory but uses the score function. By training the model to sometimes drop the conditioning (text prompt), it learns both conditional and unconditional generation. At inference, the two predictions are combined to amplify the conditioning signal. This dramatically improves text-to-image alignment. We will see the mathematical details in Chapter 5 of the book, but it relies on the score function concepts from Chapter 4.

The Euler method vs. higher-order solvers. Beyond Euler and Heun, there are more advanced ODE solvers like DPM-Solver, DPM-Solver++, and UniPC that take advantage of the specific structure of the flow matching ODE. These can achieve the same quality as 50 Euler steps in just 10-20 steps by using higher-order approximations and specialized scheduling. In production systems, these advanced solvers are standard.

Step schedule matters. Instead of using uniform step sizes h = 1/n, production models use non-uniform schedules. The key insight: more steps should be allocated where the vector field changes rapidly. For CondOT paths, the field is nearly constant, so uniform steps work well. For VP (DDPM-style) paths, more steps are needed near t = 1 where fine details emerge.

python
# Non-uniform step schedules
import torch

# Uniform schedule (basic)
t_uniform = torch.linspace(0, 1, n_steps + 1)

# Quadratic schedule (more steps near t=1)
t_quad = torch.linspace(0, 1, n_steps + 1) ** 2

# Karras schedule (empirically optimized)
sigma_min, sigma_max = 0.002, 80
rho = 7
ramp = torch.linspace(0, 1, n_steps + 1)
t_karras = (sigma_max ** (1/rho) + ramp * (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho

The choice of step schedule can improve FID by 10-30% at the same number of steps. This is one of the most impactful "free" improvements in practice.

Adaptive step sizes. Some ODE solvers (like Dopri5) automatically choose the step size based on a local error estimate. They take larger steps where the vector field is smooth and smaller steps where it changes rapidly. This can achieve the same accuracy as 50 Euler steps in just 15-20 adaptive steps, with no manual tuning of the schedule.

Implementation in practice. The torchdiffeq library provides ODE solvers that work with PyTorch:

python
# Using torchdiffeq for adaptive ODE solving
from torchdiffeq import odeint

class FlowODE(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, t, x):
        """Required signature for torchdiffeq: (t, x) -> dx/dt"""
        t_batch = t.expand(x.shape[0], 1)
        return self.model(x, t_batch)

# Sample with adaptive solver
x0 = torch.randn(16, d)  # batch of 16 noise samples
t_span = torch.tensor([0.0, 1.0])
x1 = odeint(FlowODE(model), x0, t_span,
    method='dopri5', atol=1e-5, rtol=1e-5)[-1]

The adaptive solver automatically determines how many function evaluations (NFE) are needed, typically 15-40 depending on the vector field complexity. This often beats fixed-step Euler at the same compute budget.

Numerical stability considerations. When implementing ODE/SDE solvers for generative models:

1. Avoid t = 0 and t = 1 exactly. Some vector fields diverge at the endpoints. Use t ∈ [ε, 1−ε] with ε = 10⁻⁵.

2. Use float32 or higher. Float16 can cause numerical issues in the early steps (t near 0) where the velocity is large.

3. Gradient clipping during training. Clip gradients to norm 1.0 to prevent training instabilities.

4. EMA of model weights. Use exponential moving average of parameters for sampling (decay 0.9999).

A note on continuous vs. discrete time. In this course, we treat time as continuous (t ∈ [0, 1]). The original DDPM paper used discrete time steps (t ∈ {0, 1, ..., T}). The continuous formulation is more elegant and general — the discrete version is recovered as a special case when you discretize. Most modern implementations use continuous time internally and only discretize for the Euler/Heun solver.

Forward process vs. reverse process. In the DDPM literature, the "forward process" adds noise to data (going from t = 0 to t = 1) and the "reverse process" removes noise (going from t = 1 to t = 0). In the flow matching literature, the convention is reversed: t = 0 is noise and t = 1 is data. Both conventions are valid; just be careful when reading papers to check which convention is used.

In this course, we follow the flow matching convention: t = 0 is noise, t = 1 is data. The ODE flows from t = 0 to t = 1 during sampling. During training, we sample random t values and compute the loss at that timestep.

The complete lifecycle of a generated image. To tie everything together, here is what happens end-to-end when you type "a sunset over the ocean" into Stable Diffusion 3:

Encode prompt

CLIP + T5 encode "a sunset over the ocean" into embedding vectors

↓

Initialize noise

X₀ = randn(1, 4, 64, 64) — random latent noise

↓

Simulate ODE (28 steps)

Each step: DiT(X_t, t, prompt_emb) → velocity → X_t+h
CFG: blend conditional and unconditional predictions

↓

Decode latent

VAE decoder maps X₁ (64×64×4) to pixels (512×512×3)

↓

Display

A photorealistic sunset over the ocean appears on your screen

Total time: ~3 seconds on a modern GPU. Total neural network forward passes: 2 × 28 = 56 (two per step for classifier-free guidance). The model has ~2 billion parameters and was trained on ~1 billion images for ~500,000 GPU-hours. This is the machinery that Ch 3 and Ch 4 will teach you to build.

The mathematical elegance. Step back and appreciate the framework we have built. The entire generative modeling pipeline reduces to: (1) choose a vector field parametrized by a neural network, (2) simulate a differential equation. That is it. The complexity of generating images, videos, proteins, and music is encoded in the neural network weights, which are learned from data using the simple regression losses of Ch 3-4. The differential equation framework provides the mathematical guarantee that this process is well-defined (existence and uniqueness of solutions) and produces valid probability distributions (via the continuity/Fokker-Planck equations). This marriage of deep learning and classical mathematics is what makes flow and diffusion models both practical and beautiful.

Key equations to remember from this chapter:

ODE: dX_t/dt = u_t(X_t), X₀ = x₀

Euler: X_t+h = X_t + h · u_t(X_t)

SDE: dX_t = u_t(X_t) dt + σ_t dW_t

Euler-Maruyama: X_t+h = X_t + h · u_t(X_t) + σ_t√h · ε

Summary of ODE/SDE solvers for generative models:

Solver	Type	Order	NFE per step	Typical steps
Euler	ODE	1	1	50-100
Heun	ODE	2	2	25-50
DPM-Solver++	ODE	2-3	1-2	15-25
Dopri5 (RK45)	ODE	5	adaptive	auto (15-40 NFE)
Euler-Maruyama	SDE	0.5	1	100-1000

Worked example — Algorithm 1 by hand. Suppose d = 1, n = 4 steps, h = 0.25, and X₀ = 1.5 (sampled from N(0,1)). Suppose the neural network outputs these velocities:

t	X_t	u_t^θ(X_t)	X_t+h = X_t + 0.25 · u
0.00	1.500	−3.2	1.5 + 0.25(−3.2) = 0.700
0.25	0.700	−1.8	0.7 + 0.25(−1.8) = 0.250
0.50	0.250	+0.5	0.25 + 0.25(0.5) = 0.375
0.75	0.375	+0.9	0.375 + 0.25(0.9) = 0.600

The output X₁ = 0.600 is our generated sample. The noise X₀ = 1.5 was transformed into the data sample 0.600 by the learned vector field.

python
# Algorithm 1: Sampling from a Flow Model
def sample_flow_model(u_theta, n=50, d=784):
    t = 0.0
    h = 1.0 / n
    x = torch.randn(d)          # X_0 ~ N(0, I)
    for i in range(n):
        x = x + h * u_theta(x, t)  # Euler step
        t += h
    return x                      # X_1 ~ p_data

In a flow model, where does the randomness come from?

The random initialization X₀ ~ N(0, I) Random noise added at each ODE step The neural network weights are randomized at inference

Chapter 5: Brownian Motion

Flow models are deterministic — once you fix the noise X₀, the trajectory is fully determined. Diffusion models add stochasticity during the trajectory itself, using a fundamental mathematical object called Brownian motion.

A Brownian motion W = (W_t)_{0 ≤ t ≤ 1} is a continuous random walk. Think of it as a drunk person stumbling around — at every instant, they take a tiny random step. It has three defining properties:

Brownian motion (W_t):
1. Starts at zero: W₀ = 0.
2. Normal increments: W_t − W_s ~ N(0, (t − s) I_d). The variance grows linearly with the time gap.
3. Independent increments: Non-overlapping time intervals have independent steps.

Worked example — simulating Brownian motion. In 1D, with step size h = 0.01, we simulate by setting W₀ = 0 and updating:

W_t+h = W_t + √h · ε_t, ε_t ~ N(0, 1)

With h = 0.01: √h = 0.1. At each step, we add a Gaussian with standard deviation 0.1. After 100 steps, we have a Brownian motion path from t = 0 to t = 1.

Numerical trace. Let ε₁ = 0.83, ε₂ = −1.21, ε₃ = 0.47:

Step	W_t	ε_t	W_t+h = W_t + 0.1 · ε
1	0.000	0.83	0.083
2	0.083	−1.21	−0.038
3	−0.038	0.47	0.009

The path zigzags randomly. Run the simulation again and you get a completely different path. This randomness is the ingredient that makes diffusion models stochastic.

Brownian Motion Simulator

Each click generates a new Brownian motion path. All paths start at zero but diverge wildly due to random increments.

Fun fact: Brownian motion paths are continuous (you could draw them without lifting your pen) but have infinite length (you would never stop drawing). They are also nowhere differentiable — the path is infinitely jagged at every point. This is why SDEs require special mathematics (stochastic calculus) rather than ordinary calculus.

Statistics of Brownian motion. Key properties that follow directly from the definition:

Property	Formula	Interpretation
Mean	E[W_t] = 0	On average, goes nowhere
Variance	Var(W_t) = t	Spread grows linearly with time
Std. dev.	√t	At t=1: std=1. At t=4: std=2
Covariance	Cov(W_s, W_t) = min(s,t)	Past and future are correlated

Numerical check. Simulate 10,000 Brownian paths to t = 1. The empirical distribution of W₁ should be approximately N(0, 1). Mean ≈ 0, Std ≈ 1. This is because W₁ − W₀ ~ N(0, 1) by the normal increments property.

python
# Simulating Brownian motion
import torch

def simulate_brownian(n_paths=5, n_steps=200, d=1):
    h = 1.0 / n_steps
    W = torch.zeros(n_paths, d)
    paths = [W.clone()]
    for _ in range(n_steps):
        W = W + (h ** 0.5) * torch.randn(n_paths, d)
        paths.append(W.clone())
    return torch.stack(paths)  # [n_steps+1, n_paths, d]

The increment W_0.7 − W_0.3 of a Brownian motion has what distribution?

N(0, 0.7) (variance = endpoint) N(0, 0.4) (variance = time gap = 0.7 − 0.3) Uniform on [−1, 1]

Chapter 6: Stochastic Differential Equations

An SDE extends an ODE by adding a Brownian motion term. At each step, the particle follows the vector field and gets a random kick:

dX_t = u_t(X_t) dt + σ_t dW_t

The first term u_t(X_t) dt is the drift — the deterministic part, same as an ODE. The second term σ_t dW_t is the diffusion — the random part, scaled by the diffusion coefficient σ_t. When σ_t = 0, the SDE reduces to an ODE.

Intuition: An ODE is like floating down a calm river — deterministic, smooth. An SDE is like floating down a turbulent river — there is a general current (drift) plus unpredictable eddies (diffusion). The larger σ_t, the more turbulent.

We simulate SDEs using the Euler-Maruyama method, the stochastic analog of the Euler method:

X_t+h = X_t + h · u_t(X_t) + σ_t √h · ε_t, ε_t ~ N(0, I_d)

Compare to Euler for ODEs: X_t+h = X_t + h · u_t(X_t). The only difference is the added noise σ_t√h · ε_t.

Worked example — Ornstein-Uhlenbeck process. The OU process uses u_t(x) = −θx and constant σ:

dX_t = −θ X_t dt + σ dW_t

The drift −θx pushes the particle back toward zero (a "spring"). The diffusion σ adds noise. The two forces balance: the particle bounces around zero, eventually settling into a Gaussian distribution N(0, σ²/(2θ)).

Numerical example. Let θ = 2, σ = 1, X₀ = 3, h = 0.1:

Step	X_t	drift = −2X · 0.1	ε	noise = √0.1 · ε	X_t+h
1	3.000	−0.600	0.52	0.164	2.564
2	2.564	−0.513	−1.31	−0.414	1.637
3	1.637	−0.327	0.88	0.278	1.588

Notice how the trajectory is jagged (unlike the smooth ODE trajectory) because of the random noise at each step.

The √h scaling is crucial. Why does the noise scale as σ√h rather than σ h? Because Brownian increments have variance proportional to the time gap. If we used σ h, the total variance after n steps would be n · (σh)² = σ²h, which vanishes as h → 0. With σ√h, the total variance is n · (σ√h)² = n · σ²h = σ², which is finite. This is the mathematically correct scaling for continuous-time stochastic processes.

Convergence of the OU process. The Ornstein-Uhlenbeck process converges to the stationary distribution N(0, σ²/(2θ)). For our example with θ = 2, σ = 1: the stationary distribution is N(0, 0.25). Starting at X₀ = 3, the process decays toward zero (due to the drift −2X) and fluctuates around zero (due to the noise), eventually settling into N(0, 0.25).

The OU process as prototype: The Ornstein-Uhlenbeck process is to SDEs what the linear ODE dX/dt = −θX is to ODEs: the simplest non-trivial example, and the one that gives the most insight. It was the starting point for the original diffusion models (SMLD and DDPM in 2019-2020). Understanding it deeply pays dividends throughout this course.

ODE vs SDE Comparison

Both start at X₀ = 3 with drift u(x) = −2x. The ODE (orange) is smooth and deterministic. The SDE (teal) is noisy. Increase σ to see more randomness.

σ 1.0

python
# Euler-Maruyama method for SDE simulation
def euler_maruyama(u_theta, sigma_t, n=50, d=2):
    h = 1.0 / n
    x = torch.randn(d)
    t = 0.0
    for _ in range(n):
        eps = torch.randn_like(x)
        x = x + h * u_theta(x, t) + sigma_t(t) * (h**0.5) * eps
        t += h
    return x

What is the difference between the Euler method and Euler-Maruyama?

Euler uses a neural network; Euler-Maruyama does not Euler-Maruyama uses larger step sizes Euler-Maruyama adds Gaussian noise σ_t√h · ε at each step

Chapter 7: Diffusion Models

A diffusion model is a generative model based on an SDE, just as a flow model is based on an ODE. The recipe is identical except we add noise during sampling:

X₀ ~ p_init, dX_t = u_t^θ(X_t) dt + σ_t dW_t

The neural network u_t^θ parameterizes the vector field (exactly as in a flow model). The diffusion coefficient σ_t is a fixed schedule — not learned. The goal is still X₁ ~ p_data.

Flow model = Diffusion model with σ_t = 0. Every flow model is a special case of a diffusion model. The distinction is purely about whether we add noise during sampling. The training algorithms (flow matching vs. score matching) also differ, as we will see in Chapters 3 and 4.

Why would we want noise during sampling? Two reasons:

1. Error correction. Neural networks are imperfect — u_t^θ only approximates the true vector field. The added noise can help "shake" trajectories out of error states, similar to how simulated annealing helps optimization escape local minima.

2. Diversity. Different amounts of noise can produce more diverse samples. The same initial noise X₀ can lead to different outputs depending on the Brownian motion realization.

3. Theoretical guarantees. Under certain conditions, SDE sampling can converge faster than ODE sampling, especially when the target distribution has many well-separated modes.

Worked example — comparing ODE and SDE sampling. Suppose our target has two sharp modes at −5 and +5. With ODE sampling, a particle starting at X₀ = 0.1 always ends up at the same mode (say +5). With SDE sampling, the Brownian noise can "push" the particle across the boundary, allowing it to reach either mode. This can improve mode coverage.

python
# Algorithm 2: Sampling from a Diffusion Model
def sample_diffusion_model(u_theta, sigma, n=50, d=784):
    t = 0.0
    h = 1.0 / n
    x = torch.randn(d)                   # X_0 ~ N(0, I)
    for i in range(n):
        eps = torch.randn_like(x)          # fresh noise
        x = x + h * u_theta(x, t) \
            + sigma(t) * (h**0.5) * eps  # Euler-Maruyama
        t += h
    return x                              # X_1 ~ p_data

Property	Flow Model	Diffusion Model
Equation	ODE	SDE
Trajectories	Smooth, deterministic	Jagged, stochastic
Randomness source	Initial noise X₀ only	X₀ + Brownian motion W_t
Simulation	Euler method	Euler-Maruyama method
Same X₀ → same output?	Yes	No
σ_t	0	> 0 (fixed schedule)

What is the relationship between flow models and diffusion models?

A flow model is a diffusion model with σ_t = 0 (no noise during sampling) They are completely unrelated algorithms A diffusion model is a special case of a flow model

Chapter 8: ODE vs SDE Showcase

Let's see everything in action. The canvas below lets you watch flow models (ODE) and diffusion models (SDE) generate samples from a 2D distribution in real time. Multiple particles start as Gaussian noise and evolve toward the target data distribution.

Flow Model Sampling Simulator

Watch particles flow from noise (t=0) to data (t=1). The target is a mixture of 4 Gaussians. Toggle SDE mode and adjust σ to see the difference.

Mode

σ (SDE only) 0.5

Steps 50

What to observe: In ODE mode, trajectories are smooth curves. Particles that start near each other stay near each other. In SDE mode, trajectories are jagged. Even particles starting from the same noise can diverge due to different Brownian motion realizations. Both end up at the target distribution — but the paths differ.

Practical considerations for real models:

Aspect	ODE (Flow Model)	SDE (Diffusion Model)
Typical steps	20-50	50-1000
Image quality	Good with few steps	Better with many steps
Deterministic?	Yes (same seed = same output)	No (different each run)
Speed	Faster (fewer steps needed)	Slower (more steps for stability)
Guidance	Works but limited	Naturally supports classifier guidance
Likelihood	Can compute exact log p(x)	Cannot (easily)

In practice, modern systems like Stable Diffusion 3 and FLUX use ODE sampling (flow matching) by default because it is faster. SDE sampling is used when higher quality is needed or when guidance techniques require it.

The key takeaway from this showcase: A flow model (ODE) and a diffusion model (SDE) are two ways to use the same trained neural network. The training is identical — only the sampling differs. This flexibility is one of the most powerful aspects of the flow/diffusion framework: train once, sample with either method depending on your quality/speed requirements.

Interpolation between ODE and SDE. You can even mix the two approaches. Start with SDE sampling (large σ) for the early steps (to get good mode coverage) and switch to ODE sampling (σ = 0) for the later steps (for precise, deterministic refinement). This "stochastic early, deterministic late" strategy is used by some state-of-the-art samplers.

python
# Complete comparison: ODE vs SDE sampling
def sample_ode(model, x0, n_steps=50):
    """Deterministic ODE sampling."""
    x, h = x0, 1.0 / n_steps
    for i in range(n_steps):
        t = i * h
        x = x + h * model(x, t)
    return x

def sample_sde(model, x0, sigma=0.5, n_steps=100):
    """Stochastic SDE sampling."""
    x, h = x0, 1.0 / n_steps
    for i in range(n_steps):
        t = i * h
        eps = torch.randn_like(x)
        x = x + h * model(x, t) + sigma * (h**0.5) * eps
    return x

# Same model, two sampling methods:
x0 = torch.randn(1, 784)
img_ode = sample_ode(model, x0)   # always same output for this x0
img_sde = sample_sde(model, x0)   # different each time

You run a flow model (ODE) twice with the same X₀. Do you get the same output?

Yes — the ODE is deterministic, so same input → same output No — the Brownian motion adds randomness It depends on the step size

Chapter 9: Connections

We have built the complete mathematical machinery for generative modeling with differential equations. Let's recap:

Concept	Definition	Role in Generation
Vector field u_t(x)	Velocity at position x, time t	Neural network parameterizes this
ODE	dX/dt = u_t(X_t)	Deterministic trajectory from noise to data
Flow ψ_t	Solution map of ODE	Where a particle ends up at time t
Euler method	X_t+h = X_t + hu_t(X_t)	Numerical simulation of ODE
Brownian motion	Continuous random walk W_t	Source of stochasticity in SDEs
SDE	dX = u dt + σ dW	Stochastic trajectory from noise to data
Euler-Maruyama	X_t+h = X_t + hu + σ√h ε	Numerical simulation of SDE
Flow model	SDE with σ = 0	Deterministic generative model
Diffusion model	SDE with σ > 0	Stochastic generative model

The open question: We know how to sample from a flow/diffusion model (simulate the ODE/SDE). But we have not yet discussed how to train the neural network u_t^θ. That is the subject of the next two chapters: Flow Matching (Chapter 3) shows how to train a flow model, and Score Matching (Chapter 4) shows how to train a diffusion model. The training algorithms are strikingly simple — just regression.

What we know vs. what we need. Here is the gap between what we have and what we need:

What We Have	What We Need
How to define a vector field (Ch 1)	How to LEARN the right vector field
How to solve an ODE/SDE numerically (Ch 3, 6)	What ODE/SDE to solve
The Euler and Euler-Maruyama methods	A training loss function
A dataset z₁, ..., z_N from p_data	Parameters θ such that X₁ ~ p_data

Chapter 3 (flow matching) closes this gap with a beautifully simple idea: define a probability path from noise to data, compute the vector field that follows this path analytically, and train the neural network to match it via MSE regression. No ODE simulation during training. No adversarial training. Just regression.

The key equations of this entire course fit on one card:

The entire flow/diffusion framework in 6 lines:

Data: z₁, ..., z_N ~ p_data    (collect examples)
Path: x = α_tz + β_tε    (ε ~ N(0,I), interpolate noise ↔ data)
Target: u_target = α̇_tz + β̇_tε    (velocity along the path)
Train: min ||u_θ(x, t) − u_target||²    (MSE regression)
ODE sample: X_t+h = X_t + h u_θ(X_t, t)    (Euler from noise)
SDE sample: X_t+h = X_t + h [u_θ + (σ²/2)s_θ] + σ√h ε    (add noise)

python
# Summary: the full generative modeling stack

# 1. REPRESENTATION (Ch 1)
z = data_sample   # z ∈ R^d (image, video, protein, ...)

# 2. MACHINERY (Ch 2 — this chapter)
u_theta = NeuralNet(d, hidden)  # u: R^d × [0,1] → R^d
# ODE: dX/dt = u_theta(X, t)  → flow model
# SDE: dX = u_theta(X,t)dt + σ dW  → diffusion model
# Simulate via Euler or Euler-Maruyama

# 3. TRAINING (Ch 3-4 — next)
# Flow matching: L = ||u_theta(x,t) - (z - eps)||^2
# Score matching: L = ||s_theta(x,t) + eps/beta||^2

# 4. GENERATION
# X_0 ~ N(0, I) → simulate ODE/SDE → X_1 ~ p_data

Ch 2 (Done)

We can SAMPLE from flow/diffusion models via ODE/SDE simulation

↓ next: how to TRAIN

Ch 3: Flow Matching

Train u_t^θ by regressing against conditional vector fields

↓

Ch 4: Score Matching

Score functions, SDE extension trick, denoising score matching

To build a complete generative model, what two things do we need?

A sampling algorithm (ODE/SDE simulation) and a training algorithm (to learn u_t^θ) A discriminator and a generator An encoder and a decoder

Flow and Diffusion Models

Chapter 0: Why ODEs?