The mathematical machinery of generation: vector fields, ODEs, SDEs, and the Euler method.
In the last chapter, we formalized generative modeling as sampling from a data distribution pdata. We need an algorithm that converts simple noise — say from a Gaussian — into complex, structured data. But how?
Here is the key intuition: imagine a particle starting at a random position in space (noise). We want it to end up at a position that looks like real data. What if we could define a velocity field that tells the particle which direction to move at every point in space and time? The particle follows this field like a leaf in a stream, and if the field is designed correctly, it arrives at a sample from pdata.
This is exactly what a differential equation does. An ordinary differential equation (ODE) defines a velocity field, and "solving" the ODE means following that field to trace out a trajectory. A stochastic differential equation (SDE) adds random jitter on top — like a leaf in a turbulent stream.
This chapter builds the mathematical toolkit piece by piece:
A vector field is a function that assigns a velocity vector to every point in space and time:
At position x and time t, the vector field ut(x) tells you the velocity — which direction and how fast to move. Think of it as a wind map: at every location and every moment, there is an arrow showing which way the wind blows and how strongly.
Worked example — linear vector field. Consider ut(x) = −θx for θ = 2. This field always points toward the origin. At x = 3, the velocity is −6 (move left). At x = −1, the velocity is +2 (move right). Every particle is pulled toward zero, and the further away you are, the stronger the pull.
Worked example — circular vector field. In 2D, ut(x1, x2) = (−x2, x1) creates circular trajectories. At (1, 0) the velocity is (0, 1) — pointing upward. At (0, 1) the velocity is (−1, 0) — pointing left. Particles orbit the origin.
Choose a vector field type and see the velocity arrows. Click anywhere to drop a particle and watch it follow the field.
Worked example — saddle point vector field. In 2D, ut(x1, x2) = (θ x1, −θ x2) creates a saddle point. At (1, 1): velocity is (2, −2) — pushed right and up. The x1 axis is unstable (grows), the x2 axis is stable (shrinks). Particles spread along x1 and compress along x2.
Mathematical properties of vector fields. A vector field ut(x) maps every point in Rd to a velocity vector in Rd. For our purposes, the vector field depends on both position x and time t. This time-dependence is crucial — the neural network needs to know "what time it is" to provide the correct velocity. At t = 0 (noise), the field should push particles apart. At t = 1 (data), it should guide them to the right locations.
python # Defining vector fields in code import torch def linear_vf(x, theta=2.0): """u(x) = -theta * x (pulls toward origin)""" return -theta * x def circular_vf(x, theta=1.0): """u(x1,x2) = (-theta*x2, theta*x1) (rotation)""" return theta * torch.stack([-x[...,1], x[...,0]], dim=-1) # Neural network vector field (what we actually use) class NeuralVF(torch.nn.Module): def forward(self, x, t): """x: [batch, d], t: [batch, 1] -> velocity [batch, d]""" return self.net(torch.cat([x, t], dim=-1))
An ordinary differential equation (ODE) says: "follow the vector field." Given a starting point x0, the ODE asks for a trajectory Xt whose velocity at every moment equals the vector field:
The first equation says the trajectory's velocity must match the vector field. The second equation says we start at x0. Together, they fully determine the trajectory — there is exactly one path through each starting point.
The solution to an ODE is called a flow, denoted ψt:
The flow ψt(x0) tells you: "if you start at x0 and follow the vector field for time t, where do you end up?" The trajectory is Xt = ψt(X0).
Worked example — linear ODE. Let ut(x) = −θx for θ = 2. The ODE is dX/dt = −2X. The solution is:
Verification: At t = 0, ψ0(x0) = e0 · x0 = x0. Check. The derivative is dψ/dt = −2e−2tx0 = −2ψt(x0) = ut(ψt(x0)). Check. So the flow satisfies the ODE.
Numerical example. Start at x0 = 4, θ = 2:
| t | ψt(4) = 4e−2t | Velocity ut(ψt) = −2ψt |
|---|---|---|
| 0.0 | 4.000 | −8.000 |
| 0.25 | 2.426 | −4.853 |
| 0.50 | 1.472 | −2.943 |
| 0.75 | 0.893 | −1.786 |
| 1.0 | 0.541 | −1.083 |
The particle decays exponentially toward zero. The velocity decreases as the particle gets closer to the origin.
Moreover, the flow ψt is a diffeomorphism — a smooth, invertible transformation with a smooth inverse. This means the flow "warps" space without tearing or folding it. Think of it as stretching a rubber sheet: every point maps to exactly one other point, and you can always undo the deformation.
Why diffeomorphisms matter: The flow maps the noise distribution N(0, I) to the data distribution pdata. Because the map is invertible, we can also go backwards: given a data point z, we can find the noise point ψ1−1(z) that generated it. This invertibility is useful for computing likelihoods and for encoding data into a latent space.
Worked example — 2D rotation flow. For the circular vector field u(x1, x2) = (−x2, x1), the flow is a rotation:
At t = π/2: the flow rotates everything 90 degrees counterclockwise. At t = π: 180 degrees. The inverse is simply rotating in the opposite direction.
We can analytically solve simple ODEs like dX/dt = −θX. But when the vector field ut(x) is a neural network with millions of parameters, there is no formula for the flow. We need to simulate the ODE numerically.
The Euler method is the simplest simulation technique. The idea: take small discrete steps along the vector field. Starting at X0 = x0, update iteratively:
where h = 1/n is the step size and n is the number of steps. We evaluate the velocity at the current position, multiply by the step size, and add it to get the new position. After n steps, we reach t = 1.
Worked example — Euler for dX/dt = −2X. Let X0 = 4, n = 4 steps, so h = 0.25:
| Step | t | Xt | ut(Xt) = −2Xt | Xt+h = Xt + 0.25 · u |
|---|---|---|---|---|
| 1 | 0.00 | 4.000 | −8.000 | 4 + 0.25(−8) = 2.000 |
| 2 | 0.25 | 2.000 | −4.000 | 2 + 0.25(−4) = 1.000 |
| 3 | 0.50 | 1.000 | −2.000 | 1 + 0.25(−2) = 0.500 |
| 4 | 0.75 | 0.500 | −1.000 | 0.5 + 0.25(−1) = 0.250 |
Euler gives X1 ≈ 0.250. The exact answer is 4e−2 ≈ 0.541. The error is significant because h = 0.25 is coarse. With n = 100 steps (h = 0.01), Euler gives X1 ≈ 0.536 — much closer.
Error analysis. The Euler method has a local truncation error of O(h2) per step and a global error of O(h). This means:
| Steps n | Step size h = 1/n | Approx. global error | NFE (neural net evals) |
|---|---|---|---|
| 4 | 0.250 | ~25% relative | 4 |
| 10 | 0.100 | ~10% relative | 10 |
| 50 | 0.020 | ~2% relative | 50 |
| 100 | 0.010 | ~1% relative | 100 |
Heun's method has O(h2) global error but costs 2 NFE per step. So 25 Heun steps = 50 NFE, comparable to 50 Euler steps. In practice, Heun often gives better results at the same compute budget.
Heun's method (a refinement): Instead of blindly trusting the velocity at the current point, Heun's method takes a trial step, evaluates the velocity at the trial point, and averages the two velocities:
This costs 2 function evaluations per step but is much more accurate.
Watch the Euler method approximate the true ODE solution. Increase steps for better accuracy. The orange curve is the exact solution; teal dots are Euler steps.
python def euler_sample(u_theta, n_steps=50, d=2): """Sample from a flow model using Euler method.""" h = 1.0 / n_steps x = torch.randn(d) # X_0 ~ N(0, I) t = 0.0 for _ in range(n_steps): v = u_theta(x, t) # evaluate neural network x = x + h * v # Euler step t += h return x # X_1 ~ p_data
We now have everything needed to define a flow model — a generative model based on an ODE. The recipe:
The ODE is deterministic — given the same starting noise X0, you always get the same output X1. The randomness comes entirely from the random initialization. Different noise samples produce different outputs.
What neural architecture is used? The neural network utθ must take a noisy input x and a time t and output a velocity of the same shape. Common architectures:
| Architecture | Used By | Input | Output |
|---|---|---|---|
| U-Net | Stable Diffusion 1/2, DALL-E 2 | [B, C, H, W] + t | [B, C, H, W] |
| DiT (Diffusion Transformer) | SD3, FLUX, Sora | [B, N, D] + t | [B, N, D] |
| Simple MLP | Toy experiments | [B, d] + t | [B, d] |
The time t is typically embedded using sinusoidal or learned embeddings, similar to positional encodings in Transformers. This time embedding is added to intermediate activations so the network "knows" what stage of the denoising process it is in. At t = 0, the input is pure noise, so the network should output a large velocity. At t = 1, the input is nearly clean, and the velocity should be small (just fine-tuning details).
Why the initial distribution matters. We usually choose pinit = N(0, Id) because:
1. We can sample from it easily (just generate random numbers).
2. It has full support on Rd — every point has positive probability.
3. It is isotropic — no preferred direction, so the model does not need to "undo" any structure in the noise.
4. It is well-studied mathematically, making theoretical analysis tractable.
Could we use a different pinit? Yes! Some models use a uniform distribution on a hypersphere, or a truncated Gaussian. But N(0, I) is the simplest and most common choice. The choice of pinit affects the difficulty of the learning problem: if pinit is very different from pdata, the flow must perform a more dramatic transformation, which requires more neural network capacity and more ODE steps.
Computational cost of generation. Each Euler step requires one forward pass through the neural network. The total cost of generating one sample is n × (cost of one forward pass). For reference:
| Model | Params | Steps | Time per image (A100) |
|---|---|---|---|
| DiT-S/2 (CIFAR) | 33M | 50 | ~0.3s |
| DiT-XL/2 (ImageNet 256) | 675M | 50 | ~2s |
| SD3-Medium (512×512) | 2B | 28 | ~3s |
| FLUX.1 (1024×1024) | 12B | 50 | ~15s |
Research in "distillation" aims to reduce the number of steps needed. Consistency models and progressive distillation can achieve good quality in 1-4 steps, at the cost of additional training. This is an active area of research as of 2025-2026.
Latent vs. pixel space. In practice, flow/diffusion models rarely operate directly in pixel space (too high-dimensional). Instead, a VAE first encodes images into a lower-dimensional latent space. The dimensions given above are for the latent representation. The actual generation pipeline is: noise → ODE/SDE in latent space → VAE decoder → pixels.
Classifier-free guidance. One of the most impactful practical techniques is not part of the basic theory but uses the score function. By training the model to sometimes drop the conditioning (text prompt), it learns both conditional and unconditional generation. At inference, the two predictions are combined to amplify the conditioning signal. This dramatically improves text-to-image alignment. We will see the mathematical details in Chapter 5 of the book, but it relies on the score function concepts from Chapter 4.
The Euler method vs. higher-order solvers. Beyond Euler and Heun, there are more advanced ODE solvers like DPM-Solver, DPM-Solver++, and UniPC that take advantage of the specific structure of the flow matching ODE. These can achieve the same quality as 50 Euler steps in just 10-20 steps by using higher-order approximations and specialized scheduling. In production systems, these advanced solvers are standard.
Step schedule matters. Instead of using uniform step sizes h = 1/n, production models use non-uniform schedules. The key insight: more steps should be allocated where the vector field changes rapidly. For CondOT paths, the field is nearly constant, so uniform steps work well. For VP (DDPM-style) paths, more steps are needed near t = 1 where fine details emerge.
python # Non-uniform step schedules import torch # Uniform schedule (basic) t_uniform = torch.linspace(0, 1, n_steps + 1) # Quadratic schedule (more steps near t=1) t_quad = torch.linspace(0, 1, n_steps + 1) ** 2 # Karras schedule (empirically optimized) sigma_min, sigma_max = 0.002, 80 rho = 7 ramp = torch.linspace(0, 1, n_steps + 1) t_karras = (sigma_max ** (1/rho) + ramp * (sigma_min ** (1/rho) - sigma_max ** (1/rho))) ** rho
The choice of step schedule can improve FID by 10-30% at the same number of steps. This is one of the most impactful "free" improvements in practice.
Adaptive step sizes. Some ODE solvers (like Dopri5) automatically choose the step size based on a local error estimate. They take larger steps where the vector field is smooth and smaller steps where it changes rapidly. This can achieve the same accuracy as 50 Euler steps in just 15-20 adaptive steps, with no manual tuning of the schedule.
Implementation in practice. The torchdiffeq library provides ODE solvers that work with PyTorch:
python # Using torchdiffeq for adaptive ODE solving from torchdiffeq import odeint class FlowODE(torch.nn.Module): def __init__(self, model): super().__init__() self.model = model def forward(self, t, x): """Required signature for torchdiffeq: (t, x) -> dx/dt""" t_batch = t.expand(x.shape[0], 1) return self.model(x, t_batch) # Sample with adaptive solver x0 = torch.randn(16, d) # batch of 16 noise samples t_span = torch.tensor([0.0, 1.0]) x1 = odeint(FlowODE(model), x0, t_span, method='dopri5', atol=1e-5, rtol=1e-5)[-1]
The adaptive solver automatically determines how many function evaluations (NFE) are needed, typically 15-40 depending on the vector field complexity. This often beats fixed-step Euler at the same compute budget.
Numerical stability considerations. When implementing ODE/SDE solvers for generative models:
1. Avoid t = 0 and t = 1 exactly. Some vector fields diverge at the endpoints. Use t ∈ [ε, 1−ε] with ε = 10−5.
2. Use float32 or higher. Float16 can cause numerical issues in the early steps (t near 0) where the velocity is large.
3. Gradient clipping during training. Clip gradients to norm 1.0 to prevent training instabilities.
4. EMA of model weights. Use exponential moving average of parameters for sampling (decay 0.9999).
A note on continuous vs. discrete time. In this course, we treat time as continuous (t ∈ [0, 1]). The original DDPM paper used discrete time steps (t ∈ {0, 1, ..., T}). The continuous formulation is more elegant and general — the discrete version is recovered as a special case when you discretize. Most modern implementations use continuous time internally and only discretize for the Euler/Heun solver.
Forward process vs. reverse process. In the DDPM literature, the "forward process" adds noise to data (going from t = 0 to t = 1) and the "reverse process" removes noise (going from t = 1 to t = 0). In the flow matching literature, the convention is reversed: t = 0 is noise and t = 1 is data. Both conventions are valid; just be careful when reading papers to check which convention is used.
In this course, we follow the flow matching convention: t = 0 is noise, t = 1 is data. The ODE flows from t = 0 to t = 1 during sampling. During training, we sample random t values and compute the loss at that timestep.
The complete lifecycle of a generated image. To tie everything together, here is what happens end-to-end when you type "a sunset over the ocean" into Stable Diffusion 3:
Total time: ~3 seconds on a modern GPU. Total neural network forward passes: 2 × 28 = 56 (two per step for classifier-free guidance). The model has ~2 billion parameters and was trained on ~1 billion images for ~500,000 GPU-hours. This is the machinery that Ch 3 and Ch 4 will teach you to build.
The mathematical elegance. Step back and appreciate the framework we have built. The entire generative modeling pipeline reduces to: (1) choose a vector field parametrized by a neural network, (2) simulate a differential equation. That is it. The complexity of generating images, videos, proteins, and music is encoded in the neural network weights, which are learned from data using the simple regression losses of Ch 3-4. The differential equation framework provides the mathematical guarantee that this process is well-defined (existence and uniqueness of solutions) and produces valid probability distributions (via the continuity/Fokker-Planck equations). This marriage of deep learning and classical mathematics is what makes flow and diffusion models both practical and beautiful.
Key equations to remember from this chapter:
Summary of ODE/SDE solvers for generative models:
| Solver | Type | Order | NFE per step | Typical steps |
|---|---|---|---|---|
| Euler | ODE | 1 | 1 | 50-100 |
| Heun | ODE | 2 | 2 | 25-50 |
| DPM-Solver++ | ODE | 2-3 | 1-2 | 15-25 |
| Dopri5 (RK45) | ODE | 5 | adaptive | auto (15-40 NFE) |
| Euler-Maruyama | SDE | 0.5 | 1 | 100-1000 |
Worked example — Algorithm 1 by hand. Suppose d = 1, n = 4 steps, h = 0.25, and X0 = 1.5 (sampled from N(0,1)). Suppose the neural network outputs these velocities:
| t | Xt | utθ(Xt) | Xt+h = Xt + 0.25 · u |
|---|---|---|---|
| 0.00 | 1.500 | −3.2 | 1.5 + 0.25(−3.2) = 0.700 |
| 0.25 | 0.700 | −1.8 | 0.7 + 0.25(−1.8) = 0.250 |
| 0.50 | 0.250 | +0.5 | 0.25 + 0.25(0.5) = 0.375 |
| 0.75 | 0.375 | +0.9 | 0.375 + 0.25(0.9) = 0.600 |
The output X1 = 0.600 is our generated sample. The noise X0 = 1.5 was transformed into the data sample 0.600 by the learned vector field.
python # Algorithm 1: Sampling from a Flow Model def sample_flow_model(u_theta, n=50, d=784): t = 0.0 h = 1.0 / n x = torch.randn(d) # X_0 ~ N(0, I) for i in range(n): x = x + h * u_theta(x, t) # Euler step t += h return x # X_1 ~ p_data
Flow models are deterministic — once you fix the noise X0, the trajectory is fully determined. Diffusion models add stochasticity during the trajectory itself, using a fundamental mathematical object called Brownian motion.
A Brownian motion W = (Wt)0 ≤ t ≤ 1 is a continuous random walk. Think of it as a drunk person stumbling around — at every instant, they take a tiny random step. It has three defining properties:
Worked example — simulating Brownian motion. In 1D, with step size h = 0.01, we simulate by setting W0 = 0 and updating:
With h = 0.01: √h = 0.1. At each step, we add a Gaussian with standard deviation 0.1. After 100 steps, we have a Brownian motion path from t = 0 to t = 1.
Numerical trace. Let ε1 = 0.83, ε2 = −1.21, ε3 = 0.47:
| Step | Wt | εt | Wt+h = Wt + 0.1 · ε |
|---|---|---|---|
| 1 | 0.000 | 0.83 | 0.083 |
| 2 | 0.083 | −1.21 | −0.038 |
| 3 | −0.038 | 0.47 | 0.009 |
The path zigzags randomly. Run the simulation again and you get a completely different path. This randomness is the ingredient that makes diffusion models stochastic.
Each click generates a new Brownian motion path. All paths start at zero but diverge wildly due to random increments.
Statistics of Brownian motion. Key properties that follow directly from the definition:
| Property | Formula | Interpretation |
|---|---|---|
| Mean | E[Wt] = 0 | On average, goes nowhere |
| Variance | Var(Wt) = t | Spread grows linearly with time |
| Std. dev. | √t | At t=1: std=1. At t=4: std=2 |
| Covariance | Cov(Ws, Wt) = min(s,t) | Past and future are correlated |
Numerical check. Simulate 10,000 Brownian paths to t = 1. The empirical distribution of W1 should be approximately N(0, 1). Mean ≈ 0, Std ≈ 1. This is because W1 − W0 ~ N(0, 1) by the normal increments property.
python # Simulating Brownian motion import torch def simulate_brownian(n_paths=5, n_steps=200, d=1): h = 1.0 / n_steps W = torch.zeros(n_paths, d) paths = [W.clone()] for _ in range(n_steps): W = W + (h ** 0.5) * torch.randn(n_paths, d) paths.append(W.clone()) return torch.stack(paths) # [n_steps+1, n_paths, d]
An SDE extends an ODE by adding a Brownian motion term. At each step, the particle follows the vector field and gets a random kick:
The first term ut(Xt) dt is the drift — the deterministic part, same as an ODE. The second term σt dWt is the diffusion — the random part, scaled by the diffusion coefficient σt. When σt = 0, the SDE reduces to an ODE.
We simulate SDEs using the Euler-Maruyama method, the stochastic analog of the Euler method:
Compare to Euler for ODEs: Xt+h = Xt + h · ut(Xt). The only difference is the added noise σt√h · εt.
Worked example — Ornstein-Uhlenbeck process. The OU process uses ut(x) = −θx and constant σ:
The drift −θx pushes the particle back toward zero (a "spring"). The diffusion σ adds noise. The two forces balance: the particle bounces around zero, eventually settling into a Gaussian distribution N(0, σ2/(2θ)).
Numerical example. Let θ = 2, σ = 1, X0 = 3, h = 0.1:
| Step | Xt | drift = −2X · 0.1 | ε | noise = √0.1 · ε | Xt+h |
|---|---|---|---|---|---|
| 1 | 3.000 | −0.600 | 0.52 | 0.164 | 2.564 |
| 2 | 2.564 | −0.513 | −1.31 | −0.414 | 1.637 |
| 3 | 1.637 | −0.327 | 0.88 | 0.278 | 1.588 |
Notice how the trajectory is jagged (unlike the smooth ODE trajectory) because of the random noise at each step.
The √h scaling is crucial. Why does the noise scale as σ√h rather than σ h? Because Brownian increments have variance proportional to the time gap. If we used σ h, the total variance after n steps would be n · (σh)2 = σ2h, which vanishes as h → 0. With σ√h, the total variance is n · (σ√h)2 = n · σ2h = σ2, which is finite. This is the mathematically correct scaling for continuous-time stochastic processes.
Convergence of the OU process. The Ornstein-Uhlenbeck process converges to the stationary distribution N(0, σ2/(2θ)). For our example with θ = 2, σ = 1: the stationary distribution is N(0, 0.25). Starting at X0 = 3, the process decays toward zero (due to the drift −2X) and fluctuates around zero (due to the noise), eventually settling into N(0, 0.25).
Both start at X0 = 3 with drift u(x) = −2x. The ODE (orange) is smooth and deterministic. The SDE (teal) is noisy. Increase σ to see more randomness.
python # Euler-Maruyama method for SDE simulation def euler_maruyama(u_theta, sigma_t, n=50, d=2): h = 1.0 / n x = torch.randn(d) t = 0.0 for _ in range(n): eps = torch.randn_like(x) x = x + h * u_theta(x, t) + sigma_t(t) * (h**0.5) * eps t += h return x
A diffusion model is a generative model based on an SDE, just as a flow model is based on an ODE. The recipe is identical except we add noise during sampling:
The neural network utθ parameterizes the vector field (exactly as in a flow model). The diffusion coefficient σt is a fixed schedule — not learned. The goal is still X1 ~ pdata.
Why would we want noise during sampling? Two reasons:
1. Error correction. Neural networks are imperfect — utθ only approximates the true vector field. The added noise can help "shake" trajectories out of error states, similar to how simulated annealing helps optimization escape local minima.
2. Diversity. Different amounts of noise can produce more diverse samples. The same initial noise X0 can lead to different outputs depending on the Brownian motion realization.
3. Theoretical guarantees. Under certain conditions, SDE sampling can converge faster than ODE sampling, especially when the target distribution has many well-separated modes.
Worked example — comparing ODE and SDE sampling. Suppose our target has two sharp modes at −5 and +5. With ODE sampling, a particle starting at X0 = 0.1 always ends up at the same mode (say +5). With SDE sampling, the Brownian noise can "push" the particle across the boundary, allowing it to reach either mode. This can improve mode coverage.
python # Algorithm 2: Sampling from a Diffusion Model def sample_diffusion_model(u_theta, sigma, n=50, d=784): t = 0.0 h = 1.0 / n x = torch.randn(d) # X_0 ~ N(0, I) for i in range(n): eps = torch.randn_like(x) # fresh noise x = x + h * u_theta(x, t) \ + sigma(t) * (h**0.5) * eps # Euler-Maruyama t += h return x # X_1 ~ p_data
| Property | Flow Model | Diffusion Model |
|---|---|---|
| Equation | ODE | SDE |
| Trajectories | Smooth, deterministic | Jagged, stochastic |
| Randomness source | Initial noise X0 only | X0 + Brownian motion Wt |
| Simulation | Euler method | Euler-Maruyama method |
| Same X0 → same output? | Yes | No |
| σt | 0 | > 0 (fixed schedule) |
Let's see everything in action. The canvas below lets you watch flow models (ODE) and diffusion models (SDE) generate samples from a 2D distribution in real time. Multiple particles start as Gaussian noise and evolve toward the target data distribution.
Watch particles flow from noise (t=0) to data (t=1). The target is a mixture of 4 Gaussians. Toggle SDE mode and adjust σ to see the difference.
Practical considerations for real models:
| Aspect | ODE (Flow Model) | SDE (Diffusion Model) |
|---|---|---|
| Typical steps | 20-50 | 50-1000 |
| Image quality | Good with few steps | Better with many steps |
| Deterministic? | Yes (same seed = same output) | No (different each run) |
| Speed | Faster (fewer steps needed) | Slower (more steps for stability) |
| Guidance | Works but limited | Naturally supports classifier guidance |
| Likelihood | Can compute exact log p(x) | Cannot (easily) |
In practice, modern systems like Stable Diffusion 3 and FLUX use ODE sampling (flow matching) by default because it is faster. SDE sampling is used when higher quality is needed or when guidance techniques require it.
The key takeaway from this showcase: A flow model (ODE) and a diffusion model (SDE) are two ways to use the same trained neural network. The training is identical — only the sampling differs. This flexibility is one of the most powerful aspects of the flow/diffusion framework: train once, sample with either method depending on your quality/speed requirements.
Interpolation between ODE and SDE. You can even mix the two approaches. Start with SDE sampling (large σ) for the early steps (to get good mode coverage) and switch to ODE sampling (σ = 0) for the later steps (for precise, deterministic refinement). This "stochastic early, deterministic late" strategy is used by some state-of-the-art samplers.
python # Complete comparison: ODE vs SDE sampling def sample_ode(model, x0, n_steps=50): """Deterministic ODE sampling.""" x, h = x0, 1.0 / n_steps for i in range(n_steps): t = i * h x = x + h * model(x, t) return x def sample_sde(model, x0, sigma=0.5, n_steps=100): """Stochastic SDE sampling.""" x, h = x0, 1.0 / n_steps for i in range(n_steps): t = i * h eps = torch.randn_like(x) x = x + h * model(x, t) + sigma * (h**0.5) * eps return x # Same model, two sampling methods: x0 = torch.randn(1, 784) img_ode = sample_ode(model, x0) # always same output for this x0 img_sde = sample_sde(model, x0) # different each time
We have built the complete mathematical machinery for generative modeling with differential equations. Let's recap:
| Concept | Definition | Role in Generation |
|---|---|---|
| Vector field ut(x) | Velocity at position x, time t | Neural network parameterizes this |
| ODE | dX/dt = ut(Xt) | Deterministic trajectory from noise to data |
| Flow ψt | Solution map of ODE | Where a particle ends up at time t |
| Euler method | Xt+h = Xt + hut(Xt) | Numerical simulation of ODE |
| Brownian motion | Continuous random walk Wt | Source of stochasticity in SDEs |
| SDE | dX = u dt + σ dW | Stochastic trajectory from noise to data |
| Euler-Maruyama | Xt+h = Xt + hu + σ√h ε | Numerical simulation of SDE |
| Flow model | SDE with σ = 0 | Deterministic generative model |
| Diffusion model | SDE with σ > 0 | Stochastic generative model |
What we know vs. what we need. Here is the gap between what we have and what we need:
| What We Have | What We Need |
|---|---|
| How to define a vector field (Ch 1) | How to LEARN the right vector field |
| How to solve an ODE/SDE numerically (Ch 3, 6) | What ODE/SDE to solve |
| The Euler and Euler-Maruyama methods | A training loss function |
| A dataset z1, ..., zN from pdata | Parameters θ such that X1 ~ pdata |
Chapter 3 (flow matching) closes this gap with a beautifully simple idea: define a probability path from noise to data, compute the vector field that follows this path analytically, and train the neural network to match it via MSE regression. No ODE simulation during training. No adversarial training. Just regression.
The key equations of this entire course fit on one card:
python # Summary: the full generative modeling stack # 1. REPRESENTATION (Ch 1) z = data_sample # z ∈ R^d (image, video, protein, ...) # 2. MACHINERY (Ch 2 — this chapter) u_theta = NeuralNet(d, hidden) # u: R^d × [0,1] → R^d # ODE: dX/dt = u_theta(X, t) → flow model # SDE: dX = u_theta(X,t)dt + σ dW → diffusion model # Simulate via Euler or Euler-Maruyama # 3. TRAINING (Ch 3-4 — next) # Flow matching: L = ||u_theta(x,t) - (z - eps)||^2 # Score matching: L = ||s_theta(x,t) + eps/beta||^2 # 4. GENERATION # X_0 ~ N(0, I) → simulate ODE/SDE → X_1 ~ p_data