Creating noise from data is easy; creating data from noise is generative modeling.
You type "a dog running through snow with mountains behind it" into an image generator. A few seconds later, a photorealistic picture appears. It is not retrieved from a database — it was created from scratch. How?
Previous generations of AI were primarily about prediction: classify this email as spam, predict tomorrow's stock price, detect a tumor in an X-ray. These systems take an input and produce a label or number. But generative AI does something fundamentally different: it creates new objects — images, videos, protein structures, music — that never existed before.
At the heart of every generative model in this course lies a single, beautiful idea: start with pure random noise, then iteratively sculpt it into data. The "sculpting" is done by simulating a differential equation. Flow matching and diffusion models are families of techniques that let us construct, train, and simulate these equations at massive scale using deep neural networks.
This course covers the two dominant algorithms behind modern generative AI: flow matching and denoising diffusion. These power Stable Diffusion, FLUX, Meta's Movie Gen, and even AlphaFold3 for protein design. The theory is elegant, the implementation is simple, and the results are staggering.
But before we can learn to generate, we need to formalize what "generate" even means. What is a "distribution of images"? What does it mean to "sample" from it? Let's build the mathematical vocabulary we need.
Click "Generate" to sample from different 2D distributions. Each dot is a sample. Notice how samples cluster where the distribution has high probability.
The interactive above shows that a "distribution" is really just a rule for where points are likely to appear. A Gaussian clusters near the center. A ring distribution clusters around a circle. A checkerboard puts mass on alternating squares. A generative model learns to produce samples that match whatever distribution the training data came from.
To build a generative model, we first need to represent the objects we want to generate — images, videos, protein structures — as mathematical objects that we can manipulate. The universal representation is the vector.
Consider a color image with H × W pixels, each with three color channels (red, green, blue). Each pixel-channel pair stores an intensity value — a real number. So the entire image is a collection of H × W × 3 real numbers. We can flatten this into a single vector z ∈ Rd where d = H × W × 3.
Worked example — a tiny image. A 4 × 4 grayscale image has d = 4 × 4 × 1 = 16 dimensions. Each pixel stores a brightness value between 0 (black) and 1 (white). So a specific image might be the vector:
That is 16 numbers — one per pixel. A 256 × 256 RGB image lives in d = 256 × 256 × 3 = 196,608 dimensions. Every possible image of that resolution is a single point in a 196,608-dimensional space.
Worked example — molecular structure. A protein with N = 100 atoms has each atom located in 3D space. The structure is z = (z1, ..., z100) where zi ∈ R3. Flattened: z ∈ R300. AlphaFold3 generates these vectors using a diffusion model.
Worked example — video. A 3-second video at 30fps with 64 × 64 RGB frames lives in d = 90 × 64 × 64 × 3 = 1,105,920 dimensions. Over a million dimensions for a tiny clip.
| Data Type | Representation | Typical d |
|---|---|---|
| 256×256 RGB Image | z ∈ RH×W×3 | 196,608 |
| 3s Video (64×64, 30fps) | z ∈ RT×H×W×3 | ~1.1M |
| Protein (100 atoms) | z ∈ R3N | 300 |
| Audio (1s, 16kHz) | z ∈ RT | 16,000 |
Why vectors and not something else? The vector representation lets us use the full power of calculus and linear algebra. We can add vectors (blend two images), scale them, compute distances (how similar are two images?), and define smooth transformations (gradually convert noise into an image). Differential equations — the mathematical engine of flow and diffusion models — operate on vectors. Without this representation, none of the theory in this course would apply.
python # Representing objects as vectors in code import numpy as np # A 32x32 RGB image as a vector image = np.random.rand(32, 32, 3) # shape: (32, 32, 3) z = image.flatten() # shape: (3072,) — a vector in R^3072 print(z.shape) # (3072,) # A protein with 100 atoms atoms = np.random.rand(100, 3) # 100 atoms, each in R^3 z_protein = atoms.flatten() # shape: (300,) # A 1-second audio clip at 16kHz z_audio = np.random.rand(16000) # shape: (16000,)
There is no single "best" image of a dog. There are millions of possible dog images, and many of them would satisfy a user. Some are more likely (a clear photo of a golden retriever) and some are less likely (an abstract painting of a dog). This diversity is captured by a probability distribution.
We define a data distribution pdata over the space Rd of all possible objects. Mathematically, pdata is a probability density function:
It assigns each possible object z ∈ Rd a non-negative likelihood pdata(z). Images that look like real photographs get high likelihood. Random noise gets near-zero likelihood.
Worked example — 1D distribution. Suppose our "data" is human heights in centimeters. Then pdata might be a Gaussian centered at 170 cm with standard deviation 10 cm. A height of 170 cm has high likelihood; a height of 300 cm has near-zero likelihood. Sampling z ~ pdata gives a random height that looks like a real human height.
Worked example — 2D distribution. Consider 2D data points arranged in a checkerboard pattern. The distribution pdata assigns high probability to the dark squares and zero probability to the light squares. Samples cluster in the dark regions.
In the real world, pdata lives in incredibly high dimensions and has incredibly complex structure. The distribution of all 256×256 images is a function over R196,608. It has mass concentrated on a tiny fraction of that space — the manifold of "images that look like real photographs." Almost all of R196,608 corresponds to random noise or visual garbage.
Worked example — the manifold hypothesis. Consider 28×28 grayscale images of handwritten digits (the MNIST dataset). Each image is a vector in R784. But not all 784-dimensional vectors look like digits. If you pick a random vector in R784, you get noise. The set of "images that look like digits" forms a thin, curved surface — a manifold — embedded in R784. The data distribution pdata places all its mass on this manifold and zero mass everywhere else.
This manifold is what a generative model must learn. It does not need to fill all of R784 with plausible images. It just needs to find the tiny subset where digits live and generate points on that subset.
python # Demonstrating the manifold hypothesis import numpy as np # A random vector in R^784 — this is NOT a valid digit noise = np.random.rand(784) # If you reshape to (28, 28) and display: just random noise # An actual MNIST digit (z ~ p_data) looks structured: # The pixel values form recognizable strokes # p_data(noise) ≈ 0 (random vectors have ~zero probability) # p_data(digit) > 0 (structured images have positive probability)
Each distribution assigns different probabilities to different regions. Brighter = higher probability. Drag the cursor to see local probability density.
Now we can say precisely what "generate" means. Generating an object z is the same as sampling from the data distribution:
This is the formal statement: to generate an image of a dog, we sample a random vector z from the distribution pdata of dog images. Each time we sample, we get a different image — but all samples are "likely" according to pdata, meaning they look like real dog images.
Worked example — 1D Gaussian. If pdata = N(170, 102) (human heights), then sampling z ~ pdata might give us 167.3, 172.8, 155.1, 180.4, ... Each sample is a plausible height. We will never get z = 1000 because the density there is essentially zero.
Worked example — 2D moons. If pdata is the "two moons" distribution, samples cluster along two crescent shapes. Drawing z ~ pdata gives random points on those crescents.
The beauty of this formulation is that it separates what we want (samples from pdata) from how we get them (the generative model). Different algorithms — GANs, VAEs, flow matching, diffusion — are different "hows" for the same "what."
But here is the catch: we never have direct access to pdata. We cannot evaluate pdata(z) for a given z (is this particular pixel arrangement likely?). We cannot invert pdata to produce samples directly. All we have is a finite collection of examples. This brings us to the concept of a dataset.
We cannot access pdata directly, but we can collect examples from it. A dataset is a finite collection of samples drawn independently from pdata:
For images, a dataset might be millions of photographs scraped from the internet (like ImageNet or LAION). For proteins, it might be experimentally resolved 3D structures from the Protein Data Bank. For videos, it might be clips from YouTube.
Worked example — 2D checkerboard. Suppose pdata is a checkerboard distribution. With N = 10 samples, the pattern is barely visible. With N = 100, you start to see structure. With N = 10,000, the checkerboard is unmistakable. The dataset is a finite proxy for the infinite distribution.
This is a crucial point: the training data is not the distribution. The distribution pdata is a theoretical object — the "true" density over all possible objects of interest. The dataset is a finite window into that distribution. Generative models must learn enough about pdata from the dataset to generate new samples that were not in the training set.
Overfitting vs. generalization. If the model simply memorizes the training examples, it has failed. A good generative model generalizes: it generates new samples that are plausible under pdata but were not in the training set. How do we know it is not just memorizing? Several tests:
1. Nearest-neighbor test: Compare generated samples to their nearest neighbors in the training set. If they are too similar, the model is memorizing.
2. Interpolation test: Interpolate between two noise vectors and check that intermediate samples look plausible. Memorizing models produce garbage in between memorized samples.
3. Distribution metrics: FID compares the statistics of generated samples to real samples. A model that memorizes perfectly would have FID = 0, but so would a model that generalizes perfectly. In practice, the FID of generated samples on a held-out test set is the best measure.
In practice, models with millions of parameters trained on millions of images generalize well. Models trained on small datasets (say, 1000 images) tend to memorize. The transition from memorization to generalization is one of the most interesting phenomena in deep generative modeling.
Dataset quality vs. quantity. Not all data is created equal. A dataset of 1 million high-quality, diverse images often produces better results than 10 million low-quality, repetitive images. Data curation — filtering out duplicates, low-quality images, and harmful content — is a critical step in training production models. LAION-5B, the dataset behind many open-source models, went through multiple rounds of filtering before use.
The training data pipeline for a real model.
Increase N to see how more samples reveal the structure of the underlying distribution. The true distribution is the "Two Moons" shape.
In practice, the quality of a generative model depends enormously on the size and quality of the dataset. Stable Diffusion was trained on billions of image-text pairs. GPT-4 was trained on trillions of text tokens. The datasets are large because the distributions are complex.
How many samples are "enough"? Consider our earlier calculation: a 256 × 256 RGB image lives in R196,608. The space is enormous, but the manifold of "natural images" is much lower-dimensional. Empirically, training on millions to billions of images captures enough of this manifold for high-quality generation. The exact number depends on the diversity of the target distribution — faces are simpler than "all images" and need fewer samples.
The intrinsic dimensionality of data. While images live in R196,608, the manifold of natural images has a much lower intrinsic dimensionality — estimates range from a few hundred to a few thousand. This is why generative models work at all: they do not need to "fill" all 196,608 dimensions, just the thin manifold where real images live. This is also why latent diffusion models (Stable Diffusion) first compress images to a 64×64×4 latent space (d = 16,384) before applying the diffusion process. The compression loses little information because the intrinsic dimensionality is much smaller than the pixel dimensionality.
The central role of randomness. A generative model produces different outputs each time because it starts from a random noise sample. Two calls to the same model with different noise produce two different images (or proteins, or videos). The noise seed controls which specific sample you get. This is why AI art tools let you set a "seed" — fixing the noise gives a reproducible output.
Mode coverage vs. mode quality. A perfect generative model would produce all the diversity of pdata (mode coverage) with each sample being high quality (mode quality). In practice, there is often a tension between the two. GANs tend to have high quality but low coverage (mode collapse). Diffusion models tend to have both, which is one reason they have become dominant.
Evaluation metrics for generative models. How do we know if a generative model is good?
| Metric | What It Measures | Lower is better? |
|---|---|---|
| FID (Frechet Inception Distance) | Distribution similarity (statistics of generated vs. real images) | Yes (0 = identical) |
| IS (Inception Score) | Quality and diversity of generated images | No (higher = better) |
| CLIP Score | Alignment between image and text prompt | No (higher = better) |
| LPIPS | Perceptual similarity between individual images | Depends on context |
| Human evaluation | Subjective quality assessed by humans | Gold standard |
FID is the most widely reported metric. State-of-the-art models achieve FID < 2 on standard benchmarks like ImageNet 256×256, meaning the generated image distribution is nearly indistinguishable from the real one.
Why this course matters practically. Flow matching and diffusion models are not just academic curiosities. They power products used by hundreds of millions of people: Midjourney, DALL-E 3, Stable Diffusion, Google Imagen, Adobe Firefly, and many more. Understanding their mathematical foundations gives you the ability to design, train, debug, and improve these systems. The algorithms are remarkably simple once you understand the theory — which is exactly what the remaining chapters build.
Dataset bias matters. If the dataset contains mostly photos of cats and dogs, the generative model will primarily produce cats and dogs. If the dataset lacks diversity (e.g., photos from only one culture), the model inherits this bias. The model cannot generate objects outside the support of its training distribution — or at least, not well.
python # Creating a toy dataset for flow matching experiments import torch def sample_two_moons(n): """Sample from the 'two moons' distribution in 2D.""" n1 = n // 2 n2 = n - n1 # Upper moon theta1 = torch.rand(n1) * torch.pi x1 = torch.cos(theta1) + torch.randn(n1) * 0.05 y1 = torch.sin(theta1) + torch.randn(n1) * 0.05 # Lower moon theta2 = torch.rand(n2) * torch.pi x2 = 1 - torch.cos(theta2) + torch.randn(n2) * 0.05 y2 = 1 - torch.sin(theta2) - 0.5 + torch.randn(n2) * 0.05 return torch.stack([ torch.cat([x1, x2]), torch.cat([y1, y2]) ], dim=1) # [n, 2]
| Domain | Example Dataset | Size |
|---|---|---|
| Images | LAION-5B | ~5 billion image-text pairs |
| Proteins | Protein Data Bank | ~200,000 structures |
| Video | HD-VILA-100M | ~100 million clips |
| Audio | AudioSet | ~2 million clips |
Unconditional generation — "give me a random image" — is useful but limited. In practice, we almost always want to condition on something: a text prompt, a class label, a partial sketch, or a noisy measurement. This is guided generation (also called conditional generation).
Worked example — text-to-image. You want an image of "a photorealistic cat blowing out birthday candles." Here y = "a photorealistic cat blowing out birthday candles" and z ~ pdata(·|y) is a random image that matches this description. Different samples from pdata(·|y) give different images, all featuring a cat with candles.
Worked example — super-resolution. Given a low-resolution 32×32 image y, generate a plausible high-resolution 256×256 version z. Multiple valid high-res images exist for any given low-res input, so this is genuinely a sampling problem.
Worked example — protein design. Given a desired protein function y, generate a 3D structure z that realizes that function. AlphaFold3 does exactly this.
Worked example — inpainting. Given an image with a missing region and y = "the surrounding pixels," generate z = "the completed image." Many plausible completions exist, so again we sample from a conditional distribution.
The training data for guided generation consists of pairs (zi, yi). For text-to-image: zi is an image and yi is its caption. A single model learns to condition on arbitrary y values.
| Application | z (generated) | y (condition) |
|---|---|---|
| Text-to-image | Image pixels | Text prompt |
| Super-resolution | High-res image | Low-res image |
| Protein design | 3D structure | Desired function |
| Inpainting | Complete image | Masked image + mask |
| Text-to-video | Video frames | Text description |
| Music generation | Audio waveform | Text description or melody |
Here is the full picture of what we are building toward:
Let's tie everything together with a hands-on simulation. A generative model is an algorithm that returns approximate samples from pdata. In this course, we will build generative models using differential equations — but for now, let's see what "sampling from a distribution" actually looks like.
The simplest possible generative model is: just return a random training example. This is called memorization — and it is not what we want. A good generative model creates new samples that look like they came from pdata but are not copies of training data.
A slightly better approach: fit a parametric model (like a Gaussian mixture) to the data and sample from that. But fitting a good density model in 196,608 dimensions is essentially as hard as the original problem.
The preview of what is coming in Chapters 2-4: We will learn to define a trajectory that starts at random noise X0 ~ N(0, I) and ends at data X1 ~ pdata. The trajectory is defined by a vector field (learned by a neural network) that tells each point which direction to move at each moment in time. Simulating this trajectory converts noise to data.
This approach is fundamentally different from memorization or density estimation. The model never stores data points or evaluates densities. Instead, it learns a transformation — a smooth, continuous map from the simple Gaussian distribution to the complex data distribution. Every point in the Gaussian gets mapped to a point in the data distribution, and the map is defined by the ODE.
The noise-to-data pipeline in practice. In a real image generator like Stable Diffusion 3:
Total time: ~2-5 seconds on an A100 GPU. Each of the 20-50 neural network calls processes the entire 64×64×4 = 16,384-dimensional latent. The cumulative effect of these small velocity steps is a photorealistic image.
python # What Stable Diffusion 3 does (simplified) def generate_image(dit_model, vae_decoder, prompt, n_steps=28): # Step 1: Random noise in latent space x = torch.randn(1, 4, 64, 64) # latent noise # Step 2: Simulate flow ODE prompt_emb = encode_text(prompt) # CLIP/T5 encoding h = 1.0 / n_steps for i in range(n_steps): t = torch.tensor([i * h]) v = dit_model(x, t, prompt_emb) # 2B param DiT x = x + h * v # Step 3: Decode to pixel space image = vae_decoder(x) # [1, 3, 512, 512] return image
Worked example — noise to data in 1D. Suppose pdata is a mixture of two Gaussians centered at −3 and +3. We start with noise X0 ~ N(0, 1). A flow model learns a vector field ut(x) such that following the ODE dXt/dt = ut(Xt) for t from 0 to 1 pushes X0 toward one of the two modes. At t = 1, the sample X1 lands near −3 or +3 — a valid sample from pdata.
Toggle conditional mode to see how conditioning changes which regions of the distribution are sampled. In conditional mode, only samples from the selected region are generated.
Notice the difference between unconditional and conditional sampling. In unconditional mode, samples cover the entire distribution. In conditional mode, we restrict to a specific region — this is what text-to-image does, but in 196,608 dimensions instead of 2.
How noise becomes structure. Here is a concrete numerical walkthrough of what happens inside a generative model. Consider a 1D example where pdata = 0.5 N(−3, 0.25) + 0.5 N(+3, 0.25) (a mixture of two Gaussians). The generative process:
| Step | t | Xt | What's happening |
|---|---|---|---|
| Init | 0.0 | 1.2 | Random noise from N(0,1) |
| 1 | 0.2 | 1.4 | Still mostly noise, drifting slightly |
| 2 | 0.4 | 1.8 | Vector field starts to "choose" the +3 mode |
| 3 | 0.6 | 2.3 | Clearly heading toward +3 |
| 4 | 0.8 | 2.7 | Almost there |
| 5 | 1.0 | 2.95 | A sample from pdata (near +3 mode) |
The initial noise X0 = 1.2 was positive, so the learned vector field pushed it toward the +3 mode. If X0 had been −0.8, the field would have pushed it toward the −3 mode. The vector field acts like a sorting mechanism — it routes each noise sample to the appropriate data region.
Why differential equations? Why not just learn a direct mapping f: noise → data? You could train a neural network fθ such that fθ(ε) ≈ z for paired (ε, z). This is essentially what a GAN generator does. The problem: learning a one-shot mapping in very high dimensions is hard. The function must be highly nonlinear and the training landscape has many bad local minima.
Differential equations decompose this hard one-shot mapping into many easy steps. Each step moves the particle a little bit in the right direction. The neural network only needs to predict a small velocity at each point, not the entire transformation. This is like the difference between trying to draw a picture in one pen stroke (hard) versus many small strokes (easy). The composability of differential equations is what makes flow and diffusion models work.
python # Pseudocode: what a generative model does import torch def generate(model, n_samples): # Step 1: Sample noise from simple distribution x = torch.randn(n_samples, d) # X_0 ~ N(0, I) # Step 2: Simulate ODE to transform noise into data for t in linspace(0, 1, n_steps): velocity = model(x, t) # neural network predicts vector field x = x + dt * velocity # Euler step return x # X_1 ~ p_data (approximately)
We have laid the mathematical foundation for generative modeling. Let's recap what we established and preview what comes next.
| Concept | What It Means | Symbol |
|---|---|---|
| Object | A vector in Rd (image, video, molecule, ...) | z |
| Data distribution | Probability density over all possible objects | pdata(z) |
| Generation | Sampling z ~ pdata | z ~ pdata |
| Dataset | Finite samples from pdata | z1,...,zN |
| Guided generation | Sampling z ~ pdata(·|y) | z ~ pdata(·|y) |
| Generative model | Algorithm that produces approximate samples from pdata | — |
The roadmap for the rest of this course:
Other generative model families include:
| Model Family | Core Idea | Strengths | Weaknesses |
|---|---|---|---|
| GANs | Generator vs. discriminator adversarial game | Fast sampling, sharp outputs | Training instability, mode collapse |
| VAEs | Encoder-decoder with latent space regularization | Principled latent space, fast | Blurry outputs |
| Autoregressive | Generate one element at a time (token, pixel) | Exact likelihoods, proven for text | Slow sequential sampling |
| Normalizing Flows | Learn invertible transformations | Exact densities, fast both ways | Architectural constraints for invertibility |
| Flow/Diffusion | ODE/SDE from noise to data | High quality, flexible, scalable | Multi-step sampling (slower than GANs) |
Flow matching and diffusion models are the current state-of-the-art for continuous data (images, video, audio, molecules). Their combination of simplicity (the training loss is just MSE), scalability (standard mini-batch SGD), and quality (state-of-the-art FID scores) has made them the dominant approach since 2022.
A brief history of generative modeling milestones:
| Year | Model | Key Idea |
|---|---|---|
| 2014 | GAN (Goodfellow) | Generator vs. discriminator adversarial game |
| 2014 | VAE (Kingma) | Variational inference with reparameterization |
| 2015 | Deep Diffusion (Sohl-Dickstein) | First diffusion model (thermodynamics-inspired) |
| 2019 | NCSN (Song) | Score matching with Langevin dynamics |
| 2020 | DDPM (Ho) | Denoising diffusion with noise prediction |
| 2021 | DALL-E, GLIDE | Text-to-image with diffusion models |
| 2022 | Stable Diffusion, Flow Matching | Latent diffusion + CondOT training |
| 2024 | FLUX, SD3, Sora | DiT architecture, video generation |
| 2024 | AlphaFold3 | Diffusion for protein structure prediction |
The timeline shows a remarkable convergence: from theoretical curiosities (2014-2019) to practical tools (2020-2022) to industry-defining products (2023-present). Flow matching and diffusion models went from academic papers to powering billions of image generations per day in just four years.
Why flow and diffusion models won. Several factors contributed to their dominance over earlier approaches:
1. Training stability. Unlike GANs, there is no adversarial game. The loss is a simple MSE regression, which converges reliably without mode collapse or training instabilities.
2. Sample quality. Flow/diffusion models achieve the best FID scores on standard benchmarks, surpassing GANs, VAEs, and autoregressive models for continuous data.
3. Mode coverage. The model captures the full diversity of the data distribution, unlike GANs which often miss rare modes.
4. Scalability. The training algorithm scales cleanly to billion-parameter models with standard distributed training techniques.
5. Flexibility. The same model supports unconditional generation, text-to-image, inpainting, super-resolution, and more — often with minimal architectural changes.
6. Theoretical foundation. The mathematics of probability paths, continuity equations, and score functions provides a principled framework for analysis and improvement.
The one drawback: generation speed. GANs generate in one forward pass; flow models need 20-50 forward passes. Research on distillation and consistency models is rapidly closing this gap, with some models now achieving near-GAN speeds at flow-model quality.
Applications beyond image generation. Flow and diffusion models have found surprising applications far beyond generating pretty pictures:
| Application | What is z? | What is y? | Example System |
|---|---|---|---|
| Image generation | Image (latent) | Text prompt | Stable Diffusion 3 |
| Video generation | Video frames | Text + first frame | Sora, Movie Gen |
| Protein design | 3D atom coordinates | Desired function | AlphaFold3, RFDiffusion |
| Drug discovery | Molecular graph | Target protein | DiffDock |
| Audio synthesis | Audio waveform | Text description | AudioLDM, MusicGen |
| 3D shape generation | 3D point cloud | Text or image | Point-E, Shap-E |
| Robot policy | Action trajectory | Observation | Diffusion Policy, π0 |
| Weather forecasting | Weather state | Current state | GenCast (DeepMind) |
The universality of the framework — "learn to convert noise into structured data" — applies to any domain where the data can be represented as continuous vectors. This is why understanding flow matching is such a valuable skill: master the theory once, apply it everywhere.
What you will be able to do after this course:
1. Understand how Stable Diffusion, FLUX, Sora, and AlphaFold3 work, down to the mathematical foundations.
2. Implement a flow matching model from scratch in PyTorch (training + sampling).
3. Debug common issues in diffusion model training (loss curves, sample quality, noise schedules).
4. Choose between velocity prediction, score prediction, and noise prediction for your application.
5. Extend the framework to new data types (molecules, 3D, video) by understanding the core math.
6. Read the latest research papers on generative modeling with a solid foundation, since the notation and concepts are now familiar.
A final analogy. Generative modeling with flow matching is like learning to sculpt. The noise is the raw clay. The vector field is the sculptor's hands, applying pressure at each point in time to shape the material. The training process teaches the sculptor (neural network) how to shape clay (noise) into a specific form (data). Each generation starts with a new lump of clay (random noise) and follows the same sculpting procedure (ODE simulation) to produce a unique sculpture (sample). The sculptures all look "real" because the sculpting procedure was learned from thousands of real examples.
Required mathematical background. Due to the technical nature of differential equations, this course assumes some familiarity with:
1. Derivatives and integrals. You should know what df/dt means and be comfortable with chain rule.
2. Probability basics. Random variables, expected value, Gaussian distribution, Bayes' rule.
3. Vectors and matrices. Vector addition, dot product, matrix-vector multiplication.
4. Python and PyTorch. Basic tensor operations, neural network training loops.
Do not worry if you are rusty — we will derive everything from scratch and motivate every formula. The Appendix of the book provides a probability refresher for those who need it. The key insight of this course is that the ideas are intuitive (convert noise to data by following a velocity field) even though the formalism requires some mathematical machinery.
Summary of notation. We will use this notation consistently throughout all chapters:
| Symbol | Meaning |
|---|---|
| d | Dimensionality of the data |
| z ∈ Rd | A data sample (image, protein, ...) |
| pdata | The data distribution (unknown, complex) |
| pinit | The initial/noise distribution, usually N(0, I) |
| pt | Probability path at time t |
| ut(x) | Vector field (velocity at position x, time t) |
| utθ(x) | Neural network vector field with parameters θ |
| ψt | Flow (solution map of ODE) |
| Wt | Brownian motion |
| σt | Diffusion coefficient |
| αt, βt | Noise schedulers for the probability path |
| y | Conditioning variable (text prompt, class label, ...) |