Holderrieth & Erives, Chapter 1

Introduction to Generative Modeling

Creating noise from data is easy; creating data from noise is generative modeling.

Prerequisites: Basic probability (what a distribution is). That's it.
8
Chapters
3
Simulations
8
Quizzes

Chapter 0: Why Generate?

You type "a dog running through snow with mountains behind it" into an image generator. A few seconds later, a photorealistic picture appears. It is not retrieved from a database — it was created from scratch. How?

Previous generations of AI were primarily about prediction: classify this email as spam, predict tomorrow's stock price, detect a tumor in an X-ray. These systems take an input and produce a label or number. But generative AI does something fundamentally different: it creates new objects — images, videos, protein structures, music — that never existed before.

At the heart of every generative model in this course lies a single, beautiful idea: start with pure random noise, then iteratively sculpt it into data. The "sculpting" is done by simulating a differential equation. Flow matching and diffusion models are families of techniques that let us construct, train, and simulate these equations at massive scale using deep neural networks.

The core premise: Generative modeling = learning to convert noise into data. We start from a simple distribution (like a Gaussian) and learn a transformation that maps it to a complex data distribution (like "all possible images of dogs").

This course covers the two dominant algorithms behind modern generative AI: flow matching and denoising diffusion. These power Stable Diffusion, FLUX, Meta's Movie Gen, and even AlphaFold3 for protein design. The theory is elegant, the implementation is simple, and the results are staggering.

But before we can learn to generate, we need to formalize what "generate" even means. What is a "distribution of images"? What does it mean to "sample" from it? Let's build the mathematical vocabulary we need.

What Generative Models Do

Click "Generate" to sample from different 2D distributions. Each dot is a sample. Notice how samples cluster where the distribution has high probability.

Distribution

The interactive above shows that a "distribution" is really just a rule for where points are likely to appear. A Gaussian clusters near the center. A ring distribution clusters around a circle. A checkerboard puts mass on alternating squares. A generative model learns to produce samples that match whatever distribution the training data came from.

What is the fundamental task of a generative model?

Chapter 1: Objects as Vectors

To build a generative model, we first need to represent the objects we want to generate — images, videos, protein structures — as mathematical objects that we can manipulate. The universal representation is the vector.

Consider a color image with H × W pixels, each with three color channels (red, green, blue). Each pixel-channel pair stores an intensity value — a real number. So the entire image is a collection of H × W × 3 real numbers. We can flatten this into a single vector z ∈ Rd where d = H × W × 3.

Key Idea 1 — Objects as Vectors: We identify the objects being generated as vectors z ∈ Rd. An image is a vector in RH×W×3. A video is a vector in RT×H×W×3. A molecular structure is a vector in R3×N.

Worked example — a tiny image. A 4 × 4 grayscale image has d = 4 × 4 × 1 = 16 dimensions. Each pixel stores a brightness value between 0 (black) and 1 (white). So a specific image might be the vector:

z = (0.1, 0.9, 0.3, 0.7, 0.2, 0.8, 0.4, 0.6, 0.5, 0.5, 0.1, 0.9, 0.8, 0.2, 0.7, 0.3)

That is 16 numbers — one per pixel. A 256 × 256 RGB image lives in d = 256 × 256 × 3 = 196,608 dimensions. Every possible image of that resolution is a single point in a 196,608-dimensional space.

Worked example — molecular structure. A protein with N = 100 atoms has each atom located in 3D space. The structure is z = (z1, ..., z100) where zi ∈ R3. Flattened: z ∈ R300. AlphaFold3 generates these vectors using a diffusion model.

Worked example — video. A 3-second video at 30fps with 64 × 64 RGB frames lives in d = 90 × 64 × 64 × 3 = 1,105,920 dimensions. Over a million dimensions for a tiny clip.

Data TypeRepresentationTypical d
256×256 RGB Imagez ∈ RH×W×3196,608
3s Video (64×64, 30fps)z ∈ RT×H×W×3~1.1M
Protein (100 atoms)z ∈ R3N300
Audio (1s, 16kHz)z ∈ RT16,000
What about text? Text is discrete — words or tokens from a vocabulary. Most language models treat text as sequences of discrete tokens, not continuous vectors. Continuous diffusion models for text exist (Section 7 of the book) but they are not our main focus. For now, think of z as always being a continuous vector.

Why vectors and not something else? The vector representation lets us use the full power of calculus and linear algebra. We can add vectors (blend two images), scale them, compute distances (how similar are two images?), and define smooth transformations (gradually convert noise into an image). Differential equations — the mathematical engine of flow and diffusion models — operate on vectors. Without this representation, none of the theory in this course would apply.

python
# Representing objects as vectors in code
import numpy as np

# A 32x32 RGB image as a vector
image = np.random.rand(32, 32, 3)   # shape: (32, 32, 3)
z = image.flatten()                    # shape: (3072,) — a vector in R^3072
print(z.shape)                         # (3072,)

# A protein with 100 atoms
atoms = np.random.rand(100, 3)       # 100 atoms, each in R^3
z_protein = atoms.flatten()            # shape: (300,)

# A 1-second audio clip at 16kHz
z_audio = np.random.rand(16000)       # shape: (16000,)
The key abstraction: Once everything is a vector z ∈ Rd, the same generative modeling algorithms work for images, proteins, audio, and video. The algorithms do not "know" what the vectors represent — they just learn to transform Gaussian noise into vectors that have the same statistical properties as the training data. The domain-specific knowledge is all in the data.
A 32×32 RGB image is represented as a vector z in what space?

Chapter 2: The Data Distribution

There is no single "best" image of a dog. There are millions of possible dog images, and many of them would satisfy a user. Some are more likely (a clear photo of a golden retriever) and some are less likely (an abstract painting of a dog). This diversity is captured by a probability distribution.

We define a data distribution pdata over the space Rd of all possible objects. Mathematically, pdata is a probability density function:

pdata : Rd → R≥0

It assigns each possible object z ∈ Rd a non-negative likelihood pdata(z). Images that look like real photographs get high likelihood. Random noise gets near-zero likelihood.

The crucial reframing: "How good is this image?" is a subjective question with no clean answer. "How likely is this image under the data distribution?" is a precise mathematical question. Generative modeling replaces the first question with the second.

Worked example — 1D distribution. Suppose our "data" is human heights in centimeters. Then pdata might be a Gaussian centered at 170 cm with standard deviation 10 cm. A height of 170 cm has high likelihood; a height of 300 cm has near-zero likelihood. Sampling z ~ pdata gives a random height that looks like a real human height.

Worked example — 2D distribution. Consider 2D data points arranged in a checkerboard pattern. The distribution pdata assigns high probability to the dark squares and zero probability to the light squares. Samples cluster in the dark regions.

In the real world, pdata lives in incredibly high dimensions and has incredibly complex structure. The distribution of all 256×256 images is a function over R196,608. It has mass concentrated on a tiny fraction of that space — the manifold of "images that look like real photographs." Almost all of R196,608 corresponds to random noise or visual garbage.

Worked example — the manifold hypothesis. Consider 28×28 grayscale images of handwritten digits (the MNIST dataset). Each image is a vector in R784. But not all 784-dimensional vectors look like digits. If you pick a random vector in R784, you get noise. The set of "images that look like digits" forms a thin, curved surface — a manifold — embedded in R784. The data distribution pdata places all its mass on this manifold and zero mass everywhere else.

This manifold is what a generative model must learn. It does not need to fill all of R784 with plausible images. It just needs to find the tiny subset where digits live and generate points on that subset.

python
# Demonstrating the manifold hypothesis
import numpy as np

# A random vector in R^784 — this is NOT a valid digit
noise = np.random.rand(784)
# If you reshape to (28, 28) and display: just random noise

# An actual MNIST digit (z ~ p_data) looks structured:
# The pixel values form recognizable strokes
# p_data(noise) ≈ 0   (random vectors have ~zero probability)
# p_data(digit) > 0   (structured images have positive probability)
Visualizing Data Distributions in 2D

Each distribution assigns different probabilities to different regions. Brighter = higher probability. Drag the cursor to see local probability density.

Distribution
The data distribution pdata(z) assigns each possible object z a...

Chapter 3: Generation as Sampling

Now we can say precisely what "generate" means. Generating an object z is the same as sampling from the data distribution:

z ~ pdata

This is the formal statement: to generate an image of a dog, we sample a random vector z from the distribution pdata of dog images. Each time we sample, we get a different image — but all samples are "likely" according to pdata, meaning they look like real dog images.

Key Idea 2 — Generation as Sampling: Generating an object z is modeled as sampling from the data distribution z ~ pdata. A generative model is any algorithm that produces such samples.

Worked example — 1D Gaussian. If pdata = N(170, 102) (human heights), then sampling z ~ pdata might give us 167.3, 172.8, 155.1, 180.4, ... Each sample is a plausible height. We will never get z = 1000 because the density there is essentially zero.

Worked example — 2D moons. If pdata is the "two moons" distribution, samples cluster along two crescent shapes. Drawing z ~ pdata gives random points on those crescents.

The beauty of this formulation is that it separates what we want (samples from pdata) from how we get them (the generative model). Different algorithms — GANs, VAEs, flow matching, diffusion — are different "hows" for the same "what."

But here is the catch: we never have direct access to pdata. We cannot evaluate pdata(z) for a given z (is this particular pixel arrangement likely?). We cannot invert pdata to produce samples directly. All we have is a finite collection of examples. This brings us to the concept of a dataset.

The fundamental challenge: We want to sample from pdata, but we can neither evaluate pdata(z) nor sample from it directly. All we have is a finite dataset of examples. The entire field of generative modeling is about building algorithms that can produce new samples despite never seeing the full distribution.
"Generating an image" is mathematically equivalent to:

Chapter 4: Datasets

We cannot access pdata directly, but we can collect examples from it. A dataset is a finite collection of samples drawn independently from pdata:

z1, z2, ..., zN ~ pdata

For images, a dataset might be millions of photographs scraped from the internet (like ImageNet or LAION). For proteins, it might be experimentally resolved 3D structures from the Protein Data Bank. For videos, it might be clips from YouTube.

Key Idea 3 — Dataset: A dataset consists of a finite number of samples z1, ..., zN ~ pdata. As N grows, the dataset becomes an increasingly better approximation of the true distribution.

Worked example — 2D checkerboard. Suppose pdata is a checkerboard distribution. With N = 10 samples, the pattern is barely visible. With N = 100, you start to see structure. With N = 10,000, the checkerboard is unmistakable. The dataset is a finite proxy for the infinite distribution.

This is a crucial point: the training data is not the distribution. The distribution pdata is a theoretical object — the "true" density over all possible objects of interest. The dataset is a finite window into that distribution. Generative models must learn enough about pdata from the dataset to generate new samples that were not in the training set.

Overfitting vs. generalization. If the model simply memorizes the training examples, it has failed. A good generative model generalizes: it generates new samples that are plausible under pdata but were not in the training set. How do we know it is not just memorizing? Several tests:

1. Nearest-neighbor test: Compare generated samples to their nearest neighbors in the training set. If they are too similar, the model is memorizing.

2. Interpolation test: Interpolate between two noise vectors and check that intermediate samples look plausible. Memorizing models produce garbage in between memorized samples.

3. Distribution metrics: FID compares the statistics of generated samples to real samples. A model that memorizes perfectly would have FID = 0, but so would a model that generalizes perfectly. In practice, the FID of generated samples on a held-out test set is the best measure.

In practice, models with millions of parameters trained on millions of images generalize well. Models trained on small datasets (say, 1000 images) tend to memorize. The transition from memorization to generalization is one of the most interesting phenomena in deep generative modeling.

Dataset quality vs. quantity. Not all data is created equal. A dataset of 1 million high-quality, diverse images often produces better results than 10 million low-quality, repetitive images. Data curation — filtering out duplicates, low-quality images, and harmful content — is a critical step in training production models. LAION-5B, the dataset behind many open-source models, went through multiple rounds of filtering before use.

The training data pipeline for a real model.

Collect
Scrape images from the web (billions)
Filter
Remove duplicates, low-res, NSFW, watermarked images
Caption
Generate text descriptions using BLIP-2 or CogVLM
Encode
Pre-compute VAE latents and CLIP/T5 text embeddings
Train
Flow matching on (latent, text_embedding) pairs
Dataset Size and Distribution Approximation

Increase N to see how more samples reveal the structure of the underlying distribution. The true distribution is the "Two Moons" shape.

N samples 50

In practice, the quality of a generative model depends enormously on the size and quality of the dataset. Stable Diffusion was trained on billions of image-text pairs. GPT-4 was trained on trillions of text tokens. The datasets are large because the distributions are complex.

How many samples are "enough"? Consider our earlier calculation: a 256 × 256 RGB image lives in R196,608. The space is enormous, but the manifold of "natural images" is much lower-dimensional. Empirically, training on millions to billions of images captures enough of this manifold for high-quality generation. The exact number depends on the diversity of the target distribution — faces are simpler than "all images" and need fewer samples.

The intrinsic dimensionality of data. While images live in R196,608, the manifold of natural images has a much lower intrinsic dimensionality — estimates range from a few hundred to a few thousand. This is why generative models work at all: they do not need to "fill" all 196,608 dimensions, just the thin manifold where real images live. This is also why latent diffusion models (Stable Diffusion) first compress images to a 64×64×4 latent space (d = 16,384) before applying the diffusion process. The compression loses little information because the intrinsic dimensionality is much smaller than the pixel dimensionality.

From pixels to latents: Modern image generators (Stable Diffusion, FLUX) do NOT apply flow matching directly in pixel space. They first train a VAE to compress images to a much smaller latent space (typically 8x spatial compression, 4 channels). Flow matching operates in this latent space. This reduces d from 196,608 to 16,384 — a 12x reduction — making training and sampling much faster. Chapter 6 of the book covers this in detail.

The central role of randomness. A generative model produces different outputs each time because it starts from a random noise sample. Two calls to the same model with different noise produce two different images (or proteins, or videos). The noise seed controls which specific sample you get. This is why AI art tools let you set a "seed" — fixing the noise gives a reproducible output.

Mode coverage vs. mode quality. A perfect generative model would produce all the diversity of pdata (mode coverage) with each sample being high quality (mode quality). In practice, there is often a tension between the two. GANs tend to have high quality but low coverage (mode collapse). Diffusion models tend to have both, which is one reason they have become dominant.

Evaluation metrics for generative models. How do we know if a generative model is good?

MetricWhat It MeasuresLower is better?
FID (Frechet Inception Distance)Distribution similarity (statistics of generated vs. real images)Yes (0 = identical)
IS (Inception Score)Quality and diversity of generated imagesNo (higher = better)
CLIP ScoreAlignment between image and text promptNo (higher = better)
LPIPSPerceptual similarity between individual imagesDepends on context
Human evaluationSubjective quality assessed by humansGold standard

FID is the most widely reported metric. State-of-the-art models achieve FID < 2 on standard benchmarks like ImageNet 256×256, meaning the generated image distribution is nearly indistinguishable from the real one.

Why this course matters practically. Flow matching and diffusion models are not just academic curiosities. They power products used by hundreds of millions of people: Midjourney, DALL-E 3, Stable Diffusion, Google Imagen, Adobe Firefly, and many more. Understanding their mathematical foundations gives you the ability to design, train, debug, and improve these systems. The algorithms are remarkably simple once you understand the theory — which is exactly what the remaining chapters build.

Dataset bias matters. If the dataset contains mostly photos of cats and dogs, the generative model will primarily produce cats and dogs. If the dataset lacks diversity (e.g., photos from only one culture), the model inherits this bias. The model cannot generate objects outside the support of its training distribution — or at least, not well.

python
# Creating a toy dataset for flow matching experiments
import torch

def sample_two_moons(n):
    """Sample from the 'two moons' distribution in 2D."""
    n1 = n // 2
    n2 = n - n1
    # Upper moon
    theta1 = torch.rand(n1) * torch.pi
    x1 = torch.cos(theta1) + torch.randn(n1) * 0.05
    y1 = torch.sin(theta1) + torch.randn(n1) * 0.05
    # Lower moon
    theta2 = torch.rand(n2) * torch.pi
    x2 = 1 - torch.cos(theta2) + torch.randn(n2) * 0.05
    y2 = 1 - torch.sin(theta2) - 0.5 + torch.randn(n2) * 0.05
    return torch.stack([
        torch.cat([x1, x2]),
        torch.cat([y1, y2])
    ], dim=1)  # [n, 2]
DomainExample DatasetSize
ImagesLAION-5B~5 billion image-text pairs
ProteinsProtein Data Bank~200,000 structures
VideoHD-VILA-100M~100 million clips
AudioAudioSet~2 million clips
A dataset is a finite proxy for pdata. What happens as the dataset size N grows very large?

Chapter 5: Guided Generation

Unconditional generation — "give me a random image" — is useful but limited. In practice, we almost always want to condition on something: a text prompt, a class label, a partial sketch, or a noisy measurement. This is guided generation (also called conditional generation).

Key Idea 4 — Guided Generation: Guided generation involves sampling from a conditional distribution z ~ pdata(·|y), where y is a conditioning variable. The model must work for any y, not just a fixed one.

Worked example — text-to-image. You want an image of "a photorealistic cat blowing out birthday candles." Here y = "a photorealistic cat blowing out birthday candles" and z ~ pdata(·|y) is a random image that matches this description. Different samples from pdata(·|y) give different images, all featuring a cat with candles.

Worked example — super-resolution. Given a low-resolution 32×32 image y, generate a plausible high-resolution 256×256 version z. Multiple valid high-res images exist for any given low-res input, so this is genuinely a sampling problem.

Worked example — protein design. Given a desired protein function y, generate a 3D structure z that realizes that function. AlphaFold3 does exactly this.

Worked example — inpainting. Given an image with a missing region and y = "the surrounding pixels," generate z = "the completed image." Many plausible completions exist, so again we sample from a conditional distribution.

The training data for guided generation consists of pairs (zi, yi). For text-to-image: zi is an image and yi is its caption. A single model learns to condition on arbitrary y values.

Applicationz (generated)y (condition)
Text-to-imageImage pixelsText prompt
Super-resolutionHigh-res imageLow-res image
Protein design3D structureDesired function
InpaintingComplete imageMasked image + mask
Text-to-videoVideo framesText description
Music generationAudio waveformText description or melody
Good news: It turns out that techniques for unconditional generation generalize straightforwardly to the conditional case. For the first three sections of this course, we focus almost exclusively on unconditional generation z ~ pdata, knowing that conditioning comes essentially "for free" later (Chapter 5 of the book covers guidance).

Here is the full picture of what we are building toward:

Represent
Objects z as vectors in Rd
Formalize
Generation as sampling z ~ pdata
Train
A model using a finite dataset z1,...,zN ~ pdata
Generate
New samples by converting noise to data via ODEs/SDEs
In guided generation, what is the conditioning variable y?

Chapter 6: The Sampler

Let's tie everything together with a hands-on simulation. A generative model is an algorithm that returns approximate samples from pdata. In this course, we will build generative models using differential equations — but for now, let's see what "sampling from a distribution" actually looks like.

The simplest possible generative model is: just return a random training example. This is called memorization — and it is not what we want. A good generative model creates new samples that look like they came from pdata but are not copies of training data.

Think of it this way: If you ask a memorizing model for a dog image, it picks a random photo from its training set. If you ask a good generative model, it "imagines" a new dog image — one that has never existed before — by understanding the underlying distribution of dog images.

A slightly better approach: fit a parametric model (like a Gaussian mixture) to the data and sample from that. But fitting a good density model in 196,608 dimensions is essentially as hard as the original problem.

The preview of what is coming in Chapters 2-4: We will learn to define a trajectory that starts at random noise X0 ~ N(0, I) and ends at data X1 ~ pdata. The trajectory is defined by a vector field (learned by a neural network) that tells each point which direction to move at each moment in time. Simulating this trajectory converts noise to data.

This approach is fundamentally different from memorization or density estimation. The model never stores data points or evaluates densities. Instead, it learns a transformation — a smooth, continuous map from the simple Gaussian distribution to the complex data distribution. Every point in the Gaussian gets mapped to a point in the data distribution, and the map is defined by the ODE.

The noise-to-data pipeline in practice. In a real image generator like Stable Diffusion 3:

1. Sample noise
X0 ~ N(0, I) — a 64×64×4 tensor of Gaussian noise in latent space
2. Simulate ODE (20-50 steps)
Each step: Xt+h = Xt + h · DiT(Xt, t, prompt). The DiT has ~2B parameters.
3. Decode latent to image
X1 is a 64×64×4 latent. A VAE decoder maps it to a 512×512×3 image.

Total time: ~2-5 seconds on an A100 GPU. Each of the 20-50 neural network calls processes the entire 64×64×4 = 16,384-dimensional latent. The cumulative effect of these small velocity steps is a photorealistic image.

python
# What Stable Diffusion 3 does (simplified)
def generate_image(dit_model, vae_decoder, prompt, n_steps=28):
    # Step 1: Random noise in latent space
    x = torch.randn(1, 4, 64, 64)  # latent noise

    # Step 2: Simulate flow ODE
    prompt_emb = encode_text(prompt)  # CLIP/T5 encoding
    h = 1.0 / n_steps
    for i in range(n_steps):
        t = torch.tensor([i * h])
        v = dit_model(x, t, prompt_emb)  # 2B param DiT
        x = x + h * v

    # Step 3: Decode to pixel space
    image = vae_decoder(x)  # [1, 3, 512, 512]
    return image

Worked example — noise to data in 1D. Suppose pdata is a mixture of two Gaussians centered at −3 and +3. We start with noise X0 ~ N(0, 1). A flow model learns a vector field ut(x) such that following the ODE dXt/dt = ut(Xt) for t from 0 to 1 pushes X0 toward one of the two modes. At t = 1, the sample X1 lands near −3 or +3 — a valid sample from pdata.

Sampling from 2D Distributions

Toggle conditional mode to see how conditioning changes which regions of the distribution are sampled. In conditional mode, only samples from the selected region are generated.

Mode

Notice the difference between unconditional and conditional sampling. In unconditional mode, samples cover the entire distribution. In conditional mode, we restrict to a specific region — this is what text-to-image does, but in 196,608 dimensions instead of 2.

How noise becomes structure. Here is a concrete numerical walkthrough of what happens inside a generative model. Consider a 1D example where pdata = 0.5 N(−3, 0.25) + 0.5 N(+3, 0.25) (a mixture of two Gaussians). The generative process:

SteptXtWhat's happening
Init0.01.2Random noise from N(0,1)
10.21.4Still mostly noise, drifting slightly
20.41.8Vector field starts to "choose" the +3 mode
30.62.3Clearly heading toward +3
40.82.7Almost there
51.02.95A sample from pdata (near +3 mode)

The initial noise X0 = 1.2 was positive, so the learned vector field pushed it toward the +3 mode. If X0 had been −0.8, the field would have pushed it toward the −3 mode. The vector field acts like a sorting mechanism — it routes each noise sample to the appropriate data region.

Why differential equations? Why not just learn a direct mapping f: noise → data? You could train a neural network fθ such that fθ(ε) ≈ z for paired (ε, z). This is essentially what a GAN generator does. The problem: learning a one-shot mapping in very high dimensions is hard. The function must be highly nonlinear and the training landscape has many bad local minima.

Differential equations decompose this hard one-shot mapping into many easy steps. Each step moves the particle a little bit in the right direction. The neural network only needs to predict a small velocity at each point, not the entire transformation. This is like the difference between trying to draw a picture in one pen stroke (hard) versus many small strokes (easy). The composability of differential equations is what makes flow and diffusion models work.

python
# Pseudocode: what a generative model does
import torch

def generate(model, n_samples):
    # Step 1: Sample noise from simple distribution
    x = torch.randn(n_samples, d)  # X_0 ~ N(0, I)

    # Step 2: Simulate ODE to transform noise into data
    for t in linspace(0, 1, n_steps):
        velocity = model(x, t)   # neural network predicts vector field
        x = x + dt * velocity     # Euler step

    return x  # X_1 ~ p_data (approximately)
Why is "just return a random training example" a bad generative model?

Chapter 7: Connections

We have laid the mathematical foundation for generative modeling. Let's recap what we established and preview what comes next.

ConceptWhat It MeansSymbol
ObjectA vector in Rd (image, video, molecule, ...)z
Data distributionProbability density over all possible objectspdata(z)
GenerationSampling z ~ pdataz ~ pdata
DatasetFinite samples from pdataz1,...,zN
Guided generationSampling z ~ pdata(·|y)z ~ pdata(·|y)
Generative modelAlgorithm that produces approximate samples from pdata
What comes next: In the next chapter, we learn the machinery of generation — how to build generative models using ordinary differential equations (ODEs) and stochastic differential equations (SDEs). This is the mathematical backbone: vector fields, flows, Brownian motion, and the Euler method. Once we have this toolkit, Chapter 3 (Flow Matching) and Chapter 4 (Score Matching) show how to train these models.

The roadmap for the rest of this course:

Ch 1 (Done)
Generative Modeling = Sampling from pdata
Ch 2: Flow & Diffusion
ODEs, SDEs, Euler method, flow & diffusion models
Ch 3: Flow Matching
How to TRAIN the vector field (the core algorithm)
Ch 4: Score Matching
Score functions, SDE sampling, denoising diffusion
Ch 5-7: Advanced
Guidance, latent spaces, architectures, discrete diffusion

Other generative model families include:

Model FamilyCore IdeaStrengthsWeaknesses
GANsGenerator vs. discriminator adversarial gameFast sampling, sharp outputsTraining instability, mode collapse
VAEsEncoder-decoder with latent space regularizationPrincipled latent space, fastBlurry outputs
AutoregressiveGenerate one element at a time (token, pixel)Exact likelihoods, proven for textSlow sequential sampling
Normalizing FlowsLearn invertible transformationsExact densities, fast both waysArchitectural constraints for invertibility
Flow/DiffusionODE/SDE from noise to dataHigh quality, flexible, scalableMulti-step sampling (slower than GANs)

Flow matching and diffusion models are the current state-of-the-art for continuous data (images, video, audio, molecules). Their combination of simplicity (the training loss is just MSE), scalability (standard mini-batch SGD), and quality (state-of-the-art FID scores) has made them the dominant approach since 2022.

A brief history of generative modeling milestones:

YearModelKey Idea
2014GAN (Goodfellow)Generator vs. discriminator adversarial game
2014VAE (Kingma)Variational inference with reparameterization
2015Deep Diffusion (Sohl-Dickstein)First diffusion model (thermodynamics-inspired)
2019NCSN (Song)Score matching with Langevin dynamics
2020DDPM (Ho)Denoising diffusion with noise prediction
2021DALL-E, GLIDEText-to-image with diffusion models
2022Stable Diffusion, Flow MatchingLatent diffusion + CondOT training
2024FLUX, SD3, SoraDiT architecture, video generation
2024AlphaFold3Diffusion for protein structure prediction

The timeline shows a remarkable convergence: from theoretical curiosities (2014-2019) to practical tools (2020-2022) to industry-defining products (2023-present). Flow matching and diffusion models went from academic papers to powering billions of image generations per day in just four years.

Why flow and diffusion models won. Several factors contributed to their dominance over earlier approaches:

1. Training stability. Unlike GANs, there is no adversarial game. The loss is a simple MSE regression, which converges reliably without mode collapse or training instabilities.

2. Sample quality. Flow/diffusion models achieve the best FID scores on standard benchmarks, surpassing GANs, VAEs, and autoregressive models for continuous data.

3. Mode coverage. The model captures the full diversity of the data distribution, unlike GANs which often miss rare modes.

4. Scalability. The training algorithm scales cleanly to billion-parameter models with standard distributed training techniques.

5. Flexibility. The same model supports unconditional generation, text-to-image, inpainting, super-resolution, and more — often with minimal architectural changes.

6. Theoretical foundation. The mathematics of probability paths, continuity equations, and score functions provides a principled framework for analysis and improvement.

The one drawback: generation speed. GANs generate in one forward pass; flow models need 20-50 forward passes. Research on distillation and consistency models is rapidly closing this gap, with some models now achieving near-GAN speeds at flow-model quality.

Applications beyond image generation. Flow and diffusion models have found surprising applications far beyond generating pretty pictures:

ApplicationWhat is z?What is y?Example System
Image generationImage (latent)Text promptStable Diffusion 3
Video generationVideo framesText + first frameSora, Movie Gen
Protein design3D atom coordinatesDesired functionAlphaFold3, RFDiffusion
Drug discoveryMolecular graphTarget proteinDiffDock
Audio synthesisAudio waveformText descriptionAudioLDM, MusicGen
3D shape generation3D point cloudText or imagePoint-E, Shap-E
Robot policyAction trajectoryObservationDiffusion Policy, π0
Weather forecastingWeather stateCurrent stateGenCast (DeepMind)

The universality of the framework — "learn to convert noise into structured data" — applies to any domain where the data can be represented as continuous vectors. This is why understanding flow matching is such a valuable skill: master the theory once, apply it everywhere.

What you will be able to do after this course:

1. Understand how Stable Diffusion, FLUX, Sora, and AlphaFold3 work, down to the mathematical foundations.

2. Implement a flow matching model from scratch in PyTorch (training + sampling).

3. Debug common issues in diffusion model training (loss curves, sample quality, noise schedules).

4. Choose between velocity prediction, score prediction, and noise prediction for your application.

5. Extend the framework to new data types (molecules, 3D, video) by understanding the core math.

6. Read the latest research papers on generative modeling with a solid foundation, since the notation and concepts are now familiar.

A final analogy. Generative modeling with flow matching is like learning to sculpt. The noise is the raw clay. The vector field is the sculptor's hands, applying pressure at each point in time to shape the material. The training process teaches the sculptor (neural network) how to shape clay (noise) into a specific form (data). Each generation starts with a new lump of clay (random noise) and follows the same sculpting procedure (ODE simulation) to produce a unique sculpture (sample). The sculptures all look "real" because the sculpting procedure was learned from thousands of real examples.

Required mathematical background. Due to the technical nature of differential equations, this course assumes some familiarity with:

1. Derivatives and integrals. You should know what df/dt means and be comfortable with chain rule.

2. Probability basics. Random variables, expected value, Gaussian distribution, Bayes' rule.

3. Vectors and matrices. Vector addition, dot product, matrix-vector multiplication.

4. Python and PyTorch. Basic tensor operations, neural network training loops.

Do not worry if you are rusty — we will derive everything from scratch and motivate every formula. The Appendix of the book provides a probability refresher for those who need it. The key insight of this course is that the ideas are intuitive (convert noise to data by following a velocity field) even though the formalism requires some mathematical machinery.

How to read this course: Each chapter builds on the previous one. Read linearly. Play with every interactive simulation — the simulations are not decoration; they build intuition that the math confirms. Answer every quiz before moving on. If a concept does not click, re-read the worked examples. The goal is not to memorize formulas but to understand why each formula has the form it does. By the end of Chapter 4, you will have a complete understanding of the theory behind the most powerful generative AI systems in the world.
Chapter 1 takeaway: Generative modeling is mathematically equivalent to sampling from a probability distribution. We represent objects as vectors, define a distribution over them, collect a dataset of examples, and build a model that can produce new samples. Everything else is about how to build that model efficiently.

Summary of notation. We will use this notation consistently throughout all chapters:

SymbolMeaning
dDimensionality of the data
z ∈ RdA data sample (image, protein, ...)
pdataThe data distribution (unknown, complex)
pinitThe initial/noise distribution, usually N(0, I)
ptProbability path at time t
ut(x)Vector field (velocity at position x, time t)
utθ(x)Neural network vector field with parameters θ
ψtFlow (solution map of ODE)
WtBrownian motion
σtDiffusion coefficient
αt, βtNoise schedulers for the probability path
yConditioning variable (text prompt, class label, ...)
In flow matching and diffusion models, how does the generative model produce samples?