Ch 1: Introduction — Flow Matching & Diffusion

Chapter 0: Why Generate?

You type "a dog running through snow with mountains behind it" into an image generator. A few seconds later, a photorealistic picture appears. It is not retrieved from a database — it was created from scratch. How?

Previous generations of AI were primarily about prediction: classify this email as spam, predict tomorrow's stock price, detect a tumor in an X-ray. These systems take an input and produce a label or number. But generative AI does something fundamentally different: it creates new objects — images, videos, protein structures, music — that never existed before.

At the heart of every generative model in this course lies a single, beautiful idea: start with pure random noise, then iteratively sculpt it into data. The "sculpting" is done by simulating a differential equation. Flow matching and diffusion models are families of techniques that let us construct, train, and simulate these equations at massive scale using deep neural networks.

The core premise: Generative modeling = learning to convert noise into data. We start from a simple distribution (like a Gaussian) and learn a transformation that maps it to a complex data distribution (like "all possible images of dogs").

This course covers the two dominant algorithms behind modern generative AI: flow matching and denoising diffusion. These power Stable Diffusion, FLUX, Meta's Movie Gen, and even AlphaFold3 for protein design. The theory is elegant, the implementation is simple, and the results are staggering.

But before we can learn to generate, we need to formalize what "generate" even means. What is a "distribution of images"? What does it mean to "sample" from it? Let's build the mathematical vocabulary we need.

What Generative Models Do

Click "Generate" to sample from different 2D distributions. Each dot is a sample. Notice how samples cluster where the distribution has high probability.

Distribution

The interactive above shows that a "distribution" is really just a rule for where points are likely to appear. A Gaussian clusters near the center. A ring distribution clusters around a circle. A checkerboard puts mass on alternating squares. A generative model learns to produce samples that match whatever distribution the training data came from.

What is the fundamental task of a generative model?

Classify inputs into categories Convert noise into samples that follow the data distribution Compress data to a smaller representation

Chapter 1: Objects as Vectors

To build a generative model, we first need to represent the objects we want to generate — images, videos, protein structures — as mathematical objects that we can manipulate. The universal representation is the vector.

Consider a color image with H × W pixels, each with three color channels (red, green, blue). Each pixel-channel pair stores an intensity value — a real number. So the entire image is a collection of H × W × 3 real numbers. We can flatten this into a single vector z ∈ R^d where d = H × W × 3.

Key Idea 1 — Objects as Vectors: We identify the objects being generated as vectors z ∈ R^d. An image is a vector in R^H×W×3. A video is a vector in R^T×H×W×3. A molecular structure is a vector in R^3×N.

Worked example — a tiny image. A 4 × 4 grayscale image has d = 4 × 4 × 1 = 16 dimensions. Each pixel stores a brightness value between 0 (black) and 1 (white). So a specific image might be the vector:

z = (0.1, 0.9, 0.3, 0.7, 0.2, 0.8, 0.4, 0.6, 0.5, 0.5, 0.1, 0.9, 0.8, 0.2, 0.7, 0.3)

That is 16 numbers — one per pixel. A 256 × 256 RGB image lives in d = 256 × 256 × 3 = 196,608 dimensions. Every possible image of that resolution is a single point in a 196,608-dimensional space.

Worked example — molecular structure. A protein with N = 100 atoms has each atom located in 3D space. The structure is z = (z₁, ..., z₁₀₀) where z_i ∈ R³. Flattened: z ∈ R³⁰⁰. AlphaFold3 generates these vectors using a diffusion model.

Worked example — video. A 3-second video at 30fps with 64 × 64 RGB frames lives in d = 90 × 64 × 64 × 3 = 1,105,920 dimensions. Over a million dimensions for a tiny clip.

Data Type	Representation	Typical d
256×256 RGB Image	z ∈ R^H×W×3	196,608
3s Video (64×64, 30fps)	z ∈ R^T×H×W×3	~1.1M
Protein (100 atoms)	z ∈ R^3N	300
Audio (1s, 16kHz)	z ∈ R^T	16,000

What about text? Text is discrete — words or tokens from a vocabulary. Most language models treat text as sequences of discrete tokens, not continuous vectors. Continuous diffusion models for text exist (Section 7 of the book) but they are not our main focus. For now, think of z as always being a continuous vector.

Why vectors and not something else? The vector representation lets us use the full power of calculus and linear algebra. We can add vectors (blend two images), scale them, compute distances (how similar are two images?), and define smooth transformations (gradually convert noise into an image). Differential equations — the mathematical engine of flow and diffusion models — operate on vectors. Without this representation, none of the theory in this course would apply.

python
# Representing objects as vectors in code
import numpy as np

# A 32x32 RGB image as a vector
image = np.random.rand(32, 32, 3)   # shape: (32, 32, 3)
z = image.flatten()                    # shape: (3072,) — a vector in R^3072
print(z.shape)                         # (3072,)

# A protein with 100 atoms
atoms = np.random.rand(100, 3)       # 100 atoms, each in R^3
z_protein = atoms.flatten()            # shape: (300,)

# A 1-second audio clip at 16kHz
z_audio = np.random.rand(16000)       # shape: (16000,)

The key abstraction: Once everything is a vector z ∈ R^d, the same generative modeling algorithms work for images, proteins, audio, and video. The algorithms do not "know" what the vectors represent — they just learn to transform Gaussian noise into vectors that have the same statistical properties as the training data. The domain-specific knowledge is all in the data.

A 32×32 RGB image is represented as a vector z in what space?

R³⁰⁷² (since 32 × 32 × 3 = 3,072) R¹⁰²⁴ (since 32 × 32 = 1,024) R³² (one dimension per row)

Chapter 2: The Data Distribution

There is no single "best" image of a dog. There are millions of possible dog images, and many of them would satisfy a user. Some are more likely (a clear photo of a golden retriever) and some are less likely (an abstract painting of a dog). This diversity is captured by a probability distribution.

We define a data distribution p_data over the space R^d of all possible objects. Mathematically, p_data is a probability density function:

p_data : R^d → R_≥0

It assigns each possible object z ∈ R^d a non-negative likelihood p_data(z). Images that look like real photographs get high likelihood. Random noise gets near-zero likelihood.

The crucial reframing: "How good is this image?" is a subjective question with no clean answer. "How likely is this image under the data distribution?" is a precise mathematical question. Generative modeling replaces the first question with the second.

Worked example — 1D distribution. Suppose our "data" is human heights in centimeters. Then p_data might be a Gaussian centered at 170 cm with standard deviation 10 cm. A height of 170 cm has high likelihood; a height of 300 cm has near-zero likelihood. Sampling z ~ p_data gives a random height that looks like a real human height.

Worked example — 2D distribution. Consider 2D data points arranged in a checkerboard pattern. The distribution p_data assigns high probability to the dark squares and zero probability to the light squares. Samples cluster in the dark regions.

In the real world, p_data lives in incredibly high dimensions and has incredibly complex structure. The distribution of all 256×256 images is a function over R^196,608. It has mass concentrated on a tiny fraction of that space — the manifold of "images that look like real photographs." Almost all of R^196,608 corresponds to random noise or visual garbage.

Worked example — the manifold hypothesis. Consider 28×28 grayscale images of handwritten digits (the MNIST dataset). Each image is a vector in R⁷⁸⁴. But not all 784-dimensional vectors look like digits. If you pick a random vector in R⁷⁸⁴, you get noise. The set of "images that look like digits" forms a thin, curved surface — a manifold — embedded in R⁷⁸⁴. The data distribution p_data places all its mass on this manifold and zero mass everywhere else.

This manifold is what a generative model must learn. It does not need to fill all of R⁷⁸⁴ with plausible images. It just needs to find the tiny subset where digits live and generate points on that subset.

python
# Demonstrating the manifold hypothesis
import numpy as np

# A random vector in R^784 — this is NOT a valid digit
noise = np.random.rand(784)
# If you reshape to (28, 28) and display: just random noise

# An actual MNIST digit (z ~ p_data) looks structured:
# The pixel values form recognizable strokes
# p_data(noise) ≈ 0   (random vectors have ~zero probability)
# p_data(digit) > 0   (structured images have positive probability)

Visualizing Data Distributions in 2D

Each distribution assigns different probabilities to different regions. Brighter = higher probability. Drag the cursor to see local probability density.

Distribution

The data distribution p_data(z) assigns each possible object z a...

Category label (dog, cat, etc.) Non-negative likelihood value indicating how "data-like" z is Quality score from 0 to 10

Chapter 3: Generation as Sampling

Now we can say precisely what "generate" means. Generating an object z is the same as sampling from the data distribution:

z ~ p_data

This is the formal statement: to generate an image of a dog, we sample a random vector z from the distribution p_data of dog images. Each time we sample, we get a different image — but all samples are "likely" according to p_data, meaning they look like real dog images.

Key Idea 2 — Generation as Sampling: Generating an object z is modeled as sampling from the data distribution z ~ p_data. A generative model is any algorithm that produces such samples.

Worked example — 1D Gaussian. If p_data = N(170, 10²) (human heights), then sampling z ~ p_data might give us 167.3, 172.8, 155.1, 180.4, ... Each sample is a plausible height. We will never get z = 1000 because the density there is essentially zero.

Worked example — 2D moons. If p_data is the "two moons" distribution, samples cluster along two crescent shapes. Drawing z ~ p_data gives random points on those crescents.

The beauty of this formulation is that it separates what we want (samples from p_data) from how we get them (the generative model). Different algorithms — GANs, VAEs, flow matching, diffusion — are different "hows" for the same "what."

But here is the catch: we never have direct access to p_data. We cannot evaluate p_data(z) for a given z (is this particular pixel arrangement likely?). We cannot invert p_data to produce samples directly. All we have is a finite collection of examples. This brings us to the concept of a dataset.

The fundamental challenge: We want to sample from p_data, but we can neither evaluate p_data(z) nor sample from it directly. All we have is a finite dataset of examples. The entire field of generative modeling is about building algorithms that can produce new samples despite never seeing the full distribution.

"Generating an image" is mathematically equivalent to:

Finding the single best image in the dataset Evaluating p_data(z) at every possible z Sampling a random z from the data distribution p_data

Chapter 4: Datasets

We cannot access p_data directly, but we can collect examples from it. A dataset is a finite collection of samples drawn independently from p_data:

z₁, z₂, ..., z_N ~ p_data

For images, a dataset might be millions of photographs scraped from the internet (like ImageNet or LAION). For proteins, it might be experimentally resolved 3D structures from the Protein Data Bank. For videos, it might be clips from YouTube.

Key Idea 3 — Dataset: A dataset consists of a finite number of samples z₁, ..., z_N ~ p_data. As N grows, the dataset becomes an increasingly better approximation of the true distribution.

Worked example — 2D checkerboard. Suppose p_data is a checkerboard distribution. With N = 10 samples, the pattern is barely visible. With N = 100, you start to see structure. With N = 10,000, the checkerboard is unmistakable. The dataset is a finite proxy for the infinite distribution.

This is a crucial point: the training data is not the distribution. The distribution p_data is a theoretical object — the "true" density over all possible objects of interest. The dataset is a finite window into that distribution. Generative models must learn enough about p_data from the dataset to generate new samples that were not in the training set.

Overfitting vs. generalization. If the model simply memorizes the training examples, it has failed. A good generative model generalizes: it generates new samples that are plausible under p_data but were not in the training set. How do we know it is not just memorizing? Several tests:

1. Nearest-neighbor test: Compare generated samples to their nearest neighbors in the training set. If they are too similar, the model is memorizing.

2. Interpolation test: Interpolate between two noise vectors and check that intermediate samples look plausible. Memorizing models produce garbage in between memorized samples.

3. Distribution metrics: FID compares the statistics of generated samples to real samples. A model that memorizes perfectly would have FID = 0, but so would a model that generalizes perfectly. In practice, the FID of generated samples on a held-out test set is the best measure.

In practice, models with millions of parameters trained on millions of images generalize well. Models trained on small datasets (say, 1000 images) tend to memorize. The transition from memorization to generalization is one of the most interesting phenomena in deep generative modeling.

Dataset quality vs. quantity. Not all data is created equal. A dataset of 1 million high-quality, diverse images often produces better results than 10 million low-quality, repetitive images. Data curation — filtering out duplicates, low-quality images, and harmful content — is a critical step in training production models. LAION-5B, the dataset behind many open-source models, went through multiple rounds of filtering before use.

The training data pipeline for a real model.

Collect

Scrape images from the web (billions)

↓

Filter

Remove duplicates, low-res, NSFW, watermarked images

↓

Caption

Generate text descriptions using BLIP-2 or CogVLM

↓

Encode

Pre-compute VAE latents and CLIP/T5 text embeddings

↓

Train

Flow matching on (latent, text_embedding) pairs

Dataset Size and Distribution Approximation

Increase N to see how more samples reveal the structure of the underlying distribution. The true distribution is the "Two Moons" shape.

N samples 50

In practice, the quality of a generative model depends enormously on the size and quality of the dataset. Stable Diffusion was trained on billions of image-text pairs. GPT-4 was trained on trillions of text tokens. The datasets are large because the distributions are complex.

How many samples are "enough"? Consider our earlier calculation: a 256 × 256 RGB image lives in R^196,608. The space is enormous, but the manifold of "natural images" is much lower-dimensional. Empirically, training on millions to billions of images captures enough of this manifold for high-quality generation. The exact number depends on the diversity of the target distribution — faces are simpler than "all images" and need fewer samples.

The intrinsic dimensionality of data. While images live in R^196,608, the manifold of natural images has a much lower intrinsic dimensionality — estimates range from a few hundred to a few thousand. This is why generative models work at all: they do not need to "fill" all 196,608 dimensions, just the thin manifold where real images live. This is also why latent diffusion models (Stable Diffusion) first compress images to a 64×64×4 latent space (d = 16,384) before applying the diffusion process. The compression loses little information because the intrinsic dimensionality is much smaller than the pixel dimensionality.

From pixels to latents: Modern image generators (Stable Diffusion, FLUX) do NOT apply flow matching directly in pixel space. They first train a VAE to compress images to a much smaller latent space (typically 8x spatial compression, 4 channels). Flow matching operates in this latent space. This reduces d from 196,608 to 16,384 — a 12x reduction — making training and sampling much faster. Chapter 6 of the book covers this in detail.

The central role of randomness. A generative model produces different outputs each time because it starts from a random noise sample. Two calls to the same model with different noise produce two different images (or proteins, or videos). The noise seed controls which specific sample you get. This is why AI art tools let you set a "seed" — fixing the noise gives a reproducible output.

Mode coverage vs. mode quality. A perfect generative model would produce all the diversity of p_data (mode coverage) with each sample being high quality (mode quality). In practice, there is often a tension between the two. GANs tend to have high quality but low coverage (mode collapse). Diffusion models tend to have both, which is one reason they have become dominant.

Evaluation metrics for generative models. How do we know if a generative model is good?

Metric	What It Measures	Lower is better?
FID (Frechet Inception Distance)	Distribution similarity (statistics of generated vs. real images)	Yes (0 = identical)
IS (Inception Score)	Quality and diversity of generated images	No (higher = better)
CLIP Score	Alignment between image and text prompt	No (higher = better)
LPIPS	Perceptual similarity between individual images	Depends on context
Human evaluation	Subjective quality assessed by humans	Gold standard

FID is the most widely reported metric. State-of-the-art models achieve FID < 2 on standard benchmarks like ImageNet 256×256, meaning the generated image distribution is nearly indistinguishable from the real one.

Why this course matters practically. Flow matching and diffusion models are not just academic curiosities. They power products used by hundreds of millions of people: Midjourney, DALL-E 3, Stable Diffusion, Google Imagen, Adobe Firefly, and many more. Understanding their mathematical foundations gives you the ability to design, train, debug, and improve these systems. The algorithms are remarkably simple once you understand the theory — which is exactly what the remaining chapters build.

Dataset bias matters. If the dataset contains mostly photos of cats and dogs, the generative model will primarily produce cats and dogs. If the dataset lacks diversity (e.g., photos from only one culture), the model inherits this bias. The model cannot generate objects outside the support of its training distribution — or at least, not well.

python
# Creating a toy dataset for flow matching experiments
import torch

def sample_two_moons(n):
    """Sample from the 'two moons' distribution in 2D."""
    n1 = n // 2
    n2 = n - n1
    # Upper moon
    theta1 = torch.rand(n1) * torch.pi
    x1 = torch.cos(theta1) + torch.randn(n1) * 0.05
    y1 = torch.sin(theta1) + torch.randn(n1) * 0.05
    # Lower moon
    theta2 = torch.rand(n2) * torch.pi
    x2 = 1 - torch.cos(theta2) + torch.randn(n2) * 0.05
    y2 = 1 - torch.sin(theta2) - 0.5 + torch.randn(n2) * 0.05
    return torch.stack([
        torch.cat([x1, x2]),
        torch.cat([y1, y2])
    ], dim=1)  # [n, 2]

Domain	Example Dataset	Size
Images	LAION-5B	~5 billion image-text pairs
Proteins	Protein Data Bank	~200,000 structures
Video	HD-VILA-100M	~100 million clips
Audio	AudioSet	~2 million clips

A dataset is a finite proxy for p_data. What happens as the dataset size N grows very large?

The dataset becomes an increasingly better representation of p_data The dataset converges to a single point The dataset becomes more noisy

Chapter 5: Guided Generation

Unconditional generation — "give me a random image" — is useful but limited. In practice, we almost always want to condition on something: a text prompt, a class label, a partial sketch, or a noisy measurement. This is guided generation (also called conditional generation).

Key Idea 4 — Guided Generation: Guided generation involves sampling from a conditional distribution z ~ p_data(·|y), where y is a conditioning variable. The model must work for any y, not just a fixed one.

Worked example — text-to-image. You want an image of "a photorealistic cat blowing out birthday candles." Here y = "a photorealistic cat blowing out birthday candles" and z ~ p_data(·|y) is a random image that matches this description. Different samples from p_data(·|y) give different images, all featuring a cat with candles.

Worked example — super-resolution. Given a low-resolution 32×32 image y, generate a plausible high-resolution 256×256 version z. Multiple valid high-res images exist for any given low-res input, so this is genuinely a sampling problem.

Worked example — protein design. Given a desired protein function y, generate a 3D structure z that realizes that function. AlphaFold3 does exactly this.

Worked example — inpainting. Given an image with a missing region and y = "the surrounding pixels," generate z = "the completed image." Many plausible completions exist, so again we sample from a conditional distribution.

The training data for guided generation consists of pairs (z_i, y_i). For text-to-image: z_i is an image and y_i is its caption. A single model learns to condition on arbitrary y values.

Application	z (generated)	y (condition)
Text-to-image	Image pixels	Text prompt
Super-resolution	High-res image	Low-res image
Protein design	3D structure	Desired function
Inpainting	Complete image	Masked image + mask
Text-to-video	Video frames	Text description
Music generation	Audio waveform	Text description or melody

Good news: It turns out that techniques for unconditional generation generalize straightforwardly to the conditional case. For the first three sections of this course, we focus almost exclusively on unconditional generation z ~ p_data, knowing that conditioning comes essentially "for free" later (Chapter 5 of the book covers guidance).

Here is the full picture of what we are building toward:

Represent

Objects z as vectors in R^d

↓

Formalize

Generation as sampling z ~ p_data

↓

Train

A model using a finite dataset z₁,...,z_N ~ p_data

↓

Generate

New samples by converting noise to data via ODEs/SDEs

In guided generation, what is the conditioning variable y?

The output image or generated object Any user-specified input (text prompt, class label, etc.) that controls what is generated The random noise used to initialize the generation process

Chapter 6: The Sampler

Let's tie everything together with a hands-on simulation. A generative model is an algorithm that returns approximate samples from p_data. In this course, we will build generative models using differential equations — but for now, let's see what "sampling from a distribution" actually looks like.

The simplest possible generative model is: just return a random training example. This is called memorization — and it is not what we want. A good generative model creates new samples that look like they came from p_data but are not copies of training data.

Think of it this way: If you ask a memorizing model for a dog image, it picks a random photo from its training set. If you ask a good generative model, it "imagines" a new dog image — one that has never existed before — by understanding the underlying distribution of dog images.

A slightly better approach: fit a parametric model (like a Gaussian mixture) to the data and sample from that. But fitting a good density model in 196,608 dimensions is essentially as hard as the original problem.

The preview of what is coming in Chapters 2-4: We will learn to define a trajectory that starts at random noise X₀ ~ N(0, I) and ends at data X₁ ~ p_data. The trajectory is defined by a vector field (learned by a neural network) that tells each point which direction to move at each moment in time. Simulating this trajectory converts noise to data.

This approach is fundamentally different from memorization or density estimation. The model never stores data points or evaluates densities. Instead, it learns a transformation — a smooth, continuous map from the simple Gaussian distribution to the complex data distribution. Every point in the Gaussian gets mapped to a point in the data distribution, and the map is defined by the ODE.

The noise-to-data pipeline in practice. In a real image generator like Stable Diffusion 3:

1. Sample noise

X₀ ~ N(0, I) — a 64×64×4 tensor of Gaussian noise in latent space

↓

2. Simulate ODE (20-50 steps)

Each step: X_t+h = X_t + h · DiT(X_t, t, prompt). The DiT has ~2B parameters.

↓

3. Decode latent to image

X₁ is a 64×64×4 latent. A VAE decoder maps it to a 512×512×3 image.

Total time: ~2-5 seconds on an A100 GPU. Each of the 20-50 neural network calls processes the entire 64×64×4 = 16,384-dimensional latent. The cumulative effect of these small velocity steps is a photorealistic image.

python
# What Stable Diffusion 3 does (simplified)
def generate_image(dit_model, vae_decoder, prompt, n_steps=28):
    # Step 1: Random noise in latent space
    x = torch.randn(1, 4, 64, 64)  # latent noise

    # Step 2: Simulate flow ODE
    prompt_emb = encode_text(prompt)  # CLIP/T5 encoding
    h = 1.0 / n_steps
    for i in range(n_steps):
        t = torch.tensor([i * h])
        v = dit_model(x, t, prompt_emb)  # 2B param DiT
        x = x + h * v

    # Step 3: Decode to pixel space
    image = vae_decoder(x)  # [1, 3, 512, 512]
    return image

Worked example — noise to data in 1D. Suppose p_data is a mixture of two Gaussians centered at −3 and +3. We start with noise X₀ ~ N(0, 1). A flow model learns a vector field u_t(x) such that following the ODE dX_t/dt = u_t(X_t) for t from 0 to 1 pushes X₀ toward one of the two modes. At t = 1, the sample X₁ lands near −3 or +3 — a valid sample from p_data.

Sampling from 2D Distributions

Toggle conditional mode to see how conditioning changes which regions of the distribution are sampled. In conditional mode, only samples from the selected region are generated.

Mode

Notice the difference between unconditional and conditional sampling. In unconditional mode, samples cover the entire distribution. In conditional mode, we restrict to a specific region — this is what text-to-image does, but in 196,608 dimensions instead of 2.

How noise becomes structure. Here is a concrete numerical walkthrough of what happens inside a generative model. Consider a 1D example where p_data = 0.5 N(−3, 0.25) + 0.5 N(+3, 0.25) (a mixture of two Gaussians). The generative process:

Step	t	X_t	What's happening
Init	0.0	1.2	Random noise from N(0,1)
1	0.2	1.4	Still mostly noise, drifting slightly
2	0.4	1.8	Vector field starts to "choose" the +3 mode
3	0.6	2.3	Clearly heading toward +3
4	0.8	2.7	Almost there
5	1.0	2.95	A sample from p_data (near +3 mode)

The initial noise X₀ = 1.2 was positive, so the learned vector field pushed it toward the +3 mode. If X₀ had been −0.8, the field would have pushed it toward the −3 mode. The vector field acts like a sorting mechanism — it routes each noise sample to the appropriate data region.

Why differential equations? Why not just learn a direct mapping f: noise → data? You could train a neural network f_θ such that f_θ(ε) ≈ z for paired (ε, z). This is essentially what a GAN generator does. The problem: learning a one-shot mapping in very high dimensions is hard. The function must be highly nonlinear and the training landscape has many bad local minima.

Differential equations decompose this hard one-shot mapping into many easy steps. Each step moves the particle a little bit in the right direction. The neural network only needs to predict a small velocity at each point, not the entire transformation. This is like the difference between trying to draw a picture in one pen stroke (hard) versus many small strokes (easy). The composability of differential equations is what makes flow and diffusion models work.

python
# Pseudocode: what a generative model does
import torch

def generate(model, n_samples):
    # Step 1: Sample noise from simple distribution
    x = torch.randn(n_samples, d)  # X_0 ~ N(0, I)

    # Step 2: Simulate ODE to transform noise into data
    for t in linspace(0, 1, n_steps):
        velocity = model(x, t)   # neural network predicts vector field
        x = x + dt * velocity     # Euler step

    return x  # X_1 ~ p_data (approximately)

Why is "just return a random training example" a bad generative model?

It memorizes rather than learning the distribution, so it cannot create new samples It is too slow It requires too much memory

Chapter 7: Connections

We have laid the mathematical foundation for generative modeling. Let's recap what we established and preview what comes next.

Concept	What It Means	Symbol
Object	A vector in R^d (image, video, molecule, ...)	z
Data distribution	Probability density over all possible objects	p_data(z)
Generation	Sampling z ~ p_data	z ~ p_data
Dataset	Finite samples from p_data	z₁,...,z_N
Guided generation	Sampling z ~ p_data(·\|y)	z ~ p_data(·\|y)
Generative model	Algorithm that produces approximate samples from p_data	—

What comes next: In the next chapter, we learn the machinery of generation — how to build generative models using ordinary differential equations (ODEs) and stochastic differential equations (SDEs). This is the mathematical backbone: vector fields, flows, Brownian motion, and the Euler method. Once we have this toolkit, Chapter 3 (Flow Matching) and Chapter 4 (Score Matching) show how to train these models.

The roadmap for the rest of this course:

Ch 1 (Done)

Generative Modeling = Sampling from p_data

↓

Ch 2: Flow & Diffusion

ODEs, SDEs, Euler method, flow & diffusion models

↓

Ch 3: Flow Matching

How to TRAIN the vector field (the core algorithm)

↓

Ch 4: Score Matching

Score functions, SDE sampling, denoising diffusion

↓

Ch 5-7: Advanced

Guidance, latent spaces, architectures, discrete diffusion

Other generative model families include:

Model Family	Core Idea	Strengths	Weaknesses
GANs	Generator vs. discriminator adversarial game	Fast sampling, sharp outputs	Training instability, mode collapse
VAEs	Encoder-decoder with latent space regularization	Principled latent space, fast	Blurry outputs
Autoregressive	Generate one element at a time (token, pixel)	Exact likelihoods, proven for text	Slow sequential sampling
Normalizing Flows	Learn invertible transformations	Exact densities, fast both ways	Architectural constraints for invertibility
Flow/Diffusion	ODE/SDE from noise to data	High quality, flexible, scalable	Multi-step sampling (slower than GANs)

Flow matching and diffusion models are the current state-of-the-art for continuous data (images, video, audio, molecules). Their combination of simplicity (the training loss is just MSE), scalability (standard mini-batch SGD), and quality (state-of-the-art FID scores) has made them the dominant approach since 2022.

A brief history of generative modeling milestones:

Year	Model	Key Idea
2014	GAN (Goodfellow)	Generator vs. discriminator adversarial game
2014	VAE (Kingma)	Variational inference with reparameterization
2015	Deep Diffusion (Sohl-Dickstein)	First diffusion model (thermodynamics-inspired)
2019	NCSN (Song)	Score matching with Langevin dynamics
2020	DDPM (Ho)	Denoising diffusion with noise prediction
2021	DALL-E, GLIDE	Text-to-image with diffusion models
2022	Stable Diffusion, Flow Matching	Latent diffusion + CondOT training
2024	FLUX, SD3, Sora	DiT architecture, video generation
2024	AlphaFold3	Diffusion for protein structure prediction

The timeline shows a remarkable convergence: from theoretical curiosities (2014-2019) to practical tools (2020-2022) to industry-defining products (2023-present). Flow matching and diffusion models went from academic papers to powering billions of image generations per day in just four years.

Why flow and diffusion models won. Several factors contributed to their dominance over earlier approaches:

1. Training stability. Unlike GANs, there is no adversarial game. The loss is a simple MSE regression, which converges reliably without mode collapse or training instabilities.

2. Sample quality. Flow/diffusion models achieve the best FID scores on standard benchmarks, surpassing GANs, VAEs, and autoregressive models for continuous data.

3. Mode coverage. The model captures the full diversity of the data distribution, unlike GANs which often miss rare modes.

4. Scalability. The training algorithm scales cleanly to billion-parameter models with standard distributed training techniques.

5. Flexibility. The same model supports unconditional generation, text-to-image, inpainting, super-resolution, and more — often with minimal architectural changes.

6. Theoretical foundation. The mathematics of probability paths, continuity equations, and score functions provides a principled framework for analysis and improvement.

The one drawback: generation speed. GANs generate in one forward pass; flow models need 20-50 forward passes. Research on distillation and consistency models is rapidly closing this gap, with some models now achieving near-GAN speeds at flow-model quality.

Applications beyond image generation. Flow and diffusion models have found surprising applications far beyond generating pretty pictures:

Application	What is z?	What is y?	Example System
Image generation	Image (latent)	Text prompt	Stable Diffusion 3
Video generation	Video frames	Text + first frame	Sora, Movie Gen
Protein design	3D atom coordinates	Desired function	AlphaFold3, RFDiffusion
Drug discovery	Molecular graph	Target protein	DiffDock
Audio synthesis	Audio waveform	Text description	AudioLDM, MusicGen
3D shape generation	3D point cloud	Text or image	Point-E, Shap-E
Robot policy	Action trajectory	Observation	Diffusion Policy, π₀
Weather forecasting	Weather state	Current state	GenCast (DeepMind)

The universality of the framework — "learn to convert noise into structured data" — applies to any domain where the data can be represented as continuous vectors. This is why understanding flow matching is such a valuable skill: master the theory once, apply it everywhere.

What you will be able to do after this course:

1. Understand how Stable Diffusion, FLUX, Sora, and AlphaFold3 work, down to the mathematical foundations.

2. Implement a flow matching model from scratch in PyTorch (training + sampling).

3. Debug common issues in diffusion model training (loss curves, sample quality, noise schedules).

4. Choose between velocity prediction, score prediction, and noise prediction for your application.

5. Extend the framework to new data types (molecules, 3D, video) by understanding the core math.

6. Read the latest research papers on generative modeling with a solid foundation, since the notation and concepts are now familiar.

A final analogy. Generative modeling with flow matching is like learning to sculpt. The noise is the raw clay. The vector field is the sculptor's hands, applying pressure at each point in time to shape the material. The training process teaches the sculptor (neural network) how to shape clay (noise) into a specific form (data). Each generation starts with a new lump of clay (random noise) and follows the same sculpting procedure (ODE simulation) to produce a unique sculpture (sample). The sculptures all look "real" because the sculpting procedure was learned from thousands of real examples.

Required mathematical background. Due to the technical nature of differential equations, this course assumes some familiarity with:

1. Derivatives and integrals. You should know what df/dt means and be comfortable with chain rule.

2. Probability basics. Random variables, expected value, Gaussian distribution, Bayes' rule.

3. Vectors and matrices. Vector addition, dot product, matrix-vector multiplication.

4. Python and PyTorch. Basic tensor operations, neural network training loops.

Do not worry if you are rusty — we will derive everything from scratch and motivate every formula. The Appendix of the book provides a probability refresher for those who need it. The key insight of this course is that the ideas are intuitive (convert noise to data by following a velocity field) even though the formalism requires some mathematical machinery.

How to read this course: Each chapter builds on the previous one. Read linearly. Play with every interactive simulation — the simulations are not decoration; they build intuition that the math confirms. Answer every quiz before moving on. If a concept does not click, re-read the worked examples. The goal is not to memorize formulas but to understand why each formula has the form it does. By the end of Chapter 4, you will have a complete understanding of the theory behind the most powerful generative AI systems in the world.

Chapter 1 takeaway: Generative modeling is mathematically equivalent to sampling from a probability distribution. We represent objects as vectors, define a distribution over them, collect a dataset of examples, and build a model that can produce new samples. Everything else is about how to build that model efficiently.

Summary of notation. We will use this notation consistently throughout all chapters:

Symbol	Meaning
d	Dimensionality of the data
z ∈ R^d	A data sample (image, protein, ...)
p_data	The data distribution (unknown, complex)
p_init	The initial/noise distribution, usually N(0, I)
p_t	Probability path at time t
u_t(x)	Vector field (velocity at position x, time t)
u_t^θ(x)	Neural network vector field with parameters θ
ψ_t	Flow (solution map of ODE)
W_t	Brownian motion
σ_t	Diffusion coefficient
α_t, β_t	Noise schedulers for the probability path
y	Conditioning variable (text prompt, class label, ...)

In flow matching and diffusion models, how does the generative model produce samples?

By searching the training set for the closest match By training a discriminator to distinguish real from fake By starting from noise and simulating a differential equation that converts it to data

Introduction to Generative Modeling