The engine behind Stable Diffusion, DALL-E, and modern image generation. Learn how neural networks learn to create by learning to denoise.
Generative modeling has a deceptively simple goal: given a dataset of images (faces, landscapes, cats), learn the underlying probability distribution p(x) and then draw new samples from it. A perfect generative model would produce images indistinguishable from real photographs.
Why is this hard? Because an image is a point in an astronomically high-dimensional space. A 512×512 RGB image lives in a space with 786,432 dimensions. The "real image" manifold is a tiny, twisted surface in that vast void. Random points are just static.
Each cell is a random pixel grid. Pure noise has no structure. Generation means learning to place every pixel in just the right spot.
The key insight of diffusion models: destruction is easy, creation is hard. The forward process takes a real image and gradually adds Gaussian noise over T steps (typically T=1000). At each step, the image gets a little noisier, until at step T it's indistinguishable from pure static.
Mathematically, at each step t we mix the current image with a little noise: q(xt | xt-1) = N(xt; √(1-βt) xt-1, βt I). The noise schedule βt controls how fast the signal is destroyed.
Drag the slider to add noise. At t=0 you see the original signal. At t=1000 it's pure static.
If we could reverse the forward process — undo each noise step — we'd have a generative model! Start from pure noise xT ~ N(0, I) and iteratively denoise to get a clean image x0. The problem: the exact reverse q(xt-1|xt) requires knowing the full data distribution, which is what we're trying to learn.
Solution: train a neural network εθ(xt, t) to approximate the reverse step. This network takes in a noisy image and the timestep, and predicts the noise that was added. Given the predicted noise, we can estimate the slightly-less-noisy image.
Training is surprisingly simple. For each training step: (1) pick a random image x0 from the dataset, (2) pick a random timestep t, (3) sample noise ε ~ N(0,I), (4) create the noisy image xt = √ᾱt x0 + √(1-ᾱt) ε, and (5) train the network to predict ε from xt and t.
That's it. Plain MSE loss between the true noise and the predicted noise. No adversarial training, no mode collapse, no training instability. This simplicity is a huge reason diffusion models won.
The theoretical foundation of diffusion models rests on three pillars. You don't need them to use diffusion models, but understanding them reveals why the simple training objective actually works.
We want to maximize log p(x0), but it's intractable. Instead we maximize a lower bound (the Evidence Lower Bound). The ELBO decomposes into a sum of KL divergences — one per timestep — each comparing the true reverse step to our learned approximation.
KL divergence measures how different two distributions are. Since both q and pθ are Gaussian, the KL has a closed form. It reduces to comparing means — which becomes the simple MSE loss.
The score is ∇x log p(x) — a vector pointing toward higher-density regions. Denoising is equivalent to estimating the score: εθ(xt, t) ∝ -∇x log p(xt). This connection to score matching is why diffusion models are sometimes called "score-based generative models."
Arrows show ∇ log p(x), pointing toward the data distribution (two clusters). The score field guides sampling.
Once trained, we generate images by starting from noise and iteratively denoising. The original DDPM sampler uses all T=1000 steps — faithful to the theory but painfully slow (~1 minute per image).
DDIM (Denoising Diffusion Implicit Models) noticed that the forward process can be made deterministic, allowing you to skip steps. With just 50 steps, quality is nearly identical. DPM-Solver treats sampling as solving an ODE and uses higher-order methods (like Runge-Kutta) to achieve great quality in 10-25 steps.
| Sampler | Steps | Speed | Quality |
|---|---|---|---|
| DDPM | 1000 | Slow | Excellent |
| DDIM | 50 | Fast | Very good |
| DPM-Solver | 15-25 | Very fast | Excellent |
| DPM-Solver++ | 10-20 | Very fast | Excellent |
Watch a 1D distribution emerge from noise. More steps = smoother convergence. Fewer steps = faster but noisier.
Diffusing directly in pixel space is expensive: a 512×512 image has ~786K dimensions. Latent Diffusion Models (LDMs) first encode the image into a compact latent space using a pretrained VAE (Variational Autoencoder), then run diffusion there.
The VAE encoder compresses the image by 8x in each spatial dimension: 512×512 → 64×64 latent. The diffusion model learns to denoise in this 64×64 space (much cheaper!), then the VAE decoder reconstructs the final image. This is exactly what Stable Diffusion does.
Compare the computational cost. Each block represents a unit of work. Latent diffusion is dramatically cheaper.
Unconditional generation is impressive but not useful. We want to say "a cat wearing a top hat" and get that image. Conditioning injects a text prompt into the denoising process.
The pipeline: (1) A text encoder (typically CLIP) converts the prompt into an embedding vector. (2) This embedding is injected into the U-Net via cross-attention layers: the noisy image attends to the text tokens. The network learns to denoise differently depending on the text.
During training, the text condition is randomly dropped (replaced with empty text) some percentage of the time. At inference, we compute both the conditional and unconditional noise predictions, then amplify the difference:
The guidance scale w (typically 7-12) controls how strongly the model follows the prompt. Higher w = more adherence to text but less diversity and potential artifacts.
See how classifier-free guidance amplifies the conditional signal. Low w = generic. High w = strongly steered (but may overshoot).
Text alone is often insufficient. You might want to specify a precise pose, edge map, or depth layout. ControlNet adds a parallel encoder that takes a spatial condition (like a Canny edge image) and injects it into the U-Net's skip connections.
The genius: the original Stable Diffusion weights are frozen. ControlNet trains a copy of the encoder that learns to translate spatial signals. This preserves the base model's quality while adding precise spatial control.
| Control Type | Input | What It Controls |
|---|---|---|
| Canny | Edge map | Outline / structure |
| Depth | Depth map | 3D layout, foreground/background |
| OpenPose | Skeleton keypoints | Human pose |
| Segmentation | Semantic map | Region content types |
| IP-Adapter | Reference image | Style and subject transfer |
Diffusion models have evolved rapidly. Here's a map of the landscape as of mid-2025:
| Model | Year | Key Innovation |
|---|---|---|
| DDPM | 2020 | Showed diffusion can match GANs |
| DALL-E 2 | 2022 | CLIP-guided diffusion prior |
| Stable Diffusion 1.5 | 2022 | Open-source latent diffusion |
| SDXL | 2023 | Larger U-Net, dual text encoders, 1024px |
| DALL-E 3 | 2023 | Better text understanding via recaptioning |
| SD3 / SD3.5 | 2024 | MMDiT (Transformer replaces U-Net) + flow matching |
| Flux | 2024 | Rectified flow, DiT architecture, open weights |
A radical departure: instead of iterating T steps, learn to jump directly from any noisy xt to x0 in a single step. Consistency models (Song et al., 2023) enforce that all points on the same denoising trajectory map to the same output. The result: 1-2 step generation with quality approaching multi-step diffusion.
Major milestones in diffusion model development.
You now understand how diffusion models create. From noise to structure, one step at a time.