"Learn to denoise, and you learn to generate."
A diffusion model is a generative model that learns to reverse a gradual noising process. During training, we take a clean data sample and progressively add Gaussian noise over many timesteps until it becomes indistinguishable from pure random noise. We then train a neural network — typically a U-Net or a Diffusion Transformer (DiT) — to predict and undo that noise at every step.
At generation time, we start from pure noise and iteratively denoise, one step at a time, to produce a realistic sample. The model never sees the full generative process in one shot — it only learns to make things slightly less noisy. Stack a thousand tiny improvements, and you get an image.
The forward process destroys structure; the reverse process creates it. Click Play to watch the forward diffusion corrupt an image into noise, then the reverse process reconstruct it step by step.
Gradually add Gaussian noise according to a variance schedule beta_1 ... beta_T.
At each step t, we mix the previous sample with fresh noise.
By step T (typically 1000), the result is indistinguishable from N(0, I).
The key trick: we can jump directly to any timestep t using the cumulative product alpha_bar_t.
The neural network learns epsilon_theta(x_t, t) — the noise added at step t.
Alternatively, it can predict the clean image x_0 directly, or a "velocity" v.
The predicted noise is subtracted (scaled appropriately) to get x_{t-1}.
Repeat T times: noise becomes data.
Linear: beta increases linearly from 0.0001 to 0.02. Simple, original DDPM.
Cosine: slower noise ramp, preserves structure longer — better for high-res.
Learned: the schedule itself is optimized during training. Used in improved DDPM and some LDM variants.
Train the model to handle both conditional and unconditional generation. At inference, interpolate:
eps = eps_uncond + w * (eps_cond - eps_uncond).
Guidance scale w controls fidelity vs diversity. w=1 means no guidance;
w=7.5 is the sweet spot for Stable Diffusion. Higher values produce sharper but less diverse images.
Running diffusion at full pixel resolution is extremely expensive. Latent Diffusion Models (LDM) solve this by first encoding the image into a compressed latent space using a pretrained VAE (variational autoencoder), performing diffusion in that latent space (typically 8x smaller per spatial dimension), then decoding back to pixels. This is the core idea behind Stable Diffusion.
The pipeline: Image (512x512) is encoded by the VAE encoder into a Latent (64x64x4), diffusion runs entirely in latent space with the U-Net/DiT, and the VAE decoder maps back to pixel space. Text conditioning enters via cross-attention using a frozen CLIP text encoder.
Drag the slider to see how guidance scale w affects the balance between diversity and prompt fidelity.
Low guidance gives blurry, diverse outputs. High guidance gives sharp, mode-collapsed results.
The schedule controls how quickly information is destroyed. Toggle each schedule to compare how
alpha_bar_t (signal-to-noise ratio) decays over timesteps.
Diffusion models are most powerful when conditioned on auxiliary signals. The conditioning mechanism determines what the model generates and how controllable it is.
A frozen CLIP or T5 text encoder produces token embeddings from the prompt. These embeddings are injected into the U-Net/DiT via cross-attention layers — the noisy latent attends to the text tokens at every denoising step. SD3 and Flux use dual text encoders (CLIP + T5) for richer text understanding.
Add spatial control without retraining the base model. ControlNet clones the encoder of the U-Net and injects its outputs via zero-convolution layers. Inputs: Canny edges, depth maps, pose skeletons, semantic segmentation. Enables precise structural control over generation.
Image Prompt Adapter — condition on a reference image instead of (or alongside) text. A lightweight projection network maps CLIP image embeddings into the cross-attention space. Decoupled cross-attention ensures text and image signals don't interfere. Enables style transfer, character consistency, and image variation.
Distill a pretrained diffusion model into one that can generate in a single step.
The consistency function f(x_t, t) maps any point on the diffusion trajectory directly
to x_0. Latent Consistency Models (LCM) achieve this for Stable Diffusion,
enabling 1-4 step generation with minimal quality loss.
Bridge between diffusion and flow matching. Instead of curved SDE trajectories, learn straight-line ODE paths from noise to data. "Reflow" iteratively straightens paths, enabling fewer sampling steps. SD3 and Flux are built on rectified flow.
The theoretical foundation. A diffusion model implicitly learns the score function
nabla_x log p(x) — the gradient of the log-density. This connects diffusion to
Langevin dynamics, SDEs, and the broader framework of score-based generative modeling
(Song & Ermon, 2019).
Training a diffusion model is conceptually simple: sample a clean image, corrupt it to a random timestep, ask the network to undo the corruption, and minimize the error. Repeat billions of times.
(image, caption) from the dataset.z_0 = E(x).t ~ Uniform(1, T).epsilon ~ N(0, I) and create noisy latent: z_t = sqrt(alpha_bar_t) * z_0 + sqrt(1 - alpha_bar_t) * epsilon.c.epsilon_hat = model(z_t, t, c).L = MSE(epsilon_hat, epsilon). Backpropagate and update weights.Generation reverses the noising process. Start from pure noise and iteratively denoise. The scheduler (sampler) determines how timesteps are traversed.
z_T ~ N(0, I) in latent space.c.t = T, T-1, ..., 1: predict noise epsilon_hat = model(z_t, t, c),
apply CFG, then compute z_{t-1} using the scheduler's update rule.x = D(z_0) using the VAE decoder.DDPM uses all 1000 steps (stochastic). DDIM skips steps deterministically (20-50 steps). DPM-Solver++ and Euler are faster ODE solvers. LCM needs just 1-4 steps. The choice of scheduler is the biggest knob for speed vs quality.
The original open-source LDM. U-Net backbone, 512x512, CLIP text encoder. Still the most fine-tuned model in history.
Scaled-up SD with dual text encoders (CLIP-ViT-L + OpenCLIP-ViT-bigG), 1024x1024 native resolution, and a refiner model.
Rectified flow + DiT (MM-DiT) architecture. Triple text encoders (CLIP-L, CLIP-G, T5-XXL). State-of-the-art text rendering.
Integrated with ChatGPT. Trained on recaptioned data for strong prompt following. Not open-source, but wildly influential.
Pixel-space diffusion with a frozen T5-XXL text encoder. Cascaded pipeline: 64x64 base model + super-resolution stages.
Rectified flow transformer from the original Stable Diffusion creators. Best open model for text rendering and prompt adherence.