03 — Architecture Overview

Diffusion Models

"Learn to denoise, and you learn to generate."

Year 2020

Creators Ho et al.

Category Generative Model

Key DDPM / DDIM / LDM

01What Is It?

A diffusion model is a generative model that learns to reverse a gradual noising process. During training, we take a clean data sample and progressively add Gaussian noise over many timesteps until it becomes indistinguishable from pure random noise. We then train a neural network — typically a U-Net or a Diffusion Transformer (DiT) — to predict and undo that noise at every step.

At generation time, we start from pure noise and iteratively denoise, one step at a time, to produce a realistic sample. The model never sees the full generative process in one shot — it only learns to make things slightly less noisy. Stack a thousand tiny improvements, and you get an image.

q(x_t | x_{t-1}) = N(x_t; sqrt(1-beta_t) * x_{t-1}, beta_t * I) — Forward process (add noise)
p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I) — Reverse process (denoise)

02Architecture Diagram

The forward process destroys structure; the reverse process creates it. Click Play to watch the forward diffusion corrupt an image into noise, then the reverse process reconstruct it step by step.

Forward & Reverse Diffusion Process Interactive

t = 0 — Clean image

03Core Mechanisms

+Forward Process

Gradually add Gaussian noise according to a variance schedule beta_1 ... beta_T. At each step t, we mix the previous sample with fresh noise. By step T (typically 1000), the result is indistinguishable from N(0, I). The key trick: we can jump directly to any timestep t using the cumulative product alpha_bar_t.

-Reverse Process

The neural network learns epsilon_theta(x_t, t) — the noise added at step t. Alternatively, it can predict the clean image x_0 directly, or a "velocity" v. The predicted noise is subtracted (scaled appropriately) to get x_{t-1}. Repeat T times: noise becomes data.

~Noise Schedules

Linear: beta increases linearly from 0.0001 to 0.02. Simple, original DDPM. Cosine: slower noise ramp, preserves structure longer — better for high-res. Learned: the schedule itself is optimized during training. Used in improved DDPM and some LDM variants.

wClassifier-Free Guidance

Train the model to handle both conditional and unconditional generation. At inference, interpolate: eps = eps_uncond + w * (eps_cond - eps_uncond). Guidance scale w controls fidelity vs diversity. w=1 means no guidance; w=7.5 is the sweet spot for Stable Diffusion. Higher values produce sharper but less diverse images.

04Latent Diffusion

Running diffusion at full pixel resolution is extremely expensive. Latent Diffusion Models (LDM) solve this by first encoding the image into a compressed latent space using a pretrained VAE (variational autoencoder), performing diffusion in that latent space (typically 8x smaller per spatial dimension), then decoding back to pixels. This is the core idea behind Stable Diffusion.

Latent Diffusion Pipeline Interactive

Click to trace the data flow

The pipeline: Image (512x512) is encoded by the VAE encoder into a Latent (64x64x4), diffusion runs entirely in latent space with the U-Net/DiT, and the VAE decoder maps back to pixel space. Text conditioning enters via cross-attention using a frozen CLIP text encoder.

05Classifier-Free Guidance — Visualized

Drag the slider to see how guidance scale w affects the balance between diversity and prompt fidelity. Low guidance gives blurry, diverse outputs. High guidance gives sharp, mode-collapsed results.

CFG Strength Slider Interactive

Guidance scale w: 7.5

06Noise Schedule Comparison

The schedule controls how quickly information is destroyed. Toggle each schedule to compare how alpha_bar_t (signal-to-noise ratio) decays over timesteps.

Noise Schedules: alpha_bar vs. Timestep Interactive

07Conditioning

Diffusion models are most powerful when conditioned on auxiliary signals. The conditioning mechanism determines what the model generates and how controllable it is.

TText Conditioning

A frozen CLIP or T5 text encoder produces token embeddings from the prompt. These embeddings are injected into the U-Net/DiT via cross-attention layers — the noisy latent attends to the text tokens at every denoising step. SD3 and Flux use dual text encoders (CLIP + T5) for richer text understanding.

CControlNet

Add spatial control without retraining the base model. ControlNet clones the encoder of the U-Net and injects its outputs via zero-convolution layers. Inputs: Canny edges, depth maps, pose skeletons, semantic segmentation. Enables precise structural control over generation.

IIP-Adapter

Image Prompt Adapter — condition on a reference image instead of (or alongside) text. A lightweight projection network maps CLIP image embeddings into the cross-attention space. Decoupled cross-attention ensures text and image signals don't interfere. Enables style transfer, character consistency, and image variation.

08Advanced Topics

1Consistency Models

Distill a pretrained diffusion model into one that can generate in a single step. The consistency function f(x_t, t) maps any point on the diffusion trajectory directly to x_0. Latent Consistency Models (LCM) achieve this for Stable Diffusion, enabling 1-4 step generation with minimal quality loss.

RRectified Flow

Bridge between diffusion and flow matching. Instead of curved SDE trajectories, learn straight-line ODE paths from noise to data. "Reflow" iteratively straightens paths, enabling fewer sampling steps. SD3 and Flux are built on rectified flow.

SScore Matching

The theoretical foundation. A diffusion model implicitly learns the score function nabla_x log p(x) — the gradient of the log-density. This connects diffusion to Langevin dynamics, SDEs, and the broader framework of score-based generative modeling (Song & Ermon, 2019).

09Training

Training a diffusion model is conceptually simple: sample a clean image, corrupt it to a random timestep, ask the network to undo the corruption, and minimize the error. Repeat billions of times.

Sample a training pair (image, caption) from the dataset.
Encode the image to latent space using the frozen VAE encoder: z_0 = E(x).
Sample a random timestep t ~ Uniform(1, T).
Sample noise epsilon ~ N(0, I) and create noisy latent: z_t = sqrt(alpha_bar_t) * z_0 + sqrt(1 - alpha_bar_t) * epsilon.
Encode the caption using CLIP/T5 to get conditioning embeddings c.
Predict the noise: epsilon_hat = model(z_t, t, c).
Compute loss: L = MSE(epsilon_hat, epsilon). Backpropagate and update weights.

L_simple = E_{t, x_0, epsilon} [ || epsilon - epsilon_theta(sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon, t) ||^2 ]

10Inference

Generation reverses the noising process. Start from pure noise and iteratively denoise. The scheduler (sampler) determines how timesteps are traversed.

Sample initial noise z_T ~ N(0, I) in latent space.
Encode the text prompt with CLIP/T5 to get conditioning c.
For each timestep t = T, T-1, ..., 1: predict noise epsilon_hat = model(z_t, t, c), apply CFG, then compute z_{t-1} using the scheduler's update rule.
Decode the final latent: x = D(z_0) using the VAE decoder.

DDPM uses all 1000 steps (stochastic). DDIM skips steps deterministically (20-50 steps). DPM-Solver++ and Euler are faster ODE solvers. LCM needs just 1-4 steps. The choice of scheduler is the biggest knob for speed vs quality.

11Model Zoo

Stable Diffusion 1.5

Stability AI / 2022 / 860M params

The original open-source LDM. U-Net backbone, 512x512, CLIP text encoder. Still the most fine-tuned model in history.

SDXL

Stability AI / 2023 / 2.6B params

Scaled-up SD with dual text encoders (CLIP-ViT-L + OpenCLIP-ViT-bigG), 1024x1024 native resolution, and a refiner model.

Stable Diffusion 3

Stability AI / 2024 / 2B-8B params

Rectified flow + DiT (MM-DiT) architecture. Triple text encoders (CLIP-L, CLIP-G, T5-XXL). State-of-the-art text rendering.

DALL-E 3

OpenAI / 2023

Integrated with ChatGPT. Trained on recaptioned data for strong prompt following. Not open-source, but wildly influential.

Imagen

Google / 2022

Pixel-space diffusion with a frozen T5-XXL text encoder. Cascaded pipeline: 64x64 base model + super-resolution stages.

Flux

Black Forest Labs / 2024 / 12B params

Rectified flow transformer from the original Stable Diffusion creators. Best open model for text rendering and prompt adherence.