Introduction

Everything we have built so far — the forward process, the denoising objective, score matching, SDEs, flow matching — ultimately distills to a single requirement: we need a neural network εθ(xt, t) (or equivalently a score network, or a velocity field) that can predict noise from a corrupted input at any timestep. The entire theoretical edifice is only as powerful as this network's capacity to learn the denoising function.

This article is about the practical choices that make diffusion models work. The neural network is where all the capacity lives — where the model encodes its understanding of what a face looks like, how light scatters through clouds, or what makes a dog distinct from a wolf. We will trace the evolution of diffusion architectures from the original U-Net design through the modern Diffusion Transformer (DiT), and explore how conditioning — on timestep, class label, or free-form text — steers the generative process toward desired outputs.

The story has two interleaved threads: architecture (what goes inside the network) and conditioning (how we inject external information to control generation). Both are essential. A powerful architecture without good conditioning produces beautiful but uncontrollable images. Perfect conditioning on a weak backbone produces controllable but mediocre results.

ℹ Prerequisites

This article assumes familiarity with the DDPM training objective (Article 02) and the score function (Article 03). You should be comfortable with the idea that a neural network takes in a noisy image xt and a timestep t, and predicts either the noise ε, the score ∇ log pt(xt), or the velocity vt. The specific prediction target doesn't change the architecture discussion.

U-Net Architecture

The U-Net (Ronneberger et al., 2015) was originally designed for biomedical image segmentation, but it became the de facto backbone for diffusion models thanks to Ho et al. (2020). Its signature feature is the encoder-decoder structure with skip connections — a design perfectly suited for denoising, where the network must preserve fine spatial details while also reasoning about global structure.

The architecture has three parts:

  • Encoder (downsampling path): A sequence of blocks that progressively reduce spatial resolution while increasing channel depth. A typical chain might go 64 → 128 → 256 → 512 channels, with resolution halving at each stage via strided convolutions or average pooling. Each resolution level contains one or more ResNet blocks.
  • Bottleneck: At the lowest resolution (e.g., 8×8 for a 256×256 input), the network processes features with the highest channel count. This is where self-attention is most computationally tractable and most impactful, capturing global relationships across the entire image.
  • Decoder (upsampling path): Mirrors the encoder, progressively increasing resolution and decreasing channels. Each decoder block receives a skip connection from the corresponding encoder block — the feature maps are concatenated along the channel dimension before the decoder block processes them.

The skip connections are the critical design choice. Without them, fine-grained spatial information would be lost during the compression through the bottleneck. With them, the decoder has direct access to high-resolution features from the encoder, allowing it to reconstruct precise details. For denoising, this is essential: the network must output an image at the same resolution as its input, with pixel-level precision.

ResNet Blocks & Time Embedding

Each block within the U-Net is typically a ResNet block with a residual connection: the input is added to the output of two convolutional layers. Between the convolutions, the timestep embedding is injected (more on this in the next section). The standard structure is:

h = Conv(GroupNorm(x)) → SiLU → +temb → Conv(GroupNorm(h)) → SiLU → + x

GroupNorm replaces BatchNorm because diffusion training uses small batch sizes (the images are large and the models are deep). The SiLU activation (x · σ(x), also called Swish) provides smooth gradients and has become standard in modern diffusion architectures.

U-Net Architecture Interactive

Encoder-decoder with skip connections. Hover over any block to see its dimensions and role. The blue path is the encoder (downsampling), yellow is the bottleneck, and green is the decoder (upsampling).

Time Conditioning

A diffusion model must behave differently at every timestep. At high noise levels (large t), the network should focus on recovering global structure — rough shapes, major color regions. At low noise levels (small t), it should refine fine details — textures, edges, subtle shading. The timestep t must therefore be communicated to the network in a way that enables smooth, expressive modulation of its behavior across the entire noise spectrum.

The standard approach, borrowed from the positional encodings in the Transformer (Vaswani et al., 2017), uses sinusoidal embeddings:

PE(t, 2i) = sin(t / 100002i/d),    PE(t, 2i+1) = cos(t / 100002i/d)

This produces a d-dimensional vector for each scalar timestep t. The different frequencies ensure that nearby timesteps have similar embeddings while distant ones are well-separated — exactly the inductive bias we want. The sinusoidal embedding is then passed through a small MLP (typically two linear layers with SiLU activation) to produce a timestep embedding vector temb ∈ ℝd.

AdaGN & FiLM Conditioning

How do we actually inject temb into the network? The simplest approach is additive conditioning: project temb to match the channel dimension and add it to the feature map after the first convolution. This works but is limited — it can shift activations but not rescale them.

A more powerful approach is Adaptive Group Normalization (AdaGN), also known as FiLM conditioning (Feature-wise Linear Modulation, Perez et al., 2018). Instead of learning fixed normalization parameters, the network predicts them from the timestep:

AdaGN(h, t) = γt · GroupNorm(h) + βt

where γt = Wγ temb and βt = Wβ temb are linear projections of the timestep embedding. This gives the timestep control over both the scale and shift of every feature channel — a much richer form of conditioning.

🔢 Why scale-and-shift works so well

FiLM conditioning is powerful because it modulates the information flow through the network. Setting γ close to zero effectively gates off a feature channel; setting it large amplifies it. This allows the timestep to dynamically reconfigure which features the network uses at each noise level — coarse structure features at high noise, fine detail features at low noise — without any explicit architectural switching.

Self-Attention in U-Nets

Convolutions are local operations — a 3×3 kernel sees only a tiny spatial neighborhood. For diffusion models, this locality is a problem: at high noise levels, the model needs to coordinate across the entire image to establish global structure. Self-attention solves this by allowing every spatial position to attend to every other position, capturing long-range dependencies in a single layer.

However, self-attention has quadratic cost in the number of spatial positions: for a feature map of size H×W, the attention matrix is (HW)×(HW). At full resolution (256×256 = 65,536 positions), this is prohibitively expensive. The standard solution is to apply self-attention only at lower-resolution feature maps — typically at 16×16 or 32×32 — where the quadratic cost is manageable (256 or 1024 positions).

In the U-Net, attention blocks are interleaved with ResNet blocks at the lower resolution levels. A typical configuration applies self-attention at 16×16 and 32×32 but not at 64×64 or higher. The attention is multi-head, with the feature map reshaped so that spatial positions become the sequence dimension:

Q = WQ h,   K = WK h,   V = WV h
Attention(Q, K, V) = softmax(Q KT / √dk) V

where h ∈ ℝ(H·W) × C is the flattened feature map. Each spatial position attends to all others, enabling the network to learn relationships like "this region is a shadow cast by that object" or "these two eyes should be symmetric" — relationships that span far beyond any convolutional receptive field.

The combination of convolutions (efficient local processing) and attention (powerful global reasoning) at strategic resolution levels gives the U-Net its remarkable capacity for denoising.

Diffusion Transformers (DiT)

The success of self-attention in U-Nets raised a natural question: why not replace the convolutional backbone entirely with a Transformer? Peebles & Xie (2023) answered this with the Diffusion Transformer (DiT), demonstrating that a pure Transformer architecture could match and exceed the U-Net on class-conditional ImageNet generation.

The DiT architecture is elegantly simple:

  1. Patchify: The input image (or noisy latent) is divided into non-overlapping patches (e.g., 2×2 or 4×4), each linearly embedded into a token vector. A 256×256 image with 4×4 patches yields a sequence of 4,096 tokens — feasible for modern Transformers.
  2. Positional embedding: Standard learnable or sinusoidal position embeddings are added to distinguish spatial locations.
  3. Transformer blocks: A stack of standard Transformer blocks with multi-head self-attention, feed-forward networks (MLPs), and layer normalization. The key innovation is how conditioning is injected.
  4. Unpatchify: The final token sequence is reshaped back into a spatial grid and linearly projected to the output channels (predicting noise, score, or velocity).

adaLN-Zero Conditioning

The critical design choice in DiT is adaLN-Zero — an adaptive layer normalization scheme where the timestep and class embeddings modulate the scale, shift, and a per-layer gating parameter:

adaLN-Zero(h, c) = αc · (γc · LayerNorm(h) + βc)

The extra gating parameter αc is initialized to zero at the start of training, meaning each Transformer block initially acts as an identity function. This zero-initialization is crucial for training stability — it allows the model to gradually learn to use each block rather than being hit with random transformations from the start.

DiT demonstrated a clean scaling law: FID scores improved smoothly with increased model size and training compute, following the same predictable scaling behavior observed in language models. This was a landmark finding — it suggested that diffusion models could benefit from the same "just make it bigger" recipe that transformed NLP.

Property U-Net DiT
Core operation Convolution + sparse attention Full self-attention on patches
Spatial structure Multi-scale (explicit down/up) Single-scale (patchified)
Skip connections Encoder → decoder skips Residual within each block
Time conditioning AdaGN / additive adaLN-Zero
Scaling Architecture-specific tuning Clean compute scaling laws
Inductive bias Strong spatial (convolutions) Weak (learned from data)
Key models DDPM, Stable Diffusion 1/2 DiT, SD3, Flux, Sora

Classifier Guidance

Training a powerful denoising network is only half the battle. We also need to control what the model generates. The first breakthrough in guided diffusion came from Dhariwal & Nichol (2021), who showed that a separately trained classifier could steer the sampling process toward a desired class.

The idea is mathematically elegant. Recall that the score function x log pt(x) points toward regions of higher data density. If we want to generate images of class y, we want to sample from the conditional distribution pt(x | y). By Bayes' rule:

x log pt(x | y) = ∇x log pt(x) + ∇x log pt(y | x)

The first term is the unconditional score — what our diffusion model already provides. The second term is the gradient of a classifier's log-probability with respect to the noisy input. We train a noise-aware classifier pφ(y | xt) on noisy images, then at sampling time, we modify the score:

ε̃θ(xt, t, y) = εθ(xt, t) - √(1 - ᾱt) · s · ∇x log pφ(y | xt)

The guidance scale s controls how strongly the classifier steers generation. Higher s produces images that are more recognizably of class y, at the cost of reduced diversity. This tradeoff between fidelity (how well the image matches the condition) and diversity (how varied the samples are) is fundamental to all guidance methods.

Classifier guidance produced a landmark result: for the first time, diffusion models beat GANs on ImageNet generation (FID 4.59 vs. BigGAN's 6.95). But it has a major practical limitation — it requires training a separate classifier on noisy images, which is expensive and limits the types of conditioning that can be applied.

Classifier-Free Guidance

Ho & Salimans (2022) proposed a brilliantly simple alternative: instead of training a separate classifier, train the diffusion model itself to be both conditional and unconditional. During training, the conditioning signal c (class label, text embedding, etc.) is randomly dropped with some probability (typically 10-20%), replaced by a null token . This means the same network learns both:

  • εθ(xt, t, c) — the conditional prediction
  • εθ(xt, t, ∅) — the unconditional prediction

At sampling time, the two predictions are combined:

ε̃ = εθ(xt, t, ∅) + w · (εθ(xt, t, c) - εθ(xt, t, ∅))

When w = 1, this reduces to standard conditional generation. When w > 1, the model extrapolates in the direction of the conditioning — moving further away from the unconditional prediction than a purely conditional model would. This amplifies the influence of the conditioning signal, producing images that more strongly match the desired condition.

The guidance weight w (often called the "CFG scale") is the single most important hyperparameter in practical diffusion sampling. Typical values range from 3 to 15 depending on the application. Too low and the images are diverse but may not match the prompt. Too high and the images become oversaturated, artifact-ridden caricatures of the prompt.

💡 Why classifier-free guidance works so well

The formula ε̃ = ε + w(εc - ε) can be interpreted geometrically: the difference c - ε) points in the direction that conditioning "wants to push" the prediction. Multiplying by w > 1 amplifies this push. Equivalently, it implicitly raises the classifier's log-probability to a power w, sharpening the conditional distribution. The model is using its own internal "classifier" — the difference between its conditional and unconditional predictions — rather than relying on an external one.

Classifier-Free Guidance Effect Interactive

Adjust the guidance weight w. At w=1, the model samples from the learned conditional. Higher w sharpens the distribution toward the condition but reduces diversity.

w = 7.5 — strong guidance

Text Conditioning & Cross-Attention

Class-conditional generation is useful for benchmarks, but the real magic begins when diffusion models are conditioned on free-form text. The key challenge: text is a variable-length sequence of tokens, while the diffusion model operates on spatial feature maps. How do we bridge these fundamentally different modalities?

The answer comes in two parts: a text encoder that converts the prompt into a rich sequence of embedding vectors, and cross-attention layers that allow the image features to query the text embeddings.

Text encoders. The most common choices are CLIP (Radford et al., 2021) and T5 (Raffel et al., 2020). CLIP was trained contrastively on 400M image-text pairs, so its text embeddings already encode visual semantics — the embedding of "a red sports car" is close to embeddings of actual sports car images. T5 is a large language model with richer linguistic understanding but less visual grounding. Modern systems often use both: CLIP for visual alignment and T5 for complex compositional understanding.

Cross-attention. Inside the denoising network, cross-attention layers are inserted alongside the self-attention layers. The mechanism is identical to Transformer cross-attention:

Q = WQ himage,   K = WK htext,   V = WV htext
CrossAttn(himage, htext) = softmax(Q KT / √dk) V

The queries come from the image features (spatial positions in the feature map), while the keys and values come from the text encoder's output sequence. Each spatial position in the image can attend to every text token, learning which words are relevant to which spatial locations. The word "red" might receive high attention weights from pixels in the car region, while "sky" attends to the upper portion of the image.

This creates a powerful soft spatial grounding: the model learns to associate textual concepts with spatial regions without any explicit supervision of where objects should appear. The compositional structure of language ("a cat sitting on a mat") is translated into compositional spatial structure through the attention patterns.

Cross-Attention: Text ↔ Image Interactive

Hover over a text token (left) to see which image spatial positions attend to it, or hover over an image position (right) to see which text tokens it attends to. Line thickness indicates attention weight.

Latent Diffusion

Running a diffusion model directly on pixels is computationally brutal. A 512×512 RGB image has 786,432 dimensions. Every forward pass of the denoising network must process this full-resolution tensor, and sampling requires dozens to hundreds of such passes. Rombach et al. (2022) proposed an elegant solution: run the diffusion process in a compressed latent space instead of pixel space.

The Latent Diffusion Model (LDM) architecture has three stages:

  1. Encoder (E): A pretrained autoencoder (typically a VQ-VAE or KL-regularized VAE) compresses images from pixel space to a low-dimensional latent space. A 512×512×3 image becomes a 64×64×4 latent tensor — a 48× compression in dimensionality. The encoder is trained once and frozen.
  2. Diffusion model: The standard forward/reverse diffusion process operates entirely in this latent space. The denoising U-Net (or DiT) is much smaller because it processes 64×64×4 tensors instead of 512×512×3. Training and sampling are both dramatically faster.
  3. Decoder (D): After sampling is complete, the clean latent z0 is passed through the decoder to produce the final pixel-space image. This is a single forward pass — no iteration required.

The autoencoder is trained to reconstruct images with high fidelity while keeping the latent space smooth and well-structured. A KL penalty or vector quantization ensures the latent space doesn't develop pathological regions that the diffusion model would struggle with.

ℹ Stable Diffusion = Latent Diffusion + CLIP + CFG

Stable Diffusion (Rombach et al., 2022) is the most famous instantiation of Latent Diffusion. It combines an LDM with a CLIP text encoder, cross-attention conditioning, and classifier-free guidance. The full pipeline: CLIP encodes the text prompt → the U-Net denoises in latent space conditioned on text via cross-attention → the VAE decoder produces the final image. Training on 256×256 latents from 512×512 images made it feasible to train on large datasets (LAION-5B) and run inference on consumer GPUs — democratizing high-quality image generation.

The latent diffusion insight is general: any domain where a good autoencoder exists can benefit from running diffusion in the compressed space. This has been applied to video (compressing both spatial and temporal dimensions), audio (spectrograms to latents), and 3D generation (tri-plane latent representations).

Latent vs Pixel Space Interactive

Left: pixel-space diffusion operates on the full image grid. Right: latent diffusion operates on a compressed representation. The ratio shows the dimensionality savings. Hover to compare.

References

Seminal papers and key works referenced in this article.

  1. Ronneberger et al. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI, 2015. arXiv
  2. Peebles & Xie. "Scalable Diffusion Models with Transformers." ICCV, 2023. arXiv
  3. Dhariwal & Nichol. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS, 2021. arXiv
  4. Ho & Salimans. "Classifier-Free Diffusion Guidance." NeurIPS Workshop, 2022. arXiv
  5. Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR, 2022. arXiv