Introduction
Everything we have built so far — the forward process, the denoising objective, score matching,
SDEs, flow matching — ultimately distills to a single requirement: we need a neural network
εθ(xt, t) (or equivalently a score network,
or a velocity field) that can predict noise from a corrupted input at any timestep. The entire
theoretical edifice is only as powerful as this network's capacity to learn the denoising function.
This article is about the practical choices that make diffusion models work. The neural network is where all the capacity lives — where the model encodes its understanding of what a face looks like, how light scatters through clouds, or what makes a dog distinct from a wolf. We will trace the evolution of diffusion architectures from the original U-Net design through the modern Diffusion Transformer (DiT), and explore how conditioning — on timestep, class label, or free-form text — steers the generative process toward desired outputs.
The story has two interleaved threads: architecture (what goes inside the network) and conditioning (how we inject external information to control generation). Both are essential. A powerful architecture without good conditioning produces beautiful but uncontrollable images. Perfect conditioning on a weak backbone produces controllable but mediocre results.
This article assumes familiarity with the DDPM training objective (Article 02) and the score function (Article 03). You should be comfortable with the idea that a neural network takes in a noisy image xt and a timestep t, and predicts either the noise ε, the score ∇ log pt(xt), or the velocity vt. The specific prediction target doesn't change the architecture discussion.
U-Net Architecture
The U-Net (Ronneberger et al., 2015) was originally designed for biomedical image segmentation, but it became the de facto backbone for diffusion models thanks to Ho et al. (2020). Its signature feature is the encoder-decoder structure with skip connections — a design perfectly suited for denoising, where the network must preserve fine spatial details while also reasoning about global structure.
The architecture has three parts:
- Encoder (downsampling path): A sequence of blocks that progressively reduce spatial resolution while increasing channel depth. A typical chain might go 64 → 128 → 256 → 512 channels, with resolution halving at each stage via strided convolutions or average pooling. Each resolution level contains one or more ResNet blocks.
- Bottleneck: At the lowest resolution (e.g., 8×8 for a 256×256 input), the network processes features with the highest channel count. This is where self-attention is most computationally tractable and most impactful, capturing global relationships across the entire image.
- Decoder (upsampling path): Mirrors the encoder, progressively increasing resolution and decreasing channels. Each decoder block receives a skip connection from the corresponding encoder block — the feature maps are concatenated along the channel dimension before the decoder block processes them.
The skip connections are the critical design choice. Without them, fine-grained spatial information would be lost during the compression through the bottleneck. With them, the decoder has direct access to high-resolution features from the encoder, allowing it to reconstruct precise details. For denoising, this is essential: the network must output an image at the same resolution as its input, with pixel-level precision.
ResNet Blocks & Time Embedding
Each block within the U-Net is typically a ResNet block with a residual connection: the input is added to the output of two convolutional layers. Between the convolutions, the timestep embedding is injected (more on this in the next section). The standard structure is:
GroupNorm replaces BatchNorm because diffusion training uses small batch sizes (the images are large and the models are deep). The SiLU activation (x · σ(x), also called Swish) provides smooth gradients and has become standard in modern diffusion architectures.
Encoder-decoder with skip connections. Hover over any block to see its dimensions and role. The blue path is the encoder (downsampling), yellow is the bottleneck, and green is the decoder (upsampling).
Time Conditioning
A diffusion model must behave differently at every timestep. At high noise levels (large t), the
network should focus on recovering global structure — rough shapes, major color regions. At low
noise levels (small t), it should refine fine details — textures, edges, subtle shading. The
timestep t must therefore be communicated to the network in a way that enables
smooth, expressive modulation of its behavior across the entire noise spectrum.
The standard approach, borrowed from the positional encodings in the Transformer (Vaswani et al., 2017), uses sinusoidal embeddings:
This produces a d-dimensional vector for each scalar timestep t. The different frequencies ensure
that nearby timesteps have similar embeddings while distant ones are well-separated — exactly the
inductive bias we want. The sinusoidal embedding is then passed through a small MLP
(typically two linear layers with SiLU activation) to produce a timestep embedding vector
temb ∈ ℝd.
AdaGN & FiLM Conditioning
How do we actually inject temb into the network? The simplest approach
is additive conditioning: project temb to match the channel dimension
and add it to the feature map after the first convolution. This works but is limited — it can
shift activations but not rescale them.
A more powerful approach is Adaptive Group Normalization (AdaGN), also known as FiLM conditioning (Feature-wise Linear Modulation, Perez et al., 2018). Instead of learning fixed normalization parameters, the network predicts them from the timestep:
where γt = Wγ temb and
βt = Wβ temb are linear projections of
the timestep embedding. This gives the timestep control over both the scale and
shift of every feature channel — a much richer form of conditioning.
FiLM conditioning is powerful because it modulates the information flow through the network. Setting γ close to zero effectively gates off a feature channel; setting it large amplifies it. This allows the timestep to dynamically reconfigure which features the network uses at each noise level — coarse structure features at high noise, fine detail features at low noise — without any explicit architectural switching.
Self-Attention in U-Nets
Convolutions are local operations — a 3×3 kernel sees only a tiny spatial neighborhood. For diffusion models, this locality is a problem: at high noise levels, the model needs to coordinate across the entire image to establish global structure. Self-attention solves this by allowing every spatial position to attend to every other position, capturing long-range dependencies in a single layer.
However, self-attention has quadratic cost in the number of spatial positions: for a feature map of size H×W, the attention matrix is (HW)×(HW). At full resolution (256×256 = 65,536 positions), this is prohibitively expensive. The standard solution is to apply self-attention only at lower-resolution feature maps — typically at 16×16 or 32×32 — where the quadratic cost is manageable (256 or 1024 positions).
In the U-Net, attention blocks are interleaved with ResNet blocks at the lower resolution levels. A typical configuration applies self-attention at 16×16 and 32×32 but not at 64×64 or higher. The attention is multi-head, with the feature map reshaped so that spatial positions become the sequence dimension:
Attention(Q, K, V) = softmax(Q KT / √dk) V
where h ∈ ℝ(H·W) × C is the flattened feature map.
Each spatial position attends to all others, enabling the network to learn relationships like
"this region is a shadow cast by that object" or "these two eyes should be symmetric" — relationships
that span far beyond any convolutional receptive field.
The combination of convolutions (efficient local processing) and attention (powerful global reasoning) at strategic resolution levels gives the U-Net its remarkable capacity for denoising.
Diffusion Transformers (DiT)
The success of self-attention in U-Nets raised a natural question: why not replace the convolutional backbone entirely with a Transformer? Peebles & Xie (2023) answered this with the Diffusion Transformer (DiT), demonstrating that a pure Transformer architecture could match and exceed the U-Net on class-conditional ImageNet generation.
The DiT architecture is elegantly simple:
- Patchify: The input image (or noisy latent) is divided into non-overlapping patches (e.g., 2×2 or 4×4), each linearly embedded into a token vector. A 256×256 image with 4×4 patches yields a sequence of 4,096 tokens — feasible for modern Transformers.
- Positional embedding: Standard learnable or sinusoidal position embeddings are added to distinguish spatial locations.
- Transformer blocks: A stack of standard Transformer blocks with multi-head self-attention, feed-forward networks (MLPs), and layer normalization. The key innovation is how conditioning is injected.
- Unpatchify: The final token sequence is reshaped back into a spatial grid and linearly projected to the output channels (predicting noise, score, or velocity).
adaLN-Zero Conditioning
The critical design choice in DiT is adaLN-Zero — an adaptive layer normalization scheme where the timestep and class embeddings modulate the scale, shift, and a per-layer gating parameter:
The extra gating parameter αc is initialized to zero at the start
of training, meaning each Transformer block initially acts as an identity function. This
zero-initialization is crucial for training stability — it allows the model to
gradually learn to use each block rather than being hit with random transformations from the start.
DiT demonstrated a clean scaling law: FID scores improved smoothly with increased model size and training compute, following the same predictable scaling behavior observed in language models. This was a landmark finding — it suggested that diffusion models could benefit from the same "just make it bigger" recipe that transformed NLP.
| Property | U-Net | DiT |
|---|---|---|
| Core operation | Convolution + sparse attention | Full self-attention on patches |
| Spatial structure | Multi-scale (explicit down/up) | Single-scale (patchified) |
| Skip connections | Encoder → decoder skips | Residual within each block |
| Time conditioning | AdaGN / additive | adaLN-Zero |
| Scaling | Architecture-specific tuning | Clean compute scaling laws |
| Inductive bias | Strong spatial (convolutions) | Weak (learned from data) |
| Key models | DDPM, Stable Diffusion 1/2 | DiT, SD3, Flux, Sora |
Classifier Guidance
Training a powerful denoising network is only half the battle. We also need to control what the model generates. The first breakthrough in guided diffusion came from Dhariwal & Nichol (2021), who showed that a separately trained classifier could steer the sampling process toward a desired class.
The idea is mathematically elegant. Recall that the score function
∇x log pt(x) points toward regions of higher data
density. If we want to generate images of class y, we want to sample from the conditional
distribution pt(x | y). By Bayes' rule:
The first term is the unconditional score — what our diffusion model already provides. The second
term is the gradient of a classifier's log-probability with respect to the noisy input. We train
a noise-aware classifier pφ(y | xt) on noisy images, then
at sampling time, we modify the score:
The guidance scale s controls how strongly the classifier steers
generation. Higher s produces images that are more recognizably of class y, at the cost of reduced
diversity. This tradeoff between fidelity (how well the image matches the
condition) and diversity (how varied the samples are) is fundamental to all
guidance methods.
Classifier guidance produced a landmark result: for the first time, diffusion models beat GANs on ImageNet generation (FID 4.59 vs. BigGAN's 6.95). But it has a major practical limitation — it requires training a separate classifier on noisy images, which is expensive and limits the types of conditioning that can be applied.
Classifier-Free Guidance
Ho & Salimans (2022) proposed a brilliantly simple alternative: instead of
training a separate classifier, train the diffusion model itself to be both conditional and
unconditional. During training, the conditioning signal c (class label, text embedding, etc.)
is randomly dropped with some probability (typically 10-20%), replaced by a null token
∅. This means the same network learns both:
εθ(xt, t, c)— the conditional predictionεθ(xt, t, ∅)— the unconditional prediction
At sampling time, the two predictions are combined:
When w = 1, this reduces to standard conditional generation. When w > 1,
the model extrapolates in the direction of the conditioning — moving further away from the
unconditional prediction than a purely conditional model would. This amplifies the influence of the
conditioning signal, producing images that more strongly match the desired condition.
The guidance weight w (often called the "CFG scale") is the single most important
hyperparameter in practical diffusion sampling. Typical values range from 3 to 15 depending on
the application. Too low and the images are diverse but may not match the prompt. Too high and
the images become oversaturated, artifact-ridden caricatures of the prompt.
The formula ε̃ = ε∅ + w(εc - ε∅)
can be interpreted geometrically: the difference
(εc - ε∅) points in the direction that
conditioning "wants to push" the prediction. Multiplying by w > 1 amplifies this push.
Equivalently, it implicitly raises the classifier's log-probability to a power w, sharpening
the conditional distribution. The model is using its own internal "classifier" — the difference
between its conditional and unconditional predictions — rather than relying on an external one.
Adjust the guidance weight w. At w=1, the model samples from the learned conditional. Higher w sharpens the distribution toward the condition but reduces diversity.
Text Conditioning & Cross-Attention
Class-conditional generation is useful for benchmarks, but the real magic begins when diffusion models are conditioned on free-form text. The key challenge: text is a variable-length sequence of tokens, while the diffusion model operates on spatial feature maps. How do we bridge these fundamentally different modalities?
The answer comes in two parts: a text encoder that converts the prompt into a rich sequence of embedding vectors, and cross-attention layers that allow the image features to query the text embeddings.
Text encoders. The most common choices are CLIP (Radford et al., 2021) and T5 (Raffel et al., 2020). CLIP was trained contrastively on 400M image-text pairs, so its text embeddings already encode visual semantics — the embedding of "a red sports car" is close to embeddings of actual sports car images. T5 is a large language model with richer linguistic understanding but less visual grounding. Modern systems often use both: CLIP for visual alignment and T5 for complex compositional understanding.
Cross-attention. Inside the denoising network, cross-attention layers are inserted alongside the self-attention layers. The mechanism is identical to Transformer cross-attention:
CrossAttn(himage, htext) = softmax(Q KT / √dk) V
The queries come from the image features (spatial positions in the feature map), while the keys and values come from the text encoder's output sequence. Each spatial position in the image can attend to every text token, learning which words are relevant to which spatial locations. The word "red" might receive high attention weights from pixels in the car region, while "sky" attends to the upper portion of the image.
This creates a powerful soft spatial grounding: the model learns to associate textual concepts with spatial regions without any explicit supervision of where objects should appear. The compositional structure of language ("a cat sitting on a mat") is translated into compositional spatial structure through the attention patterns.
Hover over a text token (left) to see which image spatial positions attend to it, or hover over an image position (right) to see which text tokens it attends to. Line thickness indicates attention weight.
Latent Diffusion
Running a diffusion model directly on pixels is computationally brutal. A 512×512 RGB image has 786,432 dimensions. Every forward pass of the denoising network must process this full-resolution tensor, and sampling requires dozens to hundreds of such passes. Rombach et al. (2022) proposed an elegant solution: run the diffusion process in a compressed latent space instead of pixel space.
The Latent Diffusion Model (LDM) architecture has three stages:
- Encoder (E): A pretrained autoencoder (typically a VQ-VAE or KL-regularized VAE) compresses images from pixel space to a low-dimensional latent space. A 512×512×3 image becomes a 64×64×4 latent tensor — a 48× compression in dimensionality. The encoder is trained once and frozen.
- Diffusion model: The standard forward/reverse diffusion process operates entirely in this latent space. The denoising U-Net (or DiT) is much smaller because it processes 64×64×4 tensors instead of 512×512×3. Training and sampling are both dramatically faster.
-
Decoder (D): After sampling is complete, the clean latent
z0is passed through the decoder to produce the final pixel-space image. This is a single forward pass — no iteration required.
The autoencoder is trained to reconstruct images with high fidelity while keeping the latent space smooth and well-structured. A KL penalty or vector quantization ensures the latent space doesn't develop pathological regions that the diffusion model would struggle with.
Stable Diffusion (Rombach et al., 2022) is the most famous instantiation of Latent Diffusion. It combines an LDM with a CLIP text encoder, cross-attention conditioning, and classifier-free guidance. The full pipeline: CLIP encodes the text prompt → the U-Net denoises in latent space conditioned on text via cross-attention → the VAE decoder produces the final image. Training on 256×256 latents from 512×512 images made it feasible to train on large datasets (LAION-5B) and run inference on consumer GPUs — democratizing high-quality image generation.
The latent diffusion insight is general: any domain where a good autoencoder exists can benefit from running diffusion in the compressed space. This has been applied to video (compressing both spatial and temporal dimensions), audio (spectrograms to latents), and 3D generation (tri-plane latent representations).
Left: pixel-space diffusion operates on the full image grid. Right: latent diffusion operates on a compressed representation. The ratio shows the dimensionality savings. Hover to compare.
Trends & Future Directions
The architecture landscape is evolving rapidly. Several trends are reshaping how diffusion models are designed:
Multi-resolution DiT. Pure single-scale DiTs process all patches at the same resolution, missing the multi-scale inductive bias that made U-Nets successful. Recent architectures like U-ViT (Bao et al., 2023) and Hourglass DiT combine the Transformer backbone with U-Net-style multi-resolution processing — downsampling tokens at intermediate layers and using skip connections across resolution levels. This marries the scaling benefits of Transformers with the spatial efficiency of the U-Net design.
Mixture of Experts (MoE). As models scale to billions of parameters, MoE layers offer a path to increased capacity without proportional compute costs. Each token is routed to a subset of expert MLPs, allowing the model to maintain a large parameter count while keeping the per-token FLOPs manageable. This is particularly appealing for diffusion models, where different noise levels and content types may benefit from specialized processing.
Stable Diffusion 3 & Flux. The latest generation of open models has converged on the MM-DiT (multimodal DiT) architecture, which processes text and image tokens in a single unified Transformer stream with bidirectional attention. This eliminates the asymmetry of separate text encoders and cross-attention, allowing text and image representations to co-evolve through every layer. SD3 (Esser et al., 2024) and Flux (Black Forest Labs, 2024) both use this approach, combined with flow matching training objectives and rectified flow for straighter sampling trajectories.
Scaling and efficiency. The field is simultaneously pushing in two directions: larger models with better quality (following DiT scaling laws), and smaller, distilled models that can generate high-quality images in 1-4 steps. Techniques like consistency distillation, progressive distillation, and adversarial distillation compress hundreds of diffusion steps into a handful, enabling real-time generation on mobile devices.
The architecture story is far from settled. What is clear is that the fundamental building blocks — attention, normalization-based conditioning, multi-scale processing, and latent compression — will remain central even as specific designs continue to evolve.
References
Seminal papers and key works referenced in this article.
- Ronneberger et al. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI, 2015. arXiv
- Peebles & Xie. "Scalable Diffusion Models with Transformers." ICCV, 2023. arXiv
- Dhariwal & Nichol. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS, 2021. arXiv
- Ho & Salimans. "Classifier-Free Diffusion Guidance." NeurIPS Workshop, 2022. arXiv
- Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR, 2022. arXiv