Architectures & Conditioning — Diffusion & Flow Matching Internals

Introduction

Everything we have built so far — the forward process, the denoising objective, score matching, SDEs, flow matching — ultimately distills to a single requirement: we need a neural network ε_θ(x_t, t) (or equivalently a score network, or a velocity field) that can predict noise from a corrupted input at any timestep. The entire theoretical edifice is only as powerful as this network's capacity to learn the denoising function.

This article is about the practical choices that make diffusion models work. The neural network is where all the capacity lives — where the model encodes its understanding of what a face looks like, how light scatters through clouds, or what makes a dog distinct from a wolf. We will trace the evolution of diffusion architectures from the original U-Net design through the modern Diffusion Transformer (DiT), and explore how conditioning — on timestep, class label, or free-form text — steers the generative process toward desired outputs.

The story has two interleaved threads: architecture (what goes inside the network) and conditioning (how we inject external information to control generation). Both are essential. A powerful architecture without good conditioning produces beautiful but uncontrollable images. Perfect conditioning on a weak backbone produces controllable but mediocre results.

ℹ Prerequisites

This article assumes familiarity with the DDPM training objective (Article 02) and the score function (Article 03). You should be comfortable with the idea that a neural network takes in a noisy image x_t and a timestep t, and predicts either the noise ε, the score ∇ log p_t(x_t), or the velocity v_t. The specific prediction target doesn't change the architecture discussion.

U-Net Architecture

The U-Net (Ronneberger et al., 2015) was originally designed for biomedical image segmentation, but it became the de facto backbone for diffusion models thanks to Ho et al. (2020). Its signature feature is the encoder-decoder structure with skip connections — a design perfectly suited for denoising, where the network must preserve fine spatial details while also reasoning about global structure.

The architecture has three parts:

Encoder (downsampling path): A sequence of blocks that progressively reduce spatial resolution while increasing channel depth. A typical chain might go 64 → 128 → 256 → 512 channels, with resolution halving at each stage via strided convolutions or average pooling. Each resolution level contains one or more ResNet blocks.
Bottleneck: At the lowest resolution (e.g., 8×8 for a 256×256 input), the network processes features with the highest channel count. This is where self-attention is most computationally tractable and most impactful, capturing global relationships across the entire image.
Decoder (upsampling path): Mirrors the encoder, progressively increasing resolution and decreasing channels. Each decoder block receives a skip connection from the corresponding encoder block — the feature maps are concatenated along the channel dimension before the decoder block processes them.

The skip connections are the critical design choice. Without them, fine-grained spatial information would be lost during the compression through the bottleneck. With them, the decoder has direct access to high-resolution features from the encoder, allowing it to reconstruct precise details. For denoising, this is essential: the network must output an image at the same resolution as its input, with pixel-level precision.

ResNet Blocks & Time Embedding

Each block within the U-Net is typically a ResNet block with a residual connection: the input is added to the output of two convolutional layers. Between the convolutions, the timestep embedding is injected (more on this in the next section). The standard structure is:

h = Conv(GroupNorm(x)) \to SiLU \to +t emb \to Conv(GroupNorm(h)) \to SiLU \to + x

GroupNorm replaces BatchNorm because diffusion training uses small batch sizes (the images are large and the models are deep). The SiLU activation (x · σ(x), also called Swish) provides smooth gradients and has become standard in modern diffusion architectures.

U-Net Architecture Interactive

Encoder-decoder with skip connections. Hover over any block to see its dimensions and role. The blue path is the encoder (downsampling), yellow is the bottleneck, and green is the decoder (upsampling).

Time Conditioning

A diffusion model must behave differently at every timestep. At high noise levels (large t), the network should focus on recovering global structure — rough shapes, major color regions. At low noise levels (small t), it should refine fine details — textures, edges, subtle shading. The timestep t must therefore be communicated to the network in a way that enables smooth, expressive modulation of its behavior across the entire noise spectrum.

The standard approach, borrowed from the positional encodings in the Transformer (Vaswani et al., 2017), uses sinusoidal embeddings:

PE(t, 2i) = sin(t / 10000 2i/d), PE(t, 2i+1) = cos(t / 10000 2i/d)

This produces a d-dimensional vector for each scalar timestep t. The different frequencies ensure that nearby timesteps have similar embeddings while distant ones are well-separated — exactly the inductive bias we want. The sinusoidal embedding is then passed through a small MLP (typically two linear layers with SiLU activation) to produce a timestep embedding vector t_emb ∈ ℝ^d.

AdaGN & FiLM Conditioning

How do we actually inject t_emb into the network? The simplest approach is additive conditioning: project t_emb to match the channel dimension and add it to the feature map after the first convolution. This works but is limited — it can shift activations but not rescale them.

A more powerful approach is Adaptive Group Normalization (AdaGN), also known as FiLM conditioning (Feature-wise Linear Modulation, Perez et al., 2018). Instead of learning fixed normalization parameters, the network predicts them from the timestep:

AdaGN(h, t) = γ t \cdot GroupNorm(h) + β t

where γ_t = W_γ t_emb and β_t = W_β t_emb are linear projections of the timestep embedding. This gives the timestep control over both the scale and shift of every feature channel — a much richer form of conditioning.

🔢 Why scale-and-shift works so well

FiLM conditioning is powerful because it modulates the information flow through the network. Setting γ close to zero effectively gates off a feature channel; setting it large amplifies it. This allows the timestep to dynamically reconfigure which features the network uses at each noise level — coarse structure features at high noise, fine detail features at low noise — without any explicit architectural switching.

Self-Attention in U-Nets

Convolutions are local operations — a 3×3 kernel sees only a tiny spatial neighborhood. For diffusion models, this locality is a problem: at high noise levels, the model needs to coordinate across the entire image to establish global structure. Self-attention solves this by allowing every spatial position to attend to every other position, capturing long-range dependencies in a single layer.

However, self-attention has quadratic cost in the number of spatial positions: for a feature map of size H×W, the attention matrix is (HW)×(HW). At full resolution (256×256 = 65,536 positions), this is prohibitively expensive. The standard solution is to apply self-attention only at lower-resolution feature maps — typically at 16×16 or 32×32 — where the quadratic cost is manageable (256 or 1024 positions).

In the U-Net, attention blocks are interleaved with ResNet blocks at the lower resolution levels. A typical configuration applies self-attention at 16×16 and 32×32 but not at 64×64 or higher. The attention is multi-head, with the feature map reshaped so that spatial positions become the sequence dimension:

Q = W Q h, K = W K h, V = W V h Attention(Q, K, V) = softmax(Q K T / \sqrtd k) V

where h ∈ ℝ^{(H·W) × C} is the flattened feature map. Each spatial position attends to all others, enabling the network to learn relationships like "this region is a shadow cast by that object" or "these two eyes should be symmetric" — relationships that span far beyond any convolutional receptive field.

The combination of convolutions (efficient local processing) and attention (powerful global reasoning) at strategic resolution levels gives the U-Net its remarkable capacity for denoising.

Diffusion Transformers (DiT)

The success of self-attention in U-Nets raised a natural question: why not replace the convolutional backbone entirely with a Transformer? Peebles & Xie (2023) answered this with the Diffusion Transformer (DiT), demonstrating that a pure Transformer architecture could match and exceed the U-Net on class-conditional ImageNet generation.

The DiT architecture is elegantly simple:

Patchify: The input image (or noisy latent) is divided into non-overlapping patches (e.g., 2×2 or 4×4), each linearly embedded into a token vector. A 256×256 image with 4×4 patches yields a sequence of 4,096 tokens — feasible for modern Transformers.
Positional embedding: Standard learnable or sinusoidal position embeddings are added to distinguish spatial locations.
Transformer blocks: A stack of standard Transformer blocks with multi-head self-attention, feed-forward networks (MLPs), and layer normalization. The key innovation is how conditioning is injected.
Unpatchify: The final token sequence is reshaped back into a spatial grid and linearly projected to the output channels (predicting noise, score, or velocity).

adaLN-Zero Conditioning

The critical design choice in DiT is adaLN-Zero — an adaptive layer normalization scheme where the timestep and class embeddings modulate the scale, shift, and a per-layer gating parameter:

adaLN-Zero(h, c) = α c \cdot (γ c \cdot LayerNorm(h) + β c)

The extra gating parameter α_c is initialized to zero at the start of training, meaning each Transformer block initially acts as an identity function. This zero-initialization is crucial for training stability — it allows the model to gradually learn to use each block rather than being hit with random transformations from the start.

DiT demonstrated a clean scaling law: FID scores improved smoothly with increased model size and training compute, following the same predictable scaling behavior observed in language models. This was a landmark finding — it suggested that diffusion models could benefit from the same "just make it bigger" recipe that transformed NLP.

Property	U-Net	DiT
Core operation	Convolution + sparse attention	Full self-attention on patches
Spatial structure	Multi-scale (explicit down/up)	Single-scale (patchified)
Skip connections	Encoder → decoder skips	Residual within each block
Time conditioning	AdaGN / additive	adaLN-Zero
Scaling	Architecture-specific tuning	Clean compute scaling laws
Inductive bias	Strong spatial (convolutions)	Weak (learned from data)
Key models	DDPM, Stable Diffusion 1/2	DiT, SD3, Flux, Sora

Classifier Guidance

Training a powerful denoising network is only half the battle. We also need to control what the model generates. The first breakthrough in guided diffusion came from Dhariwal & Nichol (2021), who showed that a separately trained classifier could steer the sampling process toward a desired class.

The idea is mathematically elegant. Recall that the score function ∇_x log p_t(x) points toward regions of higher data density. If we want to generate images of class y, we want to sample from the conditional distribution p_t(x | y). By Bayes' rule:

\nabla x log p t (x | y) = \nabla x log p t (x) + \nabla x log p t (y | x)

The first term is the unconditional score — what our diffusion model already provides. The second term is the gradient of a classifier's log-probability with respect to the noisy input. We train a noise-aware classifier p_φ(y | x_t) on noisy images, then at sampling time, we modify the score:

ε̃ θ (x t, t, y) = ε θ (x t, t) - \sqrt(1 - ᾱ t) \cdot s \cdot \nabla x log p φ (y | x t)

The guidance scale s controls how strongly the classifier steers generation. Higher s produces images that are more recognizably of class y, at the cost of reduced diversity. This tradeoff between fidelity (how well the image matches the condition) and diversity (how varied the samples are) is fundamental to all guidance methods.

Classifier guidance produced a landmark result: for the first time, diffusion models beat GANs on ImageNet generation (FID 4.59 vs. BigGAN's 6.95). But it has a major practical limitation — it requires training a separate classifier on noisy images, which is expensive and limits the types of conditioning that can be applied.

Classifier-Free Guidance

Ho & Salimans (2022) proposed a brilliantly simple alternative: instead of training a separate classifier, train the diffusion model itself to be both conditional and unconditional. During training, the conditioning signal c (class label, text embedding, etc.) is randomly dropped with some probability (typically 10-20%), replaced by a null token ∅. This means the same network learns both:

ε_θ(x_t, t, c) — the conditional prediction
ε_θ(x_t, t, ∅) — the unconditional prediction

At sampling time, the two predictions are combined:

ε̃ = ε θ (x t, t, \emptyset) + w \cdot (ε θ (x t, t, c) - ε θ (x t, t, \emptyset))

When w = 1, this reduces to standard conditional generation. When w > 1, the model extrapolates in the direction of the conditioning — moving further away from the unconditional prediction than a purely conditional model would. This amplifies the influence of the conditioning signal, producing images that more strongly match the desired condition.

The guidance weight w (often called the "CFG scale") is the single most important hyperparameter in practical diffusion sampling. Typical values range from 3 to 15 depending on the application. Too low and the images are diverse but may not match the prompt. Too high and the images become oversaturated, artifact-ridden caricatures of the prompt.

💡 Why classifier-free guidance works so well

The formula ε̃ = ε_∅ + w(ε_c - ε_∅) can be interpreted geometrically: the difference (ε_c - ε_∅) points in the direction that conditioning "wants to push" the prediction. Multiplying by w > 1 amplifies this push. Equivalently, it implicitly raises the classifier's log-probability to a power w, sharpening the conditional distribution. The model is using its own internal "classifier" — the difference between its conditional and unconditional predictions — rather than relying on an external one.

Classifier-Free Guidance Effect Interactive

Adjust the guidance weight w. At w=1, the model samples from the learned conditional. Higher w sharpens the distribution toward the condition but reduces diversity.

Guidance w: 7.5 w = 7.5 — strong guidance

Text Conditioning & Cross-Attention

Class-conditional generation is useful for benchmarks, but the real magic begins when diffusion models are conditioned on free-form text. The key challenge: text is a variable-length sequence of tokens, while the diffusion model operates on spatial feature maps. How do we bridge these fundamentally different modalities?

The answer comes in two parts: a text encoder that converts the prompt into a rich sequence of embedding vectors, and cross-attention layers that allow the image features to query the text embeddings.

Text encoders. The most common choices are CLIP (Radford et al., 2021) and T5 (Raffel et al., 2020). CLIP was trained contrastively on 400M image-text pairs, so its text embeddings already encode visual semantics — the embedding of "a red sports car" is close to embeddings of actual sports car images. T5 is a large language model with richer linguistic understanding but less visual grounding. Modern systems often use both: CLIP for visual alignment and T5 for complex compositional understanding.

Cross-attention. Inside the denoising network, cross-attention layers are inserted alongside the self-attention layers. The mechanism is identical to Transformer cross-attention:

Q = W Q h image, K = W K h text, V = W V h text CrossAttn(h image, h text) = softmax(Q K T / \sqrtd k) V

The queries come from the image features (spatial positions in the feature map), while the keys and values come from the text encoder's output sequence. Each spatial position in the image can attend to every text token, learning which words are relevant to which spatial locations. The word "red" might receive high attention weights from pixels in the car region, while "sky" attends to the upper portion of the image.

This creates a powerful soft spatial grounding: the model learns to associate textual concepts with spatial regions without any explicit supervision of where objects should appear. The compositional structure of language ("a cat sitting on a mat") is translated into compositional spatial structure through the attention patterns.

Cross-Attention: Text ↔ Image Interactive

Hover over a text token (left) to see which image spatial positions attend to it, or hover over an image position (right) to see which text tokens it attends to. Line thickness indicates attention weight.

Latent Diffusion

Running a diffusion model directly on pixels is computationally brutal. A 512×512 RGB image has 786,432 dimensions. Every forward pass of the denoising network must process this full-resolution tensor, and sampling requires dozens to hundreds of such passes. Rombach et al. (2022) proposed an elegant solution: run the diffusion process in a compressed latent space instead of pixel space.

The Latent Diffusion Model (LDM) architecture has three stages:

Encoder (E): A pretrained autoencoder (typically a VQ-VAE or KL-regularized VAE) compresses images from pixel space to a low-dimensional latent space. A 512×512×3 image becomes a 64×64×4 latent tensor — a 48× compression in dimensionality. The encoder is trained once and frozen.
Diffusion model: The standard forward/reverse diffusion process operates entirely in this latent space. The denoising U-Net (or DiT) is much smaller because it processes 64×64×4 tensors instead of 512×512×3. Training and sampling are both dramatically faster.
Decoder (D): After sampling is complete, the clean latent z₀ is passed through the decoder to produce the final pixel-space image. This is a single forward pass — no iteration required.

The autoencoder is trained to reconstruct images with high fidelity while keeping the latent space smooth and well-structured. A KL penalty or vector quantization ensures the latent space doesn't develop pathological regions that the diffusion model would struggle with.

ℹ Stable Diffusion = Latent Diffusion + CLIP + CFG

Stable Diffusion (Rombach et al., 2022) is the most famous instantiation of Latent Diffusion. It combines an LDM with a CLIP text encoder, cross-attention conditioning, and classifier-free guidance. The full pipeline: CLIP encodes the text prompt → the U-Net denoises in latent space conditioned on text via cross-attention → the VAE decoder produces the final image. Training on 256×256 latents from 512×512 images made it feasible to train on large datasets (LAION-5B) and run inference on consumer GPUs — democratizing high-quality image generation.

The latent diffusion insight is general: any domain where a good autoencoder exists can benefit from running diffusion in the compressed space. This has been applied to video (compressing both spatial and temporal dimensions), audio (spectrograms to latents), and 3D generation (tri-plane latent representations).

Latent vs Pixel Space Interactive

Left: pixel-space diffusion operates on the full image grid. Right: latent diffusion operates on a compressed representation. The ratio shows the dimensionality savings. Hover to compare.

Trends & Future Directions

The architecture landscape is evolving rapidly. Several trends are reshaping how diffusion models are designed:

Multi-resolution DiT. Pure single-scale DiTs process all patches at the same resolution, missing the multi-scale inductive bias that made U-Nets successful. Recent architectures like U-ViT (Bao et al., 2023) and Hourglass DiT combine the Transformer backbone with U-Net-style multi-resolution processing — downsampling tokens at intermediate layers and using skip connections across resolution levels. This marries the scaling benefits of Transformers with the spatial efficiency of the U-Net design.

Mixture of Experts (MoE). As models scale to billions of parameters, MoE layers offer a path to increased capacity without proportional compute costs. Each token is routed to a subset of expert MLPs, allowing the model to maintain a large parameter count while keeping the per-token FLOPs manageable. This is particularly appealing for diffusion models, where different noise levels and content types may benefit from specialized processing.

Stable Diffusion 3 & Flux. The latest generation of open models has converged on the MM-DiT (multimodal DiT) architecture, which processes text and image tokens in a single unified Transformer stream with bidirectional attention. This eliminates the asymmetry of separate text encoders and cross-attention, allowing text and image representations to co-evolve through every layer. SD3 (Esser et al., 2024) and Flux (Black Forest Labs, 2024) both use this approach, combined with flow matching training objectives and rectified flow for straighter sampling trajectories.

Scaling and efficiency. The field is simultaneously pushing in two directions: larger models with better quality (following DiT scaling laws), and smaller, distilled models that can generate high-quality images in 1-4 steps. Techniques like consistency distillation, progressive distillation, and adversarial distillation compress hundreds of diffusion steps into a handful, enabling real-time generation on mobile devices.

The architecture story is far from settled. What is clear is that the fundamental building blocks — attention, normalization-based conditioning, multi-scale processing, and latent compression — will remain central even as specific designs continue to evolve.

References

Seminal papers and key works referenced in this article.

Ronneberger et al. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI, 2015. arXiv
Peebles & Xie. "Scalable Diffusion Models with Transformers." ICCV, 2023. arXiv
Dhariwal & Nichol. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS, 2021. arXiv
Ho & Salimans. "Classifier-Free Diffusion Guidance." NeurIPS Workshop, 2022. arXiv
Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR, 2022. arXiv