Why one design — the transformer — conquered language, vision, audio, video, diffusion, robotics, and everything in between. The design patterns that make it work.
It's 2017. You've just built a transformer that translates English to French better than anything before it. Your boss walks in and says: "Great. Now make it generate images." You stare at the paper on your desk. Attention Is All You Need was designed for sequences of words. Images aren't sequences of words. Do you start from scratch?
You don't. And that decision — that instinct to adapt rather than reinvent — turns out to be one of the most important ideas in modern AI. Between 2017 and 2024, researchers took the exact same transformer architecture and retrofitted it to handle images (ViT, 2020), diffusion (DiT, 2022), video (ViViT, 2021), audio (AST, 2021), point clouds (Point Transformer, 2020), protein folding (AlphaFold 2, 2021), robot control (RT-2, 2023), and dozens more domains.
Each time, the recipe was eerily similar: keep the core (attention + feed-forward + residual connections), swap the tokenizer, adjust the positional encoding, and choose a conditioning mechanism. The same backbone, different clothes. This isn't a coincidence — it reveals something deep about why the transformer works.
Click a domain to see what changes and what stays the same. The teal blocks are universal. The orange blocks are domain-specific.
Look at that diagram carefully. No matter which domain you pick, three things are always teal (universal): self-attention, feed-forward network, and residual connections. The only things that change are the input tokenizer, the positional encoding, and the conditioning mechanism. Three thin layers of domain adaptation wrapping a universal core.
That ratio — roughly 90% universal, 10% domain-specific — is the transformer's superpower. But why? Why can the same attention mechanism that learns "cat" relates to "sat" also learn that a pixel patch in the upper-left relates to a pixel patch in the lower-right? The answer isn't magic. It's a specific set of architectural design decisions that, taken together, make the transformer maximally reusable. Let's dissect each one.
Here's how fast it happened. Each entry below is a team taking the transformer and adapting it to a new domain. Notice how each adaptation required less novelty than the last — the playbook was becoming standardized:
| Year | Model | Domain | What Changed |
|---|---|---|---|
| 2017 | Transformer | Machine Translation | The original — encoder-decoder, learned positional embeddings |
| 2018 | GPT / BERT | General NLP | Decoder-only / Encoder-only — showed you don't need both halves |
| 2020 | ViT | Image Classification | Patch tokenizer + class token. That's it. |
| 2020 | Point Transformer | 3D Point Clouds | kNN-local attention instead of global |
| 2021 | AST | Audio | Spectrogram → patches, same as ViT |
| 2021 | ViViT | Video | Spatiotemporal tube tokenizer |
| 2021 | AlphaFold 2 | Protein Structure | MSA attention + pair representation |
| 2022 | DiT | Diffusion/Generation | Replaced U-Net, added AdaLN-Zero conditioning |
| 2023 | RT-2 | Robot Control | Actions as text tokens. Used a pretrained VLM directly. |
Nine domains in six years. The core transformer block barely changed across any of them. What evolved was the adapter layer — the thin domain-specific shell. That shell is what we'll learn to design in this lesson.
Let's make this concrete. A standard transformer block with d_model = 768 (like ViT-Base) contains:
The patch embedding (the domain-specific tokenizer) for ViT-Base? A single linear projection from 768-dimensional flattened patches to 768 dimensions: 768 × 768 + 768 = 590,592 params. That's 0.7% of a 12-block ViT-Base (86M total). The domain-specific part is a rounding error.
python # Count: how much of ViT is domain-specific? d = 768 n_layers = 12 # Universal (per block): attention (Q,K,V,Out) + FFN (up, down) + LayerNorm attn_params = 4 * d * d # 2,359,296 ffn_params = 2 * 4 * d * d # 4,718,592 ln_params = 2 * 2 * d # 3,072 block_total = attn_params + ffn_params + ln_params # 7,080,960 universal = block_total * n_layers # 84,971,520 # Domain-specific: patch embedding + class token + position embedding patch_embed = d * d + d # 590,592 (linear projection) cls_token = d # 768 (one learnable vector) pos_embed = (197) * d # 151,296 (14×14 patches + CLS) domain_specific = patch_embed + cls_token + pos_embed # 742,656 print(f"Universal: {universal:,} ({universal/(universal+domain_specific)*100:.1f}%)") print(f"Domain: {domain_specific:,} ({domain_specific/(universal+domain_specific)*100:.1f}%)") # Universal: 84,971,520 (99.1%) # Domain: 742,656 (0.9%)
Less than 1% of the model is domain-specific. The rest is perfectly general sequence processing machinery. This is the transformer's design genius: a thin, swappable adapter sitting atop a massive, reusable core.
If you had to point to the single most important reason the transformer is universal, it wouldn't be attention. It would be something much simpler: the residual connection. That humble "add the input back to the output" after every sub-layer is what makes the entire architecture modular, composable, and trainable at depth. Without it, none of the retrofitting we saw in Chapter 0 would work.
Here's why. Think of a transformer as a highway — an information highway running through the model from input to output. Each layer (attention, FFN) is an off-ramp/on-ramp: it reads from the highway, computes something, and writes the result back onto the highway by adding it to the existing stream. The stream itself flows unimpeded from the first layer to the last.
Without residual connections, a two-layer network computes:
The output is the result of composing functions. The original input x is gone — it's been fully transformed. If f1 loses some information, f2 can never recover it.
With residual connections, the same network computes:
Expand this and you see something remarkable. The output is:
The original input x is always present. Layer f1 adds its contribution. Layer f2 adds another contribution. Neither layer needs to preserve information from the input — the residual connection does it automatically. Each layer just needs to compute what's missing or what needs correction.
Let's trace actual numbers. Suppose x = [1.0, 2.0] and we have two simple layers where f1 doubles, f2 halves:
Without residual:
We got back to where we started. The two layers cancelled. Worse: if f1 mapped to [0, 0] (a common failure mode during training), f2 sees nothing. Information is destroyed.
With residual:
Even if f1 outputs zeros, h1 = [1.0, 2.0] — the input survives. Even if f2 outputs zeros, h2 = [3.0, 6.0] — everything accumulated so far survives. No layer can destroy information. Each can only add to it.
python import numpy as np x = np.array([1.0, 2.0]) # Without residual — information can be destroyed def no_residual(x, f1, f2): h1 = f1(x) # if f1 → zeros, game over h2 = f2(h1) return h2 # With residual — input always survives def with_residual(x, f1, f2): h1 = x + f1(x) # even if f1 → zeros, h1 = x h2 = h1 + f2(h1) # even if f2 → zeros, h2 = h1 return h2 # Test with a "dead" layer dead = lambda x: np.zeros_like(x) double = lambda x: x * 2 print(no_residual(x, dead, double)) # [0. 0.] — destroyed! print(with_residual(x, dead, double)) # [3. 6.] — input survived
The residual stream has three consequences for universality:
1. Layers are optional. If a layer hasn't learned anything useful yet (as often happens early in training), it can output near-zeros and the stream flows through unharmed. This means you can add layers to a pretrained model and they start as no-ops — the model behaves as before while the new layers gradually learn to contribute.
2. Layers are modular. Each layer reads from and writes to the same shared representation. It doesn't matter whether the layer before it was attention or FFN or something entirely new — as long as it reads a vector and writes a vector of the same dimension, it plugs in. This is why you can insert cross-attention layers (Chapter 4) into an existing model without rewriting anything.
3. Gradients flow freely. During backpropagation, the gradient of the loss with respect to an early layer doesn't need to pass through every intervening layer — it has a direct path through the residual connections. This is what makes 100+ layer transformers trainable when a naive 100-layer network would have vanishing gradients.
Toggle layers on/off. Watch how the stream flows. When a layer is off, it contributes zero — but the stream still flows because of the residual connection. Drag the corruption slider to simulate a layer outputting noise.
Let's trace the gradient. Consider a loss L at the output of a 4-layer residual network. The gradient with respect to the input x is:
That leading I (identity matrix) is the hero. It means the gradient always has a direct path back to the input, regardless of what the individual layers do. Even if ∂fi/∂h is tiny (vanishing) or huge (exploding), the identity term guarantees a stable gradient path. This is why transformers can be 100+ layers deep.
Without residuals, the gradient would be:
A chain of multiplications. If each Jacobian has norm slightly less than 1 (say 0.9), after 100 layers the gradient magnitude is 0.9100 ≈ 2.66 × 10-5. Practically zero. The residual connection breaks this chain.
python # Gradient magnitude after N layers import numpy as np n_layers = 100 jacobian_norm = 0.9 # each layer slightly shrinks gradients # Without residual: product of Jacobians no_res_grad = jacobian_norm ** n_layers print(f"Without residual: {no_res_grad:.2e}") # 2.66e-05 — vanished # With residual: each Jacobian is (I + df/dx), so product ≈ (1 + jac)^N # The identity term dominates — gradient stays O(1) with_res_grad = (1 + jacobian_norm) ** n_layers # explodes, but LayerNorm tames it print(f"With residual (raw): {with_res_grad:.2e}") # 1.38e+27 # In practice, LayerNorm keeps this in check — the point is it doesn't vanish
The transformer doesn't know what a word is. It doesn't know what a pixel is. It doesn't know what a sound wave is. All it knows is: "I receive a sequence of vectors, each of dimension d_model. I process them with attention and FFN. I output a sequence of vectors." That's it. The entire architecture is built around this one abstraction: a sequence of d-dimensional vectors.
This means the entire burden of domain adaptation falls on the tokenizer — the component that converts raw domain data (text, images, audio, point clouds, robot states) into that universal format. Get the tokenizer right, and the transformer does the rest. This chapter is about how that conversion works for each major domain.
Every domain follows the same three-step recipe:
Text was the original domain. The tokenizer splits text into subword tokens using algorithms like BPE (Byte-Pair Encoding). Common words stay whole ("the", "and"), uncommon words get split ("unbelievable" → "un", "believ", "able"). Each token maps to a row in a learned embedding matrix.
This was the big breakthrough. Instead of feeding individual pixels (which would create impossibly long sequences), ViT cuts the image into a grid of non-overlapping patches. Each patch is flattened and linearly projected to d_model dimensions.
That's it. A 224×224 image becomes 197 tokens of dimension 768 — the same shape a 197-word sentence would have. The transformer can't tell the difference.
Audio is first converted to a mel spectrogram — a 2D image where the x-axis is time and the y-axis is frequency. Then it's patched exactly like ViT.
Video adds a time dimension. ViViT extracts tubelet tokens — 3D patches spanning space AND time:
3D point clouds are already discrete — each point has (x, y, z) coordinates plus optional features (color, normals). The tokenizer just embeds each point:
RT-2 does something clever: it discretizes continuous robot actions (joint angles, gripper open/close) into text tokens. A 7-DOF action becomes 7 integer tokens, concatenated to the language instruction and image tokens. The transformer processes all three modalities in a single sequence.
Toggle between domains to see how raw data becomes a token sequence. Every domain produces the same shape: [N, d_model].
Notice the pattern: no matter the domain, the output is always [N, d_model]. The number N varies (197 for images, 512 for audio, 3136 for video), and longer sequences cost quadratically more in attention, but the format is identical. This is the abstraction barrier that makes the transformer universal.
python import torch import torch.nn as nn class PatchTokenizer(nn.Module): """Universal pattern: chunk → flatten → project → add position""" def __init__(self, in_dim, d_model, n_tokens): super().__init__() self.proj = nn.Linear(in_dim, d_model) # The only learned part self.pos = nn.Parameter(torch.randn(n_tokens, d_model) * 0.02) def forward(self, patches): # patches: [batch, N, in_dim] tokens = self.proj(patches) # [batch, N, d_model] tokens = tokens + self.pos # add positional encoding return tokens # ready for transformer # Text tokenizer: embedding lookup (in_dim = vocab_size, one-hot) text_tok = PatchTokenizer(50257, 768, 512) # Image tokenizer: 16×16 RGB patch → 768 image_tok = PatchTokenizer(16*16*3, 768, 197) # Audio tokenizer: 16×16 spectrogram patch → 768 audio_tok = PatchTokenizer(16*16*1, 768, 512) # Same transformer processes all three — it sees [batch, N, 768] every time
You've tokenized your data into [N, d_model]. Now what? It enters the attention mechanism. And here's the crucial insight: attention doesn't know what the tokens represent. It's a pure set operation. It takes N vectors, computes pairwise similarity scores between all pairs, and produces N output vectors. Whether those vectors came from words, image patches, audio frames, or robot joint states — the computation is identical.
This isn't a bug. It's the design. Attention is an adaptive pooling operation: each token computes a weighted average of all other tokens, where the weights are learned from the data. The only inductive bias it has is "some tokens are more relevant to each other than others." It discovers which tokens are relevant from training data alone.
Mathematically, self-attention is a function on sets of vectors. Given a set X = {x1, ..., xN}, attention computes:
Let's trace what this does for a single query token xi:
Notice: nothing in this computation references position, spatial structure, or domain. It's purely about vector similarity. If token 5 and token 42 have similar query-key dot products, they'll attend to each other — regardless of whether they're adjacent words, distant image patches, or one is a text token and the other is an image token.
Imagine 4 tokens, each 3-dimensional. We'll trace attention with the same weights, but different input domains:
python import numpy as np # Same attention weights for both domains np.random.seed(42) W_Q = np.random.randn(3, 3) * 0.5 W_K = np.random.randn(3, 3) * 0.5 W_V = np.random.randn(3, 3) * 0.5 def attention(X): Q = X @ W_Q # [4, 3] K = X @ W_K V = X @ W_V scores = Q @ K.T / np.sqrt(3) # [4, 4] weights = np.exp(scores) weights /= weights.sum(axis=1, keepdims=True) # softmax return weights @ V # [4, 3] # "Text tokens" — embeddings for [The, cat, sat, down] text = np.array([[0.2,0.8,0.1], [0.9,0.1,0.7], [0.3,0.6,0.4], [0.5,0.3,0.2]]) # "Image patches" — embeddings for 4 image patches image = np.array([[0.7,0.2,0.5], [0.1,0.9,0.3], [0.4,0.4,0.8], [0.6,0.1,0.6]]) # Same function, same weights, different data text_out = attention(text) # Works perfectly on text image_out = attention(image) # Works perfectly on images # The attention function doesn't know — or care — what domain the tokens came from
If attention is domain-agnostic, how does the model know about spatial structure? Through positional encoding — and this is one of the few domain-specific components.
Different domains use different positional encodings because their data has different structure:
| Domain | Positional Encoding | Why |
|---|---|---|
| Language | 1D learned or sinusoidal | Text is sequential — only order matters |
| Images (ViT) | 2D learned embeddings | Patches have (row, col) positions |
| Video (ViViT) | 3D: spatial + temporal | Patches have (time, row, col) |
| Audio (AST) | 2D: time + frequency | Spectrogram patches have (time, freq) |
| Diffusion (DiT) | 2D sinusoidal | Like ViT, but often with continuous position |
| Point Clouds | 3D coordinates directly | Points have (x, y, z) — feed as features |
| Robotics (RT-2) | 1D (sequence position) | Concatenated sequence of image + text + action |
The positional encoding is the transformer's only inductive bias for spatial/temporal structure. Everything else — which patches relate to which, what spatial patterns matter — is learned from data through the attention weights.
Toggle domain to see how the same attention mechanism produces different patterns depending on input structure. The heatmap shows attention weights — brighter = higher weight.
CNNs have strong inductive bias: local connectivity (a pixel relates most to its neighbors) and translation equivariance (a pattern at position A is the same pattern at position B). This makes CNNs data-efficient for images — they "know" about spatial locality from the start.
Transformers have almost no inductive bias. They don't assume locality, translation equivariance, or any spatial structure. This seems like a weakness, and it is — on small datasets. ViT trained on ImageNet-1K (1.3M images) underperforms ResNet. But ViT pretrained on JFT-300M (300M images) crushes ResNet.
Why? Because with enough data, the model discovers the right inductive bias from the data itself. And the bias it discovers might be better than what a human engineer would have hardcoded. Early ViT layers learn local patterns (like a CNN), but later layers learn long-range dependencies that CNNs fundamentally cannot represent.
This lack of hardcoded bias is precisely what makes the transformer universal. A CNN can only process grid-structured data (images). An RNN can only process sequential data (text). The transformer can process anything that can be expressed as a set of vectors — because it makes no assumptions about the structure.
Self-attention lets tokens within a single sequence talk to each other. But what if you need two different representations to interact? A diffusion model needs to condition on a text prompt. A VLM needs image features to inform text generation. A robot policy needs language instructions to guide motor outputs. In every case, you need information from one modality to influence another.
This is where cross-attention comes in — and it's arguably the most important design pattern in modern AI. Cross-attention is identical to self-attention with one change: the queries come from one representation, while the keys and values come from another.
In self-attention, all three projections come from the same input X:
In cross-attention, queries come from the target (the thing being updated) and keys/values come from the source (the conditioning signal):
The attention score is still the dot product between query and key, but now the query asks "what information do I need?" and the key/value from the source answers "here's what I have." The result is a representation of the target that's been conditioned on the source.
Suppose you're building Stable Diffusion. You have noisy image features (the target) and a text prompt embedding (the source). Let's trace the shapes:
python import torch import torch.nn as nn # Dimensions d_model = 768 # transformer width n_image_tokens = 256 # 16×16 latent patches n_text_tokens = 77 # CLIP max sequence length # Input representations image_features = torch.randn(1, n_image_tokens, d_model) # [1, 256, 768] text_features = torch.randn(1, n_text_tokens, d_model) # [1, 77, 768] # Cross-attention projections W_Q = nn.Linear(d_model, d_model) # queries from IMAGE W_K = nn.Linear(d_model, d_model) # keys from TEXT W_V = nn.Linear(d_model, d_model) # values from TEXT # Compute cross-attention Q = W_Q(image_features) # [1, 256, 768] — "what does each patch need?" K = W_K(text_features) # [1, 77, 768] — "what does each word offer?" V = W_V(text_features) # [1, 77, 768] — "what info does each word carry?" # Attention weights: [1, 256, 768] × [1, 768, 77] → [1, 256, 77] scores = Q @ K.transpose(-2, -1) / (d_model ** 0.5) weights = torch.softmax(scores, dim=-1) # [1, 256, 77] # Each image patch gets a weighted average of text token values output = weights @ V # [1, 256, 77] × [1, 77, 768] → [1, 256, 768] # weights[0, 42, :] tells us: for image patch 42, # how much does it attend to each of the 77 text tokens? # If the prompt is "a red car", patch 42 (in the car region) # will attend strongly to "car" and "red".
The critical shape to remember: the attention matrix is [n_target, n_source]. Each target token has a distribution over source tokens. This is a soft lookup: each image patch retrieves the most relevant text information.
Here's the analogy that makes cross-attention click. Think of it as a database:
| Database Concept | Cross-Attention | Concrete Example |
|---|---|---|
| Query | Q = target · WQ | "What information does image patch 42 need?" |
| Index/Key | K = source · WK | "Each text token advertises its content" |
| Value/Record | V = source · WV | "The actual information each text token carries" |
| Match Score | softmax(QKT/√d) | "How relevant is each word to this patch?" |
| Retrieved Record | ∑ αij vj | "Weighted blend of relevant word meanings" |
The difference from a real database: it's soft (retrieves a weighted combination, not a single exact match) and learned (the W matrices are trained to define what "relevant" means).
Watch how queries from the target attend to keys/values from the source. Click a target token to see which source tokens it attends to. Drag the slider to change the conditioning strength.
Cross-attention is everywhere in modern AI:
| Model | Target (Q) | Source (K, V) | Purpose |
|---|---|---|---|
| Stable Diffusion | Noisy image features | CLIP text embeddings | Condition denoising on text prompt |
| Flamingo | Language tokens | Vision features | Ground language in visual context |
| Original Transformer | Decoder tokens | Encoder tokens | Translation: target attends to source sentence |
| DETR | Object queries | Image features | Detect objects by querying image |
| RT-2 / pi0 | Action tokens | Vision + language | Ground actions in perception + instruction |
| IP-Adapter | Denoising features | Reference image features | Style/content transfer from reference |
Cross-attention is one way to inject conditioning information into a transformer. But it's not the only way — and it's not always the best way. Over the past few years, researchers have discovered a whole zoo of conditioning mechanisms, each with different trade-offs in compute cost, expressiveness, and architectural complexity.
The fundamental question is always the same: how do I get information from signal C into representation X? The answer depends on what C looks like (scalar? vector? sequence?), how much compute you can afford, and whether C should influence the content or the statistics of X.
We covered this in Chapter 4. Each target token dynamically selects which source tokens to attend to. Best when the conditioning signal is a rich sequence (text prompts, image features).
Instead of cross-attending to a sequence, AdaLN converts the conditioning signal into scale (γ) and shift (β) parameters for layer normalization. The conditioning signal (timestep, class label) is projected through an MLP to produce per-layer γ and β.
The "Zero" in AdaLN-Zero: the gate α is initialized to zero, so the conditioning layer starts as a no-op and gradually learns to contribute. This is the same "initialize as identity" trick that makes residual connections work.
FiLM is the predecessor to AdaLN. It applies a learned affine transformation to each feature channel: scale and shift, but applied to the features directly, not to a normalization layer.
The difference from AdaLN: FiLM applies scale/shift to raw features. AdaLN applies them to normalized features. In practice, AdaLN works better because LayerNorm stabilizes the features before modulation.
The simplest approach: just concatenate the conditioning tokens to the input sequence and let self-attention figure it out.
Used in: LLaVA (image tokens concatenated to text), RT-2 (action tokens concatenated to perception), many VLMs. Simple, but expensive when M is large.
Add learnable "virtual tokens" to the beginning of the sequence. These tokens carry the conditioning information and influence subsequent tokens through attention.
| Mechanism | Signal Type | Cost | Expressiveness | Best For |
|---|---|---|---|---|
| Cross-Attention | Rich sequence | High (O(NM)) | Highest — token-level selection | Text prompts, multi-modal fusion |
| AdaLN-Zero | Global vector | Very low (O(d)) | Medium — per-layer modulation | Timestep, class label, style |
| FiLM | Global vector | Very low (O(d)) | Medium — feature-wise scaling | Simple conditioning signals |
| Concatenation | Any sequence | High (O((N+M)²)) | High — full self-attention | Multi-modal with shared backbone |
| Prefix Tuning | Task/style | Low (O((M+N)²)) | Low-Medium — soft prompt | Task adaptation, few-shot |
Toggle between mechanisms to see how each injects the conditioning signal (orange) into the main representation (teal). Watch the data flow change.
When the DiT paper (Peebles & Xie, 2022) designed a transformer for diffusion, they compared cross-attention, AdaLN, and in-context conditioning. The conditioning signal was simple: a class label (integer 0-999) plus a diffusion timestep (integer 0-999). Both are single vectors, not sequences.
Cross-attention would create Q/K/V projections and attention weights for what is essentially a 1-token source sequence. That's a lot of machinery for a single vector. AdaLN converts that vector into scale/shift parameters — much more efficient.
The results: AdaLN-Zero achieved FID 2.27 on ImageNet 256×256, beating cross-attention (FID 3.75) and in-context conditioning (FID 5.38). Simpler was better because the conditioning signal was simple.
python import torch import torch.nn as nn class AdaLNZero(nn.Module): """DiT's conditioning mechanism.""" def __init__(self, d_model, cond_dim): super().__init__() # One MLP produces 6 modulation parameters per layer: # gamma1, beta1, alpha1 (for attention) # gamma2, beta2, alpha2 (for FFN) self.mlp = nn.Sequential( nn.SiLU(), nn.Linear(cond_dim, 6 * d_model) ) # Initialize output to zero → layer starts as no-op nn.init.zeros_(self.mlp[1].weight) nn.init.zeros_(self.mlp[1].bias) def forward(self, x, c): # c: [batch, cond_dim] — e.g., timestep + class embedding params = self.mlp(c) # [batch, 6*d_model] g1, b1, a1, g2, b2, a2 = params.chunk(6, dim=-1) # Each is [batch, d_model] # Modulate attention sub-layer h = a1.unsqueeze(1) * self.attn(g1.unsqueeze(1) * self.norm1(x) + b1.unsqueeze(1)) x = x + h # residual # Modulate FFN sub-layer h = a2.unsqueeze(1) * self.ffn(g2.unsqueeze(1) * self.norm2(x) + b2.unsqueeze(1)) x = x + h # residual return x
Now we have all the pieces: the residual stream (Chapter 1), tokenization (Chapter 2), domain-agnostic attention (Chapter 3), cross-attention (Chapter 4), and the conditioning zoo (Chapter 5). It's time to see how they come together. This chapter is the payoff — a complete field guide to how each major domain adapted the transformer.
The playbook has exactly four steps, and every successful adaptation follows them:
The simplest and most influential adaptation. Dosovitskiy et al. asked: what's the minimum change needed to make a transformer process images?
| Component | Original Transformer | ViT |
|---|---|---|
| Tokenizer | Subword (BPE) | 16×16 patch + linear projection |
| Position | 1D sinusoidal/learned | 2D learned positional embeddings |
| Attention | Causal (decoder) or bidirectional (encoder) | Bidirectional (all patches see all patches) |
| Conditioning | N/A | N/A (classification, no external signal) |
| Output | Token probabilities | [CLS] token → classification head |
| Core modified? | NO — identical attention + FFN | |
The total novelty: a patch embedding layer and 2D positional embeddings. Everything else is copy-paste from BERT.
DiT replaced the U-Net in diffusion models with a transformer. The key challenge: diffusion models need to condition on a timestep (how noisy is the current image) and a class label (what to generate).
| Component | ViT | DiT |
|---|---|---|
| Tokenizer | Pixel patches | Latent patches (from VAE encoder) |
| Position | 2D learned | 2D sinusoidal (frequency-based) |
| Attention | Bidirectional global | Bidirectional global (same) |
| Conditioning | None | AdaLN-Zero (timestep + class → scale/shift/gate) |
| Output | [CLS] → class | All tokens → predicted noise (unpatchify) |
| Core modified? | NO — same attention + FFN blocks | |
DiT's novelty: AdaLN-Zero conditioning and operating on latent space patches instead of pixel patches. The transformer itself? Unchanged.
Video is images plus time. The challenge: a 32-frame video at ViT resolution creates 32 × 196 = 6,272 tokens. That's quadratic attention cost of O(6272²) ≈ 39M operations per attention layer. The solution: factored attention.
| Component | ViT | ViViT |
|---|---|---|
| Tokenizer | 2D patches | 3D tubelets (space × time) |
| Position | 2D | 3D (spatial + temporal, separable) |
| Attention | Global | Factored: spatial-only then temporal-only |
| Conditioning | None | None (classification) |
| Core modified? | Attention PATTERN changed (factored), but the mechanism is still standard dot-product attention | |
Factored attention: instead of one global attention over 6,272 tokens, do spatial attention (196 tokens within each frame) then temporal attention (32 tokens across frames for each spatial position). Cost drops from O(6272²) to O(196² × 32 + 32² × 196) — a ~30× reduction.
RT-2 is perhaps the most elegant adaptation. Instead of designing a new architecture for robot control, the team took a pretrained Vision-Language Model (PaLM-E) and tokenized robot actions as text. The model generates action tokens the same way it generates word tokens.
| Component | PaLM-E (VLM) | RT-2 |
|---|---|---|
| Tokenizer | Text BPE + ViT patches | Same + discretized actions as text tokens |
| Position | 1D sequential | Same (actions are just more tokens in the sequence) |
| Attention | Causal (autoregressive) | Same |
| Conditioning | Image + text concatenated | Same |
| Core modified? | NO — literally zero architectural changes | |
RT-2 didn't modify the transformer AT ALL. It just added new tokens to the vocabulary. This is the purest example of the transformer's universality — the architecture doesn't even know it's controlling a robot.
Pick a target domain. Watch the base transformer morph — orange blocks are the parts that change, teal blocks stay identical. The percentages show how much of the total architecture changed.
After reviewing every major adaptation, the recipe crystallizes:
If you can answer these four questions for your domain, you can build a transformer for it. The core — attention + FFN + residual — stays identical. The engineering decisions are ALL in the adapter layers.
So far we've talked about adapting a single transformer to a new domain. But the real power emerges when you compose multiple pretrained models. You've trained a great vision encoder (DINOv2) and a great language model (LLaMA). How do you combine them into a VLM without retraining either from scratch?
This is the domain of composition patterns — the architectural strategies for connecting pretrained modules. Each pattern makes a different trade-off between flexibility, compute cost, and how much of the pretrained knowledge you preserve.
The most common pattern. You freeze the backbone (keep its weights fixed) and train a small adapter module that translates between representations.
Why freeze? Two reasons. First, the backbone already encodes enormously valuable knowledge from pretraining (often on billions of examples). Fine-tuning risks catastrophic forgetting — the model unlearns its general capabilities while learning the new task. Second, freezing is cheap — you only need gradients through the adapter, not the backbone.
Why adapter? The vision encoder and LLM typically have different embedding dimensions and different "languages" (the feature spaces don't align). The adapter bridges this gap. Different adapter designs have different expressiveness:
| Adapter | Mechanism | Params | Used By |
|---|---|---|---|
| Linear Projection | Single matrix: d_vision → d_llm | d_v × d_l | LLaVA |
| MLP | 2-layer MLP with GELU | ~2 × d_v × d_l | LLaVA-1.5 |
| Q-Former | Learnable queries cross-attend to vision features | ~100M | BLIP-2, InstructBLIP |
| Perceiver Resampler | Similar to Q-Former with latent array | ~50M | Flamingo |
python import torch import torch.nn as nn class LLaVA(nn.Module): def __init__(self, vision_encoder, llm, d_vision=1024, d_llm=4096): super().__init__() self.vision = vision_encoder # CLIP ViT-L/14 — FROZEN self.llm = llm # Vicuna-7B — initially frozen, then unfrozen # The only new thing: a 2-layer MLP adapter self.adapter = nn.Sequential( nn.Linear(d_vision, d_llm), nn.GELU(), nn.Linear(d_llm, d_llm) ) # Adapter params: 1024×4096 + 4096 + 4096×4096 + 4096 ≈ 21M # vs Vision encoder: ~300M, LLM: ~7B # Adapter is 0.3% of total model! def forward(self, image, text_tokens): # Step 1: Extract vision features (frozen) with torch.no_grad(): vis_features = self.vision(image) # [1, 256, 1024] # Step 2: Adapt to LLM space (trainable) vis_tokens = self.adapter(vis_features) # [1, 256, 4096] # Step 3: Concatenate with text and run LLM combined = torch.cat([vis_tokens, text_tokens], dim=1) output = self.llm(combined) return output
Notice: the adapter is 0.3% of total parameters. Yet it bridges a 300M-parameter vision encoder with a 7B-parameter LLM. This ratio — tiny adapter, massive pretrained backbone — is the hallmark of efficient composition.
Two separate encoders process two modalities independently, producing embeddings in a shared space. The training objective aligns the spaces (e.g., contrastive loss pushes matching image-text pairs together and non-matching pairs apart).
Used by: CLIP, SigLIP, ALIGN. The encoders never directly interact — they communicate only through the shared embedding space. This makes dual encoders extremely efficient for retrieval (precompute all image embeddings, search by text) but limited for generation (no token-level cross-modal interaction).
Instead of one FFN per layer, use N expert FFNs and a router that selects which expert(s) process each token. This is a composition pattern because it combines multiple specialized sub-networks within a single architecture.
Why it works: Different tokens route to different experts, creating implicit specialization. In multilingual models, different languages naturally cluster to different experts. In multimodal models, image tokens and text tokens may use different experts. The model gets 8× more parameters but only activates 2× the compute (if using 8 experts with top-2 routing).
python class MoELayer(nn.Module): def __init__(self, d_model, n_experts=8, top_k=2): super().__init__() self.experts = nn.ModuleList([ nn.Sequential(nn.Linear(d_model, 4*d_model), nn.GELU(), nn.Linear(4*d_model, d_model)) for _ in range(n_experts) ]) self.router = nn.Linear(d_model, n_experts) self.top_k = top_k def forward(self, x): # x: [batch, seq, d_model] gates = torch.softmax(self.router(x), dim=-1) # [B, S, N] top_vals, top_idx = gates.topk(self.top_k, dim=-1) # Only compute the top-k experts per token output = torch.zeros_like(x) for i in range(self.top_k): expert_idx = top_idx[..., i] # which expert for this slot weight = top_vals[..., i] # gate value for e in range(len(self.experts)): mask = (expert_idx == e) if mask.any(): output[mask] += weight[mask].unsqueeze(-1) * self.experts[e](x[mask]) return output
The simplest composition: a shared backbone (pretrained transformer) with task-specific heads (small networks appended to the output). The backbone extracts general features; the head adapts to the task.
| Task | Head | Input from Backbone |
|---|---|---|
| Classification | Linear: d → N_classes | [CLS] token or mean pool |
| Detection | Transformer decoder + FFN | All token features (DETR) |
| Segmentation | Upsampling + per-pixel classifier | All tokens, unpatchified |
| Generation | Linear: d → vocab_size | Last token (autoregressive) |
| Robot Control | Action tokenizer (discretize) | Action token positions |
Select a composition pattern. Blue = frozen, orange = trainable. Watch how data flows between components.
The billion-dollar question. Here are the actual decision factors:
| Factor | Freeze | Fine-tune |
|---|---|---|
| Training data | Small (<100K examples) | Large (>1M examples) |
| Domain gap | Small (natural images → natural images) | Large (natural images → medical images) |
| Compute budget | Low (only train adapter) | High (gradients through everything) |
| Risk of forgetting | High (backbone knowledge is critical) | Low (task-specific performance matters more) |
| Multi-task | Yes (shared backbone, per-task adapters) | No (fine-tuned model is task-specific) |
We've established that the transformer is universal because its core is domain-agnostic and its adapter layers are thin. But there's one more mystery: why do deeper transformers consistently outperform wider ones at the same parameter count? GPT-3 has 96 layers. GPT-4 is rumored to have even more. Why not use 10 very wide layers instead?
The answer involves three interrelated ideas: feature hierarchies, the residual stream view, and the lottery ticket hypothesis. Together, they explain why stacking transformer layers works — and predict when it stops working.
Early layers learn simple patterns. Middle layers compose them into complex ones. Late layers build task-specific representations. This holds across every domain:
| Layer Depth | Language (GPT) | Vision (ViT) | Diffusion (DiT) |
|---|---|---|---|
| Early (1-4) | Word identity, punctuation | Edges, colors, textures | Low-frequency noise patterns |
| Middle (5-8) | Syntax, phrase structure | Object parts, spatial relationships | Object shapes, layout |
| Late (9-12) | Semantics, reasoning | Object categories, scenes | Fine details, textures |
This hierarchy emerges naturally from training — nobody programs it. Depth creates the representational capacity for this hierarchy. A shallow network (2-3 layers) can't build the compositional features that a deep network can.
Here's a concrete trace. In a 12-layer ViT classifying "golden retriever":
Each layer adds one level of abstraction. You can't jump from "brown pixels" to "golden retriever" in one layer — the gap is too large. You need intermediate representations.
From Chapter 1, we know each layer reads from and writes to a shared residual stream. This gives us a powerful way to think about depth: each layer makes a small edit to the stream. More layers = more edits = richer final representation.
Anthropic's research (Elhage et al., 2021) formalized this as the "residual stream" view of transformers. They showed that:
Attention heads READ from and WRITE to the stream independently. A head in layer 5 might read information written by a head in layer 2, even though layers 3 and 4 are in between. The residual connection enables this long-range communication.
The stream accumulates features, it doesn't transform them. After layer 1, the stream contains: original input + layer 1's contribution. After layer 12: original input + all 12 layers' contributions. Nothing is lost.
Adjust depth and width at constant parameter count. Watch how the feature hierarchy changes. Deep models build layered abstractions. Wide models compute more features per layer but can't compose them as deeply.
The lottery ticket hypothesis (Frankle & Carlin, 2019) suggests that large networks work because they contain many "lottery tickets" — small sub-networks that, if trained in isolation, would achieve good performance. Deeper networks contain exponentially more potential sub-networks because depth creates combinatorial diversity.
Think of it this way: a 12-layer network with 12 attention heads per layer has 144 heads total. But the number of circuits — paths through specific heads across layers — grows exponentially with depth. A 2-layer network with 12 heads per layer has at most 12 × 12 = 144 circuits. A 12-layer network has 1212 ≈ 8.9 × 1012 potential circuits. More depth = more lottery tickets = higher chance of finding a good solution.
Let's compare two models with the same parameter count (~85M):
python # Model A: Deep and narrow d_model_A = 512 n_layers_A = 24 params_per_layer_A = 4 * d_model_A**2 + 8 * d_model_A**2 # attn + FFN total_A = params_per_layer_A * n_layers_A print(f"Model A (24 layers, d=512): {total_A/1e6:.1f}M") # 75.5M # Model B: Shallow and wide d_model_B = 1536 n_layers_B = 3 params_per_layer_B = 4 * d_model_B**2 + 8 * d_model_B**2 total_B = params_per_layer_B * n_layers_B print(f"Model B (3 layers, d=1536): {total_B/1e6:.1f}M") # 84.9M # Similar parameter count, but: # - Model A: 24 levels of abstraction, 12^24 possible circuits # - Model B: 3 levels of abstraction, 12^3 = 1,728 circuits # Model A consistently wins on benchmarks (Kaplan et al., 2020)
Depth isn't free. Three failure modes:
1. Diminishing returns. Each additional layer adds less new information. Going from 12 to 24 layers helps a lot. Going from 96 to 192 helps very little. The scaling law (Kaplan et al., 2020) shows performance improves as a power law with depth: L(D) ∝ D-α where α ≈ 0.076 for transformers. This means doubling depth gives ~5% improvement — less and less as you go deeper.
2. Training instability. Very deep networks (100+ layers) become harder to train. Gradients, despite residual connections, can still accumulate numerical errors. This is why techniques like pre-norm (LayerNorm before attention, not after) became standard for deep transformers.
3. Inference latency. Layers execute sequentially — you can't parallelize depth. A 96-layer model takes 96 sequential forward passes. Width, by contrast, parallelizes across GPU cores. For real-time applications, a shallower, wider model might be faster even if slightly less accurate.
You've just learned the architectural design patterns that make the transformer a universal backbone. Let's map what we covered to where you can go deeper.
| Concept | Key Insight | When You Need It |
|---|---|---|
| Residual Stream | Layers edit a shared stream, not transform it | Understanding why layers are modular and composable |
| Tokenize Everything | Convert any domain to [N, d_model] | Adapting transformers to new data types |
| Agnostic Attention | Attention is a set operation — domain-free | Understanding why one mechanism works everywhere |
| Cross-Attention | Q from target, K/V from source — universal conditioning | Building multi-modal or conditioned models |
| Conditioning Zoo | Match mechanism complexity to signal complexity | Choosing between cross-attn, AdaLN, FiLM, concat, prefix |
| Retrofitting | 4 steps: tokenizer, position, attention pattern, conditioning | Adapting transformers to any new domain |
| Composition | Frozen backbone + adapter is 0.3% params | Combining pretrained models without retraining |
| Depth | Depth creates hierarchies + exponential circuits | Deciding model shape (depth vs width) |
| Want to Go Deeper On... | Read This |
|---|---|
| How self-attention works from scratch | Gleam: Transformer |
| How attention + FFN work at a component level | Gleam: Attention & Transformers |
| Vision transformers and image representations | Deep-Dive: Vision Transformers |
| Multi-modal fusion patterns in depth | Deep-Dive: Multimodal Fusion |
| DiT and diffusion architectures | Deep-Dive: Architectures & Conditioning |
| Diffusion models from zero | Gleam: Diffusion |
| Flow matching (DiT's denoising objective) | Gleam: Flow Matching |
| VLMs (how vision + language compose) | Gleam: VLM |
| VLAs (how language controls robots) | Gleam: VLA |
| Contrastive learning and CLIP | Gleam: Contrastive & CLIP |
| Model compression and efficiency | Gleam: Model Compression |
| Efficient architectures (beyond vanilla transformer) | Gleam: Efficient Architectures |
| World models and predictive architectures | Gleam: World Models |
| The DiT paper in detail | Paper: DiT |
| The ViT paper in detail | Paper: Vision Transformer |
The transformer's universality isn't an accident. It's the result of four deliberate design decisions that, together, create a maximally reusable architecture:
We are living through a remarkable convergence in AI architecture. For the first time in the field's history, the same design is state-of-the-art across nearly every modality and task. Understanding the design patterns behind that universality — which is what this lesson taught — is arguably the single most important architectural insight in modern AI.
"The transformer is not the final architecture. But it is the first universal one."