The architecture behind GPT, BERT, LLaMA, and every frontier language model. One paper changed everything — here's how it works.
Text is a sequence of words. Audio is a sequence of samples. Video is a sequence of frames. A stock price is a sequence of values over time. Before the Transformer, we processed sequences with recurrent neural networks (RNNs) — one element at a time, left to right. It was slow and it forgot things.
The Transformer processes every element at once. Instead of a conveyor belt, it's a spotlight that shines on the entire sequence simultaneously. This parallelism is why Transformers train so fast on GPUs — and why they scale to billions of parameters.
Click a domain to see how it's tokenized into a sequence.
Imagine reading the sentence: "The cat sat on the mat because it was tired." What does "it" refer to? You instantly know it means "the cat." Your brain attends to "cat" when processing "it." That's attention.
In a neural network, each token is a vector (a list of numbers). Attention lets each token compute a weighted average of all other tokens' vectors. The weights come from dot products — measuring how similar two tokens are. Similar tokens get high weights; irrelevant ones get near-zero.
But we don't compare raw token vectors directly. Each token is projected through three learned linear layers to produce three separate vectors: Query (Q), Key (K), and Value (V). Here's the actual tensor math:
python # Input: x has shape [batch, seq_len, d_model] # e.g. batch=1, seq_len=10 tokens, d_model=512 Q = x @ W_Q # [1, 10, 512] @ [512, 512] → [1, 10, 512] K = x @ W_K # [1, 10, 512] @ [512, 512] → [1, 10, 512] V = x @ W_V # [1, 10, 512] @ [512, 512] → [1, 10, 512]
Three weight matrices, each [d_model, d_model]. That's 512 × 512 × 3 = 786,432 parameters just for one attention layer's projections. These matrices are learned during training — they determine what "looking for" (Q), "containing" (K), and "carrying" (V) mean.
Next, compute the full attention in one shot:
python scores = Q @ K.transpose(-2, -1) # [1, 10, 512] @ [1, 512, 10] → [1, 10, 10] scores = scores / sqrt(d_k) # scale (Chapter 2 explains why) weights = softmax(scores, dim=-1) # [1, 10, 10] — each row sums to 1 output = weights @ V # [1, 10, 10] @ [1, 10, 512] → [1, 10, 512]
The output has the same shape as the input: [batch, seq_len, d_model]. Each token's output is now a weighted mix of all tokens' Value vectors, where the weights are determined by Query-Key similarity. That's the entire mechanism.
[B, T, D]. Output is [B, T, D]. In between, we create a [B, T, T] attention matrix — that's the T × T "who attends to whom" map. For a 2048-token sequence, that's a 2048 × 2048 = ~4M entry matrix. This is why attention is O(n²) in sequence length.Click any token to see its dot-product similarity with every other token. Brighter = higher similarity.
Same deep principle: compute a weighted combination where the weights reflect quality of information. The Kalman gain asks "how much should I trust this measurement?" Attention asks "how much should I trust this token's information?" Both produce optimal weighting given their respective uncertainty models.
Can you spot this same "optimal weighting" pattern when we reach Mixture-of-Experts routing in Chapter 9?
Raw attention has three ingredients. Each token produces three vectors by multiplying its embedding with learned weight matrices:
The attention score between tokens i and j is Qi · Kj. We divide by √dk to prevent the dot products from getting too large (which would make softmax saturate and kill gradients). Then softmax converts scores to weights that sum to 1.
Four tokens with 2D Q and K vectors. Watch how weights shift as you drag the query vector of the selected token.
Assume each entry of Q and K is drawn i.i.d. from N(0, 1). The dot product is q · k = ∑ qiki over dk dimensions.
Your task: Derive the variance of this dot product. Then explain why dividing by √dk (not dk, not 1) is the correct normalization.
Full derivation:
1. Each entry qi, ki ~ N(0, 1), independent.
2. Var(qi · ki) = E[qi²]·E[ki²] - (E[qi]·E[ki])² = 1·1 - 0 = 1
3. The dot product sums dk such terms: Var(q·k) = dk
4. Standard deviation = √dk. For dk=64, typical dot products are ±8.
5. Dividing by √dk normalizes to unit variance: scores stay in [-3, 3] range where softmax has healthy gradients.
The key insight: This isn't a heuristic. It's the unique scaling that preserves unit variance regardless of head dimension. It falls directly out of the statistics.
Let's prove this rigorously. Assume each entry of Q and K is drawn from N(0, 1) — mean zero, variance one. The dot product of two dk-dimensional vectors is:
Each term qi · ki has mean 0 and variance 1 × 1 = 1. Since the terms are independent, the sum has variance = dk. So the dot product is ~N(0, dk).
For dk = 64, a typical dot product might be ±8. Let's see what softmax does with large values:
concrete numbers # Without scaling (d_k = 64): raw scores = [8.0, 0.1, -0.3] softmax = [0.9997, 0.0002, 0.0001] # almost one-hot! # With scaling (divide by √64 = 8): scaled = [1.0, 0.0125, -0.0375] softmax = [0.58, 0.21, 0.21] # smooth distribution
Without scaling, dot products grow with dk. For dk=64, scores reach ±8. Softmax of [8, 0.1, -0.3] ≈ [0.9997, 0.0002, 0.0001] — essentially one-hot. The gradient of softmax at saturation is nearly zero: ∂softmax/∂z ≈ 0. This means the attention weights can't update. The model locks onto whichever token happened to have the highest initial score and can never learn to redistribute attention. Training stalls completely in the attention layers.
The specific failure: gradient vanishing in the attention weights, not in the value path. The FFN still trains, but attention becomes a random fixed lookup — catastrophic for language modeling.
One set of Q, K, V can only learn one type of pattern. But language has many simultaneous relationships: syntax (subject-verb), coreference (pronoun-noun), semantic similarity, positional patterns. Multi-head attention runs several attention operations in parallel, each with its own learned Q, K, V projections.
If the model dimension is d = 512 and we use h = 8 heads, each head works with d/h = 64 dimensions. After computing attention independently, we concatenate all head outputs and project back to the full dimension.
Select a head to see its attention pattern. Each head learns to focus on different relationships.
A common misconception: multi-head attention does NOT run 8 separate attention operations with 8 separate weight matrices. In practice, you do one big projection then reshape:
python # d_model=512, n_heads=8, d_k=64 Q = x @ W_Q # [B, T, 512] @ [512, 512] → [B, T, 512] # Reshape into heads: Q = Q.view(B, T, 8, 64) # split last dim into 8 heads of 64 Q = Q.transpose(1, 2) # [B, 8, T, 64] — heads become a batch dim # Same for K and V. Now attention is a single batched matmul: scores = Q @ K.transpose(-2, -1) # [B, 8, T, 64] @ [B, 8, 64, T] → [B, 8, T, T] # 8 independent T×T attention matrices, computed in one GPU kernel
After attention, we concatenate heads and project back:
python out = (softmax(scores / 8) @ V) # [B, 8, T, 64] out = out.transpose(1, 2).contiguous().view(B, T, 512) # concat heads out = out @ W_O # [B, T, 512] @ [512, 512] → [B, T, 512]
python def multi_head_attention(x, W_Q, W_K, W_V, W_O, n_heads): B, T, d_model = x.shape d_k = d_model // n_heads # Project Q = x @ W_Q # [B, T, d_model] K = x @ W_K V = x @ W_V # Reshape into heads Q = Q.view(B, T, n_heads, d_k).transpose(1, 2) # [B, h, T, d_k] K = K.view(B, T, n_heads, d_k).transpose(1, 2) V = V.view(B, T, n_heads, d_k).transpose(1, 2) # Scaled dot-product attention scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5) weights = F.softmax(scores, dim=-1) out = weights @ V # [B, h, T, d_k] # Concat heads and project out = out.transpose(1, 2).contiguous().view(B, T, d_model) return out @ W_O
Attention is permutation-invariant: if you shuffle the input tokens, the attention weights change but the mechanism itself doesn't inherently know the order. "Cat sat mat" and "Mat cat sat" would produce the same attention pattern. We need to inject position information explicitly.
Each row is a position (0–31). Each column is a dimension. Color = encoding value. Notice the wave patterns at different frequencies.
The original Transformer paper claims: "for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos)." This means the model can learn to attend to "3 tokens back" using a simple linear transformation.
Your task: Given PE(pos, 2i) = sin(pos/100002i/d) and PE(pos, 2i+1) = cos(pos/100002i/d), prove that PE(pos+k) = Mk · PE(pos) for some matrix Mk that depends only on k, not on pos.
Proof: Let ωi = 1/100002i/d. For each dimension pair (2i, 2i+1):
PE(pos+k, 2i) = sin(ωi(pos+k)) = sin(ωi·pos)cos(ωi·k) + cos(ωi·pos)sin(ωi·k)
PE(pos+k, 2i+1) = cos(ωi(pos+k)) = cos(ωi·pos)cos(ωi·k) − sin(ωi·pos)sin(ωi·k)
In matrix form: [PE(pos+k, 2i), PE(pos+k, 2i+1)]ᵀ = R(ωi·k) · [PE(pos, 2i), PE(pos, 2i+1)]ᵀ
where R(θ) is the standard 2D rotation matrix. The full Mk is block-diagonal with d/2 such rotation blocks.
The key insight: Sinusoidal position encodings encode relative position as a rotation. The model's linear layers can learn these rotation matrices, letting it attend to "k positions back" regardless of absolute position. This is why sinusoidal encodings generalize to unseen sequence lengths — and why RoPE (which applies rotation directly to Q and K) is the modern evolution of this idea.
An encoder block takes a sequence and returns a refined sequence of the same shape. It has two sub-layers, each wrapped in a residual connection (add the input back) and layer normalization. The residual connections are critical — they let gradients flow straight through, enabling very deep stacks.
Watch a token vector flow through each sub-layer. The residual stream carries information forward.
Let's trace a concrete example through one encoder block. Assume d_model = 512, 8 heads, sequence length = 10:
python # Input: x = [B, 10, 512] # Step 1: LayerNorm (normalize each token vector independently) x_norm = layer_norm(x) # [B, 10, 512] — same shape # Step 2: Multi-Head Self-Attention attn_out = mha(x_norm) # [B, 10, 512] — same shape # Step 3: Residual add x = x + attn_out # [B, 10, 512] — ADD, not replace # Step 4: LayerNorm again x_norm = layer_norm(x) # [B, 10, 512] # Step 5: Feed-Forward Network (the 4× expansion) h = x_norm @ W1 + b1 # [B, 10, 512] @ [512, 2048] → [B, 10, 2048] h = gelu(h) # activation function ffn_out = h @ W2 + b2 # [B, 10, 2048] @ [2048, 512] → [B, 10, 512] # Step 6: Residual add x = x + ffn_out # [B, 10, 512] — still same shape!
The FFN expands from 512 to 2048 (4×), applies a nonlinearity, then projects back to 512. Why 4×? It's a design choice, not derived from theory. The original paper used 4× and it stuck. Some modern models use 8/3× with gated variants (SwiGLU). The expansion gives each token a wider space to "think" before compressing back.
The decoder block has the same structure as the encoder, plus two crucial additions: causal masking and (in encoder-decoder models) cross-attention.
Causal masking: During generation, token i must not see tokens i+1, i+2, ... (the future). We achieve this by setting those attention scores to −∞ before softmax, which forces their weights to zero. This creates a lower-triangular attention matrix.
Cross-attention: In translation models, the decoder attends to the encoder's output. The decoder provides Q; the encoder provides K and V. This is how the decoder "reads" the source language.
The attention matrix before and after masking. White cells are visible; dark cells are masked (−∞). Each token can only see itself and earlier tokens.
Training a Transformer for language modeling is deceptively simple: given a sequence of tokens, predict the next token at every position. The loss function is cross-entropy between the predicted probability distribution and the actual next token.
Teacher forcing: During training, we don't use the model's own predictions as input for the next step. Instead, we always feed the true previous tokens. This is faster and more stable than autoregressive training, but it means the model never sees its own mistakes during training.
The model predicts a probability for the correct next token. Drag the slider to see how loss changes. Higher confidence in the right answer = lower loss.
| Concept | What It Does |
|---|---|
| Next-token prediction | The training objective: predict xt+1 from x1..t |
| Cross-entropy loss | Measures how far predicted distribution is from truth |
| Teacher forcing | Use true tokens (not predictions) as input during training |
| AdamW optimizer | Adaptive learning rate + weight decay |
| Warmup + cosine decay | Gradually increase then decrease learning rate |
During generation, the model produces one token at a time. Without optimization, generating token n requires recomputing attention over all n−1 previous tokens — that's O(n²) total work for a full sequence. The KV cache stores the Key and Value vectors from previous tokens so we never recompute them.
Click "Generate Token" to add one token. The cache (blue bars) grows while each step only computes one new Q (orange).
The KV cache stores K and V tensors for every layer and every token generated so far. Here's the formula:
The "2" is for K and V. Let's compute this for real models:
| Model | Layers | d_model | Seq Len | KV Cache Size |
|---|---|---|---|---|
| GPT-2 Small | 12 | 768 | 1,024 | 36 MB |
| GPT-3 (175B) | 96 | 12,288 | 2,048 | 9.4 GB |
| LLaMA-2 70B | 80 | 8,192 | 4,096 | 10.5 GB |
| LLaMA-3 70B | 80 | 8,192 | 128,000 | ~328 GB |
For GPT-3, the calculation: 96 layers × 2 (K+V) × 2,048 tokens × 12,288 dims × 2 bytes (FP16) = 9.66 × 109 bytes ≈ 9.4 GB. That's just for one user's cache — serving 100 concurrent users needs 940 GB of KV cache memory alone.
The math: KV per user = 80 × 2 × 32,768 × 8,192 × 2 = 82 GB per user. For 100 users: 8,200 GB. Plus 140 GB weights. Total: 8,340 GB. You have 640 GB. You're 13× over budget.
Real solutions, in order of impact:
1. GQA (Grouped-Query Attention): LLaMA-3 70B uses 8 KV heads instead of 64 query heads. KV cache shrinks by 8× → 10.25 GB/user.
2. KV cache quantization (INT8): Another 2× reduction → 5.1 GB/user.
3. Paged Attention (vLLM): Don't pre-allocate full 32K. Most requests use 2-4K tokens. Only allocate pages as needed → average 3-5 GB/user effective.
4. Tensor parallelism: Shard weights across 8 GPUs (17.5 GB each). Remaining ~62 GB per GPU for KV cache. 8 GPUs × 62 GB = ~496 GB for KV. At 5 GB/user effective: ~99 concurrent users. It fits.
The insight: no single technique solves it. You need GQA + quantization + paged allocation + parallelism together. This is why LLaMA chose GQA — it's a serving-time decision made at training time.
As models get bigger, we face a dilemma: more parameters = better quality, but also more compute per token. Mixture of Experts (MoE) breaks this tradeoff. Instead of one giant FFN, we have many smaller "expert" FFNs. A router selects the top-k experts for each token. Only those experts run.
Each token is routed to 2 of 8 experts. Different tokens activate different experts. Click "Route" to see a new random routing.
Kaplan et al. (2020) discovered that model performance follows predictable power laws:
| Technique | Idea | Speedup |
|---|---|---|
| Flash Attention | Tile computation to stay in SRAM, never materialize full attention matrix | 2–4x |
| Ring Attention | Distribute sequence across GPUs, pass KV blocks in a ring | Linear in #GPUs |
| Grouped-Query Attention | Share KV heads across multiple Q heads | 1.5–2x less KV memory |
| Sliding Window | Each token only attends to nearby tokens | O(n·w) instead of O(n²) |
Root cause: Softmax routing creates a rich-get-richer feedback loop. The expert that handles slightly more tokens gets more gradient signal, becomes slightly better, then attracts even more tokens. Within 1000 steps, this snowballs into complete collapse.
Industry solutions:
1. Load balancing loss (Switch Transformer): Add an auxiliary loss α·∑ fi · Pi where fi = fraction of tokens routed to expert i, Pi = average router probability for expert i. This penalizes concentration. α = 0.01 typically. Trade-off: can reduce model quality by ~0.5% if α too high.
2. Expert capacity factor (GShard): Cap each expert to at most C × (N/E) tokens per batch (C≈1.25). Overflow tokens are dropped or routed to a second choice. Trade-off: dropped tokens lose information; too low C wastes expert capacity.
3. Random routing with learned bias (BASE layers): Add noise to router logits during training. Top-2 from (logits + noise). Trade-off: slows convergence early but prevents lock-in.
4. Expert choice routing (Zhou et al.): Flip the problem — each expert CHOOSES its top-k tokens instead of each token choosing experts. Guarantees perfect balance by construction. Trade-off: variable number of experts per token; some tokens may get 0 experts.
Mixtral-8x7B uses a simple top-2 softmax router with load balancing loss. DeepSeek-MoE adds shared experts (always active) plus routed experts. The key insight: you need an explicit mechanism to fight the rich-get-richer dynamic. Softmax alone will always collapse.
The Transformer's power isn't in any single component — it's in how they compose. Here are the key phenomena researchers have discovered:
Think of the residual connections as a communication bus. Each layer reads from the stream, processes information, and writes its contribution back. Information from early layers is never destroyed — it flows all the way to the final layer. This is fundamentally different from a pipeline where each stage replaces the previous output.
One of the most remarkable discoveries: pairs of attention heads that implement in-context pattern completion. If the model sees "Harry Potter is a wizard... Harry Potter is a", the induction head copies "wizard" from the earlier occurrence. This is a key mechanism behind in-context learning — the ability to learn new tasks from examples in the prompt without any weight updates.
The sequence repeats. Watch how the attention pattern forms a diagonal stripe shifted by the repeat length — that's the induction head copying from the first occurrence.
| Phenomenon | What Happens |
|---|---|
| In-context learning | Learns new tasks from examples in the prompt |
| Chain-of-thought | Step-by-step reasoning improves accuracy |
| Few-shot generalization | Solves unseen tasks with just a few examples |
| Tool use | Learns to call APIs, write code, use calculators |
The Transformer you just learned is the same architecture behind:
The pattern: tokenize your domain, then attend. Any sequential or set-structured problem can be cast as a Transformer problem. The architecture doesn't know or care what the tokens represent.
You now understand the architecture that powers every frontier AI model. From dot products to multi-head attention, from encoder blocks to KV caches — this is the foundation of modern AI.