CS224N Lecture 5 — The Transformer

Chapter 0: Why Replace RNNs?

You're translating a 200-word paragraph from English to French. Your RNN encoder reads the English words one by one, left to right. By the time it reaches word 200, the information from word 1 has passed through 199 transformations. It's been squeezed, distorted, and mostly forgotten. The model has to summarize the entire paragraph into a single hidden vector before the decoder can start generating French. That's like reading a novel through a keyhole — one word at a time, no going back.

But the problem isn't just memory. It's speed. Because each RNN step depends on the output of the previous step, you cannot parallelize across time steps. Step 50 must wait for step 49, which waits for step 48, all the way back to step 1. On a modern GPU with thousands of cores sitting idle, the RNN forces you into a single-lane highway.

In 2017, Vaswani et al. published "Attention Is All You Need" and proposed a radical alternative: throw away recurrence entirely. Replace the sequential hidden state with a mechanism that lets every position look at every other position simultaneously. No loops. No sequential bottleneck. Just one big matrix multiply.

They called it the Transformer.

The Three Sins of RNNs

Sin 1: Sequential computation. Processing n tokens takes O(n) sequential steps. You can't use the next until the current is done. A 512-token input means 512 serial operations, even if you have 10,000 GPU cores.

Sin 2: Long-range forgetting. Information from token 1 must survive through 511 transformations to reach token 512. Even LSTMs, designed to fight vanishing gradients, struggle with dependencies spanning hundreds of tokens. The gradient signal decays exponentially with distance.

Sin 3: No direct connections. In an RNN, token 1 and token 200 are connected only through a chain of 199 hidden states. Each link in the chain can distort the signal. In attention, any two tokens are connected by a single computation step — O(1) path length, regardless of their distance.

The simulation below shows this visually. On the left, an RNN processes tokens sequentially — the signal from early tokens fades as the sequence grows. On the right, attention connects all tokens simultaneously with direct links. Drag the slider to increase sequence length and watch the difference.

RNN vs. Attention: Sequential vs. Parallel

Drag the slider to change sequence length. Left: RNN (sequential, fading). Right: Attention (parallel, direct connections).

Sequence length 6

RNNs are like reading through a keyhole, one word at a time. Attention lets you see the whole page at once. Every token can directly attend to every other token in a single step — no sequential bottleneck, no vanishing gradients across distance.

What the Transformer Achieves

Property	RNN	Transformer
Sequential operations	O(n)	O(1)
Max path length	O(n)	O(1)
Computation per layer	O(n · d²)	O(n² · d)
Parallelizable	No	Yes

There's a trade-off: the Transformer's attention costs O(n²) per layer because every token looks at every other. For very long sequences this becomes expensive. But for the sequence lengths used in practice (512-2048 tokens in the original paper), the parallelism advantage dominates. Training a Transformer on 8 GPUs took 3.5 days. The equivalent RNN would have taken weeks.

The Hardware Revolution Connection

The Transformer arrived at exactly the right moment in hardware history. GPUs in 2017 had thousands of cores optimized for matrix multiplication (NVIDIA's P100 had 3,584 CUDA cores). RNNs could barely use 1% of this hardware because sequential dependencies forced serial computation. The Transformer, built entirely from matrix multiplies, could saturate the GPU completely.

This hardware-algorithm co-design explains the explosion of scale that followed. GPT-2 (2019): 1.5B parameters. GPT-3 (2020): 175B. PaLM (2022): 540B. Each of these models is architecturally identical to the 65M-parameter Transformer from 2017 — just bigger. The scaling laws research by Kaplan et al. (2020) showed that Transformer performance improves predictably with more data, compute, and parameters. No architectural changes needed. Just more matrix multiplies on more GPUs.

The contrast with RNNs is stark. You can't just "make an RNN bigger" and expect proportional improvement. The sequential bottleneck means training time scales linearly with model size AND sequence length. A 175B-parameter RNN would take years to train on the same data. The Transformer's parallelism made large-scale language models economically feasible for the first time.

Here's the remarkable timeline: 2017, Transformer with 65M parameters trains in days. 2018, GPT with 117M. 2019, GPT-2 with 1.5B. 2020, GPT-3 with 175B. 2023, GPT-4 (estimated 1.8T). Each step was enabled by the same architecture — the Transformer — applied at increasing scale. No other architecture in the history of machine learning has shown this consistent scaling behavior.

The Road to "Attention Is All You Need"

The Transformer didn't emerge from nothing. It was the culmination of several years of incremental progress:

Year	Innovation	Key Idea
2014	Seq2Seq (Sutskever)	Encoder-decoder architecture with LSTMs
2015	Attention (Bahdanau)	Let decoder attend to encoder hidden states
2015	Layer Normalization (Ba)	Normalize per-example, not per-batch
2016	Residual Networks (He)	Skip connections enable deep networks
2017	Transformer	Replace ALL recurrence with attention

The key leap: previous work used attention alongside RNNs (the RNN reads the sequence, attention helps with alignment). The Transformer's radical claim: attention is sufficient. No RNN needed at all. The paper's title says it: "Attention Is All You Need."

This lesson walks through every component of the Transformer architecture: self-attention, scaling, multi-head attention, positional encoding, the encoder block, the decoder with masking, and the full system. By the end, you'll be able to build one from scratch.

What this lesson covers: Self-attention as a soft dictionary lookup. Scaled dot-product attention and why sqrt(d_k) matters. Multi-head attention for parallel relationship learning. Positional encoding with sinusoids. The encoder block (attention + FFN + residual + LayerNorm). The decoder with causal masking and cross-attention. A full Transformer builder simulation. Extensions to images, music, and beyond.

Why can't RNNs be parallelized across time steps?

They use too much memory Each step's output depends on the previous step's hidden state They don't support batching

Chapter 1: Self-Attention

When you read "The cat sat on the mat because it was tired," how do you know "it" refers to "the cat" and not "the mat"? Your brain doesn't process words in isolation — it considers the meaning of every other word in the sentence to resolve ambiguities. "It was tired" suggests a living thing, so "it" must be the cat.

Self-attention is the mechanism that gives the Transformer this ability. For each position in the sequence, it computes a weighted average over all positions, where the weights reflect how relevant each other position is to the current one. The result: every token's representation is enriched by information from every other token.

Queries, Keys, and Values

Self-attention works like a soft dictionary lookup. Imagine a dictionary where you look up a word and get a definition. In attention:

Each token produces three vectors:

Query (Q)

"What am I looking for?" The question this token asks about context.

↓

Key (K)

"What do I contain?" The label this token advertises to other tokens.

↓

Value (V)

"What information do I carry?" The actual content to pass along.

These three vectors come from three different learned linear projections of the same input embedding. If the input embedding for token i is x_i (a d-dimensional vector), then:

q_i = W_Q x_i, k_i = W_K x_i, v_i = W_V x_i

Where W_Q, W_K, W_V are learned weight matrices of shape [d × d_k]. The same matrices are applied to every position — the Transformer learns what to ask (Q), what to advertise (K), and what to say (V) as general functions of the input.

How Attention Scores Work

To compute the output for token i, we take its query q_i and compute dot products with every key k_j in the sequence. A high dot product means "token j is relevant to token i." These raw scores are then passed through softmax to get weights that sum to 1. Finally, the output is a weighted sum of the value vectors:

score(i, j) = q_i · k_j

α_ij = softmax(score(i, ·))_j = exp(q_i · k_j) / ∑_m exp(q_i · k_m)

output_i = ∑_j α_ij · v_j

Token i's output is a blend of all value vectors, weighted by how much each key matched token i's query. If "it" strongly attends to "cat," then "it"'s output representation will contain a lot of "cat"'s information.

The widget below shows this in action. Click any token to select it as the query. Attention weight lines appear from that token to all others, with thickness proportional to the attention weight. The Q, K, V vectors are shown as colored bars below.

Self-Attention Visualizer

Click a token to select it as the query. Lines show attention weights to all other tokens. Thicker = more attention.

Click any token above to see its attention pattern.

Self-attention is a soft dictionary lookup: the query asks "what am I looking for?", every key answers "how relevant am I?", and the output is a weighted mix of values. Unlike a hard lookup (which returns one entry), attention blends all entries — just with most weight on the best matches.

A Worked Example by Hand

Let's trace self-attention on a tiny example. Three tokens: "I", "love", "dogs". Suppose d_model = 4 and d_k = 4 (no dimensionality reduction for simplicity). After embedding, our input matrix X is:

X = [[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 0, 0]]

Suppose our learned weight matrices are (simplified for hand computation):

W_Q = W_K = W_V = I_4×4 (identity, for simplicity)

Then Q = K = V = X. The attention scores are:

Q K^T = X X^T = [[2, 0, 1], [0, 2, 1], [1, 1, 2]]

Token 0 ("I") has scores [2, 0, 1]. It attends strongly to itself (score 2), not at all to "love" (score 0), and weakly to "dogs" (score 1). After softmax: [0.66, 0.09, 0.24]. The output for "I" is 0.66·v_I + 0.09·v_love + 0.24·v_dogs — mostly itself, with a bit of "dogs" mixed in.

In practice, learned W_Q, W_K, W_V matrices are NOT identity — they learn to project into a space where semantically related tokens produce high dot products. After training, the query for "it" might naturally align with the key for "cat" because the model has learned that pronouns need to find their antecedents.

Why This Solves the RNN Problem

In an RNN, token 1 and token 100 are connected by a chain of 99 hidden state transformations. In self-attention, they're connected by a single dot product. The path length between any two tokens is O(1). This means gradients flow directly between distant tokens during backpropagation — no vanishing gradient through 99 intermediate steps.

And because every token's attention weights are independent of every other token's (no sequential dependency), the entire computation can be parallelized as a single matrix multiplication. All positions computed simultaneously.

The Matrix Form

All of self-attention can be written as three matrix multiplications and a softmax — no loops, no sequential dependencies:

Q = X W_Q, K = X W_K, V = X W_V

Attention(Q, K, V) = softmax(Q K^T) · V

This is a batch operation: every position's attention is computed simultaneously. On a GPU, this translates to a single cuBLAS GEMM (General Matrix Multiply) call for each step. That's why Transformers train 10-100x faster than RNNs on modern hardware.

In Code

python
import torch
import torch.nn.functional as F

def self_attention(X, W_Q, W_K, W_V):
    # X: [seq_len, d_model]
    # W_Q, W_K, W_V: [d_model, d_k]
    Q = X @ W_Q          # [seq_len, d_k]
    K = X @ W_K          # [seq_len, d_k]
    V = X @ W_V          # [seq_len, d_k]

    scores = Q @ K.T     # [seq_len, seq_len]
    weights = F.softmax(scores, dim=-1)
    output = weights @ V  # [seq_len, d_k]
    return output, weights

That's the entire mechanism: three matrix multiplies, a softmax, and one more matrix multiply. Five lines of math, and it replaces the entire recurrent loop of an RNN.

Attention Is O(n²) — Is That Bad?

Self-attention computes a score for every pair of tokens: n queries × n keys = n² dot products. For n = 512, that's 262,144 pairs per layer. For n = 4096 (GPT-3's context length), it's 16.7 million. For n = 100,000 (some modern models), it's 10 billion.

This quadratic scaling is the Transformer's Achilles heel. It's why the original paper used only n = 512. It's why GPT-2 could only handle 1024 tokens. And it's why an entire subfield of "efficient attention" has emerged: Linformer (linear approximation), Performer (random feature maps), FlashAttention (hardware-aware exact attention), Mamba (selective state spaces that bypass attention entirely), and many others.

The memory cost is equally important. The attention matrix itself is [n, n] per head per layer. For GPT-3 (96 heads, 96 layers, n = 2048): 96 × 96 × 2048 × 2048 × 2 bytes = ~73 GB just for the attention matrices during training. This is why techniques like gradient checkpointing (recompute activations instead of storing them) and mixed-precision training (use float16 instead of float32) are essential for large Transformers.

To put the cost in perspective: processing a 1M-token context with standard attention requires 10¹² attention computations per layer. Even at GPU speeds of 10¹⁵ FLOPS, that's 1 millisecond per layer just for attention — and you need 96 layers. This is why long-context Transformers are so expensive and why alternative architectures like Mamba (which processes sequences in O(n) time) are attracting attention. The Transformer's quadratic cost may ultimately limit its dominance for very long sequences, even as it remains supreme for moderate-length contexts where quality matters most.

FlashAttention (Dao et al., 2022) deserves special mention: it computes exact attention in O(n²) time but with dramatically less memory by tiling the computation to fit in GPU SRAM (fast cache) instead of slow HBM (main GPU memory). FlashAttention doesn't change the math at all — it changes how the math is executed on hardware. The result: 2-4x speedup and models can handle 4-16x longer sequences at the same memory budget. This is a perfect example of hardware-aware algorithm design.

What determines how much token A attends to token B?

The distance between their positions The magnitude of their value vectors The dot product of A's query with B's key

Chapter 2: Scaled Dot-Product Attention

Multiply two 512-dimensional vectors. Each element is roughly standard-normal (mean 0, variance 1). Their dot product is the sum of 512 products of independent random variables. By the central limit theorem, this sum has variance approximately equal to 512 — so the dot product has a standard deviation of √512 ≈ 22.6. That means typical dot products range from −45 to +45.

Now pass those through softmax. Softmax with inputs in the range [−45, +45] is catastrophically peaked: the largest value gets nearly all the probability mass, and everything else is essentially zero. The attention pattern becomes one-hot — each token attends to exactly one other token, ignoring everything else. That's not a soft weighted average; it's a hard lookup. And worse, the gradients through softmax vanish when the output is nearly one-hot.

The Fix: Divide by √d_k

The solution is beautifully simple. If the dot product has variance d_k, divide by √d_k to bring the variance back to 1. Now the softmax inputs are in the range [−3, +3] (roughly), and softmax produces a smooth distribution:

Attention(Q, K, V) = softmax(Q K^T / √d_k) · V

This is the complete scaled dot-product attention formula from "Attention Is All You Need." Every attention computation in the Transformer uses this exact formula.

Why √d_k Specifically?

If q and k are vectors where each element is drawn i.i.d. from N(0, 1), then:

q · k = ∑_i=1^d_k q_i · k_i

Each product q_i · k_i has mean 0 and variance 1 (product of two standard normals). The sum of d_k such terms has variance d_k (variances add for independent variables). Standard deviation = √d_k. Dividing by √d_k normalizes the variance back to 1, so softmax inputs stay in a well-behaved range regardless of how large d_k is.

The simulation below shows this dramatically. Adjust d_k with the slider and toggle scaling on/off. Without scaling, as d_k grows, the softmax output becomes a spike. With scaling, it stays smooth.

Scaled vs. Unscaled Attention

Drag d_k to see how dimension affects softmax. Toggle scaling on/off.

d_k 64

Dividing by √d_k is temperature control. Without it, softmax saturates — one token gets ~100% of the weight and gradients vanish. With it, the distribution stays smooth and trainable. It's one line of code that makes or breaks training.

A Worked Example

Let d_k = 4. Suppose token "it" has query q = [1, 0, −1, 2] and three keys are k₁ = [2, 1, 0, 1], k₂ = [0, −1, 1, 0], k₃ = [1, 0, −1, 3].

Raw dot products:

q · k₁ = 2 + 0 + 0 + 2 = 4

q · k₂ = 0 + 0 − 1 + 0 = −1

q · k₃ = 1 + 0 + 1 + 6 = 8

Scaled (divide by √4 = 2): [2, −0.5, 4]

Softmax of [4, −1, 8] (unscaled): [0.018, 0.000, 0.982] — almost all weight on k₃.

Softmax of [2, −0.5, 4] (scaled): [0.117, 0.010, 0.873] — still mostly k₃, but now k₁ gets 11.7%. The model can learn nuanced blending.

Now imagine d_k = 512 instead of 4. Those dot products would be ~128x larger (scaling with dimension). Without the √d_k divisor, the softmax outputs would be [~0.0, ~0.0, ~1.0] — indistinguishable from a hard lookup. Gradients at the 0.0 positions would be effectively zero, making it impossible for the model to learn that k₁ is partially relevant. The √512 ≈ 22.6 divisor brings everything back to a manageable range.

In PyTorch: One Line

python
# The full scaled dot-product attention
def scaled_dot_product(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k**0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights

The mask parameter is how we implement causal masking for the decoder (Chapter 6). For the encoder, mask is None — all positions can see all others.

What Happens Without Scaling: A Visualization

The effect is dramatic in practice. With d_k = 512 (the original Transformer's dimension per head is 64, but models like GPT-3 use 128), the dot products have standard deviation √512 ≈ 22.6. Typical softmax inputs look like [−30, 5, 42, −15, 28, ...]. After softmax, the value 42 gets probability ~0.9999 and everything else gets ~0.0000. The model becomes a hard lookup table — each token attends to exactly one other token.

With scaling, the same inputs become [−1.3, 0.2, 1.9, −0.7, 1.2, ...]. Softmax of these gives [0.04, 0.16, 0.45, 0.07, 0.22, ...]. Now the model can express "I'm mostly interested in token 3, but tokens 2 and 5 are also relevant." This soft blending is what makes attention powerful. Hard attention (attending to exactly one token) loses the ability to aggregate information from multiple sources.

The gradient tells the same story. At softmax saturation, the gradient is nearly zero: ∂softmax/∂z ≈ 0 when one output dominates. The model can't learn to adjust the attention pattern because the gradient carries no information about which direction to update. Scaling keeps the gradient informative throughout training.

Alternative: Additive Attention

Before scaled dot-product attention, Bahdanau et al. (2015) used additive attention:

score(q, k) = v^T tanh(W₁ q + W₂ k)

This doesn't have the scaling problem (tanh keeps values in [−1, 1]), but it's slower because it can't be computed as a single matrix multiply. Dot-product attention is O(n² · d) with highly optimized GEMM; additive attention requires per-pair computation. At d_k = 64, the two produce similar quality, but dot-product is significantly faster on GPUs. The scaling factor is the small price we pay for that speed.

Why do we divide attention scores by √d_k before applying softmax?

To prevent dot products from growing with d_k, which would make softmax saturate and kill gradients To make the output vectors unit length To reduce the number of parameters

Chapter 3: Multi-Head Attention

One attention head might learn that "it" refers to "cat" (coreference). But there are other relationships worth capturing: "sat" relates to "cat" (subject-verb), "on" relates to "mat" (prepositional attachment), "tired" relates to "sat" (causal). Why force a single set of Q/K/V weights to capture all these different relationships simultaneously?

Multi-head attention runs h separate attention operations in parallel, each with its own learned W_Q, W_K, W_V matrices. Each head projects into a smaller subspace (d_k = d_model/h) so the total computation is the same as single-head attention with full dimensionality.

The Math

For each head i (from 1 to h):

head_i = Attention(X W_Qⁱ, X W_Kⁱ, X W_Vⁱ)

Where W_Qⁱ, W_Kⁱ, W_Vⁱ each have shape [d_model × d_k], with d_k = d_model/h.

The outputs of all heads are concatenated and projected back to d_model:

MultiHead(X) = Concat(head₁, ..., head_h) · W_O

Where W_O has shape [h · d_k × d_model] = [d_model × d_model]. The output is the same dimension as the input.

Why This Works

In the original Transformer, d_model = 512 and h = 8, so each head operates in d_k = 64 dimensions. One head learns syntax, another learns coreference, another learns proximity — each in its own 64-dimensional subspace. The concat + output projection W_O learns how to combine these different relationship types into a unified representation.

This is the key insight: different types of relationships can be learned independently. A single 512-dimensional attention head might try to average syntax and coreference patterns together, losing both. Eight 64-dimensional heads can specialize.

The simulation below shows four heads attending to different aspects of a sentence. Each head is color-coded. Click each head to toggle it on/off and see the attention pattern it learns.

Multi-Head Attention Patterns

Click head buttons to toggle individual attention patterns. Each head learns different relationships.

Multi-head attention is like having multiple pairs of eyes, each looking for a different kind of relationship. One head finds subjects, another finds objects, another finds temporal modifiers. Together, they capture the full structure of the sentence.

Tensor Shapes at Each Step

Tensor	Shape	Description
Input X	[n, d_model]	n tokens, each 512-dim
W_Qⁱ	[d_model, d_k]	[512, 64] per head
Qⁱ, Kⁱ, Vⁱ	[n, d_k]	[n, 64] per head
Scores	[n, n]	Attention matrix per head
head_i output	[n, d_k]	[n, 64] per head
Concat	[n, h · d_k]	[n, 512] = [n, d_model]
Final output	[n, d_model]	[n, 512] same as input

Notice: the final output has the same shape as the input. This is critical — it means we can stack attention layers. The output of layer 1 feeds directly into layer 2.

Multi-Head Attention in Code

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.d_k = d_model // n_heads    # 64
        self.h = n_heads
        # One big projection, then split into heads
        self.W_qkv = nn.Linear(d_model, 3 * d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        B, T, D = x.shape
        # Project to Q, K, V in one shot
        qkv = self.W_qkv(x)                     # [B, T, 3*D]
        qkv = qkv.reshape(B, T, 3, self.h, self.d_k)
        qkv = qkv.permute(2, 0, 3, 1, 4)       # [3, B, h, T, d_k]
        Q, K, V = qkv[0], qkv[1], qkv[2]

        # Scaled dot-product attention per head
        scores = Q @ K.transpose(-2, -1) / self.d_k**0.5
        weights = F.softmax(scores, dim=-1)      # [B, h, T, T]
        out = weights @ V                          # [B, h, T, d_k]

        # Concat heads and project
        out = out.transpose(1, 2).reshape(B, T, D)
        return self.W_o(out)                    # [B, T, D]

The key implementation trick: instead of h separate W_Q, W_K, W_V matrices, we use one large projection and reshape. This is more efficient on GPUs because one large matrix multiply is faster than h small ones.

What Do Heads Actually Learn?

Researchers have probed trained Transformer heads to discover what relationships they capture. Some findings from Clark et al. (2019) and Voita et al. (2019):

Positional heads always attend to the previous token (position i attends most to position i−1) or to the first token. These are surprisingly common and seem to implement simple positional patterns.

Syntactic heads attend from a verb to its subject, from a noun to its determiner, or from a pronoun to its antecedent. These heads effectively learn to parse grammar without being explicitly trained on parse trees.

Rare token heads attend to rare or unusual tokens in the sequence, possibly implementing a "surprise detection" mechanism.

Separator heads attend to punctuation and sentence boundaries, perhaps helping the model understand document structure.

Not all heads are useful. Voita et al. showed that you can prune over 60% of heads in a trained Transformer with minimal quality loss. The model is over-parameterized in attention heads, and most of the work is done by a small number of critical heads. This finding has implications for inference efficiency: Grouped Query Attention (GQA), used in LLaMA 2 and later models, shares key-value heads across multiple query heads, reducing KV cache size by 4-8x with minimal quality loss.

Parameter Count

Each head has three weight matrices (W_Q, W_K, W_V), each [512 × 64]. That's 3 × 512 × 64 = 98,304 parameters per head. With 8 heads: 786,432 parameters. Plus W_O at [512 × 512] = 262,144. Total for multi-head attention: ~1M parameters. About the same as a single-head attention with d_k = 512 would cost (3 × 512 × 512 = 786,432 for Q/K/V). Multi-head attention adds expressiveness without adding cost.

A Worked Example: Two Heads on "it"

Consider "The cat sat because it was tired" with d_model = 4 and h = 2 heads (d_k = 2 each).

Head 1 (coreference): Projects "it" into query q₁ = [0.9, 0.1]. Keys for "cat" = [0.8, 0.2], "sat" = [0.1, 0.9], "tired" = [0.2, 0.3]. Dot products: cat = 0.74, sat = 0.18, tired = 0.21. After softmax: cat = 0.49, sat = 0.28, tired = 0.23. Head 1 successfully finds the antecedent "cat."

Head 2 (adjacency): Projects "it" into query q₂ = [0.2, 0.8]. Keys for "because" = [0.3, 0.9], "was" = [0.4, 0.7], "tired" = [0.1, 0.3]. Dot products: because = 0.78, was = 0.64, tired = 0.26. After softmax: because = 0.42, was = 0.37, tired = 0.21. Head 2 finds the local context.

Concatenating: head₁ output (2 dims focused on "cat") || head₂ output (2 dims focused on "because"/"was") = 4-dim vector containing both coreference AND local context. W_O then learns how to combine these different relationship types into a single unified representation.

A single head with d_k = 4 would be forced to choose: learn coreference OR learn adjacency, but not both. Multi-head lets it learn both simultaneously.

After concatenating all h heads (each outputting [n, d_k]), what is the output dimension before the W_O projection?

[n, d_k] [n, h × d_k] = [n, d_model] [n, h × d_model]

Chapter 4: Positional Encoding

"Dog bites man" and "Man bites dog" have identical tokens: {dog, bites, man}. Self-attention computes dot products between queries and keys, but dot products are symmetric and permutation-invariant. If you shuffle the input tokens, the attention weights change only because the Q/K/V values changed — but the mechanism itself has no notion of order. It treats the input as a set, not a sequence.

This is a fundamental problem. Word order carries meaning. "The cat chased the mouse" means something very different from "The mouse chased the cat." We need to inject position information explicitly.

Sinusoidal Positional Encoding

Vaswani et al. added a positional encoding vector to each input embedding. For position pos and dimension i:

PE(pos, 2i) = sin(pos / 10000^2i/d_model)

PE(pos, 2i+1) = cos(pos / 10000^2i/d_model)

Each dimension gets a sinusoidal wave at a different frequency. Low-index dimensions oscillate quickly (changing every position); high-index dimensions oscillate slowly (changing over hundreds of positions). The result: each position gets a unique "fingerprint" vector.

Why Sinusoids? The Clock Analogy

Think of how a clock represents time. The second hand spins fast (high frequency), the minute hand spins slow (medium frequency), and the hour hand barely moves (low frequency). Any moment in time is uniquely identified by the combination of all three hands. 3:15:42 is different from 3:15:43 because the second hand moved, even though the other hands are in the same place.

Sinusoidal positional encoding works the same way. Each dimension is a "hand" spinning at a different frequency. Low-index dimensions spin fast (changing every token), high-index dimensions spin slowly (repeating only over thousands of tokens). The combination of all dimensions uniquely identifies each position.

Three elegant properties:

1. Unique per position. No two positions share the same encoding vector. The combination of different-frequency sinusoids creates a unique pattern at each position, like the unique combination of hands on a clock.

2. Bounded values. Every element is between −1 and +1 (sine and cosine are bounded). This plays well with the input embeddings, which are typically initialized with similar magnitude.

3. Relative positions are learnable. For any fixed offset k, PE(pos+k) is a linear function of PE(pos). This means the model can learn to attend to "the word 3 positions ago" by learning a simple linear transformation of the positional encoding. The dot product PE(pos) · PE(pos+k) depends only on k, not on pos itself.

The simulation below shows the positional encoding as a heatmap: rows are positions, columns are dimensions. Hover any row to see its encoding vector. The slider controls the maximum sequence length. Notice how low dimensions oscillate fast and high dimensions oscillate slowly.

Sinusoidal Positional Encoding

Top: encoding heatmap (position × dimension). Bottom: 2D PCA projection of position vectors. Hover rows to inspect.

Max position 50

Sinusoidal encodings let the model learn relative positions: the encoding of position p+k is a linear function of position p. This means a simple linear layer can extract "how far apart are these two tokens?" from their positional encodings.

Added, Not Concatenated

The positional encoding is added to the input embedding, not concatenated:

input_pos = embedding(token_pos) + PE(pos)

Why add instead of concatenate? Concatenation would double the dimension (wasting computation) and force the model to learn separate weights for "content" and "position" dimensions. Addition lets position information blend naturally with semantic information. The model can learn to use position through the standard Q/K/V projections without any architectural changes.

Positional Encoding in Code

python
import torch
import math

def sinusoidal_pe(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    pos = torch.arange(0, max_len).unsqueeze(1)  # [max_len, 1]
    div = torch.exp(
        torch.arange(0, d_model, 2) * -math.log(10000) / d_model
    )  # [d_model/2]
    pe[:, 0::2] = torch.sin(pos * div)  # even dims
    pe[:, 1::2] = torch.cos(pos * div)  # odd dims
    return pe  # [max_len, d_model]

# Usage: embeddings = token_embed(x) + sinusoidal_pe(512, 512)[:seq_len]

The division term exp(arange * -log(10000) / d_model) computes 1/10000^2i/d in log space for numerical stability. This creates frequencies that span from 1 (dimension 0 changes every position) to 1/10000 (dimension d-1 changes only over thousands of positions).

Learned vs. Fixed

The original Transformer uses fixed sinusoidal encodings (no learned parameters). Vaswani et al. also tried learned positional embeddings — a separate embedding table of shape [max_len, d_model] — and found "nearly identical results." Later models like BERT and GPT use learned embeddings; very long-context models use RoPE (rotary position embeddings) or ALiBi. The key insight remains: attention needs position information injected explicitly.

RoPE (Rotary Position Embeddings) is worth mentioning because it powers most modern LLMs (LLaMA, Mistral, GPT-NeoX). Instead of adding position to the embedding, RoPE applies a rotation to Q and K vectors based on position. The dot product Q·K then naturally encodes relative position. RoPE is elegant because it makes relative position a property of the dot product itself, not something the model has to learn from additive encodings.

Extrapolation: Going Beyond Training Length

One advantage of sinusoidal encodings: they're defined for any position, even positions not seen during training. If you train on sequences of length 512, you can theoretically evaluate on length 1024 — the sine/cosine functions extend naturally. In practice, the model's quality degrades significantly beyond the training length because attention patterns were only trained on the shorter context.

This length extrapolation problem is a major area of current research. Approaches include:

Method	Key Idea	Used By
ALiBi	Penalize attention by distance (no explicit PE)	BLOOM, MPT
RoPE + NTK scaling	Adjust RoPE frequencies to extend context	CodeLlama, extended LLaMA
YaRN	Learned interpolation of RoPE frequencies	Various fine-tunes
Ring Attention	Distribute long context across multiple GPUs	Research

The upshot: position encoding isn't a solved problem. It remains one of the most active research frontiers in Transformer architecture design, especially as models push toward 1M+ token context windows.

Why can't we just use integer positions (0, 1, 2, ...) as positional encodings?

Integers are too small to carry information Unbounded integers would dominate the embedding magnitudes for long sequences, and relative position wouldn't be a linear function Integers can't be represented in floating-point

Chapter 5: The Encoder Block

We have attention and position. Now we need to package them into a repeatable building block that can be stacked into a deep network. The Transformer encoder is built from N identical layers (N=6 in the original paper), each containing the same two sub-modules wired together with residual connections and layer normalization.

Sub-layer 1: Multi-Head Self-Attention

The input (a sequence of n vectors, each d_model-dimensional) goes through multi-head attention. Every position attends to every other position. The output has the same shape as the input: [n, d_model].

Sub-layer 2: Position-wise Feed-Forward Network (FFN)

After attention, each position independently passes through a two-layer fully connected network:

FFN(x) = max(0, x W₁ + b₁) W₂ + b₂

The hidden dimension d_ff is typically 4 × d_model (so 2048 in the original). ReLU activation between the two layers. The same FFN is applied to every position independently — no interaction between positions here. That's the job of attention. The FFN's job is to transform each position's representation nonlinearly, acting like a small neural network applied to each token independently.

Why is d_ff = 4 × d_model? The expansion allows the FFN to represent more complex functions. Think of it as: attention mixes information across positions; FFN processes information within each position. The 4x expansion gives the FFN enough capacity to do useful computation.

Recent research suggests the FFN acts as a key-value memory. Each row of W₁ is a "key" that activates on specific input patterns, and the corresponding column of W₂ is the "value" that gets added to the representation when that pattern is detected. With d_ff = 2048, the FFN has 2048 memory slots. One slot might fire when it sees a capital letter after a period (triggering "this is a sentence start"), another when it sees a number followed by a unit (triggering "this is a measurement"). The FFN is where the Transformer stores factual knowledge.

This is why scaling up d_ff (and by extension d_model) improves the model's ability to store facts. GPT-3 has d_ff = 4 × 12288 = 49,152 memory slots per layer, across 96 layers. That's 4.7 million memory slots — enough to store an enormous amount of world knowledge.

Modern FFN Variants

The original FFN uses ReLU activation. Modern Transformers have found that SwiGLU (a gated variant of Swish) works better:

FFN_SwiGLU(x) = (Swish(x W₁) ⊙ x W₃) W₂

Where ⊙ is element-wise multiplication and Swish(x) = x · σ(x). The gating mechanism (multiplying by x W₃) lets the network selectively activate different memory slots more precisely than ReLU. LLaMA, Mistral, and most 2023-2024 models use SwiGLU. The cost: an extra weight matrix W₃, which increases FFN parameters by ~50%. But the quality improvement justifies the cost.

Another common variant: GeGLU (GELU-gated), which replaces Swish with GELU. The differences between SwiGLU, GeGLU, and ReGLU (ReLU-gated) are small — the key insight is that gating helps, regardless of which activation function gates it.

Residual Connections

Each sub-layer has a residual connection that adds the input to the output:

output = LayerNorm(x + SubLayer(x))

Residual connections are arguably the most important architectural choice in the Transformer. Without them, deep Transformers (6+ layers) fail to train. Why? In a deep network, the gradient must pass through every layer during backpropagation. Each layer transforms the gradient, and many layers of transformation can shrink it to near-zero (vanishing gradient). The residual connection provides a "highway" for the gradient to flow directly from the loss back to early layers, bypassing the sub-layers. The gradient through a residual connection is always at least 1 (the identity contribution), no matter how the sub-layer transforms it.

Layer Normalization

Layer normalization normalizes each token's representation to have zero mean and unit variance, then applies learned scale (γ) and shift (β) parameters:

LayerNorm(x) = γ · (x − μ) / √(σ² + ε) + β

Where μ and σ² are the mean and variance computed across the d_model dimensions of a single token (not across the batch or sequence). This stabilizes training by preventing the internal representations from drifting to very large or very small values.

Why Layer Norm, Not Batch Norm?

Batch normalization (Ioffe & Szegedy, 2015) normalizes across the batch dimension: for each feature, compute the mean and variance across all examples in the batch. This works well for CNNs where batch statistics are stable, but fails for sequences because:

1. Variable sequence lengths. Different sequences in a batch have different lengths. Position 50 might exist in 3 of 8 batch items. Batch statistics at position 50 are computed from only 3 examples — too noisy to be useful.

2. Train-test mismatch. Batch norm uses running statistics at test time, but at inference we often process one sequence at a time (batch size 1). The running statistics from training (computed over batches of 32+) don't match the single-example test distribution.

3. Sequence length changes. Batch norm statistics are position-dependent. A model trained on 512-token sequences has statistics for positions 0-511. At test time with 1024 tokens, positions 512-1023 have no statistics. Layer norm avoids all these problems because it normalizes each token independently — no cross-sequence or cross-batch dependencies.

Click each component in the simulation below to expand it and see the tensor shapes flowing through.

Encoder Block: Data Flow

Click each block to expand and see tensor shapes. Residual connections shown as bypass arrows.

Click any block to see details.

The residual connection is the most important part of the Transformer. Without it, deep Transformers can't train. The gradient has a direct highway from the loss to every layer, bypassing all the attention and FFN transformations. This is why you can stack 6, 12, even 96 layers.

Putting It Together

Input

Token embeddings + positional encoding. Shape: [n, 512]

↓

Multi-Head Self-Attention

All positions attend to all positions. Shape: [n, 512] → [n, 512]

↓ + residual

Layer Norm

Normalize each token independently. Shape: [n, 512]

↓

Feed-Forward Network

[n, 512] → [n, 2048] → ReLU → [n, 512]. Per-position.

↓ + residual

Layer Norm

Normalize again. Shape: [n, 512]. This is the layer output.

↻ repeat N=6 times

After 6 such layers, the encoder produces a rich, contextual representation of the input sequence. Every token's representation has been enriched by information from every other token, through 6 rounds of attention and nonlinear transformation.

One Encoder Block in Code

python
import torch
import torch.nn as nn

class EncoderBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8, d_ff=2048):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # x: [batch, seq_len, d_model]
        attn_out, _ = self.attn(x, x, x)       # self-attention
        x = self.ln1(x + attn_out)              # residual + norm
        ff_out = self.ff(x)                     # feed-forward
        x = self.ln2(x + ff_out)               # residual + norm
        return x                               # [batch, seq_len, d_model]

Notice how compact this is: two sub-layers, each wrapped in residual + LayerNorm. The entire encoder is just nn.Sequential(*[EncoderBlock() for _ in range(N)]). The Transformer's power comes from repetition of simple blocks, not from any single complex component.

What Happens Without Each Component?

Each component exists for a reason. Here's what breaks if you remove one:

Remove This	What Happens
Residual connection	Training diverges after 3-4 layers. Gradients vanish. Deep Transformers become untrainable.
Layer normalization	Activations drift to extreme values. Training becomes unstable and requires very small learning rates.
FFN (keep only attention)	Model loses per-position processing power. Quality drops ~2-3 BLEU. The model can mix information but can't transform it.
Multi-head (use single head)	Model captures fewer relationship types simultaneously. Quality drops ~1 BLEU but still works.
Positional encoding	Model treats input as a bag-of-words. Word order information is lost. "Dog bites man" = "Man bites dog."

Pre-Norm vs. Post-Norm

The original Transformer applies LayerNorm after the residual connection: LayerNorm(x + SubLayer(x)). This is called Post-Norm. Many later implementations (including GPT-2, GPT-3, LLaMA) use Pre-Norm: x + SubLayer(LayerNorm(x)). Pre-Norm is more stable during training because the residual path is a clean identity — the gradient flows through without any normalization in the way. Post-Norm sometimes needs learning rate warmup to avoid divergence.

What does the residual connection in an encoder block add together?

The Q and K matrices The attention output and the FFN output The sub-layer's input and its output (bypassing the sub-layer)

Chapter 6: The Decoder

The encoder sees the whole input at once. But when generating output, the decoder must go one token at a time — it produces "The," then "cat," then "sat," each conditioned on what it has generated so far. If the decoder could see future tokens during training, it would just copy the answer instead of learning to predict. No peeking allowed.

Three Sub-layers, Not Two

The decoder block has the same two sub-layers as the encoder (self-attention + FFN), plus a third: cross-attention sandwiched between them.

1. Masked Self-Attention

Decoder attends only to earlier positions. Causal mask prevents peeking ahead.

↓ + residual + LayerNorm

2. Cross-Attention

Decoder attends to encoder output. Q from decoder, K/V from encoder.

↓ + residual + LayerNorm

3. Feed-Forward Network

Same as encoder FFN. Per-position, independent.

↓ + residual + LayerNorm

Masked Self-Attention: The Causal Mask

In the encoder, every position attends to every other position. In the decoder, position i can only attend to positions ≤ i. This is enforced by setting all attention scores from i to positions j > i to −∞ before softmax. Since exp(−∞) = 0, future positions get zero attention weight.

In matrix form, the causal mask is an upper-triangular matrix of −∞:

MaskedAttention(Q, K, V) = softmax(Q K^T / √d_k + M) · V

Where M_ij = 0 if i ≥ j, and M_ij = −∞ if i < j. This ensures that each position can only gather information from the past and present, never the future.

Causal Mask: Worked Example

Suppose the decoder generates the sentence "Le chat dort" (French for "The cat sleeps"). The mask matrix M for 3 positions is:

M = [[0, −∞, −∞], [0, 0, −∞], [0, 0, 0]]

At position 0 ("Le"), the raw attention scores are [2.1, 3.5, −0.8]. Adding the mask: [2.1, −∞, −∞]. After softmax: [1.0, 0.0, 0.0]. "Le" can only see itself.

At position 1 ("chat"), scores are [1.5, 2.8, 0.9]. After mask: [1.5, 2.8, −∞]. Softmax: [0.21, 0.79, 0.0]. "chat" attends to "Le" (21%) and itself (79%).

At position 2 ("dort"), scores are [0.3, 1.8, 2.1]. No masking needed (last row is all zeros). Softmax: [0.09, 0.41, 0.50]. "dort" can see everything.

The mask elegantly prevents information leakage while allowing the entire sequence to be processed in parallel during training. Without it, the model would cheat by looking at future tokens.

Cross-Attention: Connecting Encoder and Decoder

Cross-attention is where the decoder "reads" the encoder's output. The mechanism is identical to self-attention, but with one crucial difference: the queries come from the decoder, while the keys and values come from the encoder.

Q = decoder_hidden · W_Q

K = encoder_output · W_K, V = encoder_output · W_V

This lets each decoder position ask: "Which parts of the input should I pay attention to right now?" When generating "le" in French, the decoder might attend strongly to "the" in the English encoder output. When generating "chat," it attends to "cat."

The simulation below shows the decoder in action. The attention matrix shows the causal mask (grayed upper triangle). Step through generation token-by-token — each step reveals one more row in the attention matrix. Cross-attention arrows connect to the encoder.

Decoder: Masked Self-Attention + Cross-Attention

Click "Next Token" to step through autoregressive generation. Gray = masked (can't see future). Purple lines = cross-attention to encoder.

Step 0 — decoder starts with <SOS> token.

The causal mask is the decoder's blindfold. During training, all decoder positions are computed in parallel (like the encoder), but the mask ensures each position can only see earlier positions. At inference time, generation is inherently sequential — we don't know token t+1 until we've generated token t.

Training vs. Inference

A subtle but important distinction. During training, we know the entire target sequence. We feed it all at once to the decoder and apply the causal mask. This is called teacher forcing — the model always sees the correct previous tokens, even if it would have predicted wrong ones. All positions are computed in parallel (a single forward pass), making training efficient.

During inference, we don't have the target. We generate one token at a time: feed <SOS>, predict the first token, feed <SOS> + first token, predict the second, and so on. This is autoregressive and inherently sequential. Each step requires a full forward pass through the decoder, though we can cache the key/value computations from previous positions (KV caching) to avoid redundant computation.

Why Teacher Forcing Works (and When It Doesn't)

Teacher forcing has a subtle problem: exposure bias. During training, the model always sees correct previous tokens. During inference, it sees its own (potentially wrong) predictions. If the model generates a wrong token early on, all subsequent predictions are conditioned on that error — but the model was never trained in this scenario. It's like practicing basketball by always catching perfect passes, then being thrown a bad pass in a real game.

Despite this theoretical concern, teacher forcing works well in practice for Transformers. The attention mechanism helps: even if one token is wrong, the model can attend to many other (correct) tokens. The error doesn't propagate through a hidden state chain like in RNNs — it's just one position in a set of many. Techniques like scheduled sampling (gradually mixing model predictions into the training targets) can reduce exposure bias but add complexity and are rarely used with modern Transformers.

KV Caching: The Speed Trick

Without caching, generating token t requires computing attention over all t previous positions — including recomputing keys and values for positions 1 through t−1, which we already computed in previous steps. That's O(t²) work per token, or O(n³) total for n tokens.

With KV caching, we store the key and value vectors from all previous positions. At step t, we only compute Q, K, V for the new position t, append the new K and V to the cache, and attend over the full cached sequence. The work per step drops from O(t · d) to O(d) for the projections, plus O(t · d) for the attention itself. Total: O(n² · d) instead of O(n³ · d).

cache_K ← [cache_K; k_new], cache_V ← [cache_V; v_new]

The cost: memory. For each layer, we store n × d_k for keys and n × d_k for values, times h heads. For GPT-3 (96 layers, 96 heads, d_k = 128, 2048 tokens): 96 × 2 × 96 × 128 × 2048 × 2 bytes = ~9.7 GB per sequence. This is why LLMs need so much GPU memory during inference, and why KV cache compression is an active research area.

The Full Decoder in Code

python
class DecoderBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8, d_ff=2048):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.cross_attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff = nn.Sequential(nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model))
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.ln3 = nn.LayerNorm(d_model)

    def forward(self, x, enc_out, causal_mask):
        # 1. Masked self-attention (decoder attends to itself)
        attn1, _ = self.self_attn(x, x, x, attn_mask=causal_mask)
        x = self.ln1(x + attn1)
        # 2. Cross-attention (Q from decoder, K/V from encoder)
        attn2, _ = self.cross_attn(x, enc_out, enc_out)
        x = self.ln2(x + attn2)
        # 3. Feed-forward
        x = self.ln3(x + self.ff(x))
        return x

In cross-attention, where do the queries come from and where do the keys/values come from?

Queries from the decoder, keys and values from the encoder Queries from the encoder, keys and values from the decoder All three from the encoder

Chapter 7: Transformer Builder

Time to put it all together. Build a Transformer piece by piece and watch data flow through each component. Start with raw tokens, add embedding, positional encoding, encoder layers, decoder layers, and the output head. Each addition animates the data as it flows through.

Build-a-Transformer

Click buttons in order to add each component. Toggle residual/positional to see their effect. Adjust sliders for model dimensions.

Layers 6

d_model 512

Heads 8

Parameters: —

Click "Add Embedding" to begin building.

Watch how the parameter count changes as you adjust the sliders. The original Transformer base model has 65M parameters (d_model=512, N=6, h=8). The large model has 213M parameters (d_model=1024, N=6, h=16). Modern LLMs like GPT-3 (175B) and Llama (65B) use the same architecture — just scaled up massively.

Training Recipe

The original Transformer was trained on WMT 2014 English-German (4.5M sentence pairs) and English-French (36M pairs). The training recipe contained several innovations that became standard practice:

Learning rate warmup. The learning rate starts at zero and linearly increases for the first 4,000 steps, then decays proportional to the inverse square root of the step number:

lr = d_model^−0.5 · min(step^−0.5, step · warmup_steps^−1.5)

Why warmup? Early in training, the model's parameters are random. Large learning rates + random parameters = large, unstable updates. Warmup lets the model find a reasonable region of parameter space before full-speed optimization. This schedule is now so standard it's called the "Transformer schedule" or "Noam schedule" (after one of the paper's authors).

Label smoothing. Instead of training against hard one-hot targets (0 everywhere, 1 at the correct token), they used soft targets: 0.9 at the correct token, 0.1 / |V| everywhere else. This hurts perplexity (the model's log-likelihood on test data) but improves BLEU scores because it encourages the model to be less confident, producing more diverse and natural translations.

Dropout. Applied after every sub-layer, after attention weights, and in the positional encoding addition. Rate: 0.1 for the base model. Without dropout, the Transformer overfits quickly on smaller datasets.

Adam optimizer with β₁ = 0.9, β₂ = 0.98, ε = 10⁻⁹. The high β₂ stabilizes the second moment estimates for the large, sparse gradients typical of attention layers.

Results That Changed NLP

The Transformer achieved 28.4 BLEU on English-to-German translation, beating the previous best (by an ensemble of LSTMs + attention) by over 2 BLEU points. On English-to-French, it hit 41.0 BLEU — a new state of the art. And it trained in 3.5 days on 8 P100 GPUs. The previous state-of-the-art models took weeks.

Model	EN-DE BLEU	EN-FR BLEU	Training Cost
ConvS2S (2017)	25.2	40.5	Very high
GNMT (Google, 2016)	26.3	39.9	6 days, 96 GPUs
Transformer (base)	27.3	38.1	12 hrs, 8 GPUs
Transformer (big)	28.4	41.0	3.5 days, 8 GPUs

The cost efficiency was as impressive as the quality. The Transformer wasn't just better — it was dramatically cheaper to train. This is what made the subsequent scaling revolution possible.

Full Architecture Summary

Component	Purpose	Parameters (base)
Token Embedding	Map token IDs to d_model vectors	V × 512 ≈ 19M
Positional Encoding	Inject position info (sinusoidal = 0 params)	0
Encoder Self-Attention ×6	Context mixing across positions	6 × 1.05M = 6.3M
Encoder FFN ×6	Per-position nonlinear transform	6 × 2.1M = 12.6M
Encoder LayerNorm ×12	Stabilize activations	12 × 1K ≈ 12K
Decoder (same + cross-attn) ×6	Generate output autoregressively	≈25M
Output Linear + Softmax	Map d_model to vocabulary probs	512 × V ≈ 19M
Total (base)		≈65M

Minimal Transformer in PyTorch

Here's the complete encoder stack — everything from tokens to contextual representations — in under 30 lines of PyTorch:

python
import torch, torch.nn as nn, math

class Transformer(nn.Module):
    def __init__(self, vocab=37000, d=512, N=6, h=8):
        super().__init__()
        self.embed = nn.Embedding(vocab, d)
        self.pe = sinusoidal_pe(5000, d)   # from earlier
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(
                d_model=d, nhead=h, dim_feedforward=4*d,
                dropout=0.1, batch_first=True
            ), num_layers=N
        )
        self.out = nn.Linear(d, vocab)
        self.d = d

    def forward(self, x):
        # x: [batch, seq_len] token IDs
        seq_len = x.size(1)
        x = self.embed(x) * math.sqrt(self.d)  # scale embeddings
        x = x + self.pe[:seq_len].to(x.device)
        x = self.encoder(x)             # [batch, seq_len, d]
        return self.out(x)              # [batch, seq_len, vocab]

The embedding scaling by √d_model is a detail from the paper that's easy to miss: it ensures the embedding magnitudes are comparable to the positional encoding magnitudes (which are bounded in [−1, 1]). Without this scaling, d_model = 512 embeddings initialized near zero would be dwarfed by the positional signal.

Inference: Greedy vs. Beam Search vs. Sampling

Once the Transformer produces output probabilities, how do we choose the next token?

Greedy decoding: always pick the highest-probability token. Fast (one forward pass per step) but can produce repetitive, generic text. "The cat sat on the mat. The cat sat on the mat."

Beam search: keep the top-k partial sequences at each step (beam width k), exploring multiple paths. The winning sequence is the one with the highest total log-probability. Produces higher-quality translations than greedy but is k times more expensive. The original Transformer used beam width 4.

Sampling with temperature: sample from the softmax distribution, optionally sharpened (temperature < 1) or flattened (temperature > 1). Top-k sampling restricts to the k most likely tokens. Top-p (nucleus) sampling restricts to the smallest set of tokens whose cumulative probability exceeds p. Modern chatbots use top-p with p ≈ 0.9 and temperature ≈ 0.7 for a balance of quality and diversity.

Chapter 8: Beyond Text

Vaswani et al. built the Transformer for machine translation. Within two years, it had conquered language modeling (GPT), bidirectional understanding (BERT), computer vision (ViT), music generation (Music Transformer), protein folding (AlphaFold 2), and robotics (RT-2). How? Because the Transformer's core operation — attention over a set of tokens — doesn't care what those tokens represent.

The Universal Trick: Tokenize Anything

The key insight: attention treats its input as a set of vectors. It doesn't know or care whether those vectors represent words, image patches, audio frames, or amino acids. As long as you can convert your data into a sequence of embeddings, you can apply a Transformer.

Image Transformer (Parmar et al., 2018): Treats each pixel as a token. But full self-attention over every pixel in a 256×256 image would need 65,536 × 65,536 = 4 billion attention computations. Solution: local attention — each pixel attends only to a small neighborhood (a "local attention window"), reducing cost from O(n²) to O(n · w) where w is the window size. The model can still capture long-range dependencies by stacking many layers.

Vision Transformer (ViT, Dosovitskiy et al., 2021): Instead of pixels, split the image into 16×16 patches, flatten each patch into a vector, and treat patches as tokens. A 224×224 image becomes 196 tokens — manageable for full self-attention. Add a [CLS] token for classification, add learned positional embeddings, and apply a standard Transformer encoder. ViT matched or beat CNNs on ImageNet when trained on large datasets.

Music Transformer (Huang et al., 2018): Represents music as a sequence of MIDI events (note-on, note-off, time-shift). The key innovation: relative positional encoding instead of absolute. In music, what matters is the interval between notes, not their absolute position. The Music Transformer uses relative attention to capture patterns like "this note is a fifth above the note 4 beats ago."

AlphaFold 2 (DeepMind, 2021): Solved the 50-year-old protein folding problem using a modified Transformer. The input is an amino acid sequence (like a sentence of 20 possible "characters"). The attention mechanism learns which amino acids interact in 3D space — residues that are far apart in sequence can be close in the folded structure. The model uses a custom attention variant called "triangle attention" that respects geometric constraints.

RT-2 (Google, 2023): A robot that thinks in language. The input is an image (tokenized like ViT) concatenated with a text instruction ("pick up the blue cup"). The output is a sequence of action tokens (motor commands). The same attention mechanism that resolves coreference in text now decides which part of the visual scene to focus on while planning robot movements.

Click each panel below to see how each domain tokenizes its data for the Transformer.

Transformers Everywhere

Click a domain panel to see how it tokenizes data for attention. Text = subwords, Images = patches, Music = MIDI events.

The Transformer's secret weapon: attention treats input as a SET of tokens. Any data you can tokenize — text, images, audio, protein sequences, robot trajectories — you can Transform. The architecture is domain-agnostic; only the tokenizer changes.

The Transformer Family Tree

Model	Year	Domain	Key Adaptation
Transformer	2017	Translation	Original encoder-decoder
GPT	2018	Language	Decoder-only, autoregressive
BERT	2018	Language	Encoder-only, masked LM
Image Transformer	2018	Vision	Local attention windows
Music Transformer	2018	Music	Relative positional encoding
ViT	2021	Vision	Image patches as tokens
AlphaFold 2	2021	Biology	Amino acid + structure tokens
RT-2	2023	Robotics	Vision + language + action tokens

ViT: Step by Step

Let's trace a 224×224 RGB image through ViT:

1. Patch Extraction

Split image into 16×16 patches. 224/16 = 14 patches per side. 14×14 = 196 patches. Each patch: 16×16×3 = 768 pixels.

↓

2. Linear Projection

Flatten each patch to a 768-dim vector. Project to d_model via learned linear layer. Shape: [196, d_model].

↓

3. Prepend [CLS]

Add a learnable [CLS] token at position 0. Shape: [197, d_model]. This token's final representation is used for classification.

↓

4. Add Pos Embeddings

Add learned 2D position embeddings (not sinusoidal). Shape: [197, d_model].

↓

5. Transformer Encoder

Standard encoder blocks, N = 12 (ViT-B). Self-attention over all 197 tokens.

↓

6. Classification Head

Take [CLS] token output. MLP head maps to class probabilities. Done.

The beauty: steps 5 and 6 are identical to BERT. ViT literally is BERT for images. The only difference is the tokenizer (patch embedding instead of word embedding). This universality is what makes the Transformer the most important architecture in modern AI.

One surprising finding from ViT: it needs much more data than CNNs to perform well. CNNs have an inductive bias toward local spatial patterns (convolution kernels are local). The Transformer has no such bias — it must learn locality from data. On ImageNet alone (1.2M images), ViT underperforms ResNet. But on JFT-300M (300M images), ViT crushes ResNet. The Transformer trades inductive bias for flexibility: give it enough data, and it learns better representations than any hard-coded structure.

This data-hungry property explains a lot about modern AI. Transformers need massive datasets because they start with fewer assumptions. But those fewer assumptions mean they can discover patterns that human-designed architectures miss. The bet is: data is cheaper than engineering. And so far, that bet has paid off spectacularly.

The Scaling Law Implication

Every domain that adopted the Transformer discovered the same thing: performance improves predictably with scale. More data, more parameters, more compute — and the model gets better in a smooth, predictable curve (a power law). This finding, first documented by Kaplan et al. (2020) for language, extended to vision (Zhai et al., 2022), robotics (Brohan et al., 2023), and biology. It suggests that the Transformer architecture has few inherent bottlenecks — the main limit is how much compute you can afford.

What modification does Image Transformer make to standard self-attention?

It removes positional encoding It restricts attention to a local neighborhood (local attention windows) instead of all positions It uses convolutional layers instead of attention

Chapter 9: Connections

The Transformer didn't appear in a vacuum. It built on a decade of attention research and spawned an avalanche of follow-up work that continues today.

Papers

Attention Is All You Need (Vaswani et al., 2017) — The original Transformer paper. 100,000+ citations.
Layer Normalization (Ba et al., 2016) — The normalization technique used in every Transformer layer.
Image Transformer (Parmar et al., 2018) — First application of Transformers to image generation with local attention.
Music Transformer (Huang et al., 2018) — Relative attention for music generation.

RNN vs. LSTM vs. Transformer

Property	Vanilla RNN	LSTM	Transformer
Sequential computation	O(n)	O(n)	O(1)
Max path length	O(n)	O(n)	O(1)
Handles long-range?	No (vanishing grad)	Partially (gates help)	Yes (direct attention)
Parallelizable (train)?	No	No	Yes
Memory mechanism	Hidden state h_t	Cell state c_t + gates	Attention weights
Parameter sharing	Across time steps	Across time steps	Across positions
Year	1990	1997	2017

Key Numbers to Remember

Aspect	Original Transformer (2017)	Modern LLM (2024)
d_model	512	4096-12288
Layers	6	32-96
Heads	8	32-96
Parameters	65M	7B-400B+
Context length	512	8K-128K+
Training data	4.5M sentence pairs	1-15 trillion tokens
Training compute	~100 GPU-hours	~10M GPU-hours
Architecture changes	—	RoPE, GQA, SwiGLU, RMSNorm

The remarkable thing: the 2024 column is the same architecture as the 2017 column. The core mechanism — scaled dot-product attention with multi-head projections, residual connections, and layer normalization — is unchanged. The improvements are mostly engineering refinements (RoPE for better position encoding, SwiGLU for a better activation function, Grouped Query Attention for KV cache efficiency) and massive scaling.

Where to Go Next

L04: Language Models & RNNs — What the Transformer replaced. Understand what came before to appreciate what changed.
Transformer Deep Dive — Our full standalone Transformer lesson with more simulations and code.
GPT — What happens when you take only the decoder half and train on massive data.

Decoder-Only vs. Encoder-Only

The original Transformer has both encoder and decoder. But many later models use only half:

Variant	Used In	Key Difference
Encoder-only	BERT, RoBERTa	No causal mask, no decoder. Bidirectional attention. Best for understanding tasks (classification, NER).
Decoder-only	GPT, LLaMA, Claude	No encoder, no cross-attention. Causal mask only. Best for generation tasks.
Encoder-decoder	T5, BART, Original	Full architecture. Best for sequence-to-sequence (translation, summarization).

The decoder-only variant won the scaling wars. GPT-3, GPT-4, LLaMA, Claude, Gemini — all decoder-only. Why? Because generation is the most general capability. A model that can generate text can also classify (generate the label), translate (generate the target language), and answer questions (generate the answer). Encoder-only models can't generate; they can only encode.

The Eight Authors

The Transformer paper was written by eight Google researchers: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. Several went on to found major AI companies: Aidan Gomez co-founded Cohere, Illia Polosukhin co-founded NEAR Protocol, and Noam Shazeer co-founded Character.AI before returning to Google. The paper has over 100,000 citations — one of the most cited papers in computer science history.

The paper's title, "Attention Is All You Need," was deliberately provocative. At the time, most researchers believed that attention was a useful supplement to RNNs, not a replacement. The claim that attention alone could outperform RNN-based systems was controversial. History proved them right.

Open Questions

Despite seven years of dominance, fundamental questions about the Transformer remain open:

Why does it work so well? We have no theoretical proof that attention is the optimal mechanism for sequence modeling. State space models (Mamba, S4) achieve comparable results on some tasks with O(n) instead of O(n²) computation. Are there better architectures waiting to be discovered?

What are the limits of scaling? Kaplan's scaling laws suggest performance improves indefinitely with more compute. But is there a ceiling? Some researchers argue that Transformers can only interpolate within their training distribution, not truly generalize. Others argue that sufficient scale IS generalization.

Why do residual connections matter so much? Residual connections are necessary for deep Transformers, but we lack a deep theoretical understanding of why. The gradient highway explanation is intuitive but doesn't explain why some architectures train fine without residuals (like shallow networks).

Can we beat O(n²)? Linear attention variants exist but consistently underperform standard attention. FlashAttention makes O(n²) faster but doesn't change the scaling. State space models offer O(n) alternatives. The optimal complexity for sequence modeling remains an open problem.

What's remarkable is that despite these open questions, the Transformer continues to win empirically. It may not be theoretically optimal, but it's practically unbeatable. As Yann LeCun has noted, "the Transformer is not the final architecture — but it's the best one we have today." Understanding it deeply, as this lesson has aimed to do, is the foundation for understanding everything that comes next in AI.

"What I cannot create, I do not understand." — Richard Feynman. The Transformer is simple enough to implement in 200 lines of PyTorch. Build one. Train it on Shakespeare. Watch it learn to generate prose. That's when the architecture truly clicks.

The Transformer

Chapter 0: Why Replace RNNs?

The Three Sins of RNNs

What the Transformer Achieves

The Hardware Revolution Connection

The Road to "Attention Is All You Need"

Chapter 1: Self-Attention

Queries, Keys, and Values

How Attention Scores Work

A Worked Example by Hand

Why This Solves the RNN Problem

The Matrix Form

In Code

Attention Is O(n²) — Is That Bad?

Chapter 2: Scaled Dot-Product Attention

The Fix: Divide by √dk

Why √dk Specifically?

A Worked Example

In PyTorch: One Line

What Happens Without Scaling: A Visualization

Alternative: Additive Attention

Chapter 3: Multi-Head Attention

The Math

Why This Works

Tensor Shapes at Each Step

Multi-Head Attention in Code

What Do Heads Actually Learn?

Parameter Count

A Worked Example: Two Heads on "it"

Chapter 4: Positional Encoding

Sinusoidal Positional Encoding

Why Sinusoids? The Clock Analogy

Added, Not Concatenated

Positional Encoding in Code

Learned vs. Fixed

Extrapolation: Going Beyond Training Length

Chapter 5: The Encoder Block

Sub-layer 1: Multi-Head Self-Attention

Sub-layer 2: Position-wise Feed-Forward Network (FFN)

Modern FFN Variants

Residual Connections

Layer Normalization

Why Layer Norm, Not Batch Norm?

Putting It Together

One Encoder Block in Code

What Happens Without Each Component?

Pre-Norm vs. Post-Norm

Chapter 6: The Decoder

Three Sub-layers, Not Two

Masked Self-Attention: The Causal Mask

Causal Mask: Worked Example

Cross-Attention: Connecting Encoder and Decoder

Training vs. Inference

Why Teacher Forcing Works (and When It Doesn't)

KV Caching: The Speed Trick

The Full Decoder in Code

Chapter 7: Transformer Builder

Training Recipe

Results That Changed NLP

Full Architecture Summary

Minimal Transformer in PyTorch

Inference: Greedy vs. Beam Search vs. Sampling

Chapter 8: Beyond Text

The Universal Trick: Tokenize Anything

The Transformer Family Tree

ViT: Step by Step

The Scaling Law Implication

Chapter 9: Connections

Papers

RNN vs. LSTM vs. Transformer

Key Numbers to Remember

Where to Go Next

Decoder-Only vs. Encoder-Only

The Eight Authors

Open Questions

The Fix: Divide by √d_k

Why √d_k Specifically?