← Gleams
Stanford CS 231n · Lecture 8 · Attention and Transformers

The Complete Guide to Attention & Transformers

What if every word in a sentence could look at every other word to decide what matters? That's attention — and it replaced everything.

Bahdanau Attention Self-Attention Multi-Head Vision Transformer
Roadmap

What You'll Master

Chapter 01

The Bottleneck Problem

You want to translate "we see the sky" into Italian: "vediamo il cielo." The standard approach in 2014 was a sequence-to-sequence model: an encoder RNN reads the English tokens one by one and compresses the entire sentence into a single hidden vector c, then a decoder RNN unfolds that vector into the Italian output.

The encoder updates its hidden state at each step: ht = fW(xt, ht-1). After processing all T input tokens, the final hidden state hT becomes the context vector c. The decoder generates output tokens conditioned on c: st = gU(yt-1, st-1, c).

Encoder-Decoder Encoder: ht = fW(xt, ht-1)
Context: c = hT
Decoder: st = gU(yt-1, st-1, c)

This works for short sentences. But think about what happens when T = 1,000. You're asking a single fixed-size vector — maybe 512 or 1,024 dimensions — to memorize the meaning, word order, and nuance of a thousand tokens. It's like compressing an entire book into a single sentence.

The Information Bottleneck

The context vector c has a fixed size regardless of the input length. As the source sequence grows, more and more information must be squeezed into the same number of floats. Translation quality drops sharply for long sentences — the decoder forgets early tokens because c can't hold everything.

Worked Example — Why Fixed Context Fails

Translate: "The cat that the dog that the boy owned chased ran away." This has nested relative clauses. The decoder needs to know that "ran away" refers to "the cat" (the very first noun), not "the boy" or "the dog." But by the time the encoder reaches "ran away," the information about "the cat" has been overwritten by all the intervening tokens. The context vector c is dominated by recent tokens.

What's the fix? Instead of compressing the entire input into one vector, let the decoder look back at the entire input sequence on every step. At each output position, the decoder decides which parts of the input are relevant right now. Translating "cielo" (sky)? Look at position 4 ("sky"). Translating "vediamo" (we see)? Look at positions 1 and 2.

The Key Insight

Don't force all input information through a single bottleneck. Instead, give the decoder access to all encoder hidden states and let it learn to focus on the relevant ones at each step. This is attention.

Information Theory Perspective

Think of it in terms of information capacity. A vector of D floats (32-bit each) can store at most D × 32 = 16,384 bits for D = 512. A sentence of 1,000 tokens, each from a vocabulary of 50,000 words, carries roughly 1,000 × log2(50,000) ≈ 15,600 bits of information. The context vector simply doesn't have the bandwidth to losslessly encode a long sentence. Attention solves this by giving the decoder random access to the full encoder state — effectively infinite bandwidth, at the cost of computing alignment scores at each step.

The Broader Impact

The bottleneck problem appears everywhere, not just in translation. Image captioning (compress an entire image into a vector, then generate a sentence), text summarization (compress a document, then generate a summary), speech recognition (compress an audio signal, then generate text) — all suffered from the same limitation. Attention provided a universal fix: let the decoder selectively access the encoder's full representation.

Empirical Evidence — BLEU Score vs. Sentence Length

Sutskever et al. (2014) showed that their seq2seq model's BLEU score (translation quality) dropped sharply for sentences longer than ~20 words. Bahdanau et al. (2015) showed that with attention, BLEU scores remained high even for sentences of 50+ words. The improvement was most dramatic on long sentences — exactly where the bottleneck hurts most. This single result convinced the NLP community that attention was not just an incremental improvement but a fundamental architectural advance.

Chapter 02

The Attention Mechanism

Bahdanau et al. (2015) proposed a simple but powerful idea: at each decoder step t, compute a fresh context vector ct as a weighted sum of all encoder hidden states. The weights are learned — the network figures out which encoder states matter for the current output.

Step 1: Alignment Scores

For each encoder hidden state hi, compute a scalar alignment score that measures how relevant hi is to the current decoder state st-1:

Alignment Score et,i = fatt(st-1, hi)
fatt is a small neural network (often a single linear layer)

Step 2: Attention Weights

Normalize the alignment scores with softmax so they sum to 1. These are the attention weights:

Attention Weights at,i = softmax(et)i = exp(et,i) / ∑j exp(et,j)
0 < at,i < 1 and ∑i at,i = 1

Step 3: Context Vector

Compute the context vector as a weighted sum of encoder states:

Context Vector ct = ∑i at,i · hi

This context vector ct is different at every decoder step. When the decoder is generating "vediamo" (we see), the attention weights might concentrate on h1 and h2 (the encoder states for "we" and "see"). When generating "cielo" (sky), they shift to h4.

Definition
Soft Attention (Bahdanau Attention)

A mechanism that computes a weighted average over all encoder hidden states at each decoder step. The weights are learned alignment scores normalized by softmax. "Soft" because it uses continuous weights rather than hard selection of a single position. The entire computation is differentiable — no supervision on the attention weights is needed. Backprop learns them automatically.

Worked Example — Attention in Translation

Source: "we see the sky" → encoder states h1, h2, h3, h4.

Decoder step 1 (generating "vediamo" = "we see"):
Alignment scores: e1 = [2.1, 1.8, 0.3, 0.1]
After softmax: a1 = [0.45, 0.38, 0.10, 0.07]
Context: c1 = 0.45·h1 + 0.38·h2 + 0.10·h3 + 0.07·h4
The model focuses on "we" and "see" — exactly what it's translating.

Decoder step 3 (generating "cielo" = "sky"):
Alignment scores: e3 = [0.1, 0.2, 0.5, 2.8]
After softmax: a3 = [0.05, 0.06, 0.09, 0.80]
Context: c3 = 0.05·h1 + 0.06·h2 + 0.09·h3 + 0.80·h4
Now the model focuses almost entirely on "sky."

Interactive: Attention Weight Visualization

Click a target word to see which source words it attends to. Line thickness and opacity show attention weight.

Visualizing Attention Weights

When researchers visualize attention weights on real translation tasks, something beautiful emerges: the attention matrix looks roughly diagonal for languages with similar word order (English-French: "The agreement" → "L'accord") but shows cross-diagonal patterns where word order differs ("European Economic Area" → "zone économique européenne" reverses the adjective order).

Worked Example — Diagonal vs. Non-Diagonal Attention

Similar word order: "I eat bread" → "Yo como pan" (Spanish). Attention for "Yo" peaks on "I". Attention for "como" peaks on "eat". Nearly diagonal.

Different word order: "I don't know" → "Je ne sais pas" (French). "sais" (know) is position 3 in French but "know" is position 3 in English — still diagonal here. But "pas" (negation) attends to "don't" at position 2, not position 4. The model learns the reordering.

Longer-range: "The man who I met yesterday left" → in German the verb goes to the end. The attention weight from the final German verb reaches all the way back to "left" in the English input. Without attention, this information would need to survive through the entire decoder hidden state chain.

Why Attention Works

The bottleneck is gone. Instead of squeezing T hidden states into one vector, the decoder gets access to all T states at every step. Attention weights act as a soft pointer — "which input positions should I focus on right now?" For languages with different word orders (English-French: "European Economic Area" → "zone économique européenne"), attention learns the reordering automatically. And the whole thing is end-to-end differentiable — no supervision on the attention weights is required. Backprop learns them from the translation loss alone.

Why Is It Differentiable?

Every operation in the attention mechanism is smooth: the alignment function fatt is a neural network (differentiable), softmax is differentiable, and weighted summation is linear (trivially differentiable). The gradient of the loss flows backward through ct = ∑i at,i · hi, through the softmax, through fatt, and into both the encoder (updating hi) and decoder (updating st-1). This means the model can learn where to look entirely from the translation objective — no hand-labeled alignments needed.

Chapter 03

Self-Attention

Bahdanau attention is a decoder-to-encoder mechanism: the decoder queries attend to encoder states. But there's a more general operator hiding inside. What if a sequence attended to itself?

In self-attention, every token in a sequence looks at every other token in the same sequence. The input is a set of vectors X = {x1, x2, ..., xN}. Each vector produces one output that's a weighted combination of information from all inputs.

Definition
Self-Attention

An operation where each element in a sequence computes its output by attending to all other elements in the same sequence. Unlike cross-attention (decoder → encoder), self-attention has a single set of inputs that serve as both the source and target. Each input generates a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what information do I provide?").

Think of it like a meeting. Each person (token) has a question (query), a nametag (key), and a message (value). To form your output, you compare your question against everyone's nametag, figure out who's relevant, and blend their messages proportionally.

The Meeting Analogy

You're at a conference. You're the query: "I need to know about gradients." You look at everyone's nametag (key): "Optimization researcher," "Image processing expert," "Computer vision professor," "Gradient flow specialist." The "Gradient flow specialist" and "Optimization researcher" nametags match your question well — high attention weights. You then listen to each person's value (their actual expertise) and blend it proportionally. Your output is mostly "gradient flow" knowledge with some "optimization" knowledge mixed in. The person studying image processing contributed almost nothing to your output — they had a low attention weight despite being at the same conference.

QKV: Three Projections, Three Roles

Why do we need three separate projections? Can't we just use X directly for everything? The answer is role separation. What a token is looking for (query), what a token advertises (key), and what a token provides (value) are three different things. The word "bank" might have a key that says "I'm a noun related to money or rivers" (for other tokens to find it), a query that says "I need context to disambiguate" (to look at surrounding tokens), and a value that says "here's my semantic content" (what it contributes to other tokens' representations). Three separate linear projections let the model learn these three roles independently.

From Cross-Attention to Self-Attention

In cross-attention, queries come from the decoder and keys/values come from the encoder — two separate sequences. In self-attention, queries, keys, and values all come from the same sequence. The shapes simplify: if the input is X [N × Din], then Q, K, V are all [N × Dout] and the attention matrix E is [N × N].

Worked Example — Self-Attention on "The cat sat"

Three tokens: x1="The", x2="cat", x3="sat". Each produces Q, K, V vectors.

Token "sat" (query q3): It computes similarity with every key: e3,1 = q3·k1, e3,2 = q3·k2, e3,3 = q3·k3. Suppose after softmax: a3 = [0.05, 0.70, 0.25]. "sat" mostly attends to "cat" (the subject) and somewhat to itself. Output: y3 = 0.05·v1 + 0.70·v2 + 0.25·v3.

The output y3 for "sat" now contains information about its subject "cat" — it knows what sat.

Interactive: Self-Attention Matrix

Hover over a cell in the attention matrix to see which query-key pair it represents. Brighter = higher attention weight.

Permutation Equivariance

A remarkable property: self-attention is permutation equivariant. If you shuffle the input tokens, the outputs get shuffled in the same way. Formally: F(σ(X)) = σ(F(X)). This means self-attention operates on sets, not sequences — it has no notion of position or order.

Proof of Permutation Equivariance

Let σ be a permutation. Apply σ to the input: X' = σ(X). Then Q' = X' · WQ = σ(X) · WQ = σ(Q). Similarly K' = σ(K), V' = σ(V). The similarity matrix E'ij = Q'i · K'j = Qσ(i) · Kσ(j) = Eσ(i),σ(j). So E' is E with rows and columns permuted. After softmax (applied per row), A' is the same permutation of A. Finally Y' = A'V' = σ(AV) = σ(Y). The output is just the permuted version of the original output. QED.

No Position Awareness

"The cat sat on the mat" and "mat the on sat cat the" produce the same attention weights (just permuted). Self-attention alone can't tell word order. This is why we'll need positional encoding (Chapter 7) — we must explicitly inject position information.

Why Self-Attention Is Powerful

Every output directly depends on all inputs. In an RNN, information from token 1 must travel through every intermediate hidden state to reach token 100 — a chain of 99 steps where it can degrade. In self-attention, token 100 looks directly at token 1 in a single step. This is why transformers handle long-range dependencies so much better than RNNs.

Chapter 04

Scaled Dot-Product Attention

Bahdanau used a learned neural network fatt to compute alignment scores. The modern approach is simpler: use the dot product between query and key vectors, with one crucial scaling factor.

The Full Computation

Given input vectors X [N × Din], we first project them into queries, keys, and values using learned weight matrices:

QKV Projection Q = X · WQ [N × Dk]
K = X · WK [N × Dk]
V = X · WV [N × Dv]

Then compute the attention output in three steps:

Scaled Dot-Product Attention E = Q · KT / √Dk [N × N]
A = softmax(E, dim=1) [N × N]
Y = A · V [N × Dv]

Or more compactly:

One-Line Form Attention(Q, K, V) = softmax(Q · KT / √Dk) · V

Why Divide by √Dk?

This is the part most presentations skip. Consider two random vectors q and k of dimension Dk, where each component is drawn from a distribution with mean 0 and variance 1. Their dot product q · k = ∑i qi · ki has mean 0 and variance Dk.

Derivation — Variance of the Dot Product

Each term qi · ki has E[qiki] = E[qi]E[ki] = 0 (by independence) and Var(qiki) = E[qi2]E[ki2] = 1 × 1 = 1. Since the dot product is a sum of Dk independent terms, its variance is Dk × 1 = Dk. Standard deviation is √Dk.

For Dk = 64, dot products can easily reach magnitudes of ±8. Softmax applied to values this large produces outputs extremely close to 0 or 1 — a near-one-hot distribution. The gradients of softmax in this regime are vanishingly small, killing learning.

Dividing by √Dk normalizes the variance back to 1, keeping the softmax inputs in a regime where gradients flow well.

Worked Example — Scaling Matters

Dk = 64. Two random vectors with magnitude √64 = 8. Dot product could be ~50.

Without scaling: softmax([50, 2, 1, -3]) ≈ [1.0, 0.0, 0.0, 0.0]. Near-deterministic. Gradient of softmax is ≈ 0 — learning stops.

With scaling: divide by 8: softmax([6.25, 0.25, 0.125, -0.375]) ≈ [0.85, 0.02, 0.02, 0.01]. Smooth distribution. Healthy gradients.

Complexity Analysis

The attention matrix E has shape [N × N], where N is the sequence length. Computing it requires N2 dot products, each of dimension Dk.

Complexity
O(N2 · D) Time, O(N2) Memory

QKV projection: O(N · D2). Attention matrix: O(N2 · D). Weighted sum: O(N2 · D). Storing the N × N attention matrix requires O(N2) memory. For N = 4,096 tokens with D = 512: the attention matrix alone is 40962 = 16.7 million entries. This quadratic cost is the main limitation of transformers — doubling the sequence length quadruples the compute.

The Whole Thing Is Four Matrix Multiplies

Self-attention looks intimidating in equations but is remarkably simple computationally:

Self-Attention in Practice
  1. QKV Projection: [N × D] × [D × 3D] → [N × 3D], then split into Q, K, V
  2. QK Similarity: [N × D] × [D × N] → [N × N], then scale by 1/√D
  3. Softmax + AV product: [N × N] × [N × D] → [N × D]
  4. Output projection: [N × D] × [D × D] → [N × D]
Just Matrix Multiplies

Self-attention is four matrix multiplications plus one softmax. That's it. GPUs are extremely good at matrix multiplication. This is why transformers are so parallelizable — unlike RNNs, which require sequential computation through time steps, all of these matmuls can run simultaneously.

Cross-Attention vs. Self-Attention

It's worth pausing to distinguish these two forms precisely:

PropertyCross-AttentionSelf-Attention
Queries come fromDecoder sequence [NQ × DQ]Same input X [N × D]
Keys/Values come fromEncoder sequence [NX × DX]Same input X [N × D]
Attention matrix shape[NQ × NX] (rectangular)[N × N] (square)
PurposeDecoder reads encoder statesTokens mix within one sequence
Used inEncoder-decoder Transformers (T5, original)All Transformers (GPT, BERT, ViT)
Worked Example — Full Numeric Computation

N = 3 tokens, D = 4. Input X and weight matrices WQ, WK, WV all [4 × 4].

Step 1 (QKV): Q = X · WQ, K = X · WK, V = X · WV. Three matrix multiplies, each [3 × 4] × [4 × 4] = [3 × 4].

Step 2 (Similarity): E = Q · KT / √4 = Q · KT / 2. Shape: [3 × 4] × [4 × 3] = [3 × 3]. Each entry Eij is the scaled dot product between query i and key j.

Step 3 (Softmax): Apply softmax row-wise. Each row sums to 1. Row i gives the attention distribution for query i over all keys.

Step 4 (Output): Y = A · V. Shape: [3 × 3] × [3 × 4] = [3 × 4]. Each output yi is a weighted combination of all value vectors.

Total learnable parameters: 3 × D2 = 3 × 16 = 48 (for WQ, WK, WV). Plus D2 = 16 for WO = 64 total.

Chapter 05

Multi-Head Attention

A single attention head computes one set of attention weights — one "pattern" of which tokens attend to which. But language has many simultaneous relationships: syntactic (subject-verb), semantic (synonyms), positional (nearby words). A single head must blend all these into one attention matrix. Can we do better?

Definition
Multi-Head Attention

Run H independent self-attention operations ("heads") in parallel, each with its own learned WQ, WK, WV matrices. Concatenate their outputs and project through a final linear layer WO. Each head can learn a different attention pattern.

Multi-Head Attention headh = Attention(X · WQh, X · WKh, X · WVh) [N × DH]
MultiHead(X) = Concat(head1, ..., headH) · WO [N × D]

The key trick: each head operates in a lower-dimensional subspace. If the model dimension is D = 512 and we use H = 8 heads, each head has dimension DH = D / H = 64. The total parameter count is the same as a single head with dimension D.

Worked Example — What Different Heads Learn

Sentence: "The cat that I fed yesterday slept on the mat."

Head 1 (syntactic): "slept" attends strongly to "cat" (its subject), ignoring the relative clause.

Head 2 (local): Each word attends to its immediate neighbors — a kind of learned smoothing.

Head 3 (semantic): "cat" attends to "mat" and "slept" (related concepts in the scene).

Head 4 (positional): Attends to a fixed relative offset — always looking 2 tokens back.

The output projection WO blends all four perspectives into a single representation.

Interactive: Multi-Head Attention Patterns

Each colored head learns a different attention pattern. Click a head to highlight it. Observe how different heads focus on different relationships.

Why Not Just One Big Head?

A single head with Dk = 512 and a single head with Dk = 64 have different expressiveness. But 8 heads of Dk = 64 is strictly more expressive than 1 head of Dk = 512 — it can learn 8 different attention patterns simultaneously. The output projection WO then learns how to combine them.

Heads as Subspace Experts

Think of each head as a specialist that projects tokens into a different 64-dimensional subspace and finds patterns there. Head 1 might project tokens so that subjects and verbs are close together. Head 2 might project so that co-referent nouns cluster. The final WO matrix merges these specialized views.

Parameter Count

ComponentShapeParameters
WQ (all heads)D × (H · DH) = D × DD2
WK (all heads)D × DD2
WV (all heads)D × DD2
WO (output)D × DD2
Total4D2

For D = 512: 4 × 5122 = 1,048,576 parameters per multi-head attention layer. The number of heads H doesn't affect the total — it just determines how the computation is divided.

Implementation: Batched Matmul

In practice, all H heads are computed in parallel using a single batched matrix multiply. The fused QKV projection produces a tensor of shape [N × 3HDH], which is reshaped to [H × N × 3DH] and split into Q, K, V. The attention computation runs independently per head using batched matmul (the H dimension acts as a batch dimension). This is extremely efficient on GPUs.

Worked Example — Multi-Head Shapes

D = 512, H = 8, DH = 64. Input X: [N × 512].

Fused QKV: X · WQKV = [N × 512] × [512 × 1536] = [N × 1536]. Split into Q, K, V each [N × 512], then reshape to [8 × N × 64].

Per-head attention: E = Q · KT = [8 × N × 64] × [8 × 64 × N] = [8 × N × N]. One matmul, 8 heads at once.

Per-head output: A · V = [8 × N × N] × [8 × N × 64] = [8 × N × 64]. Reshape to [N × 512].

Output projection: [N × 512] × [512 × 512] = [N × 512]. Done. Same input and output shape.

Chapter 06

The Transformer Block

Self-attention alone is not enough. It lets tokens communicate, but it doesn't add nonlinearity per token. The Transformer block wraps multi-head self-attention with three additional components: residual connections, layer normalization, and a feedforward network.

Transformer Block (Post-Norm, Vaswani et al. 2017)
  1. Self-Attention: Z = MultiHeadSelfAttention(X) [N × D]
  2. Residual + LayerNorm: X' = LayerNorm(X + Z) [N × D]
  3. Feedforward (MLP): F = MLP(X') [N × D]
  4. Residual + LayerNorm: Y = LayerNorm(X' + F) [N × D]

Layer Normalization

Layer normalization normalizes each vector independently across its feature dimensions. Given a vector h of dimension D:

Layer Normalization μ = (1/D) ∑j hj (mean)
σ = √((1/D) ∑j (hj − μ)2) (std)
z = (h − μ) / σ
y = γ ⊙ z + β (learnable scale & shift)

This stabilizes training by preventing activations from drifting to extreme values. Each token's representation is normalized independently — no dependence on batch statistics like batch normalization.

The Feedforward Network (MLP)

The MLP applies the same two-layer network to each token independently:

Feedforward Network MLP(x) = W2 · ReLU(W1 · x + b1) + b2
W1: [D × 4D], W2: [4D × D]. Expands then contracts.

The standard expansion factor is . If D = 512, the hidden layer has 2,048 dimensions. This gives each token a chance to do "private computation" — self-attention is the communication channel between tokens, but the MLP is where individual token representations get transformed.

Communication vs. Computation

Self-attention is the communication step: tokens exchange information. The MLP is the computation step: each token processes its updated representation privately. The transformer alternates between these two operations, stacking identical blocks.

Worked Example — Why the MLP Matters

After self-attention, token "sat" has gathered information from "cat" and "on." Its representation is now a weighted blend of value vectors. But blending alone is linear — it can only produce linear combinations. The MLP adds nonlinearity (via ReLU or SwiGLU), letting the model compute new features: "this is a past-tense verb whose subject is an animal." The 4× expansion (D → 4D → D) gives the MLP a large hidden space to compute these features before projecting back to model dimension.

An analogy: self-attention is like collecting notes from colleagues at a meeting. The MLP is going back to your desk and thinking about what those notes mean. Both steps are essential.

Residual Connections

The X + Z in step 2 is a residual connection (a.k.a. skip connection). It adds the input directly to the output. This serves two purposes: (1) gradients flow directly through the addition, avoiding vanishing gradients in deep stacks, and (2) the model only needs to learn the change to the representation, not the entire representation from scratch.

Pre-Norm vs. Post-Norm

The original Transformer (Vaswani 2017) places LayerNorm after the residual add. Modern practice moves it before the sublayer — inside the residual path:

Pre-Norm (Modern) X' = X + MultiHeadAttention(LayerNorm(X))
Y = X' + MLP(LayerNorm(X'))
Why Pre-Norm?
Training Stability

Post-norm places the normalization outside the residual, which means the model can't easily learn the identity function (a problem at initialization). Pre-norm normalizes inside the residual branch, so the main path is a clean addition. This makes training more stable, especially for very deep models (50+ layers). Nearly all modern transformers (GPT-2+, LLaMA, etc.) use pre-norm.

Stacking Blocks

A Transformer is simply a stack of identical blocks. The original paper used 6 blocks. The architecture has barely changed since 2017, but models have gotten much bigger:

ModelBlocksDHeadsParams
Transformer (2017)121,02416213M
GPT-2 (2019)481,600251.5B
GPT-3 (2020)9612,28896175B
Compute Breakdown

A transformer block is 6 matrix multiplies: 4 from multi-head attention (QKV projection, QK similarity, AV product, output projection) and 2 from the MLP (up-project and down-project). Every other operation (softmax, LayerNorm, residual adds) is cheap by comparison. Transformers are matrix-multiply machines.

SwiGLU: The Modern MLP

The original Transformer uses ReLU in the MLP: Y = W2 · ReLU(W1 · x). Modern transformers (LLaMA, PaLM) replace this with SwiGLU, a gated variant:

SwiGLU MLP Y = (Swish(X · W1) ⊙ X · W2) · W3
W1, W2: [D × H], W3: [H × D]. H = 8D/3 keeps same param count.

The ⊙ is element-wise multiplication — a gating mechanism. One branch (Swish(X · W1)) decides how much information to let through, while the other branch (X · W2) provides the information. Shazeer (2020) showed this outperforms plain ReLU MLPs across the board, and offered the wonderfully honest explanation: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."

Mixture of Experts (MoE)

The biggest modern models (GPT-4, Gemini, Claude) almost certainly use Mixture of Experts. Instead of one MLP per block, learn E separate MLP "experts." Each token is routed to only A < E of them. This multiplies parameters by E but only multiplies compute by A. A model with 1T total parameters might only activate 100B per token.

Definition
Mixture of Experts (MoE)

Learn E separate sets of MLP weights per transformer block. A learned router network selects the top A experts for each token. The token's MLP output is a weighted combination of only the A active experts' outputs. This decouples model capacity (total parameters) from inference cost (active parameters per token).

Chapter 07

Positional Encoding

Self-attention is permutation equivariant — it treats its input as a set, not a sequence. "The cat sat" and "sat cat the" produce the same attention patterns (just permuted). But language is ordered: "dog bites man" means something very different from "man bites dog." We need to inject position information.

Sinusoidal Positional Encoding

Vaswani et al. (2017) proposed adding a fixed vector to each input token that encodes its position. For position pos and dimension i:

Sinusoidal Positional Encoding PE(pos, 2i) = sin(pos / 100002i/D)
PE(pos, 2i+1) = cos(pos / 100002i/D)

Each dimension of the positional encoding oscillates at a different frequency. Low dimensions change rapidly (high frequency), high dimensions change slowly (low frequency). This creates a unique "fingerprint" for each position.

Why Sinusoids?

Three desirable properties: (1) Each position gets a unique encoding. (2) The encoding has bounded magnitude regardless of sequence length — sin and cos are always between -1 and 1. (3) The model can learn to attend to relative positions: PE(pos + k) can be expressed as a linear function of PE(pos) for any fixed offset k, because sin(a + b) = sin(a)cos(b) + cos(a)sin(b). This means "look 3 positions back" is a linear transformation the model can learn.

Worked Example — Encoding Position 5

D = 4. We compute PE(5) using two frequencies:

Dim 0: sin(5 / 100000/4) = sin(5 / 1) = sin(5) ≈ −0.96

Dim 1: cos(5 / 100000/4) = cos(5) ≈ 0.28

Dim 2: sin(5 / 100002/4) = sin(5 / 100) = sin(0.05) ≈ 0.05

Dim 3: cos(5 / 100002/4) = cos(0.05) ≈ 1.00

PE(5) ≈ [−0.96, 0.28, 0.05, 1.00]. This is added to x5 before entering the transformer.

Learned Positional Embeddings

An alternative: learn a lookup table of D-dimensional vectors, one per position. The model has a learnable matrix PE [Nmax × D] and adds PE[pos] to xpos. This is what GPT-2 and BERT use.

MethodProsCons
Sinusoidal (fixed)No extra parameters; can extrapolate to unseen lengthsLess flexible; theoretical extrapolation doesn't always work in practice
LearnedMaximum flexibility; can capture position-specific patternsExtra parameters; hard limit on sequence length
Position Is Added, Not Concatenated

The positional encoding is added to the token embedding, not concatenated. This means the model dimension stays the same. The input to the transformer is xpos + PE(pos), where both vectors have dimension D. The model must learn to disentangle position information from content information — and empirically, it does.

Modern Alternatives: RoPE

A limitation of both sinusoidal and learned positional encodings: they encode absolute position. Token 5 always gets the same encoding regardless of context. Modern models like LLaMA use Rotary Position Embeddings (RoPE), which encode relative position by rotating query and key vectors. The dot product Qi · Kj then naturally depends on the offset (i − j) rather than the absolute positions i and j. This allows better generalization to sequence lengths not seen during training.

Worked Example — Why Relative Position Matters

Consider the phrase "the big red ball" appearing at positions [5,6,7,8] in one sentence and [100,101,102,103] in another. With absolute encoding, the attention patterns between "big" and "ball" would differ because their absolute positions differ. With relative encoding, the offset (2 positions apart) is the same in both cases, so the model can learn "attend to the noun 2 positions ahead" as a general rule. This is especially important for models that must handle varying context lengths.

Chapter 08

Vision Transformer (ViT)

Transformers were designed for sequences of words. Images aren't sequences — they're 2D grids of pixels. But Dosovitskiy et al. (2021) showed that you can treat an image as a sequence of patches and apply a standard transformer. The result: Vision Transformer (ViT), which matches or beats CNNs on image classification when given enough data.

Patch Embedding

Take a 224 × 224 × 3 image. Divide it into a grid of non-overlapping patches, each 16 × 16 × 3 = 768 values. This gives (224/16)2 = 196 patches. Flatten each patch and apply a linear projection from 768 to D dimensions:

Patch Embedding Image: 224 × 224 × 3
Patches: 196 patches, each 16 × 16 × 3 = 768 dims
Projection: Wpatch [768 × D] → 196 vectors of dim D
Equivalent to a 16×16 convolution with stride 16 and D output channels
Patches Are Tokens

Each image patch is treated as a "word." The 16×16 patch is the image equivalent of a word embedding. The transformer processes 196 "tokens" exactly as it would process 196 words. No convolutions, no pooling, no spatial inductive bias — just a standard transformer on a set of patch vectors.

Position and Classification

Add learned positional embeddings to each patch vector (a learnable [196 × D] matrix). Since we flattened the 2D grid, position embeddings must encode both row and column — the transformer has no built-in 2D structure.

For classification, ViT prepends a special [CLS] token (a learned D-dimensional vector) to the patch sequence. After the transformer, the [CLS] token's output vector is fed to a linear classifier. Alternatively, you can average-pool all patch outputs and classify from that — both work.

ViT Pipeline
  1. Patchify: Split image into 16×16 patches. Flatten to vectors of dim 768.
  2. Linear projection: Project each patch vector to dim D.
  3. Prepend [CLS] token: Now we have 197 vectors.
  4. Add positional embeddings: Add a learned [197 × D] matrix.
  5. Transformer blocks: Pass through L blocks of self-attention + MLP.
  6. Classify: Take [CLS] output (or average pool), project to C classes.

ViT vs. CNNs

PropertyCNN (ResNet)ViT
Inductive biasStrong: locality, translation equivarianceWeak: only patch structure
Receptive fieldGrows with depth (local → global)Global from layer 1 (every patch sees every patch)
Data efficiencyBetter with small datasets (inductive bias helps)Needs large datasets (300M+ images for best results)
ScalabilitySaturates as model growsKeeps improving with more data + params
Compute patternConvolution (local matmuls)Self-attention (global matmuls)
Data Hunger

ViT trained on ImageNet alone (1.3M images) underperforms ResNets. The lack of inductive bias means it must learn locality and translation invariance from data. Pre-training on JFT-300M (300 million images) or ImageNet-21k (14 million images) makes ViT competitive or superior. The lesson: transformers trade inductive bias for data.

ViT Model Variants

ViT-B/16: Base model, 16×16 patches, D = 768, 12 blocks, 12 heads, 86M params.

ViT-L/16: Large model, D = 1024, 24 blocks, 16 heads, 307M params.

ViT-H/14: Huge model, 14×14 patches (more tokens), D = 1280, 32 blocks, 16 heads, 632M params.

Token Count Calculation

For a 224×224 image with patch size P: N = (224/P)2. With P = 16: N = 142 = 196. With P = 14: N = 162 = 256. Smaller patches → more tokens → more fine-grained attention → but O(N2) attention cost grows fast. Going from P = 16 to P = 14 increases attention compute by (256/196)2 ≈ 1.7×. That's the trade-off: resolution vs. compute.

Why ViT Works

The magic of ViT is its simplicity. No pooling layers, no convolutional kernels, no special spatial operations. Just patch embedding + standard transformer + classification head. This means any improvement to the transformer architecture (better attention, better normalization, longer context) automatically benefits vision models too. The same architecture handles text, images, audio, video — unifying AI around a single computational primitive.

Universality

Before ViT, vision had ResNets, language had LSTMs/Transformers, audio had WaveNets. After ViT, the transformer became the universal architecture. Convert your data into a sequence of vectors (tokens, patches, spectrogram frames), add positional encoding, and run a transformer. The same attention mechanism handles spatial, temporal, and semantic relationships. This universality is arguably the transformer's greatest contribution.

Chapter 09

Self-Attention Visualizer

This is the interactive payoff. Below, you can see the full self-attention computation unfold step by step: input tokens are projected to Q, K, V vectors, the attention matrix is computed, and weighted sums produce the outputs. Select different tokens and sentences to see how attention patterns change.

Showcase: Full Self-Attention Pipeline

Select a sentence and click a query token to see what it attends to. The attention matrix and output blending are shown in real time.

What to Look For

"The cat sat on the mat": Notice how "sat" strongly attends to "cat" (its subject) and weakly to "on" (the preposition that follows). "mat" attends to "the" (its determiner) and "on" (the spatial relation).

"I saw the man with the telescope": This sentence is famously ambiguous. Does "with the telescope" modify "saw" (I used a telescope to see) or "man" (the man had a telescope)? Watch how the attention weights don't resolve this — a single layer of self-attention can represent either interpretation depending on the learned weights.

How to Read Attention Patterns

When you explore the visualizer, keep these patterns in mind:

Common Attention Patterns in Practice

Diagonal: Each token mostly attends to itself. Common in early layers where the model is still building basic representations.

Vertical stripes: All tokens attend to a specific position (often [CLS] or punctuation). This position acts as a "sink" that aggregates global information.

Subject-verb: Verbs attend strongly to their subjects. "The cat [that I saw] sat" — "sat" attends to "cat" despite the intervening clause.

Coreference: Pronouns attend to their antecedents. "John picked up the ball. He threw it." — "He" attends to "John."

Uniform: Attention spread evenly across all tokens. This can mean the head is "averaging" information globally, or that it's not learning anything useful.

Attention Is Not Explanation

A common mistake: interpreting attention weights as "what the model is thinking about." Attention weights show where information flows, but the model can route information through many heads and many layers. A token might receive low attention from the final layer but high attention from an earlier layer, with the information already baked into its representation. Use attention visualization as intuition, not proof.

Chapter 10

Summary & Connections

Attention started as a fix for the encoder-decoder bottleneck. Self-attention generalized it into a primitive that operates on sets of vectors. The Transformer stacked self-attention with MLPs, residual connections, and layer normalization into the most successful neural network architecture in history.

Three Transformer Families

TypeArchitectureMaskingUse CaseExamples
Encoder-onlyStack of transformer blocks with full (unmasked) self-attentionNone — every token sees every other tokenUnderstanding: classification, NER, sentimentBERT, RoBERTa
Decoder-onlyStack of transformer blocks with masked (causal) self-attentionCausal mask: token i can only attend to tokens ≤ iGeneration: language modeling, text completionGPT-2, GPT-3, LLaMA
Encoder-decoderEncoder (full attention) + Decoder (masked self-attention + cross-attention to encoder)Causal in decoder self-attention; full in cross-attentionSeq-to-seq: translation, summarizationOriginal Transformer, T5, BART

Masked Self-Attention for Language Modeling

Decoder-only models like GPT use masked (causal) self-attention. Before computing softmax, entries where j > i are set to −∞. After softmax, these become 0 — each token can only attend to previous tokens and itself. This enables autoregressive generation: predict the next token given all previous tokens.

Causal Masking Eij = Qi · Kj / √Dk if j ≤ i
Eij = −∞ if j > i
A = softmax(E) ⇒ Aij = 0 for j > i
Worked Example — Causal Mask

Sentence: "Attention is very cool". Token 1 ("Attention") can only see itself: mask row 1 = [E11, −∞, −∞, −∞]. After softmax: [1.0, 0, 0, 0].

Token 3 ("very") can see tokens 1, 2, 3 but NOT token 4: mask row 3 = [E31, E32, E33, −∞]. After softmax, the weights only distribute over the first 3 tokens.

This ensures that when predicting the next token, the model cannot "cheat" by looking at future tokens. During training, all positions are processed in parallel (one forward pass), but each position only sees its causal context.

GPT: Decoder-Only Language Model

GPT (Radford et al., 2018) showed that a decoder-only transformer with masked self-attention, pre-trained on a massive text corpus, can be fine-tuned for many NLP tasks. The training objective is simple: predict the next token. Given tokens [x1, ..., xt], predict xt+1 using softmax over the vocabulary.

Language Modeling Objective L(θ) = − ∑t log P(xt+1 | x1, ..., xt; θ)

The architecture is straightforward: an embedding matrix [V × D] converts tokens to vectors, a stack of masked self-attention blocks processes them, and a projection matrix [D × V] converts each output vector into scores over the vocabulary. Training minimizes cross-entropy between predicted and actual next tokens.

BERT: Masked Language Modeling

BERT (Devlin et al., 2019) uses an encoder-only transformer with no causal mask. Every token sees every other token. Training objective: randomly mask 15% of input tokens, and the model predicts the masked tokens. This forces bidirectional understanding — unlike GPT, which only looks backward.

BERT vs. GPT

BERT: Input "The [MASK] sat on the mat" → predict "cat". Every token sees every other token (bidirectional). Excellent for classification and understanding tasks. Cannot generate text autoregressively.

GPT: Input "The cat sat on the" → predict "mat". Each token only sees previous tokens (unidirectional). Excellent for text generation. Can also do classification (with fine-tuning) but less naturally.

Why Bidirectional Attention Helps Understanding

Consider: "The bank by the river was eroded." Unidirectional (GPT): when processing "bank," the model only sees "The" — it can't tell if "bank" means a financial institution or a riverbank. Bidirectional (BERT): "bank" sees "river" to the right, immediately disambiguating to "riverbank." For classification tasks (sentiment, NER, question answering), this bidirectional context is crucial. The trade-off: BERT can't generate text token-by-token because it requires the full input to make predictions.

Encoder-Decoder: The Original Design

Vaswani et al. (2017) designed the original Transformer as an encoder-decoder model for translation. The encoder uses full self-attention (every source token sees every other source token). The decoder uses masked self-attention (each target token only sees previous target tokens) plus cross-attention (each target token attends to all source tokens). This design directly inherited the seq2seq structure that motivated attention in the first place.

T5 (Raffel et al., 2020) showed that many NLP tasks can be framed as seq-to-seq: classification becomes "Input: sentiment review. Output: positive." This made encoder-decoder models surprisingly versatile, though GPT-style decoder-only models have since dominated due to simpler training and better scaling.

Modern Tweaks

The core architecture has barely changed since 2017, but several refinements are now standard:

ModificationChangeWhy
Pre-NormLayerNorm before sublayer, inside residualMore stable training for deep models
RMSNormReplace LayerNorm with root-mean-square normalizationSlightly more stable; removes mean centering
SwiGLU MLPGated linear unit with Swish activationEmpirically better than ReLU MLP
MoEMultiple expert MLPs; each token routed to A of E expertsMassive params, modest compute increase
RoPERotary positional embeddings applied to Q, KBetter relative position modeling; can extrapolate

Three Ways to Process Sequences

MethodReceptive FieldComputeMemoryParallelism
RNNFull (sequential)O(N · D2)O(D)Low (sequential)
CNNLocal (kernel-size per layer)O(N · K · D2)O(N · D)High
Self-AttentionGlobal (every token)O(N2 · D)O(N2)High
The One Sentence

Attention lets every token look at every other token to decide what matters. The Transformer stacks this primitive into the most scalable architecture we have.

Quick Reference — Transformer Dimensions
The Key Hyperparameters

D (model dimension): size of each token vector. 512–12,288 in practice.

H (number of heads): parallel attention heads. DH = D/H per head.

L (number of layers/blocks): depth of the stack. 6–96 in practice.

N (context length): maximum sequence length. 512–128,000+ in modern models.

V (vocabulary size): number of unique tokens. 30,000–100,000 typically.

dff (MLP hidden dim): usually 4D or 8D/3 (with SwiGLU).

What Hasn't Changed

It's remarkable how little the core architecture has evolved. The original 2017 Transformer had: multi-head self-attention, feedforward networks, residual connections, layer normalization, and positional encoding. Every major model in 2025 still has all five. The changes are incremental refinements (better normalization, better activation functions, better position encoding) — not architectural revolutions. The transformer is not just a good architecture. It's a stable attractor in design space.

Why Transformers Scale

The scaling hypothesis: transformer performance improves predictably as you increase model size, dataset size, and compute. This was demonstrated by Kaplan et al. (2020) with scaling laws: loss follows a power law in each of these quantities. No other architecture has shown such reliable scaling behavior. The combination of parallelizable computation (matrix multiplies), flexible capacity (just stack more blocks), and the attention mechanism's ability to represent complex dependencies appears to be uniquely well-suited to gradient-based optimization at scale.

Timeline

YearMilestoneKey Contribution
2014Seq2Seq (Sutskever et al.)Encoder-decoder RNN for translation
2015Bahdanau AttentionSoft attention for seq2seq; dynamic context vector
2017Transformer (Vaswani et al.)Self-attention everywhere; no RNNs needed
2018GPT-1 (Radford et al.)Decoder-only transformer for language modeling
2019BERT (Devlin et al.)Encoder-only; masked language modeling; bidirectional
2019GPT-21.5B params; zero-shot capabilities
2020GPT-3175B params; in-context learning
2021ViT (Dosovitskiy et al.)Patches as tokens; transformers replace CNNs for vision

References

#Paper
1Sutskever et al. "Sequence to Sequence Learning with Neural Networks." NeurIPS 2014.
2Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015.
3Vaswani et al. "Attention Is All You Need." NeurIPS 2017.
4Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL 2019.
5Radford et al. "Improving Language Understanding by Generative Pre-Training." OpenAI 2018.
6Dosovitskiy et al. "An Image Is Worth 16x16 Words." ICLR 2021.
7Kaplan et al. "Scaling Laws for Neural Language Models." arXiv 2020.
8Shazeer. "GLU Variants Improve Transformer." arXiv 2020.