The architecture that replaced recurrence with attention — and changed everything.
You're translating a 200-word paragraph from English to French. Your RNN encoder reads the English words one by one, left to right. By the time it reaches word 200, the information from word 1 has passed through 199 transformations. It's been squeezed, distorted, and mostly forgotten. The model has to summarize the entire paragraph into a single hidden vector before the decoder can start generating French. That's like reading a novel through a keyhole — one word at a time, no going back.
But the problem isn't just memory. It's speed. Because each RNN step depends on the output of the previous step, you cannot parallelize across time steps. Step 50 must wait for step 49, which waits for step 48, all the way back to step 1. On a modern GPU with thousands of cores sitting idle, the RNN forces you into a single-lane highway.
In 2017, Vaswani et al. published "Attention Is All You Need" and proposed a radical alternative: throw away recurrence entirely. Replace the sequential hidden state with a mechanism that lets every position look at every other position simultaneously. No loops. No sequential bottleneck. Just one big matrix multiply.
They called it the Transformer.
Sin 1: Sequential computation. Processing n tokens takes O(n) sequential steps. You can't use the next until the current is done. A 512-token input means 512 serial operations, even if you have 10,000 GPU cores.
Sin 2: Long-range forgetting. Information from token 1 must survive through 511 transformations to reach token 512. Even LSTMs, designed to fight vanishing gradients, struggle with dependencies spanning hundreds of tokens. The gradient signal decays exponentially with distance.
Sin 3: No direct connections. In an RNN, token 1 and token 200 are connected only through a chain of 199 hidden states. Each link in the chain can distort the signal. In attention, any two tokens are connected by a single computation step — O(1) path length, regardless of their distance.
The simulation below shows this visually. On the left, an RNN processes tokens sequentially — the signal from early tokens fades as the sequence grows. On the right, attention connects all tokens simultaneously with direct links. Drag the slider to increase sequence length and watch the difference.
Drag the slider to change sequence length. Left: RNN (sequential, fading). Right: Attention (parallel, direct connections).
| Property | RNN | Transformer |
|---|---|---|
| Sequential operations | O(n) | O(1) |
| Max path length | O(n) | O(1) |
| Computation per layer | O(n · d²) | O(n² · d) |
| Parallelizable | No | Yes |
There's a trade-off: the Transformer's attention costs O(n²) per layer because every token looks at every other. For very long sequences this becomes expensive. But for the sequence lengths used in practice (512-2048 tokens in the original paper), the parallelism advantage dominates. Training a Transformer on 8 GPUs took 3.5 days. The equivalent RNN would have taken weeks.
The Transformer arrived at exactly the right moment in hardware history. GPUs in 2017 had thousands of cores optimized for matrix multiplication (NVIDIA's P100 had 3,584 CUDA cores). RNNs could barely use 1% of this hardware because sequential dependencies forced serial computation. The Transformer, built entirely from matrix multiplies, could saturate the GPU completely.
This hardware-algorithm co-design explains the explosion of scale that followed. GPT-2 (2019): 1.5B parameters. GPT-3 (2020): 175B. PaLM (2022): 540B. Each of these models is architecturally identical to the 65M-parameter Transformer from 2017 — just bigger. The scaling laws research by Kaplan et al. (2020) showed that Transformer performance improves predictably with more data, compute, and parameters. No architectural changes needed. Just more matrix multiplies on more GPUs.
The contrast with RNNs is stark. You can't just "make an RNN bigger" and expect proportional improvement. The sequential bottleneck means training time scales linearly with model size AND sequence length. A 175B-parameter RNN would take years to train on the same data. The Transformer's parallelism made large-scale language models economically feasible for the first time.
Here's the remarkable timeline: 2017, Transformer with 65M parameters trains in days. 2018, GPT with 117M. 2019, GPT-2 with 1.5B. 2020, GPT-3 with 175B. 2023, GPT-4 (estimated 1.8T). Each step was enabled by the same architecture — the Transformer — applied at increasing scale. No other architecture in the history of machine learning has shown this consistent scaling behavior.
The Transformer didn't emerge from nothing. It was the culmination of several years of incremental progress:
| Year | Innovation | Key Idea |
|---|---|---|
| 2014 | Seq2Seq (Sutskever) | Encoder-decoder architecture with LSTMs |
| 2015 | Attention (Bahdanau) | Let decoder attend to encoder hidden states |
| 2015 | Layer Normalization (Ba) | Normalize per-example, not per-batch |
| 2016 | Residual Networks (He) | Skip connections enable deep networks |
| 2017 | Transformer | Replace ALL recurrence with attention |
The key leap: previous work used attention alongside RNNs (the RNN reads the sequence, attention helps with alignment). The Transformer's radical claim: attention is sufficient. No RNN needed at all. The paper's title says it: "Attention Is All You Need."
This lesson walks through every component of the Transformer architecture: self-attention, scaling, multi-head attention, positional encoding, the encoder block, the decoder with masking, and the full system. By the end, you'll be able to build one from scratch.
When you read "The cat sat on the mat because it was tired," how do you know "it" refers to "the cat" and not "the mat"? Your brain doesn't process words in isolation — it considers the meaning of every other word in the sentence to resolve ambiguities. "It was tired" suggests a living thing, so "it" must be the cat.
Self-attention is the mechanism that gives the Transformer this ability. For each position in the sequence, it computes a weighted average over all positions, where the weights reflect how relevant each other position is to the current one. The result: every token's representation is enriched by information from every other token.
Self-attention works like a soft dictionary lookup. Imagine a dictionary where you look up a word and get a definition. In attention:
Each token produces three vectors:
These three vectors come from three different learned linear projections of the same input embedding. If the input embedding for token i is xi (a d-dimensional vector), then:
Where WQ, WK, WV are learned weight matrices of shape [d × dk]. The same matrices are applied to every position — the Transformer learns what to ask (Q), what to advertise (K), and what to say (V) as general functions of the input.
To compute the output for token i, we take its query qi and compute dot products with every key kj in the sequence. A high dot product means "token j is relevant to token i." These raw scores are then passed through softmax to get weights that sum to 1. Finally, the output is a weighted sum of the value vectors:
Token i's output is a blend of all value vectors, weighted by how much each key matched token i's query. If "it" strongly attends to "cat," then "it"'s output representation will contain a lot of "cat"'s information.
The widget below shows this in action. Click any token to select it as the query. Attention weight lines appear from that token to all others, with thickness proportional to the attention weight. The Q, K, V vectors are shown as colored bars below.
Click a token to select it as the query. Lines show attention weights to all other tokens. Thicker = more attention.
Let's trace self-attention on a tiny example. Three tokens: "I", "love", "dogs". Suppose dmodel = 4 and dk = 4 (no dimensionality reduction for simplicity). After embedding, our input matrix X is:
Suppose our learned weight matrices are (simplified for hand computation):
Then Q = K = V = X. The attention scores are:
Token 0 ("I") has scores [2, 0, 1]. It attends strongly to itself (score 2), not at all to "love" (score 0), and weakly to "dogs" (score 1). After softmax: [0.66, 0.09, 0.24]. The output for "I" is 0.66·vI + 0.09·vlove + 0.24·vdogs — mostly itself, with a bit of "dogs" mixed in.
In practice, learned WQ, WK, WV matrices are NOT identity — they learn to project into a space where semantically related tokens produce high dot products. After training, the query for "it" might naturally align with the key for "cat" because the model has learned that pronouns need to find their antecedents.
In an RNN, token 1 and token 100 are connected by a chain of 99 hidden state transformations. In self-attention, they're connected by a single dot product. The path length between any two tokens is O(1). This means gradients flow directly between distant tokens during backpropagation — no vanishing gradient through 99 intermediate steps.
And because every token's attention weights are independent of every other token's (no sequential dependency), the entire computation can be parallelized as a single matrix multiplication. All positions computed simultaneously.
All of self-attention can be written as three matrix multiplications and a softmax — no loops, no sequential dependencies:
This is a batch operation: every position's attention is computed simultaneously. On a GPU, this translates to a single cuBLAS GEMM (General Matrix Multiply) call for each step. That's why Transformers train 10-100x faster than RNNs on modern hardware.
python import torch import torch.nn.functional as F def self_attention(X, W_Q, W_K, W_V): # X: [seq_len, d_model] # W_Q, W_K, W_V: [d_model, d_k] Q = X @ W_Q # [seq_len, d_k] K = X @ W_K # [seq_len, d_k] V = X @ W_V # [seq_len, d_k] scores = Q @ K.T # [seq_len, seq_len] weights = F.softmax(scores, dim=-1) output = weights @ V # [seq_len, d_k] return output, weights
That's the entire mechanism: three matrix multiplies, a softmax, and one more matrix multiply. Five lines of math, and it replaces the entire recurrent loop of an RNN.
Self-attention computes a score for every pair of tokens: n queries × n keys = n² dot products. For n = 512, that's 262,144 pairs per layer. For n = 4096 (GPT-3's context length), it's 16.7 million. For n = 100,000 (some modern models), it's 10 billion.
This quadratic scaling is the Transformer's Achilles heel. It's why the original paper used only n = 512. It's why GPT-2 could only handle 1024 tokens. And it's why an entire subfield of "efficient attention" has emerged: Linformer (linear approximation), Performer (random feature maps), FlashAttention (hardware-aware exact attention), Mamba (selective state spaces that bypass attention entirely), and many others.
The memory cost is equally important. The attention matrix itself is [n, n] per head per layer. For GPT-3 (96 heads, 96 layers, n = 2048): 96 × 96 × 2048 × 2048 × 2 bytes = ~73 GB just for the attention matrices during training. This is why techniques like gradient checkpointing (recompute activations instead of storing them) and mixed-precision training (use float16 instead of float32) are essential for large Transformers.
To put the cost in perspective: processing a 1M-token context with standard attention requires 1012 attention computations per layer. Even at GPU speeds of 1015 FLOPS, that's 1 millisecond per layer just for attention — and you need 96 layers. This is why long-context Transformers are so expensive and why alternative architectures like Mamba (which processes sequences in O(n) time) are attracting attention. The Transformer's quadratic cost may ultimately limit its dominance for very long sequences, even as it remains supreme for moderate-length contexts where quality matters most.
FlashAttention (Dao et al., 2022) deserves special mention: it computes exact attention in O(n²) time but with dramatically less memory by tiling the computation to fit in GPU SRAM (fast cache) instead of slow HBM (main GPU memory). FlashAttention doesn't change the math at all — it changes how the math is executed on hardware. The result: 2-4x speedup and models can handle 4-16x longer sequences at the same memory budget. This is a perfect example of hardware-aware algorithm design.
Multiply two 512-dimensional vectors. Each element is roughly standard-normal (mean 0, variance 1). Their dot product is the sum of 512 products of independent random variables. By the central limit theorem, this sum has variance approximately equal to 512 — so the dot product has a standard deviation of √512 ≈ 22.6. That means typical dot products range from −45 to +45.
Now pass those through softmax. Softmax with inputs in the range [−45, +45] is catastrophically peaked: the largest value gets nearly all the probability mass, and everything else is essentially zero. The attention pattern becomes one-hot — each token attends to exactly one other token, ignoring everything else. That's not a soft weighted average; it's a hard lookup. And worse, the gradients through softmax vanish when the output is nearly one-hot.
The solution is beautifully simple. If the dot product has variance dk, divide by √dk to bring the variance back to 1. Now the softmax inputs are in the range [−3, +3] (roughly), and softmax produces a smooth distribution:
This is the complete scaled dot-product attention formula from "Attention Is All You Need." Every attention computation in the Transformer uses this exact formula.
If q and k are vectors where each element is drawn i.i.d. from N(0, 1), then:
Each product qi · ki has mean 0 and variance 1 (product of two standard normals). The sum of dk such terms has variance dk (variances add for independent variables). Standard deviation = √dk. Dividing by √dk normalizes the variance back to 1, so softmax inputs stay in a well-behaved range regardless of how large dk is.
The simulation below shows this dramatically. Adjust dk with the slider and toggle scaling on/off. Without scaling, as dk grows, the softmax output becomes a spike. With scaling, it stays smooth.
Drag d_k to see how dimension affects softmax. Toggle scaling on/off.
Let dk = 4. Suppose token "it" has query q = [1, 0, −1, 2] and three keys are k1 = [2, 1, 0, 1], k2 = [0, −1, 1, 0], k3 = [1, 0, −1, 3].
Raw dot products:
Scaled (divide by √4 = 2): [2, −0.5, 4]
Softmax of [4, −1, 8] (unscaled): [0.018, 0.000, 0.982] — almost all weight on k3.
Softmax of [2, −0.5, 4] (scaled): [0.117, 0.010, 0.873] — still mostly k3, but now k1 gets 11.7%. The model can learn nuanced blending.
Now imagine dk = 512 instead of 4. Those dot products would be ~128x larger (scaling with dimension). Without the √dk divisor, the softmax outputs would be [~0.0, ~0.0, ~1.0] — indistinguishable from a hard lookup. Gradients at the 0.0 positions would be effectively zero, making it impossible for the model to learn that k1 is partially relevant. The √512 ≈ 22.6 divisor brings everything back to a manageable range.
python # The full scaled dot-product attention def scaled_dot_product(Q, K, V, mask=None): d_k = Q.size(-1) scores = Q @ K.transpose(-2, -1) / d_k**0.5 if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) weights = F.softmax(scores, dim=-1) return weights @ V, weights
The mask parameter is how we implement causal masking for the decoder (Chapter 6). For the encoder, mask is None — all positions can see all others.
The effect is dramatic in practice. With dk = 512 (the original Transformer's dimension per head is 64, but models like GPT-3 use 128), the dot products have standard deviation √512 ≈ 22.6. Typical softmax inputs look like [−30, 5, 42, −15, 28, ...]. After softmax, the value 42 gets probability ~0.9999 and everything else gets ~0.0000. The model becomes a hard lookup table — each token attends to exactly one other token.
With scaling, the same inputs become [−1.3, 0.2, 1.9, −0.7, 1.2, ...]. Softmax of these gives [0.04, 0.16, 0.45, 0.07, 0.22, ...]. Now the model can express "I'm mostly interested in token 3, but tokens 2 and 5 are also relevant." This soft blending is what makes attention powerful. Hard attention (attending to exactly one token) loses the ability to aggregate information from multiple sources.
The gradient tells the same story. At softmax saturation, the gradient is nearly zero: ∂softmax/∂z ≈ 0 when one output dominates. The model can't learn to adjust the attention pattern because the gradient carries no information about which direction to update. Scaling keeps the gradient informative throughout training.
Before scaled dot-product attention, Bahdanau et al. (2015) used additive attention:
This doesn't have the scaling problem (tanh keeps values in [−1, 1]), but it's slower because it can't be computed as a single matrix multiply. Dot-product attention is O(n² · d) with highly optimized GEMM; additive attention requires per-pair computation. At dk = 64, the two produce similar quality, but dot-product is significantly faster on GPUs. The scaling factor is the small price we pay for that speed.
One attention head might learn that "it" refers to "cat" (coreference). But there are other relationships worth capturing: "sat" relates to "cat" (subject-verb), "on" relates to "mat" (prepositional attachment), "tired" relates to "sat" (causal). Why force a single set of Q/K/V weights to capture all these different relationships simultaneously?
Multi-head attention runs h separate attention operations in parallel, each with its own learned WQ, WK, WV matrices. Each head projects into a smaller subspace (dk = dmodel/h) so the total computation is the same as single-head attention with full dimensionality.
For each head i (from 1 to h):
Where WQi, WKi, WVi each have shape [dmodel × dk], with dk = dmodel/h.
The outputs of all heads are concatenated and projected back to dmodel:
Where WO has shape [h · dk × dmodel] = [dmodel × dmodel]. The output is the same dimension as the input.
In the original Transformer, dmodel = 512 and h = 8, so each head operates in dk = 64 dimensions. One head learns syntax, another learns coreference, another learns proximity — each in its own 64-dimensional subspace. The concat + output projection WO learns how to combine these different relationship types into a unified representation.
This is the key insight: different types of relationships can be learned independently. A single 512-dimensional attention head might try to average syntax and coreference patterns together, losing both. Eight 64-dimensional heads can specialize.
The simulation below shows four heads attending to different aspects of a sentence. Each head is color-coded. Click each head to toggle it on/off and see the attention pattern it learns.
Click head buttons to toggle individual attention patterns. Each head learns different relationships.
| Tensor | Shape | Description |
|---|---|---|
| Input X | [n, dmodel] | n tokens, each 512-dim |
| WQi | [dmodel, dk] | [512, 64] per head |
| Qi, Ki, Vi | [n, dk] | [n, 64] per head |
| Scores | [n, n] | Attention matrix per head |
| headi output | [n, dk] | [n, 64] per head |
| Concat | [n, h · dk] | [n, 512] = [n, dmodel] |
| Final output | [n, dmodel] | [n, 512] same as input |
Notice: the final output has the same shape as the input. This is critical — it means we can stack attention layers. The output of layer 1 feeds directly into layer 2.
python import torch import torch.nn as nn import torch.nn.functional as F class MultiHeadAttention(nn.Module): def __init__(self, d_model=512, n_heads=8): super().__init__() self.d_k = d_model // n_heads # 64 self.h = n_heads # One big projection, then split into heads self.W_qkv = nn.Linear(d_model, 3 * d_model) self.W_o = nn.Linear(d_model, d_model) def forward(self, x): B, T, D = x.shape # Project to Q, K, V in one shot qkv = self.W_qkv(x) # [B, T, 3*D] qkv = qkv.reshape(B, T, 3, self.h, self.d_k) qkv = qkv.permute(2, 0, 3, 1, 4) # [3, B, h, T, d_k] Q, K, V = qkv[0], qkv[1], qkv[2] # Scaled dot-product attention per head scores = Q @ K.transpose(-2, -1) / self.d_k**0.5 weights = F.softmax(scores, dim=-1) # [B, h, T, T] out = weights @ V # [B, h, T, d_k] # Concat heads and project out = out.transpose(1, 2).reshape(B, T, D) return self.W_o(out) # [B, T, D]
The key implementation trick: instead of h separate WQ, WK, WV matrices, we use one large projection and reshape. This is more efficient on GPUs because one large matrix multiply is faster than h small ones.
Researchers have probed trained Transformer heads to discover what relationships they capture. Some findings from Clark et al. (2019) and Voita et al. (2019):
Positional heads always attend to the previous token (position i attends most to position i−1) or to the first token. These are surprisingly common and seem to implement simple positional patterns.
Syntactic heads attend from a verb to its subject, from a noun to its determiner, or from a pronoun to its antecedent. These heads effectively learn to parse grammar without being explicitly trained on parse trees.
Rare token heads attend to rare or unusual tokens in the sequence, possibly implementing a "surprise detection" mechanism.
Separator heads attend to punctuation and sentence boundaries, perhaps helping the model understand document structure.
Not all heads are useful. Voita et al. showed that you can prune over 60% of heads in a trained Transformer with minimal quality loss. The model is over-parameterized in attention heads, and most of the work is done by a small number of critical heads. This finding has implications for inference efficiency: Grouped Query Attention (GQA), used in LLaMA 2 and later models, shares key-value heads across multiple query heads, reducing KV cache size by 4-8x with minimal quality loss.
Each head has three weight matrices (WQ, WK, WV), each [512 × 64]. That's 3 × 512 × 64 = 98,304 parameters per head. With 8 heads: 786,432 parameters. Plus WO at [512 × 512] = 262,144. Total for multi-head attention: ~1M parameters. About the same as a single-head attention with dk = 512 would cost (3 × 512 × 512 = 786,432 for Q/K/V). Multi-head attention adds expressiveness without adding cost.
Consider "The cat sat because it was tired" with dmodel = 4 and h = 2 heads (dk = 2 each).
Head 1 (coreference): Projects "it" into query q1 = [0.9, 0.1]. Keys for "cat" = [0.8, 0.2], "sat" = [0.1, 0.9], "tired" = [0.2, 0.3]. Dot products: cat = 0.74, sat = 0.18, tired = 0.21. After softmax: cat = 0.49, sat = 0.28, tired = 0.23. Head 1 successfully finds the antecedent "cat."
Head 2 (adjacency): Projects "it" into query q2 = [0.2, 0.8]. Keys for "because" = [0.3, 0.9], "was" = [0.4, 0.7], "tired" = [0.1, 0.3]. Dot products: because = 0.78, was = 0.64, tired = 0.26. After softmax: because = 0.42, was = 0.37, tired = 0.21. Head 2 finds the local context.
Concatenating: head1 output (2 dims focused on "cat") || head2 output (2 dims focused on "because"/"was") = 4-dim vector containing both coreference AND local context. WO then learns how to combine these different relationship types into a single unified representation.
A single head with dk = 4 would be forced to choose: learn coreference OR learn adjacency, but not both. Multi-head lets it learn both simultaneously.
"Dog bites man" and "Man bites dog" have identical tokens: {dog, bites, man}. Self-attention computes dot products between queries and keys, but dot products are symmetric and permutation-invariant. If you shuffle the input tokens, the attention weights change only because the Q/K/V values changed — but the mechanism itself has no notion of order. It treats the input as a set, not a sequence.
This is a fundamental problem. Word order carries meaning. "The cat chased the mouse" means something very different from "The mouse chased the cat." We need to inject position information explicitly.
Vaswani et al. added a positional encoding vector to each input embedding. For position pos and dimension i:
Each dimension gets a sinusoidal wave at a different frequency. Low-index dimensions oscillate quickly (changing every position); high-index dimensions oscillate slowly (changing over hundreds of positions). The result: each position gets a unique "fingerprint" vector.
Think of how a clock represents time. The second hand spins fast (high frequency), the minute hand spins slow (medium frequency), and the hour hand barely moves (low frequency). Any moment in time is uniquely identified by the combination of all three hands. 3:15:42 is different from 3:15:43 because the second hand moved, even though the other hands are in the same place.
Sinusoidal positional encoding works the same way. Each dimension is a "hand" spinning at a different frequency. Low-index dimensions spin fast (changing every token), high-index dimensions spin slowly (repeating only over thousands of tokens). The combination of all dimensions uniquely identifies each position.
Three elegant properties:
1. Unique per position. No two positions share the same encoding vector. The combination of different-frequency sinusoids creates a unique pattern at each position, like the unique combination of hands on a clock.
2. Bounded values. Every element is between −1 and +1 (sine and cosine are bounded). This plays well with the input embeddings, which are typically initialized with similar magnitude.
3. Relative positions are learnable. For any fixed offset k, PE(pos+k) is a linear function of PE(pos). This means the model can learn to attend to "the word 3 positions ago" by learning a simple linear transformation of the positional encoding. The dot product PE(pos) · PE(pos+k) depends only on k, not on pos itself.
The simulation below shows the positional encoding as a heatmap: rows are positions, columns are dimensions. Hover any row to see its encoding vector. The slider controls the maximum sequence length. Notice how low dimensions oscillate fast and high dimensions oscillate slowly.
Top: encoding heatmap (position × dimension). Bottom: 2D PCA projection of position vectors. Hover rows to inspect.
The positional encoding is added to the input embedding, not concatenated:
Why add instead of concatenate? Concatenation would double the dimension (wasting computation) and force the model to learn separate weights for "content" and "position" dimensions. Addition lets position information blend naturally with semantic information. The model can learn to use position through the standard Q/K/V projections without any architectural changes.
python import torch import math def sinusoidal_pe(max_len, d_model): pe = torch.zeros(max_len, d_model) pos = torch.arange(0, max_len).unsqueeze(1) # [max_len, 1] div = torch.exp( torch.arange(0, d_model, 2) * -math.log(10000) / d_model ) # [d_model/2] pe[:, 0::2] = torch.sin(pos * div) # even dims pe[:, 1::2] = torch.cos(pos * div) # odd dims return pe # [max_len, d_model] # Usage: embeddings = token_embed(x) + sinusoidal_pe(512, 512)[:seq_len]
The division term exp(arange * -log(10000) / d_model) computes 1/100002i/d in log space for numerical stability. This creates frequencies that span from 1 (dimension 0 changes every position) to 1/10000 (dimension d-1 changes only over thousands of positions).
The original Transformer uses fixed sinusoidal encodings (no learned parameters). Vaswani et al. also tried learned positional embeddings — a separate embedding table of shape [max_len, dmodel] — and found "nearly identical results." Later models like BERT and GPT use learned embeddings; very long-context models use RoPE (rotary position embeddings) or ALiBi. The key insight remains: attention needs position information injected explicitly.
RoPE (Rotary Position Embeddings) is worth mentioning because it powers most modern LLMs (LLaMA, Mistral, GPT-NeoX). Instead of adding position to the embedding, RoPE applies a rotation to Q and K vectors based on position. The dot product Q·K then naturally encodes relative position. RoPE is elegant because it makes relative position a property of the dot product itself, not something the model has to learn from additive encodings.
One advantage of sinusoidal encodings: they're defined for any position, even positions not seen during training. If you train on sequences of length 512, you can theoretically evaluate on length 1024 — the sine/cosine functions extend naturally. In practice, the model's quality degrades significantly beyond the training length because attention patterns were only trained on the shorter context.
This length extrapolation problem is a major area of current research. Approaches include:
| Method | Key Idea | Used By |
|---|---|---|
| ALiBi | Penalize attention by distance (no explicit PE) | BLOOM, MPT |
| RoPE + NTK scaling | Adjust RoPE frequencies to extend context | CodeLlama, extended LLaMA |
| YaRN | Learned interpolation of RoPE frequencies | Various fine-tunes |
| Ring Attention | Distribute long context across multiple GPUs | Research |
The upshot: position encoding isn't a solved problem. It remains one of the most active research frontiers in Transformer architecture design, especially as models push toward 1M+ token context windows.
We have attention and position. Now we need to package them into a repeatable building block that can be stacked into a deep network. The Transformer encoder is built from N identical layers (N=6 in the original paper), each containing the same two sub-modules wired together with residual connections and layer normalization.
The input (a sequence of n vectors, each dmodel-dimensional) goes through multi-head attention. Every position attends to every other position. The output has the same shape as the input: [n, dmodel].
After attention, each position independently passes through a two-layer fully connected network:
The hidden dimension dff is typically 4 × dmodel (so 2048 in the original). ReLU activation between the two layers. The same FFN is applied to every position independently — no interaction between positions here. That's the job of attention. The FFN's job is to transform each position's representation nonlinearly, acting like a small neural network applied to each token independently.
Why is dff = 4 × dmodel? The expansion allows the FFN to represent more complex functions. Think of it as: attention mixes information across positions; FFN processes information within each position. The 4x expansion gives the FFN enough capacity to do useful computation.
Recent research suggests the FFN acts as a key-value memory. Each row of W1 is a "key" that activates on specific input patterns, and the corresponding column of W2 is the "value" that gets added to the representation when that pattern is detected. With dff = 2048, the FFN has 2048 memory slots. One slot might fire when it sees a capital letter after a period (triggering "this is a sentence start"), another when it sees a number followed by a unit (triggering "this is a measurement"). The FFN is where the Transformer stores factual knowledge.
This is why scaling up dff (and by extension dmodel) improves the model's ability to store facts. GPT-3 has dff = 4 × 12288 = 49,152 memory slots per layer, across 96 layers. That's 4.7 million memory slots — enough to store an enormous amount of world knowledge.
The original FFN uses ReLU activation. Modern Transformers have found that SwiGLU (a gated variant of Swish) works better:
Where ⊙ is element-wise multiplication and Swish(x) = x · σ(x). The gating mechanism (multiplying by x W3) lets the network selectively activate different memory slots more precisely than ReLU. LLaMA, Mistral, and most 2023-2024 models use SwiGLU. The cost: an extra weight matrix W3, which increases FFN parameters by ~50%. But the quality improvement justifies the cost.
Another common variant: GeGLU (GELU-gated), which replaces Swish with GELU. The differences between SwiGLU, GeGLU, and ReGLU (ReLU-gated) are small — the key insight is that gating helps, regardless of which activation function gates it.
Each sub-layer has a residual connection that adds the input to the output:
Residual connections are arguably the most important architectural choice in the Transformer. Without them, deep Transformers (6+ layers) fail to train. Why? In a deep network, the gradient must pass through every layer during backpropagation. Each layer transforms the gradient, and many layers of transformation can shrink it to near-zero (vanishing gradient). The residual connection provides a "highway" for the gradient to flow directly from the loss back to early layers, bypassing the sub-layers. The gradient through a residual connection is always at least 1 (the identity contribution), no matter how the sub-layer transforms it.
Layer normalization normalizes each token's representation to have zero mean and unit variance, then applies learned scale (γ) and shift (β) parameters:
Where μ and σ² are the mean and variance computed across the dmodel dimensions of a single token (not across the batch or sequence). This stabilizes training by preventing the internal representations from drifting to very large or very small values.
Batch normalization (Ioffe & Szegedy, 2015) normalizes across the batch dimension: for each feature, compute the mean and variance across all examples in the batch. This works well for CNNs where batch statistics are stable, but fails for sequences because:
1. Variable sequence lengths. Different sequences in a batch have different lengths. Position 50 might exist in 3 of 8 batch items. Batch statistics at position 50 are computed from only 3 examples — too noisy to be useful.
2. Train-test mismatch. Batch norm uses running statistics at test time, but at inference we often process one sequence at a time (batch size 1). The running statistics from training (computed over batches of 32+) don't match the single-example test distribution.
3. Sequence length changes. Batch norm statistics are position-dependent. A model trained on 512-token sequences has statistics for positions 0-511. At test time with 1024 tokens, positions 512-1023 have no statistics. Layer norm avoids all these problems because it normalizes each token independently — no cross-sequence or cross-batch dependencies.
Click each component in the simulation below to expand it and see the tensor shapes flowing through.
Click each block to expand and see tensor shapes. Residual connections shown as bypass arrows.
After 6 such layers, the encoder produces a rich, contextual representation of the input sequence. Every token's representation has been enriched by information from every other token, through 6 rounds of attention and nonlinear transformation.
python import torch import torch.nn as nn class EncoderBlock(nn.Module): def __init__(self, d_model=512, n_heads=8, d_ff=2048): super().__init__() self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True) self.ff = nn.Sequential( nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model) ) self.ln1 = nn.LayerNorm(d_model) self.ln2 = nn.LayerNorm(d_model) def forward(self, x): # x: [batch, seq_len, d_model] attn_out, _ = self.attn(x, x, x) # self-attention x = self.ln1(x + attn_out) # residual + norm ff_out = self.ff(x) # feed-forward x = self.ln2(x + ff_out) # residual + norm return x # [batch, seq_len, d_model]
Notice how compact this is: two sub-layers, each wrapped in residual + LayerNorm. The entire encoder is just nn.Sequential(*[EncoderBlock() for _ in range(N)]). The Transformer's power comes from repetition of simple blocks, not from any single complex component.
Each component exists for a reason. Here's what breaks if you remove one:
| Remove This | What Happens |
|---|---|
| Residual connection | Training diverges after 3-4 layers. Gradients vanish. Deep Transformers become untrainable. |
| Layer normalization | Activations drift to extreme values. Training becomes unstable and requires very small learning rates. |
| FFN (keep only attention) | Model loses per-position processing power. Quality drops ~2-3 BLEU. The model can mix information but can't transform it. |
| Multi-head (use single head) | Model captures fewer relationship types simultaneously. Quality drops ~1 BLEU but still works. |
| Positional encoding | Model treats input as a bag-of-words. Word order information is lost. "Dog bites man" = "Man bites dog." |
The original Transformer applies LayerNorm after the residual connection: LayerNorm(x + SubLayer(x)). This is called Post-Norm. Many later implementations (including GPT-2, GPT-3, LLaMA) use Pre-Norm: x + SubLayer(LayerNorm(x)). Pre-Norm is more stable during training because the residual path is a clean identity — the gradient flows through without any normalization in the way. Post-Norm sometimes needs learning rate warmup to avoid divergence.
The encoder sees the whole input at once. But when generating output, the decoder must go one token at a time — it produces "The," then "cat," then "sat," each conditioned on what it has generated so far. If the decoder could see future tokens during training, it would just copy the answer instead of learning to predict. No peeking allowed.
The decoder block has the same two sub-layers as the encoder (self-attention + FFN), plus a third: cross-attention sandwiched between them.
In the encoder, every position attends to every other position. In the decoder, position i can only attend to positions ≤ i. This is enforced by setting all attention scores from i to positions j > i to −∞ before softmax. Since exp(−∞) = 0, future positions get zero attention weight.
In matrix form, the causal mask is an upper-triangular matrix of −∞:
Where Mij = 0 if i ≥ j, and Mij = −∞ if i < j. This ensures that each position can only gather information from the past and present, never the future.
Suppose the decoder generates the sentence "Le chat dort" (French for "The cat sleeps"). The mask matrix M for 3 positions is:
At position 0 ("Le"), the raw attention scores are [2.1, 3.5, −0.8]. Adding the mask: [2.1, −∞, −∞]. After softmax: [1.0, 0.0, 0.0]. "Le" can only see itself.
At position 1 ("chat"), scores are [1.5, 2.8, 0.9]. After mask: [1.5, 2.8, −∞]. Softmax: [0.21, 0.79, 0.0]. "chat" attends to "Le" (21%) and itself (79%).
At position 2 ("dort"), scores are [0.3, 1.8, 2.1]. No masking needed (last row is all zeros). Softmax: [0.09, 0.41, 0.50]. "dort" can see everything.
The mask elegantly prevents information leakage while allowing the entire sequence to be processed in parallel during training. Without it, the model would cheat by looking at future tokens.
Cross-attention is where the decoder "reads" the encoder's output. The mechanism is identical to self-attention, but with one crucial difference: the queries come from the decoder, while the keys and values come from the encoder.
This lets each decoder position ask: "Which parts of the input should I pay attention to right now?" When generating "le" in French, the decoder might attend strongly to "the" in the English encoder output. When generating "chat," it attends to "cat."
The simulation below shows the decoder in action. The attention matrix shows the causal mask (grayed upper triangle). Step through generation token-by-token — each step reveals one more row in the attention matrix. Cross-attention arrows connect to the encoder.
Click "Next Token" to step through autoregressive generation. Gray = masked (can't see future). Purple lines = cross-attention to encoder.
A subtle but important distinction. During training, we know the entire target sequence. We feed it all at once to the decoder and apply the causal mask. This is called teacher forcing — the model always sees the correct previous tokens, even if it would have predicted wrong ones. All positions are computed in parallel (a single forward pass), making training efficient.
During inference, we don't have the target. We generate one token at a time: feed <SOS>, predict the first token, feed <SOS> + first token, predict the second, and so on. This is autoregressive and inherently sequential. Each step requires a full forward pass through the decoder, though we can cache the key/value computations from previous positions (KV caching) to avoid redundant computation.
Teacher forcing has a subtle problem: exposure bias. During training, the model always sees correct previous tokens. During inference, it sees its own (potentially wrong) predictions. If the model generates a wrong token early on, all subsequent predictions are conditioned on that error — but the model was never trained in this scenario. It's like practicing basketball by always catching perfect passes, then being thrown a bad pass in a real game.
Despite this theoretical concern, teacher forcing works well in practice for Transformers. The attention mechanism helps: even if one token is wrong, the model can attend to many other (correct) tokens. The error doesn't propagate through a hidden state chain like in RNNs — it's just one position in a set of many. Techniques like scheduled sampling (gradually mixing model predictions into the training targets) can reduce exposure bias but add complexity and are rarely used with modern Transformers.
Without caching, generating token t requires computing attention over all t previous positions — including recomputing keys and values for positions 1 through t−1, which we already computed in previous steps. That's O(t²) work per token, or O(n³) total for n tokens.
With KV caching, we store the key and value vectors from all previous positions. At step t, we only compute Q, K, V for the new position t, append the new K and V to the cache, and attend over the full cached sequence. The work per step drops from O(t · d) to O(d) for the projections, plus O(t · d) for the attention itself. Total: O(n² · d) instead of O(n³ · d).
The cost: memory. For each layer, we store n × dk for keys and n × dk for values, times h heads. For GPT-3 (96 layers, 96 heads, dk = 128, 2048 tokens): 96 × 2 × 96 × 128 × 2048 × 2 bytes = ~9.7 GB per sequence. This is why LLMs need so much GPU memory during inference, and why KV cache compression is an active research area.
python class DecoderBlock(nn.Module): def __init__(self, d_model=512, n_heads=8, d_ff=2048): super().__init__() self.self_attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True) self.cross_attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True) self.ff = nn.Sequential(nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model)) self.ln1 = nn.LayerNorm(d_model) self.ln2 = nn.LayerNorm(d_model) self.ln3 = nn.LayerNorm(d_model) def forward(self, x, enc_out, causal_mask): # 1. Masked self-attention (decoder attends to itself) attn1, _ = self.self_attn(x, x, x, attn_mask=causal_mask) x = self.ln1(x + attn1) # 2. Cross-attention (Q from decoder, K/V from encoder) attn2, _ = self.cross_attn(x, enc_out, enc_out) x = self.ln2(x + attn2) # 3. Feed-forward x = self.ln3(x + self.ff(x)) return x
Time to put it all together. Build a Transformer piece by piece and watch data flow through each component. Start with raw tokens, add embedding, positional encoding, encoder layers, decoder layers, and the output head. Each addition animates the data as it flows through.
Click buttons in order to add each component. Toggle residual/positional to see their effect. Adjust sliders for model dimensions.
Watch how the parameter count changes as you adjust the sliders. The original Transformer base model has 65M parameters (dmodel=512, N=6, h=8). The large model has 213M parameters (dmodel=1024, N=6, h=16). Modern LLMs like GPT-3 (175B) and Llama (65B) use the same architecture — just scaled up massively.
The original Transformer was trained on WMT 2014 English-German (4.5M sentence pairs) and English-French (36M pairs). The training recipe contained several innovations that became standard practice:
Learning rate warmup. The learning rate starts at zero and linearly increases for the first 4,000 steps, then decays proportional to the inverse square root of the step number:
Why warmup? Early in training, the model's parameters are random. Large learning rates + random parameters = large, unstable updates. Warmup lets the model find a reasonable region of parameter space before full-speed optimization. This schedule is now so standard it's called the "Transformer schedule" or "Noam schedule" (after one of the paper's authors).
Label smoothing. Instead of training against hard one-hot targets (0 everywhere, 1 at the correct token), they used soft targets: 0.9 at the correct token, 0.1 / |V| everywhere else. This hurts perplexity (the model's log-likelihood on test data) but improves BLEU scores because it encourages the model to be less confident, producing more diverse and natural translations.
Dropout. Applied after every sub-layer, after attention weights, and in the positional encoding addition. Rate: 0.1 for the base model. Without dropout, the Transformer overfits quickly on smaller datasets.
Adam optimizer with β1 = 0.9, β2 = 0.98, ε = 10−9. The high β2 stabilizes the second moment estimates for the large, sparse gradients typical of attention layers.
The Transformer achieved 28.4 BLEU on English-to-German translation, beating the previous best (by an ensemble of LSTMs + attention) by over 2 BLEU points. On English-to-French, it hit 41.0 BLEU — a new state of the art. And it trained in 3.5 days on 8 P100 GPUs. The previous state-of-the-art models took weeks.
| Model | EN-DE BLEU | EN-FR BLEU | Training Cost |
|---|---|---|---|
| ConvS2S (2017) | 25.2 | 40.5 | Very high |
| GNMT (Google, 2016) | 26.3 | 39.9 | 6 days, 96 GPUs |
| Transformer (base) | 27.3 | 38.1 | 12 hrs, 8 GPUs |
| Transformer (big) | 28.4 | 41.0 | 3.5 days, 8 GPUs |
The cost efficiency was as impressive as the quality. The Transformer wasn't just better — it was dramatically cheaper to train. This is what made the subsequent scaling revolution possible.
| Component | Purpose | Parameters (base) |
|---|---|---|
| Token Embedding | Map token IDs to d_model vectors | V × 512 ≈ 19M |
| Positional Encoding | Inject position info (sinusoidal = 0 params) | 0 |
| Encoder Self-Attention ×6 | Context mixing across positions | 6 × 1.05M = 6.3M |
| Encoder FFN ×6 | Per-position nonlinear transform | 6 × 2.1M = 12.6M |
| Encoder LayerNorm ×12 | Stabilize activations | 12 × 1K ≈ 12K |
| Decoder (same + cross-attn) ×6 | Generate output autoregressively | ≈25M |
| Output Linear + Softmax | Map d_model to vocabulary probs | 512 × V ≈ 19M |
| Total (base) | ≈65M |
Here's the complete encoder stack — everything from tokens to contextual representations — in under 30 lines of PyTorch:
python import torch, torch.nn as nn, math class Transformer(nn.Module): def __init__(self, vocab=37000, d=512, N=6, h=8): super().__init__() self.embed = nn.Embedding(vocab, d) self.pe = sinusoidal_pe(5000, d) # from earlier self.encoder = nn.TransformerEncoder( nn.TransformerEncoderLayer( d_model=d, nhead=h, dim_feedforward=4*d, dropout=0.1, batch_first=True ), num_layers=N ) self.out = nn.Linear(d, vocab) self.d = d def forward(self, x): # x: [batch, seq_len] token IDs seq_len = x.size(1) x = self.embed(x) * math.sqrt(self.d) # scale embeddings x = x + self.pe[:seq_len].to(x.device) x = self.encoder(x) # [batch, seq_len, d] return self.out(x) # [batch, seq_len, vocab]
The embedding scaling by √dmodel is a detail from the paper that's easy to miss: it ensures the embedding magnitudes are comparable to the positional encoding magnitudes (which are bounded in [−1, 1]). Without this scaling, dmodel = 512 embeddings initialized near zero would be dwarfed by the positional signal.
Once the Transformer produces output probabilities, how do we choose the next token?
Greedy decoding: always pick the highest-probability token. Fast (one forward pass per step) but can produce repetitive, generic text. "The cat sat on the mat. The cat sat on the mat."
Beam search: keep the top-k partial sequences at each step (beam width k), exploring multiple paths. The winning sequence is the one with the highest total log-probability. Produces higher-quality translations than greedy but is k times more expensive. The original Transformer used beam width 4.
Sampling with temperature: sample from the softmax distribution, optionally sharpened (temperature < 1) or flattened (temperature > 1). Top-k sampling restricts to the k most likely tokens. Top-p (nucleus) sampling restricts to the smallest set of tokens whose cumulative probability exceeds p. Modern chatbots use top-p with p ≈ 0.9 and temperature ≈ 0.7 for a balance of quality and diversity.
Vaswani et al. built the Transformer for machine translation. Within two years, it had conquered language modeling (GPT), bidirectional understanding (BERT), computer vision (ViT), music generation (Music Transformer), protein folding (AlphaFold 2), and robotics (RT-2). How? Because the Transformer's core operation — attention over a set of tokens — doesn't care what those tokens represent.
The key insight: attention treats its input as a set of vectors. It doesn't know or care whether those vectors represent words, image patches, audio frames, or amino acids. As long as you can convert your data into a sequence of embeddings, you can apply a Transformer.
Image Transformer (Parmar et al., 2018): Treats each pixel as a token. But full self-attention over every pixel in a 256×256 image would need 65,536 × 65,536 = 4 billion attention computations. Solution: local attention — each pixel attends only to a small neighborhood (a "local attention window"), reducing cost from O(n²) to O(n · w) where w is the window size. The model can still capture long-range dependencies by stacking many layers.
Vision Transformer (ViT, Dosovitskiy et al., 2021): Instead of pixels, split the image into 16×16 patches, flatten each patch into a vector, and treat patches as tokens. A 224×224 image becomes 196 tokens — manageable for full self-attention. Add a [CLS] token for classification, add learned positional embeddings, and apply a standard Transformer encoder. ViT matched or beat CNNs on ImageNet when trained on large datasets.
Music Transformer (Huang et al., 2018): Represents music as a sequence of MIDI events (note-on, note-off, time-shift). The key innovation: relative positional encoding instead of absolute. In music, what matters is the interval between notes, not their absolute position. The Music Transformer uses relative attention to capture patterns like "this note is a fifth above the note 4 beats ago."
AlphaFold 2 (DeepMind, 2021): Solved the 50-year-old protein folding problem using a modified Transformer. The input is an amino acid sequence (like a sentence of 20 possible "characters"). The attention mechanism learns which amino acids interact in 3D space — residues that are far apart in sequence can be close in the folded structure. The model uses a custom attention variant called "triangle attention" that respects geometric constraints.
RT-2 (Google, 2023): A robot that thinks in language. The input is an image (tokenized like ViT) concatenated with a text instruction ("pick up the blue cup"). The output is a sequence of action tokens (motor commands). The same attention mechanism that resolves coreference in text now decides which part of the visual scene to focus on while planning robot movements.
Click each panel below to see how each domain tokenizes its data for the Transformer.
Click a domain panel to see how it tokenizes data for attention. Text = subwords, Images = patches, Music = MIDI events.
| Model | Year | Domain | Key Adaptation |
|---|---|---|---|
| Transformer | 2017 | Translation | Original encoder-decoder |
| GPT | 2018 | Language | Decoder-only, autoregressive |
| BERT | 2018 | Language | Encoder-only, masked LM |
| Image Transformer | 2018 | Vision | Local attention windows |
| Music Transformer | 2018 | Music | Relative positional encoding |
| ViT | 2021 | Vision | Image patches as tokens |
| AlphaFold 2 | 2021 | Biology | Amino acid + structure tokens |
| RT-2 | 2023 | Robotics | Vision + language + action tokens |
Let's trace a 224×224 RGB image through ViT:
The beauty: steps 5 and 6 are identical to BERT. ViT literally is BERT for images. The only difference is the tokenizer (patch embedding instead of word embedding). This universality is what makes the Transformer the most important architecture in modern AI.
One surprising finding from ViT: it needs much more data than CNNs to perform well. CNNs have an inductive bias toward local spatial patterns (convolution kernels are local). The Transformer has no such bias — it must learn locality from data. On ImageNet alone (1.2M images), ViT underperforms ResNet. But on JFT-300M (300M images), ViT crushes ResNet. The Transformer trades inductive bias for flexibility: give it enough data, and it learns better representations than any hard-coded structure.
This data-hungry property explains a lot about modern AI. Transformers need massive datasets because they start with fewer assumptions. But those fewer assumptions mean they can discover patterns that human-designed architectures miss. The bet is: data is cheaper than engineering. And so far, that bet has paid off spectacularly.
Every domain that adopted the Transformer discovered the same thing: performance improves predictably with scale. More data, more parameters, more compute — and the model gets better in a smooth, predictable curve (a power law). This finding, first documented by Kaplan et al. (2020) for language, extended to vision (Zhai et al., 2022), robotics (Brohan et al., 2023), and biology. It suggests that the Transformer architecture has few inherent bottlenecks — the main limit is how much compute you can afford.
The Transformer didn't appear in a vacuum. It built on a decade of attention research and spawned an avalanche of follow-up work that continues today.
| Property | Vanilla RNN | LSTM | Transformer |
|---|---|---|---|
| Sequential computation | O(n) | O(n) | O(1) |
| Max path length | O(n) | O(n) | O(1) |
| Handles long-range? | No (vanishing grad) | Partially (gates help) | Yes (direct attention) |
| Parallelizable (train)? | No | No | Yes |
| Memory mechanism | Hidden state ht | Cell state ct + gates | Attention weights |
| Parameter sharing | Across time steps | Across time steps | Across positions |
| Year | 1990 | 1997 | 2017 |
| Aspect | Original Transformer (2017) | Modern LLM (2024) |
|---|---|---|
| d_model | 512 | 4096-12288 |
| Layers | 6 | 32-96 |
| Heads | 8 | 32-96 |
| Parameters | 65M | 7B-400B+ |
| Context length | 512 | 8K-128K+ |
| Training data | 4.5M sentence pairs | 1-15 trillion tokens |
| Training compute | ~100 GPU-hours | ~10M GPU-hours |
| Architecture changes | — | RoPE, GQA, SwiGLU, RMSNorm |
The remarkable thing: the 2024 column is the same architecture as the 2017 column. The core mechanism — scaled dot-product attention with multi-head projections, residual connections, and layer normalization — is unchanged. The improvements are mostly engineering refinements (RoPE for better position encoding, SwiGLU for a better activation function, Grouped Query Attention for KV cache efficiency) and massive scaling.
The original Transformer has both encoder and decoder. But many later models use only half:
| Variant | Used In | Key Difference |
|---|---|---|
| Encoder-only | BERT, RoBERTa | No causal mask, no decoder. Bidirectional attention. Best for understanding tasks (classification, NER). |
| Decoder-only | GPT, LLaMA, Claude | No encoder, no cross-attention. Causal mask only. Best for generation tasks. |
| Encoder-decoder | T5, BART, Original | Full architecture. Best for sequence-to-sequence (translation, summarization). |
The decoder-only variant won the scaling wars. GPT-3, GPT-4, LLaMA, Claude, Gemini — all decoder-only. Why? Because generation is the most general capability. A model that can generate text can also classify (generate the label), translate (generate the target language), and answer questions (generate the answer). Encoder-only models can't generate; they can only encode.
The Transformer paper was written by eight Google researchers: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. Several went on to found major AI companies: Aidan Gomez co-founded Cohere, Illia Polosukhin co-founded NEAR Protocol, and Noam Shazeer co-founded Character.AI before returning to Google. The paper has over 100,000 citations — one of the most cited papers in computer science history.
The paper's title, "Attention Is All You Need," was deliberately provocative. At the time, most researchers believed that attention was a useful supplement to RNNs, not a replacement. The claim that attention alone could outperform RNN-based systems was controversial. History proved them right.
Despite seven years of dominance, fundamental questions about the Transformer remain open:
Why does it work so well? We have no theoretical proof that attention is the optimal mechanism for sequence modeling. State space models (Mamba, S4) achieve comparable results on some tasks with O(n) instead of O(n²) computation. Are there better architectures waiting to be discovered?
What are the limits of scaling? Kaplan's scaling laws suggest performance improves indefinitely with more compute. But is there a ceiling? Some researchers argue that Transformers can only interpolate within their training distribution, not truly generalize. Others argue that sufficient scale IS generalization.
Why do residual connections matter so much? Residual connections are necessary for deep Transformers, but we lack a deep theoretical understanding of why. The gradient highway explanation is intuitive but doesn't explain why some architectures train fine without residuals (like shallow networks).
Can we beat O(n²)? Linear attention variants exist but consistently underperform standard attention. FlashAttention makes O(n²) faster but doesn't change the scaling. State space models offer O(n) alternatives. The optimal complexity for sequence modeling remains an open problem.
What's remarkable is that despite these open questions, the Transformer continues to win empirically. It may not be theoretically optimal, but it's practically unbeatable. As Yann LeCun has noted, "the Transformer is not the final architecture — but it's the best one we have today." Understanding it deeply, as this lesson has aimed to do, is the foundation for understanding everything that comes next in AI.