What if every word in a sentence could look at every other word to decide what matters? That's attention — and it replaced everything.
You want to translate "we see the sky" into Italian: "vediamo il cielo." The standard approach in 2014 was a sequence-to-sequence model: an encoder RNN reads the English tokens one by one and compresses the entire sentence into a single hidden vector c, then a decoder RNN unfolds that vector into the Italian output.
The encoder updates its hidden state at each step: ht = fW(xt, ht-1). After processing all T input tokens, the final hidden state hT becomes the context vector c. The decoder generates output tokens conditioned on c: st = gU(yt-1, st-1, c).
This works for short sentences. But think about what happens when T = 1,000. You're asking a single fixed-size vector — maybe 512 or 1,024 dimensions — to memorize the meaning, word order, and nuance of a thousand tokens. It's like compressing an entire book into a single sentence.
The context vector c has a fixed size regardless of the input length. As the source sequence grows, more and more information must be squeezed into the same number of floats. Translation quality drops sharply for long sentences — the decoder forgets early tokens because c can't hold everything.
Translate: "The cat that the dog that the boy owned chased ran away." This has nested relative clauses. The decoder needs to know that "ran away" refers to "the cat" (the very first noun), not "the boy" or "the dog." But by the time the encoder reaches "ran away," the information about "the cat" has been overwritten by all the intervening tokens. The context vector c is dominated by recent tokens.
What's the fix? Instead of compressing the entire input into one vector, let the decoder look back at the entire input sequence on every step. At each output position, the decoder decides which parts of the input are relevant right now. Translating "cielo" (sky)? Look at position 4 ("sky"). Translating "vediamo" (we see)? Look at positions 1 and 2.
Don't force all input information through a single bottleneck. Instead, give the decoder access to all encoder hidden states and let it learn to focus on the relevant ones at each step. This is attention.
Think of it in terms of information capacity. A vector of D floats (32-bit each) can store at most D × 32 = 16,384 bits for D = 512. A sentence of 1,000 tokens, each from a vocabulary of 50,000 words, carries roughly 1,000 × log2(50,000) ≈ 15,600 bits of information. The context vector simply doesn't have the bandwidth to losslessly encode a long sentence. Attention solves this by giving the decoder random access to the full encoder state — effectively infinite bandwidth, at the cost of computing alignment scores at each step.
The bottleneck problem appears everywhere, not just in translation. Image captioning (compress an entire image into a vector, then generate a sentence), text summarization (compress a document, then generate a summary), speech recognition (compress an audio signal, then generate text) — all suffered from the same limitation. Attention provided a universal fix: let the decoder selectively access the encoder's full representation.
Sutskever et al. (2014) showed that their seq2seq model's BLEU score (translation quality) dropped sharply for sentences longer than ~20 words. Bahdanau et al. (2015) showed that with attention, BLEU scores remained high even for sentences of 50+ words. The improvement was most dramatic on long sentences — exactly where the bottleneck hurts most. This single result convinced the NLP community that attention was not just an incremental improvement but a fundamental architectural advance.
Bahdanau et al. (2015) proposed a simple but powerful idea: at each decoder step t, compute a fresh context vector ct as a weighted sum of all encoder hidden states. The weights are learned — the network figures out which encoder states matter for the current output.
For each encoder hidden state hi, compute a scalar alignment score that measures how relevant hi is to the current decoder state st-1:
Normalize the alignment scores with softmax so they sum to 1. These are the attention weights:
Compute the context vector as a weighted sum of encoder states:
This context vector ct is different at every decoder step. When the decoder is generating "vediamo" (we see), the attention weights might concentrate on h1 and h2 (the encoder states for "we" and "see"). When generating "cielo" (sky), they shift to h4.
A mechanism that computes a weighted average over all encoder hidden states at each decoder step. The weights are learned alignment scores normalized by softmax. "Soft" because it uses continuous weights rather than hard selection of a single position. The entire computation is differentiable — no supervision on the attention weights is needed. Backprop learns them automatically.
Source: "we see the sky" → encoder states h1, h2, h3, h4.
Decoder step 1 (generating "vediamo" = "we see"):
Alignment scores: e1 = [2.1, 1.8, 0.3, 0.1]
After softmax: a1 = [0.45, 0.38, 0.10, 0.07]
Context: c1 = 0.45·h1 + 0.38·h2 + 0.10·h3 + 0.07·h4
The model focuses on "we" and "see" — exactly what it's translating.
Decoder step 3 (generating "cielo" = "sky"):
Alignment scores: e3 = [0.1, 0.2, 0.5, 2.8]
After softmax: a3 = [0.05, 0.06, 0.09, 0.80]
Context: c3 = 0.05·h1 + 0.06·h2 + 0.09·h3 + 0.80·h4
Now the model focuses almost entirely on "sky."
Click a target word to see which source words it attends to. Line thickness and opacity show attention weight.
When researchers visualize attention weights on real translation tasks, something beautiful emerges: the attention matrix looks roughly diagonal for languages with similar word order (English-French: "The agreement" → "L'accord") but shows cross-diagonal patterns where word order differs ("European Economic Area" → "zone économique européenne" reverses the adjective order).
Similar word order: "I eat bread" → "Yo como pan" (Spanish). Attention for "Yo" peaks on "I". Attention for "como" peaks on "eat". Nearly diagonal.
Different word order: "I don't know" → "Je ne sais pas" (French). "sais" (know) is position 3 in French but "know" is position 3 in English — still diagonal here. But "pas" (negation) attends to "don't" at position 2, not position 4. The model learns the reordering.
Longer-range: "The man who I met yesterday left" → in German the verb goes to the end. The attention weight from the final German verb reaches all the way back to "left" in the English input. Without attention, this information would need to survive through the entire decoder hidden state chain.
The bottleneck is gone. Instead of squeezing T hidden states into one vector, the decoder gets access to all T states at every step. Attention weights act as a soft pointer — "which input positions should I focus on right now?" For languages with different word orders (English-French: "European Economic Area" → "zone économique européenne"), attention learns the reordering automatically. And the whole thing is end-to-end differentiable — no supervision on the attention weights is required. Backprop learns them from the translation loss alone.
Every operation in the attention mechanism is smooth: the alignment function fatt is a neural network (differentiable), softmax is differentiable, and weighted summation is linear (trivially differentiable). The gradient of the loss flows backward through ct = ∑i at,i · hi, through the softmax, through fatt, and into both the encoder (updating hi) and decoder (updating st-1). This means the model can learn where to look entirely from the translation objective — no hand-labeled alignments needed.
Bahdanau attention is a decoder-to-encoder mechanism: the decoder queries attend to encoder states. But there's a more general operator hiding inside. What if a sequence attended to itself?
In self-attention, every token in a sequence looks at every other token in the same sequence. The input is a set of vectors X = {x1, x2, ..., xN}. Each vector produces one output that's a weighted combination of information from all inputs.
An operation where each element in a sequence computes its output by attending to all other elements in the same sequence. Unlike cross-attention (decoder → encoder), self-attention has a single set of inputs that serve as both the source and target. Each input generates a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what information do I provide?").
Think of it like a meeting. Each person (token) has a question (query), a nametag (key), and a message (value). To form your output, you compare your question against everyone's nametag, figure out who's relevant, and blend their messages proportionally.
You're at a conference. You're the query: "I need to know about gradients." You look at everyone's nametag (key): "Optimization researcher," "Image processing expert," "Computer vision professor," "Gradient flow specialist." The "Gradient flow specialist" and "Optimization researcher" nametags match your question well — high attention weights. You then listen to each person's value (their actual expertise) and blend it proportionally. Your output is mostly "gradient flow" knowledge with some "optimization" knowledge mixed in. The person studying image processing contributed almost nothing to your output — they had a low attention weight despite being at the same conference.
Why do we need three separate projections? Can't we just use X directly for everything? The answer is role separation. What a token is looking for (query), what a token advertises (key), and what a token provides (value) are three different things. The word "bank" might have a key that says "I'm a noun related to money or rivers" (for other tokens to find it), a query that says "I need context to disambiguate" (to look at surrounding tokens), and a value that says "here's my semantic content" (what it contributes to other tokens' representations). Three separate linear projections let the model learn these three roles independently.
In cross-attention, queries come from the decoder and keys/values come from the encoder — two separate sequences. In self-attention, queries, keys, and values all come from the same sequence. The shapes simplify: if the input is X [N × Din], then Q, K, V are all [N × Dout] and the attention matrix E is [N × N].
Three tokens: x1="The", x2="cat", x3="sat". Each produces Q, K, V vectors.
Token "sat" (query q3): It computes similarity with every key: e3,1 = q3·k1, e3,2 = q3·k2, e3,3 = q3·k3. Suppose after softmax: a3 = [0.05, 0.70, 0.25]. "sat" mostly attends to "cat" (the subject) and somewhat to itself. Output: y3 = 0.05·v1 + 0.70·v2 + 0.25·v3.
The output y3 for "sat" now contains information about its subject "cat" — it knows what sat.
Hover over a cell in the attention matrix to see which query-key pair it represents. Brighter = higher attention weight.
A remarkable property: self-attention is permutation equivariant. If you shuffle the input tokens, the outputs get shuffled in the same way. Formally: F(σ(X)) = σ(F(X)). This means self-attention operates on sets, not sequences — it has no notion of position or order.
Let σ be a permutation. Apply σ to the input: X' = σ(X). Then Q' = X' · WQ = σ(X) · WQ = σ(Q). Similarly K' = σ(K), V' = σ(V). The similarity matrix E'ij = Q'i · K'j = Qσ(i) · Kσ(j) = Eσ(i),σ(j). So E' is E with rows and columns permuted. After softmax (applied per row), A' is the same permutation of A. Finally Y' = A'V' = σ(AV) = σ(Y). The output is just the permuted version of the original output. QED.
"The cat sat on the mat" and "mat the on sat cat the" produce the same attention weights (just permuted). Self-attention alone can't tell word order. This is why we'll need positional encoding (Chapter 7) — we must explicitly inject position information.
Every output directly depends on all inputs. In an RNN, information from token 1 must travel through every intermediate hidden state to reach token 100 — a chain of 99 steps where it can degrade. In self-attention, token 100 looks directly at token 1 in a single step. This is why transformers handle long-range dependencies so much better than RNNs.
Bahdanau used a learned neural network fatt to compute alignment scores. The modern approach is simpler: use the dot product between query and key vectors, with one crucial scaling factor.
Given input vectors X [N × Din], we first project them into queries, keys, and values using learned weight matrices:
Then compute the attention output in three steps:
Or more compactly:
This is the part most presentations skip. Consider two random vectors q and k of dimension Dk, where each component is drawn from a distribution with mean 0 and variance 1. Their dot product q · k = ∑i qi · ki has mean 0 and variance Dk.
Each term qi · ki has E[qiki] = E[qi]E[ki] = 0 (by independence) and Var(qiki) = E[qi2]E[ki2] = 1 × 1 = 1. Since the dot product is a sum of Dk independent terms, its variance is Dk × 1 = Dk. Standard deviation is √Dk.
For Dk = 64, dot products can easily reach magnitudes of ±8. Softmax applied to values this large produces outputs extremely close to 0 or 1 — a near-one-hot distribution. The gradients of softmax in this regime are vanishingly small, killing learning.
Dividing by √Dk normalizes the variance back to 1, keeping the softmax inputs in a regime where gradients flow well.
Dk = 64. Two random vectors with magnitude √64 = 8. Dot product could be ~50.
Without scaling: softmax([50, 2, 1, -3]) ≈ [1.0, 0.0, 0.0, 0.0]. Near-deterministic. Gradient of softmax is ≈ 0 — learning stops.
With scaling: divide by 8: softmax([6.25, 0.25, 0.125, -0.375]) ≈ [0.85, 0.02, 0.02, 0.01]. Smooth distribution. Healthy gradients.
The attention matrix E has shape [N × N], where N is the sequence length. Computing it requires N2 dot products, each of dimension Dk.
QKV projection: O(N · D2). Attention matrix: O(N2 · D). Weighted sum: O(N2 · D). Storing the N × N attention matrix requires O(N2) memory. For N = 4,096 tokens with D = 512: the attention matrix alone is 40962 = 16.7 million entries. This quadratic cost is the main limitation of transformers — doubling the sequence length quadruples the compute.
Self-attention looks intimidating in equations but is remarkably simple computationally:
Self-attention is four matrix multiplications plus one softmax. That's it. GPUs are extremely good at matrix multiplication. This is why transformers are so parallelizable — unlike RNNs, which require sequential computation through time steps, all of these matmuls can run simultaneously.
It's worth pausing to distinguish these two forms precisely:
| Property | Cross-Attention | Self-Attention |
|---|---|---|
| Queries come from | Decoder sequence [NQ × DQ] | Same input X [N × D] |
| Keys/Values come from | Encoder sequence [NX × DX] | Same input X [N × D] |
| Attention matrix shape | [NQ × NX] (rectangular) | [N × N] (square) |
| Purpose | Decoder reads encoder states | Tokens mix within one sequence |
| Used in | Encoder-decoder Transformers (T5, original) | All Transformers (GPT, BERT, ViT) |
N = 3 tokens, D = 4. Input X and weight matrices WQ, WK, WV all [4 × 4].
Step 1 (QKV): Q = X · WQ, K = X · WK, V = X · WV. Three matrix multiplies, each [3 × 4] × [4 × 4] = [3 × 4].
Step 2 (Similarity): E = Q · KT / √4 = Q · KT / 2. Shape: [3 × 4] × [4 × 3] = [3 × 3]. Each entry Eij is the scaled dot product between query i and key j.
Step 3 (Softmax): Apply softmax row-wise. Each row sums to 1. Row i gives the attention distribution for query i over all keys.
Step 4 (Output): Y = A · V. Shape: [3 × 3] × [3 × 4] = [3 × 4]. Each output yi is a weighted combination of all value vectors.
Total learnable parameters: 3 × D2 = 3 × 16 = 48 (for WQ, WK, WV). Plus D2 = 16 for WO = 64 total.
A single attention head computes one set of attention weights — one "pattern" of which tokens attend to which. But language has many simultaneous relationships: syntactic (subject-verb), semantic (synonyms), positional (nearby words). A single head must blend all these into one attention matrix. Can we do better?
Run H independent self-attention operations ("heads") in parallel, each with its own learned WQ, WK, WV matrices. Concatenate their outputs and project through a final linear layer WO. Each head can learn a different attention pattern.
The key trick: each head operates in a lower-dimensional subspace. If the model dimension is D = 512 and we use H = 8 heads, each head has dimension DH = D / H = 64. The total parameter count is the same as a single head with dimension D.
Sentence: "The cat that I fed yesterday slept on the mat."
Head 1 (syntactic): "slept" attends strongly to "cat" (its subject), ignoring the relative clause.
Head 2 (local): Each word attends to its immediate neighbors — a kind of learned smoothing.
Head 3 (semantic): "cat" attends to "mat" and "slept" (related concepts in the scene).
Head 4 (positional): Attends to a fixed relative offset — always looking 2 tokens back.
The output projection WO blends all four perspectives into a single representation.
Each colored head learns a different attention pattern. Click a head to highlight it. Observe how different heads focus on different relationships.
A single head with Dk = 512 and a single head with Dk = 64 have different expressiveness. But 8 heads of Dk = 64 is strictly more expressive than 1 head of Dk = 512 — it can learn 8 different attention patterns simultaneously. The output projection WO then learns how to combine them.
Think of each head as a specialist that projects tokens into a different 64-dimensional subspace and finds patterns there. Head 1 might project tokens so that subjects and verbs are close together. Head 2 might project so that co-referent nouns cluster. The final WO matrix merges these specialized views.
| Component | Shape | Parameters |
|---|---|---|
| WQ (all heads) | D × (H · DH) = D × D | D2 |
| WK (all heads) | D × D | D2 |
| WV (all heads) | D × D | D2 |
| WO (output) | D × D | D2 |
| Total | — | 4D2 |
For D = 512: 4 × 5122 = 1,048,576 parameters per multi-head attention layer. The number of heads H doesn't affect the total — it just determines how the computation is divided.
In practice, all H heads are computed in parallel using a single batched matrix multiply. The fused QKV projection produces a tensor of shape [N × 3HDH], which is reshaped to [H × N × 3DH] and split into Q, K, V. The attention computation runs independently per head using batched matmul (the H dimension acts as a batch dimension). This is extremely efficient on GPUs.
D = 512, H = 8, DH = 64. Input X: [N × 512].
Fused QKV: X · WQKV = [N × 512] × [512 × 1536] = [N × 1536]. Split into Q, K, V each [N × 512], then reshape to [8 × N × 64].
Per-head attention: E = Q · KT = [8 × N × 64] × [8 × 64 × N] = [8 × N × N]. One matmul, 8 heads at once.
Per-head output: A · V = [8 × N × N] × [8 × N × 64] = [8 × N × 64]. Reshape to [N × 512].
Output projection: [N × 512] × [512 × 512] = [N × 512]. Done. Same input and output shape.
Self-attention alone is not enough. It lets tokens communicate, but it doesn't add nonlinearity per token. The Transformer block wraps multi-head self-attention with three additional components: residual connections, layer normalization, and a feedforward network.
Layer normalization normalizes each vector independently across its feature dimensions. Given a vector h of dimension D:
This stabilizes training by preventing activations from drifting to extreme values. Each token's representation is normalized independently — no dependence on batch statistics like batch normalization.
The MLP applies the same two-layer network to each token independently:
The standard expansion factor is 4×. If D = 512, the hidden layer has 2,048 dimensions. This gives each token a chance to do "private computation" — self-attention is the communication channel between tokens, but the MLP is where individual token representations get transformed.
Self-attention is the communication step: tokens exchange information. The MLP is the computation step: each token processes its updated representation privately. The transformer alternates between these two operations, stacking identical blocks.
After self-attention, token "sat" has gathered information from "cat" and "on." Its representation is now a weighted blend of value vectors. But blending alone is linear — it can only produce linear combinations. The MLP adds nonlinearity (via ReLU or SwiGLU), letting the model compute new features: "this is a past-tense verb whose subject is an animal." The 4× expansion (D → 4D → D) gives the MLP a large hidden space to compute these features before projecting back to model dimension.
An analogy: self-attention is like collecting notes from colleagues at a meeting. The MLP is going back to your desk and thinking about what those notes mean. Both steps are essential.
The X + Z in step 2 is a residual connection (a.k.a. skip connection). It adds the input directly to the output. This serves two purposes: (1) gradients flow directly through the addition, avoiding vanishing gradients in deep stacks, and (2) the model only needs to learn the change to the representation, not the entire representation from scratch.
The original Transformer (Vaswani 2017) places LayerNorm after the residual add. Modern practice moves it before the sublayer — inside the residual path:
Post-norm places the normalization outside the residual, which means the model can't easily learn the identity function (a problem at initialization). Pre-norm normalizes inside the residual branch, so the main path is a clean addition. This makes training more stable, especially for very deep models (50+ layers). Nearly all modern transformers (GPT-2+, LLaMA, etc.) use pre-norm.
A Transformer is simply a stack of identical blocks. The original paper used 6 blocks. The architecture has barely changed since 2017, but models have gotten much bigger:
| Model | Blocks | D | Heads | Params |
|---|---|---|---|---|
| Transformer (2017) | 12 | 1,024 | 16 | 213M |
| GPT-2 (2019) | 48 | 1,600 | 25 | 1.5B |
| GPT-3 (2020) | 96 | 12,288 | 96 | 175B |
A transformer block is 6 matrix multiplies: 4 from multi-head attention (QKV projection, QK similarity, AV product, output projection) and 2 from the MLP (up-project and down-project). Every other operation (softmax, LayerNorm, residual adds) is cheap by comparison. Transformers are matrix-multiply machines.
The original Transformer uses ReLU in the MLP: Y = W2 · ReLU(W1 · x). Modern transformers (LLaMA, PaLM) replace this with SwiGLU, a gated variant:
The ⊙ is element-wise multiplication — a gating mechanism. One branch (Swish(X · W1)) decides how much information to let through, while the other branch (X · W2) provides the information. Shazeer (2020) showed this outperforms plain ReLU MLPs across the board, and offered the wonderfully honest explanation: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."
The biggest modern models (GPT-4, Gemini, Claude) almost certainly use Mixture of Experts. Instead of one MLP per block, learn E separate MLP "experts." Each token is routed to only A < E of them. This multiplies parameters by E but only multiplies compute by A. A model with 1T total parameters might only activate 100B per token.
Learn E separate sets of MLP weights per transformer block. A learned router network selects the top A experts for each token. The token's MLP output is a weighted combination of only the A active experts' outputs. This decouples model capacity (total parameters) from inference cost (active parameters per token).
Self-attention is permutation equivariant — it treats its input as a set, not a sequence. "The cat sat" and "sat cat the" produce the same attention patterns (just permuted). But language is ordered: "dog bites man" means something very different from "man bites dog." We need to inject position information.
Vaswani et al. (2017) proposed adding a fixed vector to each input token that encodes its position. For position pos and dimension i:
Each dimension of the positional encoding oscillates at a different frequency. Low dimensions change rapidly (high frequency), high dimensions change slowly (low frequency). This creates a unique "fingerprint" for each position.
Three desirable properties: (1) Each position gets a unique encoding. (2) The encoding has bounded magnitude regardless of sequence length — sin and cos are always between -1 and 1. (3) The model can learn to attend to relative positions: PE(pos + k) can be expressed as a linear function of PE(pos) for any fixed offset k, because sin(a + b) = sin(a)cos(b) + cos(a)sin(b). This means "look 3 positions back" is a linear transformation the model can learn.
D = 4. We compute PE(5) using two frequencies:
Dim 0: sin(5 / 100000/4) = sin(5 / 1) = sin(5) ≈ −0.96
Dim 1: cos(5 / 100000/4) = cos(5) ≈ 0.28
Dim 2: sin(5 / 100002/4) = sin(5 / 100) = sin(0.05) ≈ 0.05
Dim 3: cos(5 / 100002/4) = cos(0.05) ≈ 1.00
PE(5) ≈ [−0.96, 0.28, 0.05, 1.00]. This is added to x5 before entering the transformer.
An alternative: learn a lookup table of D-dimensional vectors, one per position. The model has a learnable matrix PE [Nmax × D] and adds PE[pos] to xpos. This is what GPT-2 and BERT use.
| Method | Pros | Cons |
|---|---|---|
| Sinusoidal (fixed) | No extra parameters; can extrapolate to unseen lengths | Less flexible; theoretical extrapolation doesn't always work in practice |
| Learned | Maximum flexibility; can capture position-specific patterns | Extra parameters; hard limit on sequence length |
The positional encoding is added to the token embedding, not concatenated. This means the model dimension stays the same. The input to the transformer is xpos + PE(pos), where both vectors have dimension D. The model must learn to disentangle position information from content information — and empirically, it does.
A limitation of both sinusoidal and learned positional encodings: they encode absolute position. Token 5 always gets the same encoding regardless of context. Modern models like LLaMA use Rotary Position Embeddings (RoPE), which encode relative position by rotating query and key vectors. The dot product Qi · Kj then naturally depends on the offset (i − j) rather than the absolute positions i and j. This allows better generalization to sequence lengths not seen during training.
Consider the phrase "the big red ball" appearing at positions [5,6,7,8] in one sentence and [100,101,102,103] in another. With absolute encoding, the attention patterns between "big" and "ball" would differ because their absolute positions differ. With relative encoding, the offset (2 positions apart) is the same in both cases, so the model can learn "attend to the noun 2 positions ahead" as a general rule. This is especially important for models that must handle varying context lengths.
Transformers were designed for sequences of words. Images aren't sequences — they're 2D grids of pixels. But Dosovitskiy et al. (2021) showed that you can treat an image as a sequence of patches and apply a standard transformer. The result: Vision Transformer (ViT), which matches or beats CNNs on image classification when given enough data.
Take a 224 × 224 × 3 image. Divide it into a grid of non-overlapping patches, each 16 × 16 × 3 = 768 values. This gives (224/16)2 = 196 patches. Flatten each patch and apply a linear projection from 768 to D dimensions:
Each image patch is treated as a "word." The 16×16 patch is the image equivalent of a word embedding. The transformer processes 196 "tokens" exactly as it would process 196 words. No convolutions, no pooling, no spatial inductive bias — just a standard transformer on a set of patch vectors.
Add learned positional embeddings to each patch vector (a learnable [196 × D] matrix). Since we flattened the 2D grid, position embeddings must encode both row and column — the transformer has no built-in 2D structure.
For classification, ViT prepends a special [CLS] token (a learned D-dimensional vector) to the patch sequence. After the transformer, the [CLS] token's output vector is fed to a linear classifier. Alternatively, you can average-pool all patch outputs and classify from that — both work.
| Property | CNN (ResNet) | ViT |
|---|---|---|
| Inductive bias | Strong: locality, translation equivariance | Weak: only patch structure |
| Receptive field | Grows with depth (local → global) | Global from layer 1 (every patch sees every patch) |
| Data efficiency | Better with small datasets (inductive bias helps) | Needs large datasets (300M+ images for best results) |
| Scalability | Saturates as model grows | Keeps improving with more data + params |
| Compute pattern | Convolution (local matmuls) | Self-attention (global matmuls) |
ViT trained on ImageNet alone (1.3M images) underperforms ResNets. The lack of inductive bias means it must learn locality and translation invariance from data. Pre-training on JFT-300M (300 million images) or ImageNet-21k (14 million images) makes ViT competitive or superior. The lesson: transformers trade inductive bias for data.
ViT-B/16: Base model, 16×16 patches, D = 768, 12 blocks, 12 heads, 86M params.
ViT-L/16: Large model, D = 1024, 24 blocks, 16 heads, 307M params.
ViT-H/14: Huge model, 14×14 patches (more tokens), D = 1280, 32 blocks, 16 heads, 632M params.
For a 224×224 image with patch size P: N = (224/P)2. With P = 16: N = 142 = 196. With P = 14: N = 162 = 256. Smaller patches → more tokens → more fine-grained attention → but O(N2) attention cost grows fast. Going from P = 16 to P = 14 increases attention compute by (256/196)2 ≈ 1.7×. That's the trade-off: resolution vs. compute.
The magic of ViT is its simplicity. No pooling layers, no convolutional kernels, no special spatial operations. Just patch embedding + standard transformer + classification head. This means any improvement to the transformer architecture (better attention, better normalization, longer context) automatically benefits vision models too. The same architecture handles text, images, audio, video — unifying AI around a single computational primitive.
Before ViT, vision had ResNets, language had LSTMs/Transformers, audio had WaveNets. After ViT, the transformer became the universal architecture. Convert your data into a sequence of vectors (tokens, patches, spectrogram frames), add positional encoding, and run a transformer. The same attention mechanism handles spatial, temporal, and semantic relationships. This universality is arguably the transformer's greatest contribution.
This is the interactive payoff. Below, you can see the full self-attention computation unfold step by step: input tokens are projected to Q, K, V vectors, the attention matrix is computed, and weighted sums produce the outputs. Select different tokens and sentences to see how attention patterns change.
Select a sentence and click a query token to see what it attends to. The attention matrix and output blending are shown in real time.
"The cat sat on the mat": Notice how "sat" strongly attends to "cat" (its subject) and weakly to "on" (the preposition that follows). "mat" attends to "the" (its determiner) and "on" (the spatial relation).
"I saw the man with the telescope": This sentence is famously ambiguous. Does "with the telescope" modify "saw" (I used a telescope to see) or "man" (the man had a telescope)? Watch how the attention weights don't resolve this — a single layer of self-attention can represent either interpretation depending on the learned weights.
When you explore the visualizer, keep these patterns in mind:
Diagonal: Each token mostly attends to itself. Common in early layers where the model is still building basic representations.
Vertical stripes: All tokens attend to a specific position (often [CLS] or punctuation). This position acts as a "sink" that aggregates global information.
Subject-verb: Verbs attend strongly to their subjects. "The cat [that I saw] sat" — "sat" attends to "cat" despite the intervening clause.
Coreference: Pronouns attend to their antecedents. "John picked up the ball. He threw it." — "He" attends to "John."
Uniform: Attention spread evenly across all tokens. This can mean the head is "averaging" information globally, or that it's not learning anything useful.
A common mistake: interpreting attention weights as "what the model is thinking about." Attention weights show where information flows, but the model can route information through many heads and many layers. A token might receive low attention from the final layer but high attention from an earlier layer, with the information already baked into its representation. Use attention visualization as intuition, not proof.
Attention started as a fix for the encoder-decoder bottleneck. Self-attention generalized it into a primitive that operates on sets of vectors. The Transformer stacked self-attention with MLPs, residual connections, and layer normalization into the most successful neural network architecture in history.
| Type | Architecture | Masking | Use Case | Examples |
|---|---|---|---|---|
| Encoder-only | Stack of transformer blocks with full (unmasked) self-attention | None — every token sees every other token | Understanding: classification, NER, sentiment | BERT, RoBERTa |
| Decoder-only | Stack of transformer blocks with masked (causal) self-attention | Causal mask: token i can only attend to tokens ≤ i | Generation: language modeling, text completion | GPT-2, GPT-3, LLaMA |
| Encoder-decoder | Encoder (full attention) + Decoder (masked self-attention + cross-attention to encoder) | Causal in decoder self-attention; full in cross-attention | Seq-to-seq: translation, summarization | Original Transformer, T5, BART |
Decoder-only models like GPT use masked (causal) self-attention. Before computing softmax, entries where j > i are set to −∞. After softmax, these become 0 — each token can only attend to previous tokens and itself. This enables autoregressive generation: predict the next token given all previous tokens.
Sentence: "Attention is very cool". Token 1 ("Attention") can only see itself: mask row 1 = [E11, −∞, −∞, −∞]. After softmax: [1.0, 0, 0, 0].
Token 3 ("very") can see tokens 1, 2, 3 but NOT token 4: mask row 3 = [E31, E32, E33, −∞]. After softmax, the weights only distribute over the first 3 tokens.
This ensures that when predicting the next token, the model cannot "cheat" by looking at future tokens. During training, all positions are processed in parallel (one forward pass), but each position only sees its causal context.
GPT (Radford et al., 2018) showed that a decoder-only transformer with masked self-attention, pre-trained on a massive text corpus, can be fine-tuned for many NLP tasks. The training objective is simple: predict the next token. Given tokens [x1, ..., xt], predict xt+1 using softmax over the vocabulary.
The architecture is straightforward: an embedding matrix [V × D] converts tokens to vectors, a stack of masked self-attention blocks processes them, and a projection matrix [D × V] converts each output vector into scores over the vocabulary. Training minimizes cross-entropy between predicted and actual next tokens.
BERT (Devlin et al., 2019) uses an encoder-only transformer with no causal mask. Every token sees every other token. Training objective: randomly mask 15% of input tokens, and the model predicts the masked tokens. This forces bidirectional understanding — unlike GPT, which only looks backward.
BERT: Input "The [MASK] sat on the mat" → predict "cat". Every token sees every other token (bidirectional). Excellent for classification and understanding tasks. Cannot generate text autoregressively.
GPT: Input "The cat sat on the" → predict "mat". Each token only sees previous tokens (unidirectional). Excellent for text generation. Can also do classification (with fine-tuning) but less naturally.
Consider: "The bank by the river was eroded." Unidirectional (GPT): when processing "bank," the model only sees "The" — it can't tell if "bank" means a financial institution or a riverbank. Bidirectional (BERT): "bank" sees "river" to the right, immediately disambiguating to "riverbank." For classification tasks (sentiment, NER, question answering), this bidirectional context is crucial. The trade-off: BERT can't generate text token-by-token because it requires the full input to make predictions.
Vaswani et al. (2017) designed the original Transformer as an encoder-decoder model for translation. The encoder uses full self-attention (every source token sees every other source token). The decoder uses masked self-attention (each target token only sees previous target tokens) plus cross-attention (each target token attends to all source tokens). This design directly inherited the seq2seq structure that motivated attention in the first place.
T5 (Raffel et al., 2020) showed that many NLP tasks can be framed as seq-to-seq: classification becomes "Input: sentiment review. Output: positive." This made encoder-decoder models surprisingly versatile, though GPT-style decoder-only models have since dominated due to simpler training and better scaling.
The core architecture has barely changed since 2017, but several refinements are now standard:
| Modification | Change | Why |
|---|---|---|
| Pre-Norm | LayerNorm before sublayer, inside residual | More stable training for deep models |
| RMSNorm | Replace LayerNorm with root-mean-square normalization | Slightly more stable; removes mean centering |
| SwiGLU MLP | Gated linear unit with Swish activation | Empirically better than ReLU MLP |
| MoE | Multiple expert MLPs; each token routed to A of E experts | Massive params, modest compute increase |
| RoPE | Rotary positional embeddings applied to Q, K | Better relative position modeling; can extrapolate |
| Method | Receptive Field | Compute | Memory | Parallelism |
|---|---|---|---|---|
| RNN | Full (sequential) | O(N · D2) | O(D) | Low (sequential) |
| CNN | Local (kernel-size per layer) | O(N · K · D2) | O(N · D) | High |
| Self-Attention | Global (every token) | O(N2 · D) | O(N2) | High |
Attention lets every token look at every other token to decide what matters. The Transformer stacks this primitive into the most scalable architecture we have.
D (model dimension): size of each token vector. 512–12,288 in practice.
H (number of heads): parallel attention heads. DH = D/H per head.
L (number of layers/blocks): depth of the stack. 6–96 in practice.
N (context length): maximum sequence length. 512–128,000+ in modern models.
V (vocabulary size): number of unique tokens. 30,000–100,000 typically.
dff (MLP hidden dim): usually 4D or 8D/3 (with SwiGLU).
It's remarkable how little the core architecture has evolved. The original 2017 Transformer had: multi-head self-attention, feedforward networks, residual connections, layer normalization, and positional encoding. Every major model in 2025 still has all five. The changes are incremental refinements (better normalization, better activation functions, better position encoding) — not architectural revolutions. The transformer is not just a good architecture. It's a stable attractor in design space.
The scaling hypothesis: transformer performance improves predictably as you increase model size, dataset size, and compute. This was demonstrated by Kaplan et al. (2020) with scaling laws: loss follows a power law in each of these quantities. No other architecture has shown such reliable scaling behavior. The combination of parallelizable computation (matrix multiplies), flexible capacity (just stack more blocks), and the attention mechanism's ability to represent complex dependencies appears to be uniquely well-suited to gradient-based optimization at scale.
| Year | Milestone | Key Contribution |
|---|---|---|
| 2014 | Seq2Seq (Sutskever et al.) | Encoder-decoder RNN for translation |
| 2015 | Bahdanau Attention | Soft attention for seq2seq; dynamic context vector |
| 2017 | Transformer (Vaswani et al.) | Self-attention everywhere; no RNNs needed |
| 2018 | GPT-1 (Radford et al.) | Decoder-only transformer for language modeling |
| 2019 | BERT (Devlin et al.) | Encoder-only; masked language modeling; bidirectional |
| 2019 | GPT-2 | 1.5B params; zero-shot capabilities |
| 2020 | GPT-3 | 175B params; in-context learning |
| 2021 | ViT (Dosovitskiy et al.) | Patches as tokens; transformers replace CNNs for vision |
| # | Paper |
|---|---|
| 1 | Sutskever et al. "Sequence to Sequence Learning with Neural Networks." NeurIPS 2014. |
| 2 | Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR 2015. |
| 3 | Vaswani et al. "Attention Is All You Need." NeurIPS 2017. |
| 4 | Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers." NAACL 2019. |
| 5 | Radford et al. "Improving Language Understanding by Generative Pre-Training." OpenAI 2018. |
| 6 | Dosovitskiy et al. "An Image Is Worth 16x16 Words." ICLR 2021. |
| 7 | Kaplan et al. "Scaling Laws for Neural Language Models." arXiv 2020. |
| 8 | Shazeer. "GLU Variants Improve Transformer." arXiv 2020. |