What if you threw away recurrence entirely and built a sequence model from pure attention? This paper did exactly that — and launched the era of modern AI.
By 2017, the best machine translation systems all shared the same backbone: recurrent neural networks (RNNs). An RNN reads a sentence one word at a time, updating a hidden state at each step. To translate "The cat sat on the mat" into French, the encoder processes "The," then "cat," then "sat," and so on — sequentially.
This sequential processing creates two problems.
Problem 1: No parallelism. You cannot process word 5 until word 4 is done, because word 5 depends on the hidden state produced by word 4. On a GPU with thousands of cores, most of them sit idle, waiting. Training is slow.
Problem 2: Long-range forgetting. By the time the RNN reaches word 50, the hidden state has been overwritten so many times that information about word 1 is diluted. Attention mechanisms (Bahdanau, 2014) helped by letting the decoder look back at all encoder states, but the encoder itself was still sequential, still forgetting.
Problem 3: Slow training, period. Google's Neural Machine Translation system (GNMT) used 96 GPUs for a week to train. ConvS2S required enormous compute budgets. The sequential bottleneck was not just a theoretical concern — it directly limited how quickly the field could iterate on new ideas.
Attention was not a new idea. Bahdanau et al. (2014) had introduced additive attention as an add-on to RNN encoder-decoder models. Their insight: instead of compressing the entire input sentence into a single vector, let the decoder look back at all encoder hidden states and focus on the relevant ones. This dramatically improved translation quality, especially for long sentences.
But in Bahdanau's model — and in all subsequent work — attention was used alongside recurrence. The RNN still did the heavy lifting of encoding the sequence; attention just helped the decoder pick the right source words. No one had tried using attention alone, without any recurrence at all.
The footnote of the paper reveals the intellectual genesis. Jakob Uszkoreit first proposed replacing RNNs with self-attention. Ashish Vaswani and Illia Polosukhin designed and implemented the first models. Noam Shazeer proposed three of the paper's key innovations: scaled dot-product attention, multi-head attention, and the parameter-free sinusoidal position representation. It was a truly collaborative breakthrough, with each author contributing crucial pieces.
Earlier attempts to reduce sequential computation used convolutions (ByteNet, ConvS2S), which compute all positions in parallel but connect distant positions only through stacks of layers — a signal from position 1 must pass through O(log n) layers to reach position n. The number of operations grows with distance, making long-range dependencies hard to learn.
Vaswani and colleagues at Google asked a radical question: what if you removed the RNN entirely? What if the only mechanism relating tokens to each other was attention? The result was the Transformer — a model that processes all positions in parallel, connects any two positions in a single operation (O(1) path length!), and trained to state-of-the-art translation quality in 3.5 days on 8 GPUs.
This paper did not just improve machine translation. It became the foundation of GPT, BERT, PaLM, LLaMA, and virtually every large language model built since 2018. As of 2024, it has over 130,000 citations — more than almost any other paper in computer science history. Let's understand exactly how it works.
Before diving into the math, let's build intuition for what attention does. Forget neural networks for a moment. Think about a dictionary.
A dictionary maps keys to values. You have a query ("I want the meaning of 'cat'"), you find the matching key ("cat"), and you retrieve the value ("a small furry animal"). This is hard lookup — you either find an exact match or you don't.
Now imagine a soft dictionary. Your query is "I want something between 'cat' and 'dog'." Instead of finding one exact match, you compute a similarity score between your query and every key. Then you take a weighted average of all values, where the weights come from those similarity scores.
This is exactly what attention does. Every token in the sequence plays three roles simultaneously. It asks questions (queries), advertises what it contains (keys), and carries information (values). The attention mechanism computes pairwise similarity between every query and every key, then uses those similarities as weights to mix all the values.
This formulation has deep roots in information retrieval. Search engines work the same way: you type a query, the engine compares it against document keys (titles, keywords), and returns the most relevant values (documents). The Transformer does this at every layer, for every token, in a differentiable way that can be trained end-to-end with gradient descent.
The beautiful thing about this formulation is that it requires no sequential processing. Every token computes its query, key, and value simultaneously. Every pairwise similarity is computed simultaneously. The entire operation is one big matrix multiplication — perfectly suited for GPUs.
One more analogy that helps: think of a cocktail party. Each person (token) is simultaneously asking questions (queries), broadcasting information about themselves (keys), and carrying knowledge to share (values). Attention is the social mechanism by which each person decides who to listen to and how much to weight what they hear. The "cocktail party" happens in parallel — everyone talks and listens at the same time.
Let's make this precise. You have a sequence of n tokens, each represented as a vector of dimension dmodel = 512. Stack them into a matrix X with shape (n × dmodel).
Now, we need each token to play three roles. We create three separate linear projections:
Where WQ, WK, WV are learned weight matrices. Q, K, and V are all (n × dk) matrices. Each row is one token's query, key, or value vector.
Why three separate projections instead of using X directly? Because a token's "question" and its "answer" are different things. When the word "it" appears in a sentence, its query might encode "I need to find my antecedent" while its key might encode "I am a pronoun in object position." These are different aspects of the same token, and the network learns to project them into different spaces.
There is a subtle but important distinction between attention and self-attention. In Bahdanau's original attention, the queries came from one sequence (the decoder) and the keys/values from another (the encoder). In self-attention, queries, keys, and values all come from the same sequence. Each token attends to all tokens in its own sequence, including itself. This is what makes the Transformer's encoder work — it builds rich contextual representations by letting every source token gather information from every other source token.
The term "self-attention" was coined by Cheng et al. (2016) and Lin et al. (2017), who used it for reading comprehension and sentence embeddings. But they always used it inside an RNN. The Transformer was the first model to use self-attention as the only mechanism for building sequence representations.
Now we compute attention in three steps:
The full formula in one line:
This is the paper's equation (1). It is the single most important equation in modern deep learning.
Let's trace through a tiny numerical example. Suppose we have 3 tokens with dk = 2:
| dim 0 | dim 1 | |
|---|---|---|
| Q ("the") | 1.0 | 0.5 |
| Q ("cat") | 0.2 | 1.0 |
| Q ("sat") | 0.8 | 0.3 |
| dim 0 | dim 1 | |
|---|---|---|
| K ("the") | 0.9 | 0.1 |
| K ("cat") | 0.3 | 0.8 |
| K ("sat") | 0.7 | 0.6 |
Step 1 — QKT: the dot product of "cat"'s query with "the"'s key is (0.2)(0.9) + (1.0)(0.1) = 0.28. With "cat"'s key: (0.2)(0.3) + (1.0)(0.8) = 0.86. With "sat"'s key: (0.2)(0.7) + (1.0)(0.6) = 0.74. So "cat" has the highest affinity for itself and for "sat."
Step 2 — Scale by √dk = √2 ≈ 1.41: divide each score by 1.41.
Step 3 — Softmax: convert the scaled scores into weights that sum to 1. "Cat" might end up with weights [0.22, 0.42, 0.36], meaning it pulls 42% of its information from itself, 36% from "sat," and 22% from "the."
Step 4 — Multiply by V: the output for "cat" is 0.22 · V"the" + 0.42 · V"cat" + 0.36 · V"sat". A blended representation, enriched by contextual information.
Notice the asymmetric roles. Q and K determine who talks to whom (the attention pattern). V determines what information gets passed. You could have two tokens with very different V vectors but similar K vectors — they would attract the same queries but contribute different information. This separation of "addressing" (Q, K) from "content" (V) is analogous to the separation of addresses and data in computer memory.
Let's unpack why the √dk scaling is there.
This seems like a small detail, but it is critical. Without the scaling factor, the Transformer would not train. Let's derive exactly why.
Suppose the components of the query vector q and the key vector k are independent random variables, each with mean 0 and variance 1. (This is roughly true after proper initialization.) The dot product is:
Each term qiki has mean 0 and variance 1 · 1 = 1 (variance of a product of two independent zero-mean variables equals the product of variances). Since we sum dk such terms, the dot product has:
So the standard deviation of the dot product grows as √dk. With dk = 64, the dot products have standard deviation 8. This means some entries of QKT will be around +8 while others are around −8.
Now feed these into softmax. Softmax(zi) = ezi / ∑ ezj. When the inputs have large magnitude, the exponentials diverge wildly. If one entry is 8 and another is −8, then e8 ≈ 2981 while e−8 ≈ 0.0003. The softmax output becomes essentially a one-hot vector: all the weight goes to the largest score, and the gradient of softmax becomes vanishingly small.
After dividing by √dk:
The dot products now have unit variance regardless of the dimension. Softmax receives well-behaved inputs, gradients flow, and training succeeds.
The paper notes that for small dk, the scaling barely matters. But for dk = 64 (the value they use), it is essential. Additive attention (Bahdanau et al.) avoids this problem by using a learned feed-forward network instead of a dot product, but it is slower because it cannot be implemented as a single matrix multiply.
Why is dot-product attention faster than additive attention? Additive attention computes vT tanh(W1q + W2k) for each query-key pair. This requires forming an intermediate hidden state for every pair, then applying a nonlinearity, then a dot product with v. For n tokens, that is n² invocations of a small neural network. Dot-product attention computes QKT — a single matrix multiply. Modern GPUs are spectacularly optimized for matrix multiplication (they are literally designed for it), so dot-product attention is vastly faster in practice, even though both have the same O(n²d) theoretical complexity.
Let's see this concretely. With dk = 64, a typical dot product might be +6.3. After softmax over a row of n = 10 tokens, the maximum entry gets weight ~0.98, and the rest share ~0.02. The attention is nearly a hard lookup — and the gradient of softmax at a nearly-one-hot output is essentially zero. The network cannot learn to redistribute attention because the gradients have vanished.
After scaling by √64 = 8, that same dot product becomes +0.79. Softmax distributes weight more evenly — perhaps [0.18, 0.15, 0.14, ...]. Gradients are healthy, and the network can learn nuanced attention patterns where multiple tokens contribute to the output.
| Attention Type | Scoring Function | Scaling Needed? | Speed |
|---|---|---|---|
| Additive (Bahdanau) | vT tanh(W1q + W2k) | No — tanh bounds the output | Slower |
| Dot-product | qTk | Yes — grows with dk | Fast |
| Scaled dot-product (this paper) | qTk / √dk | Built in | Fast |
The scaling factor is so important that it appears in the name: "Scaled Dot-Product Attention." Three words, each load-bearing. Scaled for the √dk normalization. Dot-Product for the efficient matrix-multiply-based scoring. Attention for the softmax-weighted aggregation. Noam Shazeer, one of the paper's authors, proposed this formulation.
A single attention head computes one set of attention weights. But language has many kinds of relationships. The word "it" needs to attend to its antecedent (a syntactic relationship). The word "bank" needs to attend to "river" or "money" to disambiguate (a semantic relationship). The word "sat" needs to attend to "cat" to know who is sitting (a subject-verb relationship).
A single attention head tries to do all of this with one set of weights. The result is a compromise — averaging inhibits specialization.
The fix: run multiple attention heads in parallel, each with its own learned projections. Each head operates in a lower-dimensional subspace and can specialize in a different type of relationship.
The paper uses h = 8 heads with dk = dv = dmodel/h = 64. Each head projects the 512-dimensional input down to 64 dimensions, computes attention in that subspace, and produces a 64-dimensional output. The 8 outputs are concatenated back to 512 dimensions, then projected one more time through WO.
Why is the total cost the same as single-head attention? Because each head uses dk = 64 instead of 512. The cost of attention is O(n² · dk). With 8 heads at dk = 64, the total is 8 × O(n² · 64) = O(n² · 512), the same as one head at dk = 512. We get specialization for free.
Let's count the parameters in multi-head attention for the base model:
| Matrix | Shape | Parameters |
|---|---|---|
| WiQ (×8 heads) | (512 × 64) × 8 | 262,144 |
| WiK (×8 heads) | (512 × 64) × 8 | 262,144 |
| WiV (×8 heads) | (512 × 64) × 8 | 262,144 |
| WO | (512 × 512) | 262,144 |
| Total per attention layer | 1,048,576 ≈ 1M | |
Exactly 4 · dmodel² parameters per multi-head attention layer. Note that the 8 per-head projection matrices for Q (each 512 × 64) can equivalently be viewed as a single matrix of shape (512 × 512), followed by splitting the output into 8 chunks of 64. This is how it is implemented in practice — one big matrix multiply, then reshape.
An important ablation from the paper: what happens with more heads but smaller head dimension? They tested h = 16 with dk = 32 (same total cost) and h = 32 with dk = 16. Performance dropped with dk = 16 (−0.3 BLEU), suggesting there is a minimum head dimension below which each head does not have enough capacity to compute useful attention patterns. The sweet spot for the base model was 8 heads of 64 dimensions each.
The output projection WO is often overlooked but plays a critical role. After concatenating the 8 head outputs, we have a 512-dimensional vector — but it is just a concatenation, not a meaningful blend. The WO projection (512 × 512) learns to mix the heads' outputs, combining information from all 8 subspaces into a single coherent representation. Without WO, the heads could not communicate.
The paper's appendix shows that different heads do indeed learn different patterns. Some heads attend to adjacent positions (local syntax). Others attend to distant positions (long-range coreference). Some heads show sharp, peaked attention; others show diffuse, spread-out patterns. The paper shows beautiful visualizations of layer 5 heads: one head learns to track the direct object of a verb, another resolves pronoun anaphora ("its" attends back to its referent).
We have a problem. The attention mechanism is permutation-equivariant: if you shuffle the input tokens, the output tokens get shuffled in exactly the same way. Attention does not know that "cat sat" is different from "sat cat." There is no notion of position built into the mechanism.
RNNs get position for free — they process tokens sequentially, so position is implicit in the order of computation. Transformers process all tokens simultaneously, so we must explicitly inject position information.
The solution: add a positional encoding vector to each token embedding before feeding it into the Transformer. The encoding has the same dimension dmodel = 512 so it can be added element-wise.
Where pos is the token position (0, 1, 2, ...) and i is the dimension index. Each dimension gets a sinusoid of a different frequency. The wavelengths form a geometric progression from 2π (for i = 0, the fastest-oscillating dimension) to 10000 · 2π (for the slowest).
Think of it like a clock. Dimension 0 is the second hand — it oscillates rapidly, distinguishing position 0 from position 1. Dimension 100 is the hour hand — it barely moves between adjacent positions, but clearly distinguishes position 0 from position 1000. Together, all 512 dimensions give every position a unique "time stamp" at multiple resolutions, from fine-grained local position to coarse global position.
Why sinusoids? Three reasons:
Reason 3: Extrapolation. Since the encoding is a deterministic function, the model can potentially handle sequences longer than those seen during training. (The paper hypothesized this; in practice, later work like RoPE and ALiBi improved upon it.)
Let's work through a concrete example. With dmodel = 4 (tiny, for illustration), the positional encoding for position 3 would be:
| Dimension | Formula | Value |
|---|---|---|
| 0 (i=0, sin) | sin(3 / 100000/4) = sin(3) | 0.141 |
| 1 (i=0, cos) | cos(3 / 100000/4) = cos(3) | −0.990 |
| 2 (i=1, sin) | sin(3 / 100002/4) = sin(0.03) | 0.030 |
| 3 (i=1, cos) | cos(3 / 100002/4) = cos(0.03) | 1.000 |
Notice how the low-index dimensions oscillate rapidly (distinguishing adjacent positions) while the high-index dimensions oscillate slowly (providing coarse position information). Together, they give each position a unique fingerprint.
Why this enables relative position attention. Consider dimensions 2i and 2i+1, which form a (sin, cos) pair at frequency ωi. The encoding at position pos+k can be written as a rotation of the encoding at position pos:
| PE(pos+k, 2i) | = | sin(ωi(pos+k)) | = | sin(ωi · pos) cos(ωi · k) + cos(ωi · pos) sin(ωi · k) |
In matrix form, for each (sin, cos) pair:
| [PE(pos+k, 2i) ] | = | [cos(ωik) sin(ωik)] | [PE(pos, 2i) ] |
| [PE(pos+k, 2i+1)] | [−sin(ωik) cos(ωik)] | [PE(pos, 2i+1)] |
This is a 2D rotation matrix! The positional encoding at pos+k is literally a rotation of the encoding at pos by angle ωik. Each frequency dimension rotates at a different speed. The network can learn a linear projection that "rotates" any position encoding by a fixed offset, enabling relative position attention without explicitly computing relative positions.
This rotation perspective became the foundation for Rotary Position Embeddings (RoPE, Su et al. 2021), which applies the rotation directly to the query and key vectors inside the attention computation rather than adding it to the input embeddings. RoPE is now used in LLaMA, Mistral, and most modern open-source LLMs. The seed of the idea is right here in the original Transformer paper.
The paper also tested learned positional embeddings and found "nearly identical results" (Table 3, row E). They chose sinusoidal encodings for the extrapolation benefit. In practice, most modern Transformers use learned or rotary positional encodings (RoPE takes this rotation idea and applies it directly inside the attention computation), but the insight that position must be explicitly injected remains foundational.
Now let's assemble the pieces into the full architecture. The Transformer has an encoder-decoder structure, just like the RNN models it replaced. But instead of recurrence, it uses stacked layers of self-attention and feed-forward networks.
The encoder is a stack of N = 6 identical layers. Each layer has two sub-layers:
Each sub-layer is wrapped with two critical components:
Residual connection. The output of each sub-layer is x + Sublayer(x). This is the skip connection from ResNet. Without it, a 6-layer Transformer with 12 sub-layers would suffer from gradient degradation. The residual connection creates a "highway" for gradients to flow backward through the network.
Layer normalization. Applied after the residual addition: LayerNorm(x + Sublayer(x)). Unlike batch normalization (which normalizes across the batch dimension), layer normalization normalizes across the feature dimension within each token. For each token's 512-dimensional vector, it computes the mean and variance across all 512 values, then shifts and scales to zero mean and unit variance. Two learned parameters (γ and β) allow the network to adjust the normalized distribution.
Why layer norm instead of batch norm? Batch normalization computes statistics across all tokens in a batch, which is problematic for sequences of different lengths. Layer normalization works token-by-token, making it independent of batch size and sequence length. It also stabilizes training by preventing the "internal covariate shift" that plagues deep networks — each sub-layer receives well-behaved inputs regardless of what the previous layers do.
The FFN has a hidden dimension of dff = 2048, which is 4× the model dimension of 512. This expansion-then-compression pattern (512 → 2048 → 512) gives the network a high-dimensional space to do nonlinear computation at each position. The expansion is crucial — the ReLU activation in the middle sparsifies the representation (zeroing out negative values), and having 4× more neurons means the network can afford to lose some while still retaining enough information.
Later research showed that the FFN is where the Transformer stores factual knowledge. The attention layers decide which tokens to look at, and the FFN layers decide what to do with the gathered information. Think of it as: attention is reading comprehension, FFN is reasoning. Dai et al. (2022) showed you can even edit individual facts by modifying specific rows of the FFN weight matrices.
The paper notes that the FFN can also be described as "two convolutions with kernel size 1." This is not just a mathematical curiosity — it means the same code that implements depthwise-separable convolutions can implement the Transformer's FFN, making the architecture easy to deploy on existing deep learning frameworks.
Let's count the parameters in one FFN layer: W1 is (512 × 2048) = 1,048,576 parameters, plus b1 with 2048. W2 is (2048 × 512) = 1,048,576 parameters, plus b2 with 512. Total: ~2.1M parameters per FFN layer. Compare to ~1M for the attention layer. The FFN has twice the parameters of attention! In the full model (6 encoder + 6 decoder layers), the FFN parameters dominate. This is why later work on making Transformers more efficient often focuses on the FFN (mixture of experts, for example, routes each token to only a subset of FFN parameters).
Putting it all together, here is the information flow through one encoder layer:
One crucial design choice: all sub-layers and embedding layers produce outputs of dimension dmodel = 512. This uniform dimensionality makes residual connections trivial — no projection needed, just add.
The paper provides a beautiful comparison of self-attention vs recurrence vs convolution (their Table 1):
| Layer Type | Complexity per Layer | Sequential Ops | Max Path Length |
|---|---|---|---|
| Self-Attention | O(n² · d) | O(1) | O(1) |
| Recurrent | O(n · d²) | O(n) | O(n) |
| Convolutional | O(k · n · d²) | O(1) | O(logk(n)) |
The maximum path length is how many layers a signal must traverse to connect two arbitrary positions. For self-attention, it is O(1) — any two tokens connect directly in a single layer. For recurrence, it is O(n) — information from token 1 must pass through n hidden states to reach token n. This is why the Transformer handles long-range dependencies so well.
The cost is the O(n²) term: self-attention computes a score for every pair of tokens. For a 1024-token sequence, that is about 1 million scores per layer. This quadratic cost is the Transformer's Achilles' heel for very long sequences, and it motivated later work on efficient attention (Linformer, Performer, FlashAttention).
Here is the full set of hyperparameters for the Transformer base model:
| Parameter | Value | Role |
|---|---|---|
| dmodel | 512 | Embedding dimension and residual stream width |
| N | 6 | Number of encoder layers (and decoder layers) |
| h | 8 | Number of attention heads per layer |
| dk = dv | 64 | Key/value dimension per head (= dmodel/h) |
| dff | 2048 | Feed-forward inner dimension (4× dmodel) |
| Pdrop | 0.1 | Dropout rate |
| Total params | ~65M | Base model parameter count |
The decoder also has N = 6 identical layers, but each layer has three sub-layers instead of two.
Why masking? During training, the decoder receives the entire target sentence at once (for parallelism). But at inference time, it generates tokens one at a time. To make training match inference, we mask future positions: when computing attention for position t, the model cannot see positions t+1, t+2, etc. This is implemented by setting the corresponding entries in the QKT matrix to −∞ before softmax, which drives those attention weights to zero.
Encoder-decoder attention is the bridge between the two halves. The queries come from the decoder (asking "what in the source sentence is relevant to what I am generating right now?"), while the keys and values come from the encoder's final output. This mirrors the attention mechanism in older seq2seq models — but here the encoder representations are computed purely with self-attention, not recurrence.
To summarize, the Transformer uses multi-head attention in three distinct ways:
| Attention Type | Q from | K, V from | Masking? | Purpose |
|---|---|---|---|---|
| Encoder self-attention | Encoder | Encoder | None | Each source token attends to all source tokens |
| Masked decoder self-attention | Decoder | Decoder | Causal | Each target token attends only to previous targets |
| Encoder-decoder attention | Decoder | Encoder | None | Each target token attends to all source tokens |
This three-way use of the same mechanism is elegant. The same multi-head attention code handles all three cases — the only difference is where Q, K, V come from and whether a causal mask is applied. A single function with three arguments (Q_source, KV_source, mask) implements all of: self-attention, cross-attention, and causal self-attention. This economy of design is one reason the Transformer was so easy to implement and scale.
The decoder also uses beam search at inference time. At each step, instead of greedily taking the most probable token, it maintains k candidate sequences (typically k = 4) and extends each by one token. After the full output is generated, the highest-scoring candidate (under the model's log-probability, adjusted by a length penalty) is selected. This simple search procedure consistently improves BLEU scores by 1-2 points over greedy decoding.
Let's be concrete about how the mask works. For a sequence of 4 tokens, the attention score matrix before masking looks like:
| tok 0 | tok 1 | tok 2 | tok 3 | |
|---|---|---|---|---|
| tok 0 | s00 | −∞ | −∞ | −∞ |
| tok 1 | s10 | s11 | −∞ | −∞ |
| tok 2 | s20 | s21 | s22 | −∞ |
| tok 3 | s30 | s31 | s32 | s33 |
The −∞ entries become 0 after softmax. Token 2, for example, can only attend to tokens 0, 1, and 2. It knows nothing about token 3. This is the causal mask — it makes the attention lower-triangular, ensuring that generation is autoregressive.
Weight tying. The model ties the input embedding, output embedding, and the pre-softmax linear transformation. They all share the same weight matrix. This is a powerful regularization: it forces the model to use a consistent token representation throughout. If "cat" has embedding vector v, then the model will also predict "cat" by looking for outputs that are close to v in the same space.
The embedding layers multiply the shared weights by √dmodel. Why? With dmodel = 512, each component of a learned embedding is roughly 1/√512 ≈ 0.044 in magnitude (to keep the vector norm reasonable). But the positional encodings have components up to 1.0 (since sin and cos are bounded by 1). Without the √512 ≈ 22.6× scaling, the positional signal would overwhelm the token identity. After scaling, the token and position contributions are on the same scale.
Inference: autoregressive decoding. At test time, the decoder generates tokens one at a time. It starts with a special start token, feeds it through the decoder, predicts the first output token, then feeds both tokens through the decoder, predicts the second, and so on. This is inherently sequential — you cannot predict token 5 without first generating tokens 1-4. But the encoder side runs once in parallel over the entire input. The quadratic cost during training (processing all target positions simultaneously with masking) buys us speed at train time while maintaining correct autoregressive behavior.
Time to see attention in action. This interactive visualization computes self-attention step by step on a short sentence. Each token gets random query, key, and value vectors. You can see the raw dot-product scores, the scaled scores, the softmax weights, and the resulting attention patterns.
Use the head slider to see how different attention heads produce different patterns. Each head has independent Q, K, V projections, so each head can specialize in a different linguistic relationship.
The visualization below performs these exact steps:
Try different sentences. Notice how changing even one word shifts the attention pattern. Each head has its own "personality" — Head 1 tends to attend locally (nearby words), Head 2 attends to sentence boundaries, Head 3 spreads attention diffusely, and Head 4 captures longer-range patterns. In a real Transformer, these specializations emerge from training on millions of sentences.
Reading the heatmap. Each row corresponds to a query token (the token doing the attending). Each column corresponds to a key token (the token being attended to). A bright cell at row i, column j means "token i pays strong attention to token j." Each row sums to 1.0 — the attention weights are a probability distribution.
In the paper's appendix, the authors show real attention patterns from a trained Transformer. One head in layer 5 clearly performs anaphora resolution: the word "its" attends strongly to the word "Law" (its referent), with a sharp, peaked attention weight of ~0.9. Another head in the same layer captures sentence structure: verbs attend to their subjects, prepositions attend to their objects. Each head has learned a different linguistic function, without any explicit supervision.
The paper evaluated the Transformer on two machine translation benchmarks: WMT 2014 English-to-German (4.5M sentence pairs) and WMT 2014 English-to-French (36M sentence pairs).
The results were decisive:
| Model | EN-DE BLEU | EN-FR BLEU | Training Cost |
|---|---|---|---|
| GNMT + RL Ensemble | 26.30 | 41.16 | 2.3 × 1019 FLOPs |
| ConvS2S Ensemble | 26.36 | 41.29 | 1.2 × 1021 FLOPs |
| Transformer (base) | 27.3 | 38.1 | 3.3 × 1018 FLOPs |
| Transformer (big) | 28.4 | 41.8 | 2.3 × 1019 FLOPs |
The Transformer (big) surpassed all previous models — including ensembles of multiple models — while using a fraction of the training compute. The base model trained in 12 hours on 8 P100 GPUs. The big model took 3.5 days.
The paper also included an ablation study (Table 3) that revealed what mattered most:
| Variation | Effect on BLEU |
|---|---|
| Single head instead of 8 | −0.9 BLEU |
| Smaller key dimension (dk = 16) | −0.3 BLEU |
| Larger model (dmodel = 1024, 16 heads) | +0.4 BLEU |
| Dropout removed | −0.4 BLEU |
| Learned positional embeddings | ≈ same BLEU |
Multi-head attention was crucial (−0.9 BLEU when reduced to a single head). More attention heads were better than larger key dimensions, confirming that specialization matters more than raw per-head capacity. Dropout was important for regularization.
The paper also tested the Transformer on English constituency parsing — a task where the input is a sentence and the output is its syntactic tree structure. Despite being designed for translation, the Transformer achieved an F1 score of 91.3 on the Wall Street Journal portion of the Penn Treebank, outperforming all previously published models except the Recurrent Neural Network Grammar. When trained on a semi-supervised setup with 17M additional sentences, it achieved 92.7 F1, beating all prior results. This was a powerful signal that the architecture was not just a good translation model — it was a general-purpose sequence-to-sequence machine.
The big model configuration doubled the model dimension and quadrupled the heads:
| Parameter | Base | Big |
|---|---|---|
| dmodel | 512 | 1024 |
| Heads (h) | 8 | 16 |
| dff | 2048 | 4096 |
| N (layers) | 6 | 6 |
| Pdrop | 0.1 | 0.3 |
| Parameters | ~65M | ~213M |
| EN-DE BLEU | 27.3 | 28.4 |
| Training time | 12 hours | 3.5 days |
The big model used stronger dropout (0.3 vs 0.1) to compensate for its larger capacity. Even at 213M parameters — tiny by today's standards — it was the best translation model in the world. Modern LLMs with billions of parameters are direct descendants of this architecture, scaled up along the exact dimensions the paper identified: dmodel, N, h, and dff.
The warmup learning rate schedule. The paper introduced a learning rate schedule that has become standard in Transformer training:
For the first 4000 steps, the learning rate increases linearly. After that, it decays proportionally to the inverse square root of the step number. The intuition: at the start of training, the model's gradients are noisy and unpredictable (the attention weights are essentially random). A small, gradually increasing learning rate prevents early instability. Once the model has learned some structure, the learning rate peaks and then decays smoothly.
They used Adam with β1 = 0.9, β2 = 0.98, and ε = 10−9. The high β2 (0.98 instead of the typical 0.999) makes the optimizer more responsive to recent gradient magnitudes.
Regularization. Three forms: (1) residual dropout (Pdrop = 0.1) applied to the output of each sub-layer before the residual addition; (2) dropout on the sum of token embeddings and positional encodings; (3) label smoothing (εls = 0.1), which replaces the hard one-hot target with a softened distribution. Label smoothing hurt perplexity (the model is less confident) but improved BLEU score (the model is better calibrated).
Inference details. For decoding, the paper used beam search with beam size 4 and length penalty α = 0.6. Beam search maintains 4 candidate translations at each step, expanding each by one token and keeping the 4 best-scoring candidates. The length penalty discourages the model from producing very short translations (which tend to have higher per-token probability but miss content). The maximum output length was set to input length + 50 tokens.
For the reported BLEU scores, the base model used checkpoint averaging over the last 5 checkpoints (written at 10-minute intervals), and the big model averaged the last 20 checkpoints. Checkpoint averaging is a simple ensembling technique that smooths out the noise of individual training steps, typically improving BLEU by 0.5-1.0 points for free.
This paper is arguably the most influential machine learning paper ever published. Its impact radiates in every direction.
The Transformer and GPT. Radford et al. (2018) took the Transformer decoder, removed the encoder entirely, and trained it as a language model on raw text. This became GPT, and the decoder-only Transformer became the backbone of GPT-2, GPT-3, GPT-4, Claude, LLaMA, and essentially every modern LLM. The insight: you do not need an encoder-decoder architecture for language modeling — the decoder alone suffices.
The Transformer and BERT. Devlin et al. (2018) took the Transformer encoder, removed the decoder, and trained it with masked language modeling. BERT showed that the encoder produces powerful bidirectional representations useful for classification, question answering, and other tasks. The encoder-only and decoder-only branches both emerged from this paper.
The Transformer and Vision. Dosovitskiy et al. (2020) showed that the Transformer works on images too. Split an image into patches, treat each patch as a token, and apply the standard Transformer encoder. Vision Transformers (ViT) now match or exceed CNNs on image classification. The architecture was so general that it transferred to an entirely different modality.
The Transformer and Scaling Laws. Kaplan et al. (2020) discovered that Transformer performance follows smooth power laws as you scale model size, data, and compute. This predictability enabled the strategic scaling that produced GPT-3 and beyond. The scaling laws are specific to the Transformer architecture — nothing comparable existed for RNNs or CNNs.
The Transformer and Efficient Attention. The O(n²) cost of self-attention motivated an entire research field: Linformer (low-rank approximation), Performer (kernel trick to linearize attention), LongFormer (local + global attention), and FlashAttention (IO-aware exact attention). All of these exist because the original Transformer's quadratic cost limits context length. The quest for longer context is a direct consequence of this paper's design.
The Transformer and Multi-Modal AI. The architecture turned out to be so general that it works on almost any modality. Audio (Whisper), proteins (AlphaFold2's Evoformer), code (Codex), molecules, weather prediction, and even game playing. The key insight is that any structured data can be tokenized into a sequence, and self-attention can learn relationships within that sequence. The Transformer became the "universal function approximator" for structured data.
A note on the authors. The eight authors all contributed roughly equally (the listing order is random, as stated in the footnote). Several went on to found major AI companies: Noam Shazeer co-founded Character.AI, Aidan Gomez co-founded Cohere, Illia Polosukhin co-founded NEAR Protocol, and Niki Parmar co-founded Adept AI. Jakob Uszkoreit co-founded Inceptive (RNA design). A single paper seeded half a dozen billion-dollar companies.
What the paper got "wrong." The original Transformer used post-layer-normalization (LayerNorm after the residual add: LayerNorm(x + Sublayer(x))). Later work showed that pre-layer-normalization (LayerNorm before the sub-layer: x + Sublayer(LayerNorm(x))) trains more stably, especially at scale. The warmup schedule, while effective, was somewhat ad-hoc — later work found that AdamW with cosine decay works as well or better. The fixed sinusoidal positional encoding has been largely replaced by learned (GPT) or rotary (RoPE) alternatives that handle long sequences better. And the ReLU activation in the FFN has been superseded by GELU, SwiGLU, and other smoother variants.
But these are refinements, not refutations. The core architecture — multi-head self-attention, residual connections, feed-forward networks, layer normalization — remains unchanged in essentially every modern LLM. A time-traveler from 2017 reading the GPT-4 architecture diagram would recognize every component. The Transformer has proven to be not just a good model, but the right model — a rare case where the first design was so well-conceived that seven years of intensive research have produced only incremental improvements.
The implementation. The paper's open-source implementation was done in the "tensor2tensor" library, primarily by Łukasz Kaiser and Aidan Gomez. It became the reference implementation that the entire community built on. Later, the "Annotated Transformer" blog post by Harvard NLP translated the paper into readable, line-by-line PyTorch code. Today, every deep learning framework (PyTorch, JAX, TensorFlow) has Transformer layers as built-in primitives. The architecture is so fundamental that it is infrastructure, not research code.
Paper details. "Attention Is All You Need," Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. NeurIPS 2017 (as NIPS). arXiv:1706.03762. Submitted June 2017. Over 130,000 citations.