Attention Is All You Need

Chapter 0: The Bottleneck

By 2017, the best machine translation systems all shared the same backbone: recurrent neural networks (RNNs). An RNN reads a sentence one word at a time, updating a hidden state at each step. To translate "The cat sat on the mat" into French, the encoder processes "The," then "cat," then "sat," and so on — sequentially.

This sequential processing creates two problems.

Problem 1: No parallelism. You cannot process word 5 until word 4 is done, because word 5 depends on the hidden state produced by word 4. On a GPU with thousands of cores, most of them sit idle, waiting. Training is slow.

Problem 2: Long-range forgetting. By the time the RNN reaches word 50, the hidden state has been overwritten so many times that information about word 1 is diluted. Attention mechanisms (Bahdanau, 2014) helped by letting the decoder look back at all encoder states, but the encoder itself was still sequential, still forgetting.

Problem 3: Slow training, period. Google's Neural Machine Translation system (GNMT) used 96 GPUs for a week to train. ConvS2S required enormous compute budgets. The sequential bottleneck was not just a theoretical concern — it directly limited how quickly the field could iterate on new ideas.

The fundamental constraint: RNNs process tokens one at a time. This makes training slow (no parallelism) and makes long sequences hard (information must survive many sequential updates). The Transformer eliminates both problems by replacing recurrence with attention — a mechanism that can look at every position simultaneously.

Attention was not a new idea. Bahdanau et al. (2014) had introduced additive attention as an add-on to RNN encoder-decoder models. Their insight: instead of compressing the entire input sentence into a single vector, let the decoder look back at all encoder hidden states and focus on the relevant ones. This dramatically improved translation quality, especially for long sentences.

But in Bahdanau's model — and in all subsequent work — attention was used alongside recurrence. The RNN still did the heavy lifting of encoding the sequence; attention just helped the decoder pick the right source words. No one had tried using attention alone, without any recurrence at all.

The footnote of the paper reveals the intellectual genesis. Jakob Uszkoreit first proposed replacing RNNs with self-attention. Ashish Vaswani and Illia Polosukhin designed and implemented the first models. Noam Shazeer proposed three of the paper's key innovations: scaled dot-product attention, multi-head attention, and the parameter-free sinusoidal position representation. It was a truly collaborative breakthrough, with each author contributing crucial pieces.

Earlier attempts to reduce sequential computation used convolutions (ByteNet, ConvS2S), which compute all positions in parallel but connect distant positions only through stacks of layers — a signal from position 1 must pass through O(log n) layers to reach position n. The number of operations grows with distance, making long-range dependencies hard to learn.

Vaswani and colleagues at Google asked a radical question: what if you removed the RNN entirely? What if the only mechanism relating tokens to each other was attention? The result was the Transformer — a model that processes all positions in parallel, connects any two positions in a single operation (O(1) path length!), and trained to state-of-the-art translation quality in 3.5 days on 8 GPUs.

This paper did not just improve machine translation. It became the foundation of GPT, BERT, PaLM, LLaMA, and virtually every large language model built since 2018. As of 2024, it has over 130,000 citations — more than almost any other paper in computer science history. Let's understand exactly how it works.

Why can't RNNs fully exploit GPU parallelism during training?

Because each token's hidden state depends on the previous token's hidden state, forcing sequential computation Because RNNs use too much memory Because GPUs cannot perform matrix multiplication

Chapter 1: Attention as Lookup

Before diving into the math, let's build intuition for what attention does. Forget neural networks for a moment. Think about a dictionary.

A dictionary maps keys to values. You have a query ("I want the meaning of 'cat'"), you find the matching key ("cat"), and you retrieve the value ("a small furry animal"). This is hard lookup — you either find an exact match or you don't.

Now imagine a soft dictionary. Your query is "I want something between 'cat' and 'dog'." Instead of finding one exact match, you compute a similarity score between your query and every key. Then you take a weighted average of all values, where the weights come from those similarity scores.

Query

"What am I looking for?" — the token that needs information

↓ compare against

Keys

"What do I contain?" — labels on each available token

↓ similarity scores select

Values

"What information do I carry?" — the actual content to retrieve

This is exactly what attention does. Every token in the sequence plays three roles simultaneously. It asks questions (queries), advertises what it contains (keys), and carries information (values). The attention mechanism computes pairwise similarity between every query and every key, then uses those similarities as weights to mix all the values.

The key intuition: Attention is a soft, differentiable lookup table. Each token asks "who in this sequence is relevant to me?" and receives a weighted blend of all tokens' information, with the weights learned by the network.

This formulation has deep roots in information retrieval. Search engines work the same way: you type a query, the engine compares it against document keys (titles, keywords), and returns the most relevant values (documents). The Transformer does this at every layer, for every token, in a differentiable way that can be trained end-to-end with gradient descent.

The beautiful thing about this formulation is that it requires no sequential processing. Every token computes its query, key, and value simultaneously. Every pairwise similarity is computed simultaneously. The entire operation is one big matrix multiplication — perfectly suited for GPUs.

One more analogy that helps: think of a cocktail party. Each person (token) is simultaneously asking questions (queries), broadcasting information about themselves (keys), and carrying knowledge to share (values). Attention is the social mechanism by which each person decides who to listen to and how much to weight what they hear. The "cocktail party" happens in parallel — everyone talks and listens at the same time.

In the attention mechanism, what determines how much information token A receives from token B?

The distance between token A and token B in the sequence The similarity between token A's query and token B's key The magnitude of token B's value vector

Chapter 2: Queries, Keys, Values

Let's make this precise. You have a sequence of n tokens, each represented as a vector of dimension d_model = 512. Stack them into a matrix X with shape (n × d_model).

Now, we need each token to play three roles. We create three separate linear projections:

Q = X W^Q, K = X W^K, V = X W^V

Where W^Q, W^K, W^V are learned weight matrices. Q, K, and V are all (n × d_k) matrices. Each row is one token's query, key, or value vector.

Why three separate projections instead of using X directly? Because a token's "question" and its "answer" are different things. When the word "it" appears in a sentence, its query might encode "I need to find my antecedent" while its key might encode "I am a pronoun in object position." These are different aspects of the same token, and the network learns to project them into different spaces.

There is a subtle but important distinction between attention and self-attention. In Bahdanau's original attention, the queries came from one sequence (the decoder) and the keys/values from another (the encoder). In self-attention, queries, keys, and values all come from the same sequence. Each token attends to all tokens in its own sequence, including itself. This is what makes the Transformer's encoder work — it builds rich contextual representations by letting every source token gather information from every other source token.

The term "self-attention" was coined by Cheng et al. (2016) and Lin et al. (2017), who used it for reading comprehension and sentence embeddings. But they always used it inside an RNN. The Transformer was the first model to use self-attention as the only mechanism for building sequence representations.

Now we compute attention in three steps:

Step 1: Score

Compute QK^T — the dot product of every query with every key. Result is an (n × n) matrix of raw similarity scores.

↓

Step 2: Normalize

Divide by √d_k, then apply softmax row-wise. Each row sums to 1 — they are attention weights.

↓

Step 3: Aggregate

Multiply the weight matrix by V. Each token's output is a weighted sum of all value vectors.

The full formula in one line:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

This is the paper's equation (1). It is the single most important equation in modern deep learning.

Let's trace through a tiny numerical example. Suppose we have 3 tokens with d_k = 2:

	dim 0	dim 1
Q ("the")	1.0	0.5
Q ("cat")	0.2	1.0
Q ("sat")	0.8	0.3

	dim 0	dim 1
K ("the")	0.9	0.1
K ("cat")	0.3	0.8
K ("sat")	0.7	0.6

Step 1 — QK^T: the dot product of "cat"'s query with "the"'s key is (0.2)(0.9) + (1.0)(0.1) = 0.28. With "cat"'s key: (0.2)(0.3) + (1.0)(0.8) = 0.86. With "sat"'s key: (0.2)(0.7) + (1.0)(0.6) = 0.74. So "cat" has the highest affinity for itself and for "sat."

Step 2 — Scale by √d_k = √2 ≈ 1.41: divide each score by 1.41.

Step 3 — Softmax: convert the scaled scores into weights that sum to 1. "Cat" might end up with weights [0.22, 0.42, 0.36], meaning it pulls 42% of its information from itself, 36% from "sat," and 22% from "the."

Step 4 — Multiply by V: the output for "cat" is 0.22 · V_"the" + 0.42 · V_"cat" + 0.36 · V_"sat". A blended representation, enriched by contextual information.

Notice the asymmetric roles. Q and K determine who talks to whom (the attention pattern). V determines what information gets passed. You could have two tokens with very different V vectors but similar K vectors — they would attract the same queries but contribute different information. This separation of "addressing" (Q, K) from "content" (V) is analogous to the separation of addresses and data in computer memory.

Self-attention as matrix factorization. The attention output can be written as AV, where A = softmax(QK^T/√d_k) is a stochastic matrix (rows sum to 1). Each output row is a convex combination of value rows. The attention matrix A acts as a routing table, dynamically computed from the input. This is fundamentally different from a fixed weight matrix — the connections change depending on what the input is.

Let's unpack why the √d_k scaling is there.

The entire operation is matrix multiplication. Q times K-transpose gives scores. Softmax gives weights. Weights times V gives outputs. Three matrix multiplies. No recurrence, no sequential dependency. Every token is processed in parallel.

What is the shape of the attention weight matrix QK^T for a sequence of n tokens?

(n × n) — every token has a score against every other token (n × d_k) — same shape as Q (d_k × d_k) — a square matrix over feature dimensions

Chapter 3: Why Scale by √d_k

This seems like a small detail, but it is critical. Without the scaling factor, the Transformer would not train. Let's derive exactly why.

Suppose the components of the query vector q and the key vector k are independent random variables, each with mean 0 and variance 1. (This is roughly true after proper initialization.) The dot product is:

q · k = ∑_i=1^d_k q_i k_i

Each term q_ik_i has mean 0 and variance 1 · 1 = 1 (variance of a product of two independent zero-mean variables equals the product of variances). Since we sum d_k such terms, the dot product has:

E[q · k] = 0, Var[q · k] = d_k

So the standard deviation of the dot product grows as √d_k. With d_k = 64, the dot products have standard deviation 8. This means some entries of QK^T will be around +8 while others are around −8.

Now feed these into softmax. Softmax(z_i) = e^z_i / ∑ e^z_j. When the inputs have large magnitude, the exponentials diverge wildly. If one entry is 8 and another is −8, then e⁸ ≈ 2981 while e⁻⁸ ≈ 0.0003. The softmax output becomes essentially a one-hot vector: all the weight goes to the largest score, and the gradient of softmax becomes vanishingly small.

The saturation problem: Large dot products push softmax into its flat regions where gradients vanish. The network stops learning because it cannot adjust the attention weights. Dividing by √d_k brings the variance back to 1, keeping softmax in its sensitive, gradient-rich region.

After dividing by √d_k:

Var[q · k / √d_k] = Var[q · k] / d_k = d_k / d_k = 1

The dot products now have unit variance regardless of the dimension. Softmax receives well-behaved inputs, gradients flow, and training succeeds.

The paper notes that for small d_k, the scaling barely matters. But for d_k = 64 (the value they use), it is essential. Additive attention (Bahdanau et al.) avoids this problem by using a learned feed-forward network instead of a dot product, but it is slower because it cannot be implemented as a single matrix multiply.

Why is dot-product attention faster than additive attention? Additive attention computes v^T tanh(W₁q + W₂k) for each query-key pair. This requires forming an intermediate hidden state for every pair, then applying a nonlinearity, then a dot product with v. For n tokens, that is n² invocations of a small neural network. Dot-product attention computes QK^T — a single matrix multiply. Modern GPUs are spectacularly optimized for matrix multiplication (they are literally designed for it), so dot-product attention is vastly faster in practice, even though both have the same O(n²d) theoretical complexity.

Let's see this concretely. With d_k = 64, a typical dot product might be +6.3. After softmax over a row of n = 10 tokens, the maximum entry gets weight ~0.98, and the rest share ~0.02. The attention is nearly a hard lookup — and the gradient of softmax at a nearly-one-hot output is essentially zero. The network cannot learn to redistribute attention because the gradients have vanished.

After scaling by √64 = 8, that same dot product becomes +0.79. Softmax distributes weight more evenly — perhaps [0.18, 0.15, 0.14, ...]. Gradients are healthy, and the network can learn nuanced attention patterns where multiple tokens contribute to the output.

Attention Type	Scoring Function	Scaling Needed?	Speed
Additive (Bahdanau)	v^T tanh(W₁q + W₂k)	No — tanh bounds the output	Slower
Dot-product	q^Tk	Yes — grows with d_k	Fast
Scaled dot-product (this paper)	q^Tk / √d_k	Built in	Fast

The scaling factor is so important that it appears in the name: "Scaled Dot-Product Attention." Three words, each load-bearing. Scaled for the √d_k normalization. Dot-Product for the efficient matrix-multiply-based scoring. Attention for the softmax-weighted aggregation. Noam Shazeer, one of the paper's authors, proposed this formulation.

If d_k = 64 and q, k have unit-variance components, what is the standard deviation of the unscaled dot product q · k?

64 8 — the square root of d_k = √64 1

Chapter 4: Multi-Head Attention

A single attention head computes one set of attention weights. But language has many kinds of relationships. The word "it" needs to attend to its antecedent (a syntactic relationship). The word "bank" needs to attend to "river" or "money" to disambiguate (a semantic relationship). The word "sat" needs to attend to "cat" to know who is sitting (a subject-verb relationship).

A single attention head tries to do all of this with one set of weights. The result is a compromise — averaging inhibits specialization.

The fix: run multiple attention heads in parallel, each with its own learned projections. Each head operates in a lower-dimensional subspace and can specialize in a different type of relationship.

MultiHead(Q, K, V) = Concat(head₁, …, head_h) W^O

where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

The paper uses h = 8 heads with d_k = d_v = d_model/h = 64. Each head projects the 512-dimensional input down to 64 dimensions, computes attention in that subspace, and produces a 64-dimensional output. The 8 outputs are concatenated back to 512 dimensions, then projected one more time through W^O.

The geometry of multi-head attention: Think of the 512-dimensional space as having 8 "rooms" of 64 dimensions each. Each head gets its own room to work in. Head 1 might learn to attend to the previous word. Head 2 might attend to the subject of the sentence. Head 3 might attend to punctuation. They operate independently and their results are combined. This is like having 8 different experts analyzing the sentence simultaneously.

Why is the total cost the same as single-head attention? Because each head uses d_k = 64 instead of 512. The cost of attention is O(n² · d_k). With 8 heads at d_k = 64, the total is 8 × O(n² · 64) = O(n² · 512), the same as one head at d_k = 512. We get specialization for free.

Let's count the parameters in multi-head attention for the base model:

Matrix	Shape	Parameters
W_i^Q (×8 heads)	(512 × 64) × 8	262,144
W_i^K (×8 heads)	(512 × 64) × 8	262,144
W_i^V (×8 heads)	(512 × 64) × 8	262,144
W^O	(512 × 512)	262,144
Total per attention layer		1,048,576 ≈ 1M

Exactly 4 · d_model² parameters per multi-head attention layer. Note that the 8 per-head projection matrices for Q (each 512 × 64) can equivalently be viewed as a single matrix of shape (512 × 512), followed by splitting the output into 8 chunks of 64. This is how it is implemented in practice — one big matrix multiply, then reshape.

An important ablation from the paper: what happens with more heads but smaller head dimension? They tested h = 16 with d_k = 32 (same total cost) and h = 32 with d_k = 16. Performance dropped with d_k = 16 (−0.3 BLEU), suggesting there is a minimum head dimension below which each head does not have enough capacity to compute useful attention patterns. The sweet spot for the base model was 8 heads of 64 dimensions each.

The multi-head attention recipe: (1) Project Q, K, V with learned matrices. (2) Split into h heads. (3) Run scaled dot-product attention on each head in parallel. (4) Concatenate heads. (5) Project back with W^O. Five steps, all parallelizable, all differentiable. This is the single most important computational primitive in modern AI.

The output projection W^O is often overlooked but plays a critical role. After concatenating the 8 head outputs, we have a 512-dimensional vector — but it is just a concatenation, not a meaningful blend. The W^O projection (512 × 512) learns to mix the heads' outputs, combining information from all 8 subspaces into a single coherent representation. Without W^O, the heads could not communicate.

The paper's appendix shows that different heads do indeed learn different patterns. Some heads attend to adjacent positions (local syntax). Others attend to distant positions (long-range coreference). Some heads show sharp, peaked attention; others show diffuse, spread-out patterns. The paper shows beautiful visualizations of layer 5 heads: one head learns to track the direct object of a verb, another resolves pronoun anaphora ("its" attends back to its referent).

With 8 attention heads and d_model = 512, what is the dimension of each head's key/query space?

64 — each head operates in d_model/h = 512/8 dimensions 512 — each head uses the full model dimension 8 — one dimension per head

Chapter 5: Positional Encoding

We have a problem. The attention mechanism is permutation-equivariant: if you shuffle the input tokens, the output tokens get shuffled in exactly the same way. Attention does not know that "cat sat" is different from "sat cat." There is no notion of position built into the mechanism.

RNNs get position for free — they process tokens sequentially, so position is implicit in the order of computation. Transformers process all tokens simultaneously, so we must explicitly inject position information.

The solution: add a positional encoding vector to each token embedding before feeding it into the Transformer. The encoding has the same dimension d_model = 512 so it can be added element-wise.

PE(pos, 2i) = sin(pos / 10000^2i/d_model)

PE(pos, 2i+1) = cos(pos / 10000^2i/d_model)

Where pos is the token position (0, 1, 2, ...) and i is the dimension index. Each dimension gets a sinusoid of a different frequency. The wavelengths form a geometric progression from 2π (for i = 0, the fastest-oscillating dimension) to 10000 · 2π (for the slowest).

Think of it like a clock. Dimension 0 is the second hand — it oscillates rapidly, distinguishing position 0 from position 1. Dimension 100 is the hour hand — it barely moves between adjacent positions, but clearly distinguishes position 0 from position 1000. Together, all 512 dimensions give every position a unique "time stamp" at multiple resolutions, from fine-grained local position to coarse global position.

Why sinusoids? Three reasons:

Reason 1: Relative positions are linear functions. For any fixed offset k, PE(pos + k) can be written as a linear transformation of PE(pos). Specifically, sin(a + b) = sin(a)cos(b) + cos(a)sin(b). This means the network can learn to attend to "the token 3 positions ago" with a simple linear operation — it just needs to learn the matrix that maps PE(pos) to PE(pos + 3).

Reason 2: Bounded magnitude. Unlike learned position embeddings that could grow arbitrarily, sin and cos are bounded between −1 and +1. This keeps the positional signal on the same scale as the token embeddings.

Reason 3: Extrapolation. Since the encoding is a deterministic function, the model can potentially handle sequences longer than those seen during training. (The paper hypothesized this; in practice, later work like RoPE and ALiBi improved upon it.)

Let's work through a concrete example. With d_model = 4 (tiny, for illustration), the positional encoding for position 3 would be:

Dimension	Formula	Value
0 (i=0, sin)	sin(3 / 10000^0/4) = sin(3)	0.141
1 (i=0, cos)	cos(3 / 10000^0/4) = cos(3)	−0.990
2 (i=1, sin)	sin(3 / 10000^2/4) = sin(0.03)	0.030
3 (i=1, cos)	cos(3 / 10000^2/4) = cos(0.03)	1.000

Notice how the low-index dimensions oscillate rapidly (distinguishing adjacent positions) while the high-index dimensions oscillate slowly (providing coarse position information). Together, they give each position a unique fingerprint.

Why this enables relative position attention. Consider dimensions 2i and 2i+1, which form a (sin, cos) pair at frequency ω_i. The encoding at position pos+k can be written as a rotation of the encoding at position pos:

PE(pos+k, 2i)

sin(ω_i(pos+k))

sin(ω_i · pos) cos(ω_i · k) + cos(ω_i · pos) sin(ω_i · k)

In matrix form, for each (sin, cos) pair:

[PE(pos+k, 2i) ]	=	[cos(ω_ik) sin(ω_ik)]	[PE(pos, 2i) ]
[PE(pos+k, 2i+1)]	=	[−sin(ω_ik) cos(ω_ik)]	[PE(pos, 2i+1)]

This is a 2D rotation matrix! The positional encoding at pos+k is literally a rotation of the encoding at pos by angle ω_ik. Each frequency dimension rotates at a different speed. The network can learn a linear projection that "rotates" any position encoding by a fixed offset, enabling relative position attention without explicitly computing relative positions.

This rotation perspective became the foundation for Rotary Position Embeddings (RoPE, Su et al. 2021), which applies the rotation directly to the query and key vectors inside the attention computation rather than adding it to the input embeddings. RoPE is now used in LLaMA, Mistral, and most modern open-source LLMs. The seed of the idea is right here in the original Transformer paper.

The paper also tested learned positional embeddings and found "nearly identical results" (Table 3, row E). They chose sinusoidal encodings for the extrapolation benefit. In practice, most modern Transformers use learned or rotary positional encodings (RoPE takes this rotation idea and applies it directly inside the attention computation), but the insight that position must be explicitly injected remains foundational.

Why does the Transformer need positional encoding at all?

Because attention is permutation-equivariant — without position information, "cat sat" and "sat cat" produce the same output Because the model needs to count the number of tokens Because sinusoidal functions are computationally cheaper than learned embeddings

Chapter 6: The Encoder Stack

Now let's assemble the pieces into the full architecture. The Transformer has an encoder-decoder structure, just like the RNN models it replaced. But instead of recurrence, it uses stacked layers of self-attention and feed-forward networks.

The encoder is a stack of N = 6 identical layers. Each layer has two sub-layers:

Sub-layer 1: Multi-Head Self-Attention

Each token attends to all other tokens in the input. Q, K, V all come from the same source — hence "self"-attention.

↓

Sub-layer 2: Feed-Forward Network

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂. Two linear layers with ReLU in between. Applied to each position independently.

Each sub-layer is wrapped with two critical components:

Residual connection. The output of each sub-layer is x + Sublayer(x). This is the skip connection from ResNet. Without it, a 6-layer Transformer with 12 sub-layers would suffer from gradient degradation. The residual connection creates a "highway" for gradients to flow backward through the network.

Layer normalization. Applied after the residual addition: LayerNorm(x + Sublayer(x)). Unlike batch normalization (which normalizes across the batch dimension), layer normalization normalizes across the feature dimension within each token. For each token's 512-dimensional vector, it computes the mean and variance across all 512 values, then shifts and scales to zero mean and unit variance. Two learned parameters (γ and β) allow the network to adjust the normalized distribution.

Why layer norm instead of batch norm? Batch normalization computes statistics across all tokens in a batch, which is problematic for sequences of different lengths. Layer normalization works token-by-token, making it independent of batch size and sequence length. It also stabilizes training by preventing the "internal covariate shift" that plagues deep networks — each sub-layer receives well-behaved inputs regardless of what the previous layers do.

Position-wise = independent. The feed-forward network is applied to each position independently and identically. Position 1 is processed by the same weights as position 7. It is the self-attention sub-layer that mixes information between positions. The FFN then processes each position's mixed representation independently. Think of it as: attention is for communication, FFN is for computation.

The FFN has a hidden dimension of d_ff = 2048, which is 4× the model dimension of 512. This expansion-then-compression pattern (512 → 2048 → 512) gives the network a high-dimensional space to do nonlinear computation at each position. The expansion is crucial — the ReLU activation in the middle sparsifies the representation (zeroing out negative values), and having 4× more neurons means the network can afford to lose some while still retaining enough information.

Later research showed that the FFN is where the Transformer stores factual knowledge. The attention layers decide which tokens to look at, and the FFN layers decide what to do with the gathered information. Think of it as: attention is reading comprehension, FFN is reasoning. Dai et al. (2022) showed you can even edit individual facts by modifying specific rows of the FFN weight matrices.

The paper notes that the FFN can also be described as "two convolutions with kernel size 1." This is not just a mathematical curiosity — it means the same code that implements depthwise-separable convolutions can implement the Transformer's FFN, making the architecture easy to deploy on existing deep learning frameworks.

Let's count the parameters in one FFN layer: W₁ is (512 × 2048) = 1,048,576 parameters, plus b₁ with 2048. W₂ is (2048 × 512) = 1,048,576 parameters, plus b₂ with 512. Total: ~2.1M parameters per FFN layer. Compare to ~1M for the attention layer. The FFN has twice the parameters of attention! In the full model (6 encoder + 6 decoder layers), the FFN parameters dominate. This is why later work on making Transformers more efficient often focuses on the FFN (mixture of experts, for example, routes each token to only a subset of FFN parameters).

Putting it all together, here is the information flow through one encoder layer:

Input

x (512-dim vector per token)

↓

Multi-Head Attention

a = MultiHead(x, x, x) — each token gathers context from all others

↓

Add & Norm

x′ = LayerNorm(x + a) — residual connection + normalization

↓

Feed-Forward

f = FFN(x′) — independent nonlinear transformation per token

↓

Add & Norm

x″ = LayerNorm(x′ + f) — output of this layer, input to the next

One crucial design choice: all sub-layers and embedding layers produce outputs of dimension d_model = 512. This uniform dimensionality makes residual connections trivial — no projection needed, just add.

The paper provides a beautiful comparison of self-attention vs recurrence vs convolution (their Table 1):

Layer Type	Complexity per Layer	Sequential Ops	Max Path Length
Self-Attention	O(n² · d)	O(1)	O(1)
Recurrent	O(n · d²)	O(n)	O(n)
Convolutional	O(k · n · d²)	O(1)	O(log_k(n))

The maximum path length is how many layers a signal must traverse to connect two arbitrary positions. For self-attention, it is O(1) — any two tokens connect directly in a single layer. For recurrence, it is O(n) — information from token 1 must pass through n hidden states to reach token n. This is why the Transformer handles long-range dependencies so well.

The cost is the O(n²) term: self-attention computes a score for every pair of tokens. For a 1024-token sequence, that is about 1 million scores per layer. This quadratic cost is the Transformer's Achilles' heel for very long sequences, and it motivated later work on efficient attention (Linformer, Performer, FlashAttention).

Here is the full set of hyperparameters for the Transformer base model:

Parameter	Value	Role
d_model	512	Embedding dimension and residual stream width
N	6	Number of encoder layers (and decoder layers)
h	8	Number of attention heads per layer
d_k = d_v	64	Key/value dimension per head (= d_model/h)
d_ff	2048	Feed-forward inner dimension (4× d_model)
P_drop	0.1	Dropout rate
Total params	~65M	Base model parameter count

In the Transformer encoder, which component is responsible for mixing information between different token positions?

The feed-forward network The multi-head self-attention sub-layer Layer normalization

Chapter 7: The Decoder Stack

The decoder also has N = 6 identical layers, but each layer has three sub-layers instead of two.

Sub-layer 1: Masked Multi-Head Self-Attention

The decoder attends to its own previous outputs — but only to positions before the current one. Future positions are masked out.

↓

Sub-layer 2: Encoder-Decoder Attention

Queries come from the decoder. Keys and values come from the encoder output. This is how the decoder "reads" the input sentence.

↓

Sub-layer 3: Feed-Forward Network

Same as the encoder FFN — applied independently to each position.

Why masking? During training, the decoder receives the entire target sentence at once (for parallelism). But at inference time, it generates tokens one at a time. To make training match inference, we mask future positions: when computing attention for position t, the model cannot see positions t+1, t+2, etc. This is implemented by setting the corresponding entries in the QK^T matrix to −∞ before softmax, which drives those attention weights to zero.

The causal mask: Set score(i, j) = −∞ whenever j > i. After softmax, these entries become 0. Token i can only attend to tokens 0 through i. This preserves the autoregressive property: each prediction depends only on previously generated tokens.

Encoder-decoder attention is the bridge between the two halves. The queries come from the decoder (asking "what in the source sentence is relevant to what I am generating right now?"), while the keys and values come from the encoder's final output. This mirrors the attention mechanism in older seq2seq models — but here the encoder representations are computed purely with self-attention, not recurrence.

To summarize, the Transformer uses multi-head attention in three distinct ways:

Attention Type	Q from	K, V from	Masking?	Purpose
Encoder self-attention	Encoder	Encoder	None	Each source token attends to all source tokens
Masked decoder self-attention	Decoder	Decoder	Causal	Each target token attends only to previous targets
Encoder-decoder attention	Decoder	Encoder	None	Each target token attends to all source tokens

This three-way use of the same mechanism is elegant. The same multi-head attention code handles all three cases — the only difference is where Q, K, V come from and whether a causal mask is applied. A single function with three arguments (Q_source, KV_source, mask) implements all of: self-attention, cross-attention, and causal self-attention. This economy of design is one reason the Transformer was so easy to implement and scale.

The decoder also uses beam search at inference time. At each step, instead of greedily taking the most probable token, it maintains k candidate sequences (typically k = 4) and extends each by one token. After the full output is generated, the highest-scoring candidate (under the model's log-probability, adjusted by a length penalty) is selected. This simple search procedure consistently improves BLEU scores by 1-2 points over greedy decoding.

Let's be concrete about how the mask works. For a sequence of 4 tokens, the attention score matrix before masking looks like:

	tok 0	tok 1	tok 2	tok 3
tok 0	s₀₀	−∞	−∞	−∞
tok 1	s₁₀	s₁₁	−∞	−∞
tok 2	s₂₀	s₂₁	s₂₂	−∞
tok 3	s₃₀	s₃₁	s₃₂	s₃₃

The −∞ entries become 0 after softmax. Token 2, for example, can only attend to tokens 0, 1, and 2. It knows nothing about token 3. This is the causal mask — it makes the attention lower-triangular, ensuring that generation is autoregressive.

Weight tying. The model ties the input embedding, output embedding, and the pre-softmax linear transformation. They all share the same weight matrix. This is a powerful regularization: it forces the model to use a consistent token representation throughout. If "cat" has embedding vector v, then the model will also predict "cat" by looking for outputs that are close to v in the same space.

The embedding layers multiply the shared weights by √d_model. Why? With d_model = 512, each component of a learned embedding is roughly 1/√512 ≈ 0.044 in magnitude (to keep the vector norm reasonable). But the positional encodings have components up to 1.0 (since sin and cos are bounded by 1). Without the √512 ≈ 22.6× scaling, the positional signal would overwhelm the token identity. After scaling, the token and position contributions are on the same scale.

Inference: autoregressive decoding. At test time, the decoder generates tokens one at a time. It starts with a special start token, feeds it through the decoder, predicts the first output token, then feeds both tokens through the decoder, predicts the second, and so on. This is inherently sequential — you cannot predict token 5 without first generating tokens 1-4. But the encoder side runs once in parallel over the entire input. The quadratic cost during training (processing all target positions simultaneously with masking) buys us speed at train time while maintaining correct autoregressive behavior.

In encoder-decoder attention, where do the queries, keys, and values come from?

Queries from the decoder, keys and values from the encoder All three from the encoder All three from the decoder

Chapter 8: Showcase — Attention Visualizer

Time to see attention in action. This interactive visualization computes self-attention step by step on a short sentence. Each token gets random query, key, and value vectors. You can see the raw dot-product scores, the scaled scores, the softmax weights, and the resulting attention patterns.

Use the head slider to see how different attention heads produce different patterns. Each head has independent Q, K, V projections, so each head can specialize in a different linguistic relationship.

What to look for: Watch how some heads produce sharp attention (attending strongly to one token) while others produce diffuse attention (spreading weight across many tokens). In real Transformers, sharp heads often capture syntactic relations (subject-verb agreement) and diffuse heads capture broader semantic context.

The visualization below performs these exact steps:

Step 1

Tokenize the sentence (split on spaces). Each token gets a pseudo-random embedding.

↓

Step 2

Project through W^Q and W^K (different random matrices per head) to get Q and K.

↓

Step 3

Compute QK^T / √d_k, apply softmax. The heatmap shows the resulting attention weights.

Try different sentences. Notice how changing even one word shifts the attention pattern. Each head has its own "personality" — Head 1 tends to attend locally (nearby words), Head 2 attends to sentence boundaries, Head 3 spreads attention diffusely, and Head 4 captures longer-range patterns. In a real Transformer, these specializations emerge from training on millions of sentences.

Sentence:

Head: 1 / 4 Enter a sentence and click Compute

Reading the heatmap. Each row corresponds to a query token (the token doing the attending). Each column corresponds to a key token (the token being attended to). A bright cell at row i, column j means "token i pays strong attention to token j." Each row sums to 1.0 — the attention weights are a probability distribution.

In the paper's appendix, the authors show real attention patterns from a trained Transformer. One head in layer 5 clearly performs anaphora resolution: the word "its" attends strongly to the word "Law" (its referent), with a sharp, peaked attention weight of ~0.9. Another head in the same layer captures sentence structure: verbs attend to their subjects, prepositions attend to their objects. Each head has learned a different linguistic function, without any explicit supervision.

The interpretability surprise. The paper's authors did not design these patterns. They emerged purely from the translation objective. The multi-head architecture gave the model enough "slots" to specialize, and gradient descent found useful specializations automatically. This interpretability was an unexpected bonus — and it launched the field of mechanistic interpretability.

Why do different attention heads in the same layer produce different attention patterns?

Because each head has its own learned W^Q, W^K, W^V projection matrices, projecting into different subspaces Because each head processes a different subset of tokens Because each head uses a different activation function

Chapter 9: The Experiments

The paper evaluated the Transformer on two machine translation benchmarks: WMT 2014 English-to-German (4.5M sentence pairs) and WMT 2014 English-to-French (36M sentence pairs).

The results were decisive:

Model	EN-DE BLEU	EN-FR BLEU	Training Cost
GNMT + RL Ensemble	26.30	41.16	2.3 × 10¹⁹ FLOPs
ConvS2S Ensemble	26.36	41.29	1.2 × 10²¹ FLOPs
Transformer (base)	27.3	38.1	3.3 × 10¹⁸ FLOPs
Transformer (big)	28.4	41.8	2.3 × 10¹⁹ FLOPs

The Transformer (big) surpassed all previous models — including ensembles of multiple models — while using a fraction of the training compute. The base model trained in 12 hours on 8 P100 GPUs. The big model took 3.5 days.

The efficiency revolution: The Transformer base model used 3.3 × 10¹⁸ FLOPs to outperform ensembles that cost 10²⁰–10²¹ FLOPs. That is 100–300× cheaper. The parallelism of self-attention was not just theoretically nice — it translated directly into dramatically lower training cost.

The paper also included an ablation study (Table 3) that revealed what mattered most:

Variation	Effect on BLEU
Single head instead of 8	−0.9 BLEU
Smaller key dimension (d_k = 16)	−0.3 BLEU
Larger model (d_model = 1024, 16 heads)	+0.4 BLEU
Dropout removed	−0.4 BLEU
Learned positional embeddings	≈ same BLEU

Multi-head attention was crucial (−0.9 BLEU when reduced to a single head). More attention heads were better than larger key dimensions, confirming that specialization matters more than raw per-head capacity. Dropout was important for regularization.

The paper also tested the Transformer on English constituency parsing — a task where the input is a sentence and the output is its syntactic tree structure. Despite being designed for translation, the Transformer achieved an F1 score of 91.3 on the Wall Street Journal portion of the Penn Treebank, outperforming all previously published models except the Recurrent Neural Network Grammar. When trained on a semi-supervised setup with 17M additional sentences, it achieved 92.7 F1, beating all prior results. This was a powerful signal that the architecture was not just a good translation model — it was a general-purpose sequence-to-sequence machine.

The big model configuration doubled the model dimension and quadrupled the heads:

Parameter	Base	Big
d_model	512	1024
Heads (h)	8	16
d_ff	2048	4096
N (layers)	6	6
P_drop	0.1	0.3
Parameters	~65M	~213M
EN-DE BLEU	27.3	28.4
Training time	12 hours	3.5 days

The big model used stronger dropout (0.3 vs 0.1) to compensate for its larger capacity. Even at 213M parameters — tiny by today's standards — it was the best translation model in the world. Modern LLMs with billions of parameters are direct descendants of this architecture, scaled up along the exact dimensions the paper identified: d_model, N, h, and d_ff.

The warmup learning rate schedule. The paper introduced a learning rate schedule that has become standard in Transformer training:

lr = d_model^−0.5 · min(step^−0.5, step · warmup^−1.5)

For the first 4000 steps, the learning rate increases linearly. After that, it decays proportionally to the inverse square root of the step number. The intuition: at the start of training, the model's gradients are noisy and unpredictable (the attention weights are essentially random). A small, gradually increasing learning rate prevents early instability. Once the model has learned some structure, the learning rate peaks and then decays smoothly.

They used Adam with β₁ = 0.9, β₂ = 0.98, and ε = 10⁻⁹. The high β₂ (0.98 instead of the typical 0.999) makes the optimizer more responsive to recent gradient magnitudes.

Regularization. Three forms: (1) residual dropout (P_drop = 0.1) applied to the output of each sub-layer before the residual addition; (2) dropout on the sum of token embeddings and positional encodings; (3) label smoothing (ε_ls = 0.1), which replaces the hard one-hot target with a softened distribution. Label smoothing hurt perplexity (the model is less confident) but improved BLEU score (the model is better calibrated).

Label smoothing in detail. Instead of training the model to predict P("chat") = 1.0 and P(everything else) = 0.0, label smoothing uses P("chat") = 0.9 and distributes the remaining 0.1 uniformly across all 37,000 vocabulary tokens. This prevents the model from becoming infinitely confident, which improves generalization. The perplexity goes up (the model is less "sure" of the right answer), but the BLEU score goes up too (the model makes fewer catastrophic errors on ambiguous cases).

Inference details. For decoding, the paper used beam search with beam size 4 and length penalty α = 0.6. Beam search maintains 4 candidate translations at each step, expanding each by one token and keeping the 4 best-scoring candidates. The length penalty discourages the model from producing very short translations (which tend to have higher per-token probability but miss content). The maximum output length was set to input length + 50 tokens.

For the reported BLEU scores, the base model used checkpoint averaging over the last 5 checkpoints (written at 10-minute intervals), and the big model averaged the last 20 checkpoints. Checkpoint averaging is a simple ensembling technique that smooths out the noise of individual training steps, typically improving BLEU by 0.5-1.0 points for free.

Which variation caused the largest drop in BLEU score in the ablation study?

Reducing from 8 heads to 1 head (−0.9 BLEU) Removing dropout (−0.4 BLEU) Using learned positional embeddings

Chapter 10: Connections

This paper is arguably the most influential machine learning paper ever published. Its impact radiates in every direction.

The Transformer and GPT. Radford et al. (2018) took the Transformer decoder, removed the encoder entirely, and trained it as a language model on raw text. This became GPT, and the decoder-only Transformer became the backbone of GPT-2, GPT-3, GPT-4, Claude, LLaMA, and essentially every modern LLM. The insight: you do not need an encoder-decoder architecture for language modeling — the decoder alone suffices.

The Transformer and BERT. Devlin et al. (2018) took the Transformer encoder, removed the decoder, and trained it with masked language modeling. BERT showed that the encoder produces powerful bidirectional representations useful for classification, question answering, and other tasks. The encoder-only and decoder-only branches both emerged from this paper.

The Transformer and Vision. Dosovitskiy et al. (2020) showed that the Transformer works on images too. Split an image into patches, treat each patch as a token, and apply the standard Transformer encoder. Vision Transformers (ViT) now match or exceed CNNs on image classification. The architecture was so general that it transferred to an entirely different modality.

The Transformer and Scaling Laws. Kaplan et al. (2020) discovered that Transformer performance follows smooth power laws as you scale model size, data, and compute. This predictability enabled the strategic scaling that produced GPT-3 and beyond. The scaling laws are specific to the Transformer architecture — nothing comparable existed for RNNs or CNNs.

The Transformer and Efficient Attention. The O(n²) cost of self-attention motivated an entire research field: Linformer (low-rank approximation), Performer (kernel trick to linearize attention), LongFormer (local + global attention), and FlashAttention (IO-aware exact attention). All of these exist because the original Transformer's quadratic cost limits context length. The quest for longer context is a direct consequence of this paper's design.

The Transformer and Multi-Modal AI. The architecture turned out to be so general that it works on almost any modality. Audio (Whisper), proteins (AlphaFold2's Evoformer), code (Codex), molecules, weather prediction, and even game playing. The key insight is that any structured data can be tokenized into a sequence, and self-attention can learn relationships within that sequence. The Transformer became the "universal function approximator" for structured data.

Transformer (2017)

Self-attention replaces recurrence. Parallel training. State-of-the-art translation.

↓ spawned

GPT, BERT, T5 (2018–2019)

Decoder-only, encoder-only, and encoder-decoder variants for different tasks

↓ scaled to

GPT-3, PaLM, LLaMA, Claude (2020+)

Billions of parameters, emergent capabilities, the foundation of modern AI

The lasting impact. Before this paper, machine learning was fragmented: CNNs for vision, RNNs for language, specialized architectures for each domain. After this paper, one architecture conquered everything. The Transformer is now the universal backbone for language, vision, audio, protein folding, robotics, and more. The title was prophetic — attention really was all you needed.

A note on the authors. The eight authors all contributed roughly equally (the listing order is random, as stated in the footnote). Several went on to found major AI companies: Noam Shazeer co-founded Character.AI, Aidan Gomez co-founded Cohere, Illia Polosukhin co-founded NEAR Protocol, and Niki Parmar co-founded Adept AI. Jakob Uszkoreit co-founded Inceptive (RNA design). A single paper seeded half a dozen billion-dollar companies.

What the paper got "wrong." The original Transformer used post-layer-normalization (LayerNorm after the residual add: LayerNorm(x + Sublayer(x))). Later work showed that pre-layer-normalization (LayerNorm before the sub-layer: x + Sublayer(LayerNorm(x))) trains more stably, especially at scale. The warmup schedule, while effective, was somewhat ad-hoc — later work found that AdamW with cosine decay works as well or better. The fixed sinusoidal positional encoding has been largely replaced by learned (GPT) or rotary (RoPE) alternatives that handle long sequences better. And the ReLU activation in the FFN has been superseded by GELU, SwiGLU, and other smoother variants.

But these are refinements, not refutations. The core architecture — multi-head self-attention, residual connections, feed-forward networks, layer normalization — remains unchanged in essentially every modern LLM. A time-traveler from 2017 reading the GPT-4 architecture diagram would recognize every component. The Transformer has proven to be not just a good model, but the right model — a rare case where the first design was so well-conceived that seven years of intensive research have produced only incremental improvements.

The implementation. The paper's open-source implementation was done in the "tensor2tensor" library, primarily by Łukasz Kaiser and Aidan Gomez. It became the reference implementation that the entire community built on. Later, the "Annotated Transformer" blog post by Harvard NLP translated the paper into readable, line-by-line PyTorch code. Today, every deep learning framework (PyTorch, JAX, TensorFlow) has Transformer layers as built-in primitives. The architecture is so fundamental that it is infrastructure, not research code.

Paper details. "Attention Is All You Need," Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. NeurIPS 2017 (as NIPS). arXiv:1706.03762. Submitted June 2017. Over 130,000 citations.

← Back to Veanors Hub

What architectural modification did GPT make to the original Transformer from this paper?

It kept only the decoder and removed the encoder, using masked self-attention for language modeling It added convolutional layers between attention layers It replaced multi-head attention with single-head attention

Attention IsAll You Need