The Complete Beginner's Path

Understand the Transformer
From Scratch

The architecture behind GPT, BERT, LLaMA, and every frontier language model. One paper changed everything — here's how it works.

Prerequisites: Basic linear algebra + Intuition for neural nets. That's it.
11
Chapters
12+
Simulations
0
Assumed ML Knowledge

Chapter 0: What Is a Sequence?

Text is a sequence of words. Audio is a sequence of samples. Video is a sequence of frames. A stock price is a sequence of values over time. Before the Transformer, we processed sequences with recurrent neural networks (RNNs) — one element at a time, left to right. It was slow and it forgot things.

The Transformer processes every element at once. Instead of a conveyor belt, it's a spotlight that shines on the entire sequence simultaneously. This parallelism is why Transformers train so fast on GPUs — and why they scale to billions of parameters.

The core shift: RNNs read a book one word at a time, trying to remember what happened on page 1 when they reach page 300. Transformers can see every page at once and decide which pages matter for understanding each word.
Sequence Types

Click a domain to see how it's tokenized into a sequence.

Key insight: Order matters. "Dog bites man" and "Man bites dog" have the same words but very different meanings. Any useful sequence model must somehow encode position — we'll see how in Chapter 4.
Check: Why do Transformers train faster than RNNs?

Chapter 1: Attention from Scratch

Imagine reading the sentence: "The cat sat on the mat because it was tired." What does "it" refer to? You instantly know it means "the cat." Your brain attends to "cat" when processing "it." That's attention.

In a neural network, each token is a vector (a list of numbers). Attention lets each token compute a weighted average of all other tokens' vectors. The weights come from dot products — measuring how similar two tokens are. Similar tokens get high weights; irrelevant ones get near-zero.

Interactive: Token Similarity

Click any token to see its dot-product similarity with every other token. Brighter = higher similarity.

similarity(q, k) = q · k = ∑ qi ki
Why dot products? Two vectors pointing in the same direction have a large dot product. Orthogonal vectors have zero. This gives us a fast, differentiable way to measure "how much should token A care about token B?"
Check: What does a high dot product between two token vectors mean?

Chapter 2: Scaled Dot-Product Attention

Raw attention has three ingredients. Each token produces three vectors by multiplying its embedding with learned weight matrices:

The attention score between tokens i and j is Qi · Kj. We divide by √dk to prevent the dot products from getting too large (which would make softmax saturate and kill gradients). Then softmax converts scores to weights that sum to 1.

Attention(Q, K, V) = softmax( Q Kᵀ / √dk ) · V
Interactive: Compute Attention Weights

Four tokens with 2D Q and K vectors. Watch how weights shift as you drag the query vector of the selected token.

Select token
Qx1.0
Qy0.5
Why scale? If dk = 64, dot products can easily reach values like 30–50. Softmax of 50 is essentially 1.0, making the model attend to only one token and ignore the rest. Dividing by √64 = 8 keeps things in a reasonable range.
Check: What is the purpose of dividing by √dk?

Chapter 3: Multi-Head Attention

One set of Q, K, V can only learn one type of pattern. But language has many simultaneous relationships: syntax (subject-verb), coreference (pronoun-noun), semantic similarity, positional patterns. Multi-head attention runs several attention operations in parallel, each with its own learned Q, K, V projections.

If the model dimension is d = 512 and we use h = 8 heads, each head works with d/h = 64 dimensions. After computing attention independently, we concatenate all head outputs and project back to the full dimension.

MultiHead(Q, K, V) = Concat(head1, ..., headh) WO
Input
x ∈ Rn×512
Split into 8 heads
Each head: Q, K, V ∈ Rn×64
Parallel Attention
8 independent softmax(QKᵀ/√64)V
Concat + Project
Concat all heads → WO → Rn×512
Interactive: What Each Head Sees

Select a head to see its attention pattern. Each head learns to focus on different relationships.

Key insight: Multi-head attention costs roughly the same as single-head attention with full dimensionality. The split is free — and you get richer, more diverse attention patterns.
Check: Why use multiple attention heads instead of one?

Chapter 4: Positional Encoding

Attention is permutation-invariant: if you shuffle the input tokens, the attention weights change but the mechanism itself doesn't inherently know the order. "Cat sat mat" and "Mat cat sat" would produce the same attention pattern. We need to inject position information explicitly.

Three approaches:

Sinusoidal (Original)
PE(pos, 2i) = sin(pos / 100002i/d)
PE(pos, 2i+1) = cos(pos / 100002i/d)
Learned (BERT, GPT-2)
Trainable embedding matrix Epos ∈ Rmax_len×d
RoPE (LLaMA, Modern)
Rotate Q and K vectors by position-dependent angle. Relative position encoded in the dot product itself.
Interactive: Sinusoidal Position Encodings

Each row is a position (0–31). Each column is a dimension. Color = encoding value. Notice the wave patterns at different frequencies.

Dimensions32
Why sinusoids? The original paper showed that for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). This means the model can learn to attend to relative positions using simple linear operations.
Check: Without positional encoding, attention is:

Chapter 5: The Encoder Block

An encoder block takes a sequence and returns a refined sequence of the same shape. It has two sub-layers, each wrapped in a residual connection (add the input back) and layer normalization. The residual connections are critical — they let gradients flow straight through, enabling very deep stacks.

Input
x ∈ Rn×d
LayerNorm
Normalize each token vector
Multi-Head Self-Attention
Each token attends to all tokens
↓ + residual
LayerNorm
Normalize again
Feed-Forward Network
Two linear layers with ReLU/GELU: d → 4d → d
↓ + residual
Output
Same shape: Rn×d
Interactive: Data Flow Through an Encoder Block

Watch a token vector flow through each sub-layer. The residual stream carries information forward.

Click "Step Forward" to begin
The FFN is a memory bank. The attention layer lets tokens talk to each other. The FFN processes each token independently — it's where the model stores factual knowledge. In large models, the FFN is 2/3 of all parameters.
Check: What is the purpose of the residual connection?

Chapter 6: The Decoder Block

The decoder block has the same structure as the encoder, plus two crucial additions: causal masking and (in encoder-decoder models) cross-attention.

Causal masking: During generation, token i must not see tokens i+1, i+2, ... (the future). We achieve this by setting those attention scores to −∞ before softmax, which forces their weights to zero. This creates a lower-triangular attention matrix.

Cross-attention: In translation models, the decoder attends to the encoder's output. The decoder provides Q; the encoder provides K and V. This is how the decoder "reads" the source language.

Interactive: Causal Mask

The attention matrix before and after masking. White cells are visible; dark cells are masked (−∞). Each token can only see itself and earlier tokens.

Showing: Full attention (no mask)
Decoder Input
Previously generated tokens
Masked Self-Attention
Causal: can only look backward
↓ + residual
Cross-Attention
Q from decoder, K/V from encoder
↓ + residual
Feed-Forward Network
Same as encoder FFN
↓ + residual
Output
Logits over vocabulary
Decoder-only models (GPT, LLaMA) skip cross-attention entirely. They use only masked self-attention + FFN. The entire "encoder" is implicit: the prompt IS the encoding. This simplification is why decoder-only models dominate today.
Check: Why does the decoder use causal masking?

Chapter 7: Training

Training a Transformer for language modeling is deceptively simple: given a sequence of tokens, predict the next token at every position. The loss function is cross-entropy between the predicted probability distribution and the actual next token.

L = − ∑t log P(xt+1 | x1, ..., xt)

Teacher forcing: During training, we don't use the model's own predictions as input for the next step. Instead, we always feed the true previous tokens. This is faster and more stable than autoregressive training, but it means the model never sees its own mistakes during training.

Interactive: Cross-Entropy Loss

The model predicts a probability for the correct next token. Drag the slider to see how loss changes. Higher confidence in the right answer = lower loss.

P(correct)0.50
The beauty of causal masking: With one forward pass, the model produces predictions for every position simultaneously. Position 1 predicts token 2, position 2 predicts token 3, and so on. One sequence gives you n−1 training examples for free.
ConceptWhat It Does
Next-token predictionThe training objective: predict xt+1 from x1..t
Cross-entropy lossMeasures how far predicted distribution is from truth
Teacher forcingUse true tokens (not predictions) as input during training
AdamW optimizerAdaptive learning rate + weight decay
Warmup + cosine decayGradually increase then decrease learning rate
Check: What is teacher forcing?

Chapter 8: KV Cache & Inference

During generation, the model produces one token at a time. Without optimization, generating token n requires recomputing attention over all n−1 previous tokens — that's O(n²) total work for a full sequence. The KV cache stores the Key and Value vectors from previous tokens so we never recompute them.

Step 1
Process "The" → compute K1, V1 → store in cache
Step 2
Process "cat" → compute K2, V2 → append to cache → attend to [K1,K2]
Step 3
Process "sat" → compute K3, V3 → append → attend to [K1,K2,K3]
↓ ...
Step n
Only compute Qn for the new token, reuse all cached K, V
Interactive: Watch the KV Cache Grow

Click "Generate Token" to add one token. The cache (blue bars) grows while each step only computes one new Q (orange).

Cache: 0 tokens | Total KV memory: 0 KB
Memory cost: For a 70B parameter model with 128K context, the KV cache alone can require ~40 GB of memory. This is often the bottleneck for serving large models, not the model weights themselves.
Check: What does the KV cache store?

Chapter 9: MoE & Scaling

As models get bigger, we face a dilemma: more parameters = better quality, but also more compute per token. Mixture of Experts (MoE) breaks this tradeoff. Instead of one giant FFN, we have many smaller "expert" FFNs. A router selects the top-k experts for each token. Only those experts run.

y = ∑i∈TopK gi · Experti(x)
Interactive: Expert Routing

Each token is routed to 2 of 8 experts. Different tokens activate different experts. Click "Route" to see a new random routing.

Scaling Laws

Kaplan et al. (2020) discovered that model performance follows predictable power laws:

Loss ∝ N−0.076 where N = number of parameters. Double the parameters, loss drops ~5%. This is why labs race to build bigger models — the returns are predictable.

Efficient Attention

TechniqueIdeaSpeedup
Flash AttentionTile computation to stay in SRAM, never materialize full attention matrix2–4x
Ring AttentionDistribute sequence across GPUs, pass KV blocks in a ringLinear in #GPUs
Grouped-Query AttentionShare KV heads across multiple Q heads1.5–2x less KV memory
Sliding WindowEach token only attends to nearby tokensO(n·w) instead of O(n²)
Check: What is the key advantage of Mixture of Experts?

Chapter 10: What Makes It Work

The Transformer's power isn't in any single component — it's in how they compose. Here are the key phenomena researchers have discovered:

The Residual Stream

Think of the residual connections as a communication bus. Each layer reads from the stream, processes information, and writes its contribution back. Information from early layers is never destroyed — it flows all the way to the final layer. This is fundamentally different from a pipeline where each stage replaces the previous output.

Induction Heads

One of the most remarkable discoveries: pairs of attention heads that implement in-context pattern completion. If the model sees "Harry Potter is a wizard... Harry Potter is a", the induction head copies "wizard" from the earlier occurrence. This is a key mechanism behind in-context learning — the ability to learn new tasks from examples in the prompt without any weight updates.

Interactive: Induction Head Pattern

The sequence repeats. Watch how the attention pattern forms a diagonal stripe shifted by the repeat length — that's the induction head copying from the first occurrence.

Repeat offset5

Emergent Abilities

PhenomenonWhat Happens
In-context learningLearns new tasks from examples in the prompt
Chain-of-thoughtStep-by-step reasoning improves accuracy
Few-shot generalizationSolves unseen tasks with just a few examples
Tool useLearns to call APIs, write code, use calculators
The deep mystery: We designed the Transformer for translation. We didn't design it for reasoning, coding, or creative writing. These abilities emerged from scale. Understanding why is one of the most important open questions in AI.
"Attention is all you need."
— Vaswani et al., 2017

You now understand the architecture that powers every frontier AI model. From dot products to multi-head attention, from encoder blocks to KV caches — this is the foundation of modern AI.