The architecture behind GPT, BERT, LLaMA, and every frontier language model. One paper changed everything — here's how it works.
Text is a sequence of words. Audio is a sequence of samples. Video is a sequence of frames. A stock price is a sequence of values over time. Before the Transformer, we processed sequences with recurrent neural networks (RNNs) — one element at a time, left to right. It was slow and it forgot things.
The Transformer processes every element at once. Instead of a conveyor belt, it's a spotlight that shines on the entire sequence simultaneously. This parallelism is why Transformers train so fast on GPUs — and why they scale to billions of parameters.
Click a domain to see how it's tokenized into a sequence.
Imagine reading the sentence: "The cat sat on the mat because it was tired." What does "it" refer to? You instantly know it means "the cat." Your brain attends to "cat" when processing "it." That's attention.
In a neural network, each token is a vector (a list of numbers). Attention lets each token compute a weighted average of all other tokens' vectors. The weights come from dot products — measuring how similar two tokens are. Similar tokens get high weights; irrelevant ones get near-zero.
Click any token to see its dot-product similarity with every other token. Brighter = higher similarity.
Raw attention has three ingredients. Each token produces three vectors by multiplying its embedding with learned weight matrices:
The attention score between tokens i and j is Qi · Kj. We divide by √dk to prevent the dot products from getting too large (which would make softmax saturate and kill gradients). Then softmax converts scores to weights that sum to 1.
Four tokens with 2D Q and K vectors. Watch how weights shift as you drag the query vector of the selected token.
One set of Q, K, V can only learn one type of pattern. But language has many simultaneous relationships: syntax (subject-verb), coreference (pronoun-noun), semantic similarity, positional patterns. Multi-head attention runs several attention operations in parallel, each with its own learned Q, K, V projections.
If the model dimension is d = 512 and we use h = 8 heads, each head works with d/h = 64 dimensions. After computing attention independently, we concatenate all head outputs and project back to the full dimension.
Select a head to see its attention pattern. Each head learns to focus on different relationships.
Attention is permutation-invariant: if you shuffle the input tokens, the attention weights change but the mechanism itself doesn't inherently know the order. "Cat sat mat" and "Mat cat sat" would produce the same attention pattern. We need to inject position information explicitly.
Each row is a position (0–31). Each column is a dimension. Color = encoding value. Notice the wave patterns at different frequencies.
An encoder block takes a sequence and returns a refined sequence of the same shape. It has two sub-layers, each wrapped in a residual connection (add the input back) and layer normalization. The residual connections are critical — they let gradients flow straight through, enabling very deep stacks.
Watch a token vector flow through each sub-layer. The residual stream carries information forward.
The decoder block has the same structure as the encoder, plus two crucial additions: causal masking and (in encoder-decoder models) cross-attention.
Causal masking: During generation, token i must not see tokens i+1, i+2, ... (the future). We achieve this by setting those attention scores to −∞ before softmax, which forces their weights to zero. This creates a lower-triangular attention matrix.
Cross-attention: In translation models, the decoder attends to the encoder's output. The decoder provides Q; the encoder provides K and V. This is how the decoder "reads" the source language.
The attention matrix before and after masking. White cells are visible; dark cells are masked (−∞). Each token can only see itself and earlier tokens.
Training a Transformer for language modeling is deceptively simple: given a sequence of tokens, predict the next token at every position. The loss function is cross-entropy between the predicted probability distribution and the actual next token.
Teacher forcing: During training, we don't use the model's own predictions as input for the next step. Instead, we always feed the true previous tokens. This is faster and more stable than autoregressive training, but it means the model never sees its own mistakes during training.
The model predicts a probability for the correct next token. Drag the slider to see how loss changes. Higher confidence in the right answer = lower loss.
| Concept | What It Does |
|---|---|
| Next-token prediction | The training objective: predict xt+1 from x1..t |
| Cross-entropy loss | Measures how far predicted distribution is from truth |
| Teacher forcing | Use true tokens (not predictions) as input during training |
| AdamW optimizer | Adaptive learning rate + weight decay |
| Warmup + cosine decay | Gradually increase then decrease learning rate |
During generation, the model produces one token at a time. Without optimization, generating token n requires recomputing attention over all n−1 previous tokens — that's O(n²) total work for a full sequence. The KV cache stores the Key and Value vectors from previous tokens so we never recompute them.
Click "Generate Token" to add one token. The cache (blue bars) grows while each step only computes one new Q (orange).
As models get bigger, we face a dilemma: more parameters = better quality, but also more compute per token. Mixture of Experts (MoE) breaks this tradeoff. Instead of one giant FFN, we have many smaller "expert" FFNs. A router selects the top-k experts for each token. Only those experts run.
Each token is routed to 2 of 8 experts. Different tokens activate different experts. Click "Route" to see a new random routing.
Kaplan et al. (2020) discovered that model performance follows predictable power laws:
| Technique | Idea | Speedup |
|---|---|---|
| Flash Attention | Tile computation to stay in SRAM, never materialize full attention matrix | 2–4x |
| Ring Attention | Distribute sequence across GPUs, pass KV blocks in a ring | Linear in #GPUs |
| Grouped-Query Attention | Share KV heads across multiple Q heads | 1.5–2x less KV memory |
| Sliding Window | Each token only attends to nearby tokens | O(n·w) instead of O(n²) |
The Transformer's power isn't in any single component — it's in how they compose. Here are the key phenomena researchers have discovered:
Think of the residual connections as a communication bus. Each layer reads from the stream, processes information, and writes its contribution back. Information from early layers is never destroyed — it flows all the way to the final layer. This is fundamentally different from a pipeline where each stage replaces the previous output.
One of the most remarkable discoveries: pairs of attention heads that implement in-context pattern completion. If the model sees "Harry Potter is a wizard... Harry Potter is a", the induction head copies "wizard" from the earlier occurrence. This is a key mechanism behind in-context learning — the ability to learn new tasks from examples in the prompt without any weight updates.
The sequence repeats. Watch how the attention pattern forms a diagonal stripe shifted by the repeat length — that's the induction head copying from the first occurrence.
| Phenomenon | What Happens |
|---|---|
| In-context learning | Learns new tasks from examples in the prompt |
| Chain-of-thought | Step-by-step reasoning improves accuracy |
| Few-shot generalization | Solves unseen tasks with just a few examples |
| Tool use | Learns to call APIs, write code, use calculators |
You now understand the architecture that powers every frontier AI model. From dot products to multi-head attention, from encoder blocks to KV caches — this is the foundation of modern AI.