The architecture behind GPT, Claude, Gemini, and everything else.
A Transformer is a neural network that processes sequences by letting every element attend to every other element in parallel. Instead of reading tokens one-by-one like an RNN, it computes all pairwise relationships simultaneously using a mechanism called self-attention.
This simple idea — "Attention Is All You Need" — replaced recurrence with parallelism, unlocking massive scaling on GPUs. Every modern LLM, from GPT-4 to Claude to Gemini, is built on this foundation. The core architecture has barely changed since 2017; what changed is the scale and the training recipe.
Click any component to highlight its data flow. The decoder-only variant (used by GPT, Claude, Llama) stacks N identical blocks, each with self-attention and a feed-forward network.
A sequence of token IDs (integers) mapped to dense vectors via an embedding table.
Shape: [B, T, d_model] where T is sequence length and d_model is typically 4096–12288.
A probability distribution over the vocabulary for the next token at each position.
During autoregressive generation, only the last position's logits matter.
Shape: [B, T, V] where V is vocab size (~32k–256k).
Each token computes Query, Key, Value vectors. Attention weights = softmax(QKT / √dk). The output is a weighted sum of Values. This lets every token "look at" every other token.
Instead of one big attention, split into H parallel heads (e.g., 32–128), each attending to a different subspace. Concat and project. This lets the model capture different relationship types simultaneously.
In decoder-only models, each token can only attend to itself and earlier tokens. This is enforced by setting future positions to −∞ before softmax — a triangular mask that prevents information leaking from the future.
Used in encoder-decoder models (T5, original Transformer). Queries come from the decoder; Keys and Values come from the encoder output. This is how the decoder "reads" the input sequence during translation or summarization.
Watch how Query, Key, and Value matrices interact. Hover over tokens to see which other tokens they attend to and with what weight.
Transformers have no built-in notion of order. Positional encodings inject sequence position information. Three dominant approaches:
Fixed sine/cosine waves at different frequencies. Original paper. No learned parameters. Theoretically generalizes to any length.
Rotary Position Embedding. Rotates Q and K vectors by position-dependent angles. Used by Llama, Mistral, Qwen. Relative by design.
Attention with Linear Biases. No positional embedding — just subtracts a linear penalty proportional to distance. Simple and effective.
The Transformer has been iteratively improved since 2017. The biggest wins:
Cache Key/Value matrices for past tokens during autoregressive decode. Avoids recomputation. Now universal.
Fuses attention into a single GPU kernel with tiling. 2–4x faster, O(1) extra memory. Changed everything.
Distributes attention across devices in a ring. Enables million-token contexts by splitting the KV cache across GPUs.
Replace dense FFN with sparse routed experts. Mixtral uses 8 experts, picks top-2. More capacity, same FLOPs.
Draft tokens with a small model, verify in parallel with the big model. 2–3x faster inference without quality loss.
Grouped-Query / Multi-Query Attention. Share K/V heads across query heads. Cuts KV cache by 4–8x. Used in Llama 2+.
Rotary Position Embeddings encode relative position via rotation matrices. Dominant in open-source LLMs.
Limit attention to a local window + global tokens. Mistral's approach to efficient long-context without full quadratic cost.
Massive next-token prediction on internet-scale data. This is where the model learns language, facts, and reasoning patterns.
Fine-tune the model to follow instructions, be helpful, and avoid harm.
Full fine-tuning a 70B model requires hundreds of GB of memory. LoRA freezes the base model and trains low-rank adapter matrices (rank 8–64), reducing trainable parameters by 99%+. QLoRA goes further by quantizing the base model to 4-bit, making fine-tuning possible on a single GPU.
Generation happens in two phases: prefill (process the full prompt in parallel) and decode (generate tokens one at a time using the KV cache).
Notable Transformer-based models and their key specs.
| Model | Params | Context | Attention | Key Features |
|---|---|---|---|---|
| GPT-4o | ~1.8T (MoE) |
128k |
GQA | Multimodal, MoE, tool use |
| Claude 3.5 Sonnet | undisclosed |
200k |
— | Long context, strong reasoning, tool use |
| Gemini 1.5 Pro | ~1T (MoE) |
1M+ |
Ring Attention | Extreme context window, multimodal native |
| Llama 3.1 405B | 405B |
128k |
GQA + RoPE | Open-weight, dense, multilingual |
| Mistral Large | 123B |
128k |
GQA + SWA | Sliding window attention, efficient |
| Mixtral 8x22B | 176B (39B active) |
64k |
GQA + MoE | Sparse MoE, 8 experts top-2 routing |
| DeepSeek-V3 | 671B (37B active) |
128k |
MLA + MoE | Multi-head Latent Attention, 256 experts |
| Qwen 2.5 72B | 72B |
128k |
GQA + RoPE | Strong multilingual, code, math |