Karpathy built the entire GPT algorithm in 243 lines of Python. This masterclass will make sure you understand every single one.
A Large Language Model like ChatGPT is, at its core, a next-token predictor. You give it some text, and it predicts what word should come next. The entire miracle of "artificial intelligence" emerges from doing this one thing extremely well.
microGPT learns patterns in 32,000 human names and generates new names that sound plausible but never existed. The same algorithm, scaled up 100,000x, produces ChatGPT.
Names generated by the 243-line model after training:
Every LLM runs this exact loop:
Neural networks only understand numbers. A vector is just a list of numbers. A parameter is a single adjustable number the model learns during training.
The derivative tells you: if I nudge the input a tiny bit, how much does the output change?
Drag the point along f(x) = x². The tangent line shows the derivative (slope).
Gradient descent: compute the loss, compute gradients, take a small step downhill, repeat.
Click anywhere to place the ball, then watch it roll downhill.
Every number in microGPT is wrapped in a Value object that tracks its gradient. When you multiply two Values, the result remembers how it was made.
python class Value: def __init__(self, data): self.data = data # the actual number self.grad = 0 # how the loss depends on this
| Operation | Local Gradient |
|---|---|
| a + b | Both inputs: slope = 1 |
| a × b | ∂/∂a = b, ∂/∂b = a |
| an | n · an-1 |
| log(a) | 1/a |
| exp(a) | exp(a) |
| relu(a) | 1 if a>0, else 0 |
We've built a gradient descent engine. It can adjust numbers to minimize a loss function. But here's the problem: how do we feed text into a system that only understands numbers?
The naive approach: assign each character a number. a=1, b=2, c=3... z=26. Now "cat" becomes [3, 1, 20]. But this creates a lie — it says "b" is halfway between "a" and "c," and "m" is close to "n." In reality, the letter "a" has no meaningful numerical relationship to "b." The number 1 vs. 2 is an arbitrary label, not a measurement.
A single number per character can only tell you which character it is. It can't capture anything about what that character means in context. Think of it like GPS: a single number (longitude) tells you east-west position, but you need two numbers (latitude + longitude) to pinpoint a location on Earth. With more numbers, you can describe richer things.
Before anything else, we need a consistent way to refer to each character. We assign each one an integer ID — think of it as a name tag, not a measurement. The specific numbers don't matter; what matters is that each character gets a unique ID.
For microGPT: a=0, b=1, c=2, ..., z=25, and a special "beginning of sequence" token BOS=26. That's 27 possible tokens total.
Type any text below. Each character gets mapped to its integer ID.
Now we convert each token ID into a vector. How? With a simple lookup table — a big grid of numbers called an embedding table.
Picture a spreadsheet with 27 rows (one per character) and 16 columns (one per dimension). Each row is that character's vector. To "embed" the character "e" (ID=4), you just grab row 4. No math, no computation — just look up the row.
python # The embedding table: 27 characters, each gets a 16-number vector wte = nn.Embedding(27, 16) # 27 rows × 16 columns = 432 learnable numbers # "Look up" a character: just index into the table # token ID 4 (the letter 'e') → row 4 → a vector of 16 numbers e_vector = wte[4] # → [0.23, -0.81, 1.42, ..., 0.05] (16 numbers)
Before training, these 432 numbers are random. The letter "e" starts with a meaningless vector. But during training, gradient descent adjusts these numbers so that characters appearing in similar contexts end up with similar vectors. The model discovers structure on its own.
There's a subtle problem. If we just embed each character, the model sees the word "cat" as three vectors: [vec_c, vec_a, vec_t]. But it also sees "act" as the same three vectors in a different order: [vec_a, vec_c, vec_t]. How does the model know which character came first?
The answer: we add a second set of vectors that encode position. Position 0 gets its own vector, position 1 gets a different vector, and so on. These are also stored in a lookup table and also learned during training.
python # Position embedding table: 16 positions, each gets a 16-number vector wpe = nn.Embedding(16, 16) # 16 rows × 16 columns = 256 learnable numbers # For the word "cat" at positions 0, 1, 2: tok_emb = wte[[2, 0, 19]] # look up c, a, t → three 16-number vectors pos_emb = wpe[[0, 1, 2]] # look up positions 0, 1, 2 → three 16-number vectors # Combine: add them element-by-element x = tok_emb + pos_emb # "c at position 0", "a at position 1", "t at position 2"
Why add instead of sticking them side by side (concatenating)? If we concatenated, each vector would double from 16 numbers to 32 numbers, making every downstream computation more expensive. Addition keeps the size at 16. The model learns to pack both "which character" and "which position" into the same 16 numbers — they share the space.
From here on, we'll describe the size of data using shape notation like [3, 16]. This just means "3 rows, 16 columns" — a grid of numbers. You'll also see a third dimension: [B, T, 16], where B is the batch size (how many examples we process at once for efficiency) and T is the sequence length (how many characters). So [4, 8, 16] means "4 examples, each with 8 characters, each character described by 16 numbers."
microGPT uses 27 characters and 16 dimensions. Real models are bigger, but the mechanism is identical — just larger tables:
| Model | Vocab Size | Vector Size | Embedding Table Size |
|---|---|---|---|
| microGPT | 27 characters | 16 numbers each | 432 numbers |
| GPT-2 Small | 50,257 tokens | 768 numbers each | 38.6 million numbers |
| GPT-3 | 50,257 tokens | 12,288 numbers each | 617 million numbers |
GPT-2 and GPT-3 use subword tokens instead of individual characters — chunks like "the", "ing", "##tion" — which is more efficient. But the embedding mechanism is the same: one row per token, looked up by index.
Step 1: Tokenize "hi" → [26, 7, 8] (BOS=26, h=7, i=8). Three integer IDs.
Step 2: Token embedding lookup: grab rows 26, 7, 8 from the token table → three vectors of 16 numbers each.
Step 3: Position embedding lookup: grab rows 0, 1, 2 from the position table → three vectors of 16 numbers each.
Step 4: Add element-by-element: token vector + position vector → three vectors of 16 numbers each. This is what enters attention.
Why add, not concatenate? Concatenation would double the vector size from 16 to 32, making every downstream computation more expensive. Addition keeps the size at 16. It works because the model learns to pack both identity and position into the same 16-dimensional space during training.
Each token creates three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), Value ("what info do I offer?").
GPT is autoregressive — it predicts the next token from only the previous tokens. This means token 3 must NOT see tokens 4, 5, 6... How do we enforce this? With a causal mask: a lower-triangular boolean matrix.
python # For a 5-token sequence, the mask looks like: mask = [ [1, 0, 0, 0, 0], # token 0 sees only itself [1, 1, 0, 0, 0], # token 1 sees tokens 0,1 [1, 1, 1, 0, 0], # token 2 sees tokens 0,1,2 [1, 1, 1, 1, 0], # token 3 sees tokens 0,1,2,3 [1, 1, 1, 1, 1], # token 4 sees all tokens ] # Where mask is 0, we set the attention score to -infinity # softmax(-inf) = 0, so those tokens are completely invisible scores = scores.masked_fill(mask == 0, float('-inf'))
This is the ENTIRE autoregressive property. No special architecture, no separate backward pass. Just a triangular matrix of zeros and ones applied before softmax. The implementation is one line of code, but it's what makes GPT a generative model.
The original 2017 Transformer had both encoder and decoder stacks. GPT's insight (Radford et al., 2018): throw away the encoder entirely. A causal decoder alone, trained on enough data, learns to do everything. The causal mask IS the architecture — remove it and you get BERT (bidirectional encoder). Same attention mechanism, same Q/K/V math, different masking pattern = completely different model behavior.
BERT uses a [MASK] token and predicts masked words bidirectionally. Why can't BERT generate text autoregressively the way GPT does?
| Setting | Value | Meaning |
|---|---|---|
| n_embd | 16 | Each token = 16 numbers |
| n_head | 4 | 4 parallel attention patterns |
| n_layer | 1 | One [Attention + MLP] block |
| vocab_size | 27 | 26 letters + BOS |
| Total params | 4,192 | 4,192 learnable numbers |
Let's account for every single parameter in microGPT:
microGPT parameter audit # Embeddings token_emb: 27 × 16 = 432 pos_emb: 16 × 16 = 256 # Attention (1 block, 4 heads) W_Q: 16 × 16 = 256 W_K: 16 × 16 = 256 W_V: 16 × 16 = 256 W_O: 16 × 16 = 256 # MLP (4× expansion: 16 → 64 → 16) W_up: 16 × 64 = 1,024 b_up: 64 = 64 W_down: 64 × 16 = 1,024 b_down: 16 = 16 # LayerNorm (2 per block, each has scale + shift) ln1: 16 + 16 = 32 ln2: 16 + 16 = 32 # LM Head (shared with token_emb via weight tying) lm_head: 16 × 27 = 432 # often tied with token_emb ln_final: 16 + 16 = 32 # TOTAL: ~4,192
Now here's the same audit for GPT-2 Small (124M parameters):
| Component | Formula | Params |
|---|---|---|
| Token embedding | 50,257 × 768 | 38.6M |
| Position embedding | 1,024 × 768 | 0.8M |
| Attention (12 layers) | 12 × 4 × 768² | 28.3M |
| MLP (12 layers) | 12 × 2 × 768 × 3072 | 56.6M |
| LayerNorm + biases | small | ~0.1M |
| Total | ~124M |
python import numpy as np def layernorm(x, g, b, eps=1e-5): mean = x.mean(axis=-1, keepdims=True) var = x.var(axis=-1, keepdims=True) return g * (x - mean) / np.sqrt(var + eps) + b def softmax(x, axis=-1): e = np.exp(x - x.max(axis=axis, keepdims=True)) return e / e.sum(axis=axis, keepdims=True) def gpt_forward(token_ids, params): T = len(token_ids) d = 16 # 1. Embed x = params['wte'][token_ids] + params['wpe'][np.arange(T)] # 2. Attention block x_norm = layernorm(x, params['ln1_g'], params['ln1_b']) Q = x_norm @ params['W_Q'] # [T, 16] K = x_norm @ params['W_K'] # [T, 16] V = x_norm @ params['W_V'] # [T, 16] # Scaled dot-product with causal mask scores = Q @ K.T / np.sqrt(d) # [T, T] mask = np.tril(np.ones((T, T))) scores = np.where(mask == 0, -1e9, scores) attn = softmax(scores) # [T, T] attn_out = attn @ V # [T, 16] attn_out = attn_out @ params['W_O'] # 3. Residual x = x + attn_out # 4. MLP block x_norm = layernorm(x, params['ln2_g'], params['ln2_b']) h = x_norm @ params['W_up'] + params['b_up'] # [T, 64] h = np.maximum(0, h) # ReLU mlp_out = h @ params['W_down'] + params['b_down'] # [T, 16] # 5. Residual x = x + mlp_out # 6. Final norm + project to vocab x = layernorm(x, params['ln_f_g'], params['ln_f_b']) logits = x @ params['wte'].T # [T, 27] — weight tying! return logits
Here's the crucial detail: the model's input and its target (ground truth) are the same sequence, shifted by one position. The target for position t is the token at position t+1.
python # Training sequence: "emma" # Input tokens: [BOS, e, m, m] (positions 0,1,2,3) # Target tokens: [e, m, m, a] (the NEXT token at each position) logits = model(input_tokens) # [B, 4, 27] — 27 probs per position targets = input_tokens[1:] # shift by 1 # Cross-entropy loss at EVERY position: # Position 0: model predicts next after BOS → should be 'e' # Position 1: model predicts next after 'e' → should be 'm' # Position 2: model predicts next after 'm' → should be 'm' # Position 3: model predicts next after 'm' → should be 'a' loss = cross_entropy(logits.view(-1, 27), targets.view(-1))
The model outputs [batch, seq_len, vocab_size] logits. The loss is computed at every position simultaneously — one forward pass gives you seq_len−1 training signals. This is why language model training is so data-efficient compared to, say, image classification where each image gives you just one label.
We've been using loss = −log P(correct token) without justification. Where does this come from?
Given a dataset of sequences, the model assigns probability Pθ(xt | x<t) to each next token. We want to find parameters θ that make the training data most probable.
Your task: Start from Maximum Likelihood Estimation (maximize the probability of the data) and show that it's equivalent to minimizing the average cross-entropy loss −log P.
Full derivation:
1. Likelihood: P(dataset) = ∏sequences ∏t=1T Pθ(xt | x<t)
2. Log-likelihood: log P = ∑∑ log Pθ(xt | x<t)
3. Negate and average: L(θ) = −(1/N) ∑i ∑t log Pθ(xt(i) | x<t(i))
4. Per-token: At each position, we have a true distribution q (one-hot on the correct token) and model distribution p. The cross-entropy H(q, p) = −∑v q(v) log p(v) = −log p(correct) since q is one-hot.
The key insight: Cross-entropy loss isn't an arbitrary choice — it's the ONLY loss function that corresponds to maximum likelihood estimation for categorical distributions. Using MSE or L1 on probabilities would not give you the MLE solution.
GPT-2's validation loss was ~3.3 nats. GPT-3's was ~2.8 nats. These numbers are hard to interpret. Perplexity converts loss into an intuitive quantity: "on average, the model is as confused as if it were choosing uniformly from PPL options."
Your task: Show that perplexity = exp(average cross-entropy loss), and explain why perplexity = V (vocabulary size) for a random model and perplexity = 1 for a perfect model.
Full derivation:
PPL = exp( −(1/T) ∑t=1T log P(xt | x<t) ) = exp(average_loss)
Random model: P(xt) = 1/V for all t. Loss = log(V). PPL = exp(log V) = V. For GPT's 50K vocab, a random model has PPL = 50,257.
Perfect model: P(correct) = 1. Loss = 0. PPL = 1.
GPT-2: Loss ≈ 3.3 → PPL = exp(3.3) ≈ 27. "On average, the model is choosing from ~27 equally likely options."
GPT-3: Loss ≈ 2.8 → PPL = exp(2.8) ≈ 16. Narrowed it down to ~16 options.
The key insight: Perplexity has a beautiful interpretation: it's the effective branching factor. A model with PPL=27 is, on average, as uncertain as someone choosing from 27 equally likely options. This makes it meaningful to compare across vocabularies and datasets.
Here is the complete generation algorithm. It's surprisingly short:
python def generate(model, prompt_tokens, max_new_tokens, temperature=1.0): tokens = prompt_tokens.clone() # start with prompt for _ in range(max_new_tokens): # 1. Forward pass — only need logits for LAST position logits = model(tokens) # [1, T, vocab_size] logits = logits[:, -1, :] # [1, vocab_size] — last token only # 2. Apply temperature logits = logits / temperature # higher T → flatter distribution # 3. Convert to probabilities probs = softmax(logits, dim=-1) # 4. Sample from the distribution next_token = torch.multinomial(probs, num_samples=1) # 5. Append and repeat tokens = torch.cat([tokens, next_token], dim=1) return tokens
That's it. Five lines in the loop. The model sees more and more context each iteration (or with KV cache, just the new token). Generation is inherently sequential — you can't parallelize it because each token depends on all previous ones.
Temperature is division before softmax. It controls the "sharpness" of the distribution:
Temperature alone isn't enough. Even with T=0.8, the model might sometimes sample a very unlikely token (the "long tail"). Top-k fixes this by zeroing out everything except the k most likely tokens before sampling:
python # Top-k: keep only the 5 highest logits, zero the rest logits = [3.2, 2.5, -0.1, 1.8, -2.0, 0.5, -1.2, 1.5] top_5 = [3.2, 2.5, -inf, 1.8, -inf, 0.5, -inf, 1.5] # Now softmax only distributes probability among those 5 # Top-p (nucleus): keep smallest set of tokens # whose cumulative probability ≥ p (e.g., 0.9) # More adaptive — sometimes keeps 3 tokens, sometimes 20
Here are the exact architectural parameters for every public GPT model. Notice how each dimension scales:
| Model | Params | Layers | Heads | d_model | Context | Year |
|---|---|---|---|---|---|---|
| microGPT | 4,192 | 1 | 4 | 16 | 16 | 2024 |
| GPT-2 Small | 124M | 12 | 12 | 768 | 1,024 | 2019 |
| GPT-2 Medium | 355M | 24 | 16 | 1,024 | 1,024 | 2019 |
| GPT-2 Large | 774M | 36 | 20 | 1,280 | 1,024 | 2019 |
| GPT-2 XL | 1.5B | 48 | 25 | 1,600 | 1,024 | 2019 |
| GPT-3 | 175B | 96 | 96 | 12,288 | 2,048 | 2020 |
| GPT-4* | ~1.8T* | ~120* | ~128* | ~16K* | 128K | 2023 |
*GPT-4 specs are rumored (MoE with ~16 experts, ~110B active per forward pass). OpenAI has not confirmed.
| Dimension | microGPT | GPT-4 class |
|---|---|---|
| Data | 32K names | Trillions of tokens |
| Parameters | 4,192 | 100B – 1T+ |
| Layers | 1 | 80 – 128+ |
| Context | 16 chars | 128K+ tokens |
| Training | ~1 minute | ~3 months |
| Cost | $0 | $100M+ |
Pre-training is microGPT's algorithm at scale. SFT (Supervised Fine-Tuning) continues training on high-quality conversation data — same loss function, just better data. RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preferences, then optimizes the language model against it. The base algorithm never changes — what changes is what you train on.
The Chinchilla paper (Hoffmann et al., 2022) showed that for a fixed compute budget C, there's an optimal balance between model parameters N and training tokens D. The loss follows:
where α ≈ 0.34, β ≈ 0.28, and compute C ≈ 6ND (FLOPs ≈ 6 × params × tokens).
Your task: Given a fixed compute budget C, derive the optimal N and D that minimize L. Show that N and D should scale proportionally with C (i.e., if you 10x your compute, you should ~10x both params and data).
Full derivation:
Substitute D = C/(6N) into the loss: L(N) = E + A/Nα + B·(6N/C)β
Take dL/dN = 0: αA/Nα+1 = βB·6β·Nβ-1/Cβ
Rearrange: Nα+β = (αA·Cβ) / (βB·6β)
So: Nopt ∝ Cβ/(α+β) = C0.28/0.62 ≈ C0.45
And: Dopt = C/(6N) ∝ C1-0.45 = C0.55
The key insight: Both N and D grow sub-linearly with C, but D grows slightly faster. The Chinchilla rule: tokens should scale ~1.4x faster than parameters. GPT-3 was trained on too few tokens for its size (300B tokens for 175B params). Chinchilla (70B params, 1.4T tokens) matched GPT-3 performance with 4x fewer params.
Real-world solution (circa 2024):
With C = 3×1023 FLOPs, Chinchilla-optimal gives: N ≈ 13B params, D ≈ 4T tokens. However, the field has moved toward "over-training" smaller models (more tokens than Chinchilla-optimal) because inference cost matters more than training cost. Llama-3 8B was trained on 15T tokens (1875:1 token-to-param ratio vs Chinchilla's ~20:1).
Modern answer: Train a 7-13B dense model on 4-15T tokens. Context 4K-8K for pre-training (extend later with RoPE scaling). Use ~1000 H100s for 3-4 months. Dense > MoE at this scale because MoE routing overhead dominates when model is small. Budget: ~60% pre-training, ~10% SFT data curation, ~20% RLHF/DPO, ~10% evaluation and iteration.
The key trade-off: Chinchilla minimizes training loss for fixed compute. But in production, inference cost dominates. A smaller model trained longer has worse training efficiency but better deployment economics.
Pre-training makes GPT a brilliant document completer. But "complete this document" isn't the same as "be helpful." RLHF adds a second objective: maximize a reward model trained on human preferences, while staying close to the pre-trained model (the KL penalty prevents "reward hacking"). The base GPT never changes architecture — only the loss signal changes.
Both losses are expectations over text. What's the fundamental difference in what distribution the expectation is taken over?
You now understand the creation. The only question left is: what will you build?