The Complete Beginner's Path

Understand GPT
From Absolute Zero

Karpathy built the entire GPT algorithm in 243 lines of Python. This masterclass will make sure you understand every single one.

Prerequisites: Basic Python + High school algebra. That's it.
11
Chapters
15+
Simulations
0
Assumed ML Knowledge

Chapter 0: Why Does This Matter?

A Large Language Model like ChatGPT is, at its core, a next-token predictor. You give it some text, and it predicts what word should come next. The entire miracle of "artificial intelligence" emerges from doing this one thing extremely well.

The big reveal: When you chat with ChatGPT, it's not "thinking" the way you do. It's completing a document one token at a time. Your conversation is just a funny-looking document.

What does microGPT do?

microGPT learns patterns in 32,000 human names and generates new names that sound plausible but never existed. The same algorithm, scaled up 100,000x, produces ChatGPT.

See It In Action

Names generated by the 243-line model after training:

The 5-Step Loop

Every LLM runs this exact loop:

Step 1
Feed a sequence of tokens into the model
Step 2
Model outputs a probability for each possible next token
Step 3
Compare prediction to actual next token → compute loss
Step 4
Figure out how to adjust each parameter (backpropagation)
Step 5
Adjust parameters slightly → repeat 1000s of times
Check: What does an LLM fundamentally do?

Chapter 1: Numbers All The Way Down

Neural networks only understand numbers. A vector is just a list of numbers. A parameter is a single adjustable number the model learns during training.

Interactive Dot Product

Drag sliders to explore
Vector A
a11.0
a22.0
a3-1.0
Vector B
b12.0
b21.0
b30.5
Key insight: The dot product measures how similar two vectors are. Positive = similar direction. Negative = opposite. Zero = unrelated. This is the foundation of attention.
Check: What is a "parameter" in a neural network?

Chapter 2: The Slope of a Hill

The derivative tells you: if I nudge the input a tiny bit, how much does the output change?

Interactive Derivative Explorer

Drag the point along f(x) = x². The tangent line shows the derivative (slope).

x = 1.0
f(1.0) = 1.0 | slope = 2.0
The gradient is a vector of slopes — one per parameter. It points uphill. We go the opposite direction.
If the derivative of loss w.r.t. a parameter is +3.5, what should you do?

Chapter 3: Rolling Downhill

Gradient descent: compute the loss, compute gradients, take a small step downhill, repeat.

Gradient Descent Simulator

Click anywhere to place the ball, then watch it roll downhill.

Learning rate0.1
Step: 0 | Loss:
parameter = parameter − learning_rate × gradient
This is ALL of training. Compute loss. Compute gradients. Take a small step downhill. Repeat.

Chapter 4: The Autograd Engine

Every number in microGPT is wrapped in a Value object that tracks its gradient. When you multiply two Values, the result remembers how it was made.

python
class Value:
    def __init__(self, data):
        self.data = data   # the actual number
        self.grad = 0      # how the loss depends on this

The 6 Building Blocks

OperationLocal Gradient
a + bBoth inputs: slope = 1
a × b∂/∂a = b, ∂/∂b = a
ann · an-1
log(a)1/a
exp(a)exp(a)
relu(a)1 if a>0, else 0
That's all the calculus you need. These 6 local derivatives, combined via the chain rule, let you compute gradients through ANY computation.

Chapter 5: From Characters to Vectors

The Problem: Numbers Aren't Rich Enough

We've built a gradient descent engine. It can adjust numbers to minimize a loss function. But here's the problem: how do we feed text into a system that only understands numbers?

The naive approach: assign each character a number. a=1, b=2, c=3... z=26. Now "cat" becomes [3, 1, 20]. But this creates a lie — it says "b" is halfway between "a" and "c," and "m" is close to "n." In reality, the letter "a" has no meaningful numerical relationship to "b." The number 1 vs. 2 is an arbitrary label, not a measurement.

A single number per character can only tell you which character it is. It can't capture anything about what that character means in context. Think of it like GPS: a single number (longitude) tells you east-west position, but you need two numbers (latitude + longitude) to pinpoint a location on Earth. With more numbers, you can describe richer things.

The core idea behind embeddings: instead of representing each character with 1 number, represent it with a list of numbers — a vector. With 16 numbers per character, the model has 16 "dimensions" it can use to encode properties like: "appears at the start of words," "usually followed by a vowel," "common in names." The model discovers these properties during training. We don't tell it what the 16 dimensions mean — it figures that out by learning to predict the next character.

Step 1: Tokenize — Give Each Character a Name

Before anything else, we need a consistent way to refer to each character. We assign each one an integer ID — think of it as a name tag, not a measurement. The specific numbers don't matter; what matters is that each character gets a unique ID.

For microGPT: a=0, b=1, c=2, ..., z=25, and a special "beginning of sequence" token BOS=26. That's 27 possible tokens total.

Live Tokenizer

Type any text below. Each character gets mapped to its integer ID.

Step 2: Embed — From Name Tags to Rich Descriptions

Now we convert each token ID into a vector. How? With a simple lookup table — a big grid of numbers called an embedding table.

Picture a spreadsheet with 27 rows (one per character) and 16 columns (one per dimension). Each row is that character's vector. To "embed" the character "e" (ID=4), you just grab row 4. No math, no computation — just look up the row.

python
# The embedding table: 27 characters, each gets a 16-number vector
wte = nn.Embedding(27, 16)   # 27 rows × 16 columns = 432 learnable numbers

# "Look up" a character: just index into the table
# token ID 4 (the letter 'e') → row 4 → a vector of 16 numbers
e_vector = wte[4]             # → [0.23, -0.81, 1.42, ..., 0.05] (16 numbers)

Before training, these 432 numbers are random. The letter "e" starts with a meaningless vector. But during training, gradient descent adjusts these numbers so that characters appearing in similar contexts end up with similar vectors. The model discovers structure on its own.

Think of it this way: if you described every person in a room with 16 numbers — height, weight, age, hair length, voice pitch, etc. — people who look similar would have similar number lists. Embeddings do the same thing for characters, except the model invents its own categories instead of using human-chosen ones like "height."

Step 3: Position — Where Are You in the Sentence?

There's a subtle problem. If we just embed each character, the model sees the word "cat" as three vectors: [vec_c, vec_a, vec_t]. But it also sees "act" as the same three vectors in a different order: [vec_a, vec_c, vec_t]. How does the model know which character came first?

The answer: we add a second set of vectors that encode position. Position 0 gets its own vector, position 1 gets a different vector, and so on. These are also stored in a lookup table and also learned during training.

python
# Position embedding table: 16 positions, each gets a 16-number vector
wpe = nn.Embedding(16, 16)   # 16 rows × 16 columns = 256 learnable numbers

# For the word "cat" at positions 0, 1, 2:
tok_emb = wte[[2, 0, 19]]     # look up c, a, t → three 16-number vectors
pos_emb = wpe[[0, 1, 2]]     # look up positions 0, 1, 2 → three 16-number vectors

# Combine: add them element-by-element
x = tok_emb + pos_emb         # "c at position 0", "a at position 1", "t at position 2"

Why add instead of sticking them side by side (concatenating)? If we concatenated, each vector would double from 16 numbers to 32 numbers, making every downstream computation more expensive. Addition keeps the size at 16. The model learns to pack both "which character" and "which position" into the same 16 numbers — they share the space.

The full pipeline for "hi":
(1) Tokenize: "hi" → [BOS, h, i] → [26, 7, 8] — three integer IDs
(2) Token embed: look up rows 26, 7, 8 → three vectors of 16 numbers each
(3) Position embed: look up rows 0, 1, 2 → three vectors of 16 numbers each
(4) Add them: token + position → three vectors of 16 numbers each
That's what enters the attention layer next. Three characters, each described by 16 numbers that encode both what the character is and where it sits.

Reading Shape Notation

From here on, we'll describe the size of data using shape notation like [3, 16]. This just means "3 rows, 16 columns" — a grid of numbers. You'll also see a third dimension: [B, T, 16], where B is the batch size (how many examples we process at once for efficiency) and T is the sequence length (how many characters). So [4, 8, 16] means "4 examples, each with 8 characters, each character described by 16 numbers."

Scaling Up

microGPT uses 27 characters and 16 dimensions. Real models are bigger, but the mechanism is identical — just larger tables:

ModelVocab SizeVector SizeEmbedding Table Size
microGPT27 characters16 numbers each432 numbers
GPT-2 Small50,257 tokens768 numbers each38.6 million numbers
GPT-350,257 tokens12,288 numbers each617 million numbers

GPT-2 and GPT-3 use subword tokens instead of individual characters — chunks like "the", "ing", "##tion" — which is more efficient. But the embedding mechanism is the same: one row per token, looked up by index.

Check: Why can't we just feed the raw integer IDs (a=0, b=1, ...) directly into the model?
Checkpoint — Before you move on
Trace the full pipeline from raw text "hi" to the data that enters the attention layer. What happens at each step? Why do we add position embeddings instead of concatenating them?
✓ Gate cleared
Model Answer

Step 1: Tokenize "hi" → [26, 7, 8] (BOS=26, h=7, i=8). Three integer IDs.

Step 2: Token embedding lookup: grab rows 26, 7, 8 from the token table → three vectors of 16 numbers each.

Step 3: Position embedding lookup: grab rows 0, 1, 2 from the position table → three vectors of 16 numbers each.

Step 4: Add element-by-element: token vector + position vector → three vectors of 16 numbers each. This is what enters attention.

Why add, not concatenate? Concatenation would double the vector size from 16 to 32, making every downstream computation more expensive. Addition keeps the size at 16. It works because the model learns to pack both identity and position into the same 16-dimensional space during training.

Chapter 6: Attention — Tokens Talking

Each token creates three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), Value ("what info do I offer?").

attention_weight = softmax( Q · K / √d )
Softmax Visualizer
Multi-head: microGPT uses 4 attention heads, each on a 4-dim slice of the 16-dim vector. Each head learns different patterns.

The Causal Mask: The Autoregressive Secret

GPT is autoregressive — it predicts the next token from only the previous tokens. This means token 3 must NOT see tokens 4, 5, 6... How do we enforce this? With a causal mask: a lower-triangular boolean matrix.

python
# For a 5-token sequence, the mask looks like:
mask = [
  [1, 0, 0, 0, 0],  # token 0 sees only itself
  [1, 1, 0, 0, 0],  # token 1 sees tokens 0,1
  [1, 1, 1, 0, 0],  # token 2 sees tokens 0,1,2
  [1, 1, 1, 1, 0],  # token 3 sees tokens 0,1,2,3
  [1, 1, 1, 1, 1],  # token 4 sees all tokens
]

# Where mask is 0, we set the attention score to -infinity
# softmax(-inf) = 0, so those tokens are completely invisible
scores = scores.masked_fill(mask == 0, float('-inf'))

This is the ENTIRE autoregressive property. No special architecture, no separate backward pass. Just a triangular matrix of zeros and ones applied before softmax. The implementation is one line of code, but it's what makes GPT a generative model.

Why this works for training too: During training, you feed in a full sequence and the mask ensures each position can only see previous positions. Position 0 predicts token 1, position 1 predicts token 2, etc. One forward pass, T−1 predictions. This is why GPT trains so efficiently.
Check: In a 6-token sequence, which tokens can token 3 attend to?
🔗 Pattern Recognition
GPT is a Decoder-Only Transformer
GPT (this lesson)
Causal mask: each token sees only past tokens.
One stack of [Attention + MLP] blocks.
Trained with next-token prediction.
Full Transformer
Encoder: bidirectional (no mask, sees all tokens).
Decoder: causal + cross-attention to encoder.
Trained with seq2seq (translate, summarize). → Transformer lesson

The original 2017 Transformer had both encoder and decoder stacks. GPT's insight (Radford et al., 2018): throw away the encoder entirely. A causal decoder alone, trained on enough data, learns to do everything. The causal mask IS the architecture — remove it and you get BERT (bidirectional encoder). Same attention mechanism, same Q/K/V math, different masking pattern = completely different model behavior.

BERT uses a [MASK] token and predicts masked words bidirectionally. Why can't BERT generate text autoregressively the way GPT does?

Chapter 7: The Full Model

Complete Forward Pass
Attention = Communication
Tokens look at each other.
MLP = Computation
Each token thinks independently.

Model Configuration

SettingValueMeaning
n_embd16Each token = 16 numbers
n_head44 parallel attention patterns
n_layer1One [Attention + MLP] block
vocab_size2726 letters + BOS
Total params4,1924,192 learnable numbers

Where Do the Parameters Live?

Let's account for every single parameter in microGPT:

microGPT parameter audit
# Embeddings
token_emb:   27 × 16  =   432
pos_emb:     16 × 16  =   256

# Attention (1 block, 4 heads)
W_Q:         16 × 16  =   256
W_K:         16 × 16  =   256
W_V:         16 × 16  =   256
W_O:         16 × 16  =   256

# MLP (4× expansion: 16 → 64 → 16)
W_up:        16 × 64  = 1,024
b_up:        64       =    64
W_down:      64 × 16  = 1,024
b_down:      16       =    16

# LayerNorm (2 per block, each has scale + shift)
ln1:         16 + 16   =    32
ln2:         16 + 16   =    32

# LM Head (shared with token_emb via weight tying)
lm_head:     16 × 27  =   432  # often tied with token_emb
ln_final:    16 + 16   =    32

# TOTAL:                    ~4,192

Now here's the same audit for GPT-2 Small (124M parameters):

ComponentFormulaParams
Token embedding50,257 × 76838.6M
Position embedding1,024 × 7680.8M
Attention (12 layers)12 × 4 × 768²28.3M
MLP (12 layers)12 × 2 × 768 × 307256.6M
LayerNorm + biasessmall~0.1M
Total~124M
Where the weight is: In GPT-2, embeddings are 32%, attention is 23%, and the MLP is 45%. The MLP dominates because of the 4× expansion. This ratio holds across scales — the MLP is always the biggest component.
Check: Which component has the most parameters in a typical GPT model?
💻 Build It Implement the GPT Forward Pass from Scratch ✓ ATTEMPTED
You've seen the parameter audit above. Now implement the complete forward pass: embed tokens, add position embeddings, pass through one transformer block (attention + MLP with residual connections and layer norms), and project to vocabulary logits.
signature def gpt_forward(token_ids, params): """ Args: token_ids: list of ints, length T (each in [0, 26]) params: dict with keys: 'wte': [27, 16] token embedding matrix 'wpe': [16, 16] position embedding matrix 'W_Q', 'W_K', 'W_V', 'W_O': [16, 16] attention weights 'W_up': [16, 64], 'b_up': [64] 'W_down': [64, 16], 'b_down': [16] 'ln1_g', 'ln1_b': [16] layernorm params 'ln2_g', 'ln2_b': [16] layernorm params 'ln_f_g', 'ln_f_b': [16] final layernorm Returns: logits: [T, 27] unnormalized scores for next token """
Test case
token_ids = [26, 4, 12, 12, 0] # BOS, e, m, m, a
output shape: [5, 27] (one logit vector per position)
logits[0] should be the model's guess for what follows BOS
softmax(logits[0]) should sum to 1.0
After computing QK^T / sqrt(d), create a [T,T] lower-triangular matrix of ones. Set upper-triangle positions to -infinity before softmax. This ensures position i only attends to positions 0..i.
python
import numpy as np

def layernorm(x, g, b, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return g * (x - mean) / np.sqrt(var + eps) + b

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def gpt_forward(token_ids, params):
    T = len(token_ids)
    d = 16

    # 1. Embed
    x = params['wte'][token_ids] + params['wpe'][np.arange(T)]

    # 2. Attention block
    x_norm = layernorm(x, params['ln1_g'], params['ln1_b'])
    Q = x_norm @ params['W_Q']  # [T, 16]
    K = x_norm @ params['W_K']  # [T, 16]
    V = x_norm @ params['W_V']  # [T, 16]

    # Scaled dot-product with causal mask
    scores = Q @ K.T / np.sqrt(d)  # [T, T]
    mask = np.tril(np.ones((T, T)))
    scores = np.where(mask == 0, -1e9, scores)
    attn = softmax(scores)  # [T, T]
    attn_out = attn @ V  # [T, 16]
    attn_out = attn_out @ params['W_O']

    # 3. Residual
    x = x + attn_out

    # 4. MLP block
    x_norm = layernorm(x, params['ln2_g'], params['ln2_b'])
    h = x_norm @ params['W_up'] + params['b_up']  # [T, 64]
    h = np.maximum(0, h)  # ReLU
    mlp_out = h @ params['W_down'] + params['b_down']  # [T, 16]

    # 5. Residual
    x = x + mlp_out

    # 6. Final norm + project to vocab
    x = layernorm(x, params['ln_f_g'], params['ln_f_b'])
    logits = x @ params['wte'].T  # [T, 27] — weight tying!

    return logits
Bonus challenge: Modify this to support 4 attention heads (split Q/K/V into 4 chunks of dim 4, compute attention separately, concatenate). How does the output change?
💥 Break-It Lab What Dies When You Remove Components? ✓ ATTEMPTED
A working GPT trained for 5000 steps on name data. Toggle components off to see the failure modes in the loss curve and generated output.
Remove Causal Mask ACTIVE
Failure mode: Training loss drops to near-zero (the model cheats by looking at future tokens), but generation produces garbage. The model never learned to predict — it learned to copy. This is data leakage: the model sees the answer during training, so it never develops the ability to actually predict.
Remove Learning Rate Warmup ACTIVE
Failure mode: Loss spikes wildly in the first 100 steps and may diverge to infinity (NaN). Without warmup, the initial random gradients are huge and the large learning rate sends parameters to extreme values. Adam's moment estimates need time to calibrate. Warmup gives them that time.
Remove Residual Connections ACTIVE
Failure mode: With 1 layer (microGPT), it barely matters. With 12+ layers (GPT-2), gradients vanish — loss plateaus at random (ln(27) ≈ 3.3) and never decreases. Residuals create a "gradient highway" that lets information flow directly from output back to early layers.
Reduce Context to 3 Tokens ACTIVE
Failure mode: Loss converges but to a higher floor (~2.4 instead of ~1.9). The model can't learn long-range patterns: "emm" is ambiguous (emma? emmy? emmett?) but "BOS_emm" is clearer. Slashing context removes information the model needs to disambiguate. Generated names are shorter and less realistic.

Chapter 8: Training — Learning From Mistakes

loss = −log(probability assigned to the correct answer)
Loss Intuition Builder
P(correct) =0.10
2.303
loss = −ln(P)
Loss as "surprise": P=90% → loss=0.1 (expected). P=1% → loss=4.6 (shocked). Training minimizes total surprise.

The Shift Trick

Here's the crucial detail: the model's input and its target (ground truth) are the same sequence, shifted by one position. The target for position t is the token at position t+1.

python
# Training sequence: "emma"
# Input tokens:  [BOS, e, m, m]  (positions 0,1,2,3)
# Target tokens: [e, m, m, a]    (the NEXT token at each position)

logits = model(input_tokens)  # [B, 4, 27] — 27 probs per position
targets = input_tokens[1:]    # shift by 1

# Cross-entropy loss at EVERY position:
# Position 0: model predicts next after BOS → should be 'e'
# Position 1: model predicts next after 'e' → should be 'm'
# Position 2: model predicts next after 'm' → should be 'm'
# Position 3: model predicts next after 'm' → should be 'a'

loss = cross_entropy(logits.view(-1, 27), targets.view(-1))

The model outputs [batch, seq_len, vocab_size] logits. The loss is computed at every position simultaneously — one forward pass gives you seq_len−1 training signals. This is why language model training is so data-efficient compared to, say, image classification where each image gives you just one label.

Scale context: GPT-3 was trained on ~300 billion tokens. With a context length of 2048, that's ~146 million sequences. At each sequence, the model gets 2047 training signals. Total gradient updates: 300,000 training steps with batch size ~3.2M tokens each.
Watch the Loss Decrease
Check: If the input is [BOS, h, e, l, l, o], what is the target at position 2?
🔨 Derivation Why Cross-Entropy? Deriving the NTP Loss from Maximum Likelihood ✓ ATTEMPTED

We've been using loss = −log P(correct token) without justification. Where does this come from?

Given a dataset of sequences, the model assigns probability Pθ(xt | x<t) to each next token. We want to find parameters θ that make the training data most probable.

Your task: Start from Maximum Likelihood Estimation (maximize the probability of the data) and show that it's equivalent to minimizing the average cross-entropy loss −log P.

For a sequence x1,...,xT, the joint probability factorizes autoregressively: P(x) = ∏t=1T P(xt | x<t). The dataset likelihood is the product over all sequences.
Products are numerically unstable (underflow). log converts product to sum: log P(x) = ∑t log P(xt | x<t). Maximizing log-likelihood = maximizing likelihood (log is monotone).
We conventionally minimize losses (gradient descent goes downhill). Maximizing log P = minimizing −log P. The average over all positions gives cross-entropy.

Full derivation:

1. Likelihood: P(dataset) = ∏sequencest=1T Pθ(xt | x<t)

2. Log-likelihood: log P = ∑∑ log Pθ(xt | x<t)

3. Negate and average: L(θ) = −(1/N) ∑it log Pθ(xt(i) | x<t(i))

4. Per-token: At each position, we have a true distribution q (one-hot on the correct token) and model distribution p. The cross-entropy H(q, p) = −∑v q(v) log p(v) = −log p(correct) since q is one-hot.

The key insight: Cross-entropy loss isn't an arbitrary choice — it's the ONLY loss function that corresponds to maximum likelihood estimation for categorical distributions. Using MSE or L1 on probabilities would not give you the MLE solution.

🔨 Derivation Perplexity — Making Loss Interpretable ✓ ATTEMPTED

GPT-2's validation loss was ~3.3 nats. GPT-3's was ~2.8 nats. These numbers are hard to interpret. Perplexity converts loss into an intuitive quantity: "on average, the model is as confused as if it were choosing uniformly from PPL options."

Your task: Show that perplexity = exp(average cross-entropy loss), and explain why perplexity = V (vocabulary size) for a random model and perplexity = 1 for a perfect model.

Perplexity is defined as exp(H), where H is the average cross-entropy: H = −(1/T) ∑t log P(xt | x<t). This is just exponentiating the loss.
A random model assigns P = 1/V to every token (uniform over vocabulary). So loss = −log(1/V) = log(V). Perplexity = exp(log(V)) = V.
A perfect model assigns P = 1 to the correct token. Loss = −log(1) = 0. Perplexity = exp(0) = 1. It's never "surprised."

Full derivation:

PPL = exp( −(1/T) ∑t=1T log P(xt | x<t) ) = exp(average_loss)

Random model: P(xt) = 1/V for all t. Loss = log(V). PPL = exp(log V) = V. For GPT's 50K vocab, a random model has PPL = 50,257.

Perfect model: P(correct) = 1. Loss = 0. PPL = 1.

GPT-2: Loss ≈ 3.3 → PPL = exp(3.3) ≈ 27. "On average, the model is choosing from ~27 equally likely options."

GPT-3: Loss ≈ 2.8 → PPL = exp(2.8) ≈ 16. Narrowed it down to ~16 options.

The key insight: Perplexity has a beautiful interpretation: it's the effective branching factor. A model with PPL=27 is, on average, as uncertain as someone choosing from 27 equally likely options. This makes it meaningful to compare across vocabularies and datasets.

⚔ Adversarial: Two models, same loss, different quality
Model A (trained on Wikipedia) and Model B (trained on random Reddit comments) both achieve validation loss of 3.0 on their respective validation sets. A colleague claims they're "equally good." What's wrong with this reasoning?

Chapter 9: Generation — Creating Something New

Start
Feed BOS token
Predict
Model outputs 27 probabilities
Sample
Pick character based on probabilities
Feed Back
Use picked character as next input
↻ repeat until BOS generated

The Actual Generation Loop

Here is the complete generation algorithm. It's surprisingly short:

python
def generate(model, prompt_tokens, max_new_tokens, temperature=1.0):
    tokens = prompt_tokens.clone()  # start with prompt

    for _ in range(max_new_tokens):
        # 1. Forward pass — only need logits for LAST position
        logits = model(tokens)          # [1, T, vocab_size]
        logits = logits[:, -1, :]       # [1, vocab_size] — last token only

        # 2. Apply temperature
        logits = logits / temperature   # higher T → flatter distribution

        # 3. Convert to probabilities
        probs = softmax(logits, dim=-1)

        # 4. Sample from the distribution
        next_token = torch.multinomial(probs, num_samples=1)

        # 5. Append and repeat
        tokens = torch.cat([tokens, next_token], dim=1)

    return tokens

That's it. Five lines in the loop. The model sees more and more context each iteration (or with KV cache, just the new token). Generation is inherently sequential — you can't parallelize it because each token depends on all previous ones.

Temperature

Temperature is division before softmax. It controls the "sharpness" of the distribution:

P(tokeni) = softmax( logiti / T )
Temperature Playground
Temperature1.0

Top-k and Top-p Sampling

Temperature alone isn't enough. Even with T=0.8, the model might sometimes sample a very unlikely token (the "long tail"). Top-k fixes this by zeroing out everything except the k most likely tokens before sampling:

python
# Top-k: keep only the 5 highest logits, zero the rest
logits = [3.2, 2.5, -0.1, 1.8, -2.0, 0.5, -1.2, 1.5]
top_5  = [3.2, 2.5, -inf, 1.8, -inf, 0.5, -inf, 1.5]
# Now softmax only distributes probability among those 5

# Top-p (nucleus): keep smallest set of tokens
# whose cumulative probability ≥ p (e.g., 0.9)
# More adaptive — sometimes keeps 3 tokens, sometimes 20
T→0: Always pick most likely (greedy, deterministic). T=1: Sample from learned distribution. T→∞: Uniform random. In practice, T=0.7 with top-p=0.9 is a common sweet spot.
Check: What does top-k sampling do?
⚔ Adversarial: Your GPT generates fluent text but repeats itself after ~50 tokens. Training loss is low. What's failing?
You've trained a 125M parameter GPT on 10B tokens of web text. Validation loss converged to 3.2 (reasonable). But during generation at temperature 0.8, the model produces coherent text for ~50 tokens then enters repetitive loops ("the the the..." or repeating the same sentence). Greedy decoding (T=0) is even worse.

Chapter 10: From Micro to Macro

Identical at every scale: Next-token prediction. Chain rule. Autograd. Attention. Residuals. Softmax. Adam. The training loop. The generation loop. The algorithm is the same — only the numbers change.

The GPT Family

Here are the exact architectural parameters for every public GPT model. Notice how each dimension scales:

ModelParamsLayersHeadsd_modelContextYear
microGPT4,1921416162024
GPT-2 Small124M12127681,0242019
GPT-2 Medium355M24161,0241,0242019
GPT-2 Large774M36201,2801,0242019
GPT-2 XL1.5B48251,6001,0242019
GPT-3175B969612,2882,0482020
GPT-4*~1.8T*~120*~128*~16K*128K2023

*GPT-4 specs are rumored (MoE with ~16 experts, ~110B active per forward pass). OpenAI has not confirmed.

The scaling pattern: From GPT-2 Small to GPT-3, parameters grew 1,400×. But d_model only grew 16× (768→12288), layers grew 8× (12→96), and heads grew 8× (12→96). Parameters scale roughly as d_model² × layers, so doubling d_model quadruples parameters.
DimensionmicroGPTGPT-4 class
Data32K namesTrillions of tokens
Parameters4,192100B – 1T+
Layers180 – 128+
Context16 chars128K+ tokens
Training~1 minute~3 months
Cost$0$100M+

The Three-Stage Pipeline

Pre-training
Same algorithm, massive scale. Result: document completer.
SFT
Fine-tune on conversations. Result: an assistant.
RLHF
Reinforce good behavior. Result: helpful, safe assistant.

Pre-training is microGPT's algorithm at scale. SFT (Supervised Fine-Tuning) continues training on high-quality conversation data — same loss function, just better data. RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preferences, then optimizes the language model against it. The base algorithm never changes — what changes is what you train on.

🔨 Derivation Chinchilla Optimal Compute Allocation ✓ ATTEMPTED

The Chinchilla paper (Hoffmann et al., 2022) showed that for a fixed compute budget C, there's an optimal balance between model parameters N and training tokens D. The loss follows:

L(N, D) = E + A/Nα + B/Dβ

where α ≈ 0.34, β ≈ 0.28, and compute C ≈ 6ND (FLOPs ≈ 6 × params × tokens).

Your task: Given a fixed compute budget C, derive the optimal N and D that minimize L. Show that N and D should scale proportionally with C (i.e., if you 10x your compute, you should ~10x both params and data).

Use Lagrange multipliers or substitute D = C/(6N) to eliminate one variable. You want to minimize L(N) = E + A/Nα + B/(C/6N)β.
dL/dN = -αA/Nα+1 + βB·(6/C)β·Nβ-1 = 0. Solve for N in terms of C.
You'll get Nα+β ∝ Cβ. So N ∝ Cβ/(α+β). With α=0.34, β=0.28: N ∝ C0.45.

Full derivation:

Substitute D = C/(6N) into the loss: L(N) = E + A/Nα + B·(6N/C)β

Take dL/dN = 0: αA/Nα+1 = βB·6β·Nβ-1/Cβ

Rearrange: Nα+β = (αA·Cβ) / (βB·6β)

So: Nopt ∝ Cβ/(α+β) = C0.28/0.62 ≈ C0.45

And: Dopt = C/(6N) ∝ C1-0.45 = C0.55

The key insight: Both N and D grow sub-linearly with C, but D grows slightly faster. The Chinchilla rule: tokens should scale ~1.4x faster than parameters. GPT-3 was trained on too few tokens for its size (300B tokens for 175B params). Chinchilla (70B params, 1.4T tokens) matched GPT-3 performance with 4x fewer params.

🏗 Design Challenge You're the Architect: $10M Compute Budget ✓ ATTEMPTED
Your startup just raised $10M earmarked for training a GPT-class language model. You need to decide the model architecture and training configuration. H100 GPUs cost ~$2/hr, you have 6 months, and your target is a strong general-purpose chat model.
Compute Budget
$10M ≈ 5M GPU-hours ≈ 3×1023 FLOPs
Timeline
6 months (cluster size trades off with wall-clock)
Target
General-purpose chat, competitive with GPT-3.5
Inference Cost
Must serve at <$5/M tokens
1. How many parameters? (Chinchilla says N ∝ C0.45. Compute C = 6ND.)
2. How many training tokens? (What ratio of tokens-to-params?)
3. What context length? (Longer = more memory per token = fewer tokens/second)
4. Cluster size? (More GPUs = faster but communication overhead grows)
5. MoE or dense? (MoE gives more capacity at same inference cost)

Real-world solution (circa 2024):

With C = 3×1023 FLOPs, Chinchilla-optimal gives: N ≈ 13B params, D ≈ 4T tokens. However, the field has moved toward "over-training" smaller models (more tokens than Chinchilla-optimal) because inference cost matters more than training cost. Llama-3 8B was trained on 15T tokens (1875:1 token-to-param ratio vs Chinchilla's ~20:1).

Modern answer: Train a 7-13B dense model on 4-15T tokens. Context 4K-8K for pre-training (extend later with RoPE scaling). Use ~1000 H100s for 3-4 months. Dense > MoE at this scale because MoE routing overhead dominates when model is small. Budget: ~60% pre-training, ~10% SFT data curation, ~20% RLHF/DPO, ~10% evaluation and iteration.

The key trade-off: Chinchilla minimizes training loss for fixed compute. But in production, inference cost dominates. A smaller model trained longer has worse training efficiency but better deployment economics.

🔗 Pattern Recognition
From Next-Token Prediction to Alignment
This Lesson (GPT)
loss = −log P(correct next token)
Optimizes: predict human text accurately
RLHF / Reward Alignment
loss = −reward(response) + β·KL(policy || base)
Optimizes: generate text humans preferReward & Alignment

Pre-training makes GPT a brilliant document completer. But "complete this document" isn't the same as "be helpful." RLHF adds a second objective: maximize a reward model trained on human preferences, while staying close to the pre-trained model (the KL penalty prevents "reward hacking"). The base GPT never changes architecture — only the loss signal changes.

Both losses are expectations over text. What's the fundamental difference in what distribution the expectation is taken over?

"What I cannot create, I do not understand."
— Richard Feynman

You now understand the creation. The only question left is: what will you build?