microGPT — From Absolute Zero to Mastery

Operation	Local Gradient
a + b	Both inputs: slope = 1
a × b	∂/∂a = b, ∂/∂b = a
aⁿ	n · a^n-1
log(a)	1/a
exp(a)	exp(a)
relu(a)	1 if a>0, else 0

Chapter 5: From Characters to Vectors

The Problem: Numbers Aren't Rich Enough

We've built a gradient descent engine. It can adjust numbers to minimize a loss function. But here's the problem: how do we feed text into a system that only understands numbers?

The naive approach: assign each character a number. a=1, b=2, c=3... z=26. Now "cat" becomes [3, 1, 20]. But this creates a lie — it says "b" is halfway between "a" and "c," and "m" is close to "n." In reality, the letter "a" has no meaningful numerical relationship to "b." The number 1 vs. 2 is an arbitrary label, not a measurement.

A single number per character can only tell you which character it is. It can't capture anything about what that character means in context. Think of it like GPS: a single number (longitude) tells you east-west position, but you need two numbers (latitude + longitude) to pinpoint a location on Earth. With more numbers, you can describe richer things.

The core idea behind embeddings: instead of representing each character with 1 number, represent it with a list of numbers — a vector. With 16 numbers per character, the model has 16 "dimensions" it can use to encode properties like: "appears at the start of words," "usually followed by a vowel," "common in names." The model discovers these properties during training. We don't tell it what the 16 dimensions mean — it figures that out by learning to predict the next character.

Step 1: Tokenize — Give Each Character a Name

Before anything else, we need a consistent way to refer to each character. We assign each one an integer ID — think of it as a name tag, not a measurement. The specific numbers don't matter; what matters is that each character gets a unique ID.

For microGPT: a=0, b=1, c=2, ..., z=25, and a special "beginning of sequence" token BOS=26. That's 27 possible tokens total.

Live Tokenizer

Type any text below. Each character gets mapped to its integer ID.

Step 2: Embed — From Name Tags to Rich Descriptions

Now we convert each token ID into a vector. How? With a simple lookup table — a big grid of numbers called an embedding table.

Picture a spreadsheet with 27 rows (one per character) and 16 columns (one per dimension). Each row is that character's vector. To "embed" the character "e" (ID=4), you just grab row 4. No math, no computation — just look up the row.

python
# The embedding table: 27 characters, each gets a 16-number vector
wte = nn.Embedding(27, 16)   # 27 rows × 16 columns = 432 learnable numbers

# "Look up" a character: just index into the table
# token ID 4 (the letter 'e') → row 4 → a vector of 16 numbers
e_vector = wte[4]             # → [0.23, -0.81, 1.42, ..., 0.05] (16 numbers)

Before training, these 432 numbers are random. The letter "e" starts with a meaningless vector. But during training, gradient descent adjusts these numbers so that characters appearing in similar contexts end up with similar vectors. The model discovers structure on its own.

Think of it this way: if you described every person in a room with 16 numbers — height, weight, age, hair length, voice pitch, etc. — people who look similar would have similar number lists. Embeddings do the same thing for characters, except the model invents its own categories instead of using human-chosen ones like "height."

Step 3: Position — Where Are You in the Sentence?

There's a subtle problem. If we just embed each character, the model sees the word "cat" as three vectors: [vec_c, vec_a, vec_t]. But it also sees "act" as the same three vectors in a different order: [vec_a, vec_c, vec_t]. How does the model know which character came first?

The answer: we add a second set of vectors that encode position. Position 0 gets its own vector, position 1 gets a different vector, and so on. These are also stored in a lookup table and also learned during training.

python
# Position embedding table: 16 positions, each gets a 16-number vector
wpe = nn.Embedding(16, 16)   # 16 rows × 16 columns = 256 learnable numbers

# For the word "cat" at positions 0, 1, 2:
tok_emb = wte[[2, 0, 19]]     # look up c, a, t → three 16-number vectors
pos_emb = wpe[[0, 1, 2]]     # look up positions 0, 1, 2 → three 16-number vectors

# Combine: add them element-by-element
x = tok_emb + pos_emb         # "c at position 0", "a at position 1", "t at position 2"

Why add instead of sticking them side by side (concatenating)? If we concatenated, each vector would double from 16 numbers to 32 numbers, making every downstream computation more expensive. Addition keeps the size at 16. The model learns to pack both "which character" and "which position" into the same 16 numbers — they share the space.

The full pipeline for "hi":
(1) Tokenize: "hi" → [BOS, h, i] → [26, 7, 8] — three integer IDs
(2) Token embed: look up rows 26, 7, 8 → three vectors of 16 numbers each
(3) Position embed: look up rows 0, 1, 2 → three vectors of 16 numbers each
(4) Add them: token + position → three vectors of 16 numbers each
That's what enters the attention layer next. Three characters, each described by 16 numbers that encode both what the character is and where it sits.

Reading Shape Notation

From here on, we'll describe the size of data using shape notation like [3, 16]. This just means "3 rows, 16 columns" — a grid of numbers. You'll also see a third dimension: [B, T, 16], where B is the batch size (how many examples we process at once for efficiency) and T is the sequence length (how many characters). So [4, 8, 16] means "4 examples, each with 8 characters, each character described by 16 numbers."

Scaling Up

microGPT uses 27 characters and 16 dimensions. Real models are bigger, but the mechanism is identical — just larger tables:

Model	Vocab Size	Vector Size	Embedding Table Size
microGPT	27 characters	16 numbers each	432 numbers
GPT-2 Small	50,257 tokens	768 numbers each	38.6 million numbers
GPT-3	50,257 tokens	12,288 numbers each	617 million numbers

GPT-2 and GPT-3 use subword tokens instead of individual characters — chunks like "the", "ing", "##tion" — which is more efficient. But the embedding mechanism is the same: one row per token, looked up by index.

Check: Why can't we just feed the raw integer IDs (a=0, b=1, ...) directly into the model?

The numbers are too small A single number implies false relationships (b is "between" a and c) and can't capture rich properties — we need a vector of many numbers per character Neural networks can't process integers

Checkpoint — Before you move on

Trace the full pipeline from raw text "hi" to the data that enters the attention layer. What happens at each step? Why do we add position embeddings instead of concatenating them?

✓ Gate cleared

Model Answer

Step 1: Tokenize "hi" → [26, 7, 8] (BOS=26, h=7, i=8). Three integer IDs.

Step 2: Token embedding lookup: grab rows 26, 7, 8 from the token table → three vectors of 16 numbers each.

Step 3: Position embedding lookup: grab rows 0, 1, 2 from the position table → three vectors of 16 numbers each.

Step 4: Add element-by-element: token vector + position vector → three vectors of 16 numbers each. This is what enters attention.

Why add, not concatenate? Concatenation would double the vector size from 16 to 32, making every downstream computation more expensive. Addition keeps the size at 16. It works because the model learns to pack both identity and position into the same 16-dimensional space during training.

Chapter 6: Attention — Tokens Talking

Each token creates three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), Value ("what info do I offer?").

attention_weight = softmax( Q · K / √d )

Softmax Visualizer

Multi-head: microGPT uses 4 attention heads, each on a 4-dim slice of the 16-dim vector. Each head learns different patterns.

The Causal Mask: The Autoregressive Secret

GPT is autoregressive — it predicts the next token from only the previous tokens. This means token 3 must NOT see tokens 4, 5, 6... How do we enforce this? With a causal mask: a lower-triangular boolean matrix.

python
# For a 5-token sequence, the mask looks like:
mask = [
  [1, 0, 0, 0, 0],  # token 0 sees only itself
  [1, 1, 0, 0, 0],  # token 1 sees tokens 0,1
  [1, 1, 1, 0, 0],  # token 2 sees tokens 0,1,2
  [1, 1, 1, 1, 0],  # token 3 sees tokens 0,1,2,3
  [1, 1, 1, 1, 1],  # token 4 sees all tokens
]

# Where mask is 0, we set the attention score to -infinity
# softmax(-inf) = 0, so those tokens are completely invisible
scores = scores.masked_fill(mask == 0, float('-inf'))

This is the ENTIRE autoregressive property. No special architecture, no separate backward pass. Just a triangular matrix of zeros and ones applied before softmax. The implementation is one line of code, but it's what makes GPT a generative model.

Why this works for training too: During training, you feed in a full sequence and the mask ensures each position can only see previous positions. Position 0 predicts token 1, position 1 predicts token 2, etc. One forward pass, T−1 predictions. This is why GPT trains so efficiently.

Check: In a 6-token sequence, which tokens can token 3 attend to?

All 6 tokens Tokens 0, 1, 2, 3 (itself and earlier) Only token 2 (the previous one)

🔗 Pattern Recognition

GPT is a Decoder-Only Transformer

GPT (this lesson)

Causal mask: each token sees only past tokens.
One stack of [Attention + MLP] blocks.
Trained with next-token prediction.

Full Transformer

Encoder: bidirectional (no mask, sees all tokens).
Decoder: causal + cross-attention to encoder.
Trained with seq2seq (translate, summarize). → Transformer lesson

The original 2017 Transformer had both encoder and decoder stacks. GPT's insight (Radford et al., 2018): throw away the encoder entirely. A causal decoder alone, trained on enough data, learns to do everything. The causal mask IS the architecture — remove it and you get BERT (bidirectional encoder). Same attention mechanism, same Q/K/V math, different masking pattern = completely different model behavior.

BERT uses a [MASK] token and predicts masked words bidirectionally. Why can't BERT generate text autoregressively the way GPT does?

Chapter 7: The Full Model

Complete Forward Pass

Attention = Communication
Tokens look at each other.

MLP = Computation
Each token thinks independently.

Model Configuration

Setting	Value	Meaning
n_embd	16	Each token = 16 numbers
n_head	4	4 parallel attention patterns
n_layer	1	One [Attention + MLP] block
vocab_size	27	26 letters + BOS
Total params	4,192	4,192 learnable numbers

Where Do the Parameters Live?

Let's account for every single parameter in microGPT:

microGPT parameter audit
# Embeddings
token_emb:   27 × 16  =   432
pos_emb:     16 × 16  =   256

# Attention (1 block, 4 heads)
W_Q:         16 × 16  =   256
W_K:         16 × 16  =   256
W_V:         16 × 16  =   256
W_O:         16 × 16  =   256

# MLP (4× expansion: 16 → 64 → 16)
W_up:        16 × 64  = 1,024
b_up:        64       =    64
W_down:      64 × 16  = 1,024
b_down:      16       =    16

# LayerNorm (2 per block, each has scale + shift)
ln1:         16 + 16   =    32
ln2:         16 + 16   =    32

# LM Head (shared with token_emb via weight tying)
lm_head:     16 × 27  =   432  # often tied with token_emb
ln_final:    16 + 16   =    32

# TOTAL:                    ~4,192

Now here's the same audit for GPT-2 Small (124M parameters):

Component	Formula	Params
Token embedding	50,257 × 768	38.6M
Position embedding	1,024 × 768	0.8M
Attention (12 layers)	12 × 4 × 768²	28.3M
MLP (12 layers)	12 × 2 × 768 × 3072	56.6M
LayerNorm + biases	small	~0.1M
Total		~124M

Where the weight is: In GPT-2, embeddings are 32%, attention is 23%, and the MLP is 45%. The MLP dominates because of the 4× expansion. This ratio holds across scales — the MLP is always the biggest component.

Check: Which component has the most parameters in a typical GPT model?

The attention layers The embedding table The MLP (feed-forward) layers

💻 Build It Implement the GPT Forward Pass from Scratch ▶ ✓ ATTEMPTED

You've seen the parameter audit above. Now implement the complete forward pass: embed tokens, add position embeddings, pass through one transformer block (attention + MLP with residual connections and layer norms), and project to vocabulary logits.

signature def gpt_forward(token_ids, params): """ Args: token_ids: list of ints, length T (each in [0, 26]) params: dict with keys: 'wte': [27, 16] token embedding matrix 'wpe': [16, 16] position embedding matrix 'W_Q', 'W_K', 'W_V', 'W_O': [16, 16] attention weights 'W_up': [16, 64], 'b_up': [64] 'W_down': [64, 16], 'b_down': [16] 'ln1_g', 'ln1_b': [16] layernorm params 'ln2_g', 'ln2_b': [16] layernorm params 'ln_f_g', 'ln_f_b': [16] final layernorm Returns: logits: [T, 27] unnormalized scores for next token """

Test case

token_ids = [26, 4, 12, 12, 0] # BOS, e, m, m, a
output shape: [5, 27] (one logit vector per position)
logits[0] should be the model's guess for what follows BOS
softmax(logits[0]) should sum to 1.0

After computing QK^T / sqrt(d), create a [T,T] lower-triangular matrix of ones. Set upper-triangle positions to -infinity before softmax. This ensures position i only attends to positions 0..i.

python
import numpy as np

def layernorm(x, g, b, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return g * (x - mean) / np.sqrt(var + eps) + b

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def gpt_forward(token_ids, params):
    T = len(token_ids)
    d = 16

    # 1. Embed
    x = params['wte'][token_ids] + params['wpe'][np.arange(T)]

    # 2. Attention block
    x_norm = layernorm(x, params['ln1_g'], params['ln1_b'])
    Q = x_norm @ params['W_Q']  # [T, 16]
    K = x_norm @ params['W_K']  # [T, 16]
    V = x_norm @ params['W_V']  # [T, 16]

    # Scaled dot-product with causal mask
    scores = Q @ K.T / np.sqrt(d)  # [T, T]
    mask = np.tril(np.ones((T, T)))
    scores = np.where(mask == 0, -1e9, scores)
    attn = softmax(scores)  # [T, T]
    attn_out = attn @ V  # [T, 16]
    attn_out = attn_out @ params['W_O']

    # 3. Residual
    x = x + attn_out

    # 4. MLP block
    x_norm = layernorm(x, params['ln2_g'], params['ln2_b'])
    h = x_norm @ params['W_up'] + params['b_up']  # [T, 64]
    h = np.maximum(0, h)  # ReLU
    mlp_out = h @ params['W_down'] + params['b_down']  # [T, 16]

    # 5. Residual
    x = x + mlp_out

    # 6. Final norm + project to vocab
    x = layernorm(x, params['ln_f_g'], params['ln_f_b'])
    logits = x @ params['wte'].T  # [T, 27] — weight tying!

    return logits

Bonus challenge: Modify this to support 4 attention heads (split Q/K/V into 4 chunks of dim 4, compute attention separately, concatenate). How does the output change?

💥 Break-It Lab What Dies When You Remove Components? ▶ ✓ ATTEMPTED

A working GPT trained for 5000 steps on name data. Toggle components off to see the failure modes in the loss curve and generated output.

Remove Causal Mask ACTIVE

Failure mode: Training loss drops to near-zero (the model cheats by looking at future tokens), but generation produces garbage. The model never learned to predict — it learned to copy. This is data leakage: the model sees the answer during training, so it never develops the ability to actually predict.

Remove Learning Rate Warmup ACTIVE

Failure mode: Loss spikes wildly in the first 100 steps and may diverge to infinity (NaN). Without warmup, the initial random gradients are huge and the large learning rate sends parameters to extreme values. Adam's moment estimates need time to calibrate. Warmup gives them that time.

Remove Residual Connections ACTIVE

Failure mode: With 1 layer (microGPT), it barely matters. With 12+ layers (GPT-2), gradients vanish — loss plateaus at random (ln(27) ≈ 3.3) and never decreases. Residuals create a "gradient highway" that lets information flow directly from output back to early layers.

Reduce Context to 3 Tokens ACTIVE

Failure mode: Loss converges but to a higher floor (~2.4 instead of ~1.9). The model can't learn long-range patterns: "emm" is ambiguous (emma? emmy? emmett?) but "BOS_emm" is clearer. Slashing context removes information the model needs to disambiguate. Generated names are shorter and less realistic.

Chapter 8: Training — Learning From Mistakes

loss = −log(probability assigned to the correct answer)

Loss Intuition Builder

P(correct) =0.10

2.303

loss = −ln(P)

Loss as "surprise": P=90% → loss=0.1 (expected). P=1% → loss=4.6 (shocked). Training minimizes total surprise.

The Shift Trick

Here's the crucial detail: the model's input and its target (ground truth) are the same sequence, shifted by one position. The target for position t is the token at position t+1.

python
# Training sequence: "emma"
# Input tokens:  [BOS, e, m, m]  (positions 0,1,2,3)
# Target tokens: [e, m, m, a]    (the NEXT token at each position)

logits = model(input_tokens)  # [B, 4, 27] — 27 probs per position
targets = input_tokens[1:]    # shift by 1

# Cross-entropy loss at EVERY position:
# Position 0: model predicts next after BOS → should be 'e'
# Position 1: model predicts next after 'e' → should be 'm'
# Position 2: model predicts next after 'm' → should be 'm'
# Position 3: model predicts next after 'm' → should be 'a'

loss = cross_entropy(logits.view(-1, 27), targets.view(-1))

The model outputs [batch, seq_len, vocab_size] logits. The loss is computed at every position simultaneously — one forward pass gives you seq_len−1 training signals. This is why language model training is so data-efficient compared to, say, image classification where each image gives you just one label.

Scale context: GPT-3 was trained on ~300 billion tokens. With a context length of 2048, that's ~146 million sequences. At each sequence, the model gets 2047 training signals. Total gradient updates: 300,000 training steps with batch size ~3.2M tokens each.

Watch the Loss Decrease

Check: If the input is [BOS, h, e, l, l, o], what is the target at position 2?

'e' (the current token) 'l' (the next token) 'h' (the previous token)

🔨 Derivation Why Cross-Entropy? Deriving the NTP Loss from Maximum Likelihood ▶ ✓ ATTEMPTED

We've been using loss = −log P(correct token) without justification. Where does this come from?

Given a dataset of sequences, the model assigns probability P_θ(x_t | x_<t) to each next token. We want to find parameters θ that make the training data most probable.

Your task: Start from Maximum Likelihood Estimation (maximize the probability of the data) and show that it's equivalent to minimizing the average cross-entropy loss −log P.

For a sequence x₁,...,x_T, the joint probability factorizes autoregressively: P(x) = ∏_t=1^T P(x_t | x_<t). The dataset likelihood is the product over all sequences.

Products are numerically unstable (underflow). log converts product to sum: log P(x) = ∑_t log P(x_t | x_<t). Maximizing log-likelihood = maximizing likelihood (log is monotone).

We conventionally minimize losses (gradient descent goes downhill). Maximizing log P = minimizing −log P. The average over all positions gives cross-entropy.

Full derivation:

1. Likelihood: P(dataset) = ∏_sequences ∏_t=1^T P_θ(x_t | x_<t)

2. Log-likelihood: log P = ∑∑ log P_θ(x_t | x_<t)

3. Negate and average: L(θ) = −(1/N) ∑_i ∑_t log P_θ(x_t⁽ⁱ⁾ | x_<t⁽ⁱ⁾)

4. Per-token: At each position, we have a true distribution q (one-hot on the correct token) and model distribution p. The cross-entropy H(q, p) = −∑_v q(v) log p(v) = −log p(correct) since q is one-hot.

The key insight: Cross-entropy loss isn't an arbitrary choice — it's the ONLY loss function that corresponds to maximum likelihood estimation for categorical distributions. Using MSE or L1 on probabilities would not give you the MLE solution.

🔨 Derivation Perplexity — Making Loss Interpretable ▶ ✓ ATTEMPTED

GPT-2's validation loss was ~3.3 nats. GPT-3's was ~2.8 nats. These numbers are hard to interpret. Perplexity converts loss into an intuitive quantity: "on average, the model is as confused as if it were choosing uniformly from PPL options."

Your task: Show that perplexity = exp(average cross-entropy loss), and explain why perplexity = V (vocabulary size) for a random model and perplexity = 1 for a perfect model.

Perplexity is defined as exp(H), where H is the average cross-entropy: H = −(1/T) ∑_t log P(x_t | x_<t). This is just exponentiating the loss.

A random model assigns P = 1/V to every token (uniform over vocabulary). So loss = −log(1/V) = log(V). Perplexity = exp(log(V)) = V.

A perfect model assigns P = 1 to the correct token. Loss = −log(1) = 0. Perplexity = exp(0) = 1. It's never "surprised."

Full derivation:

PPL = exp( −(1/T) ∑_t=1^T log P(x_t | x_<t) ) = exp(average_loss)

Random model: P(x_t) = 1/V for all t. Loss = log(V). PPL = exp(log V) = V. For GPT's 50K vocab, a random model has PPL = 50,257.

Perfect model: P(correct) = 1. Loss = 0. PPL = 1.

GPT-2: Loss ≈ 3.3 → PPL = exp(3.3) ≈ 27. "On average, the model is choosing from ~27 equally likely options."

GPT-3: Loss ≈ 2.8 → PPL = exp(2.8) ≈ 16. Narrowed it down to ~16 options.

The key insight: Perplexity has a beautiful interpretation: it's the effective branching factor. A model with PPL=27 is, on average, as uncertain as someone choosing from 27 equally likely options. This makes it meaningful to compare across vocabularies and datasets.

⚔ Adversarial: Two models, same loss, different quality

Model A (trained on Wikipedia) and Model B (trained on random Reddit comments) both achieve validation loss of 3.0 on their respective validation sets. A colleague claims they're "equally good." What's wrong with this reasoning?

Nothing — same loss means same quality Cross-entropy measures how well you model your training distribution. If the data is low-entropy (predictable Wikipedia), loss 3.0 is mediocre. If data is high-entropy (chaotic Reddit), loss 3.0 might be near optimal. The models have different vocabulary sizes so losses aren't comparable

Chapter 9: Generation — Creating Something New

Start

Feed BOS token

↓

Predict

Model outputs 27 probabilities

↓

Sample

Pick character based on probabilities

↓

Feed Back

Use picked character as next input

↻ repeat until BOS generated

The Actual Generation Loop

Here is the complete generation algorithm. It's surprisingly short:

python
def generate(model, prompt_tokens, max_new_tokens, temperature=1.0):
    tokens = prompt_tokens.clone()  # start with prompt

    for _ in range(max_new_tokens):
        # 1. Forward pass — only need logits for LAST position
        logits = model(tokens)          # [1, T, vocab_size]
        logits = logits[:, -1, :]       # [1, vocab_size] — last token only

        # 2. Apply temperature
        logits = logits / temperature   # higher T → flatter distribution

        # 3. Convert to probabilities
        probs = softmax(logits, dim=-1)

        # 4. Sample from the distribution
        next_token = torch.multinomial(probs, num_samples=1)

        # 5. Append and repeat
        tokens = torch.cat([tokens, next_token], dim=1)

    return tokens

That's it. Five lines in the loop. The model sees more and more context each iteration (or with KV cache, just the new token). Generation is inherently sequential — you can't parallelize it because each token depends on all previous ones.

Temperature

Temperature is division before softmax. It controls the "sharpness" of the distribution:

P(token_i) = softmax( logit_i / T )

Temperature Playground

Temperature1.0

Top-k and Top-p Sampling

Temperature alone isn't enough. Even with T=0.8, the model might sometimes sample a very unlikely token (the "long tail"). Top-k fixes this by zeroing out everything except the k most likely tokens before sampling:

python
# Top-k: keep only the 5 highest logits, zero the rest
logits = [3.2, 2.5, -0.1, 1.8, -2.0, 0.5, -1.2, 1.5]
top_5  = [3.2, 2.5, -inf, 1.8, -inf, 0.5, -inf, 1.5]
# Now softmax only distributes probability among those 5

# Top-p (nucleus): keep smallest set of tokens
# whose cumulative probability ≥ p (e.g., 0.9)
# More adaptive — sometimes keeps 3 tokens, sometimes 20

T→0: Always pick most likely (greedy, deterministic). T=1: Sample from learned distribution. T→∞: Uniform random. In practice, T=0.7 with top-p=0.9 is a common sweet spot.

Check: What does top-k sampling do?

It picks the k-th most likely token It zeros out all but the k highest logits, then samples from the rest It generates k tokens at once

⚔ Adversarial: Your GPT generates fluent text but repeats itself after ~50 tokens. Training loss is low. What's failing?

You've trained a 125M parameter GPT on 10B tokens of web text. Validation loss converged to 3.2 (reasonable). But during generation at temperature 0.8, the model produces coherent text for ~50 tokens then enters repetitive loops ("the the the..." or repeating the same sentence). Greedy decoding (T=0) is even worse.

The model is undertrained — it needs more gradient steps The embedding dimension is too small for the vocabulary The model's probability mass concentrates on recent tokens (degeneration), and greedy/low-temp decoding amplifies this feedback loop The causal mask is incorrectly implemented

Chapter 10: From Micro to Macro

Identical at every scale: Next-token prediction. Chain rule. Autograd. Attention. Residuals. Softmax. Adam. The training loop. The generation loop. The algorithm is the same — only the numbers change.

The GPT Family

Here are the exact architectural parameters for every public GPT model. Notice how each dimension scales:

Model	Params	Layers	Heads	d_model	Context	Year
microGPT	4,192	1	4	16	16	2024
GPT-2 Small	124M	12	12	768	1,024	2019
GPT-2 Medium	355M	24	16	1,024	1,024	2019
GPT-2 Large	774M	36	20	1,280	1,024	2019
GPT-2 XL	1.5B	48	25	1,600	1,024	2019
GPT-3	175B	96	96	12,288	2,048	2020
GPT-4*	~1.8T*	~120*	~128*	~16K*	128K	2023

*GPT-4 specs are rumored (MoE with ~16 experts, ~110B active per forward pass). OpenAI has not confirmed.

The scaling pattern: From GPT-2 Small to GPT-3, parameters grew 1,400×. But d_model only grew 16× (768→12288), layers grew 8× (12→96), and heads grew 8× (12→96). Parameters scale roughly as d_model² × layers, so doubling d_model quadruples parameters.

Dimension	microGPT	GPT-4 class
Data	32K names	Trillions of tokens
Parameters	4,192	100B – 1T+
Layers	1	80 – 128+
Context	16 chars	128K+ tokens
Training	~1 minute	~3 months
Cost	$0	$100M+

The Three-Stage Pipeline

Pre-training

Same algorithm, massive scale. Result: document completer.

↓

SFT

Fine-tune on conversations. Result: an assistant.

↓

RLHF

Reinforce good behavior. Result: helpful, safe assistant.

Pre-training is microGPT's algorithm at scale. SFT (Supervised Fine-Tuning) continues training on high-quality conversation data — same loss function, just better data. RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preferences, then optimizes the language model against it. The base algorithm never changes — what changes is what you train on.

🔨 Derivation Chinchilla Optimal Compute Allocation ▶ ✓ ATTEMPTED

The Chinchilla paper (Hoffmann et al., 2022) showed that for a fixed compute budget C, there's an optimal balance between model parameters N and training tokens D. The loss follows:

L(N, D) = E + A/N^α + B/D^β

where α ≈ 0.34, β ≈ 0.28, and compute C ≈ 6ND (FLOPs ≈ 6 × params × tokens).

Your task: Given a fixed compute budget C, derive the optimal N and D that minimize L. Show that N and D should scale proportionally with C (i.e., if you 10x your compute, you should ~10x both params and data).

Use Lagrange multipliers or substitute D = C/(6N) to eliminate one variable. You want to minimize L(N) = E + A/N^α + B/(C/6N)^β.

dL/dN = -αA/N^α+1 + βB·(6/C)^β·N^β-1 = 0. Solve for N in terms of C.

You'll get N^α+β ∝ C^β. So N ∝ C^β/(α+β). With α=0.34, β=0.28: N ∝ C^0.45.

Full derivation:

Substitute D = C/(6N) into the loss: L(N) = E + A/N^α + B·(6N/C)^β

Take dL/dN = 0: αA/N^α+1 = βB·6^β·N^β-1/C^β

Rearrange: N^α+β = (αA·C^β) / (βB·6^β)

So: N_opt ∝ C^β/(α+β) = C^0.28/0.62 ≈ C^0.45

And: D_opt = C/(6N) ∝ C^1-0.45 = C^0.55

The key insight: Both N and D grow sub-linearly with C, but D grows slightly faster. The Chinchilla rule: tokens should scale ~1.4x faster than parameters. GPT-3 was trained on too few tokens for its size (300B tokens for 175B params). Chinchilla (70B params, 1.4T tokens) matched GPT-3 performance with 4x fewer params.

🏗 Design Challenge You're the Architect: $10M Compute Budget ▶ ✓ ATTEMPTED

Your startup just raised $10M earmarked for training a GPT-class language model. You need to decide the model architecture and training configuration. H100 GPUs cost ~$2/hr, you have 6 months, and your target is a strong general-purpose chat model.

Compute Budget

$10M ≈ 5M GPU-hours ≈ 3×10²³ FLOPs

Timeline

6 months (cluster size trades off with wall-clock)

Target

General-purpose chat, competitive with GPT-3.5

Inference Cost

Must serve at <$5/M tokens

1. How many parameters? (Chinchilla says N ∝ C^0.45. Compute C = 6ND.)

2. How many training tokens? (What ratio of tokens-to-params?)

3. What context length? (Longer = more memory per token = fewer tokens/second)

4. Cluster size? (More GPUs = faster but communication overhead grows)

5. MoE or dense? (MoE gives more capacity at same inference cost)

Real-world solution (circa 2024):

With C = 3×10²³ FLOPs, Chinchilla-optimal gives: N ≈ 13B params, D ≈ 4T tokens. However, the field has moved toward "over-training" smaller models (more tokens than Chinchilla-optimal) because inference cost matters more than training cost. Llama-3 8B was trained on 15T tokens (1875:1 token-to-param ratio vs Chinchilla's ~20:1).

Modern answer: Train a 7-13B dense model on 4-15T tokens. Context 4K-8K for pre-training (extend later with RoPE scaling). Use ~1000 H100s for 3-4 months. Dense > MoE at this scale because MoE routing overhead dominates when model is small. Budget: ~60% pre-training, ~10% SFT data curation, ~20% RLHF/DPO, ~10% evaluation and iteration.

The key trade-off: Chinchilla minimizes training loss for fixed compute. But in production, inference cost dominates. A smaller model trained longer has worse training efficiency but better deployment economics.

🔗 Pattern Recognition

From Next-Token Prediction to Alignment

This Lesson (GPT)

loss = −log P(correct next token)
Optimizes: predict human text accurately

RLHF / Reward Alignment

loss = −reward(response) + β·KL(policy || base)
Optimizes: generate text humans prefer → Reward & Alignment

Pre-training makes GPT a brilliant document completer. But "complete this document" isn't the same as "be helpful." RLHF adds a second objective: maximize a reward model trained on human preferences, while staying close to the pre-trained model (the KL penalty prevents "reward hacking"). The base GPT never changes architecture — only the loss signal changes.

Both losses are expectations over text. What's the fundamental difference in what distribution the expectation is taken over?

"What I cannot create, I do not understand."

— Richard Feynman

You now understand the creation. The only question left is: what will you build?

Understand GPT
From Absolute Zero

Chapter 0: Why Does This Matter?

What does microGPT do?

The 5-Step Loop

Chapter 1: Numbers All The Way Down

Interactive Dot Product

Chapter 2: The Slope of a Hill

Chapter 3: Rolling Downhill

Chapter 4: The Autograd Engine

The 6 Building Blocks

Chapter 5: From Characters to Vectors

The Problem: Numbers Aren't Rich Enough

Step 1: Tokenize — Give Each Character a Name

Step 2: Embed — From Name Tags to Rich Descriptions

Step 3: Position — Where Are You in the Sentence?

Reading Shape Notation

Scaling Up

Chapter 6: Attention — Tokens Talking

The Causal Mask: The Autoregressive Secret

Chapter 7: The Full Model

Model Configuration

Where Do the Parameters Live?

Chapter 8: Training — Learning From Mistakes

The Shift Trick

Chapter 9: Generation — Creating Something New

The Actual Generation Loop

Temperature

Top-k and Top-p Sampling

Chapter 10: From Micro to Macro

The GPT Family

The Three-Stage Pipeline

Understand GPTFrom Absolute Zero

Chapter 0: Why Does This Matter?

What does microGPT do?

The 5-Step Loop

Chapter 1: Numbers All The Way Down

Interactive Dot Product

Chapter 2: The Slope of a Hill

Chapter 3: Rolling Downhill

Chapter 4: The Autograd Engine

The 6 Building Blocks

Chapter 5: From Characters to Vectors

The Problem: Numbers Aren't Rich Enough

Step 1: Tokenize — Give Each Character a Name

Step 2: Embed — From Name Tags to Rich Descriptions

Step 3: Position — Where Are You in the Sentence?

Reading Shape Notation

Scaling Up

Chapter 6: Attention — Tokens Talking

The Causal Mask: The Autoregressive Secret

Chapter 7: The Full Model

Model Configuration

Where Do the Parameters Live?

Chapter 8: Training — Learning From Mistakes

The Shift Trick

Chapter 9: Generation — Creating Something New

The Actual Generation Loop

Temperature

Top-k and Top-p Sampling

Chapter 10: From Micro to Macro

The GPT Family

The Three-Stage Pipeline

Understand GPT
From Absolute Zero