The Complete Beginner's Path

Understand GPT
From Absolute Zero

Karpathy built the entire GPT algorithm in 243 lines of Python. This masterclass will make sure you understand every single one.

Prerequisites: Basic Python + High school algebra. That's it.
10
Chapters
15+
Simulations
0
Assumed ML Knowledge

Chapter 0: Why Does This Matter?

A Large Language Model like ChatGPT is, at its core, a next-token predictor. You give it some text, and it predicts what word should come next. The entire miracle of "artificial intelligence" emerges from doing this one thing extremely well.

The big reveal: When you chat with ChatGPT, it's not "thinking" the way you do. It's completing a document one token at a time. Your conversation is just a funny-looking document.

What does microGPT do?

microGPT learns patterns in 32,000 human names and generates new names that sound plausible but never existed. The same algorithm, scaled up 100,000x, produces ChatGPT.

See It In Action

Names generated by the 243-line model after training:

The 5-Step Loop

Every LLM runs this exact loop:

Step 1
Feed a sequence of tokens into the model
Step 2
Model outputs a probability for each possible next token
Step 3
Compare prediction to actual next token → compute loss
Step 4
Figure out how to adjust each parameter (backpropagation)
Step 5
Adjust parameters slightly → repeat 1000s of times
Check: What does an LLM fundamentally do?

Chapter 1: Numbers All The Way Down

Neural networks only understand numbers. A vector is just a list of numbers. A parameter is a single adjustable number the model learns during training.

Interactive Dot Product

Drag sliders to explore
Vector A
a11.0
a22.0
a3-1.0
Vector B
b12.0
b21.0
b30.5
Key insight: The dot product measures how similar two vectors are. Positive = similar direction. Negative = opposite. Zero = unrelated. This is the foundation of attention.
Check: What is a "parameter" in a neural network?

Chapter 2: The Slope of a Hill

The derivative tells you: if I nudge the input a tiny bit, how much does the output change?

Interactive Derivative Explorer

Drag the point along f(x) = x². The tangent line shows the derivative (slope).

x = 1.0
f(1.0) = 1.0 | slope = 2.0
The gradient is a vector of slopes — one per parameter. It points uphill. We go the opposite direction.
If the derivative of loss w.r.t. a parameter is +3.5, what should you do?

Chapter 3: Rolling Downhill

Gradient descent: compute the loss, compute gradients, take a small step downhill, repeat.

Gradient Descent Simulator

Click anywhere to place the ball, then watch it roll downhill.

Learning rate0.1
Step: 0 | Loss:
parameter = parameter − learning_rate × gradient
This is ALL of training. Compute loss. Compute gradients. Take a small step downhill. Repeat.

Chapter 4: The Autograd Engine

Every number in microGPT is wrapped in a Value object that tracks its gradient. When you multiply two Values, the result remembers how it was made.

python
class Value:
    def __init__(self, data):
        self.data = data   # the actual number
        self.grad = 0      # how the loss depends on this

The 6 Building Blocks

OperationLocal Gradient
a + bBoth inputs: slope = 1
a × b∂/∂a = b, ∂/∂b = a
ann · an-1
log(a)1/a
exp(a)exp(a)
relu(a)1 if a>0, else 0
That's all the calculus you need. These 6 local derivatives, combined via the chain rule, let you compute gradients through ANY computation.

Chapter 5: From Characters to Vectors

Tokenize

Assign each character an integer ID (a=0, b=1, ..., z=25, BOS=26).

Live Tokenizer

Embed

Each token gets a 16-dimensional vector identity. Before training, these are random. After training, similar characters cluster together.

x = token_embedding + position_embedding

Chapter 6: Attention — Tokens Talking

Each token creates three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), Value ("what info do I offer?").

attention_weight = softmax( Q · K / √d )
Softmax Visualizer
Multi-head: microGPT uses 4 attention heads, each on a 4-dim slice of the 16-dim vector. Each head learns different patterns.

Chapter 7: The Full Model

Complete Forward Pass
Attention = Communication
Tokens look at each other.
MLP = Computation
Each token thinks independently.

Model Configuration

SettingValueMeaning
n_embd16Each token = 16 numbers
n_head44 parallel attention patterns
n_layer1One [Attention + MLP] block
vocab_size2726 letters + BOS
Total params4,1924,192 learnable numbers

Chapter 8: Training — Learning From Mistakes

loss = −log(probability assigned to the correct answer)
Loss Intuition Builder
P(correct) =0.10
2.303
loss = −ln(P)
Loss as "surprise": P=90% → loss=0.1 (expected). P=1% → loss=4.6 (shocked). Training minimizes total surprise.
Watch the Loss Decrease

Chapter 9: Generation — Creating Something New

Start
Feed BOS token
Predict
Model outputs 27 probabilities
Sample
Pick character based on probabilities
Feed Back
Use picked character as next input
↻ repeat until BOS generated

Temperature

Temperature Playground
Temperature1.0
T→0: Always pick most likely (greedy). T=1: Balanced. T→∞: Random.

Chapter 10: From Micro to Macro

Identical at every scale: Next-token prediction. Chain rule. Autograd. Attention. Residuals. Softmax. Adam. The training loop. The generation loop.
DimensionmicroGPTGPT-4 class
Data32K namesTrillions of tokens
Parameters4,192100B – 1T+
Layers180 – 128+
Context16 chars128K+ tokens
Training~1 minute~3 months
Cost$0$100M+
Pre-training
Same algorithm, massive scale. Result: document completer.
SFT
Fine-tune on conversations. Result: an assistant.
RLHF
Reinforce good behavior. Result: helpful, safe assistant.
"What I cannot create, I do not understand."
— Richard Feynman

You now understand the creation. The only question left is: what will you build?