Karpathy built the entire GPT algorithm in 243 lines of Python. This masterclass will make sure you understand every single one.
A Large Language Model like ChatGPT is, at its core, a next-token predictor. You give it some text, and it predicts what word should come next. The entire miracle of "artificial intelligence" emerges from doing this one thing extremely well.
microGPT learns patterns in 32,000 human names and generates new names that sound plausible but never existed. The same algorithm, scaled up 100,000x, produces ChatGPT.
Names generated by the 243-line model after training:
Every LLM runs this exact loop:
Neural networks only understand numbers. A vector is just a list of numbers. A parameter is a single adjustable number the model learns during training.
The derivative tells you: if I nudge the input a tiny bit, how much does the output change?
Drag the point along f(x) = x². The tangent line shows the derivative (slope).
Gradient descent: compute the loss, compute gradients, take a small step downhill, repeat.
Click anywhere to place the ball, then watch it roll downhill.
Every number in microGPT is wrapped in a Value object that tracks its gradient. When you multiply two Values, the result remembers how it was made.
python class Value: def __init__(self, data): self.data = data # the actual number self.grad = 0 # how the loss depends on this
| Operation | Local Gradient |
|---|---|
| a + b | Both inputs: slope = 1 |
| a × b | ∂/∂a = b, ∂/∂b = a |
| an | n · an-1 |
| log(a) | 1/a |
| exp(a) | exp(a) |
| relu(a) | 1 if a>0, else 0 |
Assign each character an integer ID (a=0, b=1, ..., z=25, BOS=26).
Each token gets a 16-dimensional vector identity. Before training, these are random. After training, similar characters cluster together.
Each token creates three vectors: Query ("what am I looking for?"), Key ("what do I contain?"), Value ("what info do I offer?").
| Setting | Value | Meaning |
|---|---|---|
| n_embd | 16 | Each token = 16 numbers |
| n_head | 4 | 4 parallel attention patterns |
| n_layer | 1 | One [Attention + MLP] block |
| vocab_size | 27 | 26 letters + BOS |
| Total params | 4,192 | 4,192 learnable numbers |
| Dimension | microGPT | GPT-4 class |
|---|---|---|
| Data | 32K names | Trillions of tokens |
| Parameters | 4,192 | 100B – 1T+ |
| Layers | 1 | 80 – 128+ |
| Context | 16 chars | 128K+ tokens |
| Training | ~1 minute | ~3 months |
| Cost | $0 | $100M+ |
You now understand the creation. The only question left is: what will you build?