How neural networks learn by tracing blame backwards through computation.
You hardcode a function to classify emails as spam. It checks if the subject contains "buy now!" and flags it. Works great — for a day. Then spammers switch to "limited offer," "act fast," "urgent deal." Your function is blind to all of them. You patch it with more rules. Spammers adapt again. You're playing whack-a-mole against a universe of possible phrasings.
The core problem is this: you wrote a fixed function. Its behavior was frozen the moment you typed the last character. It cannot improve from experience. It cannot notice patterns you didn't anticipate. It cannot adapt when the world changes.
What if the function could fix itself? What if, every time it misclassified an email, it could figure out which part of its computation was responsible for the mistake and nudge that part in the right direction?
That's what a neural network does. It's a function with adjustable parameters — numbers called weights — and a procedure for adjusting those weights to reduce errors. The procedure is called learning, and the specific algorithm that makes it efficient is called backpropagation.
Here's the key shift in thinking. Traditional programming: you write the rules. Machine learning: you provide examples and the machine discovers the rules. You don't tell the spam filter that "limited offer" is suspicious. You show it 10,000 emails labeled spam or not-spam, and it learns which patterns predict spam. The function writes itself.
This works because of gradient descent — a systematic way to improve the weights. At each step, we ask: "If I increase this weight by a tiny amount, does the error go up or down?" If it goes down, we increase the weight. If it goes up, we decrease it. Do this for all weights, thousands of times, and the function converges to something useful.
To see this concretely, imagine a simple 2D classification task. You have points in two classes — blue and orange — and a linear boundary separating them. But the initial boundary is wrong. It misclassifies several points. What if we could automatically adjust the boundary's angle and position to minimize mistakes?
That's gradient descent. At each step, we measure the error, compute which direction to move the boundary to reduce that error, and take a small step. Watch it happen below.
The line starts wrong. Click "Learn" to run gradient descent steps. The boundary rotates and shifts to classify the points correctly. Adjust the learning rate to see how step size affects convergence.
Notice what happened. You didn't hand-tune the line's angle or position. You didn't write rules for where the boundary should go. You gave the system a measure of error and a way to reduce it, and the boundary found the right place on its own. That's the essence of all neural network training.
This loop — predict, measure error, compute gradients, update weights — is the heartbeat of every neural network ever trained. GPT-4 does it. Image classifiers do it. Self-driving cars do it. The architectures differ wildly. The loop is always the same.
How many times does the loop run? For GPT-3, training involved approximately 300 billion tokens, processed in batches of ~3 million tokens each. That's roughly 100,000 gradient update steps. Each step costs hundreds of GPU-hours. The total training cost was estimated at $4.6 million in compute. All of that — months of computation on thousands of GPUs — is just this loop running over and over.
The remarkable thing is that this simple loop — invented in 1986 — still powers every large model trained in 2025. The architectures have changed dramatically (feedforward → CNN → RNN → Transformer), the optimizers have improved (SGD → Adam → AdamW), the hardware has evolved (CPU → GPU → TPU → custom AI chips). But the core algorithm has not. Forward, loss, backward, update. It works.
Every neural network — GPT-4, image classifiers, self-driving cars — is built from one tiny unit repeated millions of times. Understand this one unit and you understand the building block of all of deep learning. It's called a neuron (or a unit or a node).
A neuron does three things. First, it takes multiple input numbers and multiplies each one by a weight. Second, it adds all those weighted inputs together, plus a bias term. Third, it passes the sum through an activation function to produce a single output number.
That's it. Three operations: weight, sum, activate.
Where x1, x2, x3 are the inputs, w1, w2, w3 are the weights, b is the bias, and f is the activation function. The weights control how much each input matters. A large positive weight means "pay a lot of attention to this input." A negative weight means "when this input is large, push the output down." Zero means "ignore this input."
The bias is an offset. Even when all inputs are zero, the neuron can still produce a non-zero output thanks to the bias. Think of it as the neuron's baseline tendency — its default opinion before seeing any evidence.
Let's make this concrete with numbers. Say x1 = 2.0, x2 = −1.0, x3 = 0.5, and we're using the ReLU activation (f(z) = max(0, z)). If the weights are w1 = 0.5, w2 = 0.8, w3 = −0.3, and b = 0.1:
The neuron received three inputs, weighted and summed them to get 0.15, then ReLU left it unchanged because it's positive. If the sum had been negative — say −0.5 — ReLU would have output 0. The neuron would be "silent."
Let's make this tangible. You're building a spam detector with three features: x1 = number of exclamation marks (normalized 0–1), x2 = contains "free" (1 or 0), x3 = sender in contacts (1 or 0). A neuron might learn weights w1 = 1.5, w2 = 2.0, w3 = −3.0, b = −0.5.
Read the weights like a story: "Exclamation marks are mildly suspicious (+1.5). The word 'free' is very suspicious (+2.0). Being in contacts is strong evidence of NOT spam (−3.0). Default lean: slightly toward not-spam (−0.5)."
Email from your friend (x = [0.2, 0, 1]): z = 1.5(0.2) + 2.0(0) + (−3.0)(1) + (−0.5) = 0.3 + 0 − 3 − 0.5 = −3.2. ReLU(−3.2) = 0. Not spam. Correct — the contact weight dominated.
Spam email (x = [0.9, 1, 0]): z = 1.5(0.9) + 2.0(1) + (−3.0)(0) + (−0.5) = 1.35 + 2.0 + 0 − 0.5 = 2.85. ReLU(2.85) = 2.85. High activation. Spam detected.
Edge case — spam from a friend (x = [0.8, 1, 1]): z = 1.5(0.8) + 2.0(1) + (−3.0)(1) + (−0.5) = 1.2 + 2.0 − 3.0 − 0.5 = −0.3. ReLU(−0.3) = 0. Not spam. The contact weight overrides the spam signals. Whether this is correct depends on your tolerance — and this is exactly why we need multiple neurons (different feature detectors with different weight configurations).
The name "neuron" comes from the biological brain cell. But don't take the analogy too seriously. Biological neurons communicate via spikes with complex timing; artificial neurons output a single floating-point number. Biological networks are sparse and stochastic; artificial networks are dense and deterministic. The only real similarity is the structure: many inputs, one output, a threshold effect.
Adjust the three weights and bias using the sliders. Watch the weighted sum bar and the ReLU output update in real time. Try making the sum negative — the output clamps to zero.
A single neuron can only compute a simple function. But arrange many neurons side by side, each receiving the same inputs but with different weights, and you get a layer. Each neuron in the layer produces its own output, so a layer of 4 neurons transforms 3 inputs into 4 outputs. Stack multiple layers and you get a neural network.
In vector notation, a layer with n inputs and m neurons is:
Where W is a [m × n] matrix (each row is one neuron's weights), x is the [n] input vector, b is the [m] bias vector, and f is applied element-wise. This is just our single-neuron formula repeated m times in parallel, written compactly as a matrix multiply.
python import numpy as np class Neuron: def __init__(self, n_inputs): # Random weights, small values (Xavier-ish) self.w = np.random.randn(n_inputs) * np.sqrt(1/n_inputs) self.b = 0.0 def forward(self, x): # Weighted sum + bias + ReLU self.z = np.dot(self.w, x) + self.b # scalar self.out = max(0, self.z) # ReLU self.x = x # save for backprop return self.out # Usage: 3-input neuron n = Neuron(3) output = n.forward(np.array([2.0, -1.0, 0.5])) print(output) # some scalar ≥ 0
We save the input x and the pre-activation z during the forward pass. These cached values are exactly what backpropagation needs to compute gradients — we'll see why in Chapter 5.
How does this connect to NLP? In CS224N Lecture 3, Collobert and Weston (2011) showed that a simple feedforward network can do named entity recognition. The input is a window of word vectors: concatenate the embeddings of 5 consecutive words into one long vector, feed it to a hidden layer, and classify the center word as Person, Location, Organization, or None.
If each word vector is 300-dimensional and the window size is 5, the input layer has 1500 neurons. A hidden layer of 100 neurons means W1 is [100, 1500] — 150,000 learnable parameters in just the first layer. The neuron is the same unit we studied above, but the inputs are now word embeddings from Chapter L02 instead of raw numbers.
Stack two linear functions. f(x) = 2x, g(x) = 3x. Their composition is g(f(x)) = 3(2x) = 6x — still a linear function. Add a hundred more layers, all linear: the composition is still just one big linear function. You gained exactly nothing from the extra layers.
This is the fundamental problem with linear neurons. Without a nonlinear activation function, a network of any depth collapses into a single-layer network. The activation function is what gives deep networks their expressive power — the ability to represent curved, folded, and twisted decision boundaries that linear functions can never capture.
Three activation functions dominate the history of neural networks. Each solved a problem the previous one had.
The sigmoid squashes any input into the range (0, 1). Large positive inputs map to ~1. Large negative inputs map to ~0. Zero maps to 0.5. It was the default choice in the 1990s because it resembles a biological neuron's firing rate curve: below threshold it's silent, above threshold it saturates.
The problem is the vanishing gradient. Look at the derivative: σ′(z) = σ(z)(1 − σ(z)). When z is large and positive, σ(z) ≈ 1, so the derivative ≈ 0. When z is large and negative, σ(z) ≈ 0, derivative ≈ 0 again. The gradient is only meaningful in a narrow band around zero. Stack many sigmoid layers and gradients get multiplied together — a product of tiny numbers becomes astronomically small. Deep networks can't learn.
Tanh maps inputs to (−1, 1) instead of (0, 1). This is better because its output is zero-centered — negative inputs produce negative outputs. With sigmoid, the output is always positive, which means gradients for the weights always have the same sign, causing zig-zagging during optimization. Tanh fixes this.
But tanh still saturates at the extremes, so it still suffers from the vanishing gradient problem, just slightly less than sigmoid.
The Rectified Linear Unit is almost embarrassingly simple. If the input is positive, pass it through. If it's negative, output zero. That's it. No exponentiation, no division. The derivative is 1 for positive inputs and 0 for negative inputs.
ReLU has one issue: dying neurons. If a neuron's weighted sum is always negative (due to unlucky initialization or a large gradient spike), its output is always zero and its gradient is always zero. The neuron is permanently dead — it can never recover. Variants like Leaky ReLU (f(z) = max(0.01z, z)) fix this by allowing a small gradient when z < 0.
Here's a stunning result: a single hidden layer with enough neurons and a nonlinear activation function can approximate any continuous function to arbitrary accuracy (Cybenko, 1989). This is the universal approximation theorem. It means that in theory, a one-hidden-layer network is sufficient for any task.
In practice, depth matters enormously. A deep network with 100 neurons per layer can represent functions that would require exponentially many neurons in a single layer. Think of it this way: each layer adds a new level of abstraction. Layer 1 detects edges. Layer 2 combines edges into shapes. Layer 3 combines shapes into objects. Doing all of that in one layer would require an astronomical number of neurons.
python import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-z)) def tanh(z): return np.tanh(z) def relu(z): return np.maximum(0, z) # Derivatives (needed for backprop) def sigmoid_grad(z): s = sigmoid(z); return s * (1 - s) def tanh_grad(z): return 1 - np.tanh(z)**2 def relu_grad(z): return (z > 0).astype(float)
Notice something beautiful about ReLU's gradient: it's a boolean cast to float. No exponentials, no divisions. This computational simplicity is another reason ReLU trains faster than sigmoid or tanh.
Research hasn't stopped at ReLU. Several variants have emerged:
GELU (Gaussian Error Linear Unit): used in BERT and GPT. It smoothly approximates ReLU with GELU(x) = x · Φ(x), where Φ is the Gaussian CDF. Unlike ReLU's hard zero, GELU gently attenuates negative inputs. This smooth behavior can improve training stability in Transformers.
SiLU / Swish: f(x) = x · σ(x). Similar to GELU but uses sigmoid instead of the Gaussian CDF. Used in many modern architectures. Non-monotonic — it dips slightly below zero for negative inputs, which empirically helps.
For hidden layers in 2025: ReLU remains the default for CNNs and simple networks. GELU dominates in Transformers (language models). Sigmoid is reserved for output layers where you need a probability (0, 1). Tanh appears in LSTMs and specific gating mechanisms.
Let's prove the claim from the opening formally. Two linear layers without activation:
Weff = W2W1 is just another matrix. A [2, 4] matrix times a [4, 3] matrix gives a [2, 3] matrix. We could have computed y = Weffx directly with one layer. The hidden layer added parameters but no expressiveness. This generalizes to any number of layers: N stacked linear layers collapse to one.
With a nonlinear activation, the collapse is broken. y = W2ReLU(W1x) cannot be written as Wx for any single matrix W. The ReLU creates piecewise-linear regions that a single linear layer cannot replicate. Each additional layer with activation adds genuinely new representational capacity.
| Function | Range | Gradient Behavior | When to Use |
|---|---|---|---|
| Sigmoid | (0, 1) | Vanishes at extremes | Output layer for binary classification |
| Tanh | (−1, 1) | Vanishes, but zero-centered | RNNs (historically) |
| ReLU | [0, ∞) | 0 or 1 — no vanishing | Hidden layers (default choice) |
Toggle between activation functions. The solid curve is the function; the dashed curve is its derivative. Drag the x-slider to see the function value and gradient at that point. Notice how sigmoid and tanh derivatives vanish at the extremes.
You have a 3-number input: [height, weight, age]. How does the network turn that into "diabetic: 72% likely"? By pushing the numbers forward through the network, layer by layer, until a prediction pops out the other end. This is the forward pass.
Let's trace it through a concrete 2-layer network with 3 inputs, 4 hidden neurons, and 2 outputs. Every step is just an operation you already know: matrix multiply, add bias, activate.
Step 1: Input. Our input vector x has shape [3]. These are raw features — for example, height=170, weight=80, age=45.
Step 2: First layer linear transform. Multiply by weight matrix W1 of shape [4, 3], then add bias b1 of shape [4]. The result z1 = W1x + b1 has shape [4]. Each of the 4 entries is one hidden neuron's weighted sum.
Step 3: First activation. Apply ReLU element-wise: h = ReLU(z1). Still shape [4], but now all negative entries are clamped to zero.
Step 4: Second layer linear transform. Multiply by W2 of shape [2, 4], add b2 of shape [2]. Result z2 has shape [2] — one score per output class.
Step 5: Softmax. Convert raw scores to probabilities: p = softmax(z2). Still shape [2], but entries sum to 1. If p = [0.28, 0.72], the network says "28% healthy, 72% diabetic."
| Object | Shape | What It Is |
|---|---|---|
| x (input) | [3] | Raw features |
| W1 | [4, 3] | First layer weights |
| b1 | [4] | First layer bias |
| z1 = W1x + b1 | [4] | Pre-activation (linear output) |
| h = ReLU(z1) | [4] | Hidden activations |
| W2 | [2, 4] | Second layer weights |
| b2 | [2] | Second layer bias |
| z2 = W2h + b2 | [2] | Raw output scores (logits) |
| p = softmax(z2) | [2] | Class probabilities |
Let's nail down exactly what "matrix multiply" means with small numbers. Suppose x = [0.7, 0.8, 0.45] and W1 is a [4, 3] matrix (4 neurons, each with 3 weights):
Neuron 0: (0.2)(0.7) + (0.1)(0.8) + (−0.3)(0.45) = 0.14 + 0.08 − 0.135 = 0.085
Neuron 1: (−0.1)(0.7) + (0.4)(0.8) + (0.1)(0.45) = −0.07 + 0.32 + 0.045 = 0.295
Neuron 2: (0.3)(0.7) + (−0.2)(0.8) + (0.5)(0.45) = 0.21 − 0.16 + 0.225 = 0.275
Neuron 3: (0.1)(0.7) + (0.2)(0.8) + (−0.1)(0.45) = 0.07 + 0.16 − 0.045 = 0.185
After adding bias b1 = [0.1, −0.2, 0.1, 0.3] and applying ReLU:
z1 = [0.185, 0.095, 0.375, 0.485] → h = ReLU(z1) = [0.185, 0.095, 0.375, 0.485]
All positive, so ReLU doesn't change anything here. If any entry had been negative, it would be clamped to zero. Each dot product is one neuron "voting" on the input.
Let's change the input to see ReLU in action. With x = [0.1, 0.1, 0.1]:
Neuron 0: (0.2)(0.1) + (0.1)(0.1) + (−0.3)(0.1) + 0.1 = 0.02 + 0.01 − 0.03 + 0.1 = 0.1
Neuron 1: (−0.1)(0.1) + (0.4)(0.1) + (0.1)(0.1) + (−0.2) = −0.01 + 0.04 + 0.01 − 0.2 = −0.16
Neuron 1's pre-activation is −0.16. ReLU clamps it to 0. That neuron is silent for this input — it contributes nothing to the next layer. Different inputs activate different subsets of neurons. This selective activation is part of what gives neural networks their representational power.
You might wonder: why matrix multiplication specifically? Why not some other operation? The answer has two parts.
First, matrix multiplication is a linear transformation. It can rotate, scale, shear, and project vectors. These are exactly the operations needed to rearrange the input space into regions that correspond to different outputs. The activation function then bends these regions into nonlinear shapes.
Second, matrix multiplication is embarrassingly parallel. GPUs are built from thousands of cores optimized for matrix operations. A matrix multiply of shape [4096, 4096] × [4096, 1] involves 16 million multiply-adds — and a GPU completes them in microseconds. This computational efficiency is why neural networks scale so well on modern hardware.
To appreciate the scale, here's what a forward pass looks like for GPT-2 Small (124M parameters):
| Operation | Shape | FLOPs per token |
|---|---|---|
| Token embedding lookup | [50257, 768] → [768] | 0 (lookup, no math) |
| 12 × attention QKV projection | [768, 2304] per layer | ~4.2M per layer |
| 12 × FFN | [768, 3072] then [3072, 768] | ~9.4M per layer |
| Output projection | [768, 50257] | ~77M |
Total: roughly 240 million FLOPs per token. For a 1024-token sequence, that's ~250 billion FLOPs for a single forward pass. And GPT-2 Small is tiny by modern standards. Same operations, same algorithm — vastly more scale.
Click "Step" to advance through the 5 stages of a forward pass in a 3→4→2 network. Tensor shapes and actual numbers are shown at each stage.
python import torch import torch.nn as nn model = nn.Sequential( nn.Linear(3, 4), # W1: [4,3], b1: [4] nn.ReLU(), # element-wise max(0, z) nn.Linear(4, 2), # W2: [2,4], b2: [2] ) x = torch.tensor([170.0, 80.0, 45.0]) # [3] logits = model(x) # [2] probs = torch.softmax(logits, dim=0) # [2], sums to 1
Five lines. That's a complete forward pass. PyTorch handles the matrix multiplications, bias additions, and ReLU applications behind the scenes. The shapes flow exactly as our table predicted: [3] → [4] → [4] → [2] → [2].
The softmax function converts raw scores (called logits) into probabilities. Given a vector z of K scores:
Why exponentiate? Because ez is always positive (ensuring valid probabilities) and preserves ordering (higher scores get higher probabilities). The denominator normalizes so the probabilities sum to 1. If one logit is much larger than the others, its softmax output approaches 1 while the others approach 0 — the network is "confident."
Practical trick: to avoid numerical overflow, subtract the maximum logit before exponentiating: softmax(z) = softmax(z − max(z)). This doesn't change the result (the constant cancels) but prevents ez from becoming astronomically large.
For binary classification, you can skip softmax and use a single output neuron with sigmoid: p = σ(z). This produces a probability directly. For multi-label classification (where multiple classes can be true simultaneously), use sigmoid on each output independently instead of softmax (which enforces "exactly one class is true").
For regression (predicting a continuous number like house price), don't use any activation on the output — just let the raw linear output through. The output can be any real number, which is what you want for regression.
| Task | Output Activation | Loss Function |
|---|---|---|
| Binary classification | Sigmoid (1 output) | Binary cross-entropy |
| Multi-class classification | Softmax (K outputs) | Cross-entropy |
| Multi-label classification | Sigmoid per output | Binary CE per output |
| Regression | None (identity) | MSE |
The network predicted "cat: 20%, dog: 80%" but the image is a cat. How do we turn "you're wrong" into a number? We need a function that takes the predicted probabilities and the true label and outputs a single number that's large when the prediction is bad and small when it's good. That function is called the loss function (or cost function or objective).
The loss function is the neural network's only source of feedback. It's the signal that drives learning. A poorly chosen loss function means the network optimizes for the wrong thing. A good loss function precisely encodes what "correct" means.
The simplest loss: square the difference between the prediction and the target.
For classification with probability output p and true label y = 1: if p = 0.8, loss = (1 − 0.8)² = 0.04. If p = 0.2, loss = (1 − 0.2)² = 0.64. MSE penalizes wrong answers, but the penalty grows only quadratically. A confident wrong answer (p = 0.01) gets loss 0.98 — not much worse than a mildly wrong answer (p = 0.3, loss = 0.49).
This is the standard loss for classification. For binary classification:
Where p is the predicted probability of the true class. If the true class is "cat" and you predicted p(cat) = 0.8, the loss is −log(0.8) = 0.22. If you predicted p(cat) = 0.01, the loss is −log(0.01) = 4.61. That's a 20x larger penalty for being confidently wrong.
The steep cliff near p = 0 is the key visual insight. As the predicted probability of the correct class approaches zero, cross-entropy loss shoots to infinity. This creates a powerful gradient signal: "You are catastrophically wrong. Fix this immediately." MSE, by contrast, just says "You're a little more wrong than before."
When you train a language model like GPT, the loss function is cross-entropy over the vocabulary. At each position, the model predicts a probability distribution over ~50,000 tokens. The true next token is one of those 50,000. Cross-entropy loss is −log(ptrue), where ptrue is the probability the model assigned to the correct token.
If the model thinks the next word after "The cat sat on the" is "mat" with probability 0.3, the loss is −log(0.3) = 1.2. If it predicted "quantum" with probability 0.0001, the loss would be −log(0.0001) = 9.2. Over millions of training examples, this signal teaches the model which words tend to follow which contexts.
The metric perplexity = eaverage loss gives an intuitive interpretation: a perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 options. GPT-4's perplexity is estimated around 5–10 on typical English text.
For K classes, the loss is the sum over all classes, but only the true class contributes (because the one-hot label is zero for all other classes):
This simplifies to exactly what we had above: just the negative log of the predicted probability for the correct class. In PyTorch, nn.CrossEntropyLoss combines softmax and cross-entropy in one numerically stable operation.
Cross-entropy isn't arbitrary — it comes from information theory. It measures the number of bits needed to encode outcomes from one distribution (the true labels) using another distribution (the model's predictions). When the model perfectly predicts the true distribution, cross-entropy equals the entropy of the true distribution — the minimum possible. Any mismatch increases the cost.
The gradient of cross-entropy with respect to the logits has a beautiful form: for softmax + cross-entropy, the gradient is simply prediction minus target. If the true class is k and the model predicts probabilities p, then dL/dzk = pk − 1 and dL/dzj = pj for all other j. This clean gradient is another reason cross-entropy is preferred over MSE for classification.
Let's verify with numbers. Say logits z = [1.0, 2.0, 0.5] and the true class is 1. Softmax gives p = [0.21, 0.57, 0.13]. The gradient is: dL/dz = [0.21, 0.57 − 1.0, 0.13] = [0.21, −0.43, 0.13]. The gradient is negative for the true class (increase that logit to reduce loss) and positive for wrong classes (decrease those logits). Beautifully intuitive.
python import torch import torch.nn as nn # CrossEntropyLoss takes raw logits (before softmax) logits = torch.tensor([1.2, -0.5, 3.1]) # 3 classes target = torch.tensor(2) # true class is 2 loss = nn.CrossEntropyLoss()(logits, target) # Manual version: probs = torch.softmax(logits, dim=0) # [0.142, 0.026, 0.832] manual_loss = -torch.log(probs[2]) # -log(0.832) = 0.184
Drag the slider to change the predicted probability for the true class. Watch how both losses change. Notice the steep cliff in cross-entropy near p=0 — confident wrong answers are punished exponentially.
Turn the thermostat up 1 degree. The heater burns more gas. More gas means a bigger bill. How much does 1 degree cost? You need (dollars per gallon) × (gallons per degree). That's the chain rule: to find how a change at the start affects the end, multiply the rates of change along the chain.
In calculus notation: if y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). The derivative of the composition is the product of the individual derivatives. This extends to any number of composed functions:
Each term in the product is called a local gradient — it describes how sensitive one variable is to the variable just before it in the chain. To get the end-to-end gradient (how the loss changes when we wiggle a weight), we just multiply all the local gradients along the path from the weight to the loss.
Let's trace a concrete computation. We have weight w = 2.0, input x = 1.5, and we compute:
Now the backward pass. We want dL/dw — how much does the loss change if we wiggle w?
Local gradient 1: dL/dh = 2(h − y) = 2(0.881 − 1) = −0.238. The loss decreases if h increases (good — h should be closer to 1).
Local gradient 2: dh/da = σ(a)(1 − σ(a)) = 0.881 × 0.119 = 0.105. Sigmoid's derivative at a = 2.0.
Local gradient 3: da/dz = 1. The bias addition has derivative 1 with respect to z.
Local gradient 4: dz/dw = x = 1.5. The multiplication z = w · x has derivative x with respect to w.
Chain them: dL/dw = (−0.238)(0.105)(1)(1.5) = −0.0375. The gradient is negative, meaning increasing w will decrease the loss. So gradient descent will increase w — exactly the right direction.
Gradient descent says: wnew = w − η · dL/dw, where η is the learning rate. With η = 0.1:
The weight increased slightly. After many such updates, w will converge to a value that makes the loss near zero. The miracle of this process is that we never told the network what w should be — we only told it what the loss should be (small), and the chain rule figured out the rest.
Look at that chain of multiplications again: (−0.238)(0.105)(1)(1.5). The sigmoid derivative (0.105) is less than 1. In a 10-layer network, you'd multiply 10 such terms together. If each is 0.1, the product is 0.110 = 10−10. The gradient has effectively vanished — early layers get no useful signal.
Conversely, if each local gradient is 1.5, the product after 10 layers is 1.510 ≈ 57. The gradient explodes. Weights get massive updates, the network blows up, and you see NaN. This is why activation function choice matters so much (Chapter 2) and why ReLU's gradient of exactly 1 for positive inputs is so valuable.
What if a node has multiple outputs — for example, one hidden neuron feeds into three neurons in the next layer? The chain rule says: sum the gradients from all children. Each downstream path contributes its own gradient, and we add them all up.
This is why backpropagation works for arbitrary computation graphs, not just chains. At each node: sum the gradients flowing in from all outputs, then multiply by the local gradient to pass backward. Branching (one output goes to multiple nodes) means summing. Merging (multiple inputs combine) means each input gets its own gradient path.
A modern language model like GPT-4 has hundreds of billions of parameters. The computation graph has trillions of nodes. Computing the chain rule by hand is obviously impossible. Automatic differentiation (autodiff) systems like PyTorch's autograd do it mechanically.
During the forward pass, PyTorch records every operation in a directed acyclic graph (DAG). Each tensor knows which operation created it and what its inputs were. When you call loss.backward(), PyTorch walks this graph in reverse topological order, applying the chain rule at each step. The programmer never sees a derivative. The system handles everything.
But understanding the chain rule is still essential. When your model won't train, the first question is: "Are the gradients reaching the early layers?" If they're vanishing (sigmoid + deep network) or exploding (bad initialization), you need to understand why — and that requires knowing the chain rule.
When implementing backprop by hand, how do you know your gradients are correct? Use numerical gradient checking. For each weight w, compute:
This finite-difference approximation should match your backprop gradient to ~5 significant digits. If it doesn't, you have a bug. Typical ε = 10−5. This is slow (requires 2 forward passes per weight) so you only use it for debugging, not training.
python def grad_check(f, w, eps=1e-5): """Compare analytical gradient with numerical.""" grad_numerical = np.zeros_like(w) for i in range(len(w)): w[i] += eps loss_plus = f(w) w[i] -= 2*eps loss_minus = f(w) w[i] += eps # restore grad_numerical[i] = (loss_plus - loss_minus) / (2*eps) return grad_numerical
Click "Forward" to fill in values left-to-right through the computation graph. Then click "Backward" to fill gradients right-to-left, watching them multiply along the chain. Each node shows its value (top, teal) and local gradient (bottom, orange).
You know the chain rule for one path. But a real network has millions of paths — every weight connects to the loss through a different chain of operations. Computing each gradient independently by tracing its own path would mean repeating vast amounts of shared computation. For a network with n weights, that's n separate traversals of the graph.
Backpropagation computes ALL gradients in a single backward sweep by reusing intermediate results. It's the chain rule applied systematically, layer by layer, from the loss back to the inputs. The key insight is that gradients flowing into a node can be computed once and shared by all paths passing through that node.
Consider a network with 3 layers and 1000 weights per layer. Naively computing dL/dwi for each weight independently means traversing the graph 3000 times. Backprop traverses it once. The cost of the backward pass is approximately equal to the cost of the forward pass. This O(1) cost per gradient (amortized) is what makes training neural networks feasible.
For a single layer h = ReLU(Wx + b), given the incoming gradient dL/dh from the layer above:
The gradient for the weights (dL/dW) is the outer product of the incoming gradient and the layer's input. The gradient to pass backward (dL/dx) is the transpose of the weight matrix times the incoming gradient. These two operations — outer product and transpose multiply — are the computational core of backprop.
For our 3→4→2 network from Chapter 3, here are the gradient shapes at each step of backprop:
| Gradient | Shape | How Computed |
|---|---|---|
| dL/dz2 | [2] | p − y (softmax gradient) |
| dL/dW2 | [2, 4] | dL/dz2 · hT (outer product) |
| dL/db2 | [2] | dL/dz2 (copy) |
| dL/dh | [4] | W2T · dL/dz2 (pass backward) |
| dL/dz1 | [4] | dL/dh ⊙ ReLU′(z1) |
| dL/dW1 | [4, 3] | dL/dz1 · xT (outer product) |
| dL/db1 | [4] | dL/dz1 (copy) |
Notice the symmetry: each weight gradient has the same shape as the weight matrix itself. This makes sense — we need one gradient value per weight. The backward pass mirrors the forward pass but in reverse: forward goes [3]→[4]→[2], backward goes [2]→[4]→[3].
Backpropagation was independently discovered multiple times. Linnainmaa described the core idea (reverse-mode automatic differentiation) in 1970. Werbos applied it to neural networks in his 1974 thesis. But the paper that made the world pay attention was Rumelhart, Hinton, and Williams (1986) in Nature.
Before 1986, many researchers believed multi-layer networks were untrainable — the perceptron (a single-layer network) could be trained, but how to extend training to hidden layers was unclear. Backpropagation answered that question definitively: propagate the error signal backward through each layer. It was the breakthrough that made deep learning possible.
For decades afterward, backprop was considered "solved" and attention shifted to architectures and data. But Karpathy's 2016 blog post argued convincingly that understanding backprop (not just calling .backward()) is essential for debugging modern networks. Saturating activations, dead ReLUs, exploding gradients — all of these are gradient flow problems that require backprop intuition to diagnose.
There's a cost to backprop's efficiency: memory. The forward pass must save every intermediate activation because the backward pass needs them. A network with L layers and N activations per layer stores L × N values. For large models like GPT-4 with billions of parameters, this activation memory dominates GPU usage.
Techniques like gradient checkpointing trade memory for compute: save activations only at every k-th layer, and recompute the intermediate ones during the backward pass. This reduces memory from O(L) to O(√L) at the cost of ~33% more compute. It's essential for training the largest models.
Mixed precision training stores activations in 16-bit floats instead of 32-bit, halving memory. The gradients and weight updates are computed in 32-bit for numerical stability, but the saved activations only need enough precision for the backward pass. This is now standard practice.
python # Forward pass (PyTorch records the computation graph automatically) logits = model(x) # forward loss = nn.CrossEntropyLoss()(logits, y) # compute loss loss.backward() # ALL gradients computed here # After backward(), every parameter has a .grad attribute: for p in model.parameters(): p.data -= lr * p.grad # gradient descent update
loss.backward() is the entire backpropagation algorithm. PyTorch's autograd engine recorded every operation during the forward pass, then replays them in reverse to compute gradients. You never write a single gradient formula by hand.
To truly understand backprop, implement it once without a framework. Here's a complete 2-layer network with manual forward and backward passes:
python import numpy as np # Network: 3 → 4 → 2 W1 = np.random.randn(4, 3) * 0.5 # [4, 3] b1 = np.zeros(4) # [4] W2 = np.random.randn(2, 4) * 0.5 # [2, 4] b2 = np.zeros(2) # [2] def forward(x): z1 = W1 @ x + b1 # [4] h = np.maximum(0, z1) # [4] ReLU z2 = W2 @ h + b2 # [2] e = np.exp(z2 - z2.max()) p = e / e.sum() # [2] softmax return z1, h, z2, p def backward(x, z1, h, p, y): # Softmax + CE gradient: p - one_hot(y) dz2 = p.copy() dz2[y] -= 1 # [2] dW2 = np.outer(dz2, h) # [2, 4] db2 = dz2 # [2] dh = W2.T @ dz2 # [4] dz1 = dh * (z1 > 0) # [4] ReLU grad dW1 = np.outer(dz1, x) # [4, 3] db1 = dz1 # [4] return dW1, db1, dW2, db2
Look at the backward function. It's just 7 lines. Each line corresponds to one step of the chain rule, working from the output back to the input. The key operations are: outer product (for weight gradients), transpose multiply (for passing gradients backward), and element-wise multiply (for ReLU's gradient mask).
Let's run through the shapes one more time to make sure everything checks out:
| Line | Operation | Shape Check |
|---|---|---|
dz2 = p.copy(); dz2[y] -= 1 | Softmax+CE gradient | [2] — one gradient per output |
dW2 = np.outer(dz2, h) | Outer product | [2] ⊗ [4] = [2,4] — matches W2 shape |
dh = W2.T @ dz2 | Transpose multiply | [4,2] × [2] = [4] — gradient per hidden unit |
dz1 = dh * (z1 > 0) | ReLU mask | [4] ⊙ [4] = [4] — zero out dead neurons |
dW1 = np.outer(dz1, x) | Outer product | [4] ⊗ [3] = [4,3] — matches W1 shape |
Every gradient has the same shape as the parameter it corresponds to. This is a universal property of backprop: if W has shape [m, n], then dL/dW also has shape [m, n]. You can always sanity-check your implementation by verifying this.
Backpropagation is reverse-mode automatic differentiation. There's also forward-mode, which computes derivatives left-to-right during the forward pass. Forward mode computes one derivative per pass (doutput/done-input). Reverse mode computes all derivatives in one pass (done-output/dall-inputs).
Since neural network training has one scalar output (the loss) and millions of inputs (the weights), reverse mode is vastly more efficient. This is why we always do backward passes, never forward-mode differentiation. If we had one weight and a million outputs, forward mode would win instead.
For a vector-to-vector function f: Rn → Rm, the derivative is the Jacobian matrix J of shape [m, n], where Jij = ∂fi/∂xj. The chain rule in matrix form is just Jacobian multiplication:
Backpropagation never builds the full Jacobian (it's too large — a layer with 4096 inputs and 4096 outputs has a 16-million-entry Jacobian). Instead, it computes vector-Jacobian products (VJPs): multiply the incoming gradient vector by the Jacobian transpose without ever materializing the full matrix. This is the core trick that makes backprop efficient for neural networks.
For a linear layer y = Wx, the Jacobian is just W itself. The VJP is WT · dL/dy — exactly the operation we compute in backprop. For element-wise operations like ReLU, the Jacobian is diagonal (each output depends only on its corresponding input), so the VJP is just element-wise multiplication by the diagonal entries — the dh * (z1 > 0) operation in our from-scratch code.
This VJP perspective is the mathematically clean way to understand backpropagation. Each layer implements two functions: forward (compute output from input) and VJP (compute input gradient from output gradient). Stack them together and you get a trainable system of arbitrary depth.
Click "Run" to watch a forward pass (teal) followed by a backward pass (orange) through a 3-layer network. Activations fill left-to-right, then gradients flow right-to-left. Edges pulse as gradients propagate. Use the speed slider to slow the animation.
Everything clicks here. Watch a neural network classify 2D points in real time — every weight change, every loss drop, every decision boundary shift. This is what all the theory from Chapters 0–6 looks like in action.
On the left: a scatter plot of 2D points in two classes. The colored background shows the network's current decision boundary — the region where it predicts each class. On the right: the loss curve over training steps, so you can watch the network improve (or fail to).
Choose a dataset and architecture, then click Play. Watch the decision boundary evolve as the network learns. The loss curve shows training progress. Try different learning rates and hidden sizes.
Experiment ideas to build intuition:
Learning rate too high: Set LR to 1.0. The loss will oscillate wildly or diverge — the network overshoots the optimal weights and bounces around. Lower it to 0.01 and watch it converge smoothly but slowly.
Learning rate too low: Set LR to 0.001 on the spiral dataset. After 500 epochs it's barely learned anything. The loss crawls down at a snail's pace. Learning rate is the most important hyperparameter.
Not enough capacity: Try the spiral dataset with 2 hidden units. The network can only draw straight-ish boundaries. It maxes out at ~50% accuracy. Switch to 16 units and it carves out the spiral perfectly. But more units means more parameters means more risk of overfitting (on small datasets).
XOR is the classic test: A single layer (linear boundary) can never solve XOR — the classes are not linearly separable. But our 2-layer network with 4+ hidden units handles it easily. This is the fundamental capability that nonlinearity provides.
A single neuron with no activation draws a straight line (in 2D) or a hyperplane (in higher dimensions). That's all a linear function can do. But stack two layers with ReLU and something remarkable happens: each hidden neuron contributes one "crease" — a boundary where the function bends. Four hidden neurons can create a region enclosed by four creases. Eight neurons can create curved, complex boundaries.
The spiral dataset makes this vivid. Two interleaved spirals are extremely non-linear. With 2 hidden units, the network can only draw a boundary with 2 creases — essentially a straight line with a kink. It's hopeless. With 16 units, the boundary can fold 16 times, tracing the spiral path precisely.
When you click Play, here's what happens on every frame:
The decision boundary visualization works by evaluating the network at a grid of points across the 2D plane. At each grid point, if the network predicts > 0.5, it's colored orange (class 1); otherwise teal (class 0). The intensity of the color reflects the confidence — darker means more certain.
Our playground uses batch gradient descent: it computes the gradient using ALL data points before updating. This gives a clean gradient estimate but is slow for large datasets. In practice, real networks use mini-batch gradient descent: sample a random subset (batch) of 32-512 examples, compute the gradient on that batch, and update. This is noisier but much faster per step.
Stochastic gradient descent (SGD) technically means batch size = 1: update after every single example. This is very noisy but can escape local minima. In practice, "SGD" usually refers to mini-batch SGD. The noise from small batches is actually helpful — it acts as implicit regularization, preventing the network from settling into sharp, overfitting minima.
You built a network, wrote the training loop, hit run... and the loss is NaN. Or it flatlines at a high value. Or it drops to zero on training data but fails on test data. What went wrong?
This chapter covers the three most common failure modes and how to fix them: bad initialization, missing regularization, and wrong learning rate.
If you initialize all weights to zero, every neuron in a layer computes the exact same function. They all receive the same gradient and update identically. The network has many neurons but they're all clones — you've effectively got a single neuron per layer. This is called the symmetry problem.
If you initialize with large random values, the pre-activations z = Wx will be huge. Sigmoid and tanh saturate, killing gradients. Even ReLU can produce exploding activations that cascade through layers, leading to NaN.
The solution is Xavier initialization (Glorot, 2010): draw weights from a distribution with variance 1/nin, where nin is the number of inputs to the layer. This keeps the variance of activations roughly constant across layers — not too small, not too big.
For ReLU networks, He initialization (He et al., 2015) is better: variance 2/nin. The factor of 2 compensates for the fact that ReLU zeros out half the activations (those with negative pre-activation are zeroed, cutting the variance in half).
Here's the derivation. For one neuron: z = ∑ wi xi. Assuming inputs and weights are independent with zero mean:
We want Var(z) = Var(x) (keep variance constant across layers). This gives us Var(w) = 1/nin. That's it — the entire derivation of Xavier initialization. For He init, account for ReLU zeroing half the activations: Var(z) = (nin/2) · Var(w) · Var(x), so Var(w) = 2/nin.
A network with enough parameters can memorize the training set perfectly — loss drops to zero. But it fails on new data because it learned the noise, not the pattern. This is overfitting.
L2 regularization (weight decay) adds a penalty proportional to the squared magnitude of the weights:
This discourages large weights, which tend to produce sharp, complex decision boundaries that fit noise. Small weights produce smoother boundaries that generalize better. The hyperparameter λ controls the trade-off: too large and the network underfits (too smooth); too small and it overfits.
Dropout (Srivastava, 2014) randomly zeros out a fraction of neurons during each training step. This forces the network to be robust — no single neuron can be solely responsible for a prediction. At test time, all neurons are active but their outputs are scaled down by the dropout rate. Typical rate: 0.5 for hidden layers, 0.2 for input layers.
The intuition: imagine a team where any member might be absent on any given day. The team can't rely on any single person — everyone has to be capable. Dropout creates this same pressure on neurons. Each neuron has to be useful on its own, not just as part of a memorized pattern. This dramatically reduces overfitting.
Mathematically, dropout is equivalent to training an ensemble of all possible sub-networks and averaging their predictions. A network with n neurons has 2n possible sub-networks (each neuron either present or dropped). At test time, using all neurons with scaled weights approximates the ensemble average. This interpretation, due to Hinton, explains why dropout works so well — ensembles are one of the most reliable techniques in machine learning, and dropout gives you one for free.
python # Dropout in practice (PyTorch) model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.5), # 50% of neurons zeroed each step nn.Linear(256, 10), ) # model.train() → dropout active # model.eval() → dropout off, outputs scaled
Too high: the loss oscillates or diverges. The weight updates overshoot the minimum and bounce around. Too low: convergence is painfully slow. The loss decreases but takes thousands of extra epochs. Just right: the loss decreases steadily and reaches a good minimum.
A common strategy is learning rate scheduling. Start with a moderate LR (e.g., 0.001 for Adam, 0.01 for SGD) and decrease it over time. This allows the network to make large, coarse adjustments early and fine-tune later. The most popular schedules are cosine decay and step decay.
python import torch import torch.nn as nn # 1. Define the model model = nn.Sequential( nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 2), ) # 2. Choose loss and optimizer criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4) # 3. Training loop for epoch in range(100): logits = model(X_train) # forward loss = criterion(logits, y_train) # loss optimizer.zero_grad() # clear old gradients loss.backward() # backprop optimizer.step() # update weights
Five lines in the loop. That's the entire algorithm. optimizer.zero_grad() is needed because PyTorch accumulates gradients by default (useful for gradient accumulation with large batches). weight_decay=1e-4 adds L2 regularization directly into the optimizer.
When training fails, check these in order:
The most common bug: forgetting optimizer.zero_grad(). Without it, gradients accumulate across steps, making the effective gradient grow larger and larger until the network diverges. Always zero gradients before backward().
Good practitioners monitor more than just the loss. Here's what to track:
| Metric | What It Tells You | Warning Signs |
|---|---|---|
| Training loss | How well the model fits training data | Not decreasing: broken; NaN: exploding |
| Validation loss | How well the model generalizes | Rising while train loss falls: overfitting |
| Gradient norm | Magnitude of the gradient vector | >100: exploding; <1e-7: vanishing |
| Weight norm per layer | How large the parameters are | Growing without bound: no regularization |
| Activation histogram | Distribution of neuron outputs | All zeros: dead ReLU; all saturated: sigmoid saturation |
python # Monitor gradient norms during training for name, p in model.named_parameters(): if p.grad is not None: grad_norm = p.grad.norm().item() print(f"{name}: grad_norm={grad_norm:.4f}")
| Init Method | Formula | Best For |
|---|---|---|
| Xavier | Var = 1/nin | Sigmoid, Tanh |
| He | Var = 2/nin | ReLU and variants |
Three mini-simulations. Use the tabs to switch between them. Init: activation magnitudes across layers for different initialization strategies. Regularization: L2 penalty controls overfitting. Learning rate: too high diverges, too low crawls.
Backpropagation is the engine. Now we need the car — architectures that use this engine to solve real problems. The feedforward network we built here is the simplest architecture. It processes each input independently, with no memory of previous inputs. For language, images, and sequences, we need more.
In the next lecture (L04), we'll add memory: recurrent neural networks read sequences word by word, carrying a hidden state that encodes everything seen so far. Backpropagation through time (BPTT) extends what we learned here to the time dimension — but introduces the vanishing gradient problem at a new scale.
The simple update rule w ← w − η · dL/dw works but has problems. The loss landscape has ravines where the gradient oscillates across the narrow dimension and crawls along the long dimension. Modern optimizers fix this.
| Optimizer | Key Idea | When to Use |
|---|---|---|
| SGD | Update using gradient of a mini-batch (not full dataset) | Simple baselines, convex problems |
| SGD + Momentum | Accumulate a velocity vector. Like a ball rolling downhill — it builds speed in consistent directions and dampens oscillations. | CNNs, when you have time to tune |
| Adam | Adaptive learning rate per parameter. Combines momentum with RMSProp (scale by running average of squared gradients). | Default choice for most tasks. Transformers, NLP, generative models. |
Why Adam dominates. SGD uses one learning rate for all parameters. But different parameters need different rates: frequent gradients (for common words in NLP) should use small steps; infrequent gradients (for rare words) should use large steps. Adam tracks the first moment (mean) and second moment (variance) of each gradient and adapts the step size accordingly. This per-parameter adaptivity is why Adam converges faster with less tuning.
Typical hyperparameters: β1 = 0.9, β2 = 0.999, ε = 10−8, η = 0.001. In practice, you almost never need to tune β1 and β2. The learning rate η is the main knob.
AdamW (Loshchilov and Hutter, 2019) is a corrected version of Adam that decouples weight decay from the adaptive learning rate. In original Adam, L2 regularization interacts with the per-parameter scaling, making it less effective. AdamW applies weight decay directly to the weights, independent of the gradient scaling. This is now the default optimizer for training Transformers, including all GPT models.
python # AdamW is the standard for modern NLP models optimizer = torch.optim.AdamW( model.parameters(), lr=3e-4, # typical for Transformers weight_decay=0.01, # decoupled L2 regularization betas=(0.9, 0.999), # momentum and RMSProp decay )
These three papers provide the theoretical foundation for everything in this lesson:
Learning Representations by Backpropagating Errors (Rumelhart, Hinton, Williams, 1986) — The paper that introduced backpropagation to neural networks. Showed that multi-layer networks can learn useful internal representations by propagating error signals backward. The algorithm that made deep learning possible.
Yes You Should Understand Backprop (Karpathy, 2016) — A modern argument for why understanding backprop matters, even in the age of autograd. Practical examples of how backprop intuition helps debug training failures.
NLP (Almost) from Scratch (Collobert & Weston, 2011) — Showed that a simple neural network trained end-to-end with backprop can match or beat hand-engineered NLP features. The first convincing demonstration that deep learning could replace feature engineering in NLP.
This lesson focused on the core algorithm. Several important topics were deliberately left out for future lessons:
| Topic | Why It Matters | Where It's Covered |
|---|---|---|
| Batch normalization | Normalizes activations within layers, stabilizes training | Deep learning fundamentals |
| Residual connections | Enable training of very deep networks (100+ layers) | Transformer architecture |
| Attention | Allows the network to focus on relevant parts of the input | CS224N L05 |
| Automatic differentiation | How PyTorch actually implements backprop under the hood | Framework internals |
← L02: Word Vectors — How words become numbers that neural networks can process.
L04: Language Models & RNNs → — Adding memory to neural networks for sequential data. (Coming soon)
Before this lesson, a neural network was a black box. Now you can see inside. You know that it's just matrix multiplications and nonlinearities. You know that the loss function quantifies "how wrong." You know that backpropagation traces blame backward through the computation to figure out which weights to change. And you know that gradient descent makes the changes.
This combination — differentiable forward pass + backprop + gradient descent — is why deep learning works. It's not that neural networks are inherently intelligent. It's that we have an efficient algorithm (backprop) for finding good weights in an astronomically large search space. The algorithm turns learning from an intractable search problem into a tractable optimization problem.
| Year | Milestone | Significance |
|---|---|---|
| 1958 | Perceptron (Rosenblatt) | First trainable neural network (1 layer) |
| 1969 | Minsky & Papert critique | Proved perceptron can't learn XOR, killed funding for 15 years |
| 1986 | Backpropagation (Rumelhart et al.) | Made multi-layer networks trainable |
| 2006 | Deep Belief Networks (Hinton) | Reignited interest in deep networks |
| 2012 | AlexNet (Krizhevsky et al.) | Deep CNN + ReLU + GPU won ImageNet, launched the deep learning era |
| 2017 | Transformer (Vaswani et al.) | Attention replaces recurrence, enables GPT and BERT |
| 2020+ | Scaling laws era | More data + more parameters + more compute = better performance, predictably |
The feedforward network you learned in this lesson is the ancestor of all of these. The Transformer is essentially a feedforward network with a special layer (self-attention) inserted between the standard linear + activation layers. The training algorithm is still backpropagation. The optimizer is usually Adam. The loss is cross-entropy. Everything in this lesson scales up.
As you continue through CS224N, every new architecture — RNNs, LSTMs, Transformers, BERT, GPT — will use the same training machinery you learned here. The differences are all in the forward pass: how input is transformed into output. The backward pass (backpropagation) and the update rule (gradient descent) remain identical.
Master this lesson and you've mastered the engine. The rest is bodywork.