How does a neural network actually learn? By following the slope downhill.
You've built a classifier. It has weights. It makes predictions. But right now those weights are random, and the predictions are garbage. How do you find good weights?
Your loss function tells you how badly the current weights perform — a single number. High loss means bad predictions. Low loss means good predictions. The question becomes: how do you change the weights to make the loss go down?
A 2D slice of a loss landscape. Bright = high loss. Dark = low loss. The white dot is your current position in weight-space. Click to place it, then hit "Step Downhill" to watch it roll toward the minimum.
Every machine learning algorithm — linear classifiers, SVMs, neural networks — has this same structure: define a loss function, then minimize it by adjusting weights. The loss function is fixed (you chose it). The optimization strategy is how you navigate the landscape to find the bottom.
The naive approach? Try random weights, keep whichever set gives the lowest loss. This is like exploring a mountain range by parachuting to random spots and hoping to land in the deepest valley. It works terribly. We need something smarter: a way to figure out which direction is downhill from wherever we currently stand.
Standing on our loss landscape, we need to find the downhill direction. The tool for this is the gradient — a vector of partial derivatives, one for each weight. The gradient points in the direction of steepest ascent. So we walk in the opposite direction.
But how do we actually compute a gradient? The simplest way is the numerical gradient: wiggle each weight by a tiny amount h, measure how the loss changes, and compute the slope.
This is the centered difference formula. You nudge the weight up by h, compute the loss. Nudge it down by h, compute the loss. The difference divided by 2h is the approximate slope. Do this for every weight and you have the full gradient vector.
A simple loss curve L(w) = w². Drag the slider to move w. The orange tangent line shows the gradient (slope) at that point. Positive slope → move left. Negative slope → move right.
The numerical gradient is simple but painfully slow. If you have a million weights, you need a million forward passes just to compute one gradient. Each wiggle requires re-evaluating the entire loss function. For a neural network with millions of parameters, this is completely impractical.
python def numerical_gradient(f, w, h=1e-5): grad = np.zeros_like(w) for i in range(len(w)): old = w[i] w[i] = old + h loss_plus = f(w) w[i] = old - h loss_minus = f(w) grad[i] = (loss_plus - loss_minus) / (2 * h) w[i] = old # restore return grad
The numerical gradient wiggles each weight and measures what happens. The analytic gradient uses calculus to derive an exact formula for the gradient. No wiggling, no approximation — just math.
For a simple loss L(w) = w², calculus gives us dL/dw = 2w. That's it. One formula, one evaluation, exact answer. Compare this to the numerical approach: two function evaluations per weight, and still just an approximation.
For L(w) = w², the analytic gradient is 2w (orange line). The numerical gradient (teal dots, computed with finite differences) closely matches. Smaller h = better approximation.
| Property | Numerical Gradient | Analytic Gradient |
|---|---|---|
| Speed | O(N) forward passes per N weights | O(1) with backprop |
| Accuracy | Approximate (depends on h) | Exact |
| Implementation | Simple loop | Requires derivation |
| Use case | Debugging / gradient checking | Actual training |
In practice, you derive the analytic gradient, implement it, and then verify it against the numerical gradient on a few random inputs. If they match (within floating-point tolerance), you can trust the analytic version and use it for fast training.
Now we have the gradient — it tells us the uphill direction. To go downhill, we walk in the opposite direction. This gives us the simplest optimization algorithm: gradient descent.
That's the entire algorithm. η (eta) is the learning rate — how big a step we take. ∇L(w) is the gradient of the loss with respect to the weights. We subtract because the gradient points uphill and we want to go downhill.
Gradient descent on L(w) = w². Adjust the learning rate and watch the trajectory. Too high = overshooting. Too low = barely moving. Just right = smooth convergence.
In code, gradient descent is just a loop:
python # Vanilla gradient descent while True: grad = compute_gradient(loss_fn, weights, data) weights -= learning_rate * grad # the update step
Each iteration, we compute the gradient over the entire training set, then update. This is called batch gradient descent — "batch" because we use the full batch of data. It gives the cleanest gradient estimate, but it's expensive: every single update requires a pass over all training examples.
Batch gradient descent computes the gradient over the entire training set. With ImageNet (1.2 million images), that means processing every single image before updating the weights once. That's absurdly slow.
The insight: you don't need the exact gradient. An approximation is good enough, as long as it points roughly downhill. Stochastic Gradient Descent (SGD) estimates the gradient from a single random example. Mini-batch SGD uses a small random subset (typically 32-256 examples) — the sweet spot between noise and speed.
Where B is the batch size. The mini-batch gradient is a noisy estimate of the true gradient. But here's the beautiful part: that noise actually helps. It acts like a regularizer, preventing the optimizer from settling into sharp, narrow minima that generalize poorly.
Three gradient descent trajectories on the same loss surface. Teal = full batch (smooth but slow). Orange = mini-batch (noisy but fast). Purple = SGD (very noisy, single example). All reach roughly the same region.
python # Mini-batch SGD for epoch in range(num_epochs): for batch in get_mini_batches(data, batch_size=32): grad = compute_gradient(loss_fn, weights, batch) weights -= learning_rate * grad
Mini-batch SGD is the workhorse of deep learning. Virtually every neural network you've heard of — GPT, ResNet, DALL-E — was trained with some variant of SGD.
We know what we need: the gradient of the loss with respect to every weight. And we know analytic gradients are the way to get them. But for a deep neural network with millions of weights arranged in dozens of layers — how do we actually compute all those derivatives efficiently?
The answer is backpropagation. It's not a separate algorithm — it's just the chain rule from calculus, applied systematically to a computation graph.
Consider a simple example: f(x, y, z) = (x + y) · z. We can break this into two operations:
The chain rule says: ∂f/∂x = (∂f/∂q) · (∂q/∂x). We compute ∂f/∂q at the multiply gate, and ∂q/∂x at the add gate, then multiply them together. That's backpropagation — each gate computes its local gradient, and gradients multiply as they flow backward.
A simple computation graph: f = (x + y) · z. Green values flow forward. Orange gradients flow backward via the chain rule.
The key insight: each node only needs to know its local gradient (the derivative of its output with respect to its inputs) and the upstream gradient (the gradient flowing in from above). Multiply them together, and pass the result backward. No node needs to understand the rest of the graph.
Backpropagation works because each gate only needs to know one thing: given its inputs, what is the derivative of its output with respect to each input? This is the local gradient.
Let's work through the concrete example from Chapter 5. Set x = −2, y = 5, z = −4.
Forward pass:
Backward pass (starting from ∂f/∂f = 1):
At the multiply gate (f = q · z):
At the add gate (q = x + y):
Chain rule gives the final gradients:
Adjust x, y, z and watch the forward values and backward gradients update in real time. Each gate computes only its local derivative.
Now let's put it all together with a richer computation graph. Below is an interactive circuit that computes a more complex function. You set the inputs, watch values flow forward through the gates, then see gradients flow backward — gate by gate — via the chain rule.
Circuit: f = (x + y) · z + max(x, w). Set inputs, run forward pass, then backward pass. Watch gradients propagate through each gate.
Try these experiments:
Once you've seen backpropagation through a few gates, you notice patterns. Each common gate has a characteristic way it handles gradients. Memorizing these patterns lets you read computation graphs like a circuit diagram.
The add gate: gradient distributor. For f = x + y, we have ∂f/∂x = 1 and ∂f/∂y = 1. The upstream gradient is simply copied to both inputs, unchanged. The add gate distributes the gradient equally.
The multiply gate: gradient switcher. For f = x · y, we have ∂f/∂x = y and ∂f/∂y = x. The gradient for each input is scaled by the other input's value. If one input is large, the other's gradient is large. This is why multiplications can cause gradient explosion or vanishing.
The max gate: gradient router. For f = max(x, y), the gradient flows entirely to whichever input was larger. The other input gets zero gradient. It's like a switch — the max gate routes the gradient to the winner and blocks the loser.
Three gates, each with upstream gradient = 1.0. Watch how each gate handles gradient distribution differently. Adjust inputs to see the pattern.
| Gate | Forward | Local Gradient | Role |
|---|---|---|---|
| Add | x + y | ∂f/∂x = 1, ∂f/∂y = 1 | Distributor — copies gradient to all inputs |
| Multiply | x · y | ∂f/∂x = y, ∂f/∂y = x | Switcher — scales gradient by the other input |
| Max | max(x, y) | 1 for winner, 0 for loser | Router — sends gradient only to the larger input |
You now understand the complete optimization pipeline: define a loss, compute its gradient with respect to the weights using backpropagation, and update the weights in the opposite direction of the gradient. Every neural network, from a single-layer perceptron to GPT-4, learns this way.
But the story doesn't end with vanilla SGD. Modern deep learning uses sophisticated optimizers that adapt the learning rate for each parameter individually:
| Concept | What We Learned | What Comes Next |
|---|---|---|
| Gradient | Direction of steepest ascent | Second-order methods (curvature) |
| Learning rate | Fixed step size | Adaptive rates (Adam, AdaGrad) |
| SGD | Mini-batch gradient estimation | Momentum, Nesterov, Adam |
| Backprop | Chain rule on a computation graph | Autograd (automatic differentiation) |
| Gate patterns | Add, multiply, max | Sigmoid, softmax, attention gates |
The key lessons from this chapter:
"What I cannot create, I do not understand." — Richard Feynman