Deep Learning Foundations

Optimization & Backpropagation
From Zero

How does a neural network actually learn? By following the slope downhill.

Prerequisites: Basic calculus (derivatives) + Linear classification. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Optimization?

You've built a classifier. It has weights. It makes predictions. But right now those weights are random, and the predictions are garbage. How do you find good weights?

Your loss function tells you how badly the current weights perform — a single number. High loss means bad predictions. Low loss means good predictions. The question becomes: how do you change the weights to make the loss go down?

The core idea: Imagine the loss function as a landscape — hills and valleys across weight-space. You're standing somewhere on that landscape. You want to walk downhill to the lowest valley. Optimization is the art of finding the downhill direction and taking a step.
The Loss Landscape

A 2D slice of a loss landscape. Bright = high loss. Dark = low loss. The white dot is your current position in weight-space. Click to place it, then hit "Step Downhill" to watch it roll toward the minimum.

Every machine learning algorithm — linear classifiers, SVMs, neural networks — has this same structure: define a loss function, then minimize it by adjusting weights. The loss function is fixed (you chose it). The optimization strategy is how you navigate the landscape to find the bottom.

The naive approach? Try random weights, keep whichever set gives the lowest loss. This is like exploring a mountain range by parachuting to random spots and hoping to land in the deepest valley. It works terribly. We need something smarter: a way to figure out which direction is downhill from wherever we currently stand.

What does the loss function tell us?

Chapter 1: Gradients — The Direction of Steepest Ascent

Standing on our loss landscape, we need to find the downhill direction. The tool for this is the gradient — a vector of partial derivatives, one for each weight. The gradient points in the direction of steepest ascent. So we walk in the opposite direction.

But how do we actually compute a gradient? The simplest way is the numerical gradient: wiggle each weight by a tiny amount h, measure how the loss changes, and compute the slope.

∂L / ∂wi ≈ ( L(wi + h) − L(wi − h) ) / 2h

This is the centered difference formula. You nudge the weight up by h, compute the loss. Nudge it down by h, compute the loss. The difference divided by 2h is the approximate slope. Do this for every weight and you have the full gradient vector.

Think of it this way: You're blindfolded on a hillside. To figure out which way is downhill, you take a tiny step forward and feel if you went up or down. Then step backward and feel again. The direction that goes most steeply down is the negative gradient.
Numerical Gradient in 1D

A simple loss curve L(w) = w². Drag the slider to move w. The orange tangent line shows the gradient (slope) at that point. Positive slope → move left. Negative slope → move right.

w2.0

The numerical gradient is simple but painfully slow. If you have a million weights, you need a million forward passes just to compute one gradient. Each wiggle requires re-evaluating the entire loss function. For a neural network with millions of parameters, this is completely impractical.

python
def numerical_gradient(f, w, h=1e-5):
    grad = np.zeros_like(w)
    for i in range(len(w)):
        old = w[i]
        w[i] = old + h
        loss_plus = f(w)
        w[i] = old - h
        loss_minus = f(w)
        grad[i] = (loss_plus - loss_minus) / (2 * h)
        w[i] = old  # restore
    return grad
Why is the numerical gradient impractical for neural networks?

Chapter 2: Analytic Gradients — Calculus to the Rescue

The numerical gradient wiggles each weight and measures what happens. The analytic gradient uses calculus to derive an exact formula for the gradient. No wiggling, no approximation — just math.

For a simple loss L(w) = w², calculus gives us dL/dw = 2w. That's it. One formula, one evaluation, exact answer. Compare this to the numerical approach: two function evaluations per weight, and still just an approximation.

The tradeoff: Numerical gradients are easy to code but slow and approximate. Analytic gradients are exact and fast, but require you to derive the gradient formula by hand (or have a system do it for you). In practice, we always use analytic gradients for training. We use numerical gradients to check that our analytic gradient is correct — a technique called gradient checking.
Analytic vs Numerical Gradient

For L(w) = w², the analytic gradient is 2w (orange line). The numerical gradient (teal dots, computed with finite differences) closely matches. Smaller h = better approximation.

h (step size)0.01
PropertyNumerical GradientAnalytic Gradient
SpeedO(N) forward passes per N weightsO(1) with backprop
AccuracyApproximate (depends on h)Exact
ImplementationSimple loopRequires derivation
Use caseDebugging / gradient checkingActual training

In practice, you derive the analytic gradient, implement it, and then verify it against the numerical gradient on a few random inputs. If they match (within floating-point tolerance), you can trust the analytic version and use it for fast training.

What is gradient checking?

Chapter 3: Gradient Descent — The Update Rule

Now we have the gradient — it tells us the uphill direction. To go downhill, we walk in the opposite direction. This gives us the simplest optimization algorithm: gradient descent.

w ← w − η · ∇L(w)

That's the entire algorithm. η (eta) is the learning rate — how big a step we take. ∇L(w) is the gradient of the loss with respect to the weights. We subtract because the gradient points uphill and we want to go downhill.

The learning rate is the most important hyperparameter in deep learning. Too large, and you overshoot the minimum — the loss explodes. Too small, and training takes forever, possibly getting stuck in a shallow local minimum. Finding the right learning rate is an art.
Learning Rate: Too High vs Too Low

Gradient descent on L(w) = w². Adjust the learning rate and watch the trajectory. Too high = overshooting. Too low = barely moving. Just right = smooth convergence.

η0.10

In code, gradient descent is just a loop:

python
# Vanilla gradient descent
while True:
    grad = compute_gradient(loss_fn, weights, data)
    weights -= learning_rate * grad  # the update step

Each iteration, we compute the gradient over the entire training set, then update. This is called batch gradient descent — "batch" because we use the full batch of data. It gives the cleanest gradient estimate, but it's expensive: every single update requires a pass over all training examples.

What happens if the learning rate is too large?

Chapter 4: Mini-Batch SGD — Approximate Gradients, Real Speed

Batch gradient descent computes the gradient over the entire training set. With ImageNet (1.2 million images), that means processing every single image before updating the weights once. That's absurdly slow.

The insight: you don't need the exact gradient. An approximation is good enough, as long as it points roughly downhill. Stochastic Gradient Descent (SGD) estimates the gradient from a single random example. Mini-batch SGD uses a small random subset (typically 32-256 examples) — the sweet spot between noise and speed.

∇L ≈ (1/B) ∑i=1B ∇Li

Where B is the batch size. The mini-batch gradient is a noisy estimate of the true gradient. But here's the beautiful part: that noise actually helps. It acts like a regularizer, preventing the optimizer from settling into sharp, narrow minima that generalize poorly.

Why noise helps: Imagine two valleys — one narrow and deep, one wide and shallow. Both have low loss on training data. But the wide valley is more likely to generalize to new data (small changes in weights don't change the loss much). SGD's noise makes it hard to stay in the narrow valley — random fluctuations bounce it out. The wide valley is more stable under noise, so SGD naturally prefers it.
Batch vs Mini-Batch vs Stochastic

Three gradient descent trajectories on the same loss surface. Teal = full batch (smooth but slow). Orange = mini-batch (noisy but fast). Purple = SGD (very noisy, single example). All reach roughly the same region.

Batch Size16
python
# Mini-batch SGD
for epoch in range(num_epochs):
    for batch in get_mini_batches(data, batch_size=32):
        grad = compute_gradient(loss_fn, weights, batch)
        weights -= learning_rate * grad

Mini-batch SGD is the workhorse of deep learning. Virtually every neural network you've heard of — GPT, ResNet, DALL-E — was trained with some variant of SGD.

Why does mini-batch SGD use a subset of data instead of the full training set?

Chapter 5: Backpropagation — Chain Rule Through a Graph

We know what we need: the gradient of the loss with respect to every weight. And we know analytic gradients are the way to get them. But for a deep neural network with millions of weights arranged in dozens of layers — how do we actually compute all those derivatives efficiently?

The answer is backpropagation. It's not a separate algorithm — it's just the chain rule from calculus, applied systematically to a computation graph.

A computation graph is a directed graph where each node is an operation (add, multiply, max, etc.) and edges carry values. The forward pass flows inputs through the graph to produce the loss. The backward pass flows gradients back through the graph to produce derivatives with respect to every input.

Consider a simple example: f(x, y, z) = (x + y) · z. We can break this into two operations:

q = x + y
Add gate: combines x and y
f = q · z
Multiply gate: combines q and z

The chain rule says: ∂f/∂x = (∂f/∂q) · (∂q/∂x). We compute ∂f/∂q at the multiply gate, and ∂q/∂x at the add gate, then multiply them together. That's backpropagation — each gate computes its local gradient, and gradients multiply as they flow backward.

Forward and Backward Pass

A simple computation graph: f = (x + y) · z. Green values flow forward. Orange gradients flow backward via the chain rule.

The key insight: each node only needs to know its local gradient (the derivative of its output with respect to its inputs) and the upstream gradient (the gradient flowing in from above). Multiply them together, and pass the result backward. No node needs to understand the rest of the graph.

What does each node in a computation graph compute during the backward pass?

Chapter 6: Local Gradients — Each Gate Knows Its Own Derivative

Backpropagation works because each gate only needs to know one thing: given its inputs, what is the derivative of its output with respect to each input? This is the local gradient.

Let's work through the concrete example from Chapter 5. Set x = −2, y = 5, z = −4.

Forward pass:

q = x + y = −2 + 5 = 3
f = q · z = 3 · (−4) = −12

Backward pass (starting from ∂f/∂f = 1):

At the multiply gate (f = q · z):

∂f/∂q = z = −4    ∂f/∂z = q = 3

At the add gate (q = x + y):

∂q/∂x = 1    ∂q/∂y = 1

Chain rule gives the final gradients:

∂f/∂x = (∂f/∂q) · (∂q/∂x) = −4 · 1 = −4
∂f/∂y = (∂f/∂q) · (∂q/∂y) = −4 · 1 = −4
∂f/∂z = 3
Sanity check: ∂f/∂x = −4 means that if we increase x by a tiny amount h, f decreases by 4h. Let's verify: f(−2 + 0.01, 5, −4) = (−1.99 + 5)(−4) = 3.01 × −4 = −12.04. The change is −0.04 = −4 × 0.01. It checks out.
Local Gradients at Each Gate

Adjust x, y, z and watch the forward values and backward gradients update in real time. Each gate computes only its local derivative.

x-2.0
y5.0
z-4.0
For the multiply gate f = a · b, what is the local gradient ∂f/∂a?

Chapter 7: Backprop Through a Circuit — The Showcase

Now let's put it all together with a richer computation graph. Below is an interactive circuit that computes a more complex function. You set the inputs, watch values flow forward through the gates, then see gradients flow backward — gate by gate — via the chain rule.

What to watch for: During the forward pass, values flow left-to-right, each gate combining its inputs. During the backward pass, gradients flow right-to-left. At each gate, the incoming gradient is multiplied by the local gradient. By the time gradients reach the inputs, they encode how much each input affects the final output.
Interactive Computation Graph

Circuit: f = (x + y) · z + max(x, w). Set inputs, run forward pass, then backward pass. Watch gradients propagate through each gate.

x2.0
y1.0
z-3.0
w4.0
Set inputs and click Forward Pass

Try these experiments:

Chapter 8: Gate Patterns — Add, Multiply, Max

Once you've seen backpropagation through a few gates, you notice patterns. Each common gate has a characteristic way it handles gradients. Memorizing these patterns lets you read computation graphs like a circuit diagram.

The add gate: gradient distributor. For f = x + y, we have ∂f/∂x = 1 and ∂f/∂y = 1. The upstream gradient is simply copied to both inputs, unchanged. The add gate distributes the gradient equally.

The multiply gate: gradient switcher. For f = x · y, we have ∂f/∂x = y and ∂f/∂y = x. The gradient for each input is scaled by the other input's value. If one input is large, the other's gradient is large. This is why multiplications can cause gradient explosion or vanishing.

The max gate: gradient router. For f = max(x, y), the gradient flows entirely to whichever input was larger. The other input gets zero gradient. It's like a switch — the max gate routes the gradient to the winner and blocks the loser.

Why this matters for neural networks: A ReLU activation is f = max(0, x). When x > 0, the gradient flows through. When x ≤ 0, the gradient is zero — the neuron is "dead" for this input. This is the dying ReLU problem, and it comes directly from the max gate pattern.
Gate Patterns

Three gates, each with upstream gradient = 1.0. Watch how each gate handles gradient distribution differently. Adjust inputs to see the pattern.

x3.0
y2.0
GateForwardLocal GradientRole
Addx + y∂f/∂x = 1, ∂f/∂y = 1Distributor — copies gradient to all inputs
Multiplyx · y∂f/∂x = y, ∂f/∂y = xSwitcher — scales gradient by the other input
Maxmax(x, y)1 for winner, 0 for loserRouter — sends gradient only to the larger input
If f = x · y and the upstream gradient is 2.0, what gradient does x receive when y = 5?

Chapter 9: Beyond — From Gradients to Neural Networks

You now understand the complete optimization pipeline: define a loss, compute its gradient with respect to the weights using backpropagation, and update the weights in the opposite direction of the gradient. Every neural network, from a single-layer perceptron to GPT-4, learns this way.

But the story doesn't end with vanilla SGD. Modern deep learning uses sophisticated optimizers that adapt the learning rate for each parameter individually:

What's next: SGD with Momentum adds velocity — the update accumulates past gradients like a ball rolling downhill, carrying momentum through flat regions. Adam adapts the learning rate per-parameter using running averages of the gradient and its square. These are the optimizers that power modern deep learning.
ConceptWhat We LearnedWhat Comes Next
GradientDirection of steepest ascentSecond-order methods (curvature)
Learning rateFixed step sizeAdaptive rates (Adam, AdaGrad)
SGDMini-batch gradient estimationMomentum, Nesterov, Adam
BackpropChain rule on a computation graphAutograd (automatic differentiation)
Gate patternsAdd, multiply, maxSigmoid, softmax, attention gates

The key lessons from this chapter:

"What I cannot create, I do not understand." — Richard Feynman

What is the fundamental difference between SGD and Adam?