Optimization & Backpropagation — From Loss Landscapes to Learning

Chapter 0: Why Optimization?

You've built a classifier. It has weights. It makes predictions. But right now those weights are random, and the predictions are garbage. How do you find good weights?

Your loss function tells you how badly the current weights perform — a single number. High loss means bad predictions. Low loss means good predictions. The question becomes: how do you change the weights to make the loss go down?

The core idea: Imagine the loss function as a landscape — hills and valleys across weight-space. You're standing somewhere on that landscape. You want to walk downhill to the lowest valley. Optimization is the art of finding the downhill direction and taking a step.

The Loss Landscape

A 2D slice of a loss landscape. Bright = high loss. Dark = low loss. The white dot is your current position in weight-space. Click to place it, then hit "Step Downhill" to watch it roll toward the minimum.

Every machine learning algorithm — linear classifiers, SVMs, neural networks — has this same structure: define a loss function, then minimize it by adjusting weights. The loss function is fixed (you chose it). The optimization strategy is how you navigate the landscape to find the bottom.

The naive approach? Try random weights, keep whichever set gives the lowest loss. This is like exploring a mountain range by parachuting to random spots and hoping to land in the deepest valley. It works terribly. We need something smarter: a way to figure out which direction is downhill from wherever we currently stand.

What does the loss function tell us?

The best possible weights How badly the current weights perform — a single number we want to minimize Which training examples are hardest

Chapter 1: Gradients — The Direction of Steepest Ascent

Standing on our loss landscape, we need to find the downhill direction. The tool for this is the gradient — a vector of partial derivatives, one for each weight. The gradient points in the direction of steepest ascent. So we walk in the opposite direction.

But how do we actually compute a gradient? The simplest way is the numerical gradient: wiggle each weight by a tiny amount h, measure how the loss changes, and compute the slope.

∂L / ∂w_i ≈ ( L(w_i + h) − L(w_i − h) ) / 2h

This is the centered difference formula. You nudge the weight up by h, compute the loss. Nudge it down by h, compute the loss. The difference divided by 2h is the approximate slope. Do this for every weight and you have the full gradient vector.

Think of it this way: You're blindfolded on a hillside. To figure out which way is downhill, you take a tiny step forward and feel if you went up or down. Then step backward and feel again. The direction that goes most steeply down is the negative gradient.

Numerical Gradient in 1D

A simple loss curve L(w) = w². Drag the slider to move w. The orange tangent line shows the gradient (slope) at that point. Positive slope → move left. Negative slope → move right.

w2.0

The numerical gradient is simple but painfully slow. If you have a million weights, you need a million forward passes just to compute one gradient. Each wiggle requires re-evaluating the entire loss function. For a neural network with millions of parameters, this is completely impractical.

python
def numerical_gradient(f, w, h=1e-5):
    grad = np.zeros_like(w)
    for i in range(len(w)):
        old = w[i]
        w[i] = old + h
        loss_plus = f(w)
        w[i] = old - h
        loss_minus = f(w)
        grad[i] = (loss_plus - loss_minus) / (2 * h)
        w[i] = old  # restore
    return grad

Why is the numerical gradient impractical for neural networks?

It requires one forward pass per weight — millions of passes for a large network It only works for linear functions It gives the wrong direction

Chapter 2: Analytic Gradients — Calculus to the Rescue

The numerical gradient wiggles each weight and measures what happens. The analytic gradient uses calculus to derive an exact formula for the gradient. No wiggling, no approximation — just math.

For a simple loss L(w) = w², calculus gives us dL/dw = 2w. That's it. One formula, one evaluation, exact answer. Compare this to the numerical approach: two function evaluations per weight, and still just an approximation.

The tradeoff: Numerical gradients are easy to code but slow and approximate. Analytic gradients are exact and fast, but require you to derive the gradient formula by hand (or have a system do it for you). In practice, we always use analytic gradients for training. We use numerical gradients to check that our analytic gradient is correct — a technique called gradient checking.

Analytic vs Numerical Gradient

For L(w) = w², the analytic gradient is 2w (orange line). The numerical gradient (teal dots, computed with finite differences) closely matches. Smaller h = better approximation.

h (step size)0.01

Property	Numerical Gradient	Analytic Gradient
Speed	O(N) forward passes per N weights	O(1) with backprop
Accuracy	Approximate (depends on h)	Exact
Implementation	Simple loop	Requires derivation
Use case	Debugging / gradient checking	Actual training

In practice, you derive the analytic gradient, implement it, and then verify it against the numerical gradient on a few random inputs. If they match (within floating-point tolerance), you can trust the analytic version and use it for fast training.

What is gradient checking?

Training with a very small learning rate Checking if the loss is decreasing each iteration Comparing the analytic gradient to the numerical gradient to verify correctness

Chapter 3: Gradient Descent — The Update Rule

Now we have the gradient — it tells us the uphill direction. To go downhill, we walk in the opposite direction. This gives us the simplest optimization algorithm: gradient descent.

w ← w − η · ∇L(w)

That's the entire algorithm. η (eta) is the learning rate — how big a step we take. ∇L(w) is the gradient of the loss with respect to the weights. We subtract because the gradient points uphill and we want to go downhill.

The learning rate is the most important hyperparameter in deep learning. Too large, and you overshoot the minimum — the loss explodes. Too small, and training takes forever, possibly getting stuck in a shallow local minimum. Finding the right learning rate is an art.

Learning Rate: Too High vs Too Low

Gradient descent on L(w) = w². Adjust the learning rate and watch the trajectory. Too high = overshooting. Too low = barely moving. Just right = smooth convergence.

η0.10

In code, gradient descent is just a loop:

python
# Vanilla gradient descent
while True:
    grad = compute_gradient(loss_fn, weights, data)
    weights -= learning_rate * grad  # the update step

Each iteration, we compute the gradient over the entire training set, then update. This is called batch gradient descent — "batch" because we use the full batch of data. It gives the cleanest gradient estimate, but it's expensive: every single update requires a pass over all training examples.

What happens if the learning rate is too large?

The weights overshoot the minimum and the loss can explode Training converges faster The gradient becomes zero

Chapter 4: Mini-Batch SGD — Approximate Gradients, Real Speed

Batch gradient descent computes the gradient over the entire training set. With ImageNet (1.2 million images), that means processing every single image before updating the weights once. That's absurdly slow.

The insight: you don't need the exact gradient. An approximation is good enough, as long as it points roughly downhill. Stochastic Gradient Descent (SGD) estimates the gradient from a single random example. Mini-batch SGD uses a small random subset (typically 32-256 examples) — the sweet spot between noise and speed.

∇L ≈ (1/B) ∑_i=1^B ∇L_i

Where B is the batch size. The mini-batch gradient is a noisy estimate of the true gradient. But here's the beautiful part: that noise actually helps. It acts like a regularizer, preventing the optimizer from settling into sharp, narrow minima that generalize poorly.

Why noise helps: Imagine two valleys — one narrow and deep, one wide and shallow. Both have low loss on training data. But the wide valley is more likely to generalize to new data (small changes in weights don't change the loss much). SGD's noise makes it hard to stay in the narrow valley — random fluctuations bounce it out. The wide valley is more stable under noise, so SGD naturally prefers it.

Batch vs Mini-Batch vs Stochastic

Three gradient descent trajectories on the same loss surface. Teal = full batch (smooth but slow). Orange = mini-batch (noisy but fast). Purple = SGD (very noisy, single example). All reach roughly the same region.

Batch Size16

python
# Mini-batch SGD
for epoch in range(num_epochs):
    for batch in get_mini_batches(data, batch_size=32):
        grad = compute_gradient(loss_fn, weights, batch)
        weights -= learning_rate * grad

Mini-batch SGD is the workhorse of deep learning. Virtually every neural network you've heard of — GPT, ResNet, DALL-E — was trained with some variant of SGD.

Why does mini-batch SGD use a subset of data instead of the full training set?

The full gradient is always wrong Computing the full gradient is too slow — approximate gradients allow frequent, fast updates Smaller batches use less memory so we can have bigger models

Chapter 5: Backpropagation — Chain Rule Through a Graph

We know what we need: the gradient of the loss with respect to every weight. And we know analytic gradients are the way to get them. But for a deep neural network with millions of weights arranged in dozens of layers — how do we actually compute all those derivatives efficiently?

The answer is backpropagation. It's not a separate algorithm — it's just the chain rule from calculus, applied systematically to a computation graph.

A computation graph is a directed graph where each node is an operation (add, multiply, max, etc.) and edges carry values. The forward pass flows inputs through the graph to produce the loss. The backward pass flows gradients back through the graph to produce derivatives with respect to every input.

Consider a simple example: f(x, y, z) = (x + y) · z. We can break this into two operations:

q = x + y

Add gate: combines x and y

↓

f = q · z

Multiply gate: combines q and z

The chain rule says: ∂f/∂x = (∂f/∂q) · (∂q/∂x). We compute ∂f/∂q at the multiply gate, and ∂q/∂x at the add gate, then multiply them together. That's backpropagation — each gate computes its local gradient, and gradients multiply as they flow backward.

Forward and Backward Pass

A simple computation graph: f = (x + y) · z. Green values flow forward. Orange gradients flow backward via the chain rule.

The key insight: each node only needs to know its local gradient (the derivative of its output with respect to its inputs) and the upstream gradient (the gradient flowing in from above). Multiply them together, and pass the result backward. No node needs to understand the rest of the graph.

What does each node in a computation graph compute during the backward pass?

Its local gradient times the upstream gradient, then passes it backward The total loss of the network The weight update directly

Chapter 6: Local Gradients — Each Gate Knows Its Own Derivative

Backpropagation works because each gate only needs to know one thing: given its inputs, what is the derivative of its output with respect to each input? This is the local gradient.

Let's work through the concrete example from Chapter 5. Set x = −2, y = 5, z = −4.

Forward pass:

q = x + y = −2 + 5 = 3

f = q · z = 3 · (−4) = −12

Backward pass (starting from ∂f/∂f = 1):

At the multiply gate (f = q · z):

∂f/∂q = z = −4 ∂f/∂z = q = 3

At the add gate (q = x + y):

∂q/∂x = 1 ∂q/∂y = 1

Chain rule gives the final gradients:

∂f/∂x = (∂f/∂q) · (∂q/∂x) = −4 · 1 = −4

∂f/∂y = (∂f/∂q) · (∂q/∂y) = −4 · 1 = −4

∂f/∂z = 3

Sanity check: ∂f/∂x = −4 means that if we increase x by a tiny amount h, f decreases by 4h. Let's verify: f(−2 + 0.01, 5, −4) = (−1.99 + 5)(−4) = 3.01 × −4 = −12.04. The change is −0.04 = −4 × 0.01. It checks out.

Local Gradients at Each Gate

Adjust x, y, z and watch the forward values and backward gradients update in real time. Each gate computes only its local derivative.

x-2.0

y5.0

z-4.0

For the multiply gate f = a · b, what is the local gradient ∂f/∂a?

a b — the derivative with respect to one input is the other input a · b

Chapter 7: Backprop Through a Circuit — The Showcase

Now let's put it all together with a richer computation graph. Below is an interactive circuit that computes a more complex function. You set the inputs, watch values flow forward through the gates, then see gradients flow backward — gate by gate — via the chain rule.

What to watch for: During the forward pass, values flow left-to-right, each gate combining its inputs. During the backward pass, gradients flow right-to-left. At each gate, the incoming gradient is multiplied by the local gradient. By the time gradients reach the inputs, they encode how much each input affects the final output.

Interactive Computation Graph

Circuit: f = (x + y) · z + max(x, w). Set inputs, run forward pass, then backward pass. Watch gradients propagate through each gate.

x2.0

y1.0

z-3.0

w4.0

Set inputs and click Forward Pass

Try these experiments:

Set x=2, w=4. Since max(x,w)=w, the gradient for x from the max gate is 0 (it was "routed" away).
Make z large and positive. Watch how the multiply gate amplifies gradients for x and y.
Set x=w (equal inputs to max). The max gate picks one arbitrarily — a non-differentiable point in theory, but irrelevant in practice.

Chapter 8: Gate Patterns — Add, Multiply, Max

Once you've seen backpropagation through a few gates, you notice patterns. Each common gate has a characteristic way it handles gradients. Memorizing these patterns lets you read computation graphs like a circuit diagram.

The add gate: gradient distributor. For f = x + y, we have ∂f/∂x = 1 and ∂f/∂y = 1. The upstream gradient is simply copied to both inputs, unchanged. The add gate distributes the gradient equally.

The multiply gate: gradient switcher. For f = x · y, we have ∂f/∂x = y and ∂f/∂y = x. The gradient for each input is scaled by the other input's value. If one input is large, the other's gradient is large. This is why multiplications can cause gradient explosion or vanishing.

The max gate: gradient router. For f = max(x, y), the gradient flows entirely to whichever input was larger. The other input gets zero gradient. It's like a switch — the max gate routes the gradient to the winner and blocks the loser.

Why this matters for neural networks: A ReLU activation is f = max(0, x). When x > 0, the gradient flows through. When x ≤ 0, the gradient is zero — the neuron is "dead" for this input. This is the dying ReLU problem, and it comes directly from the max gate pattern.

Gate Patterns

Three gates, each with upstream gradient = 1.0. Watch how each gate handles gradient distribution differently. Adjust inputs to see the pattern.

x3.0

y2.0

Gate	Forward	Local Gradient	Role
Add	x + y	∂f/∂x = 1, ∂f/∂y = 1	Distributor — copies gradient to all inputs
Multiply	x · y	∂f/∂x = y, ∂f/∂y = x	Switcher — scales gradient by the other input
Max	max(x, y)	1 for winner, 0 for loser	Router — sends gradient only to the larger input

If f = x · y and the upstream gradient is 2.0, what gradient does x receive when y = 5?

2.0 10.0 — upstream gradient (2.0) times the local gradient (y = 5) 5.0

Chapter 9: Beyond — From Gradients to Neural Networks

You now understand the complete optimization pipeline: define a loss, compute its gradient with respect to the weights using backpropagation, and update the weights in the opposite direction of the gradient. Every neural network, from a single-layer perceptron to GPT-4, learns this way.

But the story doesn't end with vanilla SGD. Modern deep learning uses sophisticated optimizers that adapt the learning rate for each parameter individually:

What's next: SGD with Momentum adds velocity — the update accumulates past gradients like a ball rolling downhill, carrying momentum through flat regions. Adam adapts the learning rate per-parameter using running averages of the gradient and its square. These are the optimizers that power modern deep learning.

Concept	What We Learned	What Comes Next
Gradient	Direction of steepest ascent	Second-order methods (curvature)
Learning rate	Fixed step size	Adaptive rates (Adam, AdaGrad)
SGD	Mini-batch gradient estimation	Momentum, Nesterov, Adam
Backprop	Chain rule on a computation graph	Autograd (automatic differentiation)
Gate patterns	Add, multiply, max	Sigmoid, softmax, attention gates

The key lessons from this chapter:

Gradient = direction of steepest ascent. Negate it to go downhill.
Analytic > numerical. Use calculus for speed, numerical for verification.
Learning rate is critical. Too big = diverge. Too small = stuck.
Mini-batch SGD trades exact gradients for speed, and the noise helps generalization.
Backpropagation is the chain rule applied recursively through a computation graph.
Gates have patterns: add distributes, multiply switches, max routes.

"What I cannot create, I do not understand." — Richard Feynman

What is the fundamental difference between SGD and Adam?

Adam uses the full batch, SGD uses mini-batches Adam adapts the learning rate per-parameter; SGD uses one fixed rate for all Adam doesn't use gradients

Optimization & BackpropagationFrom Zero