A neural network is a chain of simple operations. Backpropagation is the trick that lets us compute how to adjust every single weight — by tracing blame backwards through the chain.
Imagine you have a dataset of red and blue points on a 2D plane. The red points form a ring around the blue points — like a donut with a blueberry center. Can a linear classifier separate them? No. A linear classifier draws a single straight line (or hyperplane in higher dimensions). No matter how you rotate or shift that line, it will always misclassify some points.
This is the fundamental limitation of the linear score function we've been using: f = Wx. It can only carve the input space into regions separated by flat boundaries. Many real-world problems — like recognizing a "9" vs a "6", or distinguishing a cat from a dog — require curved, nonlinear decision boundaries.
The simplest example of linear failure is XOR. Four points on a 2D grid:
Points: (0,0) → class 0, (0,1) → class 1, (1,0) → class 1, (1,1) → class 0.
Try drawing a line that puts (0,1) and (1,0) on one side, and (0,0) and (1,1) on the other. You can't. The "same class" points are diagonal to each other. No single line separates them.
A linear classifier computes w1x1 + w2x2 + b. For (0,1) to be positive: w2 + b > 0. For (1,0) to be positive: w1 + b > 0. For (0,0) to be negative: b < 0. For (1,1) to be negative: w1 + w2 + b < 0. Adding the first two: w1 + w2 + 2b > 0. But the last constraint says w1 + w2 + b < 0, meaning w1 + w2 < −b. Combined with b < 0: w1 + w2 + 2b > 0 implies w1 + w2 > −2b > 0 > −b. Contradiction.
The CS 231n lecture shows a beautiful insight: what if we transform the inputs before classifying? Take those donut-shaped points in (x, y) coordinates and convert them to polar coordinates (r, θ). Suddenly, the ring of red points has large r values and the blue center has small r values. A simple threshold on r separates them perfectly.
But here's the problem: who designs the transform? For polar coordinates, we used domain knowledge. For images of cats and dogs, what's the right transform? Nobody knows. That's where neural networks come in: they learn the transform.
A neural network is a learned feature transform followed by a linear classifier. The first layers learn to warp the input space until the classes become linearly separable. The last layer draws the separating hyperplane. The whole thing is trained end-to-end.
A curved surface in input space that separates classes. Linear classifiers can only produce flat (hyperplane) boundaries. To get curved boundaries, we need either hand-crafted feature transforms or learned ones (neural networks).
A linear classifier for CIFAR-10 learns exactly 10 templates — one per class. Each row of the weight matrix W (10×3072) acts as a template that gets compared against the input image via dot product. The problem: a single "horse" template must account for horses facing left, right, running, standing, in daylight, at dusk. It ends up as a blurry average of all horse configurations.
Imagine the "car" template. Cars can be red, blue, or white. The single template averages these colors, producing a greyish blob. A red car activates the template moderately. A blue car activates it moderately. Neither activates it strongly, because the template is a compromised average.
With 100 hidden units, you could learn: template #1 = "red car shape," template #2 = "blue car front," template #3 = "white sedan side." The output layer combines: score(car) = 0.8 × template1 + 0.5 × template2 + 0.3 × template3. Now each sub-template can specialize.
A linear classifier computes f = Wx. A 2-layer neural network computes f = W2 max(0, W1x). That's it. Two matrix multiplications with a nonlinear function sandwiched between them. But this small change has enormous consequences.
Let's trace through it. Say x is a 3072-dimensional image (32×32×3 CIFAR-10 image flattened). W1 is a 100×3072 matrix, producing a 100-dimensional hidden layer h = W1x. We apply max(0, ·) elementwise to get h' = max(0, h). Then W2 is a 10×100 matrix, mapping h' to 10 class scores.
Input: x = [1, -2]. W1 = [[2, 1], [-1, 3]], b1 = [0, 0]. W2 = [[1, -1]], b2 = [0].
Step 1: h = W1x + b1 = [2·1 + 1·(-2), -1·1 + 3·(-2)] = [0, -7].
Step 2: h' = max(0, h) = max(0, [0, -7]) = [0, 0].
Step 3: f = W2h' + b2 = [1·0 + (-1)·0] = [0].
Both hidden units got killed by ReLU. Let's try x = [1, 2]: h = [4, 5], h' = [4, 5], f = [4 - 5] = [-1]. Now the network computes something nontrivial.
What if we skip the max(0, ·) and just stack two linear layers? Then f = W2(W1x) = (W2W1)x = W'x. It collapses back to a single linear classifier! Two linear transforms compose into one linear transform. The nonlinearity between layers is what prevents this collapse and gives the network its power.
Stacking 100 linear layers without activation functions gives you exactly the same representational power as a single layer. The product W100 · W99 · ... · W1 is just one big matrix. You've burned 100x the parameters to learn a linear function. The activation function is not optional — it's what makes depth meaningful.
In a linear classifier, each row of W is a "template" for one class. There's exactly one template per class. A horse template must capture all possible horses — different orientations, colors, backgrounds — in a single linear filter. That's why linear classifiers produce blurry, averaged templates.
A 2-layer network with 100 hidden units learns 100 templates in W1. These aren't class-specific — they're shared features like "edge at 45 degrees" or "brown patch." W2 then combines templates: "horse = template 3 + template 17 - template 42." Each class can be a different mix of the 100 learned features.
A 2-layer neural network with enough hidden units can approximate any continuous function to arbitrary precision (Cybenko, 1989; Hornik, 1991). This is the Universal Approximation Theorem. It doesn't say the network will find the right function (that's an optimization problem), only that it has the capacity to represent it. Think of it as: "the architecture is not the bottleneck."
Each additional layer adds another level of learned abstraction. Layer 1 might learn edges. Layer 2 combines edges into parts (eyes, wheels). Layer 3 combines parts into objects (cats, cars). This hierarchical composition is the deep in deep learning.
A multi-layer perceptron (MLP) is a neural network where every neuron in one layer connects to every neuron in the next. Also called a "fully-connected network" or "dense network." The architecture is defined by the number of layers and the width (number of neurons) of each layer.
The nonlinear function applied after each linear layer is called the activation function. The choice of activation function matters more than you might think — it affects whether gradients flow well during training, whether neurons die, and how fast the network learns.
Sigmoid squashes any input into the range (0, 1). Historically popular because it resembles the firing rate of a biological neuron. But it has three serious problems:
1. Saturated neurons kill gradients. When the input is very positive or very negative, σ(x) is near 0 or 1. The derivative σ(x)(1−σ(x)) is near zero. During backpropagation, the gradient gets multiplied by this near-zero value and effectively vanishes.
2. Outputs are not zero-centered. If the input to a neuron is always positive (which happens when the previous layer uses sigmoid), then the gradients on the weights are always all positive or all negative. This causes zig-zagging during gradient descent.
3. exp() is expensive. Minor, but relevant at scale.
Input x = 10. σ(10) = 1/(1 + e-10) ≈ 0.99995. Derivative: 0.99995 × 0.00005 = 0.0000454.
If the upstream gradient is 1.0, the downstream gradient is 0.0000454 — four orders of magnitude smaller. Stack 5 layers of sigmoid and the gradient at the first layer is essentially zero. The network cannot learn.
Tanh fixes the zero-centering problem (outputs are in [−1, 1]), but still saturates at the extremes. It was the default choice before ReLU, and is still used in some architectures (e.g., LSTMs).
ReLU is stupidly simple: if the input is negative, output zero. If positive, pass it through unchanged. No exponentials, no division. It trains 6x faster than sigmoid in practice (Krizhevsky et al., 2012).
For positive inputs, the gradient is exactly 1. No matter how deep the network, if a neuron is active, it passes the gradient through undistorted. This solves the vanishing gradient problem that plagued sigmoid and tanh networks. The gradient is either 0 (neuron off) or 1 (neuron on) — no values in between that could shrink over many layers.
If a ReLU neuron's input is always negative (for every training example), its output is always 0 and its gradient is always 0. It can never update its weights. It's dead. This can happen if the learning rate is too large — a large gradient update pushes the weights so that the neuron's pre-activation is always negative. Once dead, always dead. In practice, ~10-20% of neurons in a ReLU network may die during training.
Leaky ReLU fixes dead neurons by allowing a small gradient (α = 0.01) when the input is negative. The neuron is never completely off, so it can always recover. ELU goes further by producing negative outputs that push the mean activation closer to zero, which helps with learning dynamics.
Use ReLU. It's the default for almost all feedforward and convolutional networks. If you see dead neurons, try Leaky ReLU. Don't use sigmoid or tanh in hidden layers unless you have a specific reason (e.g., gates in LSTMs). Sigmoid in the output layer is fine for binary classification.
| Activation | Range | Gradient at x=0 | Dead Neurons? | Use When |
|---|---|---|---|---|
| Sigmoid | (0, 1) | 0.25 | Saturates | Output layer for binary |
| Tanh | (−1, 1) | 1.0 | Saturates | LSTM gates, RNN |
| ReLU | [0, ∞) | undefined | Yes | Default everywhere |
| Leaky ReLU | (−∞, ∞) | undefined | No | If ReLU dies |
| ELU | (−α, ∞) | ~α | No | When zero-mean matters |
We've been writing neural networks as formulas: f = W2 max(0, W1x). But to compute gradients efficiently, we need a different representation. We need to see the computation as a graph — a network of simple operations connected by data flow.
A directed acyclic graph (DAG) where each node represents a simple operation (add, multiply, max, exp, etc.) and each edge carries a value. Inputs enter at the left, the loss comes out at the right. The forward pass flows left-to-right, computing the output. The backward pass flows right-to-left, computing gradients.
Consider a simple function: f(x, y, z) = (x + y) · z. We can break this into two operations:
f(x, y, z) = (x + y) · z. Let x = −2, y = 5, z = −4.
Node 1 (add): q = x + y = −2 + 5 = 3.
Node 2 (multiply): f = q · z = 3 · (−4) = −12.
The graph has three input nodes (x, y, z), one intermediate node (q), and one output node (f). Data flows left to right: inputs → add → multiply → output.
The point of the graph representation is modularity. Each node only needs to know two things: (1) how to compute its output from its inputs (the forward pass), and (2) how to compute the gradient of its output with respect to its inputs (the local gradient). It doesn't need to know anything about the rest of the network.
Deriving gradients for the whole network by hand is a nightmare of matrix calculus. But each individual node is trivial: the gradient of addition is 1, the gradient of multiplication is the other input, the gradient of max is 0 or 1. Backpropagation chains these trivial local gradients together to get the full network gradient. Simple parts, complex whole.
Any neural network — no matter how complex — can be expressed as a computational graph. The linear classifier f = Wx + b is three nodes: multiply, add bias, done. A 100-layer ResNet is thousands of nodes. But the gradient computation is always the same: forward pass to get values, backward pass to get gradients.
Full pipeline for a linear classifier with SVM loss:
Node 1: s = W · x (matrix multiply: scores)
Node 2: margins = sj − sy + 1 (hinge computation per class)
Node 3: Ldata = Σ max(0, margins) (sum of hinge losses)
Node 4: R = λ Σ W2 (regularization)
Node 5: L = Ldata + R (total loss)
Five nodes. Each one is simple. The gradient of the whole thing is computed by chaining local gradients backward through these five nodes.
The chain rule from calculus is the mathematical engine behind backpropagation. If you understand the chain rule, you understand backprop. They are the same thing.
If y = f(x) and z = g(y), then z is a function of x through y. The chain rule says:
Let y = 3x and z = y2. Then dy/dx = 3 and dz/dy = 2y = 2(3x) = 6x.
Chain rule: dz/dx = (dz/dy)(dy/dx) = 6x · 3 = 18x.
Check: z = y2 = (3x)2 = 9x2. Direct derivative: dz/dx = 18x. Matches!
In neural networks, a value often depends on x through multiple paths. The multivariable chain rule says: sum the contributions from all paths.
Let a = 2x, b = x2, z = a + b = 2x + x2.
Path through a: ∂z/∂a · da/dx = 1 · 2 = 2.
Path through b: ∂z/∂b · db/dx = 1 · 2x = 2x.
Total: dz/dx = 2 + 2x. Check: z = 2x + x2, dz/dx = 2 + 2x. Correct.
In a computational graph, the chain rule tells us: to find how the loss L changes with respect to any variable x, multiply the upstream gradient (dL/dz, coming from the right) by the local gradient (dz/dx, computed at this node). That product is the downstream gradient passed to the left. This is literally all backpropagation does — at every node.
Think of it as unit conversion. dL/dz has units "loss per z-unit." dz/dx has units "z-units per x-unit." Multiplying: "loss per z-unit" × "z-units per x-unit" = "loss per x-unit" = dL/dx. The chain rule is dimensional analysis for derivatives.
The chain rule extends to any number of composed functions. If a = f(x), b = g(a), c = h(b), L = j(c), then:
a = 2x (da/dx = 2). b = a3 (db/da = 3a2). L = −b (dL/db = −1).
At x = 1: a = 2, b = 8, L = −8.
dL/dx = (−1) · (3 × 4) · 2 = −1 · 12 · 2 = −24.
Check: L = −(2x)3 = −8x3. dL/dx = −24x2. At x = 1: −24. Matches.
If each local gradient is less than 1 (as happens with sigmoid, where the max gradient is 0.25), the product shrinks exponentially with depth. Ten sigmoid layers: gradient ≤ 0.2510 ≈ 10−6. The first layer's weights barely update. This is why deep sigmoid networks were considered untrainable before ReLU (which has local gradient exactly 1 when active) and residual connections.
Now we put it all together. Backpropagation is the algorithm that computes the gradient of the loss with respect to every weight in the network, by applying the chain rule systematically from right to left through the computational graph.
Let's trace through the complete forward and backward pass for f(x, y, z) = (x + y) · z, with x = −2, y = 5, z = −4.
FORWARD PASS (left to right):
q = x + y = −2 + 5 = 3
f = q · z = 3 · (−4) = −12
BACKWARD PASS (right to left):
Step 1 — Base case: df/df = 1 (the gradient of the output with respect to itself is always 1).
Step 2 — Multiply node: f = q · z.
Local gradients: ∂f/∂q = z = −4, ∂f/∂z = q = 3.
Downstream: df/dq = 1 · (−4) = −4. df/dz = 1 · 3 = 3.
Step 3 — Add node: q = x + y.
Local gradients: ∂q/∂x = 1, ∂q/∂y = 1.
Downstream: df/dx = (−4) · 1 = −4. df/dy = (−4) · 1 = −4.
Result: ∂f/∂x = −4, ∂f/∂y = −4, ∂f/∂z = 3.
Check ∂f/∂x: change x from −2 to −1.999 (a tiny bump of +0.001). New f = (−1.999 + 5) · (−4) = 3.001 · (−4) = −12.004. Change in f: −12.004 − (−12) = −0.004. Gradient ≈ −0.004/0.001 = −4. Matches our analytical result exactly. Always check with numerical gradients.
The lecture walks through a more involved example: f(w0, x0, w1, x1, w2) = 1 / (1 + e−(w0x0 + w1x1 + w2)).
This is a sigmoid applied to a dot product — exactly what one neuron computes. Let's set w0 = 2, x0 = −1, w1 = −3, x1 = −2, w2 = −3.
Forward: w0x0 = 2·(−1) = −2. w1x1 = (−3)·(−2) = 6. Sum = −2 + 6 + (−3) = 1. Sigmoid: 1/(1 + e−1) ≈ 0.73.
Backward: The sigmoid has a beautiful local gradient: σ(x)(1 − σ(x)) = 0.73 · 0.27 ≈ 0.20.
Starting from df/df = 1:
• Gradient at sigmoid output: 1 × 0.20 = 0.20
• The sum node distributes: gradient to each input of the sum is 0.20
• At w0x0 multiply node: ∂/∂w0 = x0 × 0.20 = (−1)(0.20) = −0.20. ∂/∂x0 = w0 × 0.20 = 2(0.20) = 0.40.
• At w1x1 multiply node: ∂/∂w1 = x1 × 0.20 = (−2)(0.20) = −0.40. ∂/∂x1 = w1 × 0.20 = (−3)(0.20) = −0.60.
• ∂/∂w2 = 1 × 0.20 = 0.20 (bias is just added).
σ(x) = 1/(1 + e−x) = (1 + e−x)−1.
dσ/dx = −(1 + e−x)−2 · (−e−x) = e−x / (1 + e−x)2.
Rewrite: = [1/(1 + e−x)] · [e−x/(1 + e−x)] = σ(x) · [(1 + e−x − 1)/(1 + e−x)]
= σ(x) · [1 − 1/(1 + e−x)] = σ(x) · (1 − σ(x)).
Maximum at x = 0 where σ(0) = 0.5: gradient = 0.5 × 0.5 = 0.25. This is the best case — it only gets smaller from here.
The computational graph representation is not unique. We could break sigmoid into four nodes (negate, exp, add 1, reciprocal) and backprop through each. Or we could treat the entire sigmoid as one node with the local gradient σ(x)(1 − σ(x)). Both give the same answer. Choose the grouping where local gradients are easy to express. In practice, frameworks like PyTorch define custom backward functions for common compound operations.
After doing enough backprop examples, you start to notice that each type of operation has a characteristic gradient behavior. These patterns are so fundamental that CS 231n gives them names. Once you recognize them, you can read gradients off a computational graph by inspection.
x = 3, y = 4. z = x + y = 7. Upstream gradient: dL/dz = 2.
dL/dx = dL/dz · ∂z/∂x = 2 · 1 = 2.
dL/dy = dL/dz · ∂z/∂y = 2 · 1 = 2.
Both inputs receive the same gradient. The add gate is a "gradient copier."
x = 2, y = 3. z = xy = 6. Upstream gradient: dL/dz = 5.
dL/dx = dL/dz · ∂z/∂x = 5 · y = 5 · 3 = 15.
dL/dy = dL/dz · ∂z/∂y = 5 · x = 5 · 2 = 10.
The gradient on x is scaled by y's value (3), and vice versa. If one input is large, the other gets a large gradient. This is why weight initialization matters — large activations mean large weight gradients.
x = 4, y = 9. z = max(4, 9) = 9. Upstream gradient: dL/dz = 5.
dL/dx = 5 · 0 = 0 (x lost the max).
dL/dy = 5 · 1 = 5 (y won the max).
The max gate acts like a switch: it routes the entire gradient to whichever input was larger and blocks the other. This is exactly what happens in ReLU (max(0, x)): if x > 0, gradient passes through; if x < 0, gradient is zero.
x is used in two places: one path gives dL/dy1 = 4, the other gives dL/dy2 = 2.
dL/dx = 4 + 2 = 6.
This is the multivariable chain rule at work: x influences L through two separate paths, so we sum the contributions.
Add distributes gradient equally to all inputs. Multiply swaps and scales — each input's gradient is the upstream times the other input's value. Max routes the full gradient to the winner, zero to the loser. Copy (fan-out) adds gradients from all downstream paths. These four patterns compose to describe gradient flow through any computational graph.
So far, all our examples used scalar values. Real neural networks operate on vectors and matrices. When we move from scalars to vectors, derivatives become Jacobian matrices.
| Input | Output | Derivative | Shape |
|---|---|---|---|
| Scalar | Scalar | Regular derivative dy/dx | 1 × 1 |
| Vector (D) | Scalar | Gradient ∇xy | D × 1 |
| Vector (D) | Vector (M) | Jacobian ∂y/∂x | M × D |
For a function f: RD → RM, the Jacobian J is an M×D matrix where Jij = ∂yi/∂xj. Each entry says: "if I nudge the j-th input by a tiny amount, how much does the i-th output change?" The Jacobian contains ALL partial derivative information.
The chain rule with vectors: if z = f(x) and the loss is a scalar L, then:
The key insight: dL/dx always has the same shape as x. If x is a 100-dimensional vector, dL/dx is also a 100-dimensional vector. If x is a 64×64 matrix, dL/dx is also 64×64. The gradient tells you how to adjust each element.
Input: x = [1, −2, 3, −1]. f(x) = max(0, x) = [1, 0, 3, 0]. Upstream: dL/dz = [4, −1, 5, 9].
Jacobian: ReLU is elementwise, so the Jacobian is diagonal:
J = [[1,0,0,0], [0,0,0,0], [0,0,1,0], [0,0,0,0]]
Diagonal entries are 1 where xi > 0, and 0 where xi ≤ 0.
Backward: dL/dx = JT · dL/dz = [1·4, 0·(−1), 1·5, 0·9] = [4, 0, 5, 0].
Gradient passes through where ReLU was active, and is zeroed where ReLU was off. In code, this is just: dx = dz * (x > 0). One line.
For a neural network layer with 4096 inputs and 4096 outputs, the Jacobian is 4096×4096 = 16 million entries — about 64 MB per layer. For the matrix multiply y = Wx, the Jacobian is even worse: it's (N·M) × (N·M) if we flatten all inputs and outputs. That could be 256 GB! Instead, we compute the Jacobian-vector product implicitly. For ReLU, it's an elementwise mask. For matrix multiply, it's another matrix multiply.
For y = x · W where x is [N×D] and W is [D×M]:
x = [[2, 1, −3], [−3, 4, 2]] (2×3). W = [[3,2,1,−1],[2,1,3,2],[3,2,1,−2]] (3×4).
y = xW = [[13, 9, −2, −6], [5, 2, 17, 1]] (2×4).
Upstream: dL/dy = [[2, 3, −3, 9], [−8, 1, 4, 6]] (2×4).
dL/dx = dL/dy · WT (2×4 · 4×3 = 2×3). Think: "for each element of x, which elements of y did it affect? The whole row of y, through the corresponding row of W."
dL/dW = xT · dL/dy (3×2 · 2×4 = 3×4). Same shape as W. Each weight gradient is the sum over the batch of "input activation times output gradient."
Consider one weight Wjk. It only affects yik for all batch elements i, through yik = Σj xij Wjk. So ∂yik/∂Wjk = xij. The gradient: dL/dWjk = Σi dL/dyik · xij. In matrix form, that's the (j,k) entry of xT · dL/dy. The shapes force this to be true — there's only one way to multiply a (D×N) matrix with an (N×M) matrix to get a (D×M) result.
Here's a practical trick from CS 231n that saves enormous amounts of time: you can often derive the gradient formula by shape-matching. You know dL/dx must have the same shape as x. You know dL/dy (the upstream gradient) has some shape. There's usually only one way to multiply the available matrices to get the right output shape.
In a layer y = xW + b, where x is [N×D], W is [D×M], b is [1×M], and y is [N×M].
dL/dy is [N×M]. We need dL/db which is [1×M].
The only way to reduce [N×M] to [1×M] is to sum over the batch dimension N.
So: dL/db = Σi=1N dL/dyi = sum over rows of dL/dy. In NumPy: db = np.sum(dy, axis=0).
Why? Because b is broadcast across all N batch elements, so the gradient accumulates from all of them (the "copy gate adds" pattern from Chapter 7).
Putting it all together, the backward pass for one fully-connected layer y = max(0, xW + b) involves three steps:
x = [[1, 2]] (1×2), W = [[0.5, -0.3], [0.8, 0.1]] (2×2), b = [[0.1, -0.2]].
Forward: xW = [[1·0.5 + 2·0.8, 1·(-0.3) + 2·0.1]] = [[2.1, -0.1]]. Plus b: [[2.2, -0.3]]. ReLU: [[2.2, 0]].
Upstream dL/dy = [[1, 1]].
ReLU backward: mask = [[1, 0]] (second element was negative). dL/d(pre) = [[1·1, 1·0]] = [[1, 0]].
dL/db = [[1, 0]].
dL/dW = xT · [[1, 0]] = [[1],[2]] · [[1, 0]] = [[1, 0], [2, 0]].
dL/dx = [[1, 0]] · WT = [[1, 0]] · [[0.5, 0.8], [-0.3, 0.1]] = [[0.5, 0.8]].
The second weight column gets zero gradient because ReLU killed that output. The first column gets updated normally.
Notice that to compute dL/dW = xT dL/dy, we need the input x. To compute the ReLU mask, we need the pre-activation values. The forward pass must cache all intermediate activations for the backward pass to use. For a network with L layers, each storing an [N×D] activation matrix, the memory cost is O(L × N × D). This is why large batch sizes on deep networks run out of GPU memory — it's not the parameters, it's the cached activations.
Time to put it all together. Below is a computational graph for f(x, y, z) = (x + y) · z. You can step through the backward pass one node at a time, watching gradients flow from right to left. At each step, you'll see the upstream gradient, the local gradient, and the computed downstream gradient.
Hit "Forward" to compute values, then "Step Backward" to trace gradients one node at a time. The currently active node is highlighted in gold.
1. The backward pass starts at the output with gradient = 1 (base case). 2. At the multiply node, the gradient on q is z (−4) and the gradient on z is q (3) — the "swap" pattern. 3. At the add node, both inputs (x and y) get the same gradient (−4) — the "distribute" pattern. 4. The whole computation costs the same as TWO forward passes — no matrix inversions, no expensive solvers.
After running the backward pass, verify each gradient:
• df/dx: Bump x from −2 to −1.999. New f = (−1.999+5)(−4) = 3.001(−4) = −12.004. Gradient ≈ −0.004/0.001 = −4. Matches.
• df/dy: Bump y from 5 to 5.001. New f = (−2+5.001)(−4) = 3.001(−4) = −12.004. Gradient ≈ −4. Matches.
• df/dz: Bump z from −4 to −3.999. New f = 3(−3.999) = −11.997. Gradient ≈ (−11.997+12)/0.001 = 3. Matches.
Now that we have the gradients, what do they mean? Remember, a gradient tells you: "if I increase this input by a tiny amount, how does the output change?"
df/dx = −4 means: if x increases by 1, f decreases by 4. Since we want to minimize f (it's our loss), we should increase x (move in the direction opposite to the gradient for minimization, same direction for maximization).
df/dz = 3 means: if z increases by 1, f increases by 3. To minimize f, we should decrease z.
The magnitude tells you sensitivity. f is more sensitive to x and y (gradient magnitude 4) than it would be to an input with gradient 0.1. Gradient magnitude = how much this input matters.
Let's step through the complete backprop for a sigmoid neuron, combining all the gate patterns:
f(w, x) = 1 / (1 + e−(w·x + b)). Let w = 2, x = −1, b = 3.
Forward (left to right):
1. a = w · x = 2 × (−1) = −2 [multiply gate]
2. c = a + b = −2 + 3 = 1 [add gate]
3. d = −c = −1 [negate gate]
4. e = exp(d) = e−1 ≈ 0.368 [exp gate]
5. g = 1 + e = 1.368 [add gate]
6. f = 1/g ≈ 0.731 [reciprocal gate]
Backward (right to left):
6. df/df = 1. Reciprocal: df/dg = −1/g2 = −1/1.3682 ≈ −0.534
5. Add gate distributes: df/d(1) = −0.534, df/de = −0.534
4. Exp gate: local grad = ed = 0.368. df/dd = −0.534 × 0.368 ≈ −0.197
3. Negate: df/dc = −0.197 × (−1) = 0.197
2. Add distributes: df/da = 0.197, df/db = 0.197
1. Multiply swaps: df/dw = 0.197 × x = 0.197 × (−1) = −0.197. df/dx = 0.197 × w = 0.197 × 2 = 0.394
Shortcut check: σ(1)(1−σ(1)) = 0.731 × 0.269 ≈ 0.197. Then dw = 0.197 × x = −0.197. Matches perfectly.
In real code, we don't manually derive gradients for each new network. Instead, each operation is a module with two methods: forward() and backward(). The framework chains them automatically.
forward(x, y): cache x and y. Return x * y.
backward(dout): dx = dout * self.y (cached). dy = dout * self.x (cached). Return (dx, dy).
This is actual PyTorch code, simplified. Every operation in PyTorch — matmul, relu, sigmoid, softmax, conv2d — implements exactly this pattern.
Here's the CS 231n example as Python pseudocode for f(w0, x0, w1, x1, w2) = σ(w0x0 + w1x1 + w2):
w0, x0, w1, x1, w2 = 2, -1, -3, -2, -3
Forward:
d0 = w0 * x0 # -2 (multiply)
d1 = w1 * x1 # 6 (multiply)
d2 = d0 + d1 # 4 (add)
d3 = d2 + w2 # 1 (add bias)
f = 1.0 / (1 + math.exp(-d3)) # 0.731 (sigmoid)
Backward:
df = 1.0 # base case
dd3 = f * (1 - f) * df # 0.197 (sigmoid grad)
dd2 = 1.0 * dd3 # 0.197 (add passes through)
dw2 = 1.0 * dd3 # 0.197 (add passes through)
dd0 = 1.0 * dd2 # 0.197 (add distributes)
dd1 = 1.0 * dd2 # 0.197 (add distributes)
dw0 = x0 * dd0 # -0.197 (mul swaps)
dx0 = w0 * dd0 # 0.394 (mul swaps)
dw1 = x1 * dd1 # -0.394 (mul swaps)
dx1 = w1 * dd1 # -0.591 (mul swaps)
Notice the pattern: backward is literally forward in reverse, with each line replaced by its gradient rule. This is why it's called "backpropagation" — you propagate gradients backward through the same computation graph.
Modern frameworks (PyTorch, JAX, TensorFlow) implement autograd: they automatically build the computational graph during the forward pass and compute gradients by calling backward() on each node in reverse order. You never write backward() yourself unless you're defining a new custom operation.
Automatic differentiation is the mechanical process of applying the chain rule to a program. The framework records every operation during the forward pass in a "tape" (the computational graph). When you call .backward(), it replays the tape in reverse, calling each node's backward() method. The result: exact gradients for every parameter, computed in a single backward pass.
The entire backprop we did by hand can be done in 4 lines:
x = torch.tensor(-2.0, requires_grad=True)
y = torch.tensor(5.0, requires_grad=True)
z = torch.tensor(-4.0, requires_grad=True)
f = (x + y) * z
f.backward()
Now x.grad = -4, y.grad = -4, z.grad = 3. Exactly what we computed by hand. PyTorch built the graph (add → multiply), then ran backward through it automatically.
Forward pass: Compute the result of each operation and save any intermediates needed for gradient computation in memory.
Backward pass: Apply the chain rule to compute the gradient of the loss with respect to every input, parameter, and intermediate value.
That's it. Neural network training is: (1) forward pass to compute loss, (2) backward pass to compute gradients, (3) gradient descent to update weights. Repeat millions of times.
| Consideration | Details |
|---|---|
| Memory | Forward pass must cache activations for backward. A 100-layer network stores 100 layers of activations. This is why GPU memory is the bottleneck, not compute. |
| Gradient checking | Always verify analytical gradients with numerical ones: (f(x+h) − f(x−h)) / 2h. Use h ≈ 10−5. If relative error > 10−5, you have a bug. |
| Computational cost | Backward pass is ~2x the cost of forward pass. Total training step ≈ 3x forward pass. |
| Custom ops | If you define a new operation, you must write both forward() and backward(). Get backward() wrong and training silently fails (weights don't update correctly). |
To see how all the pieces fit, here's the complete flow for training a 2-layer neural network. This is what CS 231n implements in about 20 lines of NumPy:
Input x = [[1, 2]], true label y = 0, learning rate = 0.1.
After forward: loss = 1.38. After backward: dW1 = [[0.05, -0.02], [0.10, -0.04]].
Update: W1[0][0] was 0.50, becomes 0.50 - 0.1 × 0.05 = 0.495.
Each weight moves a tiny bit in the direction that reduces the loss. After thousands of these tiny nudges, the weights converge to values that correctly classify the training data.
The entire learning algorithm is three operations in a loop: forward (compute loss), backward (compute gradients), update (gradient descent). Everything else — batch normalization, dropout, learning rate schedules, Adam optimizer — is refinement of these three steps. The core has not changed since Rumelhart, Hinton, and Williams described backpropagation in 1986.
Convolutional Neural Networks (CS 231n Lecture 5): Instead of fully-connected layers, use filters that slide over the input spatially. The backward pass for convolutions follows the exact same chain rule — it's just a different forward operation with a different local gradient.
Training tricks: Batch normalization, dropout, residual connections — all of these are just new nodes in the computational graph, each with their own forward() and backward().
Beyond first-order: Backpropagation computes first derivatives. Second-order methods (Newton's method, natural gradient) need the Hessian — the Jacobian of the gradient. This is a matrix of size (parameters × parameters) and is far too expensive for modern networks with billions of parameters.
| Topic | Connection |
|---|---|
| Transformers | Self-attention is a differentiable operation with its own forward/backward. The gradients flow through Q, K, V projections via the chain rule. |
| Diffusion models | The noise prediction network is trained with backprop. The denoising loss is just another node in the graph. |
| Policy gradient (RL) | REINFORCE uses a different kind of gradient (score function estimator) because the environment is not differentiable. When the dynamics ARE differentiable, backprop through time gives exact gradients. |
| GANs | Two computational graphs (generator + discriminator) with gradients flowing between them. The generator's loss depends on the discriminator's output. |
Backpropagation was not invented once. The chain rule has existed since Leibniz (1676). Automatic differentiation was described by Wengert (1964). The specific application to neural networks was explored by Linnainmaa (1970), Werbos (1974), and popularized by Rumelhart, Hinton, and Williams (1986). But it took hardware (GPUs), data (ImageNet), and architectural innovations (ReLU, batch norm, residual connections) for backprop to become the engine of modern AI.
| Year | Milestone | Contribution |
|---|---|---|
| 1676 | Leibniz | Chain rule of calculus |
| 1964 | Wengert | Automatic differentiation |
| 1970 | Linnainmaa | Reverse-mode autodiff (the algorithm behind backprop) |
| 1974 | Werbos | Applied to neural networks in PhD thesis |
| 1986 | Rumelhart, Hinton, Williams | Popularized backprop, showed it works on real problems |
| 2012 | Krizhevsky, Sutskever, Hinton | AlexNet: backprop + ReLU + GPU = ImageNet breakthrough |
| 2015 | PyTorch / TensorFlow | Autograd frameworks make backprop invisible to the user |
Backpropagation is just the chain rule applied systematically to a computational graph: upstream gradient × local gradient = downstream gradient, repeated at every node from output to input.