← Gleams
Stanford CS 231n · Lecture 4 · Neural Networks and Backpropagation

How Gradients Flow Backwards

A neural network is a chain of simple operations. Backpropagation is the trick that lets us compute how to adjust every single weight — by tracing blame backwards through the chain.

Computational Graphs Chain Rule Gate Patterns Jacobians
Roadmap

What You'll Master

Chapter 01

Beyond Linear Classifiers

Imagine you have a dataset of red and blue points on a 2D plane. The red points form a ring around the blue points — like a donut with a blueberry center. Can a linear classifier separate them? No. A linear classifier draws a single straight line (or hyperplane in higher dimensions). No matter how you rotate or shift that line, it will always misclassify some points.

This is the fundamental limitation of the linear score function we've been using: f = Wx. It can only carve the input space into regions separated by flat boundaries. Many real-world problems — like recognizing a "9" vs a "6", or distinguishing a cat from a dog — require curved, nonlinear decision boundaries.

The XOR Problem

The simplest example of linear failure is XOR. Four points on a 2D grid:

Worked Example — XOR Cannot Be Linearly Separated

Points: (0,0) → class 0, (0,1) → class 1, (1,0) → class 1, (1,1) → class 0.

Try drawing a line that puts (0,1) and (1,0) on one side, and (0,0) and (1,1) on the other. You can't. The "same class" points are diagonal to each other. No single line separates them.

A linear classifier computes w1x1 + w2x2 + b. For (0,1) to be positive: w2 + b > 0. For (1,0) to be positive: w1 + b > 0. For (0,0) to be negative: b < 0. For (1,1) to be negative: w1 + w2 + b < 0. Adding the first two: w1 + w2 + 2b > 0. But the last constraint says w1 + w2 + b < 0, meaning w1 + w2 < −b. Combined with b < 0: w1 + w2 + 2b > 0 implies w1 + w2 > −2b > 0 > −b. Contradiction.

The Feature Transform Trick

The CS 231n lecture shows a beautiful insight: what if we transform the inputs before classifying? Take those donut-shaped points in (x, y) coordinates and convert them to polar coordinates (r, θ). Suddenly, the ring of red points has large r values and the blue center has small r values. A simple threshold on r separates them perfectly.

But here's the problem: who designs the transform? For polar coordinates, we used domain knowledge. For images of cats and dogs, what's the right transform? Nobody knows. That's where neural networks come in: they learn the transform.

The Core Insight

A neural network is a learned feature transform followed by a linear classifier. The first layers learn to warp the input space until the classes become linearly separable. The last layer draws the separating hyperplane. The whole thing is trained end-to-end.

Definition
Nonlinear Decision Boundary

A curved surface in input space that separates classes. Linear classifiers can only produce flat (hyperplane) boundaries. To get curved boundaries, we need either hand-crafted feature transforms or learned ones (neural networks).

How Many Templates Does One Layer Learn?

A linear classifier for CIFAR-10 learns exactly 10 templates — one per class. Each row of the weight matrix W (10×3072) acts as a template that gets compared against the input image via dot product. The problem: a single "horse" template must account for horses facing left, right, running, standing, in daylight, at dusk. It ends up as a blurry average of all horse configurations.

Worked Example — Why One Template Fails

Imagine the "car" template. Cars can be red, blue, or white. The single template averages these colors, producing a greyish blob. A red car activates the template moderately. A blue car activates it moderately. Neither activates it strongly, because the template is a compromised average.

With 100 hidden units, you could learn: template #1 = "red car shape," template #2 = "blue car front," template #3 = "white sedan side." The output layer combines: score(car) = 0.8 × template1 + 0.5 × template2 + 0.3 × template3. Now each sub-template can specialize.

Chapter 02

Neural Networks

A linear classifier computes f = Wx. A 2-layer neural network computes f = W2 max(0, W1x). That's it. Two matrix multiplications with a nonlinear function sandwiched between them. But this small change has enormous consequences.

Linear Classifier f = Wx

2-Layer Neural Network f = W2 · max(0, W1 · x)

W1 maps input to a hidden layer, max(0, ·) is ReLU, W2 maps hidden layer to scores

Let's trace through it. Say x is a 3072-dimensional image (32×32×3 CIFAR-10 image flattened). W1 is a 100×3072 matrix, producing a 100-dimensional hidden layer h = W1x. We apply max(0, ·) elementwise to get h' = max(0, h). Then W2 is a 10×100 matrix, mapping h' to 10 class scores.

Worked Example — A Tiny 2-Layer Network

Input: x = [1, -2]. W1 = [[2, 1], [-1, 3]], b1 = [0, 0]. W2 = [[1, -1]], b2 = [0].

Step 1: h = W1x + b1 = [2·1 + 1·(-2), -1·1 + 3·(-2)] = [0, -7].

Step 2: h' = max(0, h) = max(0, [0, -7]) = [0, 0].

Step 3: f = W2h' + b2 = [1·0 + (-1)·0] = [0].

Both hidden units got killed by ReLU. Let's try x = [1, 2]: h = [4, 5], h' = [4, 5], f = [4 - 5] = [-1]. Now the network computes something nontrivial.

Why the Nonlinearity Is Essential

What if we skip the max(0, ·) and just stack two linear layers? Then f = W2(W1x) = (W2W1)x = W'x. It collapses back to a single linear classifier! Two linear transforms compose into one linear transform. The nonlinearity between layers is what prevents this collapse and gives the network its power.

Without Nonlinearity, Depth Is Useless

Stacking 100 linear layers without activation functions gives you exactly the same representational power as a single layer. The product W100 · W99 · ... · W1 is just one big matrix. You've burned 100x the parameters to learn a linear function. The activation function is not optional — it's what makes depth meaningful.

Learning Templates

In a linear classifier, each row of W is a "template" for one class. There's exactly one template per class. A horse template must capture all possible horses — different orientations, colors, backgrounds — in a single linear filter. That's why linear classifiers produce blurry, averaged templates.

A 2-layer network with 100 hidden units learns 100 templates in W1. These aren't class-specific — they're shared features like "edge at 45 degrees" or "brown patch." W2 then combines templates: "horse = template 3 + template 17 - template 42." Each class can be a different mix of the 100 learned features.

Universal Approximation

A 2-layer neural network with enough hidden units can approximate any continuous function to arbitrary precision (Cybenko, 1989; Hornik, 1991). This is the Universal Approximation Theorem. It doesn't say the network will find the right function (that's an optimization problem), only that it has the capacity to represent it. Think of it as: "the architecture is not the bottleneck."

Deeper Networks

3-Layer Neural Network f = W3 · max(0, W2 · max(0, W1 · x))

Each additional layer adds another level of learned abstraction. Layer 1 might learn edges. Layer 2 combines edges into parts (eyes, wheels). Layer 3 combines parts into objects (cats, cars). This hierarchical composition is the deep in deep learning.

Definition
Fully-Connected Network (MLP)

A multi-layer perceptron (MLP) is a neural network where every neuron in one layer connects to every neuron in the next. Also called a "fully-connected network" or "dense network." The architecture is defined by the number of layers and the width (number of neurons) of each layer.

Chapter 03

Activation Functions

The nonlinear function applied after each linear layer is called the activation function. The choice of activation function matters more than you might think — it affects whether gradients flow well during training, whether neurons die, and how fast the network learns.

Sigmoid

Sigmoid σ(x) = 1 / (1 + e−x)

Output range: (0, 1). Derivative: σ(x) · (1 − σ(x))

Sigmoid squashes any input into the range (0, 1). Historically popular because it resembles the firing rate of a biological neuron. But it has three serious problems:

Sigmoid Problems

1. Saturated neurons kill gradients. When the input is very positive or very negative, σ(x) is near 0 or 1. The derivative σ(x)(1−σ(x)) is near zero. During backpropagation, the gradient gets multiplied by this near-zero value and effectively vanishes.

2. Outputs are not zero-centered. If the input to a neuron is always positive (which happens when the previous layer uses sigmoid), then the gradients on the weights are always all positive or all negative. This causes zig-zagging during gradient descent.

3. exp() is expensive. Minor, but relevant at scale.

Worked Example — Gradient Vanishing in Sigmoid

Input x = 10. σ(10) = 1/(1 + e-10) ≈ 0.99995. Derivative: 0.99995 × 0.00005 = 0.0000454.

If the upstream gradient is 1.0, the downstream gradient is 0.0000454 — four orders of magnitude smaller. Stack 5 layers of sigmoid and the gradient at the first layer is essentially zero. The network cannot learn.

Tanh

Tanh tanh(x) = (ex − e−x) / (ex + e−x)

Output range: (−1, 1). Derivative: 1 − tanh2(x). Zero-centered!

Tanh fixes the zero-centering problem (outputs are in [−1, 1]), but still saturates at the extremes. It was the default choice before ReLU, and is still used in some architectures (e.g., LSTMs).

ReLU: The Modern Default

Rectified Linear Unit (ReLU) f(x) = max(0, x)

Output range: [0, ∞). Derivative: 0 if x < 0, 1 if x > 0.

ReLU is stupidly simple: if the input is negative, output zero. If positive, pass it through unchanged. No exponentials, no division. It trains 6x faster than sigmoid in practice (Krizhevsky et al., 2012).

Why ReLU Works So Well

For positive inputs, the gradient is exactly 1. No matter how deep the network, if a neuron is active, it passes the gradient through undistorted. This solves the vanishing gradient problem that plagued sigmoid and tanh networks. The gradient is either 0 (neuron off) or 1 (neuron on) — no values in between that could shrink over many layers.

Dead Neurons

If a ReLU neuron's input is always negative (for every training example), its output is always 0 and its gradient is always 0. It can never update its weights. It's dead. This can happen if the learning rate is too large — a large gradient update pushes the weights so that the neuron's pre-activation is always negative. Once dead, always dead. In practice, ~10-20% of neurons in a ReLU network may die during training.

Leaky ReLU and ELU

Leaky ReLU f(x) = x if x > 0, αx if x ≤ 0    (typically α = 0.01)

ELU (Exponential Linear Unit) f(x) = x if x > 0, α(ex − 1) if x ≤ 0

Leaky ReLU fixes dead neurons by allowing a small gradient (α = 0.01) when the input is negative. The neuron is never completely off, so it can always recover. ELU goes further by producing negative outputs that push the mean activation closer to zero, which helps with learning dynamics.

Activation Functions — Interactive Explorer
Toggle between activation functions to see their shapes and derivatives. Notice where the derivative goes to zero (gradient vanishing zones).
Sigmoid: squashes to (0, 1)
Practical Recommendation from CS 231n

Use ReLU. It's the default for almost all feedforward and convolutional networks. If you see dead neurons, try Leaky ReLU. Don't use sigmoid or tanh in hidden layers unless you have a specific reason (e.g., gates in LSTMs). Sigmoid in the output layer is fine for binary classification.

ActivationRangeGradient at x=0Dead Neurons?Use When
Sigmoid(0, 1)0.25SaturatesOutput layer for binary
Tanh(−1, 1)1.0SaturatesLSTM gates, RNN
ReLU[0, ∞)undefinedYesDefault everywhere
Leaky ReLU(−∞, ∞)undefinedNoIf ReLU dies
ELU(−α, ∞)NoWhen zero-mean matters
Chapter 04

Computational Graphs

We've been writing neural networks as formulas: f = W2 max(0, W1x). But to compute gradients efficiently, we need a different representation. We need to see the computation as a graph — a network of simple operations connected by data flow.

Definition
Computational Graph

A directed acyclic graph (DAG) where each node represents a simple operation (add, multiply, max, exp, etc.) and each edge carries a value. Inputs enter at the left, the loss comes out at the right. The forward pass flows left-to-right, computing the output. The backward pass flows right-to-left, computing gradients.

Consider a simple function: f(x, y, z) = (x + y) · z. We can break this into two operations:

Worked Example — Building a Computation Graph

f(x, y, z) = (x + y) · z. Let x = −2, y = 5, z = −4.

Node 1 (add): q = x + y = −2 + 5 = 3.

Node 2 (multiply): f = q · z = 3 · (−4) = −12.

The graph has three input nodes (x, y, z), one intermediate node (q), and one output node (f). Data flows left to right: inputs → add → multiply → output.

Computation Graph — Forward Pass
This graph computes f(x,y,z) = (x + y) × z. Each node shows its computed value. Follow the arrows left-to-right to see how the final result is built from simple operations.
x=−2, y=5, z=−4 f = (x + y) × z = −12

Why Graphs?

The point of the graph representation is modularity. Each node only needs to know two things: (1) how to compute its output from its inputs (the forward pass), and (2) how to compute the gradient of its output with respect to its inputs (the local gradient). It doesn't need to know anything about the rest of the network.

Modularity Is the Key

Deriving gradients for the whole network by hand is a nightmare of matrix calculus. But each individual node is trivial: the gradient of addition is 1, the gradient of multiplication is the other input, the gradient of max is 0 or 1. Backpropagation chains these trivial local gradients together to get the full network gradient. Simple parts, complex whole.

Any neural network — no matter how complex — can be expressed as a computational graph. The linear classifier f = Wx + b is three nodes: multiply, add bias, done. A 100-layer ResNet is thousands of nodes. But the gradient computation is always the same: forward pass to get values, backward pass to get gradients.

Worked Example — A Loss Function as a Graph

Full pipeline for a linear classifier with SVM loss:

Node 1: s = W · x (matrix multiply: scores)

Node 2: margins = sj − sy + 1 (hinge computation per class)

Node 3: Ldata = Σ max(0, margins) (sum of hinge losses)

Node 4: R = λ Σ W2 (regularization)

Node 5: L = Ldata + R (total loss)

Five nodes. Each one is simple. The gradient of the whole thing is computed by chaining local gradients backward through these five nodes.

Chapter 05

The Chain Rule

The chain rule from calculus is the mathematical engine behind backpropagation. If you understand the chain rule, you understand backprop. They are the same thing.

Single Variable

If y = f(x) and z = g(y), then z is a function of x through y. The chain rule says:

Chain Rule (Single Variable) dz/dx = (dz/dy) · (dy/dx)

"How much z changes when x changes" = "how much z changes when y changes" × "how much y changes when x changes"
Worked Example — Chain Rule with Numbers

Let y = 3x and z = y2. Then dy/dx = 3 and dz/dy = 2y = 2(3x) = 6x.

Chain rule: dz/dx = (dz/dy)(dy/dx) = 6x · 3 = 18x.

Check: z = y2 = (3x)2 = 9x2. Direct derivative: dz/dx = 18x. Matches!

Multivariable

In neural networks, a value often depends on x through multiple paths. The multivariable chain rule says: sum the contributions from all paths.

Chain Rule (Multivariable) If z = f(a, b) where a = g(x) and b = h(x):
dz/dx = (∂z/∂a) · (da/dx) + (∂z/∂b) · (db/dx)

Sum over ALL paths from x to z
Worked Example — Multiple Paths

Let a = 2x, b = x2, z = a + b = 2x + x2.

Path through a: ∂z/∂a · da/dx = 1 · 2 = 2.

Path through b: ∂z/∂b · db/dx = 1 · 2x = 2x.

Total: dz/dx = 2 + 2x. Check: z = 2x + x2, dz/dx = 2 + 2x. Correct.

The Connection to Backprop

In a computational graph, the chain rule tells us: to find how the loss L changes with respect to any variable x, multiply the upstream gradient (dL/dz, coming from the right) by the local gradient (dz/dx, computed at this node). That product is the downstream gradient passed to the left. This is literally all backpropagation does — at every node.

The Backprop Recipe

Gradient at Any Node
  1. upstream gradient: dL/dz — "how much does the loss change if this node's output changes?" (comes from the node to the right)
  2. local gradient: dz/dx — "how much does this node's output change if this node's input changes?" (computed from the node's own operation)
  3. downstream gradient: dL/dx = (dL/dz) · (dz/dx) — "how much does the loss change if this node's input changes?" (passed to the node to the left)
Why Multiply?

Think of it as unit conversion. dL/dz has units "loss per z-unit." dz/dx has units "z-units per x-unit." Multiplying: "loss per z-unit" × "z-units per x-unit" = "loss per x-unit" = dL/dx. The chain rule is dimensional analysis for derivatives.

Longer Chains

The chain rule extends to any number of composed functions. If a = f(x), b = g(a), c = h(b), L = j(c), then:

Extended Chain Rule dL/dx = (dL/dc) · (dc/db) · (db/da) · (da/dx)

Product of ALL local gradients along the chain from output to input.
Worked Example — Three-Link Chain

a = 2x (da/dx = 2). b = a3 (db/da = 3a2). L = −b (dL/db = −1).

At x = 1: a = 2, b = 8, L = −8.

dL/dx = (−1) · (3 × 4) · 2 = −1 · 12 · 2 = −24.

Check: L = −(2x)3 = −8x3. dL/dx = −24x2. At x = 1: −24. Matches.

The Vanishing Gradient Problem

If each local gradient is less than 1 (as happens with sigmoid, where the max gradient is 0.25), the product shrinks exponentially with depth. Ten sigmoid layers: gradient ≤ 0.2510 ≈ 10−6. The first layer's weights barely update. This is why deep sigmoid networks were considered untrainable before ReLU (which has local gradient exactly 1 when active) and residual connections.

Chapter 06

Backpropagation

Now we put it all together. Backpropagation is the algorithm that computes the gradient of the loss with respect to every weight in the network, by applying the chain rule systematically from right to left through the computational graph.

Full Worked Example

Let's trace through the complete forward and backward pass for f(x, y, z) = (x + y) · z, with x = −2, y = 5, z = −4.

Worked Example — Complete Backprop

FORWARD PASS (left to right):

q = x + y = −2 + 5 = 3

f = q · z = 3 · (−4) = −12

BACKWARD PASS (right to left):

Step 1 — Base case: df/df = 1 (the gradient of the output with respect to itself is always 1).

Step 2 — Multiply node: f = q · z.

   Local gradients: ∂f/∂q = z = −4, ∂f/∂z = q = 3.

   Downstream: df/dq = 1 · (−4) = −4. df/dz = 1 · 3 = 3.

Step 3 — Add node: q = x + y.

   Local gradients: ∂q/∂x = 1, ∂q/∂y = 1.

   Downstream: df/dx = (−4) · 1 = −4. df/dy = (−4) · 1 = −4.

Result: ∂f/∂x = −4, ∂f/∂y = −4, ∂f/∂z = 3.

Verification by Numerical Gradient

Check ∂f/∂x: change x from −2 to −1.999 (a tiny bump of +0.001). New f = (−1.999 + 5) · (−4) = 3.001 · (−4) = −12.004. Change in f: −12.004 − (−12) = −0.004. Gradient ≈ −0.004/0.001 = −4. Matches our analytical result exactly. Always check with numerical gradients.

A More Complex Example: Sigmoid

The lecture walks through a more involved example: f(w0, x0, w1, x1, w2) = 1 / (1 + e−(w0x0 + w1x1 + w2)).

This is a sigmoid applied to a dot product — exactly what one neuron computes. Let's set w0 = 2, x0 = −1, w1 = −3, x1 = −2, w2 = −3.

Worked Example — Sigmoid Backprop

Forward: w0x0 = 2·(−1) = −2. w1x1 = (−3)·(−2) = 6. Sum = −2 + 6 + (−3) = 1. Sigmoid: 1/(1 + e−1) ≈ 0.73.

Backward: The sigmoid has a beautiful local gradient: σ(x)(1 − σ(x)) = 0.73 · 0.27 ≈ 0.20.

Starting from df/df = 1:

• Gradient at sigmoid output: 1 × 0.20 = 0.20

• The sum node distributes: gradient to each input of the sum is 0.20

• At w0x0 multiply node: ∂/∂w0 = x0 × 0.20 = (−1)(0.20) = −0.20. ∂/∂x0 = w0 × 0.20 = 2(0.20) = 0.40.

• At w1x1 multiply node: ∂/∂w1 = x1 × 0.20 = (−2)(0.20) = −0.40. ∂/∂x1 = w1 × 0.20 = (−3)(0.20) = −0.60.

• ∂/∂w2 = 1 × 0.20 = 0.20 (bias is just added).

Sigmoid Local Gradient Derivation

σ(x) = 1/(1 + e−x) = (1 + e−x)−1.

dσ/dx = −(1 + e−x)−2 · (−e−x) = e−x / (1 + e−x)2.

Rewrite: = [1/(1 + e−x)] · [e−x/(1 + e−x)] = σ(x) · [(1 + e−x − 1)/(1 + e−x)]

= σ(x) · [1 − 1/(1 + e−x)] = σ(x) · (1 − σ(x)).

Maximum at x = 0 where σ(0) = 0.5: gradient = 0.5 × 0.5 = 0.25. This is the best case — it only gets smaller from here.

Grouping Nodes

The computational graph representation is not unique. We could break sigmoid into four nodes (negate, exp, add 1, reciprocal) and backprop through each. Or we could treat the entire sigmoid as one node with the local gradient σ(x)(1 − σ(x)). Both give the same answer. Choose the grouping where local gradients are easy to express. In practice, frameworks like PyTorch define custom backward functions for common compound operations.

Chapter 07

Gate Patterns

After doing enough backprop examples, you start to notice that each type of operation has a characteristic gradient behavior. These patterns are so fundamental that CS 231n gives them names. Once you recognize them, you can read gradients off a computational graph by inspection.

The Add Gate: Gradient Distributor

Add Gate z = x + y
∂z/∂x = 1,  ∂z/∂y = 1

The upstream gradient passes through unchanged to BOTH inputs. The add gate distributes.
Worked Example — Add Distributes

x = 3, y = 4. z = x + y = 7. Upstream gradient: dL/dz = 2.

dL/dx = dL/dz · ∂z/∂x = 2 · 1 = 2.

dL/dy = dL/dz · ∂z/∂y = 2 · 1 = 2.

Both inputs receive the same gradient. The add gate is a "gradient copier."

The Multiply Gate: Swap Multiplier

Multiply Gate z = x · y
∂z/∂x = y,  ∂z/∂y = x

Each input's gradient is scaled by THE OTHER input's value. Inputs swap!
Worked Example — Multiply Swaps

x = 2, y = 3. z = xy = 6. Upstream gradient: dL/dz = 5.

dL/dx = dL/dz · ∂z/∂x = 5 · y = 5 · 3 = 15.

dL/dy = dL/dz · ∂z/∂y = 5 · x = 5 · 2 = 10.

The gradient on x is scaled by y's value (3), and vice versa. If one input is large, the other gets a large gradient. This is why weight initialization matters — large activations mean large weight gradients.

The Max Gate: Gradient Router

Max Gate z = max(x, y)
∂z/∂x = 1 if x > y, else 0
∂z/∂y = 1 if y > x, else 0

The entire gradient flows to the winning input. The loser gets ZERO.
Worked Example — Max Routes

x = 4, y = 9. z = max(4, 9) = 9. Upstream gradient: dL/dz = 5.

dL/dx = 5 · 0 = 0 (x lost the max).

dL/dy = 5 · 1 = 5 (y won the max).

The max gate acts like a switch: it routes the entire gradient to whichever input was larger and blocks the other. This is exactly what happens in ReLU (max(0, x)): if x > 0, gradient passes through; if x < 0, gradient is zero.

The Copy Gate: Gradient Adder

Copy Gate (Fan-Out) When one value is used in multiple places: y1 = x, y2 = x
dL/dx = dL/dy1 + dL/dy2

Gradients from all downstream consumers ADD together at the fan-out point.
Worked Example — Copy Adds

x is used in two places: one path gives dL/dy1 = 4, the other gives dL/dy2 = 2.

dL/dx = 4 + 2 = 6.

This is the multivariable chain rule at work: x influences L through two separate paths, so we sum the contributions.

Gate Patterns — How Gradients Flow
Three gates side by side. Green values flow forward (left to right). Orange values flow backward (right to left). Notice how each gate type transforms the gradient differently.
Forward: green | Backward: orange Add distributes, Mul swaps, Max routes
The Four Patterns Summarized

Add distributes gradient equally to all inputs. Multiply swaps and scales — each input's gradient is the upstream times the other input's value. Max routes the full gradient to the winner, zero to the loser. Copy (fan-out) adds gradients from all downstream paths. These four patterns compose to describe gradient flow through any computational graph.

Chapter 08

Vectorized Backprop

So far, all our examples used scalar values. Real neural networks operate on vectors and matrices. When we move from scalars to vectors, derivatives become Jacobian matrices.

The Derivative Zoo

InputOutputDerivativeShape
ScalarScalarRegular derivative dy/dx1 × 1
Vector (D)ScalarGradient ∇xyD × 1
Vector (D)Vector (M)Jacobian ∂y/∂xM × D
Definition
Jacobian Matrix

For a function f: RD → RM, the Jacobian J is an M×D matrix where Jij = ∂yi/∂xj. Each entry says: "if I nudge the j-th input by a tiny amount, how much does the i-th output change?" The Jacobian contains ALL partial derivative information.

Vector Chain Rule

The chain rule with vectors: if z = f(x) and the loss is a scalar L, then:

Vector Chain Rule dL/dx = (dz/dx)T · dL/dz

[D×M] · [M×1] = [D×1]. Jacobian transpose times upstream gradient.

The key insight: dL/dx always has the same shape as x. If x is a 100-dimensional vector, dL/dx is also a 100-dimensional vector. If x is a 64×64 matrix, dL/dx is also 64×64. The gradient tells you how to adjust each element.

Example: ReLU on a Vector

Worked Example — Vectorized ReLU Backward

Input: x = [1, −2, 3, −1]. f(x) = max(0, x) = [1, 0, 3, 0]. Upstream: dL/dz = [4, −1, 5, 9].

Jacobian: ReLU is elementwise, so the Jacobian is diagonal:

J = [[1,0,0,0], [0,0,0,0], [0,0,1,0], [0,0,0,0]]

Diagonal entries are 1 where xi > 0, and 0 where xi ≤ 0.

Backward: dL/dx = JT · dL/dz = [1·4, 0·(−1), 1·5, 0·9] = [4, 0, 5, 0].

Gradient passes through where ReLU was active, and is zeroed where ReLU was off. In code, this is just: dx = dz * (x > 0). One line.

Never Form the Full Jacobian

For a neural network layer with 4096 inputs and 4096 outputs, the Jacobian is 4096×4096 = 16 million entries — about 64 MB per layer. For the matrix multiply y = Wx, the Jacobian is even worse: it's (N·M) × (N·M) if we flatten all inputs and outputs. That could be 256 GB! Instead, we compute the Jacobian-vector product implicitly. For ReLU, it's an elementwise mask. For matrix multiply, it's another matrix multiply.

Matrix Multiply Backward

For y = x · W where x is [N×D] and W is [D×M]:

Matrix Multiply Gradients dL/dx = dL/dy · WT    [N×D] = [N×M] · [M×D]
dL/dW = xT · dL/dy    [D×M] = [D×N] · [N×M]

These are the only arrangements that make the shapes match. That's how you remember them.
Worked Example — Matrix Multiply Backward

x = [[2, 1, −3], [−3, 4, 2]] (2×3). W = [[3,2,1,−1],[2,1,3,2],[3,2,1,−2]] (3×4).

y = xW = [[13, 9, −2, −6], [5, 2, 17, 1]] (2×4).

Upstream: dL/dy = [[2, 3, −3, 9], [−8, 1, 4, 6]] (2×4).

dL/dx = dL/dy · WT (2×4 · 4×3 = 2×3). Think: "for each element of x, which elements of y did it affect? The whole row of y, through the corresponding row of W."

dL/dW = xT · dL/dy (3×2 · 2×4 = 3×4). Same shape as W. Each weight gradient is the sum over the batch of "input activation times output gradient."

Intuition for dL/dW = xT dL/dy

Consider one weight Wjk. It only affects yik for all batch elements i, through yik = Σj xij Wjk. So ∂yik/∂Wjk = xij. The gradient: dL/dWjk = Σi dL/dyik · xij. In matrix form, that's the (j,k) entry of xT · dL/dy. The shapes force this to be true — there's only one way to multiply a (D×N) matrix with an (N×M) matrix to get a (D×M) result.

The Shape-Matching Heuristic

Here's a practical trick from CS 231n that saves enormous amounts of time: you can often derive the gradient formula by shape-matching. You know dL/dx must have the same shape as x. You know dL/dy (the upstream gradient) has some shape. There's usually only one way to multiply the available matrices to get the right output shape.

Worked Example — Shape Matching for Bias Gradient

In a layer y = xW + b, where x is [N×D], W is [D×M], b is [1×M], and y is [N×M].

dL/dy is [N×M]. We need dL/db which is [1×M].

The only way to reduce [N×M] to [1×M] is to sum over the batch dimension N.

So: dL/db = Σi=1N dL/dyi = sum over rows of dL/dy. In NumPy: db = np.sum(dy, axis=0).

Why? Because b is broadcast across all N batch elements, so the gradient accumulates from all of them (the "copy gate adds" pattern from Chapter 7).

Full Neural Network Layer Backward

Putting it all together, the backward pass for one fully-connected layer y = max(0, xW + b) involves three steps:

FC + ReLU Layer Backward
  1. ReLU backward: dL/d(pre-relu) = dL/dy ⊙ (xW + b > 0). Elementwise mask.
  2. Bias backward: dL/db = sum over batch of dL/d(pre-relu). Shape: [1×M].
  3. Weight backward: dL/dW = xT · dL/d(pre-relu). Shape: [D×M].
  4. Input backward: dL/dx = dL/d(pre-relu) · WT. Shape: [N×D]. Pass this to the previous layer.
Worked Example — Full Layer with Numbers

x = [[1, 2]] (1×2), W = [[0.5, -0.3], [0.8, 0.1]] (2×2), b = [[0.1, -0.2]].

Forward: xW = [[1·0.5 + 2·0.8, 1·(-0.3) + 2·0.1]] = [[2.1, -0.1]]. Plus b: [[2.2, -0.3]]. ReLU: [[2.2, 0]].

Upstream dL/dy = [[1, 1]].

ReLU backward: mask = [[1, 0]] (second element was negative). dL/d(pre) = [[1·1, 1·0]] = [[1, 0]].

dL/db = [[1, 0]].

dL/dW = xT · [[1, 0]] = [[1],[2]] · [[1, 0]] = [[1, 0], [2, 0]].

dL/dx = [[1, 0]] · WT = [[1, 0]] · [[0.5, 0.8], [-0.3, 0.1]] = [[0.5, 0.8]].

The second weight column gets zero gradient because ReLU killed that output. The first column gets updated normally.

Memory Cost of Backprop

Notice that to compute dL/dW = xT dL/dy, we need the input x. To compute the ReLU mask, we need the pre-activation values. The forward pass must cache all intermediate activations for the backward pass to use. For a network with L layers, each storing an [N×D] activation matrix, the memory cost is O(L × N × D). This is why large batch sizes on deep networks run out of GPU memory — it's not the parameters, it's the cached activations.

Chapter 09

Interactive Backprop Calculator

Time to put it all together. Below is a computational graph for f(x, y, z) = (x + y) · z. You can step through the backward pass one node at a time, watching gradients flow from right to left. At each step, you'll see the upstream gradient, the local gradient, and the computed downstream gradient.

Hit "Forward" to compute values, then "Step Backward" to trace gradients one node at a time. The currently active node is highlighted in gold.

Backpropagation Step-Through
Step through the backward pass of f = (x + y) × z. Watch how the upstream gradient at each node gets multiplied by the local gradient to produce the downstream gradient.
Press Forward to compute values
What to Notice

1. The backward pass starts at the output with gradient = 1 (base case). 2. At the multiply node, the gradient on q is z (−4) and the gradient on z is q (3) — the "swap" pattern. 3. At the add node, both inputs (x and y) get the same gradient (−4) — the "distribute" pattern. 4. The whole computation costs the same as TWO forward passes — no matrix inversions, no expensive solvers.

Sanity Check — Numerical Gradients

After running the backward pass, verify each gradient:

• df/dx: Bump x from −2 to −1.999. New f = (−1.999+5)(−4) = 3.001(−4) = −12.004. Gradient ≈ −0.004/0.001 = −4. Matches.

• df/dy: Bump y from 5 to 5.001. New f = (−2+5.001)(−4) = 3.001(−4) = −12.004. Gradient ≈ −4. Matches.

• df/dz: Bump z from −4 to −3.999. New f = 3(−3.999) = −11.997. Gradient ≈ (−11.997+12)/0.001 = 3. Matches.

Interpreting the Gradients

Now that we have the gradients, what do they mean? Remember, a gradient tells you: "if I increase this input by a tiny amount, how does the output change?"

Worked Example — Reading the Gradient

df/dx = −4 means: if x increases by 1, f decreases by 4. Since we want to minimize f (it's our loss), we should increase x (move in the direction opposite to the gradient for minimization, same direction for maximization).

df/dz = 3 means: if z increases by 1, f increases by 3. To minimize f, we should decrease z.

The magnitude tells you sensitivity. f is more sensitive to x and y (gradient magnitude 4) than it would be to an input with gradient 0.1. Gradient magnitude = how much this input matters.

A Larger Example: Full Sigmoid Neuron

Let's step through the complete backprop for a sigmoid neuron, combining all the gate patterns:

Worked Example — Sigmoid Neuron Step-By-Step

f(w, x) = 1 / (1 + e−(w·x + b)). Let w = 2, x = −1, b = 3.

Forward (left to right):

1. a = w · x = 2 × (−1) = −2    [multiply gate]

2. c = a + b = −2 + 3 = 1    [add gate]

3. d = −c = −1    [negate gate]

4. e = exp(d) = e−1 ≈ 0.368    [exp gate]

5. g = 1 + e = 1.368    [add gate]

6. f = 1/g ≈ 0.731    [reciprocal gate]

Backward (right to left):

6. df/df = 1. Reciprocal: df/dg = −1/g2 = −1/1.3682 ≈ −0.534

5. Add gate distributes: df/d(1) = −0.534, df/de = −0.534

4. Exp gate: local grad = ed = 0.368. df/dd = −0.534 × 0.368 ≈ −0.197

3. Negate: df/dc = −0.197 × (−1) = 0.197

2. Add distributes: df/da = 0.197, df/db = 0.197

1. Multiply swaps: df/dw = 0.197 × x = 0.197 × (−1) = −0.197. df/dx = 0.197 × w = 0.197 × 2 = 0.394

Shortcut check: σ(1)(1−σ(1)) = 0.731 × 0.269 ≈ 0.197. Then dw = 0.197 × x = −0.197. Matches perfectly.

Chapter 10

Implementation & Connections

In real code, we don't manually derive gradients for each new network. Instead, each operation is a module with two methods: forward() and backward(). The framework chains them automatically.

The Modular API

forward() / backward() Pattern
  1. forward(inputs): Compute the output from inputs. Cache any values needed for the backward pass (like the input itself, for a multiply gate). Return the output.
  2. backward(upstream_gradient): Use the cached values and the upstream gradient to compute the downstream gradient. Return the gradient with respect to each input.
Worked Example — Multiply Gate Module

forward(x, y): cache x and y. Return x * y.

backward(dout): dx = dout * self.y (cached). dy = dout * self.x (cached). Return (dx, dy).

This is actual PyTorch code, simplified. Every operation in PyTorch — matmul, relu, sigmoid, softmax, conv2d — implements exactly this pattern.

Worked Example — Flat Backprop Implementation (Python)

Here's the CS 231n example as Python pseudocode for f(w0, x0, w1, x1, w2) = σ(w0x0 + w1x1 + w2):

w0, x0, w1, x1, w2 = 2, -1, -3, -2, -3

Forward:

d0 = w0 * x0 # -2 (multiply)

d1 = w1 * x1 # 6 (multiply)

d2 = d0 + d1 # 4 (add)

d3 = d2 + w2 # 1 (add bias)

f = 1.0 / (1 + math.exp(-d3)) # 0.731 (sigmoid)

Backward:

df = 1.0 # base case

dd3 = f * (1 - f) * df # 0.197 (sigmoid grad)

dd2 = 1.0 * dd3 # 0.197 (add passes through)

dw2 = 1.0 * dd3 # 0.197 (add passes through)

dd0 = 1.0 * dd2 # 0.197 (add distributes)

dd1 = 1.0 * dd2 # 0.197 (add distributes)

dw0 = x0 * dd0 # -0.197 (mul swaps)

dx0 = w0 * dd0 # 0.394 (mul swaps)

dw1 = x1 * dd1 # -0.394 (mul swaps)

dx1 = w1 * dd1 # -0.591 (mul swaps)

Notice the pattern: backward is literally forward in reverse, with each line replaced by its gradient rule. This is why it's called "backpropagation" — you propagate gradients backward through the same computation graph.

Autograd: Automatic Differentiation

Modern frameworks (PyTorch, JAX, TensorFlow) implement autograd: they automatically build the computational graph during the forward pass and compute gradients by calling backward() on each node in reverse order. You never write backward() yourself unless you're defining a new custom operation.

Definition
Autograd

Automatic differentiation is the mechanical process of applying the chain rule to a program. The framework records every operation during the forward pass in a "tape" (the computational graph). When you call .backward(), it replays the tape in reverse, calling each node's backward() method. The result: exact gradients for every parameter, computed in a single backward pass.

Worked Example — PyTorch Autograd

The entire backprop we did by hand can be done in 4 lines:

x = torch.tensor(-2.0, requires_grad=True)

y = torch.tensor(5.0, requires_grad=True)

z = torch.tensor(-4.0, requires_grad=True)

f = (x + y) * z

f.backward()

Now x.grad = -4, y.grad = -4, z.grad = 3. Exactly what we computed by hand. PyTorch built the graph (add → multiply), then ran backward through it automatically.

The Big Picture: What Forward/Backward Means

Summary from CS 231n Lecture 4

Forward pass: Compute the result of each operation and save any intermediates needed for gradient computation in memory.

Backward pass: Apply the chain rule to compute the gradient of the loss with respect to every input, parameter, and intermediate value.

That's it. Neural network training is: (1) forward pass to compute loss, (2) backward pass to compute gradients, (3) gradient descent to update weights. Repeat millions of times.

Practical Considerations

ConsiderationDetails
MemoryForward pass must cache activations for backward. A 100-layer network stores 100 layers of activations. This is why GPU memory is the bottleneck, not compute.
Gradient checkingAlways verify analytical gradients with numerical ones: (f(x+h) − f(x−h)) / 2h. Use h ≈ 10−5. If relative error > 10−5, you have a bug.
Computational costBackward pass is ~2x the cost of forward pass. Total training step ≈ 3x forward pass.
Custom opsIf you define a new operation, you must write both forward() and backward(). Get backward() wrong and training silently fails (weights don't update correctly).

A Complete Training Loop

To see how all the pieces fit, here's the complete flow for training a 2-layer neural network. This is what CS 231n implements in about 20 lines of NumPy:

Training a 2-Layer Network
  1. Initialize: W1 = random(D, H) × 0.01, W2 = random(H, C) × 0.01. Small random weights.
  2. Forward: h = max(0, x · W1). scores = h · W2. loss = softmax_loss(scores, y) + λ(||W1||2 + ||W2||2).
  3. Backward: dscores from softmax loss. dW2 = hT · dscores. dh = dscores · W2T. dh_pre = dh ⊙ (h > 0). dW1 = xT · dh_pre. Add regularization gradients: dW1 += 2λW1, dW2 += 2λW2.
  4. Update: W1 -= lr × dW1. W2 -= lr × dW2.
  5. Repeat for 1000+ iterations.
Worked Example — One Gradient Descent Step

Input x = [[1, 2]], true label y = 0, learning rate = 0.1.

After forward: loss = 1.38. After backward: dW1 = [[0.05, -0.02], [0.10, -0.04]].

Update: W1[0][0] was 0.50, becomes 0.50 - 0.1 × 0.05 = 0.495.

Each weight moves a tiny bit in the direction that reduces the loss. After thousands of these tiny nudges, the weights converge to values that correctly classify the training data.

The Remarkable Simplicity

The entire learning algorithm is three operations in a loop: forward (compute loss), backward (compute gradients), update (gradient descent). Everything else — batch normalization, dropout, learning rate schedules, Adam optimizer — is refinement of these three steps. The core has not changed since Rumelhart, Hinton, and Williams described backpropagation in 1986.

What Comes Next

Convolutional Neural Networks (CS 231n Lecture 5): Instead of fully-connected layers, use filters that slide over the input spatially. The backward pass for convolutions follows the exact same chain rule — it's just a different forward operation with a different local gradient.

Training tricks: Batch normalization, dropout, residual connections — all of these are just new nodes in the computational graph, each with their own forward() and backward().

Beyond first-order: Backpropagation computes first derivatives. Second-order methods (Newton's method, natural gradient) need the Hessian — the Jacobian of the gradient. This is a matrix of size (parameters × parameters) and is far too expensive for modern networks with billions of parameters.

Connections to Other Lessons

TopicConnection
TransformersSelf-attention is a differentiable operation with its own forward/backward. The gradients flow through Q, K, V projections via the chain rule.
Diffusion modelsThe noise prediction network is trained with backprop. The denoising loss is just another node in the graph.
Policy gradient (RL)REINFORCE uses a different kind of gradient (score function estimator) because the environment is not differentiable. When the dynamics ARE differentiable, backprop through time gives exact gradients.
GANsTwo computational graphs (generator + discriminator) with gradients flowing between them. The generator's loss depends on the discriminator's output.

Historical Context

Backpropagation was not invented once. The chain rule has existed since Leibniz (1676). Automatic differentiation was described by Wengert (1964). The specific application to neural networks was explored by Linnainmaa (1970), Werbos (1974), and popularized by Rumelhart, Hinton, and Williams (1986). But it took hardware (GPUs), data (ImageNet), and architectural innovations (ReLU, batch norm, residual connections) for backprop to become the engine of modern AI.

YearMilestoneContribution
1676LeibnizChain rule of calculus
1964WengertAutomatic differentiation
1970LinnainmaaReverse-mode autodiff (the algorithm behind backprop)
1974WerbosApplied to neural networks in PhD thesis
1986Rumelhart, Hinton, WilliamsPopularized backprop, showed it works on real problems
2012Krizhevsky, Sutskever, HintonAlexNet: backprop + ReLU + GPU = ImageNet breakthrough
2015PyTorch / TensorFlowAutograd frameworks make backprop invisible to the user
The One Sentence

Backpropagation is just the chain rule applied systematically to a computational graph: upstream gradient × local gradient = downstream gradient, repeated at every node from output to input.