The paper that made gradient descent through multi-layer neural networks practical — and showed that hidden layers learn meaningful internal representations automatically.
Imagine you are coaching a basketball team. Your team loses a game 80-95. You know the final score — the error — but who is responsible? The point guard missed three assists. The center failed two blocks. The shooting guard missed a free throw at a critical moment. How much blame does each player deserve for the loss?
This is the credit assignment problem: given an error at the output, how do you distribute blame to the internal components that contributed to it?
In 1986, neural networks faced exactly this problem. A single-layer network (a perceptron) could learn from its errors easily — you just adjust the weights connecting inputs to outputs. But single-layer networks could only solve simple, linearly separable problems. Everyone knew that adding hidden layers would make networks more powerful. The trouble was: how do you adjust the weights in the hidden layers?
The hidden units have no direct target. There's no teacher saying "hidden unit 3 should have output 0.7." The error is only defined at the output layer. The hidden layers are invisible — they get no direct error signal.
A network with hidden layers produces an error at the output. The question: how much blame does each hidden unit deserve? Click "Propagate Blame" to see the key idea of backpropagation.
The idea wasn't entirely new. Paul Werbos described it in his 1974 PhD thesis. Seppo Linnainmaa published the mathematical formulation in 1970. But Rumelhart, Hinton, and Williams' 1986 Nature paper was the one that demonstrated it convincingly on real problems, showed that hidden layers learn interpretable representations, and made the AI community pay attention.
By the end of this lesson, you'll understand exactly how backpropagation works — not just the algorithm, but why it has the form it does. You'll implement it from scratch, watch it solve problems that stumped an entire generation of AI researchers, and see the internal representations it discovers.
To understand why backpropagation matters, you need to understand what came before it — and what broke.
The perceptron, invented by Frank Rosenblatt in 1958, was the first neural network that could learn from data. It's a single-layer network: inputs connect directly to outputs through a set of weights. The output is computed as:
Where σ is a nonlinear activation function (originally a step function, later a sigmoid). The perceptron can learn any linearly separable function — any function where you can draw a straight line (or hyperplane) separating the positive from the negative examples.
AND is linearly separable. OR is linearly separable. But XOR — "one or the other but not both" — is not.
Each panel shows the four possible inputs to a two-input logic gate. Green = output 1, red = output 0. A perceptron can learn a function only if a single line can separate green from red. Toggle between gates to see which ones are learnable.
In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a book that proved single-layer networks cannot compute XOR — or any non-linearly-separable function. The proof is simple: no single line can separate (0,0) and (1,1) (both output 0) from (0,1) and (1,0) (both output 1). They're interleaved.
Everyone knew that multi-layer networks could solve XOR. The architecture was obvious: add a hidden layer. But training was the problem. The perceptron learning rule adjusts weights based on the error at the output. With a hidden layer, you have weights connecting inputs to hidden units, and weights connecting hidden units to outputs. The output weights are easy — you can see the error. But the hidden weights? The error doesn't directly tell you what they should be.
This stalemate lasted nearly two decades. The Minsky-Papert result was widely (and incorrectly) interpreted as proving neural networks were fundamentally limited. Funding dried up. Researchers moved to other areas. It was the first AI winter.
With one hidden layer, a neural network can approximate any continuous function (the universal approximation theorem, proved by Cybenko in 1989). With two hidden layers, it can approximate any function at all. The expressive power was never the problem. The problem was always: how do you train them?
| Architecture | Can compute | Cannot compute | Trainable? |
|---|---|---|---|
| Single layer (perceptron) | AND, OR, NOT, linear functions | XOR, parity, symmetry | Yes (perceptron rule) |
| Two layers (1 hidden) | Any continuous function | Some discontinuous functions | No method until 1986 |
| Three+ layers | Any function | Nothing (universal) | No method until 1986 |
Backpropagation is the chain rule applied to computation graphs. That's it. If you understand the chain rule from calculus, you understand the mathematical core of backpropagation. Everything else is bookkeeping.
The chain rule says: if y = f(g(x)), then the derivative of y with respect to x is:
In words: the rate at which y changes with x equals the rate at which y changes with g, times the rate at which g changes with x. Each link in the chain contributes a multiplicative factor.
Let's make this concrete. Suppose we have a simple chain of computations:
How does c change when we nudge x? By the chain rule:
Each operation in the chain contributes its local gradient: how much the output of that operation changes per unit change in its input. The chain rule says we multiply all the local gradients together to get the total gradient.
A chain of operations: x → multiply by 2 → add 1 → square = output. Drag x to change the input and watch the local gradients and total gradient update. The total gradient is always the product of local gradients.
In a real network, a single value often feeds into multiple downstream operations. What then? The multivariate chain rule says: add the contributions from each path.
If a feeds into both b and c, and both feed into d:
This is why we use partial derivatives (∂) in neural networks — each variable can affect the output through multiple paths, and we sum over all of them.
Think of each operation as a node in a graph. During the forward pass, values flow from inputs to outputs. During the backward pass, gradients flow from outputs to inputs. Each node receives a gradient from above (how much the final loss changes per unit change in its output), multiplies it by its local gradient, and sends the result downward.
This is exactly what backpropagation does. It's gradient message passing through a computation graph.
python # Chain rule by hand for: c = (2x + 1)^2 x = 3.0 # Forward pass a = 2 * x # a = 6 b = a + 1 # b = 7 c = b ** 2 # c = 49 # Backward pass: compute dc/dx dc_db = 2 * b # local gradient of squaring: 2b = 14 db_da = 1.0 # local gradient of addition: 1 da_dx = 2.0 # local gradient of multiplication: 2 dc_dx = dc_db * db_da * da_dx # 14 * 1 * 2 = 28 print(f"dc/dx = {dc_dx}") # 28.0
Before we can propagate errors backward, we need to understand how a network computes its output in the first place. This is the forward pass: data flows from input to output, layer by layer.
Rumelhart et al. use a specific architecture. Each unit j in the network computes two things:
This is the net input — the weighted sum of all inputs to unit j. Then:
This is the output — the net input passed through the sigmoid (also called the logistic function). The sigmoid squashes any real number into the range (0, 1).
The sigmoid's shape is an S-curve that maps (−∞, +∞) to (0, 1). At net = 0, the output is exactly 0.5. For large positive inputs, it saturates near 1. For large negative inputs, near 0.
Its derivative — the slope of the S-curve — peaks at net = 0 (where the slope is 0.25) and approaches zero at the extremes. This will matter enormously when we discuss gradient flow.
The sigmoid (orange) squashes inputs to (0,1). Its derivative (teal) peaks at 0.25 when net=0. Drag the slider to trace a point and see both values. Notice how the derivative vanishes for large |net|.
Let's trace a complete forward pass through a network with 2 inputs, 2 hidden units, and 1 output. All weights and biases are given:
| Connection | Weight |
|---|---|
| x1 → h1 | w11 = 0.5 |
| x2 → h1 | w12 = −0.3 |
| x1 → h2 | w21 = 0.8 |
| x2 → h2 | w22 = 0.2 |
| h1 → y | v1 = 0.6 |
| h2 → y | v2 = −0.4 |
With inputs x1 = 1, x2 = 1 (and biases of 0 for simplicity):
The network outputs 0.510. If the target was 1.0, the error is 0.49. Now we need backpropagation to figure out how to adjust all six weights to reduce this error.
python import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_deriv(s): """Derivative of sigmoid, given sigmoid output s.""" return s * (1 - s) # Forward pass x = np.array([1.0, 1.0]) W_hidden = np.array([[0.5, -0.3], # weights to h1 [0.8, 0.2]]) # weights to h2 W_out = np.array([0.6, -0.4]) h_net = W_hidden @ x # [0.2, 1.0] h_out = sigmoid(h_net) # [0.550, 0.731] y_net = W_out @ h_out # 0.038 y_out = sigmoid(y_net) # 0.510 print(f"Hidden outputs: {h_out}") # [0.550, 0.731] print(f"Network output: {y_out:.3f}") # 0.510
Now comes the heart of the paper. We have a network that produced output y = 0.510 when the target was t = 1.0. The error (using sum-of-squared-errors) is:
We want to compute ∂E/∂w for every weight w in the network, then adjust each weight in the direction that reduces E. The backward pass computes all these gradients in a single sweep from output to input.
For the output unit, we compute the error signal δ:
This has two parts: how wrong we are (t − y) and how sensitive the output is to changes in its input (σ'). Plugging in our numbers:
Here's where the magic happens. A hidden unit j has no direct target. But we can compute how much it contributed to the output error by looking at the weight connecting it to the output and the output's error signal:
In words: the error signal for hidden unit j is its local sigmoid derivative, times the weighted sum of error signals from all units it feeds into. This is the backward propagation of errors — we literally propagate δ values backward through the network, just as we propagated activations forward.
For our network:
Notice: h2 gets a negative error signal because its weight to the output is negative. The network is telling h2: "You're pushing the output in the wrong direction through your negative connection."
Once we have δ for every unit, the weight update is simple:
Where η is the learning rate. The weight change is proportional to the error signal at the receiving unit, times the output of the sending unit.
Watch δ values propagate backward from the output layer to the hidden layer. Each δ is the product of the local sigmoid derivative and the weighted sum of downstream δs. Click "Step" to advance through the backward pass.
python # Backward pass (continuing from forward pass code above) target = 1.0 lr = 0.5 # learning rate # Output error signal delta_out = (target - y_out) * sigmoid_deriv(y_out) # Hidden layer error signals delta_hidden = sigmoid_deriv(h_out) * (W_out * delta_out) # Weight updates W_out += lr * delta_out * h_out # shape: (2,) W_hidden += lr * np.outer(delta_hidden, x) # shape: (2,2) print(f"delta_out: {delta_out:.4f}") print(f"delta_hidden: {delta_hidden}")
Let's derive the hidden layer δ from scratch. We want ∂E/∂wji, where wji is a weight feeding into hidden unit j. The error is E = ½ ∑k(tk − ok)2 where k ranges over output units.
By the chain rule, we decompose this into how E changes with the net input to j:
The second factor is simple: since netj = ∑i wji oi, we have ∂netj/∂wji = oi (the input from unit i).
The first factor, ∂E/∂netj, is what we call −δj. For a hidden unit j that feeds into output units k:
Since netk = ∑j wkj oj, we have ∂netk/∂oj = wkj. And ∂oj/∂netj = σ'(netj) = oj(1 − oj).
Putting it together:
This is exactly the formula from the paper. The δ at a hidden unit is the product of its local sigmoid derivative and the weighted sum of δs from all downstream units. The weight update rule follows directly:
Rumelhart et al. called their algorithm the generalized delta rule. The original delta rule (Widrow-Hoff, 1960) was: Δw = η · (target − output) · input. It could only train single-layer networks because it required a target for each unit.
The generalized delta rule extends this to any number of layers by replacing (target − output) with the backpropagated error signal δ. The complete algorithm is:
That's the entire algorithm. Four steps, repeated for every training example. The paper proves that this procedure performs gradient descent on the total error surface — every weight update moves in the direction of steepest descent.
The error E is a function of all the weights in the network. Imagine a landscape in weight space where the height at each point is the error. We want to find the lowest valley — the weight configuration with minimum error.
The gradient ∇E points in the direction of steepest ascent. By moving in the opposite direction (−∇E), we go downhill. The learning rate η controls the step size. Too large and we overshoot valleys. Too small and we crawl.
A 1D slice of an error landscape. The ball starts at a random position and rolls downhill following the negative gradient. Adjust the learning rate to see how step size affects convergence.
One concern with gradient descent is local minima — valleys in the error surface that are not the deepest point overall. Gradient descent, being a greedy algorithm, can get stuck in these shallow valleys. Rumelhart et al. acknowledged this concern but argued (and experiments confirmed) that backpropagation works well in practice because:
Modern understanding (Choromanska et al., 2015) suggests that in deep networks, local minima are not the main problem — most critical points are saddle points (flat in some directions, downhill in others). Modern optimizers like Adam handle these well.
The paper also introduced momentum — an enhancement where each weight update includes a fraction of the previous update:
Where α is the momentum coefficient (typically 0.9). Momentum helps in two ways: it accelerates convergence in consistent gradient directions (like a ball gaining speed on a consistent slope), and it dampens oscillations in directions where the gradient keeps changing sign.
Rumelhart et al. proposed two modes of training:
Modern practice uses mini-batch training (Bottou, 1991) — a compromise where gradients are averaged over a small batch (typically 32-256 examples). This combines the noise of online training (which helps escape local minima) with the efficiency of batch training (which enables GPU parallelism).
python # Full training loop with momentum lr = 0.5 momentum = 0.9 prev_dW_hidden = np.zeros_like(W_hidden) prev_dW_out = np.zeros_like(W_out) for epoch in range(1000): for x, t in training_data: # Forward h = sigmoid(W_hidden @ x) y = sigmoid(W_out @ h) # Backward d_out = (t - y) * sigmoid_deriv(y) d_hid = sigmoid_deriv(h) * (W_out * d_out) # Update with momentum dW_out = lr * d_out * h + momentum * prev_dW_out dW_hid = lr * np.outer(d_hid, x) + momentum * prev_dW_hidden W_out += dW_out W_hidden += dW_hid prev_dW_out = dW_out prev_dW_hidden = dW_hid
Here's the part that changed everything. Rumelhart, Hinton, and Williams didn't just show that backpropagation works — they showed that hidden layers learn meaningful internal representations. Nobody told the network what features to extract. It discovered them on its own.
Consider a task where the network must detect whether a binary string is symmetric — whether it reads the same forwards and backwards. "101101" is symmetric. "110010" is not. With a 6-bit input, there are 26 = 64 possible strings, of which 8 are symmetric.
A human looking at this problem would pair up bits: is bit 1 the same as bit 6? Is bit 2 the same as bit 5? Is bit 3 the same as bit 4? Three comparisons determine symmetry. Remarkably, the hidden layer learns exactly this representation.
Before this paper, the common view was that hidden layers were mysterious "black boxes" — mathematical necessities with no interpretable function. Rumelhart et al. showed that hidden units are feature detectors. They learn whatever features are useful for the task at hand.
This is the origin of the entire field of representation learning — the idea that the right features aren't hand-designed but learned. Today's deep learning success (convolutional nets learning edge detectors, transformers learning syntactic heads) traces directly back to this demonstration.
A 6-input network with 3 hidden units learns to detect symmetry. After training, each hidden unit has learned to compare one pair of corresponding bits. Click "Train" to watch the hidden unit weight patterns emerge.
To understand what a hidden unit has learned, we examine its weight vector — the set of weights connecting all inputs to that unit. For the symmetry task with 6 inputs:
The output unit then combines these with AND-like logic: the string is symmetric if and only if ALL three pairs match. Backpropagation discovered this decomposition entirely on its own.
This is remarkable because the network wasn't given any hint about the pairing structure. It could have learned any arbitrary mapping from 6-bit strings to 0/1. Instead, it found the simplest decomposition — the one a human would design. This happens because gradient descent, with its smooth optimization landscape, naturally finds solutions with structure. Highly structured solutions tend to have lower error across the entire training set, while arbitrary memorization requires more complex (higher-curvature) weight configurations.
The paper includes another striking demonstration: an auto-encoder. The input is an 8-bit one-hot vector (representing 8 items). The network has 8 inputs, 3 hidden units, and 8 outputs. The target is the same as the input — the network must reproduce its input.
With only 3 hidden units, the network is forced to compress 8 items into a 3-bit code. After training, each hidden unit learns to act as one bit of a binary encoding: the 8 items are mapped to the 8 possible 3-bit binary strings (000, 001, 010, ..., 111).
The network invented binary encoding. Nobody taught it. The bottleneck forced it to discover an efficient representation, and backpropagation found it.
| Item | One-hot input | Learned hidden code | Binary equivalent |
|---|---|---|---|
| 1 | [1,0,0,0,0,0,0,0] | [0,0,0] | 000 |
| 2 | [0,1,0,0,0,0,0,0] | [0,0,1] | 001 |
| 3 | [0,0,1,0,0,0,0,0] | [0,1,0] | 010 |
| 4 | [0,0,0,1,0,0,0,0] | [0,1,1] | 011 |
| 5 | [0,0,0,0,1,0,0,0] | [1,0,0] | 100 |
| 6 | [0,0,0,0,0,1,0,0] | [1,0,1] | 101 |
| 7 | [0,0,0,0,0,0,1,0] | [1,1,0] | 110 |
| 8 | [0,0,0,0,0,0,0,1] | [1,1,1] | 111 |
Now let's watch backpropagation solve the problem that stumped a generation. XOR — the function that a single-layer perceptron cannot learn — falls to a two-layer network trained with backpropagation.
The XOR truth table:
| x1 | x2 | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Our network: 2 inputs, 2 hidden units (with sigmoid), 1 output (with sigmoid). We train with learning rate 0.5 and momentum 0.9. The network starts with random weights and learns to compute XOR through backpropagation.
Watch a 2-2-1 network learn XOR in real time. The left panel shows the network with weight values (line thickness = magnitude, color = sign). The right panel shows the decision boundary evolving. The hidden layer learns to create a linearly separable representation.
After training, the hidden layer has learned two decision boundaries. Hidden unit 1 computes something like "x1 OR x2" and hidden unit 2 computes something like "x1 AND x2." The output unit then computes "h1 AND NOT h2" — which is exactly XOR.
The hidden layer has re-represented the inputs. In the original (x1, x2) space, the four points are not linearly separable. In the hidden (h1, h2) space, they are:
In hidden space, (0,1) and (1,0) map to nearly the same point (both get high h1, low h2), while (0,0) gets low h1 and (1,1) gets high h1 AND high h2. A single line in hidden space now separates the classes.
python import numpy as np # XOR dataset X = np.array([[0,0],[0,1],[1,0],[1,1]]) T = np.array([0,1,1,0]) # Network: 2 → 2 → 1 np.random.seed(42) W1 = np.random.randn(2,2) * 0.5 b1 = np.zeros(2) W2 = np.random.randn(2) * 0.5 b2 = 0.0 def sig(x): return 1/(1+np.exp(-x)) def sig_d(s): return s*(1-s) lr = 2.0 for epoch in range(5000): for i in range(4): # Forward h = sig(X[i] @ W1 + b1) y = sig(h @ W2 + b2) # Backward d2 = (T[i] - y) * sig_d(y) d1 = sig_d(h) * W2 * d2 # Update W2 += lr * d2 * h b2 += lr * d2 W1 += lr * np.outer(d1, X[i]) b1 += lr * d1 # Test for x in X: h = sig(x @ W1 + b1) y = sig(h @ W2 + b2) print(f"{x} -> {y:.3f}") # [0, 0] -> 0.013 # [0, 1] -> 0.987 # [1, 0] -> 0.987 # [1, 1] -> 0.014
The most impressive demonstration in the paper goes far beyond XOR. Rumelhart et al. trained a network on family relationships: given a person and a relationship type ("Colin" + "has-father"), predict the answer ("James"). The network was given no information about family structure — just a flat list of (person, relationship, answer) triples.
The family tree had two isomorphic families (English and Italian) with members connected by relationships like has-father, has-mother, has-wife, has-son, has-daughter, has-aunt, has-uncle.
After training, the first hidden layer had learned features that encode abstract properties of people — their generation (child/parent/grandparent), their nationality (English/Italian), and their branch in the family tree. No one told the network about generations or nationalities. It discovered these abstract features because they were useful for predicting relationships.
Even more remarkably, the network could generalize to held-out relationship triples it had never seen. Because the English and Italian families had the same structure, learning "Colin has-father James" helped the network predict "Roberto has-father Emilio" — even though it had never seen that particular triple. The network had learned the concept of "father," not just a lookup table.
A simplified family tree experiment. The network learns to predict relationships from (person, relation) pairs. After training, examine the hidden layer activations to see that the network discovers generation and nationality features on its own.
A lookup table can memorize every (person, relationship, answer) triple. But it cannot generalize. The network's learned representations allow it to generalize because they capture the underlying structure of the domain. This is the difference between memorizing that 3+4=7 and understanding addition.
Hinton later called these learned representations distributed representations — each concept is represented by a pattern of activity across many units, and each unit participates in representing many concepts. This is fundamentally different from local representations (one unit = one concept) and is the key to generalization.
In a local representation, "Colin" is unit #1 and "Roberto" is unit #15. Learning that Colin's father is James teaches you nothing about Roberto. The two representations share no structure.
In a distributed representation, Colin might be encoded as [generation=child, nationality=English, branch=left] and Roberto as [generation=child, nationality=Italian, branch=left]. The two encodings share two of three features. So learning the "father" relationship for one helps predict it for the other — the network has learned that "father" means "same branch, one generation up," and this rule applies regardless of nationality.
This is exactly how modern word embeddings work: "king" and "queen" share features (royalty, authority) and differ on one (gender), enabling the analogy king − man + woman ≈ queen.
| Property | Lookup table | Learned representations |
|---|---|---|
| Storage | O(n) for n facts | O(d · k) for d dims, k concepts |
| Generalization | None — only returns stored facts | Yes — uses structural similarity |
| Features | None — raw identifiers | Discovered — generation, nationality, etc. |
| Graceful degradation | Exact match or nothing | Similar inputs give similar outputs |
The 1986 backpropagation paper is arguably the most important paper in the history of deep learning. Not because the algorithm was new — Werbos, Linnainmaa, and others had the math earlier — but because Rumelhart, Hinton, and Williams demonstrated convincingly that:
Every modern deep learning system uses backpropagation or a descendant of it.
| Development | Year | Connection to backprop |
|---|---|---|
| Convolutional neural networks (LeCun) | 1989 | Backprop through weight-shared convolutions |
| LSTM (Hochreiter & Schmidhuber) | 1997 | Solved vanishing gradients in backprop through time |
| Deep learning revival (Hinton) | 2006 | Pre-training + backprop fine-tuning |
| AlexNet (Krizhevsky) | 2012 | Backprop + ReLU + GPU = ImageNet revolution |
| Transformers (Vaswani) | 2017 | Still trained end-to-end with backprop |
| GPT-4 (OpenAI) | 2023 | Backprop at trillion-parameter scale |
"What I cannot create, I do not understand." — Richard Feynman
The 1986 Nature paper has been cited over 50,000 times. It is among the most cited papers in all of computer science. Hinton went on to win the Turing Award (2018) alongside Yann LeCun and Yoshua Bengio — three researchers whose careers were built on the foundation of backpropagation.
The paper's most lasting insight is not the algorithm (which others had discovered independently) but the demonstration that learned representations are meaningful. This single idea — that neural networks discover useful features automatically — is the intellectual foundation of the entire deep learning revolution. Every time a convolutional network learns edge detectors, every time a language model learns grammar, every time an embedding captures analogy structure — it's backpropagation learning internal representations, exactly as Rumelhart, Hinton, and Williams showed 40 years ago.