Backpropagation (Rumelhart 1986)

Chapter 0: The Credit Assignment Problem

Imagine you are coaching a basketball team. Your team loses a game 80-95. You know the final score — the error — but who is responsible? The point guard missed three assists. The center failed two blocks. The shooting guard missed a free throw at a critical moment. How much blame does each player deserve for the loss?

This is the credit assignment problem: given an error at the output, how do you distribute blame to the internal components that contributed to it?

In 1986, neural networks faced exactly this problem. A single-layer network (a perceptron) could learn from its errors easily — you just adjust the weights connecting inputs to outputs. But single-layer networks could only solve simple, linearly separable problems. Everyone knew that adding hidden layers would make networks more powerful. The trouble was: how do you adjust the weights in the hidden layers?

The hidden units have no direct target. There's no teacher saying "hidden unit 3 should have output 0.7." The error is only defined at the output layer. The hidden layers are invisible — they get no direct error signal.

The Credit Assignment Problem

A network with hidden layers produces an error at the output. The question: how much blame does each hidden unit deserve? Click "Propagate Blame" to see the key idea of backpropagation.

Click to see blame flow backward

The breakthrough: Rumelhart, Hinton, and Williams showed that the chain rule from calculus provides a systematic way to propagate error signals backward through any number of layers. Each hidden unit receives blame in proportion to how much it contributed to the output error. This idea — backpropagation — is the foundation of all modern deep learning.

The idea wasn't entirely new. Paul Werbos described it in his 1974 PhD thesis. Seppo Linnainmaa published the mathematical formulation in 1970. But Rumelhart, Hinton, and Williams' 1986 Nature paper was the one that demonstrated it convincingly on real problems, showed that hidden layers learn interpretable representations, and made the AI community pay attention.

By the end of this lesson, you'll understand exactly how backpropagation works — not just the algorithm, but why it has the form it does. You'll implement it from scratch, watch it solve problems that stumped an entire generation of AI researchers, and see the internal representations it discovers.

What is the credit assignment problem in neural networks?

Given an error at the output, determining how much each hidden unit contributed to that error — so we know how to adjust weights we can't directly supervise Deciding which neurons to add or remove from the network Choosing the right number of layers for a given problem

Chapter 1: Perceptron Limits

To understand why backpropagation matters, you need to understand what came before it — and what broke.

The perceptron, invented by Frank Rosenblatt in 1958, was the first neural network that could learn from data. It's a single-layer network: inputs connect directly to outputs through a set of weights. The output is computed as:

y = σ(w₁x₁ + w₂x₂ + b)

Where σ is a nonlinear activation function (originally a step function, later a sigmoid). The perceptron can learn any linearly separable function — any function where you can draw a straight line (or hyperplane) separating the positive from the negative examples.

AND is linearly separable. OR is linearly separable. But XOR — "one or the other but not both" — is not.

Linear Separability: AND, OR, and XOR

Each panel shows the four possible inputs to a two-input logic gate. Green = output 1, red = output 0. A perceptron can learn a function only if a single line can separate green from red. Toggle between gates to see which ones are learnable.

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a book that proved single-layer networks cannot compute XOR — or any non-linearly-separable function. The proof is simple: no single line can separate (0,0) and (1,1) (both output 0) from (0,1) and (1,0) (both output 1). They're interleaved.

The key insight: The XOR problem isn't hard because it's complex — it has only four data points! It's hard because it requires creating an internal representation that transforms the inputs into a space where they become linearly separable. A hidden layer does exactly this: it re-represents the inputs in a new coordinate system where XOR becomes solvable.

Everyone knew that multi-layer networks could solve XOR. The architecture was obvious: add a hidden layer. But training was the problem. The perceptron learning rule adjusts weights based on the error at the output. With a hidden layer, you have weights connecting inputs to hidden units, and weights connecting hidden units to outputs. The output weights are easy — you can see the error. But the hidden weights? The error doesn't directly tell you what they should be.

This stalemate lasted nearly two decades. The Minsky-Papert result was widely (and incorrectly) interpreted as proving neural networks were fundamentally limited. Funding dried up. Researchers moved to other areas. It was the first AI winter.

What multi-layer networks can do

With one hidden layer, a neural network can approximate any continuous function (the universal approximation theorem, proved by Cybenko in 1989). With two hidden layers, it can approximate any function at all. The expressive power was never the problem. The problem was always: how do you train them?

Architecture	Can compute	Cannot compute	Trainable?
Single layer (perceptron)	AND, OR, NOT, linear functions	XOR, parity, symmetry	Yes (perceptron rule)
Two layers (1 hidden)	Any continuous function	Some discontinuous functions	No method until 1986
Three+ layers	Any function	Nothing (universal)	No method until 1986

Why can't a single-layer perceptron solve XOR?

XOR has too many inputs for a perceptron to handle The four XOR input-output pairs are not linearly separable — no single line can divide the 0-outputs from the 1-outputs Perceptrons cannot use nonlinear activation functions

Chapter 2: The Chain Rule

Backpropagation is the chain rule applied to computation graphs. That's it. If you understand the chain rule from calculus, you understand the mathematical core of backpropagation. Everything else is bookkeeping.

The chain rule says: if y = f(g(x)), then the derivative of y with respect to x is:

dy/dx = (dy/dg) · (dg/dx)

In words: the rate at which y changes with x equals the rate at which y changes with g, times the rate at which g changes with x. Each link in the chain contributes a multiplicative factor.

Let's make this concrete. Suppose we have a simple chain of computations:

x = 3

Input value

↓ a = 2x

a = 6

First operation: multiply by 2

↓ b = a + 1

b = 7

Second operation: add 1

↓ c = b²

c = 49

Third operation: square

How does c change when we nudge x? By the chain rule:

dc/dx = (dc/db) · (db/da) · (da/dx) = 2b · 1 · 2 = 2(7) · 1 · 2 = 28

Each operation in the chain contributes its local gradient: how much the output of that operation changes per unit change in its input. The chain rule says we multiply all the local gradients together to get the total gradient.

Chain Rule on a Computation Graph

A chain of operations: x → multiply by 2 → add 1 → square = output. Drag x to change the input and watch the local gradients and total gradient update. The total gradient is always the product of local gradients.

x 3.0

The key idea for neural networks: A neural network is just a long chain of operations: multiply by weights, add bias, apply activation, multiply by more weights, add bias, apply activation... The chain rule lets us compute dc/dw for any weight w, no matter how deep in the network it is. We just multiply local gradients backward from the output to that weight.

Branching: when outputs feed multiple paths

In a real network, a single value often feeds into multiple downstream operations. What then? The multivariate chain rule says: add the contributions from each path.

If a feeds into both b and c, and both feed into d:

∂d/∂a = (∂d/∂b)(∂b/∂a) + (∂d/∂c)(∂c/∂a)

This is why we use partial derivatives (∂) in neural networks — each variable can affect the output through multiple paths, and we sum over all of them.

The chain rule as message passing

Think of each operation as a node in a graph. During the forward pass, values flow from inputs to outputs. During the backward pass, gradients flow from outputs to inputs. Each node receives a gradient from above (how much the final loss changes per unit change in its output), multiplies it by its local gradient, and sends the result downward.

This is exactly what backpropagation does. It's gradient message passing through a computation graph.

python
# Chain rule by hand for: c = (2x + 1)^2
x = 3.0

# Forward pass
a = 2 * x           # a = 6
b = a + 1           # b = 7
c = b ** 2          # c = 49

# Backward pass: compute dc/dx
dc_db = 2 * b        # local gradient of squaring: 2b = 14
db_da = 1.0          # local gradient of addition: 1
da_dx = 2.0          # local gradient of multiplication: 2

dc_dx = dc_db * db_da * da_dx  # 14 * 1 * 2 = 28
print(f"dc/dx = {dc_dx}")       # 28.0

If y = f(g(h(x))), how do you compute dy/dx?

Multiply the local gradients: dy/dx = (dy/dg)(dg/dh)(dh/dx) — each operation in the chain contributes a multiplicative factor Add the local gradients: dy/dx = (dy/dg) + (dg/dh) + (dh/dx) Take the derivative of only the outermost function: dy/dx = f'(x)

Chapter 3: The Forward Pass

Before we can propagate errors backward, we need to understand how a network computes its output in the first place. This is the forward pass: data flows from input to output, layer by layer.

Rumelhart et al. use a specific architecture. Each unit j in the network computes two things:

net_j = ∑_i w_ji · o_i

This is the net input — the weighted sum of all inputs to unit j. Then:

o_j = σ(net_j) = 1 / (1 + e^−net_j)

This is the output — the net input passed through the sigmoid (also called the logistic function). The sigmoid squashes any real number into the range (0, 1).

Why sigmoid? Two reasons. First, it's differentiable everywhere — unlike the step function used in the original perceptron. This is essential for gradient-based learning. Second, its derivative has a beautiful form: σ'(x) = σ(x)(1 − σ(x)). You can compute the derivative from the output alone, without needing to store the input. This makes backpropagation computationally efficient.

The sigmoid and its derivative

The sigmoid's shape is an S-curve that maps (−∞, +∞) to (0, 1). At net = 0, the output is exactly 0.5. For large positive inputs, it saturates near 1. For large negative inputs, near 0.

Its derivative — the slope of the S-curve — peaks at net = 0 (where the slope is 0.25) and approaches zero at the extremes. This will matter enormously when we discuss gradient flow.

The Sigmoid Function and Its Derivative

The sigmoid (orange) squashes inputs to (0,1). Its derivative (teal) peaks at 0.25 when net=0. Drag the slider to trace a point and see both values. Notice how the derivative vanishes for large |net|.

net 0.0

A concrete forward pass

Let's trace a complete forward pass through a network with 2 inputs, 2 hidden units, and 1 output. All weights and biases are given:

Connection	Weight
x₁ → h₁	w₁₁ = 0.5
x₂ → h₁	w₁₂ = −0.3
x₁ → h₂	w₂₁ = 0.8
x₂ → h₂	w₂₂ = 0.2
h₁ → y	v₁ = 0.6
h₂ → y	v₂ = −0.4

With inputs x₁ = 1, x₂ = 1 (and biases of 0 for simplicity):

Hidden unit 1

net₁ = 0.5(1) + (−0.3)(1) = 0.2 → o₁ = σ(0.2) = 0.550

Hidden unit 2

net₂ = 0.8(1) + 0.2(1) = 1.0 → o₂ = σ(1.0) = 0.731

↓

Output

net_y = 0.6(0.550) + (−0.4)(0.731) = 0.038 → y = σ(0.038) = 0.510

The network outputs 0.510. If the target was 1.0, the error is 0.49. Now we need backpropagation to figure out how to adjust all six weights to reduce this error.

python
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(s):
    """Derivative of sigmoid, given sigmoid output s."""
    return s * (1 - s)

# Forward pass
x = np.array([1.0, 1.0])
W_hidden = np.array([[0.5, -0.3],   # weights to h1
                     [0.8,  0.2]])  # weights to h2
W_out = np.array([0.6, -0.4])

h_net = W_hidden @ x          # [0.2, 1.0]
h_out = sigmoid(h_net)        # [0.550, 0.731]
y_net = W_out @ h_out          # 0.038
y_out = sigmoid(y_net)         # 0.510

print(f"Hidden outputs: {h_out}")  # [0.550, 0.731]
print(f"Network output: {y_out:.3f}")  # 0.510

Why does the 1986 paper use the sigmoid activation function instead of a step function?

The sigmoid is faster to compute than a step function The sigmoid is differentiable everywhere, which is required for gradient-based learning — and its derivative can be computed cheaply from its output: σ'(x) = σ(x)(1 − σ(x)) The sigmoid always outputs exactly 0 or 1

Chapter 4: The Backward Pass

Now comes the heart of the paper. We have a network that produced output y = 0.510 when the target was t = 1.0. The error (using sum-of-squared-errors) is:

E = ½ (t − y)² = ½ (1.0 − 0.510)² = 0.120

We want to compute ∂E/∂w for every weight w in the network, then adjust each weight in the direction that reduces E. The backward pass computes all these gradients in a single sweep from output to input.

Step 1: Output layer gradient

For the output unit, we compute the error signal δ:

δ_output = (t − y) · σ'(net_y) = (t − y) · y(1 − y)

This has two parts: how wrong we are (t − y) and how sensitive the output is to changes in its input (σ'). Plugging in our numbers:

δ_output = (1.0 − 0.510) · 0.510 · (1 − 0.510) = 0.490 · 0.250 = 0.122

Step 2: Hidden layer gradient

Here's where the magic happens. A hidden unit j has no direct target. But we can compute how much it contributed to the output error by looking at the weight connecting it to the output and the output's error signal:

δ_j = σ'(net_j) · ∑_k w_kj · δ_k

In words: the error signal for hidden unit j is its local sigmoid derivative, times the weighted sum of error signals from all units it feeds into. This is the backward propagation of errors — we literally propagate δ values backward through the network, just as we propagated activations forward.

For our network:

δ_h1 = o₁(1 − o₁) · v₁ · δ_out = 0.550(0.450) · 0.6 · 0.122 = 0.0181

δ_h2 = o₂(1 − o₂) · v₂ · δ_out = 0.731(0.269) · (−0.4) · 0.122 = −0.0096

Notice: h₂ gets a negative error signal because its weight to the output is negative. The network is telling h₂: "You're pushing the output in the wrong direction through your negative connection."

Step 3: Weight updates

Once we have δ for every unit, the weight update is simple:

Δw_ji = η · δ_j · o_i

Where η is the learning rate. The weight change is proportional to the error signal at the receiving unit, times the output of the sending unit.

Backward Pass: Error Signal Propagation

Watch δ values propagate backward from the output layer to the hidden layer. Each δ is the product of the local sigmoid derivative and the weighted sum of downstream δs. Click "Step" to advance through the backward pass.

Step 0: Forward pass complete

python
# Backward pass (continuing from forward pass code above)
target = 1.0
lr = 0.5  # learning rate

# Output error signal
delta_out = (target - y_out) * sigmoid_deriv(y_out)

# Hidden layer error signals
delta_hidden = sigmoid_deriv(h_out) * (W_out * delta_out)

# Weight updates
W_out    += lr * delta_out * h_out      # shape: (2,)
W_hidden += lr * np.outer(delta_hidden, x)  # shape: (2,2)

print(f"delta_out: {delta_out:.4f}")
print(f"delta_hidden: {delta_hidden}")

The beautiful symmetry: Forward pass computes activations layer by layer, left to right. Backward pass computes error signals layer by layer, right to left. Each uses the same weights and the same structure — just in opposite directions. This symmetry is why it's called backpropagation.

Full derivation: why does δ have this form?

Let's derive the hidden layer δ from scratch. We want ∂E/∂w_ji, where w_ji is a weight feeding into hidden unit j. The error is E = ½ ∑_k(t_k − o_k)² where k ranges over output units.

By the chain rule, we decompose this into how E changes with the net input to j:

∂E/∂w_ji = (∂E/∂net_j) · (∂net_j/∂w_ji)

The second factor is simple: since net_j = ∑_i w_ji o_i, we have ∂net_j/∂w_ji = o_i (the input from unit i).

The first factor, ∂E/∂net_j, is what we call −δ_j. For a hidden unit j that feeds into output units k:

∂E/∂net_j = ∑_k (∂E/∂net_k) · (∂net_k/∂o_j) · (∂o_j/∂net_j)

Since net_k = ∑_j w_kj o_j, we have ∂net_k/∂o_j = w_kj. And ∂o_j/∂net_j = σ'(net_j) = o_j(1 − o_j).

Putting it together:

δ_j = o_j(1 − o_j) ∑_k w_kj δ_k

This is exactly the formula from the paper. The δ at a hidden unit is the product of its local sigmoid derivative and the weighted sum of δs from all downstream units. The weight update rule follows directly:

Δw_ji = −η ∂E/∂w_ji = η δ_j o_i

Why this derivation matters: The δ formula isn't arbitrary — it's a direct consequence of the chain rule applied to the error function. Every factor has a clear meaning: o_j(1 − o_j) tells us how sensitive unit j's output is to its input, and ∑ w_kjδ_k tells us how much unit j's output matters to the final error. The product gives us exactly how much blame unit j deserves.

How does a hidden unit's error signal δ get computed?

By multiplying its local sigmoid derivative by the weighted sum of error signals from all downstream units — the error is literally propagated backward through the weights By comparing the hidden unit's output directly to a target value By averaging the error signals of all other hidden units

Chapter 5: The Generalized Delta Rule

Rumelhart et al. called their algorithm the generalized delta rule. The original delta rule (Widrow-Hoff, 1960) was: Δw = η · (target − output) · input. It could only train single-layer networks because it required a target for each unit.

The generalized delta rule extends this to any number of layers by replacing (target − output) with the backpropagated error signal δ. The complete algorithm is:

Step 1: Forward

Compute all activations o_j from input to output

↓

Step 2: Output δ

δ_j = (t_j − o_j) · o_j(1 − o_j) for output units

↓

Step 3: Hidden δ

δ_j = o_j(1 − o_j) ∑_k w_kjδ_k for hidden units

↓

Step 4: Update

Δw_ji = η · δ_j · o_i for all weights

↻ repeat for next training example

That's the entire algorithm. Four steps, repeated for every training example. The paper proves that this procedure performs gradient descent on the total error surface — every weight update moves in the direction of steepest descent.

Why gradient descent works

The error E is a function of all the weights in the network. Imagine a landscape in weight space where the height at each point is the error. We want to find the lowest valley — the weight configuration with minimum error.

The gradient ∇E points in the direction of steepest ascent. By moving in the opposite direction (−∇E), we go downhill. The learning rate η controls the step size. Too large and we overshoot valleys. Too small and we crawl.

Gradient Descent on an Error Surface

A 1D slice of an error landscape. The ball starts at a random position and rolls downhill following the negative gradient. Adjust the learning rate to see how step size affects convergence.

Learning rate η 0.10

Step 0 — Click Step to descend

The error surface and local minima

One concern with gradient descent is local minima — valleys in the error surface that are not the deepest point overall. Gradient descent, being a greedy algorithm, can get stuck in these shallow valleys. Rumelhart et al. acknowledged this concern but argued (and experiments confirmed) that backpropagation works well in practice because:

The high-dimensional error surface has many paths downhill — getting truly stuck is rare
Multiple random initializations can find different solutions; the best one is kept
Momentum helps escape shallow local minima by building up speed

Modern understanding (Choromanska et al., 2015) suggests that in deep networks, local minima are not the main problem — most critical points are saddle points (flat in some directions, downhill in others). Modern optimizers like Adam handle these well.

Momentum

The paper also introduced momentum — an enhancement where each weight update includes a fraction of the previous update:

Δw_ji(n) = η · δ_j · o_i + α · Δw_ji(n−1)

Where α is the momentum coefficient (typically 0.9). Momentum helps in two ways: it accelerates convergence in consistent gradient directions (like a ball gaining speed on a consistent slope), and it dampens oscillations in directions where the gradient keeps changing sign.

Physics analogy: Think of a ball rolling on the error surface. Without momentum, the ball has no mass — it moves exactly in the direction of the local slope at each step. With momentum, the ball has inertia — it builds up speed in consistent downhill directions and resists sudden changes. In narrow ravines (common in neural network loss landscapes), momentum prevents the zig-zagging that pure gradient descent exhibits.

Batch vs. online training

Rumelhart et al. proposed two modes of training:

Online (stochastic): Update weights after every single training example. This is noisy but provides frequent updates. Each update is cheap.
Batch: Accumulate gradients over all training examples, then update once. This gives a more accurate gradient but is slower per update.

Modern practice uses mini-batch training (Bottou, 1991) — a compromise where gradients are averaged over a small batch (typically 32-256 examples). This combines the noise of online training (which helps escape local minima) with the efficiency of batch training (which enables GPU parallelism).

python
# Full training loop with momentum
lr = 0.5
momentum = 0.9
prev_dW_hidden = np.zeros_like(W_hidden)
prev_dW_out = np.zeros_like(W_out)

for epoch in range(1000):
    for x, t in training_data:
        # Forward
        h = sigmoid(W_hidden @ x)
        y = sigmoid(W_out @ h)

        # Backward
        d_out = (t - y) * sigmoid_deriv(y)
        d_hid = sigmoid_deriv(h) * (W_out * d_out)

        # Update with momentum
        dW_out = lr * d_out * h + momentum * prev_dW_out
        dW_hid = lr * np.outer(d_hid, x) + momentum * prev_dW_hidden

        W_out += dW_out
        W_hidden += dW_hid
        prev_dW_out = dW_out
        prev_dW_hidden = dW_hid

What does the "generalized" in "generalized delta rule" refer to?

It generalizes to networks with more neurons per layer It generalizes the single-layer delta rule to multi-layer networks by replacing the direct (target − output) error with backpropagated error signals δ It generalizes to different activation functions beyond sigmoid

Chapter 6: Internal Representations

Here's the part that changed everything. Rumelhart, Hinton, and Williams didn't just show that backpropagation works — they showed that hidden layers learn meaningful internal representations. Nobody told the network what features to extract. It discovered them on its own.

Consider a task where the network must detect whether a binary string is symmetric — whether it reads the same forwards and backwards. "101101" is symmetric. "110010" is not. With a 6-bit input, there are 2⁶ = 64 possible strings, of which 8 are symmetric.

A human looking at this problem would pair up bits: is bit 1 the same as bit 6? Is bit 2 the same as bit 5? Is bit 3 the same as bit 4? Three comparisons determine symmetry. Remarkably, the hidden layer learns exactly this representation.

The revelation: After training, each hidden unit had developed high positive weights to one pair of corresponding positions (e.g., bits 1 and 6) and near-zero weights to everything else. Without being told about the pairing structure, the network independently discovered it. The hidden layer created a compressed, meaningful representation — not a random projection.

Why this matters

Before this paper, the common view was that hidden layers were mysterious "black boxes" — mathematical necessities with no interpretable function. Rumelhart et al. showed that hidden units are feature detectors. They learn whatever features are useful for the task at hand.

This is the origin of the entire field of representation learning — the idea that the right features aren't hand-designed but learned. Today's deep learning success (convolutional nets learning edge detectors, transformers learning syntactic heads) traces directly back to this demonstration.

Symmetry Detection: What Hidden Units Learn

A 6-input network with 3 hidden units learns to detect symmetry. After training, each hidden unit has learned to compare one pair of corresponding bits. Click "Train" to watch the hidden unit weight patterns emerge.

Epoch 0 — weights random

Visualizing what hidden units learn

To understand what a hidden unit has learned, we examine its weight vector — the set of weights connecting all inputs to that unit. For the symmetry task with 6 inputs:

If hidden unit 1 has weights [+0.8, 0, 0, 0, 0, +0.8], it activates when bits 1 and 6 are both high — it's checking if the first and last bits match.
If hidden unit 2 has weights [0, +0.8, 0, 0, +0.8, 0], it checks bits 2 and 5.
If hidden unit 3 has weights [0, 0, +0.8, +0.8, 0, 0], it checks bits 3 and 4.

The output unit then combines these with AND-like logic: the string is symmetric if and only if ALL three pairs match. Backpropagation discovered this decomposition entirely on its own.

This is remarkable because the network wasn't given any hint about the pairing structure. It could have learned any arbitrary mapping from 6-bit strings to 0/1. Instead, it found the simplest decomposition — the one a human would design. This happens because gradient descent, with its smooth optimization landscape, naturally finds solutions with structure. Highly structured solutions tend to have lower error across the entire training set, while arbitrary memorization requires more complex (higher-curvature) weight configurations.

Encoder-decoder tasks

The paper includes another striking demonstration: an auto-encoder. The input is an 8-bit one-hot vector (representing 8 items). The network has 8 inputs, 3 hidden units, and 8 outputs. The target is the same as the input — the network must reproduce its input.

With only 3 hidden units, the network is forced to compress 8 items into a 3-bit code. After training, each hidden unit learns to act as one bit of a binary encoding: the 8 items are mapped to the 8 possible 3-bit binary strings (000, 001, 010, ..., 111).

Input: [0,0,0,1,0,0,0,0] → Hidden: [0,1,1] → Output: [0,0,0,1,0,0,0,0]

The network invented binary encoding. Nobody taught it. The bottleneck forced it to discover an efficient representation, and backpropagation found it.

Item	One-hot input	Learned hidden code	Binary equivalent
1	[1,0,0,0,0,0,0,0]	[0,0,0]	000
2	[0,1,0,0,0,0,0,0]	[0,0,1]	001
3	[0,0,1,0,0,0,0,0]	[0,1,0]	010
4	[0,0,0,1,0,0,0,0]	[0,1,1]	011
5	[0,0,0,0,1,0,0,0]	[1,0,0]	100
6	[0,0,0,0,0,1,0,0]	[1,0,1]	101
7	[0,0,0,0,0,0,1,0]	[1,1,0]	110
8	[0,0,0,0,0,0,0,1]	[1,1,1]	111

In the symmetry detection experiment, what did the hidden units learn without being told?

Each hidden unit learned to compare one pair of corresponding positions (bit 1 vs bit 6, bit 2 vs bit 5, bit 3 vs bit 4) — the network independently discovered the pairing structure Each hidden unit learned to memorize one of the symmetric patterns Each hidden unit learned to count the number of 1s in the input

Chapter 7: XOR Solved — Interactive Showcase

Now let's watch backpropagation solve the problem that stumped a generation. XOR — the function that a single-layer perceptron cannot learn — falls to a two-layer network trained with backpropagation.

The XOR truth table:

x₁	x₂	XOR
0	0	0
0	1	1
1	0	1
1	1	0

Our network: 2 inputs, 2 hidden units (with sigmoid), 1 output (with sigmoid). We train with learning rate 0.5 and momentum 0.9. The network starts with random weights and learns to compute XOR through backpropagation.

Backpropagation Solves XOR — Full Training

Watch a 2-2-1 network learn XOR in real time. The left panel shows the network with weight values (line thickness = magnitude, color = sign). The right panel shows the decision boundary evolving. The hidden layer learns to create a linearly separable representation.

Learning rate 0.5

Momentum 0.90

Epoch 0 — Error: 1.000

What the hidden layer actually does

After training, the hidden layer has learned two decision boundaries. Hidden unit 1 computes something like "x₁ OR x₂" and hidden unit 2 computes something like "x₁ AND x₂." The output unit then computes "h₁ AND NOT h₂" — which is exactly XOR.

The hidden layer has re-represented the inputs. In the original (x₁, x₂) space, the four points are not linearly separable. In the hidden (h₁, h₂) space, they are:

Input space (not separable)

(0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 — interleaved

↓ hidden layer transforms

Hidden space (separable!)

(0.05,0.05)→0, (0.95,0.25)→1, (0.95,0.25)→1, (0.99,0.99)→0

In hidden space, (0,1) and (1,0) map to nearly the same point (both get high h₁, low h₂), while (0,0) gets low h₁ and (1,1) gets high h₁ AND high h₂. A single line in hidden space now separates the classes.

The deep lesson: XOR is trivial for a two-layer network. The hidden layer learns to transform the problem from one where the classes are interleaved to one where they're separated. This is what every deep network does: each layer transforms the representation to make the task a little easier for the next layer. Classification is ultimately always linear — the nonlinearity is in the representation, not the classifier.

python
import numpy as np

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
T = np.array([0,1,1,0])

# Network: 2 → 2 → 1
np.random.seed(42)
W1 = np.random.randn(2,2) * 0.5
b1 = np.zeros(2)
W2 = np.random.randn(2) * 0.5
b2 = 0.0

def sig(x): return 1/(1+np.exp(-x))
def sig_d(s): return s*(1-s)

lr = 2.0
for epoch in range(5000):
    for i in range(4):
        # Forward
        h = sig(X[i] @ W1 + b1)
        y = sig(h @ W2 + b2)
        # Backward
        d2 = (T[i] - y) * sig_d(y)
        d1 = sig_d(h) * W2 * d2
        # Update
        W2 += lr * d2 * h
        b2 += lr * d2
        W1 += lr * np.outer(d1, X[i])
        b1 += lr * d1

# Test
for x in X:
    h = sig(x @ W1 + b1)
    y = sig(h @ W2 + b2)
    print(f"{x} -> {y:.3f}")
# [0, 0] -> 0.013
# [0, 1] -> 0.987
# [1, 0] -> 0.987
# [1, 1] -> 0.014

How does a two-layer network solve XOR?

The hidden layer learns to re-represent the inputs in a new space where the XOR classes become linearly separable, then the output layer draws a single line in that space The hidden layer memorizes all four XOR examples The network uses a lookup table instead of learned weights

Chapter 8: Family Trees

The most impressive demonstration in the paper goes far beyond XOR. Rumelhart et al. trained a network on family relationships: given a person and a relationship type ("Colin" + "has-father"), predict the answer ("James"). The network was given no information about family structure — just a flat list of (person, relationship, answer) triples.

The family tree had two isomorphic families (English and Italian) with members connected by relationships like has-father, has-mother, has-wife, has-son, has-daughter, has-aunt, has-uncle.

The setup: Each person was represented as a one-hot vector over 24 people. Each relationship was a one-hot vector over 12 relationship types. The network had to output the correct person. The architecture: input person (24) + input relationship (12) → hidden layer 1 (6 units) → hidden layer 2 (6 units) → output person (24).

What the network discovered

After training, the first hidden layer had learned features that encode abstract properties of people — their generation (child/parent/grandparent), their nationality (English/Italian), and their branch in the family tree. No one told the network about generations or nationalities. It discovered these abstract features because they were useful for predicting relationships.

Even more remarkably, the network could generalize to held-out relationship triples it had never seen. Because the English and Italian families had the same structure, learning "Colin has-father James" helped the network predict "Roberto has-father Emilio" — even though it had never seen that particular triple. The network had learned the concept of "father," not just a lookup table.

Family Tree: Learned Representations

A simplified family tree experiment. The network learns to predict relationships from (person, relation) pairs. After training, examine the hidden layer activations to see that the network discovers generation and nationality features on its own.

Epoch 0 — Click Train

From memorization to understanding

A lookup table can memorize every (person, relationship, answer) triple. But it cannot generalize. The network's learned representations allow it to generalize because they capture the underlying structure of the domain. This is the difference between memorizing that 3+4=7 and understanding addition.

Hinton later called these learned representations distributed representations — each concept is represented by a pattern of activity across many units, and each unit participates in representing many concepts. This is fundamentally different from local representations (one unit = one concept) and is the key to generalization.

Verified: The family tree network used 24 input units (one-hot person), 12 relationship units, two hidden layers of 6 units each, and 24 output units. After training on 100 of the 104 relationship triples, it correctly predicted the 4 held-out triples. This was the first demonstration of systematic generalization in a neural network — the ability to apply learned rules to new examples.

Why distributed representations enable generalization

In a local representation, "Colin" is unit #1 and "Roberto" is unit #15. Learning that Colin's father is James teaches you nothing about Roberto. The two representations share no structure.

In a distributed representation, Colin might be encoded as [generation=child, nationality=English, branch=left] and Roberto as [generation=child, nationality=Italian, branch=left]. The two encodings share two of three features. So learning the "father" relationship for one helps predict it for the other — the network has learned that "father" means "same branch, one generation up," and this rule applies regardless of nationality.

This is exactly how modern word embeddings work: "king" and "queen" share features (royalty, authority) and differ on one (gender), enabling the analogy king − man + woman ≈ queen.

Property	Lookup table	Learned representations
Storage	O(n) for n facts	O(d · k) for d dims, k concepts
Generalization	None — only returns stored facts	Yes — uses structural similarity
Features	None — raw identifiers	Discovered — generation, nationality, etc.
Graceful degradation	Exact match or nothing	Similar inputs give similar outputs

What abstract features did the hidden layer discover in the family tree experiment?

Generation (child/parent/grandparent), nationality (English/Italian), and family branch — all without being told these categories exist The names and ages of each family member The correct spelling of each relationship type

Chapter 9: Connections

The 1986 backpropagation paper is arguably the most important paper in the history of deep learning. Not because the algorithm was new — Werbos, Linnainmaa, and others had the math earlier — but because Rumelhart, Hinton, and Williams demonstrated convincingly that:

Multi-layer networks can be trained efficiently
Hidden layers learn meaningful representations
These representations support generalization

Every modern deep learning system uses backpropagation or a descendant of it.

What came next

Development	Year	Connection to backprop
Convolutional neural networks (LeCun)	1989	Backprop through weight-shared convolutions
LSTM (Hochreiter & Schmidhuber)	1997	Solved vanishing gradients in backprop through time
Deep learning revival (Hinton)	2006	Pre-training + backprop fine-tuning
AlexNet (Krizhevsky)	2012	Backprop + ReLU + GPU = ImageNet revolution
Transformers (Vaswani)	2017	Still trained end-to-end with backprop
GPT-4 (OpenAI)	2023	Backprop at trillion-parameter scale

Limitations the paper didn't solve

Vanishing gradients: Deep sigmoid networks have gradients that shrink exponentially with depth (max sigmoid derivative is 0.25). This was solved by ReLU (2010s) and skip connections (ResNet, 2015).
Local minima: Gradient descent can get stuck. In practice, saddle points are a bigger problem (solved by Adam and modern optimizers).
Computation cost: Backprop through deep networks was too slow until GPUs made it practical (2010s).

Historical perspective: From 1986 to today, the core algorithm hasn't changed. We compute a forward pass, compare to a target, and propagate gradients backward through the chain rule. What has changed: activation functions (sigmoid → ReLU), architectures (feedforward → convolutional → attention), scale (hundreds of parameters → trillions), hardware (CPUs → GPU clusters). But the learning algorithm is still Rumelhart, Hinton, and Williams' backpropagation.

Related Veanors

Yes You Should Understand Backprop — Karpathy's practical guide to gradient debugging
Word2Vec — Embeddings learned through backprop
NLP from Scratch — Multi-task learning with backprop

"What I cannot create, I do not understand." — Richard Feynman

The paper's place in history

The 1986 Nature paper has been cited over 50,000 times. It is among the most cited papers in all of computer science. Hinton went on to win the Turing Award (2018) alongside Yann LeCun and Yoshua Bengio — three researchers whose careers were built on the foundation of backpropagation.

The paper's most lasting insight is not the algorithm (which others had discovered independently) but the demonstration that learned representations are meaningful. This single idea — that neural networks discover useful features automatically — is the intellectual foundation of the entire deep learning revolution. Every time a convolutional network learns edge detectors, every time a language model learns grammar, every time an embedding captures analogy structure — it's backpropagation learning internal representations, exactly as Rumelhart, Hinton, and Williams showed 40 years ago.

What was the key contribution of the Rumelhart et al. 1986 paper beyond the backpropagation algorithm itself?

Demonstrating that hidden layers learn meaningful, interpretable internal representations that support generalization — not just memorization Proving that neural networks are Turing-complete Inventing the perceptron learning rule

Learning Representations by Backpropagating Errors