CS224N Lecture 3

Backpropagation and Neural Networks

How neural networks learn by tracing blame backwards through computation.

Prerequisites: L02 Word Vectors (recommended) + basic calculus (derivatives). That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: Why Learning?

You hardcode a function to classify emails as spam. It checks if the subject contains "buy now!" and flags it. Works great — for a day. Then spammers switch to "limited offer," "act fast," "urgent deal." Your function is blind to all of them. You patch it with more rules. Spammers adapt again. You're playing whack-a-mole against a universe of possible phrasings.

The core problem is this: you wrote a fixed function. Its behavior was frozen the moment you typed the last character. It cannot improve from experience. It cannot notice patterns you didn't anticipate. It cannot adapt when the world changes.

What if the function could fix itself? What if, every time it misclassified an email, it could figure out which part of its computation was responsible for the mistake and nudge that part in the right direction?

That's what a neural network does. It's a function with adjustable parameters — numbers called weights — and a procedure for adjusting those weights to reduce errors. The procedure is called learning, and the specific algorithm that makes it efficient is called backpropagation.

Here's the key shift in thinking. Traditional programming: you write the rules. Machine learning: you provide examples and the machine discovers the rules. You don't tell the spam filter that "limited offer" is suspicious. You show it 10,000 emails labeled spam or not-spam, and it learns which patterns predict spam. The function writes itself.

This works because of gradient descent — a systematic way to improve the weights. At each step, we ask: "If I increase this weight by a tiny amount, does the error go up or down?" If it goes down, we increase the weight. If it goes up, we decrease it. Do this for all weights, thousands of times, and the function converges to something useful.

A neural network isn't magic. It's a function with knobs. Learning is turning the knobs until the errors shrink. Backpropagation tells you which direction to turn each knob and by how much.

To see this concretely, imagine a simple 2D classification task. You have points in two classes — blue and orange — and a linear boundary separating them. But the initial boundary is wrong. It misclassifies several points. What if we could automatically adjust the boundary's angle and position to minimize mistakes?

That's gradient descent. At each step, we measure the error, compute which direction to move the boundary to reduce that error, and take a small step. Watch it happen below.

Learning a Decision Boundary

The line starts wrong. Click "Learn" to run gradient descent steps. The boundary rotates and shifts to classify the points correctly. Adjust the learning rate to see how step size affects convergence.

Learning rate0.10
Step 0 — click Learn to start.

Notice what happened. You didn't hand-tune the line's angle or position. You didn't write rules for where the boundary should go. You gave the system a measure of error and a way to reduce it, and the boundary found the right place on its own. That's the essence of all neural network training.

This loop — predict, measure error, compute gradients, update weights — is the heartbeat of every neural network ever trained. GPT-4 does it. Image classifiers do it. Self-driving cars do it. The architectures differ wildly. The loop is always the same.

How many times does the loop run? For GPT-3, training involved approximately 300 billion tokens, processed in batches of ~3 million tokens each. That's roughly 100,000 gradient update steps. Each step costs hundreds of GPU-hours. The total training cost was estimated at $4.6 million in compute. All of that — months of computation on thousands of GPUs — is just this loop running over and over.

The remarkable thing is that this simple loop — invented in 1986 — still powers every large model trained in 2025. The architectures have changed dramatically (feedforward → CNN → RNN → Transformer), the optimizers have improved (SGD → Adam → AdamW), the hardware has evolved (CPU → GPU → TPU → custom AI chips). But the core algorithm has not. Forward, loss, backward, update. It works.

The Training Loop

1. Forward Pass
Push data through the network to get a prediction.
2. Compute Loss
Compare prediction to truth. Get a single number: how wrong are we?
3. Backward Pass
Compute gradients: which weights caused the error?
4. Update Weights
Nudge each weight in the direction that reduces the loss.
↻ repeat
What this lesson covers: The single neuron. Activation functions. The forward pass. Loss functions. The chain rule. Backpropagation. A full neural network playground. Practical tips for initialization, regularization, and learning rate. By the end, you'll understand how every neural network learns — from a single perceptron to GPT.
Why can't we just hardcode the perfect weights for a neural network?

Chapter 1: The Neuron

Every neural network — GPT-4, image classifiers, self-driving cars — is built from one tiny unit repeated millions of times. Understand this one unit and you understand the building block of all of deep learning. It's called a neuron (or a unit or a node).

A neuron does three things. First, it takes multiple input numbers and multiplies each one by a weight. Second, it adds all those weighted inputs together, plus a bias term. Third, it passes the sum through an activation function to produce a single output number.

That's it. Three operations: weight, sum, activate.

y = f(w1x1 + w2x2 + w3x3 + b)

Where x1, x2, x3 are the inputs, w1, w2, w3 are the weights, b is the bias, and f is the activation function. The weights control how much each input matters. A large positive weight means "pay a lot of attention to this input." A negative weight means "when this input is large, push the output down." Zero means "ignore this input."

The bias is an offset. Even when all inputs are zero, the neuron can still produce a non-zero output thanks to the bias. Think of it as the neuron's baseline tendency — its default opinion before seeing any evidence.

Think of each weight as a volume knob. The neuron listens to multiple signals, each with its own volume. It sums the adjusted signals, adds a baseline (bias), then squishes the result through a nonlinearity. One number comes out. That's the neuron's "opinion."

Let's make this concrete with numbers. Say x1 = 2.0, x2 = −1.0, x3 = 0.5, and we're using the ReLU activation (f(z) = max(0, z)). If the weights are w1 = 0.5, w2 = 0.8, w3 = −0.3, and b = 0.1:

z = (0.5)(2.0) + (0.8)(−1.0) + (−0.3)(0.5) + 0.1 = 1.0 − 0.8 − 0.15 + 0.1 = 0.15
y = ReLU(0.15) = max(0, 0.15) = 0.15

The neuron received three inputs, weighted and summed them to get 0.15, then ReLU left it unchanged because it's positive. If the sum had been negative — say −0.5 — ReLU would have output 0. The neuron would be "silent."

A Worked Example: Email Spam Detector

Let's make this tangible. You're building a spam detector with three features: x1 = number of exclamation marks (normalized 0–1), x2 = contains "free" (1 or 0), x3 = sender in contacts (1 or 0). A neuron might learn weights w1 = 1.5, w2 = 2.0, w3 = −3.0, b = −0.5.

Read the weights like a story: "Exclamation marks are mildly suspicious (+1.5). The word 'free' is very suspicious (+2.0). Being in contacts is strong evidence of NOT spam (−3.0). Default lean: slightly toward not-spam (−0.5)."

Email from your friend (x = [0.2, 0, 1]): z = 1.5(0.2) + 2.0(0) + (−3.0)(1) + (−0.5) = 0.3 + 0 − 3 − 0.5 = −3.2. ReLU(−3.2) = 0. Not spam. Correct — the contact weight dominated.

Spam email (x = [0.9, 1, 0]): z = 1.5(0.9) + 2.0(1) + (−3.0)(0) + (−0.5) = 1.35 + 2.0 + 0 − 0.5 = 2.85. ReLU(2.85) = 2.85. High activation. Spam detected.

Edge case — spam from a friend (x = [0.8, 1, 1]): z = 1.5(0.8) + 2.0(1) + (−3.0)(1) + (−0.5) = 1.2 + 2.0 − 3.0 − 0.5 = −0.3. ReLU(−0.3) = 0. Not spam. The contact weight overrides the spam signals. Whether this is correct depends on your tolerance — and this is exactly why we need multiple neurons (different feature detectors with different weight configurations).

Why the Biology Analogy Breaks Down

The name "neuron" comes from the biological brain cell. But don't take the analogy too seriously. Biological neurons communicate via spikes with complex timing; artificial neurons output a single floating-point number. Biological networks are sparse and stochastic; artificial networks are dense and deterministic. The only real similarity is the structure: many inputs, one output, a threshold effect.

Interactive Neuron

Adjust the three weights and bias using the sliders. Watch the weighted sum bar and the ReLU output update in real time. Try making the sum negative — the output clamps to zero.

w10.5
w20.8
w3-0.3
bias0.1

From One Neuron to a Layer

A single neuron can only compute a simple function. But arrange many neurons side by side, each receiving the same inputs but with different weights, and you get a layer. Each neuron in the layer produces its own output, so a layer of 4 neurons transforms 3 inputs into 4 outputs. Stack multiple layers and you get a neural network.

In vector notation, a layer with n inputs and m neurons is:

h = f(Wx + b)

Where W is a [m × n] matrix (each row is one neuron's weights), x is the [n] input vector, b is the [m] bias vector, and f is applied element-wise. This is just our single-neuron formula repeated m times in parallel, written compactly as a matrix multiply.

A Neuron From Scratch

python
import numpy as np

class Neuron:
    def __init__(self, n_inputs):
        # Random weights, small values (Xavier-ish)
        self.w = np.random.randn(n_inputs) * np.sqrt(1/n_inputs)
        self.b = 0.0

    def forward(self, x):
        # Weighted sum + bias + ReLU
        self.z = np.dot(self.w, x) + self.b  # scalar
        self.out = max(0, self.z)             # ReLU
        self.x = x                              # save for backprop
        return self.out

# Usage: 3-input neuron
n = Neuron(3)
output = n.forward(np.array([2.0, -1.0, 0.5]))
print(output)  # some scalar ≥ 0

We save the input x and the pre-activation z during the forward pass. These cached values are exactly what backpropagation needs to compute gradients — we'll see why in Chapter 5.

Neurons in NLP: Window Classifiers

How does this connect to NLP? In CS224N Lecture 3, Collobert and Weston (2011) showed that a simple feedforward network can do named entity recognition. The input is a window of word vectors: concatenate the embeddings of 5 consecutive words into one long vector, feed it to a hidden layer, and classify the center word as Person, Location, Organization, or None.

If each word vector is 300-dimensional and the window size is 5, the input layer has 1500 neurons. A hidden layer of 100 neurons means W1 is [100, 1500] — 150,000 learnable parameters in just the first layer. The neuron is the same unit we studied above, but the inputs are now word embeddings from Chapter L02 instead of raw numbers.

The connection to L02: Word vectors give us inputs that capture meaning. Neural networks give us functions that can learn from those inputs. Together: feed word embeddings into a neural network and train it end-to-end. The embeddings and the network weights learn simultaneously.
If all weights are zero, what does the neuron output regardless of input?

Chapter 2: Activation Functions

Stack two linear functions. f(x) = 2x, g(x) = 3x. Their composition is g(f(x)) = 3(2x) = 6x — still a linear function. Add a hundred more layers, all linear: the composition is still just one big linear function. You gained exactly nothing from the extra layers.

This is the fundamental problem with linear neurons. Without a nonlinear activation function, a network of any depth collapses into a single-layer network. The activation function is what gives deep networks their expressive power — the ability to represent curved, folded, and twisted decision boundaries that linear functions can never capture.

Three activation functions dominate the history of neural networks. Each solved a problem the previous one had.

Sigmoid: σ(z) = 1 / (1 + e−z)

The sigmoid squashes any input into the range (0, 1). Large positive inputs map to ~1. Large negative inputs map to ~0. Zero maps to 0.5. It was the default choice in the 1990s because it resembles a biological neuron's firing rate curve: below threshold it's silent, above threshold it saturates.

The problem is the vanishing gradient. Look at the derivative: σ′(z) = σ(z)(1 − σ(z)). When z is large and positive, σ(z) ≈ 1, so the derivative ≈ 0. When z is large and negative, σ(z) ≈ 0, derivative ≈ 0 again. The gradient is only meaningful in a narrow band around zero. Stack many sigmoid layers and gradients get multiplied together — a product of tiny numbers becomes astronomically small. Deep networks can't learn.

Tanh: tanh(z) = (ez − e−z) / (ez + e−z)

Tanh maps inputs to (−1, 1) instead of (0, 1). This is better because its output is zero-centered — negative inputs produce negative outputs. With sigmoid, the output is always positive, which means gradients for the weights always have the same sign, causing zig-zagging during optimization. Tanh fixes this.

But tanh still saturates at the extremes, so it still suffers from the vanishing gradient problem, just slightly less than sigmoid.

ReLU: f(z) = max(0, z)

The Rectified Linear Unit is almost embarrassingly simple. If the input is positive, pass it through. If it's negative, output zero. That's it. No exponentiation, no division. The derivative is 1 for positive inputs and 0 for negative inputs.

ReLU(z) = max(0, z),    ReLU′(z) = 1 if z > 0, else 0
ReLU's gradient is either 0 or 1. No vanishing, no exploding. That simplicity is why it won. A network with ReLU activations trains 6x faster than the same network with tanh (Krizhevsky et al., 2012). ReLU is the default activation for hidden layers in nearly all modern networks.

ReLU has one issue: dying neurons. If a neuron's weighted sum is always negative (due to unlucky initialization or a large gradient spike), its output is always zero and its gradient is always zero. The neuron is permanently dead — it can never recover. Variants like Leaky ReLU (f(z) = max(0.01z, z)) fix this by allowing a small gradient when z < 0.

The Universal Approximation Theorem

Here's a stunning result: a single hidden layer with enough neurons and a nonlinear activation function can approximate any continuous function to arbitrary accuracy (Cybenko, 1989). This is the universal approximation theorem. It means that in theory, a one-hidden-layer network is sufficient for any task.

In practice, depth matters enormously. A deep network with 100 neurons per layer can represent functions that would require exponentially many neurons in a single layer. Think of it this way: each layer adds a new level of abstraction. Layer 1 detects edges. Layer 2 combines edges into shapes. Layer 3 combines shapes into objects. Doing all of that in one layer would require an astronomical number of neurons.

Activation in Code

python
import numpy as np

def sigmoid(z):  return 1 / (1 + np.exp(-z))
def tanh(z):     return np.tanh(z)
def relu(z):     return np.maximum(0, z)

# Derivatives (needed for backprop)
def sigmoid_grad(z): s = sigmoid(z); return s * (1 - s)
def tanh_grad(z):    return 1 - np.tanh(z)**2
def relu_grad(z):    return (z > 0).astype(float)

Notice something beautiful about ReLU's gradient: it's a boolean cast to float. No exponentials, no divisions. This computational simplicity is another reason ReLU trains faster than sigmoid or tanh.

Modern Activation Functions

Research hasn't stopped at ReLU. Several variants have emerged:

GELU (Gaussian Error Linear Unit): used in BERT and GPT. It smoothly approximates ReLU with GELU(x) = x · Φ(x), where Φ is the Gaussian CDF. Unlike ReLU's hard zero, GELU gently attenuates negative inputs. This smooth behavior can improve training stability in Transformers.

SiLU / Swish: f(x) = x · σ(x). Similar to GELU but uses sigmoid instead of the Gaussian CDF. Used in many modern architectures. Non-monotonic — it dips slightly below zero for negative inputs, which empirically helps.

For hidden layers in 2025: ReLU remains the default for CNNs and simple networks. GELU dominates in Transformers (language models). Sigmoid is reserved for output layers where you need a probability (0, 1). Tanh appears in LSTMs and specific gating mechanisms.

Proof That Linearity Collapses

Let's prove the claim from the opening formally. Two linear layers without activation:

h = W1x    (no activation)
y = W2h = W2(W1x) = (W2W1)x = Weffx

Weff = W2W1 is just another matrix. A [2, 4] matrix times a [4, 3] matrix gives a [2, 3] matrix. We could have computed y = Weffx directly with one layer. The hidden layer added parameters but no expressiveness. This generalizes to any number of layers: N stacked linear layers collapse to one.

With a nonlinear activation, the collapse is broken. y = W2ReLU(W1x) cannot be written as Wx for any single matrix W. The ReLU creates piecewise-linear regions that a single linear layer cannot replicate. Each additional layer with activation adds genuinely new representational capacity.

FunctionRangeGradient BehaviorWhen to Use
Sigmoid(0, 1)Vanishes at extremesOutput layer for binary classification
Tanh(−1, 1)Vanishes, but zero-centeredRNNs (historically)
ReLU[0, ∞)0 or 1 — no vanishingHidden layers (default choice)
Activation Function Explorer

Toggle between activation functions. The solid curve is the function; the dashed curve is its derivative. Drag the x-slider to see the function value and gradient at that point. Notice how sigmoid and tanh derivatives vanish at the extremes.

Input x0.0
Why does stacking linear layers without activation functions not increase expressive power?

Chapter 3: The Forward Pass

You have a 3-number input: [height, weight, age]. How does the network turn that into "diabetic: 72% likely"? By pushing the numbers forward through the network, layer by layer, until a prediction pops out the other end. This is the forward pass.

Let's trace it through a concrete 2-layer network with 3 inputs, 4 hidden neurons, and 2 outputs. Every step is just an operation you already know: matrix multiply, add bias, activate.

Step-by-Step with Shapes

Step 1: Input. Our input vector x has shape [3]. These are raw features — for example, height=170, weight=80, age=45.

Step 2: First layer linear transform. Multiply by weight matrix W1 of shape [4, 3], then add bias b1 of shape [4]. The result z1 = W1x + b1 has shape [4]. Each of the 4 entries is one hidden neuron's weighted sum.

Step 3: First activation. Apply ReLU element-wise: h = ReLU(z1). Still shape [4], but now all negative entries are clamped to zero.

Step 4: Second layer linear transform. Multiply by W2 of shape [2, 4], add b2 of shape [2]. Result z2 has shape [2] — one score per output class.

Step 5: Softmax. Convert raw scores to probabilities: p = softmax(z2). Still shape [2], but entries sum to 1. If p = [0.28, 0.72], the network says "28% healthy, 72% diabetic."

x ⟶ W1x + b1 ⟶ ReLU ⟶ W2h + b2 ⟶ softmax ⟶ p
ObjectShapeWhat It Is
x (input)[3]Raw features
W1[4, 3]First layer weights
b1[4]First layer bias
z1 = W1x + b1[4]Pre-activation (linear output)
h = ReLU(z1)[4]Hidden activations
W2[2, 4]Second layer weights
b2[2]Second layer bias
z2 = W2h + b2[2]Raw output scores (logits)
p = softmax(z2)[2]Class probabilities
The forward pass is just matrix multiply, add bias, activate — repeated. That's it. Every neural network, from a 2-layer classifier to a billion-parameter language model, is this same pattern stacked deeper and wider. The magic isn't in the individual operations — it's in the weights that get learned.

Hand-Worked Matrix Multiply

Let's nail down exactly what "matrix multiply" means with small numbers. Suppose x = [0.7, 0.8, 0.45] and W1 is a [4, 3] matrix (4 neurons, each with 3 weights):

W1 = [[0.2, 0.1, −0.3],
         [−0.1, 0.4, 0.1],
         [0.3, −0.2, 0.5],
         [0.1, 0.2, −0.1]]

Neuron 0: (0.2)(0.7) + (0.1)(0.8) + (−0.3)(0.45) = 0.14 + 0.08 − 0.135 = 0.085
Neuron 1: (−0.1)(0.7) + (0.4)(0.8) + (0.1)(0.45) = −0.07 + 0.32 + 0.045 = 0.295
Neuron 2: (0.3)(0.7) + (−0.2)(0.8) + (0.5)(0.45) = 0.21 − 0.16 + 0.225 = 0.275
Neuron 3: (0.1)(0.7) + (0.2)(0.8) + (−0.1)(0.45) = 0.07 + 0.16 − 0.045 = 0.185

After adding bias b1 = [0.1, −0.2, 0.1, 0.3] and applying ReLU:

z1 = [0.185, 0.095, 0.375, 0.485] → h = ReLU(z1) = [0.185, 0.095, 0.375, 0.485]

All positive, so ReLU doesn't change anything here. If any entry had been negative, it would be clamped to zero. Each dot product is one neuron "voting" on the input.

Let's change the input to see ReLU in action. With x = [0.1, 0.1, 0.1]:

Neuron 0: (0.2)(0.1) + (0.1)(0.1) + (−0.3)(0.1) + 0.1 = 0.02 + 0.01 − 0.03 + 0.1 = 0.1
Neuron 1: (−0.1)(0.1) + (0.4)(0.1) + (0.1)(0.1) + (−0.2) = −0.01 + 0.04 + 0.01 − 0.2 = −0.16

Neuron 1's pre-activation is −0.16. ReLU clamps it to 0. That neuron is silent for this input — it contributes nothing to the next layer. Different inputs activate different subsets of neurons. This selective activation is part of what gives neural networks their representational power.

Why Matrix Multiplication?

You might wonder: why matrix multiplication specifically? Why not some other operation? The answer has two parts.

First, matrix multiplication is a linear transformation. It can rotate, scale, shear, and project vectors. These are exactly the operations needed to rearrange the input space into regions that correspond to different outputs. The activation function then bends these regions into nonlinear shapes.

Second, matrix multiplication is embarrassingly parallel. GPUs are built from thousands of cores optimized for matrix operations. A matrix multiply of shape [4096, 4096] × [4096, 1] involves 16 million multiply-adds — and a GPU completes them in microseconds. This computational efficiency is why neural networks scale so well on modern hardware.

Real-World Scale: GPT-2 Forward Pass

To appreciate the scale, here's what a forward pass looks like for GPT-2 Small (124M parameters):

OperationShapeFLOPs per token
Token embedding lookup[50257, 768] → [768]0 (lookup, no math)
12 × attention QKV projection[768, 2304] per layer~4.2M per layer
12 × FFN[768, 3072] then [3072, 768]~9.4M per layer
Output projection[768, 50257]~77M

Total: roughly 240 million FLOPs per token. For a 1024-token sequence, that's ~250 billion FLOPs for a single forward pass. And GPT-2 Small is tiny by modern standards. Same operations, same algorithm — vastly more scale.

Step-Through Forward Pass

Click "Step" to advance through the 5 stages of a forward pass in a 3→4→2 network. Tensor shapes and actual numbers are shown at each stage.

Click "Step" to begin the forward pass.

PyTorch Implementation

python
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(3, 4),    # W1: [4,3], b1: [4]
    nn.ReLU(),            # element-wise max(0, z)
    nn.Linear(4, 2),    # W2: [2,4], b2: [2]
)
x = torch.tensor([170.0, 80.0, 45.0])  # [3]
logits = model(x)                          # [2]
probs  = torch.softmax(logits, dim=0)      # [2], sums to 1

Five lines. That's a complete forward pass. PyTorch handles the matrix multiplications, bias additions, and ReLU applications behind the scenes. The shapes flow exactly as our table predicted: [3] → [4] → [4] → [2] → [2].

What Softmax Actually Does

The softmax function converts raw scores (called logits) into probabilities. Given a vector z of K scores:

softmax(z)i = ezi / ∑j=1K ezj

Why exponentiate? Because ez is always positive (ensuring valid probabilities) and preserves ordering (higher scores get higher probabilities). The denominator normalizes so the probabilities sum to 1. If one logit is much larger than the others, its softmax output approaches 1 while the others approach 0 — the network is "confident."

Practical trick: to avoid numerical overflow, subtract the maximum logit before exponentiating: softmax(z) = softmax(z − max(z)). This doesn't change the result (the constant cancels) but prevents ez from becoming astronomically large.

What If We Don't Use Softmax?

For binary classification, you can skip softmax and use a single output neuron with sigmoid: p = σ(z). This produces a probability directly. For multi-label classification (where multiple classes can be true simultaneously), use sigmoid on each output independently instead of softmax (which enforces "exactly one class is true").

For regression (predicting a continuous number like house price), don't use any activation on the output — just let the raw linear output through. The output can be any real number, which is what you want for regression.

TaskOutput ActivationLoss Function
Binary classificationSigmoid (1 output)Binary cross-entropy
Multi-class classificationSoftmax (K outputs)Cross-entropy
Multi-label classificationSigmoid per outputBinary CE per output
RegressionNone (identity)MSE
If the input has shape [3] and W1 has shape [4, 3], what is the hidden layer shape?

Chapter 4: Loss Functions

The network predicted "cat: 20%, dog: 80%" but the image is a cat. How do we turn "you're wrong" into a number? We need a function that takes the predicted probabilities and the true label and outputs a single number that's large when the prediction is bad and small when it's good. That function is called the loss function (or cost function or objective).

The loss function is the neural network's only source of feedback. It's the signal that drives learning. A poorly chosen loss function means the network optimizes for the wrong thing. A good loss function precisely encodes what "correct" means.

Mean Squared Error (MSE)

The simplest loss: square the difference between the prediction and the target.

LMSE = (y − p)2

For classification with probability output p and true label y = 1: if p = 0.8, loss = (1 − 0.8)² = 0.04. If p = 0.2, loss = (1 − 0.2)² = 0.64. MSE penalizes wrong answers, but the penalty grows only quadratically. A confident wrong answer (p = 0.01) gets loss 0.98 — not much worse than a mildly wrong answer (p = 0.3, loss = 0.49).

Cross-Entropy Loss

This is the standard loss for classification. For binary classification:

LCE = −log(p)

Where p is the predicted probability of the true class. If the true class is "cat" and you predicted p(cat) = 0.8, the loss is −log(0.8) = 0.22. If you predicted p(cat) = 0.01, the loss is −log(0.01) = 4.61. That's a 20x larger penalty for being confidently wrong.

Cross-entropy punishes confident wrong answers harshly. Predicting 0.01 for the true class costs about 20x more than predicting 0.5. This is exactly what we want: the network should be strongly discouraged from being both wrong and confident. MSE doesn't have this property — it treats all degrees of wrongness roughly the same.

The steep cliff near p = 0 is the key visual insight. As the predicted probability of the correct class approaches zero, cross-entropy loss shoots to infinity. This creates a powerful gradient signal: "You are catastrophically wrong. Fix this immediately." MSE, by contrast, just says "You're a little more wrong than before."

Loss in NLP: Language Model Training

When you train a language model like GPT, the loss function is cross-entropy over the vocabulary. At each position, the model predicts a probability distribution over ~50,000 tokens. The true next token is one of those 50,000. Cross-entropy loss is −log(ptrue), where ptrue is the probability the model assigned to the correct token.

If the model thinks the next word after "The cat sat on the" is "mat" with probability 0.3, the loss is −log(0.3) = 1.2. If it predicted "quantum" with probability 0.0001, the loss would be −log(0.0001) = 9.2. Over millions of training examples, this signal teaches the model which words tend to follow which contexts.

The metric perplexity = eaverage loss gives an intuitive interpretation: a perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 options. GPT-4's perplexity is estimated around 5–10 on typical English text.

Multi-Class Cross-Entropy

For K classes, the loss is the sum over all classes, but only the true class contributes (because the one-hot label is zero for all other classes):

L = −∑k=1K yk log(pk) = −log(ptrue)

This simplifies to exactly what we had above: just the negative log of the predicted probability for the correct class. In PyTorch, nn.CrossEntropyLoss combines softmax and cross-entropy in one numerically stable operation.

Why Cross-Entropy Works: Information Theory

Cross-entropy isn't arbitrary — it comes from information theory. It measures the number of bits needed to encode outcomes from one distribution (the true labels) using another distribution (the model's predictions). When the model perfectly predicts the true distribution, cross-entropy equals the entropy of the true distribution — the minimum possible. Any mismatch increases the cost.

The gradient of cross-entropy with respect to the logits has a beautiful form: for softmax + cross-entropy, the gradient is simply prediction minus target. If the true class is k and the model predicts probabilities p, then dL/dzk = pk − 1 and dL/dzj = pj for all other j. This clean gradient is another reason cross-entropy is preferred over MSE for classification.

Let's verify with numbers. Say logits z = [1.0, 2.0, 0.5] and the true class is 1. Softmax gives p = [0.21, 0.57, 0.13]. The gradient is: dL/dz = [0.21, 0.57 − 1.0, 0.13] = [0.21, −0.43, 0.13]. The gradient is negative for the true class (increase that logit to reduce loss) and positive for wrong classes (decrease those logits). Beautifully intuitive.

Loss in Code

python
import torch
import torch.nn as nn

# CrossEntropyLoss takes raw logits (before softmax)
logits = torch.tensor([1.2, -0.5, 3.1])  # 3 classes
target = torch.tensor(2)                   # true class is 2
loss   = nn.CrossEntropyLoss()(logits, target)

# Manual version:
probs = torch.softmax(logits, dim=0)    # [0.142, 0.026, 0.832]
manual_loss = -torch.log(probs[2])      # -log(0.832) = 0.184
MSE vs. Cross-Entropy Loss

Drag the slider to change the predicted probability for the true class. Watch how both losses change. Notice the steep cliff in cross-entropy near p=0 — confident wrong answers are punished exponentially.

Predicted p(true)0.50
Which loss penalizes a confident wrong prediction more — MSE or cross-entropy?

Chapter 5: The Chain Rule

Turn the thermostat up 1 degree. The heater burns more gas. More gas means a bigger bill. How much does 1 degree cost? You need (dollars per gallon) × (gallons per degree). That's the chain rule: to find how a change at the start affects the end, multiply the rates of change along the chain.

In calculus notation: if y = f(g(x)), then dy/dx = (dy/dg) · (dg/dx). The derivative of the composition is the product of the individual derivatives. This extends to any number of composed functions:

dL/dw = (dL/da) · (da/dz) · (dz/dw)

Each term in the product is called a local gradient — it describes how sensitive one variable is to the variable just before it in the chain. To get the end-to-end gradient (how the loss changes when we wiggle a weight), we just multiply all the local gradients along the path from the weight to the loss.

Concrete Example: A 4-Node Chain

Let's trace a concrete computation. We have weight w = 2.0, input x = 1.5, and we compute:

z = w · x
Multiply: z = 2.0 × 1.5 = 3.0
a = z + b
Add bias b = −1.0: a = 3.0 + (−1.0) = 2.0
h = σ(a)
Sigmoid: h = 1/(1+e−2.0) = 0.881
L = (h − y)2
MSE loss with target y=1: L = (0.881 − 1)² = 0.0142

Now the backward pass. We want dL/dw — how much does the loss change if we wiggle w?

Local gradient 1: dL/dh = 2(h − y) = 2(0.881 − 1) = −0.238. The loss decreases if h increases (good — h should be closer to 1).

Local gradient 2: dh/da = σ(a)(1 − σ(a)) = 0.881 × 0.119 = 0.105. Sigmoid's derivative at a = 2.0.

Local gradient 3: da/dz = 1. The bias addition has derivative 1 with respect to z.

Local gradient 4: dz/dw = x = 1.5. The multiplication z = w · x has derivative x with respect to w.

Chain them: dL/dw = (−0.238)(0.105)(1)(1.5) = −0.0375. The gradient is negative, meaning increasing w will decrease the loss. So gradient descent will increase w — exactly the right direction.

The Gradient Update

Gradient descent says: wnew = w − η · dL/dw, where η is the learning rate. With η = 0.1:

wnew = 2.0 − 0.1 × (−0.0375) = 2.0 + 0.00375 = 2.00375

The weight increased slightly. After many such updates, w will converge to a value that makes the loss near zero. The miracle of this process is that we never told the network what w should be — we only told it what the loss should be (small), and the chain rule figured out the rest.

Why Gradients Can Vanish or Explode

Look at that chain of multiplications again: (−0.238)(0.105)(1)(1.5). The sigmoid derivative (0.105) is less than 1. In a 10-layer network, you'd multiply 10 such terms together. If each is 0.1, the product is 0.110 = 10−10. The gradient has effectively vanished — early layers get no useful signal.

Conversely, if each local gradient is 1.5, the product after 10 layers is 1.510 ≈ 57. The gradient explodes. Weights get massive updates, the network blows up, and you see NaN. This is why activation function choice matters so much (Chapter 2) and why ReLU's gradient of exactly 1 for positive inputs is so valuable.

Multiple Paths: Summing Over Children

What if a node has multiple outputs — for example, one hidden neuron feeds into three neurons in the next layer? The chain rule says: sum the gradients from all children. Each downstream path contributes its own gradient, and we add them all up.

dL/dhj = ∑k (dL/dzk) · (dzk/dhj)

This is why backpropagation works for arbitrary computation graphs, not just chains. At each node: sum the gradients flowing in from all outputs, then multiply by the local gradient to pass backward. Branching (one output goes to multiple nodes) means summing. Merging (multiple inputs combine) means each input gets its own gradient path.

Chain Rule at Scale: Why Autodiff Exists

A modern language model like GPT-4 has hundreds of billions of parameters. The computation graph has trillions of nodes. Computing the chain rule by hand is obviously impossible. Automatic differentiation (autodiff) systems like PyTorch's autograd do it mechanically.

During the forward pass, PyTorch records every operation in a directed acyclic graph (DAG). Each tensor knows which operation created it and what its inputs were. When you call loss.backward(), PyTorch walks this graph in reverse topological order, applying the chain rule at each step. The programmer never sees a derivative. The system handles everything.

But understanding the chain rule is still essential. When your model won't train, the first question is: "Are the gradients reaching the early layers?" If they're vanishing (sigmoid + deep network) or exploding (bad initialization), you need to understand why — and that requires knowing the chain rule.

Gradient Checking: Trust but Verify

When implementing backprop by hand, how do you know your gradients are correct? Use numerical gradient checking. For each weight w, compute:

dL/dw ≈ (L(w + ε) − L(w − ε)) / (2ε)

This finite-difference approximation should match your backprop gradient to ~5 significant digits. If it doesn't, you have a bug. Typical ε = 10−5. This is slow (requires 2 forward passes per weight) so you only use it for debugging, not training.

python
def grad_check(f, w, eps=1e-5):
    """Compare analytical gradient with numerical."""
    grad_numerical = np.zeros_like(w)
    for i in range(len(w)):
        w[i] += eps
        loss_plus = f(w)
        w[i] -= 2*eps
        loss_minus = f(w)
        w[i] += eps  # restore
        grad_numerical[i] = (loss_plus - loss_minus) / (2*eps)
    return grad_numerical
The chain rule isn't a trick. It's bookkeeping. At each node, you answer one question: "If my input wiggles by a tiny amount, how much does my output wiggle?" Multiply all the wiggles from loss to weight, and you have the gradient. That's all backpropagation is.
Computation Graph — Forward and Backward

Click "Forward" to fill in values left-to-right through the computation graph. Then click "Backward" to fill gradients right-to-left, watching them multiply along the chain. Each node shows its value (top, teal) and local gradient (bottom, orange).

Click "Forward" to begin.
In f(g(x)), if g′(x) = 3 and f′(g(x)) = 2, what is df/dx?

Chapter 6: Backpropagation

You know the chain rule for one path. But a real network has millions of paths — every weight connects to the loss through a different chain of operations. Computing each gradient independently by tracing its own path would mean repeating vast amounts of shared computation. For a network with n weights, that's n separate traversals of the graph.

Backpropagation computes ALL gradients in a single backward sweep by reusing intermediate results. It's the chain rule applied systematically, layer by layer, from the loss back to the inputs. The key insight is that gradients flowing into a node can be computed once and shared by all paths passing through that node.

The Algorithm in 5 Steps

1. Forward Pass
Push input through the network, computing and saving every intermediate value (activations, pre-activations).
2. Compute Loss
Evaluate the loss function at the output. This is the starting point for gradients.
3. Output Gradient
Compute dL/d(output). For softmax + cross-entropy, this is simply (prediction − target).
4. Propagate Backward
At each layer (right to left): multiply incoming gradient by local gradient. Save weight gradients. Pass input gradient to the layer before.
5. Update Weights
w ← w − η · dL/dw for every weight. η is the learning rate.

Why It's Efficient

Consider a network with 3 layers and 1000 weights per layer. Naively computing dL/dwi for each weight independently means traversing the graph 3000 times. Backprop traverses it once. The cost of the backward pass is approximately equal to the cost of the forward pass. This O(1) cost per gradient (amortized) is what makes training neural networks feasible.

Backprop computes ALL gradients in one backward pass — the same cost as one forward pass. This is why neural networks are trainable at all. Without backprop, computing gradients for a million-parameter network would require a million forward passes. With backprop, it takes one forward pass plus one backward pass. The algorithm made deep learning practical.

The Math for One Layer

For a single layer h = ReLU(Wx + b), given the incoming gradient dL/dh from the layer above:

dL/dz = dL/dh ⊙ ReLU′(z)    (element-wise, shape [m])
dL/dW = dL/dz · xT    (outer product, shape [m, n])
dL/db = dL/dz    (same shape [m])
dL/dx = WT · dL/dz    (pass to previous layer, shape [n])

The gradient for the weights (dL/dW) is the outer product of the incoming gradient and the layer's input. The gradient to pass backward (dL/dx) is the transpose of the weight matrix times the incoming gradient. These two operations — outer product and transpose multiply — are the computational core of backprop.

Concrete Gradient Shapes

For our 3→4→2 network from Chapter 3, here are the gradient shapes at each step of backprop:

GradientShapeHow Computed
dL/dz2[2]p − y (softmax gradient)
dL/dW2[2, 4]dL/dz2 · hT (outer product)
dL/db2[2]dL/dz2 (copy)
dL/dh[4]W2T · dL/dz2 (pass backward)
dL/dz1[4]dL/dh ⊙ ReLU′(z1)
dL/dW1[4, 3]dL/dz1 · xT (outer product)
dL/db1[4]dL/dz1 (copy)

Notice the symmetry: each weight gradient has the same shape as the weight matrix itself. This makes sense — we need one gradient value per weight. The backward pass mirrors the forward pass but in reverse: forward goes [3]→[4]→[2], backward goes [2]→[4]→[3].

Historical Context

Backpropagation was independently discovered multiple times. Linnainmaa described the core idea (reverse-mode automatic differentiation) in 1970. Werbos applied it to neural networks in his 1974 thesis. But the paper that made the world pay attention was Rumelhart, Hinton, and Williams (1986) in Nature.

Before 1986, many researchers believed multi-layer networks were untrainable — the perceptron (a single-layer network) could be trained, but how to extend training to hidden layers was unclear. Backpropagation answered that question definitively: propagate the error signal backward through each layer. It was the breakthrough that made deep learning possible.

For decades afterward, backprop was considered "solved" and attention shifted to architectures and data. But Karpathy's 2016 blog post argued convincingly that understanding backprop (not just calling .backward()) is essential for debugging modern networks. Saturating activations, dead ReLUs, exploding gradients — all of these are gradient flow problems that require backprop intuition to diagnose.

The Memory Cost of Backprop

There's a cost to backprop's efficiency: memory. The forward pass must save every intermediate activation because the backward pass needs them. A network with L layers and N activations per layer stores L × N values. For large models like GPT-4 with billions of parameters, this activation memory dominates GPU usage.

Techniques like gradient checkpointing trade memory for compute: save activations only at every k-th layer, and recompute the intermediate ones during the backward pass. This reduces memory from O(L) to O(√L) at the cost of ~33% more compute. It's essential for training the largest models.

Mixed precision training stores activations in 16-bit floats instead of 32-bit, halving memory. The gradients and weight updates are computed in 32-bit for numerical stability, but the saved activations only need enough precision for the backward pass. This is now standard practice.

PyTorch: One Line

python
# Forward pass (PyTorch records the computation graph automatically)
logits = model(x)                         # forward
loss   = nn.CrossEntropyLoss()(logits, y)  # compute loss
loss.backward()                             # ALL gradients computed here

# After backward(), every parameter has a .grad attribute:
for p in model.parameters():
    p.data -= lr * p.grad  # gradient descent update

loss.backward() is the entire backpropagation algorithm. PyTorch's autograd engine recorded every operation during the forward pass, then replays them in reverse to compute gradients. You never write a single gradient formula by hand.

Manual Backprop From Scratch

To truly understand backprop, implement it once without a framework. Here's a complete 2-layer network with manual forward and backward passes:

python
import numpy as np

# Network: 3 → 4 → 2
W1 = np.random.randn(4, 3) * 0.5  # [4, 3]
b1 = np.zeros(4)                   # [4]
W2 = np.random.randn(2, 4) * 0.5  # [2, 4]
b2 = np.zeros(2)                   # [2]

def forward(x):
    z1 = W1 @ x + b1        # [4]
    h  = np.maximum(0, z1)   # [4] ReLU
    z2 = W2 @ h + b2        # [2]
    e  = np.exp(z2 - z2.max())
    p  = e / e.sum()         # [2] softmax
    return z1, h, z2, p

def backward(x, z1, h, p, y):
    # Softmax + CE gradient: p - one_hot(y)
    dz2 = p.copy()
    dz2[y] -= 1              # [2]

    dW2 = np.outer(dz2, h)   # [2, 4]
    db2 = dz2               # [2]
    dh  = W2.T @ dz2        # [4]

    dz1 = dh * (z1 > 0)     # [4] ReLU grad
    dW1 = np.outer(dz1, x)   # [4, 3]
    db1 = dz1               # [4]
    return dW1, db1, dW2, db2

Look at the backward function. It's just 7 lines. Each line corresponds to one step of the chain rule, working from the output back to the input. The key operations are: outer product (for weight gradients), transpose multiply (for passing gradients backward), and element-wise multiply (for ReLU's gradient mask).

Verifying Backprop Correctness

Let's run through the shapes one more time to make sure everything checks out:

LineOperationShape Check
dz2 = p.copy(); dz2[y] -= 1Softmax+CE gradient[2] — one gradient per output
dW2 = np.outer(dz2, h)Outer product[2] ⊗ [4] = [2,4] — matches W2 shape
dh = W2.T @ dz2Transpose multiply[4,2] × [2] = [4] — gradient per hidden unit
dz1 = dh * (z1 > 0)ReLU mask[4] ⊙ [4] = [4] — zero out dead neurons
dW1 = np.outer(dz1, x)Outer product[4] ⊗ [3] = [4,3] — matches W1 shape

Every gradient has the same shape as the parameter it corresponds to. This is a universal property of backprop: if W has shape [m, n], then dL/dW also has shape [m, n]. You can always sanity-check your implementation by verifying this.

Forward vs. Reverse Mode

Backpropagation is reverse-mode automatic differentiation. There's also forward-mode, which computes derivatives left-to-right during the forward pass. Forward mode computes one derivative per pass (doutput/done-input). Reverse mode computes all derivatives in one pass (done-output/dall-inputs).

Since neural network training has one scalar output (the loss) and millions of inputs (the weights), reverse mode is vastly more efficient. This is why we always do backward passes, never forward-mode differentiation. If we had one weight and a million outputs, forward mode would win instead.

Jacobians and the Chain Rule in Matrix Form

For a vector-to-vector function f: Rn → Rm, the derivative is the Jacobian matrix J of shape [m, n], where Jij = ∂fi/∂xj. The chain rule in matrix form is just Jacobian multiplication:

∂L/∂x = JfT · ∂L/∂f(x)

Backpropagation never builds the full Jacobian (it's too large — a layer with 4096 inputs and 4096 outputs has a 16-million-entry Jacobian). Instead, it computes vector-Jacobian products (VJPs): multiply the incoming gradient vector by the Jacobian transpose without ever materializing the full matrix. This is the core trick that makes backprop efficient for neural networks.

For a linear layer y = Wx, the Jacobian is just W itself. The VJP is WT · dL/dy — exactly the operation we compute in backprop. For element-wise operations like ReLU, the Jacobian is diagonal (each output depends only on its corresponding input), so the VJP is just element-wise multiplication by the diagonal entries — the dh * (z1 > 0) operation in our from-scratch code.

This VJP perspective is the mathematically clean way to understand backpropagation. Each layer implements two functions: forward (compute output from input) and VJP (compute input gradient from output gradient). Stack them together and you get a trainable system of arbitrary depth.

Backpropagation Through a Network

Click "Run" to watch a forward pass (teal) followed by a backward pass (orange) through a 3-layer network. Activations fill left-to-right, then gradients flow right-to-left. Edges pulse as gradients propagate. Use the speed slider to slow the animation.

Speed5
Click "Run" to start the animation.
Why is backprop more efficient than computing each gradient independently?

Chapter 7: Gradient Descent Playground

Everything clicks here. Watch a neural network classify 2D points in real time — every weight change, every loss drop, every decision boundary shift. This is what all the theory from Chapters 0–6 looks like in action.

On the left: a scatter plot of 2D points in two classes. The colored background shows the network's current decision boundary — the region where it predicts each class. On the right: the loss curve over training steps, so you can watch the network improve (or fail to).

Try the spiral dataset with 2 hidden units. It can't learn it — the boundary stays linear. Now try 8 hidden units. Watch the boundary fold and curve to separate the spiraling classes. That's the power of width. More neurons = more flexible decision boundaries.
Neural Network Playground

Choose a dataset and architecture, then click Play. Watch the decision boundary evolve as the network learns. The loss curve shows training progress. Try different learning rates and hidden sizes.

Learning rate0.100
Hidden units8
Epoch 0 — Loss: — — Accuracy: —

Experiment ideas to build intuition:

Learning rate too high: Set LR to 1.0. The loss will oscillate wildly or diverge — the network overshoots the optimal weights and bounces around. Lower it to 0.01 and watch it converge smoothly but slowly.

Learning rate too low: Set LR to 0.001 on the spiral dataset. After 500 epochs it's barely learned anything. The loss crawls down at a snail's pace. Learning rate is the most important hyperparameter.

Not enough capacity: Try the spiral dataset with 2 hidden units. The network can only draw straight-ish boundaries. It maxes out at ~50% accuracy. Switch to 16 units and it carves out the spiral perfectly. But more units means more parameters means more risk of overfitting (on small datasets).

XOR is the classic test: A single layer (linear boundary) can never solve XOR — the classes are not linearly separable. But our 2-layer network with 4+ hidden units handles it easily. This is the fundamental capability that nonlinearity provides.

Why Decision Boundaries Curve

A single neuron with no activation draws a straight line (in 2D) or a hyperplane (in higher dimensions). That's all a linear function can do. But stack two layers with ReLU and something remarkable happens: each hidden neuron contributes one "crease" — a boundary where the function bends. Four hidden neurons can create a region enclosed by four creases. Eight neurons can create curved, complex boundaries.

The spiral dataset makes this vivid. Two interleaved spirals are extremely non-linear. With 2 hidden units, the network can only draw a boundary with 2 creases — essentially a straight line with a kink. It's hopeless. With 16 units, the boundary can fold 16 times, tracing the spiral path precisely.

What's Actually Happening Under the Hood

When you click Play, here's what happens on every frame:

Forward
For each of the 100 data points: compute h = ReLU(W1[x,y] + b1), then p = sigmoid(W2h + b2).
Loss
Compute binary cross-entropy: L = −[y log(p) + (1−y) log(1−p)] averaged over all points.
Backward
Compute dL/dW2, dL/db2, dL/dW1, dL/db1 using the chain rule.
Update
W ← W − lr · dL/dW for all weights. One step of gradient descent.
↻ 5 steps per animation frame

The decision boundary visualization works by evaluating the network at a grid of points across the 2D plane. At each grid point, if the network predicts > 0.5, it's colored orange (class 1); otherwise teal (class 0). The intensity of the color reflects the confidence — darker means more certain.

Stochastic vs. Batch vs. Mini-Batch

Our playground uses batch gradient descent: it computes the gradient using ALL data points before updating. This gives a clean gradient estimate but is slow for large datasets. In practice, real networks use mini-batch gradient descent: sample a random subset (batch) of 32-512 examples, compute the gradient on that batch, and update. This is noisier but much faster per step.

Stochastic gradient descent (SGD) technically means batch size = 1: update after every single example. This is very noisy but can escape local minima. In practice, "SGD" usually refers to mini-batch SGD. The noise from small batches is actually helpful — it acts as implicit regularization, preventing the network from settling into sharp, overfitting minima.

This playground is the same training loop you'll use in PyTorch, TensorFlow, and JAX. The framework handles the bookkeeping automatically, but the algorithm is identical: forward pass, loss, backward pass, update. Repeat. The entire field of deep learning is variations on this theme.

Chapter 8: Practical Tips

You built a network, wrote the training loop, hit run... and the loss is NaN. Or it flatlines at a high value. Or it drops to zero on training data but fails on test data. What went wrong?

This chapter covers the three most common failure modes and how to fix them: bad initialization, missing regularization, and wrong learning rate.

Initialization: Where You Start Matters

If you initialize all weights to zero, every neuron in a layer computes the exact same function. They all receive the same gradient and update identically. The network has many neurons but they're all clones — you've effectively got a single neuron per layer. This is called the symmetry problem.

If you initialize with large random values, the pre-activations z = Wx will be huge. Sigmoid and tanh saturate, killing gradients. Even ReLU can produce exploding activations that cascade through layers, leading to NaN.

The solution is Xavier initialization (Glorot, 2010): draw weights from a distribution with variance 1/nin, where nin is the number of inputs to the layer. This keeps the variance of activations roughly constant across layers — not too small, not too big.

W ~ N(0, 1/nin)    (Xavier init)

For ReLU networks, He initialization (He et al., 2015) is better: variance 2/nin. The factor of 2 compensates for the fact that ReLU zeros out half the activations (those with negative pre-activation are zeroed, cutting the variance in half).

The Math Behind Xavier Init

Here's the derivation. For one neuron: z = ∑ wi xi. Assuming inputs and weights are independent with zero mean:

Var(z) = nin · Var(w) · Var(x)

We want Var(z) = Var(x) (keep variance constant across layers). This gives us Var(w) = 1/nin. That's it — the entire derivation of Xavier initialization. For He init, account for ReLU zeroing half the activations: Var(z) = (nin/2) · Var(w) · Var(x), so Var(w) = 2/nin.

W ~ N(0, 2/nin)    (He init for ReLU)

Regularization: Preventing Overfitting

A network with enough parameters can memorize the training set perfectly — loss drops to zero. But it fails on new data because it learned the noise, not the pattern. This is overfitting.

L2 regularization (weight decay) adds a penalty proportional to the squared magnitude of the weights:

Ltotal = Ldata + λ ∑ wi2

This discourages large weights, which tend to produce sharp, complex decision boundaries that fit noise. Small weights produce smoother boundaries that generalize better. The hyperparameter λ controls the trade-off: too large and the network underfits (too smooth); too small and it overfits.

Dropout (Srivastava, 2014) randomly zeros out a fraction of neurons during each training step. This forces the network to be robust — no single neuron can be solely responsible for a prediction. At test time, all neurons are active but their outputs are scaled down by the dropout rate. Typical rate: 0.5 for hidden layers, 0.2 for input layers.

The intuition: imagine a team where any member might be absent on any given day. The team can't rely on any single person — everyone has to be capable. Dropout creates this same pressure on neurons. Each neuron has to be useful on its own, not just as part of a memorized pattern. This dramatically reduces overfitting.

Mathematically, dropout is equivalent to training an ensemble of all possible sub-networks and averaging their predictions. A network with n neurons has 2n possible sub-networks (each neuron either present or dropped). At test time, using all neurons with scaled weights approximates the ensemble average. This interpretation, due to Hinton, explains why dropout works so well — ensembles are one of the most reliable techniques in machine learning, and dropout gives you one for free.

python
# Dropout in practice (PyTorch)
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.5),   # 50% of neurons zeroed each step
    nn.Linear(256, 10),
)
# model.train()  → dropout active
# model.eval()   → dropout off, outputs scaled

Learning Rate: The Most Important Hyperparameter

Too high: the loss oscillates or diverges. The weight updates overshoot the minimum and bounce around. Too low: convergence is painfully slow. The loss decreases but takes thousands of extra epochs. Just right: the loss decreases steadily and reaches a good minimum.

A common strategy is learning rate scheduling. Start with a moderate LR (e.g., 0.001 for Adam, 0.01 for SGD) and decrease it over time. This allows the network to make large, coarse adjustments early and fine-tune later. The most popular schedules are cosine decay and step decay.

A Complete Training Loop

python
import torch
import torch.nn as nn

# 1. Define the model
model = nn.Sequential(
    nn.Linear(3, 64),
    nn.ReLU(),
    nn.Linear(64, 2),
)

# 2. Choose loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# 3. Training loop
for epoch in range(100):
    logits = model(X_train)          # forward
    loss = criterion(logits, y_train)  # loss
    optimizer.zero_grad()              # clear old gradients
    loss.backward()                   # backprop
    optimizer.step()                   # update weights

Five lines in the loop. That's the entire algorithm. optimizer.zero_grad() is needed because PyTorch accumulates gradients by default (useful for gradient accumulation with large batches). weight_decay=1e-4 adds L2 regularization directly into the optimizer.

Debugging a Broken Training Run

When training fails, check these in order:

Loss is NaN
Learning rate too high, or log(0) in loss. Fix: lower LR, add ε to log, check for 0-probability predictions.
Loss doesn't decrease
1) Learning rate too low. 2) Bug in data loading (labels shuffled). 3) All-zero initialization. 4) Gradient vanishing (check gradient norms).
Train loss low, test loss high
Overfitting. Add dropout, L2 regularization, or more training data. Reduce model size.
Both losses low
Success! But verify: check predictions on individual examples, not just aggregate metrics.

The most common bug: forgetting optimizer.zero_grad(). Without it, gradients accumulate across steps, making the effective gradient grow larger and larger until the network diverges. Always zero gradients before backward().

Monitoring Training: What to Log

Good practitioners monitor more than just the loss. Here's what to track:

MetricWhat It Tells YouWarning Signs
Training lossHow well the model fits training dataNot decreasing: broken; NaN: exploding
Validation lossHow well the model generalizesRising while train loss falls: overfitting
Gradient normMagnitude of the gradient vector>100: exploding; <1e-7: vanishing
Weight norm per layerHow large the parameters areGrowing without bound: no regularization
Activation histogramDistribution of neuron outputsAll zeros: dead ReLU; all saturated: sigmoid saturation
python
# Monitor gradient norms during training
for name, p in model.named_parameters():
    if p.grad is not None:
        grad_norm = p.grad.norm().item()
        print(f"{name}: grad_norm={grad_norm:.4f}")
Xavier init: variance = 1/nin. Keep the signal's magnitude constant across layers. Without proper init, deep networks either explode (NaN) or vanish (zero gradients). This one formula is the difference between "training works" and "training fails."
Init MethodFormulaBest For
XavierVar = 1/ninSigmoid, Tanh
HeVar = 2/ninReLU and variants
Initialization, Regularization, and Learning Rate

Three mini-simulations. Use the tabs to switch between them. Init: activation magnitudes across layers for different initialization strategies. Regularization: L2 penalty controls overfitting. Learning rate: too high diverges, too low crawls.

StrategyXavier
Why is initializing all weights to zero bad?

Chapter 9: Connections

Backpropagation is the engine. Now we need the car — architectures that use this engine to solve real problems. The feedforward network we built here is the simplest architecture. It processes each input independently, with no memory of previous inputs. For language, images, and sequences, we need more.

In the next lecture (L04), we'll add memory: recurrent neural networks read sequences word by word, carrying a hidden state that encodes everything seen so far. Backpropagation through time (BPTT) extends what we learned here to the time dimension — but introduces the vanishing gradient problem at a new scale.

Optimizers: Beyond Vanilla Gradient Descent

The simple update rule w ← w − η · dL/dw works but has problems. The loss landscape has ravines where the gradient oscillates across the narrow dimension and crawls along the long dimension. Modern optimizers fix this.

OptimizerKey IdeaWhen to Use
SGDUpdate using gradient of a mini-batch (not full dataset)Simple baselines, convex problems
SGD + MomentumAccumulate a velocity vector. Like a ball rolling downhill — it builds speed in consistent directions and dampens oscillations.CNNs, when you have time to tune
AdamAdaptive learning rate per parameter. Combines momentum with RMSProp (scale by running average of squared gradients).Default choice for most tasks. Transformers, NLP, generative models.

Why Adam dominates. SGD uses one learning rate for all parameters. But different parameters need different rates: frequent gradients (for common words in NLP) should use small steps; infrequent gradients (for rare words) should use large steps. Adam tracks the first moment (mean) and second moment (variance) of each gradient and adapts the step size accordingly. This per-parameter adaptivity is why Adam converges faster with less tuning.

mt = β1 mt-1 + (1−β1) gt    (momentum)
vt = β2 vt-1 + (1−β2) gt2    (RMSProp)
wt = wt-1 − η · m̂t / (√v̂t + ε)

Typical hyperparameters: β1 = 0.9, β2 = 0.999, ε = 10−8, η = 0.001. In practice, you almost never need to tune β1 and β2. The learning rate η is the main knob.

AdamW (Loshchilov and Hutter, 2019) is a corrected version of Adam that decouples weight decay from the adaptive learning rate. In original Adam, L2 regularization interacts with the per-parameter scaling, making it less effective. AdamW applies weight decay directly to the weights, independent of the gradient scaling. This is now the default optimizer for training Transformers, including all GPT models.

python
# AdamW is the standard for modern NLP models
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,              # typical for Transformers
    weight_decay=0.01,     # decoupled L2 regularization
    betas=(0.9, 0.999),    # momentum and RMSProp decay
)

Seminal Papers

These three papers provide the theoretical foundation for everything in this lesson:

Learning Representations by Backpropagating Errors (Rumelhart, Hinton, Williams, 1986) — The paper that introduced backpropagation to neural networks. Showed that multi-layer networks can learn useful internal representations by propagating error signals backward. The algorithm that made deep learning possible.

Yes You Should Understand Backprop (Karpathy, 2016) — A modern argument for why understanding backprop matters, even in the age of autograd. Practical examples of how backprop intuition helps debug training failures.

NLP (Almost) from Scratch (Collobert & Weston, 2011) — Showed that a simple neural network trained end-to-end with backprop can match or beat hand-engineered NLP features. The first convincing demonstration that deep learning could replace feature engineering in NLP.

What We Didn't Cover

This lesson focused on the core algorithm. Several important topics were deliberately left out for future lessons:

TopicWhy It MattersWhere It's Covered
Batch normalizationNormalizes activations within layers, stabilizes trainingDeep learning fundamentals
Residual connectionsEnable training of very deep networks (100+ layers)Transformer architecture
AttentionAllows the network to focus on relevant parts of the inputCS224N L05
Automatic differentiationHow PyTorch actually implements backprop under the hoodFramework internals

Next Steps

← L02: Word Vectors — How words become numbers that neural networks can process.

L04: Language Models & RNNs → — Adding memory to neural networks for sequential data. (Coming soon)

The Big Picture

Before this lesson, a neural network was a black box. Now you can see inside. You know that it's just matrix multiplications and nonlinearities. You know that the loss function quantifies "how wrong." You know that backpropagation traces blame backward through the computation to figure out which weights to change. And you know that gradient descent makes the changes.

This combination — differentiable forward pass + backprop + gradient descent — is why deep learning works. It's not that neural networks are inherently intelligent. It's that we have an efficient algorithm (backprop) for finding good weights in an astronomically large search space. The algorithm turns learning from an intractable search problem into a tractable optimization problem.

A Timeline of Key Milestones

YearMilestoneSignificance
1958Perceptron (Rosenblatt)First trainable neural network (1 layer)
1969Minsky & Papert critiqueProved perceptron can't learn XOR, killed funding for 15 years
1986Backpropagation (Rumelhart et al.)Made multi-layer networks trainable
2006Deep Belief Networks (Hinton)Reignited interest in deep networks
2012AlexNet (Krizhevsky et al.)Deep CNN + ReLU + GPU won ImageNet, launched the deep learning era
2017Transformer (Vaswani et al.)Attention replaces recurrence, enables GPT and BERT
2020+Scaling laws eraMore data + more parameters + more compute = better performance, predictably

The feedforward network you learned in this lesson is the ancestor of all of these. The Transformer is essentially a feedforward network with a special layer (self-attention) inserted between the standard linear + activation layers. The training algorithm is still backpropagation. The optimizer is usually Adam. The loss is cross-entropy. Everything in this lesson scales up.

Key Takeaways for the Course Ahead

As you continue through CS224N, every new architecture — RNNs, LSTMs, Transformers, BERT, GPT — will use the same training machinery you learned here. The differences are all in the forward pass: how input is transformed into output. The backward pass (backpropagation) and the update rule (gradient descent) remain identical.

Master this lesson and you've mastered the engine. The rest is bodywork.

"What I cannot create, I do not understand." — Richard Feynman. You now understand the complete pipeline: input → forward pass → loss → backward pass → update. Every neural network ever trained follows this loop. The architectures change, the loss functions change, the optimizers change — but the loop is eternal.