From linear classifiers to universal function approximators — one hidden layer at a time.
You're building a classifier for two-class data. You've mastered linear methods: a weight vector w and a bias that define a separating hyperplane wTx + b = 0. For linearly separable data, this works beautifully.
But what about the data below? Two classes arranged in a spiral. No single line (or hyperplane) can separate them. Every linear classifier will misclassify at least a quarter of the points.
This isn't a contrived example. Real-world signals — speech phonemes, EEG states, radar returns — routinely have nonlinear decision boundaries. Linear classifiers hit a hard wall.
The idea is simple: transform x through a set of nonlinear functions, then apply a linear classifier to the transformed features. If the transformations are chosen well, data that was tangled in the original space becomes separable in the new space.
Neural networks learn these transformations from data, rather than requiring us to design them by hand.
Two spirals that no linear classifier can separate. The orange line shows the best linear attempt. Click "Add Hidden Layer" to see what a neural network can learn.
Linear classifierA single hidden layer neural network maps an input vector x ∈ Rp to an output through two stages. First, M hidden units compute nonlinear features. Then, the output layer combines those features linearly.
Each hidden unit m = 1, ..., M computes:
where αm is the weight vector for hidden unit m (including bias αm0), and σ is a nonlinear activation function (sigmoid, tanh, or ReLU — we'll explore these in Chapter 2).
Think of each hidden unit as a "feature detector." The weights αm define which direction in input space this unit responds to, and the activation function makes the response nonlinear.
For regression (predicting a continuous value), output k is:
where gk(u) = u (the identity function) for regression, and g = softmax for classification.
With p inputs, M hidden units, and K outputs:
| Layer | Parameters | Count |
|---|---|---|
| Hidden (α) | M vectors of size p+1 | M(p + 1) |
| Output (β) | K vectors of size M+1 | K(M + 1) |
| Total | M(p+1) + K(M+1) |
For p = 10 inputs, M = 50 hidden units, K = 3 outputs: 50(11) + 3(51) = 550 + 153 = 703 parameters. Small enough to fit on a napkin, yet powerful enough to approximate virtually any smooth function.
Adjust the number of hidden units M to see how the network structure changes. Each line is a learnable weight.
The activation function σ is what makes neural networks nonlinear. Without it, stacking linear layers just gives another linear function: W2(W1x) = (W2W1)x. The choice of σ has major practical implications for training speed and gradient flow.
Maps any real number to (0, 1). Historically the first activation used. Derivative: σ'(u) = σ(u)(1 − σ(u)). Maximum derivative of 0.25 at u = 0. Problem: for large |u|, the derivative is near zero — vanishing gradients. During backprop, gradients get multiplied through layers, and tiny derivatives kill the signal.
Maps to (−1, 1). Zero-centered, so outputs can be positive or negative. Derivative: tanh'(u) = 1 − tanh2(u). Maximum of 1.0 at u = 0. Still suffers vanishing gradients in the tails, but less severely than sigmoid since the max derivative is 4× larger.
Dead simple. For positive inputs, the gradient is exactly 1 — no vanishing. For negative inputs, the gradient is 0. This creates dead neurons (units that never activate), but in practice ReLU trains much faster than sigmoid or tanh for deep networks.
| Activation | Range | Max |σ'| | Vanishing? | Use |
|---|---|---|---|---|
| Sigmoid | (0, 1) | 0.25 | Yes | Output for binary classification |
| Tanh | (−1, 1) | 1.0 | Moderate | RNNs, small nets |
| ReLU | [0, ∞) | 1.0 | No | Default for hidden layers |
See each activation (solid) and its derivative (dashed). Notice how sigmoid's derivative nearly vanishes for |u| > 3.
Here's a remarkable fact: a single hidden layer network with enough hidden units can approximate any continuous function on a compact set to arbitrary accuracy. This is the Universal Approximation Theorem (Cybenko 1989, Hornik 1991).
More precisely: for any continuous f: [a,b]p → R and any ε > 0, there exists M hidden units with weights αm and βm such that:
provided σ is a non-constant, bounded, continuous function (sigmoid qualifies).
Consider 1D. A sigmoid σ(αx + b) is a smooth step function. By making α large, it becomes nearly a sharp step at x = −b/α. Now, two sigmoids with opposite signs create a "bump":
This bump is approximately 1 on [a, b] and 0 elsewhere. By summing many bumps of different heights and positions, we can approximate any function — like building a histogram that matches the target function's shape.
The function f(x) = |x| is continuous but not differentiable at 0. With M = 2 ReLU units:
Two hidden units, exact representation. With smooth activations like sigmoid, we'd need more units for the sharp corner at zero, but we can get arbitrarily close.
A 1D target function (teal) is approximated by a sum of sigmoid bumps (orange). Add more hidden units to get a tighter fit.
We have a network with parameters θ = {α1, ..., αM, β}. Training means finding θ that minimizes a loss L(θ). Gradient descent needs ∂L/∂θ for every parameter. Backpropagation computes all these gradients efficiently using the chain rule, in one backward pass through the network.
Let's derive backprop for a concrete case. One input x, one hidden layer with M units, one output ŷ. Regression with MSE loss.
With concrete numbers. Let x = 2.0, y = 1.0 (the target). Two hidden units (M = 2). Sigmoid activation. Initial weights:
| Parameter | Value |
|---|---|
| α1, b1 | 0.5, −0.3 |
| α2, b2 | −0.4, 0.2 |
| β1, β2, β0 | 0.6, −0.8, 0.1 |
Hidden unit 1: α1x + b1 = 0.5(2) − 0.3 = 0.7. σ(0.7) = 1/(1 + e−0.7) = 0.668. So z1 = 0.668.
Hidden unit 2: α2x + b2 = −0.4(2) + 0.2 = −0.6. σ(−0.6) = 1/(1 + e0.6) = 0.354. So z2 = 0.354.
Output: ŷ = 0.6(0.668) + (−0.8)(0.354) + 0.1 = 0.401 − 0.283 + 0.1 = 0.218.
Loss: L = 0.5(1.0 − 0.218)2 = 0.5(0.782)2 = 0.306.
Now we compute gradients, starting from the loss and working backwards. This is why it's called backpropagation.
Step 1: Loss → output.
Step 2: Output → output weights.
Step 3: Output → hidden activations.
Step 4: Through the activation function. Since zm = σ(am) and σ'(a) = σ(a)(1 − σ(a)):
Step 5: Hidden activations → input weights. Since am = αmx + bm:
Click "Step" to walk through the forward pass, then the backward pass. Watch numbers flow through the network.
Ready — click StepThe loss function L measures how wrong the network's prediction is. Different tasks need different losses. The choice of loss also determines the output activation function.
For predicting a continuous value y ∈ R:
Output activation: g(u) = u (identity). The gradient is simple: ∂L/∂ŷi = −2(yi − ŷi)/N. MSE penalizes large errors quadratically, making it sensitive to outliers.
For predicting y ∈ {0, 1}. The output ŷ = σ(βTz) ∈ (0,1) represents P(y = 1 | x).
When y = 1 and ŷ is near 0, log(ŷ) → −∞, creating a huge loss. This forces the network to be confident about correct predictions. The gradient has a beautiful form: ∂L/∂ŷ = (ŷ − y)/(ŷ(1 − ŷ)), which cancels the sigmoid derivative during backprop, leaving just ŷ − y.
For K classes, the output layer uses softmax:
This converts K raw scores (logits) a1,...,aK into a valid probability distribution: all ŷk > 0 and they sum to 1.
where yik is 1 if sample i has true class k, 0 otherwise (one-hot encoding).
Three classes. Raw logits: a = [2.0, 1.0, 0.5]. Compute softmax:
exp(a) = [7.389, 2.718, 1.649]. Sum = 11.756.
ŷ = [0.629, 0.231, 0.140].
True class is k = 0 (one-hot: [1, 0, 0]). Cross-entropy loss:
L = −log(0.629) = 0.464.
If the network were more confident (logits [5.0, 1.0, 0.5]), ŷ0 would be 0.973, and L = −log(0.973) = 0.027. Higher confidence in the correct class → lower loss.
Adjust the three logits and see how softmax converts them to probabilities. The loss is computed for the highlighted class.
We know how to compute gradients (backprop) and what to minimize (loss function). Now: how do we actually run the optimization?
Compute the gradient using ALL N training samples, then update:
where η is the learning rate. This gives the exact gradient but costs O(N) per step. For N = 1 million images, each update is extremely expensive.
Use a single random sample to estimate the gradient:
where it is randomly chosen. The gradient estimate is noisy (high variance) but unbiased: E[∇L(xi)] = (1/N)∑∇L(xi). Each update is O(1) in dataset size, so we can make many more updates per second.
The practical compromise. Use a random subset B of B samples:
Typical batch sizes: B = 32, 64, 128, 256. The variance of the gradient estimate scales as σ2/B, so increasing B reduces noise. But the cost also scales linearly. B = 32 is a sweet spot for many problems.
Too large: the updates overshoot the minimum, loss oscillates or diverges. Too small: training takes forever and gets stuck in poor local minima.
Common schedule: start with η = 0.01 or 0.001, reduce by a factor of 10 when loss plateaus. More advanced: Adam optimizer automatically adapts per-parameter learning rates.
One epoch = one pass through the entire training set. With N = 50,000 samples and B = 50, one epoch = 1,000 mini-batch updates. Typical training runs: 50–300 epochs.
Watch SGD (noisy, orange) vs. batch gradient descent (smooth, teal) descend the same loss surface. SGD bounces but gets there faster per unit of compute.
You've built the complete neural network story from the ground up: the architecture (Chapter 1), the nonlinearity (Chapter 2), the theoretical guarantee (Chapter 3), the training algorithm (Chapters 4-6). Let's consolidate.
| Component | Formula | Role |
|---|---|---|
| Hidden layer | zm = σ(αmTx) | Learned nonlinear features |
| Output (regression) | ŷ = βTz | Linear combination of features |
| Output (classification) | ŷ = softmax(βTz) | Class probabilities |
| MSE loss | ½(y − ŷ)2 | Regression objective |
| Cross-entropy loss | −y log(ŷ) | Classification objective |
| Backprop | Chain rule layer by layer | Compute all gradients |
| SGD update | θ ← θ − η∇L | Move toward minimum |
Single hidden layer networks are powerful in theory, but in practice, deep networks (many layers) learn hierarchical features more efficiently. The next lecture explores why depth helps and introduces convolutional neural networks.
| Topic | This Lecture | Next Lecture |
|---|---|---|
| Depth | 1 hidden layer | L hidden layers |
| Architecture | Fully connected | Convolutional (CNNs) |
| Regularization | (not covered) | BatchNorm, Dropout |
| Connections | Feed-forward only | Residual (skip) connections |
"What I cannot create, I do not understand." — Richard Feynman