The XOR problem, hidden layers, activation functions, backpropagation, and universal approximation. The foundation of everything that follows.
A linear model can draw a line (or hyperplane) through data. But what if the data is not linearly separable? What if cats and dogs overlap in pixel space, and no single line can separate them?
The answer: stack layers. Each layer transforms the data into a new representation where the problem becomes easier. The first layer might detect edges. The second combines edges into textures. The third assembles textures into object parts. By the final layer, "cat" and "dog" live in well-separated regions of the learned representation.
XOR is the simplest function a linear model cannot learn. Given two binary inputs, XOR returns 1 when exactly one input is 1, and 0 otherwise. Plot the four points: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. No single line can separate the 0s from the 1s.
Left: a linear model fails to separate XOR. Right: one hidden layer with ReLU transforms the space so a line works. Toggle to compare.
A network with one hidden layer solves XOR easily. The hidden layer learns a new representation h = ReLU(Wx + b) where the transformed points are linearly separable. This is the key idea: hidden layers learn features.
A feedforward network with one hidden layer computes: h = σ(W(1)x + b(1)), then y = W(2)h + b(2). Here W(1) and W(2) are weight matrices, b(1) and b(2) are bias vectors, and σ is a nonlinear activation function.
Why do we need the nonlinearity σ? Because without it, composing linear layers just gives another linear layer: W(2)(W(1)x) = (W(2)W(1))x. No matter how many layers you stack, the result is a single matrix multiplication. The activation function breaks this linearity and allows the network to learn complex functions.
The width of a layer is the number of neurons (hidden units). The depth is the number of layers. A wide-and-shallow network can approximate any function (by the universal approximation theorem), but a deep-and-narrow network can represent the same function exponentially more efficiently.
The activation function introduces the nonlinearity that makes deep networks powerful. Over the decades, several have risen and fallen in popularity.
The sigmoid σ(z) = 1/(1 + e−z) squashes any input to (0, 1). It was the original choice, inspired by biological neurons. But it saturates for large |z|, where the gradient approaches zero — the vanishing gradient problem.
The tanh function tanh(z) = 2σ(2z) − 1 maps to (−1, 1). It is zero-centered (which helps gradient flow) but still saturates at the extremes.
The ReLU (Rectified Linear Unit) g(z) = max(0, z) is the modern default. It does not saturate for positive inputs, making gradients flow freely. Its gradient is exactly 1 for z > 0 and 0 for z ≤ 0. The downside: neurons with z ≤ 0 are "dead" and learn nothing.
Compare activation functions and their derivatives. The derivative is what flows backward during training.
The output layer's activation function depends on the task. It must produce output in the right format for the loss function.
Linear output (no activation) for regression: predict a real-valued number. Paired with MSE loss. The network predicts the mean of a Gaussian distribution over the target.
Sigmoid output for binary classification: output a probability P(y=1|x) ∈ (0, 1). Paired with binary cross-entropy loss: L = −[y log ŷ + (1−y) log(1−ŷ)].
Softmax output for multi-class classification: output a vector of K probabilities that sum to 1. softmax(z)i = exp(zi) / ∑j exp(zj). Paired with cross-entropy loss: L = −∑k yk log ŷk.
Training a neural network means minimizing a cost function (or loss function). Most cost functions in deep learning are derived from the principle of maximum likelihood: find parameters θ that maximize P(data | θ).
Taking the negative log turns maximizing likelihood into minimizing negative log-likelihood (NLL). This is the cross-entropy between the empirical data distribution and the model distribution. When the model outputs are: (1) Gaussian → NLL becomes MSE. (2) Bernoulli → NLL becomes binary cross-entropy. (3) Categorical → NLL becomes multi-class cross-entropy.
In practice, we add a regularization term to prevent overfitting: J(θ) = NLL + λ R(θ). Common choices: L2 regularization R = ||θ||2 (weight decay) or L1 regularization R = ||θ||1 (sparsity). Chapter 7 covers this in depth.
How does the network learn? By computing the gradient of the loss with respect to every weight, then nudging each weight in the direction that decreases the loss. Backpropagation (backprop) computes all these gradients efficiently using the chain rule.
The chain rule: if z = f(y) and y = g(x), then dz/dx = (dz/dy)(dy/dx). In a network with L layers, the gradient of the loss J with respect to weights in layer l requires multiplying through all layers from L down to l:
Backprop works in two passes. The forward pass computes activations layer by layer, storing intermediate results. The backward pass propagates the error signal backward through the network, computing gradients at each layer using the stored intermediates.
Here is a remarkable result: a feedforward network with a single hidden layer containing enough neurons can approximate any continuous function on a compact domain to arbitrary accuracy. This is the universal approximation theorem.
Think of it this way: each ReLU neuron contributes a "hinge" — a line that bends at a specific point. With enough hinges, you can approximate any curve to any desired precision, like building a smooth curve from many short straight segments.
This is why depth matters in practice. While a single wide hidden layer suffices in theory, deep networks can represent the same functions with exponentially fewer parameters. Depth enables compositional representations: complex functions built by composing simple ones, just as a computer program uses subroutines.
Train a small feedforward network on 2D classification datasets. Watch hidden layers transform the space in real time. See how depth, width, and activation choices affect the learned decision boundary.
Choose a dataset and architecture. Click Train to watch the decision boundary evolve. The network learns to separate the colored regions.
Dataset: xor | Loss: -- | Step: 0
Feedforward networks are the foundation on which all deep learning architectures are built. Every concept in this chapter reappears:
| Concept | Where It Appears |
|---|---|
| Hidden representations | Embeddings (NLP), feature maps (CNNs), latent spaces (VAEs) |
| Activation functions | ReLU in CNNs (Ch 9), GELU in Transformers, sigmoid in gates (LSTM, Ch 10) |
| Backpropagation | Training every network. Backprop through time for RNNs (Ch 10) |
| Softmax + cross-entropy | Every classification model's final layer |
| Universal approximation | Theoretical justification for why deep learning works |
| Depth advantage | ResNets, Transformers — modern architectures are very deep |
Up next: Chapter 7: Regularization — how to prevent these powerful function approximators from memorizing the training data.