Goodfellow et al., Chapter 6

Deep Feedforward Networks

The XOR problem, hidden layers, activation functions, backpropagation, and universal approximation. The foundation of everything that follows.

Prerequisites: Chapters 2-4 (linear algebra, probability, numerical computation).

Chapters

Simulations

Quizzes

Chapter 0: Why Depth?

A linear model can draw a line (or hyperplane) through data. But what if the data is not linearly separable? What if cats and dogs overlap in pixel space, and no single line can separate them?

The answer: stack layers. Each layer transforms the data into a new representation where the problem becomes easier. The first layer might detect edges. The second combines edges into textures. The third assembles textures into object parts. By the final layer, "cat" and "dog" live in well-separated regions of the learned representation.

Feedforward networks (also called multilayer perceptrons, or MLPs) are the prototypical deep learning model. Data flows forward through layers — no loops, no feedback. They approximate a function f*(x) by learning parameters θ such that f(x; θ) ≈ f*(x). Every modern architecture (CNNs, Transformers, etc.) is built on this foundation.

Input Layer

Raw features (pixels, tokens, etc.)

↓

Hidden Layers

Learned representations — the magic

↓

Output Layer

Prediction (class probabilities, regression value)

Why do neural networks need hidden layers?

To make training slower To transform data into representations where the problem becomes linearly separable To store more data

Chapter 1: The XOR Problem

XOR is the simplest function a linear model cannot learn. Given two binary inputs, XOR returns 1 when exactly one input is 1, and 0 otherwise. Plot the four points: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. No single line can separate the 0s from the 1s.

XOR: Linear vs Hidden Layer

Left: a linear model fails to separate XOR. Right: one hidden layer with ReLU transforms the space so a line works. Toggle to compare.

A network with one hidden layer solves XOR easily. The hidden layer learns a new representation h = ReLU(Wx + b) where the transformed points are linearly separable. This is the key idea: hidden layers learn features.

Key insight: The hidden layer performs a learned change of coordinates. In the original (x₁, x₂) space, XOR is impossible for a linear classifier. In the hidden representation (h₁, h₂) space, the classes are linearly separable. Deep learning is representation learning.

Why can't a linear model solve XOR?

The four XOR points are not linearly separable — no single line can put 0s on one side and 1s on the other XOR has too many inputs Linear models cannot handle binary data

Chapter 2: Hidden Layers

A feedforward network with one hidden layer computes: h = σ(W⁽¹⁾x + b⁽¹⁾), then y = W⁽²⁾h + b⁽²⁾. Here W⁽¹⁾ and W⁽²⁾ are weight matrices, b⁽¹⁾ and b⁽²⁾ are bias vectors, and σ is a nonlinear activation function.

h = σ(W⁽¹⁾x + b⁽¹⁾), y = W⁽²⁾h + b⁽²⁾

Why do we need the nonlinearity σ? Because without it, composing linear layers just gives another linear layer: W⁽²⁾(W⁽¹⁾x) = (W⁽²⁾W⁽¹⁾)x. No matter how many layers you stack, the result is a single matrix multiplication. The activation function breaks this linearity and allows the network to learn complex functions.

The width of a layer is the number of neurons (hidden units). The depth is the number of layers. A wide-and-shallow network can approximate any function (by the universal approximation theorem), but a deep-and-narrow network can represent the same function exponentially more efficiently.

Design principle: Each hidden layer learns a progressively more abstract representation. Layer 1 might learn edges. Layer 2 combines edges into textures. Layer 3 assembles textures into object parts. Depth enables this compositional hierarchy of features.

What happens if you stack multiple linear layers without activation functions?

The network becomes more powerful Training becomes unstable The composition collapses to a single linear transformation — no benefit from depth

Chapter 3: Activation Functions

The activation function introduces the nonlinearity that makes deep networks powerful. Over the decades, several have risen and fallen in popularity.

The sigmoid σ(z) = 1/(1 + e^−z) squashes any input to (0, 1). It was the original choice, inspired by biological neurons. But it saturates for large |z|, where the gradient approaches zero — the vanishing gradient problem.

The tanh function tanh(z) = 2σ(2z) − 1 maps to (−1, 1). It is zero-centered (which helps gradient flow) but still saturates at the extremes.

The ReLU (Rectified Linear Unit) g(z) = max(0, z) is the modern default. It does not saturate for positive inputs, making gradients flow freely. Its gradient is exactly 1 for z > 0 and 0 for z ≤ 0. The downside: neurons with z ≤ 0 are "dead" and learn nothing.

Activation Functions

Compare activation functions and their derivatives. The derivative is what flows backward during training.

ReLU variants: Leaky ReLU uses g(z) = max(αz, z) with small α (e.g., 0.01) to avoid dead neurons. ELU and GELU are smoother alternatives. GELU is used in modern Transformers. But plain ReLU remains the most common default for CNNs and MLPs.

Why did ReLU largely replace sigmoid as the default activation?

ReLU does not saturate for positive inputs, so gradients flow without vanishing — enabling training of deep networks ReLU always produces positive outputs ReLU uses less memory

Chapter 4: Output Units

The output layer's activation function depends on the task. It must produce output in the right format for the loss function.

Linear output (no activation) for regression: predict a real-valued number. Paired with MSE loss. The network predicts the mean of a Gaussian distribution over the target.

Sigmoid output for binary classification: output a probability P(y=1|x) ∈ (0, 1). Paired with binary cross-entropy loss: L = −[y log ŷ + (1−y) log(1−ŷ)].

Softmax output for multi-class classification: output a vector of K probabilities that sum to 1. softmax(z)_i = exp(z_i) / ∑_j exp(z_j). Paired with cross-entropy loss: L = −∑_k y_k log ŷ_k.

Key insight: The output activation and loss function are not independent choices. They form a matched pair. Sigmoid + binary cross-entropy, softmax + cross-entropy, linear + MSE. Using the wrong pair (e.g., sigmoid + MSE) causes gradients to vanish when the model is confidently wrong — exactly when learning should be strongest.

Why should softmax be paired with cross-entropy loss rather than MSE?

MSE is computationally cheaper Cross-entropy provides strong gradients even when predictions are confidently wrong; MSE gradients vanish near saturation Softmax requires exactly two classes

Chapter 5: Cost Functions

Training a neural network means minimizing a cost function (or loss function). Most cost functions in deep learning are derived from the principle of maximum likelihood: find parameters θ that maximize P(data | θ).

J(θ) = −E_x,y~data[log p(y | x; θ)]

Taking the negative log turns maximizing likelihood into minimizing negative log-likelihood (NLL). This is the cross-entropy between the empirical data distribution and the model distribution. When the model outputs are: (1) Gaussian → NLL becomes MSE. (2) Bernoulli → NLL becomes binary cross-entropy. (3) Categorical → NLL becomes multi-class cross-entropy.

Why maximum likelihood? It is consistent (converges to the true parameters given enough data), efficient (achieves the lowest possible variance among consistent estimators), and provides a unified framework. You don't need to hand-design loss functions — choose a probability model, and the loss falls out automatically.

In practice, we add a regularization term to prevent overfitting: J(θ) = NLL + λ R(θ). Common choices: L2 regularization R = ||θ||² (weight decay) or L1 regularization R = ||θ||₁ (sparsity). Chapter 7 covers this in depth.

What is the connection between MSE loss and maximum likelihood?

MSE is the negative log-likelihood when assuming Gaussian noise on the targets — they are mathematically equivalent There is no connection MSE is only used for classification

Chapter 6: Backpropagation

How does the network learn? By computing the gradient of the loss with respect to every weight, then nudging each weight in the direction that decreases the loss. Backpropagation (backprop) computes all these gradients efficiently using the chain rule.

The chain rule: if z = f(y) and y = g(x), then dz/dx = (dz/dy)(dy/dx). In a network with L layers, the gradient of the loss J with respect to weights in layer l requires multiplying through all layers from L down to l:

∂J/∂W^(l) = ∂J/∂a^(L) · ∂a^(L)/∂a^(L-1) · ... · ∂a^(l+1)/∂a^(l) · ∂a^(l)/∂W^(l)

Backprop works in two passes. The forward pass computes activations layer by layer, storing intermediate results. The backward pass propagates the error signal backward through the network, computing gradients at each layer using the stored intermediates.

Why backprop is efficient: Computing the gradient for each weight individually would require a separate forward pass — O(n) passes for n weights. Backprop shares computation: one forward pass + one backward pass gives ALL gradients. The cost is roughly 2x a single forward pass, regardless of the number of parameters.

Forward Pass

Compute activations layer by layer, store intermediates

↓

Compute Loss

Compare prediction to target

↓

Backward Pass

Chain rule from loss back through each layer

↓

Update Weights

w ← w − ε ∇J

Why is backpropagation efficient compared to computing each gradient independently?

It uses less memory It only works on small networks One forward + one backward pass gives all gradients by sharing intermediate computations via the chain rule

Chapter 7: Universal Approximation

Here is a remarkable result: a feedforward network with a single hidden layer containing enough neurons can approximate any continuous function on a compact domain to arbitrary accuracy. This is the universal approximation theorem.

Think of it this way: each ReLU neuron contributes a "hinge" — a line that bends at a specific point. With enough hinges, you can approximate any curve to any desired precision, like building a smooth curve from many short straight segments.

The catch: The theorem says nothing about how to find the right weights (learning), how many neurons you need (could be astronomically many), or how well the network generalizes to new data. Universal approximation is an existence result, not a construction recipe. It tells you networks CAN work; it does not tell you they WILL work.

This is why depth matters in practice. While a single wide hidden layer suffices in theory, deep networks can represent the same functions with exponentially fewer parameters. Depth enables compositional representations: complex functions built by composing simple ones, just as a computer program uses subroutines.

What is the practical limitation of the universal approximation theorem?

It only applies to ReLU networks It only works in one dimension It guarantees existence but says nothing about learning, the number of neurons needed, or generalization

Chapter 8: Network Playground

Train a small feedforward network on 2D classification datasets. Watch hidden layers transform the space in real time. See how depth, width, and activation choices affect the learned decision boundary.

Feedforward Network Trainer

Choose a dataset and architecture. Click Train to watch the decision boundary evolve. The network learns to separate the colored regions.

Hidden size8

Depth (layers)2

Learning rate0.30

Dataset: xor | Loss: -- | Step: 0

Experiments to try: (1) XOR with 1 layer, width 2 — barely enough. Increase width to see it learn reliably. (2) Spiral with depth 1 vs 3 — depth dramatically helps complex boundaries. (3) Try very high learning rate on moons — watch it overshoot and oscillate.

When training on the spiral dataset, increasing depth from 1 to 3 helps significantly. Why?

Deeper networks compose multiple nonlinear transformations, enabling them to carve complex, curved decision boundaries that a single layer cannot Deeper networks always train faster Spiral data requires exactly 3 layers

Chapter 9: Connections

Feedforward networks are the foundation on which all deep learning architectures are built. Every concept in this chapter reappears:

Concept	Where It Appears
Hidden representations	Embeddings (NLP), feature maps (CNNs), latent spaces (VAEs)
Activation functions	ReLU in CNNs (Ch 9), GELU in Transformers, sigmoid in gates (LSTM, Ch 10)
Backpropagation	Training every network. Backprop through time for RNNs (Ch 10)
Softmax + cross-entropy	Every classification model's final layer
Universal approximation	Theoretical justification for why deep learning works
Depth advantage	ResNets, Transformers — modern architectures are very deep

What you should take away: A feedforward network is a function approximator built from layers of linear transformations followed by nonlinear activations. Backpropagation computes gradients efficiently. The universal approximation theorem guarantees capacity; depth, data, and optimization determine whether that capacity is realized.

Up next: Chapter 7: Regularization — how to prevent these powerful function approximators from memorizing the training data.

What is the single most important idea in this chapter?

Networks should always use sigmoid Hidden layers learn representations that transform data into a form where the task becomes easier More layers always mean better performance