Bishop PRML, Chapter 5

Neural Networks

Feed-forward networks, backpropagation, regularization, and Bayesian neural networks — the parametric powerhouse.

Prerequisites: Chapters 3–4 (linear regression, classification, gradient-based optimization).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Neural Networks?

Linear models (Chapters 3–4) are limited: they can only represent functions that are linear in the basis functions. We had to choose basis functions by hand. What if we could learn the basis functions from data?

A neural network does exactly this. It composes multiple layers of adaptive basis functions — each layer transforms its input through a linear combination followed by a nonlinear activation. By stacking layers, the network can represent increasingly abstract features.

From fixed to learned features: In Chapter 3, we chose basis functions (polynomials, Gaussians) by hand. A neural network learns the basis functions from data, jointly with the output weights. Each hidden unit becomes an adaptive basis function. This is the fundamental advantage of neural networks over linear models — and the source of their power and their difficulty.

Bishop covers the classical two-layer network (one hidden layer + one output layer) in depth. The ideas — backpropagation, regularization, Bayesian treatment — generalize directly to modern deep networks with many layers.

Check: What is the key advantage of neural networks over linear models with fixed basis functions?

Chapter 1: Feed-Forward Architecture

A two-layer feed-forward network computes:

yk(x, w) = σout( ∑j=0M wkj(2) h( ∑i=0D wji(1) xi ) )

Breaking this down:

StageComputationWhat it does
Hidden pre-activationaj = ∑ wji(1) xiLinear combination of inputs
Hidden activationzj = h(aj)Nonlinear transformation
Output pre-activationak = ∑ wkj(2) zjLinear combination of hidden units
Output activationyk = σout(ak)Depends on task (identity, sigmoid, softmax)
Neural Network Function Approximation

A network with one hidden layer of M units fitting a sine wave. Adjust M to see how more hidden units allow more complex fits.

Hidden units M3

The total number of parameters in a two-layer network with D inputs, M hidden units, and K outputs (including biases) is (D+1)M + (M+1)K. The universal approximation theorem guarantees that a single hidden layer with enough units can approximate any continuous function to arbitrary accuracy. But "enough" can be exponentially many — deeper networks are often more efficient.

Check: What does the universal approximation theorem guarantee?

Chapter 2: Activation Functions

The hidden-layer activation function h(a) must be nonlinear — otherwise stacking layers gives another linear model. Bishop focuses on two classical choices:

Logistic sigmoid: σ(a) = 1/(1 + e−a). Output in (0, 1). Saturates at both extremes. Derivative: σ'(a) = σ(a)(1 − σ(a)).

Tanh: tanh(a) = 2σ(2a) − 1. Output in (−1, 1). Zero-centered, which aids optimization. Derivative: tanh'(a) = 1 − tanh2(a).

The output activation depends on the task:

TaskOutput activationError function
RegressionIdentity (linear)Sum-of-squares
Binary classificationSigmoidCross-entropy
Multi-class classificationSoftmaxMulti-class cross-entropy
Matching outputs to error functions: The choice of output activation and error function are linked through the probabilistic model. Regression with Gaussian noise → identity output + sum-of-squares. Classification with Bernoulli likelihood → sigmoid + cross-entropy. These combinations yield clean gradient forms: the output error is always (yk − tk). This isn't coincidence — it follows from the canonical link functions of generalized linear models.
Check: For a multi-class classification network, what output activation should you use?

Chapter 3: Backpropagation

We need gradients of the error function with respect to every weight to do gradient-based optimization. Backpropagation computes all these gradients efficiently using the chain rule.

Two passes through the network:

PassDirectionComputes
ForwardInput → OutputAll activations aj, zj, yk
BackwardOutput → InputAll error signals δj = ∂E/∂aj

Define the error signal for each unit: δj = ∂E/∂aj. For output units: δk = yk − tk. For hidden units, the chain rule gives:

δj = h'(aj) ∑k wkj δk

The gradient for any weight is: ∂E/∂wji = δj zi (error signal times input to that weight).

Backprop is just the chain rule, applied efficiently. The key insight: instead of computing the chain rule separately for each weight (which would require a separate forward pass), we share computation by propagating error signals backward through the network. The cost of computing ALL gradients is O(W) — proportional to the number of weights, similar to the cost of one forward pass. This efficiency is what makes training neural networks practical.

The gradient tells us the direction of steepest descent. We then update weights: w(new) = w(old) − η ∇E. In practice, we use mini-batch SGD or more sophisticated optimizers (momentum, Adam).

Check: What is the computational cost of backpropagation relative to a forward pass?

Chapter 4: The Hessian

The Hessian H is the matrix of second derivatives: Hij = ∂2E/∂wi∂wj. It captures the curvature of the error surface. Why do we care?

ApplicationHow the Hessian helps
Newton's methodUses H−1 for faster convergence
Laplace approximationH at the mode gives the Gaussian approximation to the posterior
Model comparisondet(H) appears in the evidence approximation
PruningIdentifies weights that can be removed (Optimal Brain Damage)

Computing the full Hessian costs O(W2) which is prohibitive for large networks. Fortunately, several approximations exist:

Diagonal approximation: Keep only the diagonal entries. O(W) cost but ignores correlations between weights.

Outer product approximation: H ≈ ∑n gngnT where gn is the gradient for data point n. Valid near a minimum where the residuals are small.

Fast Hessian-vector products: We often don't need the full Hessian — just its product with a vector: Hv. Pearlmutter (1994) showed this can be computed exactly in O(W) time using a technique called R-propagation. This is the basis of conjugate gradient methods for training neural networks without ever storing the full Hessian.
Check: Why is the full Hessian impractical for large networks?

Chapter 5: Regularization

Neural networks with many parameters overfit easily. Bishop discusses several regularization strategies:

Weight decay (L2 regularization): Add λ/2 · wTw to the error function. Just as in Chapter 3, this is equivalent to a Gaussian prior on the weights. Larger λ → smaller weights → smoother function.

Early stopping: Train on the training set, monitor error on a validation set, and stop when the validation error starts increasing. The number of training steps acts as an implicit regularizer — more steps allow the network to fit more complex patterns.

Early stopping ≈ weight decay: Bishop shows a remarkable connection. For a network trained with gradient descent starting from small random weights, early stopping is approximately equivalent to weight decay with λ proportional to 1/(number of steps). Both methods limit the effective complexity of the model, just in different ways.

Invariances: If we know the output should be invariant to certain input transformations (e.g., digit recognition should be invariant to small rotations), we can: (1) augment the training data with transformed versions, (2) add a regularization term penalizing sensitivity to the transformation, or (3) build the invariance into the network architecture (like convolutional neural networks).

Check: How is early stopping related to weight decay?

Chapter 6: Mixture Density Networks

Standard regression networks predict a single mean and variance for each input. But what if the target distribution is multimodal? For example, a robot arm with multiple valid configurations for the same end-effector position.

A mixture density network (MDN) outputs the parameters of a Gaussian mixture:

p(t|x) = ∑k=1K πk(x) N(t| μk(x), σk2(x))

The network outputs for each component k: mixing coefficients πk (via softmax), means μk (unconstrained), and variances σk2 (via exp to ensure positivity).

Why this matters: MDNs represent a general conditional density, not just a conditional mean. They can model inverse problems where a single input maps to multiple plausible outputs. This is the network saying "there are K possible answers, each with its own probability." Standard regression (which outputs a single prediction) would average over the modes, producing an answer that might actually be impossible.

Training uses the negative log-likelihood of the mixture as the error function. Gradients flow through the mixture parameters back to the network weights via backprop. MDNs are an early example of neural networks parameterizing complex probability distributions — an idea that became central in modern generative models (VAEs, normalizing flows).

Check: When should you use a mixture density network instead of standard regression?

Chapter 7: Bayesian Neural Networks

The full Bayesian approach to neural networks: instead of finding a single weight vector, maintain a posterior distribution over all possible weights:

p(w|D) ∝ p(D|w) p(w)

For predictions, marginalize over the posterior:

p(t|x, D) = ∫ p(t|x, w) p(w|D) dw

The integral is intractable for neural networks (the posterior is highly non-Gaussian and multimodal). Bishop discusses the Laplace approximation: find the MAP weights, then approximate the posterior as Gaussian using the Hessian at the MAP.

Bayesian NNs give uncertainty for free. A standard neural network gives a single prediction. A Bayesian NN gives a predictive distribution — wider where data is sparse (the network is unsure) and tighter where data is plentiful. In safety-critical applications (medical diagnosis, autonomous driving), knowing what the model doesn't know is as important as what it does know.

The evidence framework (Ch 3) extends to neural networks: maximize p(D|α, β) with respect to the hyperparameters α (weight prior precision) and β (noise precision). The effective number of parameters γ again plays a central role.

Modern approaches go beyond the Laplace approximation: variational inference (Ch 10), MC dropout, and Hamiltonian Monte Carlo (Ch 11) provide better approximations to the weight posterior.

Check: What is the main difficulty with Bayesian neural networks?

Chapter 8: Practical Training

Training a neural network involves navigating a complex, non-convex error surface. Practical considerations:

Optimization: The error surface has many local minima, saddle points, and plateaus. Different initialization → different solutions. Techniques to help: momentum (carry velocity from previous steps), adaptive learning rates (Adam, RMSProp), and learning rate schedules.

Weight initialization: Too large → saturation. Too small → vanishing signals. A good rule: initialize weights with variance proportional to 1/(fan-in), so the variance of activations stays roughly constant across layers.

Symmetry breaking: If all weights are initialized identically, all hidden units compute the same function and receive identical gradients. They'd stay identical forever. Random initialization breaks this symmetry, allowing different units to learn different features. This is why we must initialize randomly, not to zero.

Data preprocessing: Standardize inputs to zero mean and unit variance. This makes the error surface more isotropic (similar curvature in all directions), which helps gradient descent converge faster.

Network architecture: How many hidden layers? How many units per layer? Bishop emphasizes using regularization with a large network rather than carefully tuning the size of a small network. Let the regularizer (or the Bayesian prior) control complexity, not the architecture.

Check: Why must neural network weights be initialized randomly rather than to the same value?

Chapter 9: Summary

Chapter 5 introduces the most flexible parametric model in the book. The key ideas:

ConceptWhat it gives you
Feed-forward architectureUniversal function approximation with learned features
BackpropagationEfficient gradient computation — O(W) for all weights
HessianSecond-order info for optimization, pruning, Bayesian treatment
RegularizationWeight decay, early stopping, invariances — control complexity
Mixture density networksModel multimodal conditional distributions
Bayesian NNsUncertainty quantification via posterior over weights
The tradeoff: Neural networks can approximate any function but are hard to train (non-convex optimization), hard to interpret, and require lots of data. The Bayesian framework helps manage complexity but requires approximations. Kernel methods (Ch 6–7) offer a complementary approach: they trade some flexibility for convex optimization and exact Bayesian treatment.

What comes next: Chapter 6 introduces kernel methods and Gaussian processes, where we work in function space directly rather than weight space.

"The number of weight parameters is a poor measure
of model complexity."
— Christopher Bishop, PRML §5.2
Check: Why does Bishop say the number of weights is a poor measure of complexity?