Ch 5: Neural Networks — Bishop PRML

Chapter 0: Why Neural Networks?

Linear models (Chapters 3–4) are limited: they can only represent functions that are linear in the basis functions. We had to choose basis functions by hand. What if we could learn the basis functions from data?

A neural network does exactly this. It composes multiple layers of adaptive basis functions — each layer transforms its input through a linear combination followed by a nonlinear activation. By stacking layers, the network can represent increasingly abstract features.

From fixed to learned features: In Chapter 3, we chose basis functions (polynomials, Gaussians) by hand. A neural network learns the basis functions from data, jointly with the output weights. Each hidden unit becomes an adaptive basis function. This is the fundamental advantage of neural networks over linear models — and the source of their power and their difficulty.

Bishop covers the classical two-layer network (one hidden layer + one output layer) in depth. The ideas — backpropagation, regularization, Bayesian treatment — generalize directly to modern deep networks with many layers.

Check: What is the key advantage of neural networks over linear models with fixed basis functions?

Neural networks learn the basis functions from data rather than requiring manual specification Neural networks are always faster to train Neural networks don't need activation functions

Chapter 1: Feed-Forward Architecture

A two-layer feed-forward network computes:

y_k(x, w) = σ_out( ∑_j=0^M w_kj⁽²⁾ h( ∑_i=0^D w_ji⁽¹⁾ x_i ) )

Breaking this down:

Stage	Computation	What it does
Hidden pre-activation	a_j = ∑ w_ji⁽¹⁾ x_i	Linear combination of inputs
Hidden activation	z_j = h(a_j)	Nonlinear transformation
Output pre-activation	a_k = ∑ w_kj⁽²⁾ z_j	Linear combination of hidden units
Output activation	y_k = σ_out(a_k)	Depends on task (identity, sigmoid, softmax)

Neural Network Function Approximation

A network with one hidden layer of M units fitting a sine wave. Adjust M to see how more hidden units allow more complex fits.

Hidden units M3

The total number of parameters in a two-layer network with D inputs, M hidden units, and K outputs (including biases) is (D+1)M + (M+1)K. The universal approximation theorem guarantees that a single hidden layer with enough units can approximate any continuous function to arbitrary accuracy. But "enough" can be exponentially many — deeper networks are often more efficient.

Check: What does the universal approximation theorem guarantee?

A single hidden layer network can approximate any continuous function given enough hidden units Any network can be trained in polynomial time Deep networks are always better than shallow ones

Chapter 2: Activation Functions

The hidden-layer activation function h(a) must be nonlinear — otherwise stacking layers gives another linear model. Bishop focuses on two classical choices:

Logistic sigmoid: σ(a) = 1/(1 + e^−a). Output in (0, 1). Saturates at both extremes. Derivative: σ'(a) = σ(a)(1 − σ(a)).

Tanh: tanh(a) = 2σ(2a) − 1. Output in (−1, 1). Zero-centered, which aids optimization. Derivative: tanh'(a) = 1 − tanh²(a).

The output activation depends on the task:

Task	Output activation	Error function
Regression	Identity (linear)	Sum-of-squares
Binary classification	Sigmoid	Cross-entropy
Multi-class classification	Softmax	Multi-class cross-entropy

Matching outputs to error functions: The choice of output activation and error function are linked through the probabilistic model. Regression with Gaussian noise → identity output + sum-of-squares. Classification with Bernoulli likelihood → sigmoid + cross-entropy. These combinations yield clean gradient forms: the output error is always (y_k − t_k). This isn't coincidence — it follows from the canonical link functions of generalized linear models.

Check: For a multi-class classification network, what output activation should you use?

Sigmoid on each output independently Identity (linear) Softmax (normalizes across all K outputs to produce a valid probability distribution)

Chapter 3: Backpropagation

We need gradients of the error function with respect to every weight to do gradient-based optimization. Backpropagation computes all these gradients efficiently using the chain rule.

Two passes through the network:

Pass	Direction	Computes
Forward	Input → Output	All activations a_j, z_j, y_k
Backward	Output → Input	All error signals δ_j = ∂E/∂a_j

Define the error signal for each unit: δ_j = ∂E/∂a_j. For output units: δ_k = y_k − t_k. For hidden units, the chain rule gives:

δ_j = h'(a_j) ∑_k w_kj δ_k

The gradient for any weight is: ∂E/∂w_ji = δ_j z_i (error signal times input to that weight).

Backprop is just the chain rule, applied efficiently. The key insight: instead of computing the chain rule separately for each weight (which would require a separate forward pass), we share computation by propagating error signals backward through the network. The cost of computing ALL gradients is O(W) — proportional to the number of weights, similar to the cost of one forward pass. This efficiency is what makes training neural networks practical.

The gradient tells us the direction of steepest descent. We then update weights: w^(new) = w^(old) − η ∇E. In practice, we use mini-batch SGD or more sophisticated optimizers (momentum, Adam).

Check: What is the computational cost of backpropagation relative to a forward pass?

Roughly the same order — O(W) where W is the number of weights Quadratic in the number of weights Exponential in the depth of the network

Chapter 4: The Hessian

The Hessian H is the matrix of second derivatives: H_ij = ∂²E/∂w_i∂w_j. It captures the curvature of the error surface. Why do we care?

Application	How the Hessian helps
Newton's method	Uses H⁻¹ for faster convergence
Laplace approximation	H at the mode gives the Gaussian approximation to the posterior
Model comparison	det(H) appears in the evidence approximation
Pruning	Identifies weights that can be removed (Optimal Brain Damage)

Computing the full Hessian costs O(W²) which is prohibitive for large networks. Fortunately, several approximations exist:

Diagonal approximation: Keep only the diagonal entries. O(W) cost but ignores correlations between weights.

Outer product approximation: H ≈ ∑_n g_ng_n^T where g_n is the gradient for data point n. Valid near a minimum where the residuals are small.

Fast Hessian-vector products: We often don't need the full Hessian — just its product with a vector: Hv. Pearlmutter (1994) showed this can be computed exactly in O(W) time using a technique called R-propagation. This is the basis of conjugate gradient methods for training neural networks without ever storing the full Hessian.

Check: Why is the full Hessian impractical for large networks?

It requires O(W^2) storage and computation, where W can be millions It doesn't provide useful information It can only be computed for linear networks

Chapter 5: Regularization

Neural networks with many parameters overfit easily. Bishop discusses several regularization strategies:

Weight decay (L2 regularization): Add λ/2 · w^Tw to the error function. Just as in Chapter 3, this is equivalent to a Gaussian prior on the weights. Larger λ → smaller weights → smoother function.

Early stopping: Train on the training set, monitor error on a validation set, and stop when the validation error starts increasing. The number of training steps acts as an implicit regularizer — more steps allow the network to fit more complex patterns.

Early stopping ≈ weight decay: Bishop shows a remarkable connection. For a network trained with gradient descent starting from small random weights, early stopping is approximately equivalent to weight decay with λ proportional to 1/(number of steps). Both methods limit the effective complexity of the model, just in different ways.

Invariances: If we know the output should be invariant to certain input transformations (e.g., digit recognition should be invariant to small rotations), we can: (1) augment the training data with transformed versions, (2) add a regularization term penalizing sensitivity to the transformation, or (3) build the invariance into the network architecture (like convolutional neural networks).

Check: How is early stopping related to weight decay?

Early stopping is approximately equivalent to weight decay, where the regularization strength is inversely proportional to training steps They are completely unrelated techniques Early stopping is always better than weight decay

Chapter 6: Mixture Density Networks

Standard regression networks predict a single mean and variance for each input. But what if the target distribution is multimodal? For example, a robot arm with multiple valid configurations for the same end-effector position.

A mixture density network (MDN) outputs the parameters of a Gaussian mixture:

p(t|x) = ∑_k=1^K π_k(x) N(t| μ_k(x), σ_k²(x))

The network outputs for each component k: mixing coefficients π_k (via softmax), means μ_k (unconstrained), and variances σ_k² (via exp to ensure positivity).

Why this matters: MDNs represent a general conditional density, not just a conditional mean. They can model inverse problems where a single input maps to multiple plausible outputs. This is the network saying "there are K possible answers, each with its own probability." Standard regression (which outputs a single prediction) would average over the modes, producing an answer that might actually be impossible.

Training uses the negative log-likelihood of the mixture as the error function. Gradients flow through the mixture parameters back to the network weights via backprop. MDNs are an early example of neural networks parameterizing complex probability distributions — an idea that became central in modern generative models (VAEs, normalizing flows).

Check: When should you use a mixture density network instead of standard regression?

When the target distribution is multimodal (multiple valid outputs for the same input) When the data is linearly separable When there are more features than data points

Chapter 7: Bayesian Neural Networks

The full Bayesian approach to neural networks: instead of finding a single weight vector, maintain a posterior distribution over all possible weights:

p(w|D) ∝ p(D|w) p(w)

For predictions, marginalize over the posterior:

p(t|x, D) = ∫ p(t|x, w) p(w|D) dw

The integral is intractable for neural networks (the posterior is highly non-Gaussian and multimodal). Bishop discusses the Laplace approximation: find the MAP weights, then approximate the posterior as Gaussian using the Hessian at the MAP.

Bayesian NNs give uncertainty for free. A standard neural network gives a single prediction. A Bayesian NN gives a predictive distribution — wider where data is sparse (the network is unsure) and tighter where data is plentiful. In safety-critical applications (medical diagnosis, autonomous driving), knowing what the model doesn't know is as important as what it does know.

The evidence framework (Ch 3) extends to neural networks: maximize p(D|α, β) with respect to the hyperparameters α (weight prior precision) and β (noise precision). The effective number of parameters γ again plays a central role.

Modern approaches go beyond the Laplace approximation: variational inference (Ch 10), MC dropout, and Hamiltonian Monte Carlo (Ch 11) provide better approximations to the weight posterior.

Check: What is the main difficulty with Bayesian neural networks?

The posterior over weights is intractable (non-Gaussian, multimodal), requiring approximations The prior is impossible to specify Backpropagation doesn't work with distributions

Chapter 8: Practical Training

Training a neural network involves navigating a complex, non-convex error surface. Practical considerations:

Optimization: The error surface has many local minima, saddle points, and plateaus. Different initialization → different solutions. Techniques to help: momentum (carry velocity from previous steps), adaptive learning rates (Adam, RMSProp), and learning rate schedules.

Weight initialization: Too large → saturation. Too small → vanishing signals. A good rule: initialize weights with variance proportional to 1/(fan-in), so the variance of activations stays roughly constant across layers.

Symmetry breaking: If all weights are initialized identically, all hidden units compute the same function and receive identical gradients. They'd stay identical forever. Random initialization breaks this symmetry, allowing different units to learn different features. This is why we must initialize randomly, not to zero.

Data preprocessing: Standardize inputs to zero mean and unit variance. This makes the error surface more isotropic (similar curvature in all directions), which helps gradient descent converge faster.

Network architecture: How many hidden layers? How many units per layer? Bishop emphasizes using regularization with a large network rather than carefully tuning the size of a small network. Let the regularizer (or the Bayesian prior) control complexity, not the architecture.

Check: Why must neural network weights be initialized randomly rather than to the same value?

Identical initialization means all hidden units stay identical forever (symmetry), preventing them from learning different features Random initialization is faster to compute The error function requires random starting points

Chapter 9: Summary

Chapter 5 introduces the most flexible parametric model in the book. The key ideas:

Concept	What it gives you
Feed-forward architecture	Universal function approximation with learned features
Backpropagation	Efficient gradient computation — O(W) for all weights
Hessian	Second-order info for optimization, pruning, Bayesian treatment
Regularization	Weight decay, early stopping, invariances — control complexity
Mixture density networks	Model multimodal conditional distributions
Bayesian NNs	Uncertainty quantification via posterior over weights

The tradeoff: Neural networks can approximate any function but are hard to train (non-convex optimization), hard to interpret, and require lots of data. The Bayesian framework helps manage complexity but requires approximations. Kernel methods (Ch 6–7) offer a complementary approach: they trade some flexibility for convex optimization and exact Bayesian treatment.

What comes next: Chapter 6 introduces kernel methods and Gaussian processes, where we work in function space directly rather than weight space.

"The number of weight parameters is a poor measure
of model complexity."
— Christopher Bishop, PRML §5.2

Check: Why does Bishop say the number of weights is a poor measure of complexity?

Because regularization can make a large network behave like a simple model — the effective complexity depends on the regularizer, not just the parameter count Because all networks with the same number of weights behave identically Because complexity is always determined by the depth alone