Feed-forward networks, backpropagation, regularization, and Bayesian neural networks — the parametric powerhouse.
Linear models (Chapters 3–4) are limited: they can only represent functions that are linear in the basis functions. We had to choose basis functions by hand. What if we could learn the basis functions from data?
A neural network does exactly this. It composes multiple layers of adaptive basis functions — each layer transforms its input through a linear combination followed by a nonlinear activation. By stacking layers, the network can represent increasingly abstract features.
Bishop covers the classical two-layer network (one hidden layer + one output layer) in depth. The ideas — backpropagation, regularization, Bayesian treatment — generalize directly to modern deep networks with many layers.
A two-layer feed-forward network computes:
Breaking this down:
| Stage | Computation | What it does |
|---|---|---|
| Hidden pre-activation | aj = ∑ wji(1) xi | Linear combination of inputs |
| Hidden activation | zj = h(aj) | Nonlinear transformation |
| Output pre-activation | ak = ∑ wkj(2) zj | Linear combination of hidden units |
| Output activation | yk = σout(ak) | Depends on task (identity, sigmoid, softmax) |
A network with one hidden layer of M units fitting a sine wave. Adjust M to see how more hidden units allow more complex fits.
The total number of parameters in a two-layer network with D inputs, M hidden units, and K outputs (including biases) is (D+1)M + (M+1)K. The universal approximation theorem guarantees that a single hidden layer with enough units can approximate any continuous function to arbitrary accuracy. But "enough" can be exponentially many — deeper networks are often more efficient.
The hidden-layer activation function h(a) must be nonlinear — otherwise stacking layers gives another linear model. Bishop focuses on two classical choices:
Logistic sigmoid: σ(a) = 1/(1 + e−a). Output in (0, 1). Saturates at both extremes. Derivative: σ'(a) = σ(a)(1 − σ(a)).
Tanh: tanh(a) = 2σ(2a) − 1. Output in (−1, 1). Zero-centered, which aids optimization. Derivative: tanh'(a) = 1 − tanh2(a).
The output activation depends on the task:
| Task | Output activation | Error function |
|---|---|---|
| Regression | Identity (linear) | Sum-of-squares |
| Binary classification | Sigmoid | Cross-entropy |
| Multi-class classification | Softmax | Multi-class cross-entropy |
We need gradients of the error function with respect to every weight to do gradient-based optimization. Backpropagation computes all these gradients efficiently using the chain rule.
Two passes through the network:
| Pass | Direction | Computes |
|---|---|---|
| Forward | Input → Output | All activations aj, zj, yk |
| Backward | Output → Input | All error signals δj = ∂E/∂aj |
Define the error signal for each unit: δj = ∂E/∂aj. For output units: δk = yk − tk. For hidden units, the chain rule gives:
The gradient for any weight is: ∂E/∂wji = δj zi (error signal times input to that weight).
The gradient tells us the direction of steepest descent. We then update weights: w(new) = w(old) − η ∇E. In practice, we use mini-batch SGD or more sophisticated optimizers (momentum, Adam).
The Hessian H is the matrix of second derivatives: Hij = ∂2E/∂wi∂wj. It captures the curvature of the error surface. Why do we care?
| Application | How the Hessian helps |
|---|---|
| Newton's method | Uses H−1 for faster convergence |
| Laplace approximation | H at the mode gives the Gaussian approximation to the posterior |
| Model comparison | det(H) appears in the evidence approximation |
| Pruning | Identifies weights that can be removed (Optimal Brain Damage) |
Computing the full Hessian costs O(W2) which is prohibitive for large networks. Fortunately, several approximations exist:
Diagonal approximation: Keep only the diagonal entries. O(W) cost but ignores correlations between weights.
Outer product approximation: H ≈ ∑n gngnT where gn is the gradient for data point n. Valid near a minimum where the residuals are small.
Neural networks with many parameters overfit easily. Bishop discusses several regularization strategies:
Weight decay (L2 regularization): Add λ/2 · wTw to the error function. Just as in Chapter 3, this is equivalent to a Gaussian prior on the weights. Larger λ → smaller weights → smoother function.
Early stopping: Train on the training set, monitor error on a validation set, and stop when the validation error starts increasing. The number of training steps acts as an implicit regularizer — more steps allow the network to fit more complex patterns.
Invariances: If we know the output should be invariant to certain input transformations (e.g., digit recognition should be invariant to small rotations), we can: (1) augment the training data with transformed versions, (2) add a regularization term penalizing sensitivity to the transformation, or (3) build the invariance into the network architecture (like convolutional neural networks).
Standard regression networks predict a single mean and variance for each input. But what if the target distribution is multimodal? For example, a robot arm with multiple valid configurations for the same end-effector position.
A mixture density network (MDN) outputs the parameters of a Gaussian mixture:
The network outputs for each component k: mixing coefficients πk (via softmax), means μk (unconstrained), and variances σk2 (via exp to ensure positivity).
Training uses the negative log-likelihood of the mixture as the error function. Gradients flow through the mixture parameters back to the network weights via backprop. MDNs are an early example of neural networks parameterizing complex probability distributions — an idea that became central in modern generative models (VAEs, normalizing flows).
The full Bayesian approach to neural networks: instead of finding a single weight vector, maintain a posterior distribution over all possible weights:
For predictions, marginalize over the posterior:
The integral is intractable for neural networks (the posterior is highly non-Gaussian and multimodal). Bishop discusses the Laplace approximation: find the MAP weights, then approximate the posterior as Gaussian using the Hessian at the MAP.
The evidence framework (Ch 3) extends to neural networks: maximize p(D|α, β) with respect to the hyperparameters α (weight prior precision) and β (noise precision). The effective number of parameters γ again plays a central role.
Modern approaches go beyond the Laplace approximation: variational inference (Ch 10), MC dropout, and Hamiltonian Monte Carlo (Ch 11) provide better approximations to the weight posterior.
Training a neural network involves navigating a complex, non-convex error surface. Practical considerations:
Optimization: The error surface has many local minima, saddle points, and plateaus. Different initialization → different solutions. Techniques to help: momentum (carry velocity from previous steps), adaptive learning rates (Adam, RMSProp), and learning rate schedules.
Weight initialization: Too large → saturation. Too small → vanishing signals. A good rule: initialize weights with variance proportional to 1/(fan-in), so the variance of activations stays roughly constant across layers.
Data preprocessing: Standardize inputs to zero mean and unit variance. This makes the error surface more isotropic (similar curvature in all directions), which helps gradient descent converge faster.
Network architecture: How many hidden layers? How many units per layer? Bishop emphasizes using regularization with a large network rather than carefully tuning the size of a small network. Let the regularizer (or the Bayesian prior) control complexity, not the architecture.
Chapter 5 introduces the most flexible parametric model in the book. The key ideas:
| Concept | What it gives you |
|---|---|
| Feed-forward architecture | Universal function approximation with learned features |
| Backpropagation | Efficient gradient computation — O(W) for all weights |
| Hessian | Second-order info for optimization, pruning, Bayesian treatment |
| Regularization | Weight decay, early stopping, invariances — control complexity |
| Mixture density networks | Model multimodal conditional distributions |
| Bayesian NNs | Uncertainty quantification via posterior over weights |
What comes next: Chapter 6 introduces kernel methods and Gaussian processes, where we work in function space directly rather than weight space.