Bishop PRML, Chapter 3

Linear Models for Regression

Basis functions, maximum likelihood, Bayesian regression, and the evidence framework — everything you need to fit curves with calibrated uncertainty.

Prerequisites: Chapters 1–2 (probability, Gaussian inference).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: Why Regression?

You have a dataset of input-output pairs: a patient's age and their blood pressure, a house's square footage and its price, a sensor reading and the true temperature. You want to predict the output for new inputs you haven't seen before. This is regression — learning a function from data.

The simplest approach is to fit a straight line. But real relationships are rarely linear. Chapter 3 shows how to stay in the comfortable world of linear algebra while modeling complex, nonlinear relationships. The trick: basis functions.

The key insight: "Linear" in "linear regression" refers to linearity in the parameters, not in the input. We can apply nonlinear transformations to the input (polynomials, Gaussians, sigmoids) and still solve the problem with linear algebra. This gives us the flexibility of nonlinear models with the tractability of linear ones.

This chapter covers a progression from simple to sophisticated:

MethodWhat it doesKey limitation
ML / Least SquaresFinds the single best-fit weight vectorOverfits with complex bases
Regularized LSPenalizes large weightsMust choose regularization strength
Bayesian RegressionMaintains a full posterior over weightsRequires a prior
Evidence FrameworkAutomatically selects hyperparametersApproximate; assumes unimodal posterior

The Bayesian approach is the star of this chapter. Instead of a single best guess for the parameters, we maintain a distribution over all possible parameter values. This gives us uncertainty estimates for free — we know not just what to predict, but how confident we should be.

Check: What does "linear" mean in linear regression?

Chapter 1: Basis Functions

A linear regression model with basis functions takes the form:

y(x, w) = w0 + ∑j=1M−1 wj φj(x) = wTφ(x)

where φj(x) are basis functions that transform the input. The parameter w0 is the bias (offset), and we absorb it by defining φ0(x) = 1.

Common choices of basis functions:

Polynomial: φj(x) = xj. Simple but global — changing a data point at one end affects the fit everywhere.

Gaussian RBF: φj(x) = exp(−(x−μj)2 / 2s2). Local: each basis function only responds near its center μj.

Sigmoidal: φj(x) = σ((x−μj)/s) where σ is the logistic function. Smooth step functions.

Fourier: Sines and cosines at different frequencies. Natural for periodic data.

Design matrix: If we stack all N training inputs, we get the design matrix Φ of size N × M, where Φnj = φj(xn). This matrix encodes all the basis function evaluations. Everything from here on is linear algebra on Φ.

The choice of basis functions is critical — it determines what kinds of patterns the model can capture. Too few basis functions and the model underfits. Too many and it overfits (unless we regularize). This is the bias-variance tradeoff we'll see in Chapter 3.

Check: Why are Gaussian RBF basis functions called "local"?

Chapter 2: Maximum Likelihood & Least Squares

Assume the target t is the true function plus Gaussian noise: t = y(x, w) + ε, where ε ~ N(0, β−1). This gives a likelihood:

p(t|x, w, β) = N(t| wTφ(x), β−1)

For N data points, the log-likelihood is:

ln p(t|w, β) = (N/2) ln β − (N/2) ln(2π) − β/2 · ED(w)

where ED(w) = ∑n (tnwTφ(xn))2 is the sum-of-squares error. Maximizing the likelihood is equivalent to minimizing ED.

Least squares = maximum likelihood under Gaussian noise. The familiar sum-of-squares loss function isn't an arbitrary choice. It falls out naturally from assuming Gaussian noise on the targets. Different noise assumptions give different loss functions (e.g., Laplacian noise gives absolute error).

Setting the gradient to zero gives the normal equations:

wML = (ΦTΦ)−1 ΦT t

The matrix Φ = (ΦTΦ)−1ΦT is the Moore-Penrose pseudo-inverse. The noise precision is estimated as 1/βML = (1/N) ∑ (tnwMLTφn)2.

Sequential learning: We don't need all data at once. Stochastic gradient descent updates weights one sample at a time: w(new) = w(old) − η ∇En. For sum-of-squares: w(new) = w(old) + η(tnwTφn)φn. This is the LMS algorithm (Widrow-Hoff). The update is proportional to the error times the basis function — the same pattern we saw for sequential Bayesian updating in Ch 2.
Check: What assumption makes least squares equivalent to maximum likelihood?

Chapter 3: The Bias-Variance Decomposition

A single maximum likelihood fit can be misleading. If we had a different training set, we'd get a different wML. The bias-variance decomposition formalizes this by splitting the expected loss into three components.

For squared loss, the expected prediction error at a point x decomposes as:

E[(t − y(x))2] = (Bias)2 + Variance + Noise
TermMeaningReduced by
Bias2How far the average prediction is from the true functionMore flexible model
VarianceHow much predictions scatter across different training setsMore data, regularization, simpler model
NoiseIrreducible error inherent in the dataNothing (it's a floor)
The tradeoff: A simple model (few basis functions) has high bias but low variance — it consistently gets things a bit wrong. A complex model (many basis functions) has low bias but high variance — it fits the training data well but is sensitive to which data it saw. The sweet spot minimizes the total error. This is the fundamental tension in all of machine learning.

With a very flexible model and limited data, maximum likelihood overfits: the variance term dominates. We need either more data, or some way to control model complexity. The next two sections address the latter: regularization (a frequentist tool) and Bayesian inference (which controls complexity through the prior).

Important caveat: The bias-variance decomposition is a frequentist concept. It averages over hypothetical repeated datasets. The Bayesian approach sidesteps this entirely by conditioning on the actual observed data and maintaining a posterior over parameters. We'll see this in Chapter 5.
Check: A very complex model with few data points will likely have:

Chapter 4: Regularization

To combat overfitting, we add a penalty term to the error function. The total error becomes:

E(w) = ED(w) + λ EW(w)

where ED is the data-fit term and EW is the regularizer. The hyperparameter λ controls the tradeoff between fitting the data and keeping the weights small.

The most common regularizer is the L2 penalty (weight decay, ridge regression):

EW(w) = ½ wTw = ½ ∑j wj2

With this penalty, the ML solution becomes:

wreg = (ΦTΦ + λI)−1 ΦT t
Regularization = Bayesian prior: L2 regularization is exactly equivalent to placing a zero-mean Gaussian prior on the weights: p(w|α) = N(w|0, α−1I) with λ = α/β. Minimizing the regularized error is the same as finding the MAP (maximum a posteriori) estimate. Regularization isn't an ad hoc trick — it's Bayesian inference in disguise.

More generally, the Lq penalty uses ∑ |wj|q:

q = 2 (Ridge): Shrinks all weights toward zero. Smooth, differentiable. Closed-form solution.

q = 1 (Lasso): Drives some weights exactly to zero → sparse models → feature selection. But no closed-form solution (requires iterative optimization).

Check: L2 regularization is equivalent to which Bayesian operation?

Chapter 5: Bayesian Linear Regression

Instead of finding a single best w, the Bayesian approach maintains a posterior distribution over all possible weight vectors. We start with a Gaussian prior:

p(w|α) = N(w|0, α−1I)

Combined with the Gaussian likelihood from Chapter 2, the posterior is also Gaussian (conjugacy at work!):

p(w|t) = N(w|mN, SN)

with mean and covariance:

mN = β SN ΦT t
SN−1 = αI + β ΦTΦ
Bayesian Linear Regression

Click the canvas to add data points. Watch the posterior over weights sharpen and the predictive uncertainty shrink in regions with data.

Click canvas to add points
Sequential updating: Each new data point (xn, tn) updates the posterior: the previous posterior becomes the new prior. After all N points, we get the same result as processing them all at once. This is identical in spirit to the beta-binomial sequential updating in Ch 2, but now in weight space.

The posterior mean mN equals the regularized least-squares solution with λ = α/β. But the Bayesian approach gives us much more: the full covariance SN tells us how uncertain we are about each weight and how they correlate.

Check: What happens to the weight posterior as we observe more data?

Chapter 6: The Predictive Distribution

The real payoff of Bayesian regression: to predict a new target t for input x, we don't plug in a single weight vector. Instead, we average over all possible weights, weighted by the posterior:

p(t|x, t, α, β) = ∫ p(t|x, w, β) p(w|t, α, β) dw

Because both factors are Gaussian, the integral is tractable:

p(t|x, t) = N(t| mNTφ(x), σN2(x))

where the predictive variance is:

σN2(x) = 1/β + φ(x)T SN φ(x)
Two sources of uncertainty: The predictive variance has two terms. The first, 1/β, is the noise variance — irreducible randomness in the data. The second, φTSNφ, is the epistemic uncertainty — our uncertainty about the weights. This second term shrinks as we see more data. It's large far from training data (where we're unsure) and small near training data (where we're confident).

This is the foundation of modern uncertainty quantification. A model that says "I don't know" in unfamiliar regions is far more trustworthy than one that always gives confident (but potentially wrong) predictions.

Connection to Gaussian processes (Ch 6): As the number of basis functions M goes to infinity, Bayesian linear regression becomes a Gaussian process. Instead of a distribution over weight vectors, we get a distribution over functions. The kernel of the GP is determined by the basis functions. This beautiful limit connects the parametric world of Chapter 3 to the nonparametric world of Chapter 6.
Check: What is the predictive variance composed of?

Chapter 7: The Model Evidence

We've been assuming fixed hyperparameters α (prior precision) and β (noise precision). How do we choose them? Cross-validation is one option, but Bishop introduces a more elegant Bayesian approach: the model evidence (marginal likelihood).

p(t|α, β) = ∫ p(t|w, β) p(w|α) dw

This integrates over all possible weight vectors, weighting each by its prior probability. Models that are too simple can't fit the data (low likelihood). Models that are too complex spread their prior mass too thin (low prior for any particular good w). The evidence automatically balances fit and complexity.

Occam's razor, made quantitative: The model evidence embodies Occam's razor. A simple model concentrates its prior mass on a small set of datasets it can explain. A complex model spreads its mass thinly over many possible datasets. When the simple model can explain the observed data, its evidence is higher despite having fewer parameters. Nature rewards parsimony.

For model comparison, we compute the evidence ratio (Bayes factor):

p(t|M1) / p(t|M2)

A ratio much greater than 1 favors model M1. This is the Bayesian alternative to AIC/BIC, but it follows directly from probability theory rather than being an ad hoc approximation.

Check: Why does the evidence penalize overly complex models?

Chapter 8: The Evidence Approximation

Computing the evidence exactly is tractable for linear regression. We can maximize it with respect to α and β to find optimal hyperparameters. This is the evidence approximation (also called empirical Bayes or type-II maximum likelihood).

For the linear regression model, the log evidence is:

ln p(t|α, β) = (M/2) ln α + (N/2) ln β − E(mN) − ½ ln|A| − (N/2) ln(2π)

where A = αI + βΦTΦ and E(mN) is the regularized error evaluated at the posterior mean.

Maximizing with respect to α gives a beautiful result. Define the eigenvalues λi of βΦTΦ. Then:

αopt = γ / (mNTmN)

where γ = ∑i λi/(α + λi) is the effective number of parameters. This quantity ranges from 0 (all parameters controlled by the prior) to M (all parameters determined by data).

Effective parameters: Not all M parameters are equally "used" by the model. Some are strongly determined by the data (λi >> α, contributing ~1 to γ), while others are mostly determined by the prior (λi << α, contributing ~0). The number γ tells us how many parameters the data actually needs. This is a principled version of model complexity.

In practice, the optimality conditions for α and β are coupled (each depends on the other), so we iterate: (1) initialize α, β; (2) compute posterior and γ; (3) update α, β; (4) repeat until convergence.

Check: What does the "effective number of parameters" γ represent?

Chapter 9: Summary

Chapter 3 is the first complete working example of the Bayesian machine learning pipeline. Here's the progression:

StepFrequentistBayesian
Modely = wTφ(x) + noiseSame
FitMinimize sum-of-squares → point estimate wMLCompute posterior p(w|data) → full distribution
RegularizeAdd λ||w||2 penalty (ad hoc)Prior p(w) (principled)
PredictSingle prediction wMLTφPredictive distribution (with uncertainty)
Select modelCross-validationMaximize evidence
The recipe: Prior × Likelihood = Posterior. Integrate posterior × likelihood = Predictive. Integrate likelihood × prior over weights = Evidence. Three applications of Bayes' theorem at three levels: parameters, predictions, and models. This three-level hierarchy is the backbone of Bayesian machine learning.

What comes next: Chapter 4 tackles classification — predicting discrete labels instead of continuous values. The Gaussian-in-Gaussian-out magic breaks down (the posterior is no longer exactly Gaussian), forcing us to develop approximations like the Laplace approximation.

"The Bayesian treatment of linear regression avoids the
over-fitting problem of maximum likelihood."
— Christopher Bishop, PRML §3.3
Check: What are the three levels of Bayesian inference in linear regression?