Ch 3: Linear Models for Regression

Chapter 0: Why Regression?

You have a dataset of input-output pairs: a patient's age and their blood pressure, a house's square footage and its price, a sensor reading and the true temperature. You want to predict the output for new inputs you haven't seen before. This is regression — learning a function from data.

The simplest approach is to fit a straight line. But real relationships are rarely linear. Chapter 3 shows how to stay in the comfortable world of linear algebra while modeling complex, nonlinear relationships. The trick: basis functions.

The key insight: "Linear" in "linear regression" refers to linearity in the parameters, not in the input. We can apply nonlinear transformations to the input (polynomials, Gaussians, sigmoids) and still solve the problem with linear algebra. This gives us the flexibility of nonlinear models with the tractability of linear ones.

This chapter covers a progression from simple to sophisticated:

Method	What it does	Key limitation
ML / Least Squares	Finds the single best-fit weight vector	Overfits with complex bases
Regularized LS	Penalizes large weights	Must choose regularization strength
Bayesian Regression	Maintains a full posterior over weights	Requires a prior
Evidence Framework	Automatically selects hyperparameters	Approximate; assumes unimodal posterior

The Bayesian approach is the star of this chapter. Instead of a single best guess for the parameters, we maintain a distribution over all possible parameter values. This gives us uncertainty estimates for free — we know not just what to predict, but how confident we should be.

Check: What does "linear" mean in linear regression?

The model is linear in the parameters (weights), not necessarily in the input The model always fits a straight line The input features must be linearly independent

Chapter 1: Basis Functions

A linear regression model with basis functions takes the form:

y(x, w) = w₀ + ∑_j=1^M−1 w_j φ_j(x) = w^Tφ(x)

where φ_j(x) are basis functions that transform the input. The parameter w₀ is the bias (offset), and we absorb it by defining φ₀(x) = 1.

Common choices of basis functions:

Polynomial: φ_j(x) = x^j. Simple but global — changing a data point at one end affects the fit everywhere.

Gaussian RBF: φ_j(x) = exp(−(x−μ_j)² / 2s²). Local: each basis function only responds near its center μ_j.

Sigmoidal: φ_j(x) = σ((x−μ_j)/s) where σ is the logistic function. Smooth step functions.

Fourier: Sines and cosines at different frequencies. Natural for periodic data.

Design matrix: If we stack all N training inputs, we get the design matrix Φ of size N × M, where Φ_nj = φ_j(x_n). This matrix encodes all the basis function evaluations. Everything from here on is linear algebra on Φ.

The choice of basis functions is critical — it determines what kinds of patterns the model can capture. Too few basis functions and the model underfits. Too many and it overfits (unless we regularize). This is the bias-variance tradeoff we'll see in Chapter 3.

Check: Why are Gaussian RBF basis functions called "local"?

Each basis function only responds significantly near its center, unlike polynomials which are global They can only be computed locally on one machine They always produce local minima in optimization

Chapter 2: Maximum Likelihood & Least Squares

Assume the target t is the true function plus Gaussian noise: t = y(x, w) + ε, where ε ~ N(0, β⁻¹). This gives a likelihood:

p(t|x, w, β) = N(t| w^Tφ(x), β⁻¹)

For N data points, the log-likelihood is:

ln p(t|w, β) = (N/2) ln β − (N/2) ln(2π) − β/2 · E_D(w)

where E_D(w) = ∑_n (t_n − w^Tφ(x_n))² is the sum-of-squares error. Maximizing the likelihood is equivalent to minimizing E_D.

Least squares = maximum likelihood under Gaussian noise. The familiar sum-of-squares loss function isn't an arbitrary choice. It falls out naturally from assuming Gaussian noise on the targets. Different noise assumptions give different loss functions (e.g., Laplacian noise gives absolute error).

Setting the gradient to zero gives the normal equations:

w_ML = (Φ^TΦ)⁻¹ Φ^T t

The matrix Φ^† = (Φ^TΦ)⁻¹Φ^T is the Moore-Penrose pseudo-inverse. The noise precision is estimated as 1/β_ML = (1/N) ∑ (t_n − w_ML^Tφ_n)².

Sequential learning: We don't need all data at once. Stochastic gradient descent updates weights one sample at a time: w^(new) = w^(old) − η ∇E_n. For sum-of-squares: w^(new) = w^(old) + η(t_n − w^Tφ_n)φ_n. This is the LMS algorithm (Widrow-Hoff). The update is proportional to the error times the basis function — the same pattern we saw for sequential Bayesian updating in Ch 2.

Check: What assumption makes least squares equivalent to maximum likelihood?

Gaussian noise on the targets Uniform noise on the targets The basis functions are orthogonal

Chapter 3: The Bias-Variance Decomposition

A single maximum likelihood fit can be misleading. If we had a different training set, we'd get a different w_ML. The bias-variance decomposition formalizes this by splitting the expected loss into three components.

For squared loss, the expected prediction error at a point x decomposes as:

E[(t − y(x))²] = (Bias)² + Variance + Noise

Term	Meaning	Reduced by
Bias²	How far the average prediction is from the true function	More flexible model
Variance	How much predictions scatter across different training sets	More data, regularization, simpler model
Noise	Irreducible error inherent in the data	Nothing (it's a floor)

The tradeoff: A simple model (few basis functions) has high bias but low variance — it consistently gets things a bit wrong. A complex model (many basis functions) has low bias but high variance — it fits the training data well but is sensitive to which data it saw. The sweet spot minimizes the total error. This is the fundamental tension in all of machine learning.

With a very flexible model and limited data, maximum likelihood overfits: the variance term dominates. We need either more data, or some way to control model complexity. The next two sections address the latter: regularization (a frequentist tool) and Bayesian inference (which controls complexity through the prior).

Important caveat: The bias-variance decomposition is a frequentist concept. It averages over hypothetical repeated datasets. The Bayesian approach sidesteps this entirely by conditioning on the actual observed data and maintaining a posterior over parameters. We'll see this in Chapter 5.

Check: A very complex model with few data points will likely have:

High bias, low variance Low bias, high variance (overfitting) Low bias, low variance

Chapter 4: Regularization

To combat overfitting, we add a penalty term to the error function. The total error becomes:

E(w) = E_D(w) + λ E_W(w)

where E_D is the data-fit term and E_W is the regularizer. The hyperparameter λ controls the tradeoff between fitting the data and keeping the weights small.

The most common regularizer is the L2 penalty (weight decay, ridge regression):

E_W(w) = ½ w^Tw = ½ ∑_j w_j²

With this penalty, the ML solution becomes:

w_reg = (Φ^TΦ + λI)⁻¹ Φ^T t

Regularization = Bayesian prior: L2 regularization is exactly equivalent to placing a zero-mean Gaussian prior on the weights: p(w|α) = N(w|0, α⁻¹I) with λ = α/β. Minimizing the regularized error is the same as finding the MAP (maximum a posteriori) estimate. Regularization isn't an ad hoc trick — it's Bayesian inference in disguise.

More generally, the L_q penalty uses ∑ |w_j|^q:

q = 2 (Ridge): Shrinks all weights toward zero. Smooth, differentiable. Closed-form solution.

q = 1 (Lasso): Drives some weights exactly to zero → sparse models → feature selection. But no closed-form solution (requires iterative optimization).

Check: L2 regularization is equivalent to which Bayesian operation?

Placing a zero-mean Gaussian prior on the weights and finding the MAP estimate Marginalizing over all possible weights Choosing the prior that maximizes the evidence

Chapter 5: Bayesian Linear Regression

Instead of finding a single best w, the Bayesian approach maintains a posterior distribution over all possible weight vectors. We start with a Gaussian prior:

p(w|α) = N(w|0, α⁻¹I)

Combined with the Gaussian likelihood from Chapter 2, the posterior is also Gaussian (conjugacy at work!):

p(w|t) = N(w|m_N, S_N)

with mean and covariance:

m_N = β S_N Φ^T t

S_N⁻¹ = αI + β Φ^TΦ

Bayesian Linear Regression

Click the canvas to add data points. Watch the posterior over weights sharpen and the predictive uncertainty shrink in regions with data.

Click canvas to add points

Sequential updating: Each new data point (x_n, t_n) updates the posterior: the previous posterior becomes the new prior. After all N points, we get the same result as processing them all at once. This is identical in spirit to the beta-binomial sequential updating in Ch 2, but now in weight space.

The posterior mean m_N equals the regularized least-squares solution with λ = α/β. But the Bayesian approach gives us much more: the full covariance S_N tells us how uncertain we are about each weight and how they correlate.

Check: What happens to the weight posterior as we observe more data?

It becomes narrower (more concentrated), reflecting increased certainty about the weights It becomes wider to accommodate more data It stays the same because the prior dominates

Chapter 6: The Predictive Distribution

The real payoff of Bayesian regression: to predict a new target t for input x, we don't plug in a single weight vector. Instead, we average over all possible weights, weighted by the posterior:

p(t|x, t, α, β) = ∫ p(t|x, w, β) p(w|t, α, β) dw

Because both factors are Gaussian, the integral is tractable:

p(t|x, t) = N(t| m_N^Tφ(x), σ_N²(x))

where the predictive variance is:

σ_N²(x) = 1/β + φ(x)^T S_N φ(x)

Two sources of uncertainty: The predictive variance has two terms. The first, 1/β, is the noise variance — irreducible randomness in the data. The second, φ^TS_Nφ, is the epistemic uncertainty — our uncertainty about the weights. This second term shrinks as we see more data. It's large far from training data (where we're unsure) and small near training data (where we're confident).

This is the foundation of modern uncertainty quantification. A model that says "I don't know" in unfamiliar regions is far more trustworthy than one that always gives confident (but potentially wrong) predictions.

Connection to Gaussian processes (Ch 6): As the number of basis functions M goes to infinity, Bayesian linear regression becomes a Gaussian process. Instead of a distribution over weight vectors, we get a distribution over functions. The kernel of the GP is determined by the basis functions. This beautiful limit connects the parametric world of Chapter 3 to the nonparametric world of Chapter 6.

Check: What is the predictive variance composed of?

Only the noise variance Noise variance (irreducible) + epistemic uncertainty (from weight uncertainty) Only the prior variance

Chapter 7: The Model Evidence

We've been assuming fixed hyperparameters α (prior precision) and β (noise precision). How do we choose them? Cross-validation is one option, but Bishop introduces a more elegant Bayesian approach: the model evidence (marginal likelihood).

p(t|α, β) = ∫ p(t|w, β) p(w|α) dw

This integrates over all possible weight vectors, weighting each by its prior probability. Models that are too simple can't fit the data (low likelihood). Models that are too complex spread their prior mass too thin (low prior for any particular good w). The evidence automatically balances fit and complexity.

Occam's razor, made quantitative: The model evidence embodies Occam's razor. A simple model concentrates its prior mass on a small set of datasets it can explain. A complex model spreads its mass thinly over many possible datasets. When the simple model can explain the observed data, its evidence is higher despite having fewer parameters. Nature rewards parsimony.

For model comparison, we compute the evidence ratio (Bayes factor):

p(t|M₁) / p(t|M₂)

A ratio much greater than 1 favors model M₁. This is the Bayesian alternative to AIC/BIC, but it follows directly from probability theory rather than being an ad hoc approximation.

Check: Why does the evidence penalize overly complex models?

Complex models spread their prior probability over too many possible datasets, so any one dataset gets low probability Complex models always have lower likelihood The evidence function has a built-in parameter count penalty

Chapter 8: The Evidence Approximation

Computing the evidence exactly is tractable for linear regression. We can maximize it with respect to α and β to find optimal hyperparameters. This is the evidence approximation (also called empirical Bayes or type-II maximum likelihood).

For the linear regression model, the log evidence is:

ln p(t|α, β) = (M/2) ln α + (N/2) ln β − E(m_N) − ½ ln|A| − (N/2) ln(2π)

where A = αI + βΦ^TΦ and E(m_N) is the regularized error evaluated at the posterior mean.

Maximizing with respect to α gives a beautiful result. Define the eigenvalues λ_i of βΦ^TΦ. Then:

α_opt = γ / (m_N^Tm_N)

where γ = ∑_i λ_i/(α + λ_i) is the effective number of parameters. This quantity ranges from 0 (all parameters controlled by the prior) to M (all parameters determined by data).

Effective parameters: Not all M parameters are equally "used" by the model. Some are strongly determined by the data (λ_i >> α, contributing ~1 to γ), while others are mostly determined by the prior (λ_i << α, contributing ~0). The number γ tells us how many parameters the data actually needs. This is a principled version of model complexity.

In practice, the optimality conditions for α and β are coupled (each depends on the other), so we iterate: (1) initialize α, β; (2) compute posterior and γ; (3) update α, β; (4) repeat until convergence.

Check: What does the "effective number of parameters" γ represent?

How many of the M parameters are actually determined by the data (versus being controlled by the prior) The total number of parameters in the model The dimensionality of the input space

Chapter 9: Summary

Chapter 3 is the first complete working example of the Bayesian machine learning pipeline. Here's the progression:

Step	Frequentist	Bayesian
Model	y = w^Tφ(x) + noise	Same
Fit	Minimize sum-of-squares → point estimate w_ML	Compute posterior p(w\|data) → full distribution
Regularize	Add λ\|\|w\|\|² penalty (ad hoc)	Prior p(w) (principled)
Predict	Single prediction w_ML^Tφ	Predictive distribution (with uncertainty)
Select model	Cross-validation	Maximize evidence

The recipe: Prior × Likelihood = Posterior. Integrate posterior × likelihood = Predictive. Integrate likelihood × prior over weights = Evidence. Three applications of Bayes' theorem at three levels: parameters, predictions, and models. This three-level hierarchy is the backbone of Bayesian machine learning.

What comes next: Chapter 4 tackles classification — predicting discrete labels instead of continuous values. The Gaussian-in-Gaussian-out magic breaks down (the posterior is no longer exactly Gaussian), forcing us to develop approximations like the Laplace approximation.

"The Bayesian treatment of linear regression avoids the
over-fitting problem of maximum likelihood."
— Christopher Bishop, PRML §3.3

Check: What are the three levels of Bayesian inference in linear regression?

Parameter inference (posterior), prediction (predictive distribution), model comparison (evidence) Training, validation, testing Prior, likelihood, posterior