Basis functions, maximum likelihood, Bayesian regression, and the evidence framework — everything you need to fit curves with calibrated uncertainty.
You have a dataset of input-output pairs: a patient's age and their blood pressure, a house's square footage and its price, a sensor reading and the true temperature. You want to predict the output for new inputs you haven't seen before. This is regression — learning a function from data.
The simplest approach is to fit a straight line. But real relationships are rarely linear. Chapter 3 shows how to stay in the comfortable world of linear algebra while modeling complex, nonlinear relationships. The trick: basis functions.
This chapter covers a progression from simple to sophisticated:
| Method | What it does | Key limitation |
|---|---|---|
| ML / Least Squares | Finds the single best-fit weight vector | Overfits with complex bases |
| Regularized LS | Penalizes large weights | Must choose regularization strength |
| Bayesian Regression | Maintains a full posterior over weights | Requires a prior |
| Evidence Framework | Automatically selects hyperparameters | Approximate; assumes unimodal posterior |
The Bayesian approach is the star of this chapter. Instead of a single best guess for the parameters, we maintain a distribution over all possible parameter values. This gives us uncertainty estimates for free — we know not just what to predict, but how confident we should be.
A linear regression model with basis functions takes the form:
where φj(x) are basis functions that transform the input. The parameter w0 is the bias (offset), and we absorb it by defining φ0(x) = 1.
Common choices of basis functions:
Polynomial: φj(x) = xj. Simple but global — changing a data point at one end affects the fit everywhere.
Gaussian RBF: φj(x) = exp(−(x−μj)2 / 2s2). Local: each basis function only responds near its center μj.
Sigmoidal: φj(x) = σ((x−μj)/s) where σ is the logistic function. Smooth step functions.
Fourier: Sines and cosines at different frequencies. Natural for periodic data.
The choice of basis functions is critical — it determines what kinds of patterns the model can capture. Too few basis functions and the model underfits. Too many and it overfits (unless we regularize). This is the bias-variance tradeoff we'll see in Chapter 3.
Assume the target t is the true function plus Gaussian noise: t = y(x, w) + ε, where ε ~ N(0, β−1). This gives a likelihood:
For N data points, the log-likelihood is:
where ED(w) = ∑n (tn − wTφ(xn))2 is the sum-of-squares error. Maximizing the likelihood is equivalent to minimizing ED.
Setting the gradient to zero gives the normal equations:
The matrix Φ† = (ΦTΦ)−1ΦT is the Moore-Penrose pseudo-inverse. The noise precision is estimated as 1/βML = (1/N) ∑ (tn − wMLTφn)2.
A single maximum likelihood fit can be misleading. If we had a different training set, we'd get a different wML. The bias-variance decomposition formalizes this by splitting the expected loss into three components.
For squared loss, the expected prediction error at a point x decomposes as:
| Term | Meaning | Reduced by |
|---|---|---|
| Bias2 | How far the average prediction is from the true function | More flexible model |
| Variance | How much predictions scatter across different training sets | More data, regularization, simpler model |
| Noise | Irreducible error inherent in the data | Nothing (it's a floor) |
With a very flexible model and limited data, maximum likelihood overfits: the variance term dominates. We need either more data, or some way to control model complexity. The next two sections address the latter: regularization (a frequentist tool) and Bayesian inference (which controls complexity through the prior).
To combat overfitting, we add a penalty term to the error function. The total error becomes:
where ED is the data-fit term and EW is the regularizer. The hyperparameter λ controls the tradeoff between fitting the data and keeping the weights small.
The most common regularizer is the L2 penalty (weight decay, ridge regression):
With this penalty, the ML solution becomes:
More generally, the Lq penalty uses ∑ |wj|q:
q = 2 (Ridge): Shrinks all weights toward zero. Smooth, differentiable. Closed-form solution.
q = 1 (Lasso): Drives some weights exactly to zero → sparse models → feature selection. But no closed-form solution (requires iterative optimization).
Instead of finding a single best w, the Bayesian approach maintains a posterior distribution over all possible weight vectors. We start with a Gaussian prior:
Combined with the Gaussian likelihood from Chapter 2, the posterior is also Gaussian (conjugacy at work!):
with mean and covariance:
Click the canvas to add data points. Watch the posterior over weights sharpen and the predictive uncertainty shrink in regions with data.
The posterior mean mN equals the regularized least-squares solution with λ = α/β. But the Bayesian approach gives us much more: the full covariance SN tells us how uncertain we are about each weight and how they correlate.
The real payoff of Bayesian regression: to predict a new target t for input x, we don't plug in a single weight vector. Instead, we average over all possible weights, weighted by the posterior:
Because both factors are Gaussian, the integral is tractable:
where the predictive variance is:
This is the foundation of modern uncertainty quantification. A model that says "I don't know" in unfamiliar regions is far more trustworthy than one that always gives confident (but potentially wrong) predictions.
We've been assuming fixed hyperparameters α (prior precision) and β (noise precision). How do we choose them? Cross-validation is one option, but Bishop introduces a more elegant Bayesian approach: the model evidence (marginal likelihood).
This integrates over all possible weight vectors, weighting each by its prior probability. Models that are too simple can't fit the data (low likelihood). Models that are too complex spread their prior mass too thin (low prior for any particular good w). The evidence automatically balances fit and complexity.
For model comparison, we compute the evidence ratio (Bayes factor):
A ratio much greater than 1 favors model M1. This is the Bayesian alternative to AIC/BIC, but it follows directly from probability theory rather than being an ad hoc approximation.
Computing the evidence exactly is tractable for linear regression. We can maximize it with respect to α and β to find optimal hyperparameters. This is the evidence approximation (also called empirical Bayes or type-II maximum likelihood).
For the linear regression model, the log evidence is:
where A = αI + βΦTΦ and E(mN) is the regularized error evaluated at the posterior mean.
Maximizing with respect to α gives a beautiful result. Define the eigenvalues λi of βΦTΦ. Then:
where γ = ∑i λi/(α + λi) is the effective number of parameters. This quantity ranges from 0 (all parameters controlled by the prior) to M (all parameters determined by data).
In practice, the optimality conditions for α and β are coupled (each depends on the other), so we iterate: (1) initialize α, β; (2) compute posterior and γ; (3) update α, β; (4) repeat until convergence.
Chapter 3 is the first complete working example of the Bayesian machine learning pipeline. Here's the progression:
| Step | Frequentist | Bayesian |
|---|---|---|
| Model | y = wTφ(x) + noise | Same |
| Fit | Minimize sum-of-squares → point estimate wML | Compute posterior p(w|data) → full distribution |
| Regularize | Add λ||w||2 penalty (ad hoc) | Prior p(w) (principled) |
| Predict | Single prediction wMLTφ | Predictive distribution (with uncertainty) |
| Select model | Cross-validation | Maximize evidence |
What comes next: Chapter 4 tackles classification — predicting discrete labels instead of continuous values. The Gaussian-in-Gaussian-out magic breaks down (the posterior is no longer exactly Gaussian), forcing us to develop approximations like the Laplace approximation.