Deisenroth et al., Chapter 9

Linear Regression

The workhorse of machine learning — linear in parameters, not necessarily in inputs.

Prerequisites: Chapter 8 (MLE, MAP, Bayesian inference). That's it.
11
Chapters
5+
Simulations
11
Quizzes

Chapter 0: Why Linear Regression

You have 10 data points. Each is a pair: some input x and an output y. Plot them. You see a roughly linear trend with some scatter. You draw a line through the points by eye. Congratulations — you just did linear regression.

Now make it precise. Which line is "best"? How confident should you be? If I give you a new x, what should you predict for y — and how uncertain is that prediction?

Linear regression answers all of these questions. It is simultaneously:

The simplest ML model
A closed-form solution you can compute by hand
Surprisingly powerful
With feature maps, it fits polynomials, radial basis functions, anything
A complete testbed
MLE, MAP, full Bayesian — all have closed-form solutions

The name is misleading. "Linear" does not mean the model can only fit straight lines. It means the model is linear in the parameters θ. You can transform the input x through any nonlinear function φ(x) first, then combine linearly. This lets you fit curves, surfaces, and arbitrarily complex shapes — all with the same linear regression machinery.

Why start here? Every concept from Chapter 8 — MLE, MAP, Bayesian inference, model selection — has a clean, analytic solution in linear regression. No approximations, no iterative optimizers. You can write down the answer, compute it with a matrix inverse, and see exactly what each framework does. This makes linear regression the perfect sandbox for understanding ML at a deep level.
Check: Why is the term "linear" in "linear regression" potentially misleading?

Chapter 1: The Setup

The model assumes each observation is generated by a linear combination of features plus noise:

yn = φ(xn)Tθ + εn,     εn ~ N(0, σ2)

Here φ(x) ∈ RM is a feature vector — a transformation of the raw input x into M features. The parameters θ ∈ RM are the weights we want to learn. The noise ε captures everything the model cannot explain.

Stack all N observations into a matrix. The design matrix Φ ∈ RN×M has one row per data point:

Φ = ⎡ φ(x1)T
 ⎢ ⋮ ⎥
 ⎣ φ(xN)T
∈ RN×M

The full model in matrix notation:

y = Φθ + ε,     ε ~ N(0, σ2I)
Think of Φ as a recipe book. Each row says "to predict yn, mix these M features in proportions θ." The design matrix encodes what information about each data point we feed to the model. The parameters θ say how to combine that information.

A concrete example. Suppose x is a single number and φ(x) = [1, x]. Then M = 2, the design matrix is:

Φ = ⎡ 1   x1
 ⎢ 1   x2
 ⎣ 1   x3
,     θ = ⎡ θ0
 ⎣ θ1

And ŷ = Φθ = [θ0 + θ1x1, θ0 + θ1x2, θ0 + θ1x3]T. This is a line with intercept θ0 and slope θ1. The column of 1's lets us have a bias term.

SymbolShapeMeaning
φ(x)RMFeature vector for one input
θRMParameter vector (weights)
ΦRN×MDesign matrix (all features stacked)
yRNObserved outputs
σ2R+Noise variance
Check: What are the rows of the design matrix Φ?

Chapter 2: The Normal Equations (MLE)

From Chapter 8, we know that MLE with Gaussian noise reduces to minimizing the sum of squared errors. In matrix form:

L(θ) = ||y − Φθ||2 = (y − Φθ)T(y − Φθ)

To find the minimum, take the gradient with respect to θ and set it to zero. Let's expand the loss first:

L(θ) = yTy − 2yTΦθ + θTΦTΦθ

Take the derivative using matrix calculus (Chapter 5):

θ L = −2ΦTy + 2ΦTΦθ = 0

Rearranging gives the normal equations:

ΦTΦθ = ΦTy

If ΦTΦ is invertible (which happens when the columns of Φ are linearly independent — meaning we have at least as many data points as features), the solution is:

θML = (ΦTΦ)−1ΦTy
This is the closed-form MLE solution. No iteration, no gradient descent, no learning rate to tune. One matrix inverse, one matrix multiply, done. The term (ΦTΦ)−1ΦT is called the pseudoinverse of Φ.

Let's verify with a worked numerical example. Three points: (1, 2.1), (2, 3.9), (3, 6.2). Feature map φ(x) = [1, x].

Φ = ⎡ 1   1 ⎤
 ⎢ 1   2 ⎥
 ⎣ 1   3 ⎦
,     y = ⎡ 2.1 ⎤
 ⎢ 3.9 ⎥
 ⎣ 6.2 ⎦
ΦTΦ = ⎡ 3   6 ⎤
 ⎣ 6   14 ⎦
,     ΦTy = ⎡ 12.2 ⎤
 ⎣ 28.4 ⎦
TΦ)−1 = 1/6⎡ 14   −6 ⎤
 ⎣ −6   3 ⎦
θML = 1/6⎡ 14   −6 ⎤
 ⎣ −6   3 ⎦
⎡ 12.2 ⎤
 ⎣ 28.4 ⎦
= ⎡ −0.1 ⎤
 ⎣ 2.05 ⎦

The MLE fit is ŷ = −0.1 + 2.05x. Intercept near zero, slope about 2. Let's verify: at x=2, ŷ = −0.1 + 4.1 = 4.0, and the actual value is 3.9. Close.

python
import numpy as np

X = np.array([1, 2, 3])
y = np.array([2.1, 3.9, 6.2])
Phi = np.column_stack([np.ones_like(X), X])  # [1, x]

theta_ML = np.linalg.solve(Phi.T @ Phi, Phi.T @ y)
# array([-0.1, 2.05])
Numerical note: Never actually compute the inverse (ΦTΦ)−1 directly. Use np.linalg.solve instead, which solves the linear system ΦTΦθ = ΦTy more stably and efficiently. Computing inverses is O(M3) and numerically fragile. Solving linear systems is O(M3) too, but with better constants and stability.
Check: What is the closed-form MLE solution for linear regression?

Chapter 3: Feature Maps & Polynomials

A straight line is boring. But linear regression is not limited to straight lines. The trick: apply a feature map φ(x) that transforms the raw input into a richer feature space. The model is still linear in θ — it just operates on transformed features.

The most common example is polynomial features:

φ(x) = [1, x, x2, x3, …, xM−1]

With this feature map, y = φ(x)Tθ = θ0 + θ1x + θ2x2 + … — a polynomial of degree M−1. Yet we solve it with the exact same formula: θML = (ΦTΦ)−1ΦTy. The normal equations do not care what φ is.

This is why it is called "linear" regression. The model is y = φ(x)Tθ = ∑m θm φm(x). It is a linear combination of the features φm(x) with weights θm. The features can be anything — polynomials, sines, Gaussians, indicator functions — and the same closed-form solution applies.

Worked example: fit a degree-2 polynomial to the same 3 points.

φ(x) = [1, x, x2]   ⇒   Φ = ⎡ 1   1   1 ⎤
 ⎢ 1   2   4 ⎥
 ⎣ 1   3   9 ⎦

Now Φ is 3×3 (square). With N = M = 3, the polynomial passes through all 3 points exactly — zero training error. But is zero error desirable? Only if the data is noise-free. If there is noise, a perfect fit through every point is overfitting.

Other feature maps:

• Radial basis: φm(x) = exp(−||x − cm||2 / 2s2)
• Fourier: φm(x) = sin(mπx) or cos(mπx)
• Sigmoidal: φm(x) = σ(amx + bm)

The design choice:

Picking φ encodes your prior knowledge. Use polynomials if you expect smooth curves. Use Fourier features for periodic signals. Use RBFs for local patterns. The choice of features is often more important than the choice of algorithm.

Neural networks as feature maps: A neural network with L layers can be viewed as a learned feature map φ(x) followed by a linear output layer. The difference: in linear regression, φ is fixed and we optimize θ. In neural networks, we optimize φ and θ jointly. This is why deep learning is so powerful — the features adapt to the data.
Check: With φ(x) = [1, x, x2], what kind of curve can the model fit?

Chapter 4: Overfitting

More features means more flexibility. A degree-12 polynomial has 13 parameters and can fit almost any smooth curve. With only 10 data points, it can pass through every single one with zero training error. Sounds great. It is terrible.

The polynomial oscillates wildly between data points. At locations slightly away from the training data, the predictions are absurd — values in the thousands for data that lives in [−1, 1]. This is overfitting: the model has memorized the noise in the training data instead of learning the underlying pattern.

Polynomial Overfitting Explorer

Drag the Degree slider. At low degrees (0–2), the model underfits. At moderate degrees (3–5), it captures the sine shape. Beyond 8, it overfits — wild oscillations between data points. Training RMSE always drops; test RMSE shows the characteristic U-shape.

Degree: 3
Degree3
The bias-variance trade-off in action. Low-degree models have high bias (they cannot capture the true function) but low variance (they are stable across different datasets). High-degree models have low bias but high variance (they change wildly with different training samples). The sweet spot minimizes total error = bias2 + variance.

Notice the diagnostic pattern:

ConditionTraining errorTest errorDiagnosis
UnderfittingHighHighNeed more complexity
Good fitLowLowJust right
OverfittingVery lowHighNeed less complexity / more data / regularization

This is not just an academic concern. Overfitting is the single most common failure mode in ML. Every regularization technique, every validation strategy, every Bayesian method is fundamentally a weapon against overfitting.

Check: How can you detect overfitting?

Chapter 5: MAP & Regularization

From Chapter 8, we know that a Gaussian prior on θ turns MLE into MAP with an L2 penalty. For linear regression, the MAP objective is:

J(θ) = ||y − Φθ||2 + λ||θ||2

where λ = σ22 (noise variance divided by prior variance). Differentiating and setting to zero:

θ J = −2ΦTy + 2ΦTΦθ + 2λθ = 0
TΦ + λI)θ = ΦTy
θMAP = (ΦTΦ + λI)−1ΦTy
Compare MLE vs MAP. The only difference is the λI term added to ΦTΦ. This has two effects: (1) it shrinks the parameters toward zero (regularization), and (2) it makes the matrix always invertible, even when Φ is rank-deficient. Regularization is also a numerical stability trick!

Returning to our 3-point example with λ = 1 and degree-1 features:

ΦTΦ + λI = ⎡ 3+1   6 ⎤
 ⎣ 6   14+1 ⎦
= ⎡ 4   6 ⎤
 ⎣ 6   15 ⎦
θMAP = 1/24⎡ 15   −6 ⎤
 ⎣ −6   4 ⎦
⎡ 12.2 ⎤
 ⎣ 28.4 ⎦
= ⎡ 0.525 ⎤
 ⎣ 1.673 ⎦

Compare: θML = [−0.1, 2.05] vs θMAP = [0.525, 1.673]. The MAP solution has smaller parameters — the prior pulled them toward zero. The slope shrank from 2.05 to 1.67, and the intercept shifted from −0.1 to 0.53.

How much shrinkage? When λ is tiny (weak prior), MAP ≈ MLE. When λ is huge (strong prior), θMAP ≈ 0 regardless of data. The art is choosing λ via cross-validation or the marginal likelihood (Chapter 9).
Check: What does adding λI to ΦTΦ accomplish?

Chapter 6: Bayesian Regression

MLE and MAP give point estimates. Bayesian regression keeps the full posterior distribution over θ. Here is the derivation, step by step.

Step 1: Prior. We place a Gaussian prior on the parameters:

p(θ) = N(θ | m0, S0)

Typically m0 = 0 (we start centered at the origin) and S0 = αI (isotropic, where α controls the prior width).

Step 2: Likelihood. Given θ, the data likelihood is:

p(y | Φ, θ) = N(y | Φθ, σ2I)

Step 3: Posterior. Gaussian prior × Gaussian likelihood = Gaussian posterior. This is the conjugacy magic. After doing the algebra (completing the square in the exponent):

p(θ | Φ, y) = N(θ | mN, SN)

where the posterior parameters are:

SN = (S0−1 + σ−2ΦTΦ)−1
mN = SN(S0−1m0 + σ−2ΦTy)
Read these formulas carefully. The posterior covariance SN combines the prior precision S0−1 (what we knew before) with the data precision σ−2ΦTΦ (what the data tells us). More data → larger ΦTΦ → smaller SN → tighter posterior. The posterior mean mN is a precision-weighted average of the prior mean and the data evidence.

With m0 = 0 (zero prior mean), the formulas simplify to:

SN = (α−1I + σ−2ΦTΦ)−1
mN = σ−2SNΦTy
Connection to MAP: The MAP estimate is θMAP = mN, the posterior mean. But Bayesian inference keeps SN too — the uncertainty. MAP throws away the uncertainty and keeps only the peak. Full Bayesian keeps the whole bell curve.

The posterior has an intuitive interpretation. Before seeing data, our belief is N(0, αI) — a wide ball centered at the origin. Each data point narrows the ball, pulling the center toward values that explain the data. After N data points, the posterior is a tight ellipsoid whose center is near the MLE and whose shape reflects parameter correlations.

python
import numpy as np

def bayesian_regression(Phi, y, sigma2, alpha):
    S0_inv = (1/alpha) * np.eye(Phi.shape[1])
    SN = np.linalg.inv(S0_inv + (1/sigma2) * Phi.T @ Phi)
    mN = (1/sigma2) * SN @ Phi.T @ y
    return mN, SN
Check: What happens to the posterior covariance SN as we observe more data?

Chapter 7: Posterior Predictions

We have the posterior p(θ | data) = N(mN, SN). Now someone gives us a new input x* and asks: what is y*?

The posterior predictive distribution integrates out the parameter uncertainty:

p(y* | x*, data) = ∫ p(y* | x*, θ) p(θ | data) dθ

Both factors are Gaussian, and the integral of a Gaussian times a Gaussian is Gaussian. The result:

p(y* | x*, data) = N(y* | μ*, σ*2)

where:

μ* = φ(x*)TmN
σ*2 = φ(x*)TSNφ(x*) + σ2
The predictive variance has two components. The first term φ(x*)TSNφ(x*) is epistemic uncertainty — uncertainty about which θ is correct. It decreases with more data. The second term σ2 is aleatoric uncertainty — irreducible noise in the data. It stays constant no matter how much data you collect. Total uncertainty = parameter uncertainty + noise.

Key behaviors:

ScenarioEpistemic φTSNφAleatoric σ2Total uncertainty
Near data pointsSmallFixedLow
Far from dataLargeFixedHigh
More data collectedShrinksUnchangedDecreases (approaches σ2)

This is exactly the behavior we want. The model is confident near observed data and uncertain in unexplored regions. As data accumulates, the confidence bands narrow toward the irreducible noise level. No ad-hoc uncertainty calibration needed — it falls out of the math.

Check: What are the two sources of predictive uncertainty in Bayesian linear regression?

Chapter 8: Uncertainty Bands

Now for the payoff. The simulation below shows Bayesian linear regression in action. You start with the prior — wide uncertainty bands and random function samples. Click on the canvas to add data points. Watch the posterior update live: the uncertainty bands narrow around the data, and the sampled functions converge toward the truth.

Bayesian Linear Regression: Live Posterior Updates

Click on the plot to add data points. Orange band = ±2σ predictive uncertainty. Teal lines = sampled functions from the posterior. Watch the band narrow and the functions converge as you add data.

N = 0 points
Prior α2.0
Noise σ0.20
Degree3
What to explore: (1) Add 2–3 points and watch the bands narrow only near those points. (2) Add points across the whole range and watch the bands tighten everywhere. (3) Increase α (wider prior) and see how uncertainty grows. (4) Increase noise σ and notice the minimum band width increases. (5) Increase the degree and see how the model becomes more flexible but also more uncertain between data points.
Why this matters for real ML: Uncertainty quantification is critical in safety-sensitive applications — medical diagnosis, autonomous driving, financial risk. A model that says "I'm not sure" is far more useful than one that confidently gives the wrong answer. Bayesian regression shows the gold standard for honest uncertainty.
Check: In the simulation, why do the uncertainty bands widen far from the data points?

Chapter 9: Marginal Likelihood

We have a polynomial degree to choose, a prior variance α, and a noise level σ. Cross-validation works, but Bayesian inference offers something more principled: the marginal likelihood (or evidence).

p(y | Φ) = ∫ p(y | Φ, θ) p(θ) dθ

For Gaussian likelihood and Gaussian prior, this integral has a closed form:

log p(y | Φ) = −1/2 [ yT2I + ΦS0ΦT)−1y + log|σ2I + ΦS0ΦT| + N log(2π) ]

This expression has three terms, each with a clear interpretation:

Data fit term: yTK−1y (where K = σ2I + ΦS0ΦT). Prefers models that fit the data.

Complexity penalty: log|K|. Grows with model complexity. Penalizes overly flexible models.

Automatic Occam's razor. The marginal likelihood balances fit against complexity without any hyperparameter. Simple models that explain the data well get high evidence. Complex models spread their prior mass over many possible datasets, reducing the probability assigned to any specific dataset. Only if the added complexity genuinely helps explain the data will the evidence increase.

In practice, you compute the marginal likelihood for each candidate model (degree 1, 2, 3, ...) and pick the one with the highest evidence. This is Bayesian model selection — principled, automatic, and often more reliable than cross-validation for small datasets.

Optimizing hyperparameters via evidence: Beyond selecting between discrete models, you can also optimize continuous hyperparameters (α, σ2) by maximizing the marginal likelihood with gradient ascent. This is called empirical Bayes or type-II maximum likelihood. It gives a principled way to set regularization strength without cross-validation.
Check: How does the marginal likelihood avoid selecting overly complex models?

Chapter 10: The Geometric View

There is a beautiful geometric interpretation of MLE linear regression that ties everything together. The prediction ŷ = Φθ must lie in the column space of Φ — the set of all possible linear combinations of Φ's columns.

The MLE solution finds the point in this column space that is closest to the observed y. "Closest" in the least-squares sense means the residual y − ŷ is perpendicular to the column space.

ML = Φ(ΦTΦ)−1ΦTy

The matrix P = Φ(ΦTΦ)−1ΦT is the orthogonal projection matrix onto the column space of Φ. It takes any vector y ∈ RN and projects it onto the M-dimensional subspace spanned by the features.

MLE = orthogonal projection. The predicted values ŷ are the orthogonal projection of y onto the column space of Φ. The residual y − ŷ is perpendicular to every column of Φ. This is why the normal equations are ΦT(y − Φθ) = 0 — they literally say "the residual is orthogonal to the features."

Properties of the projection matrix P:

PropertyMeaning
P2 = PProjecting twice is the same as projecting once (idempotent)
PT = PThe projection is symmetric
rank(P) = MThe projection space has dimension M (number of features)
(I − P)y = y − ŷI − P projects onto the orthogonal complement (the residual space)
Geometric intuition for regularization: Adding λI modifies the projection. Instead of projecting exactly onto the column space, we project onto a "softened" version that pulls the solution toward the origin. The larger λ, the more the projection shrinks toward zero — a shrunken projection rather than an exact one.

This geometric view also explains why N ≥ M is needed for MLE. If N < M (more features than data points), the column space of Φ has dimension N < M, and y already lies in it — the "projection" is y itself (perfect fit, zero residual). This corresponds to the underdetermined case where ΦTΦ is singular and infinitely many θ give zero loss. Regularization resolves this degeneracy by picking the smallest-norm solution.

The story of this chapter in one sentence: Linear regression, viewed through MLE, MAP, Bayesian inference, and geometry, is the simplest model where all of ML's core ideas become visible, calculable, and beautiful.

What comes next: Chapter 10 applies dimensionality reduction (PCA) to find the most informative low-dimensional projections of high-dimensional data. Chapter 11 introduces mixture models and the EM algorithm for unsupervised learning. Both build on the linear algebra and probability you have now mastered.

"The purpose of computing is insight, not numbers."
— Richard Hamming
Check: What does the projection matrix P = Φ(ΦTΦ)−1ΦT do geometrically?