Ch 9: Linear Regression — Deisenroth MML

Chapter 0: Why Linear Regression

You have 10 data points. Each is a pair: some input x and an output y. Plot them. You see a roughly linear trend with some scatter. You draw a line through the points by eye. Congratulations — you just did linear regression.

Now make it precise. Which line is "best"? How confident should you be? If I give you a new x, what should you predict for y — and how uncertain is that prediction?

Linear regression answers all of these questions. It is simultaneously:

The simplest ML model

A closed-form solution you can compute by hand

↓

Surprisingly powerful

With feature maps, it fits polynomials, radial basis functions, anything

↓

A complete testbed

MLE, MAP, full Bayesian — all have closed-form solutions

The name is misleading. "Linear" does not mean the model can only fit straight lines. It means the model is linear in the parameters θ. You can transform the input x through any nonlinear function φ(x) first, then combine linearly. This lets you fit curves, surfaces, and arbitrarily complex shapes — all with the same linear regression machinery.

Why start here? Every concept from Chapter 8 — MLE, MAP, Bayesian inference, model selection — has a clean, analytic solution in linear regression. No approximations, no iterative optimizers. You can write down the answer, compute it with a matrix inverse, and see exactly what each framework does. This makes linear regression the perfect sandbox for understanding ML at a deep level.

Check: Why is the term "linear" in "linear regression" potentially misleading?

It only works with positive numbers "Linear" refers to the parameters, not the inputs — the model can fit nonlinear curves It cannot handle more than one input feature

Chapter 1: The Setup

The model assumes each observation is generated by a linear combination of features plus noise:

y_n = φ(x_n)^Tθ + ε_n, ε_n ~ N(0, σ²)

Here φ(x) ∈ R^M is a feature vector — a transformation of the raw input x into M features. The parameters θ ∈ R^M are the weights we want to learn. The noise ε captures everything the model cannot explain.

Stack all N observations into a matrix. The design matrix Φ ∈ R^N×M has one row per data point:

Φ = ⎡ φ(x₁)^T ⎤
⎢ ⋮ ⎥
⎣ φ(x_N)^T ⎦ ∈ R^N×M

The full model in matrix notation:

y = Φθ + ε, ε ~ N(0, σ²I)

Think of Φ as a recipe book. Each row says "to predict y_n, mix these M features in proportions θ." The design matrix encodes what information about each data point we feed to the model. The parameters θ say how to combine that information.

A concrete example. Suppose x is a single number and φ(x) = [1, x]. Then M = 2, the design matrix is:

Φ = ⎡ 1 x₁ ⎤
⎢ 1 x₂ ⎥
⎣ 1 x₃ ⎦, θ = ⎡ θ₀ ⎤
⎣ θ₁ ⎦

And ŷ = Φθ = [θ₀ + θ₁x₁, θ₀ + θ₁x₂, θ₀ + θ₁x₃]^T. This is a line with intercept θ₀ and slope θ₁. The column of 1's lets us have a bias term.

Symbol	Shape	Meaning
φ(x)	R^M	Feature vector for one input
θ	R^M	Parameter vector (weights)
Φ	R^N×M	Design matrix (all features stacked)
y	R^N	Observed outputs
σ²	R⁺	Noise variance

Check: What are the rows of the design matrix Φ?

The feature vectors φ(x_n)^T for each data point The parameter values θ The output labels y_n

Chapter 2: The Normal Equations (MLE)

From Chapter 8, we know that MLE with Gaussian noise reduces to minimizing the sum of squared errors. In matrix form:

L(θ) = ||y − Φθ||² = (y − Φθ)^T(y − Φθ)

To find the minimum, take the gradient with respect to θ and set it to zero. Let's expand the loss first:

L(θ) = y^Ty − 2y^TΦθ + θ^TΦ^TΦθ

Take the derivative using matrix calculus (Chapter 5):

∇_θ L = −2Φ^Ty + 2Φ^TΦθ = 0

Rearranging gives the normal equations:

Φ^TΦθ = Φ^Ty

If Φ^TΦ is invertible (which happens when the columns of Φ are linearly independent — meaning we have at least as many data points as features), the solution is:

θ_ML = (Φ^TΦ)⁻¹Φ^Ty

This is the closed-form MLE solution. No iteration, no gradient descent, no learning rate to tune. One matrix inverse, one matrix multiply, done. The term (Φ^TΦ)⁻¹Φ^T is called the pseudoinverse of Φ.

Let's verify with a worked numerical example. Three points: (1, 2.1), (2, 3.9), (3, 6.2). Feature map φ(x) = [1, x].

Φ = ⎡ 1 1 ⎤
⎢ 1 2 ⎥
⎣ 1 3 ⎦, y = ⎡ 2.1 ⎤
⎢ 3.9 ⎥
⎣ 6.2 ⎦

Φ^TΦ = ⎡ 3 6 ⎤
⎣ 6 14 ⎦, Φ^Ty = ⎡ 12.2 ⎤
⎣ 28.4 ⎦

(Φ^TΦ)⁻¹ = ¹/₆⎡ 14 −6 ⎤
⎣ −6 3 ⎦

θ_ML = ¹/₆⎡ 14 −6 ⎤
⎣ −6 3 ⎦⎡ 12.2 ⎤
⎣ 28.4 ⎦ = ⎡ −0.1 ⎤
⎣ 2.05 ⎦

The MLE fit is ŷ = −0.1 + 2.05x. Intercept near zero, slope about 2. Let's verify: at x=2, ŷ = −0.1 + 4.1 = 4.0, and the actual value is 3.9. Close.

python
import numpy as np

X = np.array([1, 2, 3])
y = np.array([2.1, 3.9, 6.2])
Phi = np.column_stack([np.ones_like(X), X])  # [1, x]

theta_ML = np.linalg.solve(Phi.T @ Phi, Phi.T @ y)
# array([-0.1, 2.05])

Numerical note: Never actually compute the inverse (Φ^TΦ)⁻¹ directly. Use np.linalg.solve instead, which solves the linear system Φ^TΦθ = Φ^Ty more stably and efficiently. Computing inverses is O(M³) and numerically fragile. Solving linear systems is O(M³) too, but with better constants and stability.

Check: What is the closed-form MLE solution for linear regression?

θ = Φy θ = (Φ^TΦ)⁻¹Φ^Ty θ = Φ^Ty

Chapter 3: Feature Maps & Polynomials

A straight line is boring. But linear regression is not limited to straight lines. The trick: apply a feature map φ(x) that transforms the raw input into a richer feature space. The model is still linear in θ — it just operates on transformed features.

The most common example is polynomial features:

φ(x) = [1, x, x², x³, …, x^M−1]

With this feature map, y = φ(x)^Tθ = θ₀ + θ₁x + θ₂x² + … — a polynomial of degree M−1. Yet we solve it with the exact same formula: θ_ML = (Φ^TΦ)⁻¹Φ^Ty. The normal equations do not care what φ is.

This is why it is called "linear" regression. The model is y = φ(x)^Tθ = ∑_m θ_m φ_m(x). It is a linear combination of the features φ_m(x) with weights θ_m. The features can be anything — polynomials, sines, Gaussians, indicator functions — and the same closed-form solution applies.

Worked example: fit a degree-2 polynomial to the same 3 points.

φ(x) = [1, x, x²] ⇒ Φ = ⎡ 1 1 1 ⎤
⎢ 1 2 4 ⎥
⎣ 1 3 9 ⎦

Now Φ is 3×3 (square). With N = M = 3, the polynomial passes through all 3 points exactly — zero training error. But is zero error desirable? Only if the data is noise-free. If there is noise, a perfect fit through every point is overfitting.

Other feature maps:

• Radial basis: φ_m(x) = exp(−||x − c_m||² / 2s²)
• Fourier: φ_m(x) = sin(mπx) or cos(mπx)
• Sigmoidal: φ_m(x) = σ(a_mx + b_m)

The design choice:

Picking φ encodes your prior knowledge. Use polynomials if you expect smooth curves. Use Fourier features for periodic signals. Use RBFs for local patterns. The choice of features is often more important than the choice of algorithm.

Neural networks as feature maps: A neural network with L layers can be viewed as a learned feature map φ(x) followed by a linear output layer. The difference: in linear regression, φ is fixed and we optimize θ. In neural networks, we optimize φ and θ jointly. This is why deep learning is so powerful — the features adapt to the data.

Check: With φ(x) = [1, x, x²], what kind of curve can the model fit?

Only straight lines Only exponential curves Quadratic curves (parabolas)

Chapter 4: Overfitting

More features means more flexibility. A degree-12 polynomial has 13 parameters and can fit almost any smooth curve. With only 10 data points, it can pass through every single one with zero training error. Sounds great. It is terrible.

The polynomial oscillates wildly between data points. At locations slightly away from the training data, the predictions are absurd — values in the thousands for data that lives in [−1, 1]. This is overfitting: the model has memorized the noise in the training data instead of learning the underlying pattern.

Polynomial Overfitting Explorer

Drag the Degree slider. At low degrees (0–2), the model underfits. At moderate degrees (3–5), it captures the sine shape. Beyond 8, it overfits — wild oscillations between data points. Training RMSE always drops; test RMSE shows the characteristic U-shape.

Degree: 3

Degree3

The bias-variance trade-off in action. Low-degree models have high bias (they cannot capture the true function) but low variance (they are stable across different datasets). High-degree models have low bias but high variance (they change wildly with different training samples). The sweet spot minimizes total error = bias² + variance.

Notice the diagnostic pattern:

Condition	Training error	Test error	Diagnosis
Underfitting	High	High	Need more complexity
Good fit	Low	Low	Just right
Overfitting	Very low	High	Need less complexity / more data / regularization

This is not just an academic concern. Overfitting is the single most common failure mode in ML. Every regularization technique, every validation strategy, every Bayesian method is fundamentally a weapon against overfitting.

Check: How can you detect overfitting?

Training error is high Training error is low but test error is much higher The model has too few parameters

Chapter 5: MAP & Regularization

From Chapter 8, we know that a Gaussian prior on θ turns MLE into MAP with an L2 penalty. For linear regression, the MAP objective is:

J(θ) = ||y − Φθ||² + λ||θ||²

where λ = σ²/τ² (noise variance divided by prior variance). Differentiating and setting to zero:

∇_θ J = −2Φ^Ty + 2Φ^TΦθ + 2λθ = 0

(Φ^TΦ + λI)θ = Φ^Ty

θ_MAP = (Φ^TΦ + λI)⁻¹Φ^Ty

Compare MLE vs MAP. The only difference is the λI term added to Φ^TΦ. This has two effects: (1) it shrinks the parameters toward zero (regularization), and (2) it makes the matrix always invertible, even when Φ is rank-deficient. Regularization is also a numerical stability trick!

Returning to our 3-point example with λ = 1 and degree-1 features:

Φ^TΦ + λI = ⎡ 3+1 6 ⎤
⎣ 6 14+1 ⎦ = ⎡ 4 6 ⎤
⎣ 6 15 ⎦

θ_MAP = ¹/₂₄⎡ 15 −6 ⎤
⎣ −6 4 ⎦⎡ 12.2 ⎤
⎣ 28.4 ⎦ = ⎡ 0.525 ⎤
⎣ 1.673 ⎦

Compare: θ_ML = [−0.1, 2.05] vs θ_MAP = [0.525, 1.673]. The MAP solution has smaller parameters — the prior pulled them toward zero. The slope shrank from 2.05 to 1.67, and the intercept shifted from −0.1 to 0.53.

How much shrinkage? When λ is tiny (weak prior), MAP ≈ MLE. When λ is huge (strong prior), θ_MAP ≈ 0 regardless of data. The art is choosing λ via cross-validation or the marginal likelihood (Chapter 9).

Check: What does adding λI to Φ^TΦ accomplish?

Shrinks parameters toward zero AND ensures the matrix is invertible Makes the solution exact (zero training error) Increases the number of features

Chapter 6: Bayesian Regression

MLE and MAP give point estimates. Bayesian regression keeps the full posterior distribution over θ. Here is the derivation, step by step.

Step 1: Prior. We place a Gaussian prior on the parameters:

p(θ) = N(θ | m₀, S₀)

Typically m₀ = 0 (we start centered at the origin) and S₀ = αI (isotropic, where α controls the prior width).

Step 2: Likelihood. Given θ, the data likelihood is:

p(y | Φ, θ) = N(y | Φθ, σ²I)

Step 3: Posterior. Gaussian prior × Gaussian likelihood = Gaussian posterior. This is the conjugacy magic. After doing the algebra (completing the square in the exponent):

p(θ | Φ, y) = N(θ | m_N, S_N)

where the posterior parameters are:

S_N = (S₀⁻¹ + σ⁻²Φ^TΦ)⁻¹

m_N = S_N(S₀⁻¹m₀ + σ⁻²Φ^Ty)

Read these formulas carefully. The posterior covariance S_N combines the prior precision S₀⁻¹ (what we knew before) with the data precision σ⁻²Φ^TΦ (what the data tells us). More data → larger Φ^TΦ → smaller S_N → tighter posterior. The posterior mean m_N is a precision-weighted average of the prior mean and the data evidence.

With m₀ = 0 (zero prior mean), the formulas simplify to:

S_N = (α⁻¹I + σ⁻²Φ^TΦ)⁻¹

m_N = σ⁻²S_NΦ^Ty

Connection to MAP: The MAP estimate is θ_MAP = m_N, the posterior mean. But Bayesian inference keeps S_N too — the uncertainty. MAP throws away the uncertainty and keeps only the peak. Full Bayesian keeps the whole bell curve.

The posterior has an intuitive interpretation. Before seeing data, our belief is N(0, αI) — a wide ball centered at the origin. Each data point narrows the ball, pulling the center toward values that explain the data. After N data points, the posterior is a tight ellipsoid whose center is near the MLE and whose shape reflects parameter correlations.

python
import numpy as np

def bayesian_regression(Phi, y, sigma2, alpha):
    S0_inv = (1/alpha) * np.eye(Phi.shape[1])
    SN = np.linalg.inv(S0_inv + (1/sigma2) * Phi.T @ Phi)
    mN = (1/sigma2) * SN @ Phi.T @ y
    return mN, SN

Check: What happens to the posterior covariance S_N as we observe more data?

It shrinks (becomes smaller), meaning more certainty about θ It grows (becomes larger) It stays the same

Chapter 7: Posterior Predictions

We have the posterior p(θ | data) = N(m_N, S_N). Now someone gives us a new input x_* and asks: what is y_*?

The posterior predictive distribution integrates out the parameter uncertainty:

p(y_* | x_*, data) = ∫ p(y_* | x_*, θ) p(θ | data) dθ

Both factors are Gaussian, and the integral of a Gaussian times a Gaussian is Gaussian. The result:

p(y_* | x_*, data) = N(y_* | μ_*, σ_*²)

where:

μ_* = φ(x_*)^Tm_N

σ_*² = φ(x_*)^TS_Nφ(x_*) + σ²

The predictive variance has two components. The first term φ(x_*)^TS_Nφ(x_*) is epistemic uncertainty — uncertainty about which θ is correct. It decreases with more data. The second term σ² is aleatoric uncertainty — irreducible noise in the data. It stays constant no matter how much data you collect. Total uncertainty = parameter uncertainty + noise.

Key behaviors:

Scenario	Epistemic φ^TS_Nφ	Aleatoric σ²	Total uncertainty
Near data points	Small	Fixed	Low
Far from data	Large	Fixed	High
More data collected	Shrinks	Unchanged	Decreases (approaches σ²)

This is exactly the behavior we want. The model is confident near observed data and uncertain in unexplored regions. As data accumulates, the confidence bands narrow toward the irreducible noise level. No ad-hoc uncertainty calibration needed — it falls out of the math.

Check: What are the two sources of predictive uncertainty in Bayesian linear regression?

Training error and test error Bias and variance Parameter uncertainty (epistemic) and observation noise (aleatoric)

Chapter 8: Uncertainty Bands

Now for the payoff. The simulation below shows Bayesian linear regression in action. You start with the prior — wide uncertainty bands and random function samples. Click on the canvas to add data points. Watch the posterior update live: the uncertainty bands narrow around the data, and the sampled functions converge toward the truth.

Bayesian Linear Regression: Live Posterior Updates

Click on the plot to add data points. Orange band = ±2σ predictive uncertainty. Teal lines = sampled functions from the posterior. Watch the band narrow and the functions converge as you add data.

N = 0 points

Prior α2.0

Noise σ0.20

Degree3

What to explore: (1) Add 2–3 points and watch the bands narrow only near those points. (2) Add points across the whole range and watch the bands tighten everywhere. (3) Increase α (wider prior) and see how uncertainty grows. (4) Increase noise σ and notice the minimum band width increases. (5) Increase the degree and see how the model becomes more flexible but also more uncertain between data points.

Why this matters for real ML: Uncertainty quantification is critical in safety-sensitive applications — medical diagnosis, autonomous driving, financial risk. A model that says "I'm not sure" is far more useful than one that confidently gives the wrong answer. Bayesian regression shows the gold standard for honest uncertainty.

Check: In the simulation, why do the uncertainty bands widen far from the data points?

Parameter uncertainty (epistemic) is large in regions with no data The noise variance increases far from data The model has more parameters in those regions

Chapter 9: Marginal Likelihood

We have a polynomial degree to choose, a prior variance α, and a noise level σ. Cross-validation works, but Bayesian inference offers something more principled: the marginal likelihood (or evidence).

p(y | Φ) = ∫ p(y | Φ, θ) p(θ) dθ

For Gaussian likelihood and Gaussian prior, this integral has a closed form:

log p(y | Φ) = −¹/₂ [ y^T(σ²I + ΦS₀Φ^T)⁻¹y + log|σ²I + ΦS₀Φ^T| + N log(2π) ]

This expression has three terms, each with a clear interpretation:

Data fit term: y^TK⁻¹y (where K = σ²I + ΦS₀Φ^T). Prefers models that fit the data.

Complexity penalty: log|K|. Grows with model complexity. Penalizes overly flexible models.

Automatic Occam's razor. The marginal likelihood balances fit against complexity without any hyperparameter. Simple models that explain the data well get high evidence. Complex models spread their prior mass over many possible datasets, reducing the probability assigned to any specific dataset. Only if the added complexity genuinely helps explain the data will the evidence increase.

In practice, you compute the marginal likelihood for each candidate model (degree 1, 2, 3, ...) and pick the one with the highest evidence. This is Bayesian model selection — principled, automatic, and often more reliable than cross-validation for small datasets.

Optimizing hyperparameters via evidence: Beyond selecting between discrete models, you can also optimize continuous hyperparameters (α, σ²) by maximizing the marginal likelihood with gradient ascent. This is called empirical Bayes or type-II maximum likelihood. It gives a principled way to set regularization strength without cross-validation.

Check: How does the marginal likelihood avoid selecting overly complex models?

It includes a complexity penalty term log|K| that grows with model flexibility It uses cross-validation internally It limits the number of parameters

Chapter 10: The Geometric View

There is a beautiful geometric interpretation of MLE linear regression that ties everything together. The prediction ŷ = Φθ must lie in the column space of Φ — the set of all possible linear combinations of Φ's columns.

The MLE solution finds the point in this column space that is closest to the observed y. "Closest" in the least-squares sense means the residual y − ŷ is perpendicular to the column space.

ŷ_ML = Φ(Φ^TΦ)⁻¹Φ^Ty

The matrix P = Φ(Φ^TΦ)⁻¹Φ^T is the orthogonal projection matrix onto the column space of Φ. It takes any vector y ∈ R^N and projects it onto the M-dimensional subspace spanned by the features.

MLE = orthogonal projection. The predicted values ŷ are the orthogonal projection of y onto the column space of Φ. The residual y − ŷ is perpendicular to every column of Φ. This is why the normal equations are Φ^T(y − Φθ) = 0 — they literally say "the residual is orthogonal to the features."

Properties of the projection matrix P:

Property	Meaning
P² = P	Projecting twice is the same as projecting once (idempotent)
P^T = P	The projection is symmetric
rank(P) = M	The projection space has dimension M (number of features)
(I − P)y = y − ŷ	I − P projects onto the orthogonal complement (the residual space)

Geometric intuition for regularization: Adding λI modifies the projection. Instead of projecting exactly onto the column space, we project onto a "softened" version that pulls the solution toward the origin. The larger λ, the more the projection shrinks toward zero — a shrunken projection rather than an exact one.

This geometric view also explains why N ≥ M is needed for MLE. If N < M (more features than data points), the column space of Φ has dimension N < M, and y already lies in it — the "projection" is y itself (perfect fit, zero residual). This corresponds to the underdetermined case where Φ^TΦ is singular and infinitely many θ give zero loss. Regularization resolves this degeneracy by picking the smallest-norm solution.

The story of this chapter in one sentence: Linear regression, viewed through MLE, MAP, Bayesian inference, and geometry, is the simplest model where all of ML's core ideas become visible, calculable, and beautiful.

What comes next: Chapter 10 applies dimensionality reduction (PCA) to find the most informative low-dimensional projections of high-dimensional data. Chapter 11 introduces mixture models and the EM algorithm for unsupervised learning. Both build on the linear algebra and probability you have now mastered.

"The purpose of computing is insight, not numbers."
— Richard Hamming

Check: What does the projection matrix P = Φ(Φ^TΦ)⁻¹Φ^T do geometrically?

Rotates the data by 90 degrees Orthogonally projects y onto the column space of Φ Inverts the design matrix