The workhorse of machine learning — linear in parameters, not necessarily in inputs.
You have 10 data points. Each is a pair: some input x and an output y. Plot them. You see a roughly linear trend with some scatter. You draw a line through the points by eye. Congratulations — you just did linear regression.
Now make it precise. Which line is "best"? How confident should you be? If I give you a new x, what should you predict for y — and how uncertain is that prediction?
Linear regression answers all of these questions. It is simultaneously:
The name is misleading. "Linear" does not mean the model can only fit straight lines. It means the model is linear in the parameters θ. You can transform the input x through any nonlinear function φ(x) first, then combine linearly. This lets you fit curves, surfaces, and arbitrarily complex shapes — all with the same linear regression machinery.
The model assumes each observation is generated by a linear combination of features plus noise:
Here φ(x) ∈ RM is a feature vector — a transformation of the raw input x into M features. The parameters θ ∈ RM are the weights we want to learn. The noise ε captures everything the model cannot explain.
Stack all N observations into a matrix. The design matrix Φ ∈ RN×M has one row per data point:
The full model in matrix notation:
A concrete example. Suppose x is a single number and φ(x) = [1, x]. Then M = 2, the design matrix is:
And ŷ = Φθ = [θ0 + θ1x1, θ0 + θ1x2, θ0 + θ1x3]T. This is a line with intercept θ0 and slope θ1. The column of 1's lets us have a bias term.
| Symbol | Shape | Meaning |
|---|---|---|
| φ(x) | RM | Feature vector for one input |
| θ | RM | Parameter vector (weights) |
| Φ | RN×M | Design matrix (all features stacked) |
| y | RN | Observed outputs |
| σ2 | R+ | Noise variance |
From Chapter 8, we know that MLE with Gaussian noise reduces to minimizing the sum of squared errors. In matrix form:
To find the minimum, take the gradient with respect to θ and set it to zero. Let's expand the loss first:
Take the derivative using matrix calculus (Chapter 5):
Rearranging gives the normal equations:
If ΦTΦ is invertible (which happens when the columns of Φ are linearly independent — meaning we have at least as many data points as features), the solution is:
Let's verify with a worked numerical example. Three points: (1, 2.1), (2, 3.9), (3, 6.2). Feature map φ(x) = [1, x].
The MLE fit is ŷ = −0.1 + 2.05x. Intercept near zero, slope about 2. Let's verify: at x=2, ŷ = −0.1 + 4.1 = 4.0, and the actual value is 3.9. Close.
python import numpy as np X = np.array([1, 2, 3]) y = np.array([2.1, 3.9, 6.2]) Phi = np.column_stack([np.ones_like(X), X]) # [1, x] theta_ML = np.linalg.solve(Phi.T @ Phi, Phi.T @ y) # array([-0.1, 2.05])
np.linalg.solve instead, which solves the linear system ΦTΦθ = ΦTy more stably and efficiently. Computing inverses is O(M3) and numerically fragile. Solving linear systems is O(M3) too, but with better constants and stability.A straight line is boring. But linear regression is not limited to straight lines. The trick: apply a feature map φ(x) that transforms the raw input into a richer feature space. The model is still linear in θ — it just operates on transformed features.
The most common example is polynomial features:
With this feature map, y = φ(x)Tθ = θ0 + θ1x + θ2x2 + … — a polynomial of degree M−1. Yet we solve it with the exact same formula: θML = (ΦTΦ)−1ΦTy. The normal equations do not care what φ is.
Worked example: fit a degree-2 polynomial to the same 3 points.
Now Φ is 3×3 (square). With N = M = 3, the polynomial passes through all 3 points exactly — zero training error. But is zero error desirable? Only if the data is noise-free. If there is noise, a perfect fit through every point is overfitting.
Other feature maps:
• Radial basis: φm(x) = exp(−||x − cm||2 / 2s2)
• Fourier: φm(x) = sin(mπx) or cos(mπx)
• Sigmoidal: φm(x) = σ(amx + bm)
The design choice:
Picking φ encodes your prior knowledge. Use polynomials if you expect smooth curves. Use Fourier features for periodic signals. Use RBFs for local patterns. The choice of features is often more important than the choice of algorithm.
More features means more flexibility. A degree-12 polynomial has 13 parameters and can fit almost any smooth curve. With only 10 data points, it can pass through every single one with zero training error. Sounds great. It is terrible.
The polynomial oscillates wildly between data points. At locations slightly away from the training data, the predictions are absurd — values in the thousands for data that lives in [−1, 1]. This is overfitting: the model has memorized the noise in the training data instead of learning the underlying pattern.
Drag the Degree slider. At low degrees (0–2), the model underfits. At moderate degrees (3–5), it captures the sine shape. Beyond 8, it overfits — wild oscillations between data points. Training RMSE always drops; test RMSE shows the characteristic U-shape.
Notice the diagnostic pattern:
| Condition | Training error | Test error | Diagnosis |
|---|---|---|---|
| Underfitting | High | High | Need more complexity |
| Good fit | Low | Low | Just right |
| Overfitting | Very low | High | Need less complexity / more data / regularization |
This is not just an academic concern. Overfitting is the single most common failure mode in ML. Every regularization technique, every validation strategy, every Bayesian method is fundamentally a weapon against overfitting.
From Chapter 8, we know that a Gaussian prior on θ turns MLE into MAP with an L2 penalty. For linear regression, the MAP objective is:
where λ = σ2/τ2 (noise variance divided by prior variance). Differentiating and setting to zero:
Returning to our 3-point example with λ = 1 and degree-1 features:
Compare: θML = [−0.1, 2.05] vs θMAP = [0.525, 1.673]. The MAP solution has smaller parameters — the prior pulled them toward zero. The slope shrank from 2.05 to 1.67, and the intercept shifted from −0.1 to 0.53.
MLE and MAP give point estimates. Bayesian regression keeps the full posterior distribution over θ. Here is the derivation, step by step.
Step 1: Prior. We place a Gaussian prior on the parameters:
Typically m0 = 0 (we start centered at the origin) and S0 = αI (isotropic, where α controls the prior width).
Step 2: Likelihood. Given θ, the data likelihood is:
Step 3: Posterior. Gaussian prior × Gaussian likelihood = Gaussian posterior. This is the conjugacy magic. After doing the algebra (completing the square in the exponent):
where the posterior parameters are:
With m0 = 0 (zero prior mean), the formulas simplify to:
The posterior has an intuitive interpretation. Before seeing data, our belief is N(0, αI) — a wide ball centered at the origin. Each data point narrows the ball, pulling the center toward values that explain the data. After N data points, the posterior is a tight ellipsoid whose center is near the MLE and whose shape reflects parameter correlations.
python import numpy as np def bayesian_regression(Phi, y, sigma2, alpha): S0_inv = (1/alpha) * np.eye(Phi.shape[1]) SN = np.linalg.inv(S0_inv + (1/sigma2) * Phi.T @ Phi) mN = (1/sigma2) * SN @ Phi.T @ y return mN, SN
We have the posterior p(θ | data) = N(mN, SN). Now someone gives us a new input x* and asks: what is y*?
The posterior predictive distribution integrates out the parameter uncertainty:
Both factors are Gaussian, and the integral of a Gaussian times a Gaussian is Gaussian. The result:
where:
Key behaviors:
| Scenario | Epistemic φTSNφ | Aleatoric σ2 | Total uncertainty |
|---|---|---|---|
| Near data points | Small | Fixed | Low |
| Far from data | Large | Fixed | High |
| More data collected | Shrinks | Unchanged | Decreases (approaches σ2) |
This is exactly the behavior we want. The model is confident near observed data and uncertain in unexplored regions. As data accumulates, the confidence bands narrow toward the irreducible noise level. No ad-hoc uncertainty calibration needed — it falls out of the math.
Now for the payoff. The simulation below shows Bayesian linear regression in action. You start with the prior — wide uncertainty bands and random function samples. Click on the canvas to add data points. Watch the posterior update live: the uncertainty bands narrow around the data, and the sampled functions converge toward the truth.
Click on the plot to add data points. Orange band = ±2σ predictive uncertainty. Teal lines = sampled functions from the posterior. Watch the band narrow and the functions converge as you add data.
We have a polynomial degree to choose, a prior variance α, and a noise level σ. Cross-validation works, but Bayesian inference offers something more principled: the marginal likelihood (or evidence).
For Gaussian likelihood and Gaussian prior, this integral has a closed form:
This expression has three terms, each with a clear interpretation:
Data fit term: yTK−1y (where K = σ2I + ΦS0ΦT). Prefers models that fit the data.
Complexity penalty: log|K|. Grows with model complexity. Penalizes overly flexible models.
In practice, you compute the marginal likelihood for each candidate model (degree 1, 2, 3, ...) and pick the one with the highest evidence. This is Bayesian model selection — principled, automatic, and often more reliable than cross-validation for small datasets.
There is a beautiful geometric interpretation of MLE linear regression that ties everything together. The prediction ŷ = Φθ must lie in the column space of Φ — the set of all possible linear combinations of Φ's columns.
The MLE solution finds the point in this column space that is closest to the observed y. "Closest" in the least-squares sense means the residual y − ŷ is perpendicular to the column space.
The matrix P = Φ(ΦTΦ)−1ΦT is the orthogonal projection matrix onto the column space of Φ. It takes any vector y ∈ RN and projects it onto the M-dimensional subspace spanned by the features.
Properties of the projection matrix P:
| Property | Meaning |
|---|---|
| P2 = P | Projecting twice is the same as projecting once (idempotent) |
| PT = P | The projection is symmetric |
| rank(P) = M | The projection space has dimension M (number of features) |
| (I − P)y = y − ŷ | I − P projects onto the orthogonal complement (the residual space) |
This geometric view also explains why N ≥ M is needed for MLE. If N < M (more features than data points), the column space of Φ has dimension N < M, and y already lies in it — the "projection" is y itself (perfect fit, zero residual). This corresponds to the underdetermined case where ΦTΦ is singular and infinitely many θ give zero loss. Regularization resolves this degeneracy by picking the smallest-norm solution.
The story of this chapter in one sentence: Linear regression, viewed through MLE, MAP, Bayesian inference, and geometry, is the simplest model where all of ML's core ideas become visible, calculable, and beautiful.
What comes next: Chapter 10 applies dimensionality reduction (PCA) to find the most informative low-dimensional projections of high-dimensional data. Chapter 11 introduces mixture models and the EM algorithm for unsupervised learning. Both build on the linear algebra and probability you have now mastered.