Bridging the gap between mathematical models and real-world observations.
You have spent seven chapters sharpening mathematical tools: linear algebra to represent data, calculus to take derivatives, probability to handle uncertainty, and optimization to find the best answer. Beautiful tools. But tools for what?
Here is the question that motivates all of machine learning: given a bunch of observations — house prices, patient outcomes, images of cats — can we find a mathematical function that explains the data well enough to predict new cases we have never seen?
That sounds simple. It is not. Three flavors of learning tackle it differently:
This chapter focuses on supervised learning. We have N data pairs {(x1, y1), …, (xN, yN)} and we want a function that generalizes — that works on data it has never seen.
The central tension: a model that fits the training data perfectly will often perform terribly on new data. This is overfitting — the model has memorized noise instead of learning the signal. A model that is too simple will underfit, missing genuine patterns in the data.
We will build from the simplest idea (pick the function that makes the fewest errors on training data) to the most sophisticated (integrate over all possible parameters weighted by their posterior probability). Each step adds a principled way to fight overfitting.
Before any model can learn, we need to convert the messy real world into numbers. Every observation becomes a feature vector x ∈ RD — a list of D numbers that describe it. A house might become x = [1200, 3, 1975, 2] (square feet, bedrooms, year built, bathrooms).
The collection of N observations forms a data matrix X ∈ RN×D, where each row is one example and each column is one feature. The corresponding outputs stack into a vector y ∈ RN.
But raw features are often on wildly different scales. Square footage ranges from 500 to 5000; the number of bedrooms is 1 to 6. A model that treats these equally will be dominated by whichever feature has the largest numbers.
Two essential transformations fix this:
Centering: subtract the mean from each feature, so the data is centered at the origin.
where μd = (1/N) ∑n xnd
Scaling (standardization): divide by the standard deviation, so each feature has unit variance.
Now every feature lives on the same scale.
Other encodings handle non-numeric data. A categorical feature like "color" with values {red, green, blue} becomes a one-hot vector: red = [1,0,0], green = [0,1,0], blue = [0,0,1]. This converts every observation into a numeric vector without imposing an artificial ordering.
| Feature Type | Encoding | Example |
|---|---|---|
| Numeric | Standardize | sq_ft → (sq_ft − μ) / σ |
| Categorical | One-hot | "red" → [1, 0, 0] |
| Ordinal | Integer code | "low/med/high" → 0, 1, 2 |
| Text | Bag-of-words / embeddings | "good movie" → [0, 1, 0, 1, …] |
With data in hand, we need a machine that takes in an input x and spits out a prediction. There are two very different philosophies about what this machine should look like.
Deterministic predictors give a single crisp answer: f(x) = ŷ. A straight line through a scatter plot. A neural network that outputs one number. Given the same input, the same output, every time.
Probabilistic predictors give a whole distribution: p(y | x, θ). Instead of saying "the house price is $350,000," they say "the house price is normally distributed with mean $350,000 and standard deviation $20,000." This is richer — we get a best guess and our uncertainty about it.
The second approach is strictly more informative. It includes the first as a special case (just take the mean of the distribution). And it gives us a principled way to say "I'm not sure" — which is exactly what we need to fight overfitting.
A model class is the family of functions we consider. Choosing "linear functions" means f(x) = θ0 + θ1x. Choosing "polynomials of degree 5" means f(x) = θ0 + θ1x + … + θ5x5. The model class is a design choice we make before seeing data. It constrains which functions are even possible. A bad choice — too simple or too complex — dooms us before training even begins.
| Approach | Output | What it tells you |
|---|---|---|
| Deterministic: f(x) | Single point ŷ | Best guess |
| Probabilistic: p(y|x,θ) | Distribution over y | Best guess + uncertainty |
We have a predictor with parameters θ. We have data. How do we measure whether θ is any good? We need a loss function — a number that says how bad a prediction is.
The most common choice is squared loss: the square of the difference between the prediction and the truth.
Why squared? Two reasons. First, it penalizes big errors more than small ones (an error of 10 costs 100, not 10). Second, it leads to beautiful closed-form solutions when combined with linear models and Gaussian noise, as we will see in Chapter 6.
A single data point's loss is not enough. We care about performance across all N training examples. The empirical risk is the average loss over the training set:
This is what most ML algorithms actually minimize. Gradient descent, normal equations, Newton's method — they all find the θ that makes Remp as small as possible.
Let's compute a concrete example. Suppose we have 3 data points and a linear predictor f(x) = θx with θ = 2:
| n | xn | yn | ŷn = 2xn | ℓ = (y − ŷ)2 |
|---|---|---|---|---|
| 1 | 1 | 2.3 | 2 | 0.09 |
| 2 | 2 | 3.8 | 4 | 0.04 |
| 3 | 3 | 6.5 | 6 | 0.25 |
That is our empirical risk for θ = 2. Different values of θ would give different risks. The optimal θ minimizes this number over the training set.
Minimizing empirical risk alone has a dangerous failure mode. If your model is flexible enough, it can drive the training loss to zero by memorizing every data point — including the noise. The model contorts itself into wild shapes to hit every dot exactly, then makes absurd predictions on new data.
The fix: add a penalty for complexity. Regularization says: "I want small risk, but I also want simple parameters." We add a regularization term to the objective:
The term ||θ||2 = ∑d θd2 penalizes large parameter values. The hyperparameter λ controls the trade-off: λ = 0 means pure empirical risk (no regularization); large λ forces parameters toward zero (extreme simplicity).
This specific form is called L2 regularization, Tikhonov regularization, or ridge regression (when applied to linear models). It has a deep connection to Bayesian inference that we will uncover in Chapter 7: adding ||θ||2 is equivalent to placing a Gaussian prior on θ.
No regularization (λ = 0):
Model fits noise. Wild oscillations between data points. Great training error, terrible test error.
Too much regularization (λ → ∞):
Model collapses to f(x) ≈ 0. Ignores data entirely. Bad training error, bad test error.
Other regularizers exist. L1 regularization (λ·∑|θd|) encourages sparsity — it drives some parameters exactly to zero, effectively doing feature selection. Dropout in neural networks is another form. All share the same spirit: trade a little training accuracy for better generalization.
We have a knob λ that controls regularization. We have a knob for model complexity (polynomial degree, number of layers). How do we set these hyperparameters? We cannot use the training loss — it always prefers the most complex model.
The idea: pretend some of your training data is "unseen." Train on part of the data, evaluate on the held-out part. This simulates the generalization test.
K-fold cross-validation makes this systematic. Split the data into K equal chunks (folds). Train on K−1 folds, evaluate on the remaining fold. Rotate which fold is held out. Average the K test scores.
20 data points split into K folds. Teal = training, orange = validation. Click Next Fold to rotate, or change K with the slider.
For hyperparameters like λ, use nested cross-validation: an outer loop to estimate generalization error, and an inner loop to select the best λ. This avoids the subtle bias of selecting hyperparameters on the same data used for evaluation.
So far we have been vague about how to set parameters. "Minimize the loss" is intuitive, but where does the loss come from? Maximum likelihood estimation (MLE) gives a principled, probabilistic answer.
The idea: assume the data was generated by a probabilistic model p(y | x, θ). Now ask: for which θ is the observed data most probable?
If the data points are independent, the joint probability factors into a product. Products are annoying to optimize (tiny numbers multiply to underflow), so we take the logarithm. Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood (NLL):
Let's verify with numbers. Suppose σ2 = 1 and our 3 data points from Chapter 3:
| n | yn | f(xn) | −log p(yn|xn,θ) |
|---|---|---|---|
| 1 | 2.3 | 2.0 | 0.5·0.09 + 0.919 = 0.964 |
| 2 | 3.8 | 4.0 | 0.5·0.04 + 0.919 = 0.939 |
| 3 | 6.5 | 6.0 | 0.5·0.25 + 0.919 = 1.044 |
Each row's NLL is (1/2)(yn − ŷn)2 + 1/2log(2π). The constant 0.919 = 1/2log(2π) drops out during optimization, leaving just the squared errors.
Noisy sine data. Use the Degree slider to change polynomial degree (0–12). Watch training RMSE (always decreasing) vs test RMSE (U-shaped). The sweet spot is where test error is minimized.
Maximum likelihood treats all parameter values as equally plausible before seeing data. But sometimes we have prior beliefs. Maybe we think parameters should be small. Maybe we have domain knowledge. Can we incorporate that?
Yes. Maximum a posteriori (MAP) estimation uses Bayes' theorem to combine the likelihood with a prior distribution on θ:
The posterior is proportional to the likelihood times the prior. MAP finds the peak of this posterior:
Now here is the punchline. Suppose the prior is Gaussian: p(θ) = N(0, τ2I). Then:
Let's trace the math explicitly. With Gaussian likelihood and Gaussian prior:
Multiply through by 2σ2:
That is exactly least squares + λ||θ||2 with λ = σ2/τ2. If the prior variance τ2 is small (we strongly believe θ ≈ 0), λ is large — heavy regularization. If τ2 is huge (weak prior), λ ≈ 0 — almost no regularization, and MAP reduces to MLE.
| Prior | Regularizer | Effect | Name |
|---|---|---|---|
| Gaussian N(0, τ2I) | λ||θ||2 | Small parameters | Ridge / L2 |
| Laplace(0, b) | λ||θ||1 | Sparse parameters | LASSO / L1 |
| Uniform (flat) | None | No constraint | MLE |
MLE gives one answer: the single best θ. MAP gives one answer with a prior bias. But both collapse the posterior into a single point. What if we kept the entire posterior distribution?
Bayesian inference does not commit to a single θ. Instead, it computes the full posterior p(θ | X, y), which tells us how plausible each θ is given the data. Predictions average over all parameter values, weighted by their plausibility:
This is the posterior predictive distribution. Instead of plugging in one θ̂, we integrate over all possible θ, weighting each by the posterior. If many different θ values are plausible, the predictive distribution will be wide — reflecting our uncertainty.
Point estimates (MLE, MAP):
Pick one θ. Use it for all predictions. No uncertainty about parameters. Risk of overconfidence.
Bayesian inference:
Keep the full posterior. Average predictions over all θ. Automatic uncertainty. More computation.
The challenge: computing the posterior requires the normalizing constant (the marginal likelihood or evidence):
This integral is often intractable. But for specific choices of likelihood and prior (called conjugate pairs), it has a closed form. The Gaussian likelihood + Gaussian prior is the most important conjugate pair, and it gives us closed-form Bayesian linear regression (Chapter 9).
We have seen how to fit parameters inside a model. But how do we choose between models themselves? A linear model vs. a degree-5 polynomial vs. a degree-20 polynomial — which one is "best"?
Cross-validation is one answer. But there is a more principled Bayesian answer: the marginal likelihood (also called the model evidence):
This is the probability of the observed data under model M, averaging over all possible parameter values. It automatically balances fit against complexity — no hyperparameter tuning needed.
To compare two models M1 and M2, compute their Bayes factor:
BF > 1 favors M1. BF < 1 favors M2. It is the Bayesian equivalent of a hypothesis test, with a natural interpretation: how many times more likely is the data under M1 than under M2?
| Bayes Factor | Evidence strength |
|---|---|
| 1 – 3 | Barely worth mentioning |
| 3 – 10 | Substantial |
| 10 – 30 | Strong |
| 30 – 100 | Very strong |
| > 100 | Decisive |
This chapter built the bridge from pure math to machine learning. Every concept addresses the same question from a different angle: how do we learn from data without being fooled by noise?
| Method | Core idea | Fights overfitting? |
|---|---|---|
| Empirical Risk Min. | Minimize average training loss | No |
| Regularization | Penalize large parameters | Yes (λ tuning) |
| Cross-validation | Evaluate on held-out data | Yes (choose hyperparams) |
| MLE | Maximize data likelihood | No (same as ERM) |
| MAP | MLE + prior on θ | Yes (= regularization) |
| Bayesian | Average over all θ | Yes (automatic) |
| Marginal Likelihood | Score models by evidence | Yes (Occam built in) |
The deep connections we uncovered:
What comes next: Chapter 9 takes these ideas and applies them to the single most important model in all of ML: linear regression. We will derive the closed-form MLE solution, add Bayesian uncertainty bands, and watch the posterior update as data arrives.