Deisenroth et al., Chapter 8

When Models Meet Data

Bridging the gap between mathematical models and real-world observations.

Prerequisites: Chapters 5–7 (vector calculus, probability, optimization). That's it.
11
Chapters
5+
Simulations
11
Quizzes

Chapter 0: Why Models Meet Data

You have spent seven chapters sharpening mathematical tools: linear algebra to represent data, calculus to take derivatives, probability to handle uncertainty, and optimization to find the best answer. Beautiful tools. But tools for what?

Here is the question that motivates all of machine learning: given a bunch of observations — house prices, patient outcomes, images of cats — can we find a mathematical function that explains the data well enough to predict new cases we have never seen?

That sounds simple. It is not. Three flavors of learning tackle it differently:

Supervised Learning
Given inputs x and outputs y, learn a mapping x → y
Unsupervised Learning
Given only x, find hidden structure (clusters, latent variables)
Reinforcement Learning
Learn by trial-and-error in an environment with rewards

This chapter focuses on supervised learning. We have N data pairs {(x1, y1), …, (xN, yN)} and we want a function that generalizes — that works on data it has never seen.

The central tension: a model that fits the training data perfectly will often perform terribly on new data. This is overfitting — the model has memorized noise instead of learning the signal. A model that is too simple will underfit, missing genuine patterns in the data.

The core question of this chapter: How do we choose a model, fit it to data, and know whether it will generalize? Every concept that follows — loss functions, regularization, cross-validation, maximum likelihood, MAP, Bayesian inference — is a different angle on this single question.

We will build from the simplest idea (pick the function that makes the fewest errors on training data) to the most sophisticated (integrate over all possible parameters weighted by their posterior probability). Each step adds a principled way to fight overfitting.

Check: What is the fundamental problem that all of machine learning tries to solve?

Chapter 1: Data as Vectors

Before any model can learn, we need to convert the messy real world into numbers. Every observation becomes a feature vector x ∈ RD — a list of D numbers that describe it. A house might become x = [1200, 3, 1975, 2] (square feet, bedrooms, year built, bathrooms).

The collection of N observations forms a data matrix X ∈ RN×D, where each row is one example and each column is one feature. The corresponding outputs stack into a vector y ∈ RN.

X = ⎡ x1T
 ⎢ ⋮ ⎥
 ⎣ xNT
∈ RN×D,     y = ⎡ y1
 ⎢ ⋮ ⎥
 ⎣ yN
∈ RN

But raw features are often on wildly different scales. Square footage ranges from 500 to 5000; the number of bedrooms is 1 to 6. A model that treats these equally will be dominated by whichever feature has the largest numbers.

Two essential transformations fix this:

Centering: subtract the mean from each feature, so the data is centered at the origin.

d = xd − μd

where μd = (1/N) ∑n xnd

Scaling (standardization): divide by the standard deviation, so each feature has unit variance.

d = (xd − μd) / σd

Now every feature lives on the same scale.

Why centering matters geometrically: Centering moves the data cloud so its center of mass sits at the origin. This makes optimization much better behaved — gradient descent doesn't have to fight through long, narrow valleys caused by a lopsided coordinate system.

Other encodings handle non-numeric data. A categorical feature like "color" with values {red, green, blue} becomes a one-hot vector: red = [1,0,0], green = [0,1,0], blue = [0,0,1]. This converts every observation into a numeric vector without imposing an artificial ordering.

Feature TypeEncodingExample
NumericStandardizesq_ft → (sq_ft − μ) / σ
CategoricalOne-hot"red" → [1, 0, 0]
OrdinalInteger code"low/med/high" → 0, 1, 2
TextBag-of-words / embeddings"good movie" → [0, 1, 0, 1, …]
Check: Why do we standardize features before training a model?

Chapter 2: Predictors & Functions

With data in hand, we need a machine that takes in an input x and spits out a prediction. There are two very different philosophies about what this machine should look like.

Deterministic predictors give a single crisp answer: f(x) = ŷ. A straight line through a scatter plot. A neural network that outputs one number. Given the same input, the same output, every time.

ŷ = f(x) = θTx

Probabilistic predictors give a whole distribution: p(y | x, θ). Instead of saying "the house price is $350,000," they say "the house price is normally distributed with mean $350,000 and standard deviation $20,000." This is richer — we get a best guess and our uncertainty about it.

p(y | x, θ) = N(y | θTx, σ2)

The second approach is strictly more informative. It includes the first as a special case (just take the mean of the distribution). And it gives us a principled way to say "I'm not sure" — which is exactly what we need to fight overfitting.

A predictor is a function with knobs. The parameters θ are the knobs. Training means turning them until the predictions match the data. Different settings of θ give different predictions. The entire game of supervised learning is: which θ is best?

A model class is the family of functions we consider. Choosing "linear functions" means f(x) = θ0 + θ1x. Choosing "polynomials of degree 5" means f(x) = θ0 + θ1x + … + θ5x5. The model class is a design choice we make before seeing data. It constrains which functions are even possible. A bad choice — too simple or too complex — dooms us before training even begins.

ApproachOutputWhat it tells you
Deterministic: f(x)Single point ŷBest guess
Probabilistic: p(y|x,θ)Distribution over yBest guess + uncertainty
Check: What is the advantage of a probabilistic predictor p(y|x,θ) over a deterministic one f(x)?

Chapter 3: Loss & Risk

We have a predictor with parameters θ. We have data. How do we measure whether θ is any good? We need a loss function — a number that says how bad a prediction is.

The most common choice is squared loss: the square of the difference between the prediction and the truth.

ℓ(y, ŷ) = (y − ŷ)2

Why squared? Two reasons. First, it penalizes big errors more than small ones (an error of 10 costs 100, not 10). Second, it leads to beautiful closed-form solutions when combined with linear models and Gaussian noise, as we will see in Chapter 6.

A single data point's loss is not enough. We care about performance across all N training examples. The empirical risk is the average loss over the training set:

Remp(θ) = 1/Nn=1N ℓ(yn, f(xn; θ))

This is what most ML algorithms actually minimize. Gradient descent, normal equations, Newton's method — they all find the θ that makes Remp as small as possible.

Key insight — empirical risk is not the full story. What we really want to minimize is the true risk (or expected risk), which averages over all possible data, not just the training set. The true risk R(θ) = Ex,y[ℓ(y, f(x;θ))] is unknowable — we don't have access to every possible example. The empirical risk is our best proxy. The gap between them is the generalization gap.

Let's compute a concrete example. Suppose we have 3 data points and a linear predictor f(x) = θx with θ = 2:

nxnynn = 2xnℓ = (y − ŷ)2
112.320.09
223.840.04
336.560.25
Remp = 1/3(0.09 + 0.04 + 0.25) = 0.127

That is our empirical risk for θ = 2. Different values of θ would give different risks. The optimal θ minimizes this number over the training set.

Other loss functions: Squared loss is not the only choice. Absolute loss |y − ŷ| is more robust to outliers. Hinge loss max(0, 1 − y·ŷ) is used in SVMs (Chapter 12). Cross-entropy loss is standard for classification. Each loss encodes different assumptions about what "good" means.
Check: What is the empirical risk Remp(θ)?

Chapter 4: Regularization

Minimizing empirical risk alone has a dangerous failure mode. If your model is flexible enough, it can drive the training loss to zero by memorizing every data point — including the noise. The model contorts itself into wild shapes to hit every dot exactly, then makes absurd predictions on new data.

The fix: add a penalty for complexity. Regularization says: "I want small risk, but I also want simple parameters." We add a regularization term to the objective:

J(θ) = Remp(θ) + λ · ||θ||2

The term ||θ||2 = ∑d θd2 penalizes large parameter values. The hyperparameter λ controls the trade-off: λ = 0 means pure empirical risk (no regularization); large λ forces parameters toward zero (extreme simplicity).

λ is a complexity dial. Turn it up: the model becomes simpler, smoother, less prone to overfitting — but might underfit. Turn it down: the model becomes more flexible, fits the data better — but might overfit. Finding the sweet spot is the art of machine learning.

This specific form is called L2 regularization, Tikhonov regularization, or ridge regression (when applied to linear models). It has a deep connection to Bayesian inference that we will uncover in Chapter 7: adding ||θ||2 is equivalent to placing a Gaussian prior on θ.

No regularization (λ = 0):

Model fits noise. Wild oscillations between data points. Great training error, terrible test error.

Too much regularization (λ → ∞):

Model collapses to f(x) ≈ 0. Ignores data entirely. Bad training error, bad test error.

Other regularizers exist. L1 regularization (λ·∑|θd|) encourages sparsity — it drives some parameters exactly to zero, effectively doing feature selection. Dropout in neural networks is another form. All share the same spirit: trade a little training accuracy for better generalization.

Worked example: With θ = [3, −5, 2] and λ = 0.1:
||θ||2 = 9 + 25 + 4 = 38
Regularization penalty = 0.1 × 38 = 3.8
If Remp = 0.5, then J(θ) = 0.5 + 3.8 = 4.3. The large θ2 = −5 contributes the most penalty.
Check: What does the regularization parameter λ control?

Chapter 5: Cross-Validation

We have a knob λ that controls regularization. We have a knob for model complexity (polynomial degree, number of layers). How do we set these hyperparameters? We cannot use the training loss — it always prefers the most complex model.

The idea: pretend some of your training data is "unseen." Train on part of the data, evaluate on the held-out part. This simulates the generalization test.

K-fold cross-validation makes this systematic. Split the data into K equal chunks (folds). Train on K−1 folds, evaluate on the remaining fold. Rotate which fold is held out. Average the K test scores.

K-Fold Cross-Validation

20 data points split into K folds. Teal = training, orange = validation. Click Next Fold to rotate, or change K with the slider.

Fold 1 / 5
K5
Why K = 5 or K = 10? Small K (like 2) means each training set is only half the data — the model underestimates true performance. Large K (like N, called leave-one-out) is nearly unbiased but expensive and high-variance. K = 5 or K = 10 is a practical sweet spot. Most published results use one of these.

For hyperparameters like λ, use nested cross-validation: an outer loop to estimate generalization error, and an inner loop to select the best λ. This avoids the subtle bias of selecting hyperparameters on the same data used for evaluation.

Outer fold
Hold out test fold. Pass rest to inner CV.
Inner CV
For each λ, run K-fold CV on the training portion. Pick best λ.
Evaluate
Retrain with best λ on full training portion. Score on outer test fold.
↻ repeat for each outer fold
Check: In 5-fold cross-validation, how much data is used for training in each fold?

Chapter 6: Maximum Likelihood

So far we have been vague about how to set parameters. "Minimize the loss" is intuitive, but where does the loss come from? Maximum likelihood estimation (MLE) gives a principled, probabilistic answer.

The idea: assume the data was generated by a probabilistic model p(y | x, θ). Now ask: for which θ is the observed data most probable?

θML = argmaxθ p(y1, …, yN | x1, …, xN, θ)

If the data points are independent, the joint probability factors into a product. Products are annoying to optimize (tiny numbers multiply to underflow), so we take the logarithm. Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood (NLL):

NLL(θ) = − ∑n=1N log p(yn | xn, θ)
The Gaussian miracle: Assume yn = f(xn; θ) + ε, where ε ~ N(0, σ2). Then p(yn | xn, θ) is a Gaussian centered at f(xn; θ). The NLL becomes:

NLL = N/2 log(2πσ2) + 1/2n (yn − f(xn; θ))2

The first term is a constant (it doesn't depend on θ). The second term is just the sum of squared errors! Minimizing NLL under Gaussian noise = minimizing squared loss. Least squares is not an arbitrary choice — it falls directly out of maximum likelihood with Gaussian noise.

Let's verify with numbers. Suppose σ2 = 1 and our 3 data points from Chapter 3:

nynf(xn)−log p(yn|xn,θ)
12.32.00.5·0.09 + 0.919 = 0.964
23.84.00.5·0.04 + 0.919 = 0.939
36.56.00.5·0.25 + 0.919 = 1.044
NLL = 0.964 + 0.939 + 1.044 = 2.947

Each row's NLL is (1/2)(yn − ŷn)2 + 1/2log(2π). The constant 0.919 = 1/2log(2π) drops out during optimization, leaving just the squared errors.

Overfitting / Underfitting Explorer

Noisy sine data. Use the Degree slider to change polynomial degree (0–12). Watch training RMSE (always decreasing) vs test RMSE (U-shaped). The sweet spot is where test error is minimized.

Degree: 3
Degree3
Noise σ0.20
What to notice: At degree 0 (constant), the model underfits — it can't capture the sine wave at all. At degree 3–5, the fit is good and test error is low. Beyond degree 8 or so, the polynomial wiggles wildly between data points — training error drops toward zero, but test error explodes. This is overfitting in action.
Check: What does maximum likelihood with Gaussian noise reduce to?

Chapter 7: MAP & Priors

Maximum likelihood treats all parameter values as equally plausible before seeing data. But sometimes we have prior beliefs. Maybe we think parameters should be small. Maybe we have domain knowledge. Can we incorporate that?

Yes. Maximum a posteriori (MAP) estimation uses Bayes' theorem to combine the likelihood with a prior distribution on θ:

p(θ | X, y) ∝ p(y | X, θ) · p(θ)

The posterior is proportional to the likelihood times the prior. MAP finds the peak of this posterior:

θMAP = argmaxθ [ log p(y | X, θ) + log p(θ) ]

Now here is the punchline. Suppose the prior is Gaussian: p(θ) = N(0, τ2I). Then:

log p(θ) = −1/2 ||θ||2 + const
MAP with a Gaussian prior = MLE + L2 regularization. Setting λ = σ22 makes the two objectives identical. Regularization is not a hack — it is the mathematically principled consequence of having a prior belief that parameters should be small. The regularization strength λ is the ratio of noise variance to prior variance.

Let's trace the math explicitly. With Gaussian likelihood and Gaussian prior:

θMAP = argminθ [ 1/2n(yn − f(xn;θ))2 + 1/2 ||θ||2 ]

Multiply through by 2σ2:

= argminθ [ ∑n(yn − f(xn;θ))2 + σ2/τ2 ||θ||2 ]

That is exactly least squares + λ||θ||2 with λ = σ22. If the prior variance τ2 is small (we strongly believe θ ≈ 0), λ is large — heavy regularization. If τ2 is huge (weak prior), λ ≈ 0 — almost no regularization, and MAP reduces to MLE.

Different priors give different regularizers. A Laplace prior p(θ) ∝ exp(−b|θ|) gives L1 regularization — the famous LASSO that produces sparse solutions. The prior encodes our structural assumption about the parameters. Gaussian prior = smooth solutions. Laplace prior = sparse solutions.
PriorRegularizerEffectName
Gaussian N(0, τ2I)λ||θ||2Small parametersRidge / L2
Laplace(0, b)λ||θ||1Sparse parametersLASSO / L1
Uniform (flat)NoneNo constraintMLE
Check: What prior distribution on θ leads to L2 regularization?

Chapter 8: Bayesian Inference

MLE gives one answer: the single best θ. MAP gives one answer with a prior bias. But both collapse the posterior into a single point. What if we kept the entire posterior distribution?

Bayesian inference does not commit to a single θ. Instead, it computes the full posterior p(θ | X, y), which tells us how plausible each θ is given the data. Predictions average over all parameter values, weighted by their plausibility:

p(y* | x*, X, y) = ∫ p(y* | x*, θ) p(θ | X, y) dθ

This is the posterior predictive distribution. Instead of plugging in one θ̂, we integrate over all possible θ, weighting each by the posterior. If many different θ values are plausible, the predictive distribution will be wide — reflecting our uncertainty.

Bayesian averaging is the ultimate anti-overfitting weapon. MLE bets everything on one parameter setting. If that setting happened to fit noise, you're doomed. Bayesian inference hedges: it averages over all settings, so no single noisy θ dominates. In regions with little data, the posterior is spread out, and the predictive uncertainty automatically grows. In data-rich regions, the posterior concentrates, and predictions become confident.

Point estimates (MLE, MAP):

Pick one θ. Use it for all predictions. No uncertainty about parameters. Risk of overconfidence.

Bayesian inference:

Keep the full posterior. Average predictions over all θ. Automatic uncertainty. More computation.

The challenge: computing the posterior requires the normalizing constant (the marginal likelihood or evidence):

p(y | X) = ∫ p(y | X, θ) p(θ) dθ

This integral is often intractable. But for specific choices of likelihood and prior (called conjugate pairs), it has a closed form. The Gaussian likelihood + Gaussian prior is the most important conjugate pair, and it gives us closed-form Bayesian linear regression (Chapter 9).

When computation is feasible: For linear models with Gaussian priors, everything is analytically tractable. For neural networks, we need approximations: variational inference, MCMC sampling, or Laplace approximation. The Bayesian ideal is clear; the computational challenge varies by model class.
Check: How does Bayesian prediction differ from MLE prediction?

Chapter 9: Model Selection

We have seen how to fit parameters inside a model. But how do we choose between models themselves? A linear model vs. a degree-5 polynomial vs. a degree-20 polynomial — which one is "best"?

Cross-validation is one answer. But there is a more principled Bayesian answer: the marginal likelihood (also called the model evidence):

p(y | X, M) = ∫ p(y | X, θ, M) p(θ | M) dθ

This is the probability of the observed data under model M, averaging over all possible parameter values. It automatically balances fit against complexity — no hyperparameter tuning needed.

Occam's razor, made mathematical. Simple models spread their probability mass over a small set of datasets — the ones they can explain. Complex models spread mass over many datasets. If a simple model can explain the data, it assigns higher probability to that specific dataset than a complex model does. The marginal likelihood naturally prefers the simplest model that explains the data.

To compare two models M1 and M2, compute their Bayes factor:

BF = p(y | X, M1) / p(y | X, M2)

BF > 1 favors M1. BF < 1 favors M2. It is the Bayesian equivalent of a hypothesis test, with a natural interpretation: how many times more likely is the data under M1 than under M2?

Bayes FactorEvidence strength
1 – 3Barely worth mentioning
3 – 10Substantial
10 – 30Strong
30 – 100Very strong
> 100Decisive
Why not just use the training loss? Training loss always prefers the most complex model (it can always fit the data better with more parameters). Cross-validation simulates generalization. The marginal likelihood does something deeper: it penalizes complexity inherently, through the integration over θ. Models with more parameters spread their prior probability mass thinner, so they need to fit the data much better to overcome the complexity penalty.
Check: Why does the marginal likelihood automatically penalize overly complex models?

Chapter 10: Summary

This chapter built the bridge from pure math to machine learning. Every concept addresses the same question from a different angle: how do we learn from data without being fooled by noise?

MethodCore ideaFights overfitting?
Empirical Risk Min.Minimize average training lossNo
RegularizationPenalize large parametersYes (λ tuning)
Cross-validationEvaluate on held-out dataYes (choose hyperparams)
MLEMaximize data likelihoodNo (same as ERM)
MAPMLE + prior on θYes (= regularization)
BayesianAverage over all θYes (automatic)
Marginal LikelihoodScore models by evidenceYes (Occam built in)
The hierarchy of sophistication: ERM ⊂ MLE ⊂ MAP ⊂ Full Bayesian. Each step adds a principled mechanism to handle uncertainty. ERM minimizes loss on what you have seen. MLE gives it a probabilistic grounding. MAP adds prior knowledge. Full Bayesian refuses to commit to a single answer, integrating over all possibilities.

The deep connections we uncovered:

Gaussian noise
MLE = least squares
Gaussian prior
MAP = ridge regression
Full posterior
Bayesian inference = averaging over all θ
Integrate out θ
Marginal likelihood = automatic model selection

What comes next: Chapter 9 takes these ideas and applies them to the single most important model in all of ML: linear regression. We will derive the closed-form MLE solution, add Bayesian uncertainty bands, and watch the posterior update as data arrives.

"All models are wrong, but some are useful."
— George Box
Check: What is the relationship between MAP estimation with a Gaussian prior and L2 regularization?