Murphy, Chapter 4

Statistics: Learning from Data

We have distributions. We have data. Now we ask: how do we estimate the parameters? MLE, MAP, and full Bayesian inference — three answers to one question.

Prerequisites: Chapters 2–3 (Probability). You need Bayes' rule, the Gaussian, and the Beta distribution.
11
Chapters
5
Simulations
11
Quizzes

Chapter 0: Why Statistics?

In the previous chapters we learned the language of probability: distributions, Bayes' rule, the Gaussian. But we left a critical question unanswered. We know that data comes from some distribution p(y | θ). We observe the data. We do not know θ. How do we find it?

This is the central question of statistics, and Murphy's Chapter 4 gives three increasingly powerful answers.

The three pillars of estimation: (1) Maximum Likelihood Estimation (MLE) finds the single θ that makes the observed data most probable. (2) MAP estimation adds a prior to regularize, finding the mode of the posterior. (3) Full Bayesian inference computes the entire posterior distribution over θ, capturing all uncertainty.

Each approach trades off simplicity against expressiveness. MLE is the simplest and most widely used. MAP adds regularization to prevent overfitting. Full Bayesian gives the richest answer but is often computationally harder.

MLE
Find θ that maximizes p(D | θ)
MAP
Find θ that maximizes p(θ | D) = p(D | θ) p(θ)
Bayesian
Compute entire posterior p(θ | D)
Regularization
Bias-variance, cross-validation, weight decay
What is the fundamental question that statistics answers in the context of ML?

Chapter 1: Maximum Likelihood Estimation

You flip a coin 100 times and get 73 heads. What is the probability of heads? Your intuition says 0.73. That intuition is MLE.

Maximum likelihood estimation finds the parameter θ that makes the observed data D most probable. We assume the N data points are independent and identically distributed (iid), so the likelihood factorizes:

p(D | θ) = ∏n=1N p(yn | θ)

Taking the log converts the product into a sum, which is easier to optimize:

ℓ(θ) = log p(D | θ) = ∑n=1N log p(yn | θ)

The MLE is the θ that maximizes this, or equivalently minimizes the negative log-likelihood (NLL):

θ̂mle = argminθ NLL(θ) = argminθ −∑n log p(yn | θ)
Why log? Two reasons. First, products of many small numbers cause numerical underflow — logs turn them into sums. Second, the log is monotonic, so the maximizer does not change. Nearly all of ML optimization works with log-likelihoods.

Murphy shows that MLE has a beautiful information-theoretic justification: it minimizes the KL divergence between the empirical distribution of the data and the model:

DKL(p̂D || pθ) = const + NLL(θ)

So MLE finds the model distribution closest to the data distribution, in the KL sense.

Why do we minimize the negative log-likelihood rather than maximizing the likelihood directly?

Chapter 2: MLE Examples

Let us work through MLE for the two most important distributions in ML.

Bernoulli MLE. Suppose Y ~ Ber(θ) and we observe N1 heads and N0 tails. The NLL is:

NLL(θ) = −[N1 log θ + N0 log(1 − θ)]

Setting the derivative to zero gives:

θ̂mle = N1 / (N0 + N1)

Just the fraction of heads. The quantities N1 and N0 are the sufficient statistics — they capture everything the data tells us about θ.

Sufficient statistics: A sufficient statistic T(D) is a function of the data that contains all the information relevant to estimating θ. For the Bernoulli, T(D) = (N1, N0). For the Gaussian, T(D) = (sample mean, sample variance). You can throw away the raw data and lose nothing.

Gaussian MLE. For Y ~ N(μ, σ²), the NLL is:

NLL(μ, σ²) = (N/2) log(2πσ²) + (1 / 2σ²) ∑n (yn − μ)²

Setting derivatives to zero yields:

ParameterMLEMeaning
μ̂(1/N) ∑ ynSample mean
σ̂²(1/N) ∑ (yn − μ̂)²Sample variance (biased)
Biased variance: The MLE divides by N, not N−1. It systematically underestimates the true variance — this is the MLE's bias. Dividing by N−1 gives the unbiased estimate, but the MLE is more common in ML because it is consistent (converges to the true value as N → ∞).

Linear regression MLE. If p(y | x, w) = N(y | wTx, σ²), then maximizing the log-likelihood is equivalent to minimizing the residual sum of squares (RSS):

RSS(w) = ∑n (yn − wTxn)² = ||Xw − y||²

The closed-form solution is the ordinary least squares (OLS) estimate:

mle = (XTX)−1 XTy
For a Bernoulli distribution, what is the MLE of θ given 80 heads and 20 tails?

Chapter 3: MLE Explorer

Watch maximum likelihood estimation in action. We sample data from a Gaussian with known parameters, then see how the MLE estimates converge as we collect more data.

MLE Convergence

Set the true parameters. Click Sample to draw data and watch the MLE estimates (teal) converge to the true values (orange dashed). The NLL surface is shown below.

0 samples
True μ1.0
True σ1.5
Bernoulli MLE: Coin Flipping

Flip a biased coin and watch the MLE of θ converge. Orange = true θ, teal = MLE estimate.

0 flips
True θ0.70
What to observe: With very few samples, the MLE can be far from the truth. As N grows, it converges. But with N=3 heads and 0 tails, the MLE is θ=1, predicting tails are impossible. This is the overfitting problem that motivates regularization.
What happens to the MLE as we collect more data?

Chapter 4: Overfitting & Regularization

MLE has a fatal flaw: it loves the training data too much. Consider fitting a polynomial of degree 20 to 21 data points. The MLE will perfectly interpolate every point — zero training error — but the resulting curve oscillates wildly between points. This is overfitting.

The core problem: the model has enough parameters to memorize the training data, but the empirical distribution is not the true distribution. Perfect fit on training data means no probability left for future data.

Murphy's coin example: We flip a coin 3 times and get 3 heads. The MLE is θ=1. This predicts tails are impossible — clearly absurd. The model has "overfit" to the small sample.

The solution is regularization: add a penalty that discourages overly complex models:

L(θ; λ) = NLL(θ) + λ C(θ)

where λ ≥ 0 controls the strength of regularization and C(θ) penalizes complexity.

The most common choice is 2 regularization (weight decay), which penalizes large weights:

L(w) = NLL(w) + λ ||w||²2

This shrinks all weights toward zero. In linear regression, this is called ridge regression. The larger λ is, the smoother the fitted function, but too much regularization leads to underfitting.

The connection to priors:2 regularization is equivalent to MAP estimation with a zero-mean Gaussian prior: p(w) = N(w | 0, (1/λ)I). The penalty term λ||w||² is just −log p(w) up to constants. Regularization and Bayesian priors are two views of the same idea.

1 regularization (lasso) uses C(w) = ||w||1 = ∑ |wd|. Unlike ℓ2, this drives some weights exactly to zero, performing automatic feature selection. It corresponds to a Laplace prior.

How does ℓ2 regularization prevent overfitting?

Chapter 5: MAP Estimation

Regularization looks ad hoc — why that particular penalty? MAP estimation reveals the principled Bayesian origin. Instead of just maximizing the likelihood, we maximize the posterior:

θ̂map = argmaxθ log p(θ | D) = argmaxθ [log p(D | θ) + log p(θ)]

The extra term log p(θ) is the log-prior. It encodes our beliefs about θ before seeing data. If p(θ) is uniform (no preference), MAP reduces to MLE.

MAP for the Bernoulli (add-one smoothing): Using a Beta(2, 2) prior, the MAP estimate becomes θ̂map = (N1 + 1) / (N + 2). With 3 heads and 0 tails, this gives 4/5 = 0.8 instead of 1.0. The prior pulls us away from the extreme.

For linear regression with a Gaussian prior p(w) = N(0, τ²I), MAP gives:

map = argminw ||Xw − y||² + λ||w||² = (XTX + λI)−1XTy

This is ridge regression. The λI term ensures the matrix is always invertible, even when XTX is singular.

MAP vs MLE: Bernoulli

Compare MLE and MAP estimates for coin flipping. Adjust the prior strength to see how the MAP estimate is pulled toward the prior mean.

0 flips
True θ0.70
Prior α2.0
Prior β2.0
What happens to the MAP estimate as we collect more data?

Chapter 6: Bayesian Statistics

Both MLE and MAP give a single point estimate θ̂. But a point estimate throws away information about uncertainty. How confident are we? Full Bayesian inference keeps the entire posterior distribution:

p(θ | D) = p(D | θ) p(θ) / p(D)

The posterior tells us not just the best guess, but the full range of plausible parameter values and how likely each is.

The posterior predictive distribution: Instead of plugging in a single θ̂, the Bayesian averages over all possible θ values, weighted by how likely each is: p(y | x, D) = ∫ p(y | x, θ) p(θ | D) dθ. This is called Bayes model averaging and is the principled way to make predictions under uncertainty.

The key advantage: Bayesian predictions have wider tails than plug-in predictions. Consider predicting 10 more coin flips after seeing 4 heads and 1 tail. The MLE plug-in uses Bin(10, 0.8) — a narrow spike. The Bayesian posterior predictive (the beta-binomial) is wider, acknowledging our uncertainty about θ.

The marginal likelihood p(D) = ∫ p(D | θ) p(θ) dθ is a byproduct of Bayesian inference. It measures how well the model (as a whole) predicts the data, averaging over all parameter settings. This is used for model comparison and hyperparameter selection.

Conjugate priors make Bayesian inference tractable. When the prior and likelihood are conjugate (e.g., Beta-Bernoulli, Gaussian-Gaussian), the posterior has a closed form. We just update the hyperparameters: Beta(α, β) → Beta(α + N1, β + N0). No integrals needed.
What is the key advantage of full Bayesian inference over MAP estimation?

Chapter 7: The Bias-Variance Tradeoff

When we evaluate a model, the expected test error decomposes into three terms:

E[loss] = bias² + variance + irreducible noise
ComponentMeaningReduced by
Bias²Systematic error: how far the average prediction is from the truthMore flexible model
VarianceSensitivity to training data: how much predictions change across datasetsMore regularization / more data
NoiseIrreducible error inherent in the dataNothing (it is aleatoric)
The tradeoff: Simple models (high bias, low variance) underfit — they miss real patterns. Complex models (low bias, high variance) overfit — they learn noise. The sweet spot minimizes total error. This is the single most important concept in all of ML.

Murphy illustrates this with polynomial regression (his Figure 4.9): as we increase polynomial degree, training error drops monotonically, but test error is U-shaped. The minimum of the test error curve is the optimal complexity.

Bias-Variance Tradeoff

Adjust model complexity (λ) and sample size (N). Watch how bias and variance trade off. Low λ = complex model. High λ = simple model.

Regularization λ1.0
Sample size N20

Cross-validation is the practical tool for selecting the right complexity. We split data into K folds, train on K−1 folds, evaluate on the held-out fold, and repeat. The combination with lowest average validation error wins.

A model has very low training error but very high test error. What is happening?

Chapter 8: MLE vs MAP vs Bayesian

Let us put all three estimation methods side by side to understand their tradeoffs.

Three Estimators Compared

Flip a coin with unknown θ. Gray = MLE, teal = MAP (Beta(2,2)), blue = posterior mean, orange = true θ. Watch how they differ with small N and converge with large N.

0 flips
True θ0.70
MethodFormulaProsCons
MLEθ̂ = N1/NSimple, consistent, efficientOverfits with small N, no uncertainty
MAPθ̂ = (N1+α−1)/(N+α+β−2)Regularized, closed-formStill a point estimate, prior-sensitive
Posterior meanE[θ|D] = (N1+α)/(N+α+β)Optimal under squared loss, captures uncertaintyRequires computing the posterior
Key insight: As N → ∞, all three converge to the same answer. The differences matter when data is scarce. With N=3, MLE can give absurd answers. MAP is safer. The full posterior is safest, and also tells you how much to trust the estimate.
When N is very large, which statement is true?

Chapter 9: Empirical Risk Minimization

MLE uses log loss: ℓ(y, θ) = −log p(y | θ). But we can replace this with any loss function to get empirical risk minimization (ERM):

L(θ) = (1/N) ∑n=1N ℓ(yn, θ; xn)

The choice of loss function depends on the task:

LossFormulaUsed for
Log loss (NLL)−log p(y | x, θ)Probabilistic classification & regression
0-1 lossI(y ≠ f(x))Misclassification rate
Hinge lossmax(0, 1 − ỹ η)SVMs (max-margin classifiers)
Squared loss(y − f(x))²Regression
Surrogate losses: The 0-1 loss is what we actually care about for classification, but it is non-differentiable and NP-hard to optimize. So we use surrogate losses that are smooth convex upper bounds. Log loss and hinge loss are both surrogates for 0-1 loss. Minimizing a surrogate also drives down the true loss.

Murphy shows (his Figure 4.2) that log loss, hinge loss, and exponential loss are all convex upper bounds on 0-1 loss. The horizontal axis is the margin ỹη — positive means correct classification. All surrogates penalize negative margins and incentivize positive ones.

Why do we use surrogate losses instead of directly minimizing 0-1 loss?

Chapter 10: Connections

Statistics is the bridge between the probability foundations (Chapters 2–3) and every model in the rest of Murphy's book. Here is how this chapter connects forward:

Concept from this chapterWhere it leads
MLELogistic regression (Ch 10), linear regression (Ch 11), neural net training (Ch 13)
MAP / regularizationRidge & lasso (Ch 11), weight decay in DNNs (Ch 13)
Bayesian inferenceBayesian linear regression (Ch 11), Gaussian processes (Ch 17)
Bias-variance tradeoffModel selection throughout the book
Empirical risk minimizationSVMs (Ch 17), decision theory (Ch 5)
Sufficient statisticsExponential family (Ch 3), data compression
Cross-validationHyperparameter tuning everywhere
What we covered: MLE as the workhorse of parameter estimation, its KL-divergence justification, Bernoulli and Gaussian MLE examples, overfitting and regularization, MAP as Bayesian regularization, full Bayesian inference with posterior predictives, the bias-variance tradeoff, and empirical risk minimization with surrogate losses.
What comes next: Chapter 10–11 takes these estimation tools and applies them to the two most important linear models: logistic regression for classification and linear regression for prediction. These are the workhorses of practical ML.

"All models are wrong, but some are useful." — George E.P. Box

What is the relationship between ℓ2 regularization and MAP estimation?