We have distributions. We have data. Now we ask: how do we estimate the parameters? MLE, MAP, and full Bayesian inference — three answers to one question.
In the previous chapters we learned the language of probability: distributions, Bayes' rule, the Gaussian. But we left a critical question unanswered. We know that data comes from some distribution p(y | θ). We observe the data. We do not know θ. How do we find it?
This is the central question of statistics, and Murphy's Chapter 4 gives three increasingly powerful answers.
Each approach trades off simplicity against expressiveness. MLE is the simplest and most widely used. MAP adds regularization to prevent overfitting. Full Bayesian gives the richest answer but is often computationally harder.
You flip a coin 100 times and get 73 heads. What is the probability of heads? Your intuition says 0.73. That intuition is MLE.
Maximum likelihood estimation finds the parameter θ that makes the observed data D most probable. We assume the N data points are independent and identically distributed (iid), so the likelihood factorizes:
Taking the log converts the product into a sum, which is easier to optimize:
The MLE is the θ that maximizes this, or equivalently minimizes the negative log-likelihood (NLL):
Murphy shows that MLE has a beautiful information-theoretic justification: it minimizes the KL divergence between the empirical distribution of the data and the model:
So MLE finds the model distribution closest to the data distribution, in the KL sense.
Let us work through MLE for the two most important distributions in ML.
Bernoulli MLE. Suppose Y ~ Ber(θ) and we observe N1 heads and N0 tails. The NLL is:
Setting the derivative to zero gives:
Just the fraction of heads. The quantities N1 and N0 are the sufficient statistics — they capture everything the data tells us about θ.
Gaussian MLE. For Y ~ N(μ, σ²), the NLL is:
Setting derivatives to zero yields:
| Parameter | MLE | Meaning |
|---|---|---|
| μ̂ | (1/N) ∑ yn | Sample mean |
| σ̂² | (1/N) ∑ (yn − μ̂)² | Sample variance (biased) |
Linear regression MLE. If p(y | x, w) = N(y | wTx, σ²), then maximizing the log-likelihood is equivalent to minimizing the residual sum of squares (RSS):
The closed-form solution is the ordinary least squares (OLS) estimate:
Watch maximum likelihood estimation in action. We sample data from a Gaussian with known parameters, then see how the MLE estimates converge as we collect more data.
Set the true parameters. Click Sample to draw data and watch the MLE estimates (teal) converge to the true values (orange dashed). The NLL surface is shown below.
Flip a biased coin and watch the MLE of θ converge. Orange = true θ, teal = MLE estimate.
MLE has a fatal flaw: it loves the training data too much. Consider fitting a polynomial of degree 20 to 21 data points. The MLE will perfectly interpolate every point — zero training error — but the resulting curve oscillates wildly between points. This is overfitting.
The core problem: the model has enough parameters to memorize the training data, but the empirical distribution is not the true distribution. Perfect fit on training data means no probability left for future data.
The solution is regularization: add a penalty that discourages overly complex models:
where λ ≥ 0 controls the strength of regularization and C(θ) penalizes complexity.
The most common choice is ℓ2 regularization (weight decay), which penalizes large weights:
This shrinks all weights toward zero. In linear regression, this is called ridge regression. The larger λ is, the smoother the fitted function, but too much regularization leads to underfitting.
ℓ1 regularization (lasso) uses C(w) = ||w||1 = ∑ |wd|. Unlike ℓ2, this drives some weights exactly to zero, performing automatic feature selection. It corresponds to a Laplace prior.
Regularization looks ad hoc — why that particular penalty? MAP estimation reveals the principled Bayesian origin. Instead of just maximizing the likelihood, we maximize the posterior:
The extra term log p(θ) is the log-prior. It encodes our beliefs about θ before seeing data. If p(θ) is uniform (no preference), MAP reduces to MLE.
For linear regression with a Gaussian prior p(w) = N(0, τ²I), MAP gives:
This is ridge regression. The λI term ensures the matrix is always invertible, even when XTX is singular.
Compare MLE and MAP estimates for coin flipping. Adjust the prior strength to see how the MAP estimate is pulled toward the prior mean.
Both MLE and MAP give a single point estimate θ̂. But a point estimate throws away information about uncertainty. How confident are we? Full Bayesian inference keeps the entire posterior distribution:
The posterior tells us not just the best guess, but the full range of plausible parameter values and how likely each is.
The key advantage: Bayesian predictions have wider tails than plug-in predictions. Consider predicting 10 more coin flips after seeing 4 heads and 1 tail. The MLE plug-in uses Bin(10, 0.8) — a narrow spike. The Bayesian posterior predictive (the beta-binomial) is wider, acknowledging our uncertainty about θ.
The marginal likelihood p(D) = ∫ p(D | θ) p(θ) dθ is a byproduct of Bayesian inference. It measures how well the model (as a whole) predicts the data, averaging over all parameter settings. This is used for model comparison and hyperparameter selection.
When we evaluate a model, the expected test error decomposes into three terms:
| Component | Meaning | Reduced by |
|---|---|---|
| Bias² | Systematic error: how far the average prediction is from the truth | More flexible model |
| Variance | Sensitivity to training data: how much predictions change across datasets | More regularization / more data |
| Noise | Irreducible error inherent in the data | Nothing (it is aleatoric) |
Murphy illustrates this with polynomial regression (his Figure 4.9): as we increase polynomial degree, training error drops monotonically, but test error is U-shaped. The minimum of the test error curve is the optimal complexity.
Adjust model complexity (λ) and sample size (N). Watch how bias and variance trade off. Low λ = complex model. High λ = simple model.
Cross-validation is the practical tool for selecting the right complexity. We split data into K folds, train on K−1 folds, evaluate on the held-out fold, and repeat. The combination with lowest average validation error wins.
Let us put all three estimation methods side by side to understand their tradeoffs.
Flip a coin with unknown θ. Gray = MLE, teal = MAP (Beta(2,2)), blue = posterior mean, orange = true θ. Watch how they differ with small N and converge with large N.
| Method | Formula | Pros | Cons |
|---|---|---|---|
| MLE | θ̂ = N1/N | Simple, consistent, efficient | Overfits with small N, no uncertainty |
| MAP | θ̂ = (N1+α−1)/(N+α+β−2) | Regularized, closed-form | Still a point estimate, prior-sensitive |
| Posterior mean | E[θ|D] = (N1+α)/(N+α+β) | Optimal under squared loss, captures uncertainty | Requires computing the posterior |
MLE uses log loss: ℓ(y, θ) = −log p(y | θ). But we can replace this with any loss function to get empirical risk minimization (ERM):
The choice of loss function depends on the task:
| Loss | Formula | Used for |
|---|---|---|
| Log loss (NLL) | −log p(y | x, θ) | Probabilistic classification & regression |
| 0-1 loss | I(y ≠ f(x)) | Misclassification rate |
| Hinge loss | max(0, 1 − ỹ η) | SVMs (max-margin classifiers) |
| Squared loss | (y − f(x))² | Regression |
Murphy shows (his Figure 4.2) that log loss, hinge loss, and exponential loss are all convex upper bounds on 0-1 loss. The horizontal axis is the margin ỹη — positive means correct classification. All surrogates penalize negative margins and incentivize positive ones.
Statistics is the bridge between the probability foundations (Chapters 2–3) and every model in the rest of Murphy's book. Here is how this chapter connects forward:
| Concept from this chapter | Where it leads |
|---|---|
| MLE | Logistic regression (Ch 10), linear regression (Ch 11), neural net training (Ch 13) |
| MAP / regularization | Ridge & lasso (Ch 11), weight decay in DNNs (Ch 13) |
| Bayesian inference | Bayesian linear regression (Ch 11), Gaussian processes (Ch 17) |
| Bias-variance tradeoff | Model selection throughout the book |
| Empirical risk minimization | SVMs (Ch 17), decision theory (Ch 5) |
| Sufficient statistics | Exponential family (Ch 3), data compression |
| Cross-validation | Hyperparameter tuning everywhere |
"All models are wrong, but some are useful." — George E.P. Box