Piech, Chapter 14

Parameter Estimation

Learning the numbers that define a distribution — directly from data.

Prerequisites: Chapter 13 (Sampling), basic calculus (derivatives, log rules), Bayes' theorem.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Estimation?

So far we have worked with random variables where someone handed us the parameters. "This coin has probability p = 0.6 of heads." "The arrival rate is λ = 3 per hour." But in the real world, nobody hands you the parameters. You observe data and have to figure them out.

Imagine you are building a spam filter. You need the probability that a spam email contains the word "lottery." You don't know that number. But you have 10,000 labeled emails. Can you estimate p from the data? That is the parameter estimation problem.

This chapter covers the two dominant approaches: Maximum Likelihood Estimation (MLE), which picks the parameters that make the observed data most probable, and Maximum A Posteriori (MAP), which folds in a prior belief about what the parameters should be.

The core idea: Given a model (Bernoulli, Normal, Poisson, etc.) and observed data, parameter estimation finds the values of the model's parameters that best explain the data. MLE asks "what parameters make this data most likely?" MAP asks "what parameters are most likely given this data and my prior beliefs?"
DistributionParameters θExample
Bernoulli(p)θ = pCoin flip probability
Poisson(λ)θ = λEmail arrival rate
Uniform(a, b)θ = [a, b]Random number range
Normal(μ, σ²)θ = [μ, σ²]Test score distribution

Almost all modern machine learning follows this two-step recipe: (1) specify a probabilistic model with parameters, (2) learn the parameter values from data. Parameter estimation is the foundation of that second step.

Both MLE and MAP assume the data are independent and identically distributed (IID) samples: X1, X2, …, Xn. Each data point was generated by the same underlying process, and observing one tells you nothing extra about the others (beyond what the shared parameters already tell you).

Why two methods? MLE is simpler and purely data-driven — no assumptions beyond the model. MAP lets you incorporate prior knowledge ("I believe p is probably near 0.5") which helps when data is scarce. With lots of data, MLE and MAP converge to the same answer. With little data, the prior can prevent wild estimates.
Check: What does parameter estimation require that earlier chapters did not?

Chapter 1: The Likelihood Function

You flip a coin 10 times and get: H, H, T, H, T, H, H, H, T, H. Seven heads, three tails. If p = 0.5, how probable is this exact sequence? If p = 0.7, is the sequence more or less probable? The function that answers this question — "how probable is my data for a given parameter value?" — is called the likelihood function.

We use the notation f(X = x | Θ = θ) for the shared PMF (discrete) or PDF (continuous) of each data point. The conditioning on θ reminds us that different parameter values produce different probabilities for the same data.

Since the data are IID, the likelihood of all n data points is the product of the individual likelihoods:

L(θ) = ∏i=1n f(Xi = xi | Θ = θ)

This is a function of θ, not of the data. The data are fixed (you already observed them). You are sweeping over different possible parameter values and asking: "which θ makes this particular dataset most probable?"

Likelihood vs. Probability: For discrete distributions, likelihood is the joint PMF of your data. For continuous distributions, likelihood is the joint PDF. The word "likelihood" is used instead of "probability" because we are treating θ as the variable and the data as fixed — the reverse of the usual setup.

Concrete example: You flip a coin 5 times and observe x1=1, x2=0, x3=1, x4=1, x5=0 (3 heads, 2 tails). Each flip is Bernoulli(p), so:

L(p) = p1(1−p)0 · p0(1−p)1 · p1(1−p)0 · p1(1−p)0 · p0(1−p)1 = p3(1−p)2

Let's evaluate this at a few values of p:

pL(p) = p3(1−p)2Calculation
0.30.013230.027 × 0.49 = 0.01323
0.50.031250.125 × 0.25 = 0.03125
0.60.034560.216 × 0.16 = 0.03456
0.70.030870.343 × 0.09 = 0.03087
0.90.007290.729 × 0.01 = 0.00729

The likelihood peaks around p = 0.6. That makes intuitive sense — we saw 3 heads in 5 flips, and 3/5 = 0.6. The Maximum Likelihood Estimate is the value of p that sits at the top of this curve.

Likelihood Function Visualizer

Flip the coin to accumulate data. The curve shows L(p) = pheads(1−p)tails as a function of p. Watch the peak sharpen as you add more data.

Heads: 0, Tails: 0
True p0.6
What to notice: With just a few flips, the likelihood curve is broad — many values of p are plausible. As you accumulate data, the curve narrows dramatically. With 50+ flips, it becomes a sharp spike centered near the true p. More data means more certainty.
Check: If you observed 4 heads and 1 tail, which value of p gives the highest likelihood?

Chapter 2: Log-Likelihood

The likelihood function is a product of many small numbers. With n = 1000 data points, you are multiplying 1000 probabilities together. The result is astronomically tiny — so small that computers lose precision (underflow). We need a trick.

The trick is the logarithm. Since log is a monotonically increasing function, the value of θ that maximizes L(θ) also maximizes log L(θ). But log turns products into sums, which are numerically stable and easier to differentiate.

LL(θ) = log L(θ) = log ∏i=1n f(Xi | θ) = ∑i=1n log f(Xi | θ)
Key insight: argmaxθ L(θ) = argmaxθ LL(θ). We can optimize the log-likelihood instead of the likelihood because log preserves the location of the maximum. Products become sums, exponents become multipliers, and the calculus gets dramatically simpler.

Worked example: Return to our 5-flip example (3 heads, 2 tails). The log-likelihood is:

LL(p) = 3 log(p) + 2 log(1 − p)

Let's evaluate at the same values of p:

pLL(p) = 3 log(p) + 2 log(1−p)Calculation
0.3−4.333(−1.204) + 2(−0.357) = −4.33
0.5−3.473(−0.693) + 2(−0.693) = −3.47
0.6−3.373(−0.511) + 2(−0.916) = −3.37
0.7−3.483(−0.357) + 2(−1.204) = −3.48
0.9−4.923(−0.105) + 2(−2.303) = −4.92

The maximum of LL(p) is at p = 0.6, same as the maximum of L(p). But now the numbers are manageable — small negative values instead of tiny decimals like 0.03456.

The MLE recipe is now clear. Write the log-likelihood, take the derivative with respect to θ, set it to zero, and solve:

Step 1
Write the log-likelihood LL(θ) = ∑ log f(xi | θ)
Step 2
Differentiate: dLL/dθ = 0
Step 3
Solve for θ̂ (the MLE estimate)
Why sums beat products: Beyond numerical stability, sums are easier to differentiate. The derivative of a sum is the sum of derivatives. The derivative of a product requires the product rule applied n times — a nightmare. Log-likelihood transforms an intractable calculus problem into a simple one.
Check: Why do we maximize log-likelihood instead of likelihood?

Chapter 3: MLE for Bernoulli

Let's apply the full MLE recipe to the Bernoulli distribution. You have n coin flips: X1, X2, …, Xn, each drawn from Bernoulli(p). You want to find the value of p that maximizes the likelihood.

The PMF of a single Bernoulli is f(x) = px(1−p)1−x. When x=1, this gives p. When x=0, this gives 1−p. Exactly what we want, in one compact formula.

Step 1: Log-likelihood. Let Y = ∑xi be the total number of heads.

LL(p) = ∑i=1n [xi log p + (1−xi) log(1−p)] = Y log p + (n − Y) log(1−p)

Step 2: Differentiate and set to zero.

dLL/dp = Y/p − (n − Y)/(1 − p) = 0

Step 3: Solve for p.

Y(1 − p) = (n − Y)p  ⇒  Y − Yp = np − Yp  ⇒  Y = np  ⇒  p̂ = Y/n
The punchline: The MLE estimate for a Bernoulli parameter is the sample mean — the number of successes divided by the total number of trials. All that calculus to derive something that feels obvious! But now we have a principled reason for it: p̂ = Y/n is the value that makes the observed data maximally likely.

Numerical example: You flip a coin 20 times and observe 13 heads (Y = 13, n = 20).

StepCalculationResult
Log-likelihood at p13 log(p) + 7 log(1−p)Function of p
Derivative = 013/p − 7/(1−p) = 0Equation in p
Solve13(1−p) = 7p ⇒ 13 = 20pp̂ = 13/20 = 0.65
Verify LL13 ln(0.65) + 7 ln(0.35)−12.78 (max)

Check: at p = 0.5, LL = 13 ln(0.5) + 7 ln(0.5) = 20 ln(0.5) = −13.86, which is less than −12.78. Our MLE is indeed better.

A subtlety: With n = 0 observations, MLE is undefined (0/0). With n = 1 and one head, MLE says p̂ = 1 — the coin always lands heads! That seems extreme. This sensitivity to small samples is a weakness of MLE that MAP will address later.
Check: You flip a coin 50 times and see 35 heads. What is the MLE for p?

Chapter 4: MLE for the Normal

Now let's tackle a distribution with two parameters. You have n samples X1, …, Xn from a Normal(μ, σ²). You want to estimate both μ and σ² simultaneously.

The PDF of a single Normal observation is:

f(xi | μ, σ²) = (1 / √(2πσ²)) exp(−(xi − μ)² / (2σ²))

Step 1: Log-likelihood.

LL(μ, σ²) = ∑i=1n [−log(√(2πσ²)) − (xi − μ)² / (2σ²)]
= −(n/2)log(2πσ²) − (1/(2σ²)) ∑i=1n (xi − μ)²

Step 2a: Differentiate with respect to μ.

∂LL/∂μ = (1/σ²) ∑i=1n (xi − μ) = 0  ⇒  ∑xi − nμ = 0  ⇒  μ̂ = (1/n) ∑xi

Step 2b: Differentiate with respect to σ².

∂LL/∂σ² = −n/(2σ²) + (1/(2σ4)) ∑(xi − μ)² = 0  ⇒  σ̂² = (1/n) ∑(xi − μ̂)²
The punchline: The MLE for the Normal mean is the sample mean, and the MLE for the variance is the average squared deviation from the sample mean. Again, the "obvious" answers, but now rigorously derived.

Numerical example: You measure 5 test scores: x = {72, 85, 90, 78, 95}.

StepCalculationResult
μ̂(72 + 85 + 90 + 78 + 95) / 5 = 420 / 584.0
Deviations(72−84)²=144, (85−84)²=1, (90−84)²=36, (78−84)²=36, (95−84)²=121Sum = 338
σ̂²338 / 567.6
σ̂√67.68.22
Bias note: The MLE variance divides by n, not by (n−1). Dividing by n gives a biased estimator — it slightly underestimates the true variance on average. The "unbiased" version divides by (n−1), which is what you see in most statistics packages as the "sample variance." For large n the difference is negligible.
Normal MLE: Watch μ̂ and σ̂² Converge

Samples are drawn from Normal(μ=50, σ²=100). Click to add data. Watch MLE estimates converge to the true values.

n=0
Check: You observe samples {10, 20, 30}. What is the MLE for μ?

Chapter 5: MAP Estimation

MLE has a blind spot. If you flip a coin once and get heads, MLE says p̂ = 1. The coin always lands heads? That seems absurd. You know, from years of experience, that most coins are roughly fair. Shouldn't that knowledge count for something?

Maximum A Posteriori (MAP) estimation lets you incorporate a prior belief about the parameters. Instead of asking "what θ makes the data most likely?", MAP asks "what θ is most likely given both the data and my prior?"

Formally, MAP applies Bayes' theorem:

θMAP = argmaxθ f(Θ=θ | X1=x1, …, Xn=xn)

Expanding with Bayes:

θMAP = argmaxθ [f(X1, …, Xn | θ) · f(θ)] / f(X1, …, Xn)

The denominator does not depend on θ, so we can drop it. And since the data are IID:

θMAP = argmaxθ f(θ) · ∏i=1n f(xi | θ)

Taking the log (same trick as MLE):

θMAP = argmaxθ [log f(θ) + ∑i=1n log f(xi | θ)]
MAP = MLE + prior. Compare the MAP formula to MLE. They are identical except for the additional log f(θ) term — the log of the prior. The prior pulls the estimate toward values of θ that you believe are likely before seeing data. The likelihood pulls the estimate toward values that explain the data. The MAP estimate balances the two.

Numerical example: Bernoulli with Beta prior. We use a Beta(a, b) prior for p, which has PDF f(p) ∝ pa−1(1−p)b−1. With a = b = 5 (prior centered at 0.5, moderate confidence), and data: 3 heads, 2 tails (Y=3, n=5).

log-posterior ∝ (Y + a − 1) log p + (n − Y + b − 1) log(1−p)
= (3 + 5 − 1) log p + (2 + 5 − 1) log(1−p)
= 7 log p + 6 log(1−p)

Differentiating and setting to zero: 7/p − 6/(1−p) = 0, giving pMAP = 7/13 ≈ 0.538.

Compare: MLE gave p̂ = 3/5 = 0.6. The prior (centered at 0.5) pulled the MAP estimate toward 0.5. The general formula for Bernoulli MAP with Beta(a,b) prior is:

pMAP = (Y + a − 1) / (n + a + b − 2)
Conjugate priors: When the prior and likelihood combine to produce a posterior of the same family as the prior, we call the prior "conjugate." Beta is conjugate to Bernoulli/Binomial. Normal is conjugate to Normal (for μ). Gamma is conjugate to Poisson. Conjugate priors make the MAP math clean and closed-form.
ParameterConjugate PriorIntuition
Bernoulli pBeta(a, b)a−1 imaginary heads, b−1 imaginary tails
Poisson λGamma(k, θ)k imaginary events in θ time periods
Normal μNormal(μ0, σ0²)Prior guess of the mean with some uncertainty
Check: How does the MAP formula differ from MLE?

Chapter 6: The Prior's Influence

How much does the prior actually matter? It depends on two things: the strength of the prior (how confident you are in your initial belief) and the amount of data (how much evidence you have).

Consider estimating a Bernoulli p with a Beta(a, b) prior. The MAP formula is:

pMAP = (Y + a − 1) / (n + a + b − 2)

Think of a−1 as "imaginary heads" and b−1 as "imaginary tails" from your prior experience. The MAP estimate averages your real data with this imaginary data.

Numerical comparison: True p = 0.7. Prior: Beta(3, 3) centered at 0.5. Observe different amounts of data:

nY (heads)MLE = Y/nMAP = (Y+2)/(n+4)MAP pulled toward
221.0004/6 = 0.6670.5 (strong pull)
540.8006/9 = 0.6670.5 (moderate pull)
20140.70016/24 = 0.6670.5 (mild pull)
100700.70072/104 = 0.6920.5 (tiny pull)
10007000.700702/1004 = 0.6990.5 (negligible)
The data always wins eventually. With small n, the prior has a big effect — it prevents extreme estimates like p̂ = 1.0. With large n, the data overwhelm the prior and MAP converges to MLE. The prior contributes "a + b − 2 imaginary observations" while the data contributes n real observations. When n ≫ a + b, the data dominate.

Stronger priors: Increasing a and b makes the prior more confident. Beta(2, 2) is a gentle nudge toward 0.5. Beta(50, 50) is a strong conviction that p ≈ 0.5 — you would need a lot of data to override it. The hyperparameters control how stubborn your prior is.

Flat prior recovers MLE: When a = b = 1, the Beta prior is Uniform(0,1) — every value of p is equally likely a priori. Then pMAP = (Y + 0)/(n + 0) = Y/n = pMLE. A flat prior means "I have no opinion," and MAP reduces exactly to MLE.

Think of it this way: MLE is a special case of MAP with a uniform (flat) prior. MAP is the more general framework. MLE says "let the data speak for themselves." MAP says "let the data speak, but tempered by what I already believe."
Check: With a Beta(1,1) prior (uniform), what does MAP reduce to?

Chapter 7: Showcase — MLE vs MAP

Time to see MLE and MAP side by side. This interactive simulation lets you generate data from a Bernoulli distribution with a true (hidden) parameter p, then watch how MLE and MAP estimates evolve as more data arrives.

The MAP estimate uses a Beta(a, b) prior. Adjust the prior strength slider to control a and b (they are set equal, centering the prior at 0.5). Adjust data size to control how many flips per batch. Watch the key dynamic: with few data points, MAP is more conservative (closer to 0.5). As data accumulates, both estimates converge to the truth.

MLE vs MAP: Live Comparison

Orange curve = MLE estimate over time. Teal curve = MAP estimate. Dashed line = true p. Watch them converge as n grows.

n=0, MLE=-, MAP=-
True p0.70
Prior strength (a=b)5
Things to try: (1) Set prior strength to 1 (flat prior) and verify that MAP = MLE at every step. (2) Set prior strength to 20 and true p to 0.9 — watch how long it takes MAP to "overcome" the strong prior. (3) Compare behavior at n=5 vs n=100. The simulation reveals the fundamental tradeoff: priors help with small samples but become irrelevant with large ones.
Prior Shape Visualizer

See how Beta(a, b) priors look. Adjust a and b to see flat, peaked, and skewed priors. The prior is the dashed curve; the posterior (prior × likelihood) is solid.

Beta(5, 5)
a5
b5
When does the prior help? The prior is most valuable when data is scarce and the prior is reasonable. A Beta(5,5) prior saves you from extreme estimates like p̂ = 0 or p̂ = 1 after just a few flips. But if your prior is wrong (e.g., Beta(50,50) when the true p = 0.9), it can slow down convergence significantly.

The extreme case: With n = 1 and one head, MLE gives p̂ = 1.0. MAP with Beta(2, 2) gives pMAP = (1+1)/(1+2) = 2/3 ≈ 0.667. With Beta(10, 10) the MAP gives (1+9)/(1+18) = 10/19 ≈ 0.526. The stronger the prior, the more it anchors the estimate against the single data point.

The large-data case: With n = 10,000 and 7,000 heads, MLE gives 0.700. MAP with Beta(10, 10) gives (7009)/(10018) = 0.6997. With 10,000 observations the prior's 18 imaginary observations contribute less than 0.2% of the total evidence. The data overwhelm any reasonable prior.

Check: As n → ∞, what happens to the MAP estimate relative to MLE?

Chapter 8: Worked Problems

Problem 1: Poisson MLE. A server receives requests. You record the number of requests per minute for 6 minutes: {3, 7, 5, 4, 6, 5}. Assuming a Poisson(λ) model, find the MLE for λ.

Solution: The Poisson PMF is f(x|λ) = e−λλx/x!. The log-likelihood is:

LL(λ) = ∑ [−λ + xi log λ − log(xi!)] = −nλ + (∑xi) log λ − ∑log(xi!)

Differentiate: dLL/dλ = −n + (∑xi)/λ = 0
Solve: λ̂ = (∑xi)/n = (3+7+5+4+6+5)/6 = 30/6 = 5.0

The MLE for the Poisson rate is the sample mean. (This is a general pattern: MLE for "natural" parameters often reduces to simple sample statistics.)

Problem 2: Bernoulli MAP with Beta prior. You are estimating the probability of a biased coin landing heads. Your prior is Beta(10, 10) — you believe the coin is probably fair, with moderate confidence (equivalent to 18 imaginary flips, 9 heads and 9 tails). You flip 8 times and observe 7 heads. Find both MLE and MAP.

Solution:
MLE: p̂ = Y/n = 7/8 = 0.875

MAP with Beta(10,10): pMAP = (Y + a − 1)/(n + a + b − 2) = (7 + 10 − 1)/(8 + 10 + 10 − 2) = 16/26 = 0.615

The MLE says p ≈ 0.88, nearly certain heads. The MAP, tempered by the prior belief that p ≈ 0.5, says p ≈ 0.62. With only 8 data points, the prior (worth 18 imaginary observations) has significant influence. If you had 100 flips with 87 heads, MLE = 0.87, MAP = (87+9)/(100+18) = 96/118 = 0.814 — much closer to MLE.

Problem 3: Normal MLE for μ with known σ². Exam scores (known σ² = 225) for 4 students: {68, 82, 75, 91}. Find the MLE for μ.

Solution:
μ̂ = (68 + 82 + 75 + 91)/4 = 316/4 = 79.0

The MLE for the normal mean is always the sample mean, regardless of the known variance. The variance affects how confident we are in our estimate (the "width" of the likelihood peak), but not the estimate itself.

Problem 4: When MLE = MAP. Under what condition on the Beta(a,b) prior does pMAP = pMLE for Bernoulli data?

Solution:
pMAP = (Y + a − 1)/(n + a + b − 2). For this to equal Y/n for all Y and n, we need a − 1 = 0 and b − 1 = 0, so a = b = 1. This is the Beta(1,1) = Uniform(0,1) prior. A uniform prior carries no information, so MAP defaults to pure MLE.
Check: The MLE for the Poisson rate λ given data {2, 4, 6, 8} is:

Chapter 9: Connections

Parameter estimation is the bridge between probability theory and machine learning. Everything we have built in this course — distributions, Bayes' theorem, independence — converges here into a practical tool for learning from data.

MLE
Maximize likelihood → Foundation of logistic regression, neural networks, many ML models
MAP
Add a prior → Equivalent to regularization (L2 regularization = Gaussian prior on weights)
Full Bayesian
Don't pick one θ — integrate over all θ → Bayesian inference, posterior predictive

Looking backward: Parameter estimation uses almost everything from this course. Bayes' theorem (Chapter 3) is the engine behind MAP. Random variables and distributions (Chapters 5, 7) define the models we estimate. Independence (Chapter 3) justifies the product form of the likelihood. Expectation (Chapter 6) gives us the sample mean as the MLE for many distributions.

Looking forward: Chapter 15 on Machine Learning will use MLE and MAP as the training algorithms for classification and regression. The loss functions you will see — cross-entropy loss for classification, mean squared error for regression — are both negative log-likelihoods in disguise. When you add L2 regularization to a neural network, you are doing MAP with a Gaussian prior on the weights.

The big picture: MLE says "find the single best θ." MAP says "find the single best θ, but be informed by a prior." Full Bayesian inference says "don't pick a single θ at all — maintain a full posterior distribution over θ and average predictions over all possible parameter values." Each step is more principled but more computationally expensive.
MethodFormulaML Equivalent
MLEargmaxθ ∑ log f(xi|θ)Minimize cross-entropy / MSE loss
MAPargmaxθ [log f(θ) + ∑ log f(xi|θ)]Loss + L2/L1 regularization
Full Bayes∫ f(xnew|θ) f(θ|data) dθBayesian neural networks, ensembles

A taste of linear regression: In the "linear transform plus noise" model Y = θX + Z where Z ~ Normal(0, σ²), the MLE reduces to minimizing ∑(Yi − θXi)² — the ordinary least squares objective. Every time you fit a line to data using least squares, you are doing MLE under a Gaussian noise assumption. This connection is the entry point to Chapter 15.

Check: Adding L2 regularization to a neural network is equivalent to which estimation method?