Learning the numbers that define a distribution — directly from data.
So far we have worked with random variables where someone handed us the parameters. "This coin has probability p = 0.6 of heads." "The arrival rate is λ = 3 per hour." But in the real world, nobody hands you the parameters. You observe data and have to figure them out.
Imagine you are building a spam filter. You need the probability that a spam email contains the word "lottery." You don't know that number. But you have 10,000 labeled emails. Can you estimate p from the data? That is the parameter estimation problem.
This chapter covers the two dominant approaches: Maximum Likelihood Estimation (MLE), which picks the parameters that make the observed data most probable, and Maximum A Posteriori (MAP), which folds in a prior belief about what the parameters should be.
| Distribution | Parameters θ | Example |
|---|---|---|
| Bernoulli(p) | θ = p | Coin flip probability |
| Poisson(λ) | θ = λ | Email arrival rate |
| Uniform(a, b) | θ = [a, b] | Random number range |
| Normal(μ, σ²) | θ = [μ, σ²] | Test score distribution |
Almost all modern machine learning follows this two-step recipe: (1) specify a probabilistic model with parameters, (2) learn the parameter values from data. Parameter estimation is the foundation of that second step.
Both MLE and MAP assume the data are independent and identically distributed (IID) samples: X1, X2, …, Xn. Each data point was generated by the same underlying process, and observing one tells you nothing extra about the others (beyond what the shared parameters already tell you).
You flip a coin 10 times and get: H, H, T, H, T, H, H, H, T, H. Seven heads, three tails. If p = 0.5, how probable is this exact sequence? If p = 0.7, is the sequence more or less probable? The function that answers this question — "how probable is my data for a given parameter value?" — is called the likelihood function.
We use the notation f(X = x | Θ = θ) for the shared PMF (discrete) or PDF (continuous) of each data point. The conditioning on θ reminds us that different parameter values produce different probabilities for the same data.
Since the data are IID, the likelihood of all n data points is the product of the individual likelihoods:
This is a function of θ, not of the data. The data are fixed (you already observed them). You are sweeping over different possible parameter values and asking: "which θ makes this particular dataset most probable?"
Concrete example: You flip a coin 5 times and observe x1=1, x2=0, x3=1, x4=1, x5=0 (3 heads, 2 tails). Each flip is Bernoulli(p), so:
Let's evaluate this at a few values of p:
| p | L(p) = p3(1−p)2 | Calculation |
|---|---|---|
| 0.3 | 0.01323 | 0.027 × 0.49 = 0.01323 |
| 0.5 | 0.03125 | 0.125 × 0.25 = 0.03125 |
| 0.6 | 0.03456 | 0.216 × 0.16 = 0.03456 |
| 0.7 | 0.03087 | 0.343 × 0.09 = 0.03087 |
| 0.9 | 0.00729 | 0.729 × 0.01 = 0.00729 |
The likelihood peaks around p = 0.6. That makes intuitive sense — we saw 3 heads in 5 flips, and 3/5 = 0.6. The Maximum Likelihood Estimate is the value of p that sits at the top of this curve.
Flip the coin to accumulate data. The curve shows L(p) = pheads(1−p)tails as a function of p. Watch the peak sharpen as you add more data.
The likelihood function is a product of many small numbers. With n = 1000 data points, you are multiplying 1000 probabilities together. The result is astronomically tiny — so small that computers lose precision (underflow). We need a trick.
The trick is the logarithm. Since log is a monotonically increasing function, the value of θ that maximizes L(θ) also maximizes log L(θ). But log turns products into sums, which are numerically stable and easier to differentiate.
Worked example: Return to our 5-flip example (3 heads, 2 tails). The log-likelihood is:
Let's evaluate at the same values of p:
| p | LL(p) = 3 log(p) + 2 log(1−p) | Calculation |
|---|---|---|
| 0.3 | −4.33 | 3(−1.204) + 2(−0.357) = −4.33 |
| 0.5 | −3.47 | 3(−0.693) + 2(−0.693) = −3.47 |
| 0.6 | −3.37 | 3(−0.511) + 2(−0.916) = −3.37 |
| 0.7 | −3.48 | 3(−0.357) + 2(−1.204) = −3.48 |
| 0.9 | −4.92 | 3(−0.105) + 2(−2.303) = −4.92 |
The maximum of LL(p) is at p = 0.6, same as the maximum of L(p). But now the numbers are manageable — small negative values instead of tiny decimals like 0.03456.
The MLE recipe is now clear. Write the log-likelihood, take the derivative with respect to θ, set it to zero, and solve:
Let's apply the full MLE recipe to the Bernoulli distribution. You have n coin flips: X1, X2, …, Xn, each drawn from Bernoulli(p). You want to find the value of p that maximizes the likelihood.
The PMF of a single Bernoulli is f(x) = px(1−p)1−x. When x=1, this gives p. When x=0, this gives 1−p. Exactly what we want, in one compact formula.
Step 1: Log-likelihood. Let Y = ∑xi be the total number of heads.
Step 2: Differentiate and set to zero.
Step 3: Solve for p.
Numerical example: You flip a coin 20 times and observe 13 heads (Y = 13, n = 20).
| Step | Calculation | Result |
|---|---|---|
| Log-likelihood at p | 13 log(p) + 7 log(1−p) | Function of p |
| Derivative = 0 | 13/p − 7/(1−p) = 0 | Equation in p |
| Solve | 13(1−p) = 7p ⇒ 13 = 20p | p̂ = 13/20 = 0.65 |
| Verify LL | 13 ln(0.65) + 7 ln(0.35) | −12.78 (max) |
Check: at p = 0.5, LL = 13 ln(0.5) + 7 ln(0.5) = 20 ln(0.5) = −13.86, which is less than −12.78. Our MLE is indeed better.
Now let's tackle a distribution with two parameters. You have n samples X1, …, Xn from a Normal(μ, σ²). You want to estimate both μ and σ² simultaneously.
The PDF of a single Normal observation is:
Step 1: Log-likelihood.
Step 2a: Differentiate with respect to μ.
Step 2b: Differentiate with respect to σ².
Numerical example: You measure 5 test scores: x = {72, 85, 90, 78, 95}.
| Step | Calculation | Result |
|---|---|---|
| μ̂ | (72 + 85 + 90 + 78 + 95) / 5 = 420 / 5 | 84.0 |
| Deviations | (72−84)²=144, (85−84)²=1, (90−84)²=36, (78−84)²=36, (95−84)²=121 | Sum = 338 |
| σ̂² | 338 / 5 | 67.6 |
| σ̂ | √67.6 | 8.22 |
Samples are drawn from Normal(μ=50, σ²=100). Click to add data. Watch MLE estimates converge to the true values.
MLE has a blind spot. If you flip a coin once and get heads, MLE says p̂ = 1. The coin always lands heads? That seems absurd. You know, from years of experience, that most coins are roughly fair. Shouldn't that knowledge count for something?
Maximum A Posteriori (MAP) estimation lets you incorporate a prior belief about the parameters. Instead of asking "what θ makes the data most likely?", MAP asks "what θ is most likely given both the data and my prior?"
Formally, MAP applies Bayes' theorem:
Expanding with Bayes:
The denominator does not depend on θ, so we can drop it. And since the data are IID:
Taking the log (same trick as MLE):
Numerical example: Bernoulli with Beta prior. We use a Beta(a, b) prior for p, which has PDF f(p) ∝ pa−1(1−p)b−1. With a = b = 5 (prior centered at 0.5, moderate confidence), and data: 3 heads, 2 tails (Y=3, n=5).
Differentiating and setting to zero: 7/p − 6/(1−p) = 0, giving pMAP = 7/13 ≈ 0.538.
Compare: MLE gave p̂ = 3/5 = 0.6. The prior (centered at 0.5) pulled the MAP estimate toward 0.5. The general formula for Bernoulli MAP with Beta(a,b) prior is:
| Parameter | Conjugate Prior | Intuition |
|---|---|---|
| Bernoulli p | Beta(a, b) | a−1 imaginary heads, b−1 imaginary tails |
| Poisson λ | Gamma(k, θ) | k imaginary events in θ time periods |
| Normal μ | Normal(μ0, σ0²) | Prior guess of the mean with some uncertainty |
How much does the prior actually matter? It depends on two things: the strength of the prior (how confident you are in your initial belief) and the amount of data (how much evidence you have).
Consider estimating a Bernoulli p with a Beta(a, b) prior. The MAP formula is:
Think of a−1 as "imaginary heads" and b−1 as "imaginary tails" from your prior experience. The MAP estimate averages your real data with this imaginary data.
Numerical comparison: True p = 0.7. Prior: Beta(3, 3) centered at 0.5. Observe different amounts of data:
| n | Y (heads) | MLE = Y/n | MAP = (Y+2)/(n+4) | MAP pulled toward |
|---|---|---|---|---|
| 2 | 2 | 1.000 | 4/6 = 0.667 | 0.5 (strong pull) |
| 5 | 4 | 0.800 | 6/9 = 0.667 | 0.5 (moderate pull) |
| 20 | 14 | 0.700 | 16/24 = 0.667 | 0.5 (mild pull) |
| 100 | 70 | 0.700 | 72/104 = 0.692 | 0.5 (tiny pull) |
| 1000 | 700 | 0.700 | 702/1004 = 0.699 | 0.5 (negligible) |
Stronger priors: Increasing a and b makes the prior more confident. Beta(2, 2) is a gentle nudge toward 0.5. Beta(50, 50) is a strong conviction that p ≈ 0.5 — you would need a lot of data to override it. The hyperparameters control how stubborn your prior is.
Flat prior recovers MLE: When a = b = 1, the Beta prior is Uniform(0,1) — every value of p is equally likely a priori. Then pMAP = (Y + 0)/(n + 0) = Y/n = pMLE. A flat prior means "I have no opinion," and MAP reduces exactly to MLE.
Time to see MLE and MAP side by side. This interactive simulation lets you generate data from a Bernoulli distribution with a true (hidden) parameter p, then watch how MLE and MAP estimates evolve as more data arrives.
The MAP estimate uses a Beta(a, b) prior. Adjust the prior strength slider to control a and b (they are set equal, centering the prior at 0.5). Adjust data size to control how many flips per batch. Watch the key dynamic: with few data points, MAP is more conservative (closer to 0.5). As data accumulates, both estimates converge to the truth.
Orange curve = MLE estimate over time. Teal curve = MAP estimate. Dashed line = true p. Watch them converge as n grows.
See how Beta(a, b) priors look. Adjust a and b to see flat, peaked, and skewed priors. The prior is the dashed curve; the posterior (prior × likelihood) is solid.
The extreme case: With n = 1 and one head, MLE gives p̂ = 1.0. MAP with Beta(2, 2) gives pMAP = (1+1)/(1+2) = 2/3 ≈ 0.667. With Beta(10, 10) the MAP gives (1+9)/(1+18) = 10/19 ≈ 0.526. The stronger the prior, the more it anchors the estimate against the single data point.
The large-data case: With n = 10,000 and 7,000 heads, MLE gives 0.700. MAP with Beta(10, 10) gives (7009)/(10018) = 0.6997. With 10,000 observations the prior's 18 imaginary observations contribute less than 0.2% of the total evidence. The data overwhelm any reasonable prior.
Problem 1: Poisson MLE. A server receives requests. You record the number of requests per minute for 6 minutes: {3, 7, 5, 4, 6, 5}. Assuming a Poisson(λ) model, find the MLE for λ.
Problem 2: Bernoulli MAP with Beta prior. You are estimating the probability of a biased coin landing heads. Your prior is Beta(10, 10) — you believe the coin is probably fair, with moderate confidence (equivalent to 18 imaginary flips, 9 heads and 9 tails). You flip 8 times and observe 7 heads. Find both MLE and MAP.
Problem 3: Normal MLE for μ with known σ². Exam scores (known σ² = 225) for 4 students: {68, 82, 75, 91}. Find the MLE for μ.
Problem 4: When MLE = MAP. Under what condition on the Beta(a,b) prior does pMAP = pMLE for Bernoulli data?
Parameter estimation is the bridge between probability theory and machine learning. Everything we have built in this course — distributions, Bayes' theorem, independence — converges here into a practical tool for learning from data.
Looking backward: Parameter estimation uses almost everything from this course. Bayes' theorem (Chapter 3) is the engine behind MAP. Random variables and distributions (Chapters 5, 7) define the models we estimate. Independence (Chapter 3) justifies the product form of the likelihood. Expectation (Chapter 6) gives us the sample mean as the MLE for many distributions.
Looking forward: Chapter 15 on Machine Learning will use MLE and MAP as the training algorithms for classification and regression. The loss functions you will see — cross-entropy loss for classification, mean squared error for regression — are both negative log-likelihoods in disguise. When you add L2 regularization to a neural network, you are doing MAP with a Gaussian prior on the weights.
| Method | Formula | ML Equivalent |
|---|---|---|
| MLE | argmaxθ ∑ log f(xi|θ) | Minimize cross-entropy / MSE loss |
| MAP | argmaxθ [log f(θ) + ∑ log f(xi|θ)] | Loss + L2/L1 regularization |
| Full Bayes | ∫ f(xnew|θ) f(θ|data) dθ | Bayesian neural networks, ensembles |
A taste of linear regression: In the "linear transform plus noise" model Y = θX + Z where Z ~ Normal(0, σ²), the MLE reduces to minimizing ∑(Yi − θXi)² — the ordinary least squares objective. Every time you fit a line to data using least squares, you are doing MLE under a Gaussian noise assumption. This connection is the entry point to Chapter 15.