You have data. You want a distribution. This chapter shows three roads from observations to parameters: maximum likelihood, Bayesian updating, and nonparametric density estimation — plus EM for the case when data is incomplete.
You are a doctor studying a rare disease. You test 10 patients and 3 come back positive. What is the probability that the next patient tests positive? Your gut says 3/10 = 0.3. But is that right? What if you had only tested 2 patients and 1 was positive — would you bet heavily on 0.5?
This is parameter learning: given a set of observations (data), estimate the numerical parameters of a probability distribution. In Chapter 2 we learned the structure of Bayesian networks (which variables connect to which). In Chapter 3 we learned to reason with them. Now we face the third challenge: where do the numbers in the CPTs come from?
The challenge is not just finding a number — any idiot can compute 3/10. The challenge is knowing how confident to be. With 3 heads in 10 flips we are uncertain. With 3000 heads in 10000 flips we are much more sure. Uncertainty about the parameters themselves is the central theme of this chapter.
Click "Flip" to generate coin flips from a biased coin (true P(heads) = 0.65). Watch how different estimation strategies evolve. MLE (orange) jumps to the sample fraction. A Bayesian with a uniform prior (teal) starts at 0.5 and gradually converges.
The oldest and most widely-used approach: find the parameter values that make the observed data as probable as possible. Formally, given data D = {x(1), ..., x(m)} and a model parameterized by θ, we want:
The product over m samples collapses to a product because we assume the samples are i.i.d. (independent and identically distributed). The function P(D | θ) viewed as a function of θ (with D fixed) is called the likelihood.
Products are computationally fragile — multiplying 1000 probabilities together produces a number so small it underflows to zero. We always work with the log-likelihood instead, which turns products into sums:
Since log is monotonically increasing, maximizing ℓ(θ) is equivalent to maximizing P(D | θ). The log-likelihood is your workhorse in statistical learning.
Drag the sliders to set the number of heads and tails. The log-likelihood curve shows which θ value the MLE selects. The peak is always at nH/(nH+nT).
For a Bayesian network with known structure, MLE decomposes beautifully. Each CPT row can be estimated independently from the data using the relevant counts. If node X has parents pa(X), then:
where m(x, pa) is the number of times we see X=x with parents pa(X) = pa, and m(pa) is the count of that parent configuration. This is just counting — MLE in Bayesian networks reduces to computing frequency tables from data.
python import numpy as np from collections import defaultdict def mle_cpt(data, child_col, parent_cols): """Estimate CPT via MLE from data (list of dicts).""" counts = defaultdict(lambda: defaultdict(int)) for row in data: pa = tuple(row[p] for p in parent_cols) x = row[child_col] counts[pa][x] += 1 cpt = {} for pa, xcounts in counts.items(): total = sum(xcounts.values()) cpt[pa] = {x: n/total for x, n in xcounts.items()} return cpt # Example: Learn P(Rain | Sprinkler) from observations data = [ {'sprinkler': 0, 'rain': 0}, {'sprinkler': 1, 'rain': 0}, {'sprinkler': 0, 'rain': 1}, ] cpt = mle_cpt(data, 'rain', ['sprinkler']) # cpt[(0,)] = {0: 0.5, 1: 0.5} -- given no sprinkler # cpt[(1,)] = {0: 1.0} -- given sprinkler on (only 1 obs!)
MLE gives you a single number — but after 3 coin flips, that single number carries enormous uncertainty. Bayesian learning treats the parameter θ itself as a random variable with a prior distribution P(θ) reflecting your beliefs before seeing data. Data updates the prior to a posterior via Bayes' rule:
The posterior P(θ | D) captures everything you know about θ after seeing the data. It is a full distribution, not a point estimate — it tells you not just where θ probably lives, but how confident you should be.
For the coin flip problem, a natural prior over θ ∈ [0,1] is the Beta distribution: Beta(α, β). Its density is proportional to θα−1(1−θ)β−1. The parameters α and β act like "pseudo-counts" — α is like seeing α−1 heads before any real data, β like seeing β−1 tails.
After observing nH heads and nT tails, the posterior is exactly:
This is the miracle of conjugate priors: the posterior has the same family as the prior. You just add the counts to the prior parameters. The posterior mean is (α + nH) / (α + β + n), which pulls the MLE toward α/(α+β) when data is scarce but converges to the MLE as n → ∞.
Set prior strength, flip coins (true bias = 0.65), watch the posterior sharpen. Orange curve = posterior density. Teal line = posterior mean. Dimmed line = MLE. Green line = true value.
python import numpy as np from scipy.stats import beta def bayesian_coin_update(prior_alpha, prior_beta, flips): """Update Beta prior with observed coin flips (1=heads, 0=tails).""" n_heads = sum(flips) n_tails = len(flips) - n_heads post_alpha = prior_alpha + n_heads post_beta = prior_beta + n_tails return post_alpha, post_beta # Uniform prior: Beta(1,1) = Uniform(0,1) a0, b0 = 1.0, 1.0 # Observe 7 heads, 3 tails flips = [1]*7 + [0]*3 a1, b1 = bayesian_coin_update(a0, b0, flips) # a1 = 8, b1 = 4 => posterior = Beta(8, 4) post_mean = a1 / (a1 + b1) # 8/12 = 0.667 post_var = a1*b1 / ((a1+b1)**2 * (a1+b1+1)) mle = 7 / 10 # 0.7 print(f"Posterior mean: {post_mean:.3f} MLE: {mle:.3f}") # Posterior mean pulled toward prior mean (0.5) vs MLE (0.7)
The Beta-Binomial is for two outcomes (heads/tails). Real variables often have more: a traffic light has three states (red, yellow, green); a word in a language model has thousands. The generalization is the Dirichlet-Categorical conjugate pair.
A Categorical distribution over k outcomes is parameterized by a probability vector θ = (θ1, ..., θk) with ∑i θi = 1. The conjugate prior over this simplex is the Dirichlet distribution:
Each αi acts as a pseudocount for outcome i. The Dirichlet prior with all αi = 1 is the uniform distribution over the simplex — complete prior ignorance. After observing mi occurrences of outcome i, the posterior is just:
The posterior mean for θi is (αi + mi) / (∑j αj + m), which is called Laplace smoothing in the special case αi = 1. It ensures no outcome gets probability exactly zero even if we never observed it — crucial for language models.
Observe outcomes A, B, C. Solid bars = posterior mean (Bayesian). Faded bars = MLE. The prior concentration α controls resistance to data. When C is never seen, notice MLE gives 0 but Bayesian does not.
python import numpy as np def dirichlet_posterior_mean(alpha_prior, counts): """Posterior mean of Dirichlet after observing counts.""" alpha_post = alpha_prior + counts return alpha_post / alpha_post.sum() # Symmetric Dir(1,1,1) prior = uniform over simplex alpha0 = np.array([1.0, 1.0, 1.0]) counts = np.array([10, 3, 0]) # A seen 10x, B 3x, C never post_mean = dirichlet_posterior_mean(alpha0, counts) # [11/17, 4/17, 1/17] = [0.647, 0.235, 0.059] # C gets 1/17 ~= 0.059, not zero -- Laplace smoothing mle = counts / counts.sum() # [10/13, 3/13, 0/13] -- C gets probability ZERO (dangerous!)
Many real-valued variables are well-modeled by a Gaussian (Normal) distribution N(μ, σ2). The two parameters are the mean μ (center) and variance σ2 (spread). Given i.i.d. data {x(1), ..., x(m)}, what are the MLE estimates?
The log-likelihood of a Gaussian is:
Setting the partial derivatives to zero gives two clean closed-form solutions. For the mean: ∂ℓ/∂μ = 0 gives μ* = (1/m) ∑ x(i) — the sample mean. For the variance: ∂ℓ/∂σ2 = 0 yields:
There is a subtlety: the MLE variance estimator divides by m, not m−1. This makes it biased — it systematically underestimates the true population variance. The unbiased estimator divides by m−1 (Bessel's correction). MLE optimizes likelihood, not unbiasedness.
np.var(x, ddof=0) gives MLE; ddof=1 gives unbiased.True distribution: N(0, 1). Draw samples, watch MLE estimates (μ, σ2) converge to truth. The fitted curve (orange) matches the true curve (teal, dashed) as sample size grows.
Parametric methods (Gaussian, Beta, Categorical) bet on a specific functional form. If the true distribution is bimodal, no single Gaussian will fit it well no matter how good your estimates are. Nonparametric methods make no such bet — they let the data define the shape directly.
Kernel Density Estimation places a small "bump" (the kernel) centered at each data point, then sums all the bumps to estimate the density. The most common kernel is the Gaussian:
Here h is the bandwidth — the width of each kernel bump. Too small: you get a spiky density that overfits to noise. Too large: you get an oversmoothed density that misses true structure. Bandwidth selection is the central challenge of KDE.
k-NN density estimation asks: how densely packed are the k nearest neighbors around point x? Regions where neighbors are close get high density; sparse regions get low density. The estimate at x is:
where Vk(x) is the volume of the ball containing the k nearest neighbors. In 1D, Vk(x) = 2 · dk(x), the distance to the k-th nearest neighbor. Like bandwidth in KDE, k controls the bias-variance tradeoff: small k = noisy, large k = smooth.
Samples from a bimodal distribution (two clusters). Orange = KDE estimate. Teal dashed = true density. Adjust bandwidth to find the sweet spot between overfitting and oversmoothing.
python import numpy as np from scipy.stats import gaussian_kde # Bimodal data (mixture of two Gaussians) np.random.seed(42) data = np.concatenate([ np.random.normal(-2, 0.8, 50), np.random.normal(2, 0.8, 50) ]) # KDE with Scott's rule bandwidth (automatic selection) kde = gaussian_kde(data, bw_method='scott') x_eval = np.linspace(-6, 6, 300) density = kde(x_eval) # Manual KDE with Gaussian kernel h = 0.5 def kde_manual(x, data, h): return np.mean(np.exp(-0.5*((x - data)/h)**2) / (h * np.sqrt(2*np.pi)))
Sometimes your data has missing values. A sensor failed for half the observations. Some patients didn't report their smoking status. You can't ignore the incomplete rows — that biases your estimates. You can't fully use them either — the missing values affect the likelihood.
The Expectation-Maximization (EM) algorithm handles this elegantly. It alternates between two steps:
EM is guaranteed to increase the observed-data log-likelihood at each step (or leave it unchanged at convergence). It cannot decrease it. This makes it safe to run until the likelihood stops improving.
The textbook example is Gaussian mixture models. Suppose you know data comes from K Gaussian components but don't know which component generated each point. The component assignment is the "missing" data. EM alternates between: (E) assigning soft probabilities ("responsibilities") to each component for each point, and (M) re-estimating the Gaussian parameters using weighted counts.
Data from two Gaussians. EM doesn't know which cluster each point came from. Color of dots = responsibility for component 1 (orange → teal). Watch EM discover the two clusters. Dashed lines = estimated means.
Let us work through EM concretely for a Bayesian network with missing data. We observe xobs, the missing variables are z (latent). The complete-data log-likelihood is log P(xobs, z | θ). We can't maximize this directly because z is unknown. Instead, EM works with its expectation over the posterior of z given the observed data.
Define the Q function (the E-step output):
The E-step computes Q. The M-step finds θnew = arg maxθ Q(θ, θold).
Jensen's inequality guarantees that for any distribution q(z):
The right side is a lower bound called the Evidence Lower BOund (ELBO). The E-step sets q(z) = P(z | xobs, θold) which makes the bound tight at the current θ. The M-step maximizes the bound in θ. Since we maximized a lower bound and the likelihood is above it, the likelihood can only increase.
python import numpy as np def em_gaussian_mixture(X, K=2, max_iter=100, tol=1e-6): """EM for K-component Gaussian mixture on 1D data X.""" m = len(X) # Initialize: random means, unit variances, uniform weights mu = X[np.random.choice(m, K, replace=False)] sigma = np.ones(K) pi = np.ones(K) / K log_lik_prev = -np.inf for _ in range(max_iter): # E-step: compute responsibilities r[i,k] = P(z_i=k | x_i, theta) r = np.zeros((m, K)) for k in range(K): r[:, k] = pi[k] * np.exp(-0.5*((X - mu[k])/sigma[k])**2) / (sigma[k]*np.sqrt(2*np.pi)) r /= r.sum(axis=1, keepdims=True) # normalize rows # M-step: re-estimate parameters using soft counts Nk = r.sum(axis=0) mu = (r * X[:, None]).sum(0) / Nk sigma = np.sqrt((r * (X[:,None] - mu)**2).sum(0) / Nk) pi = Nk / m # Check convergence via log-likelihood log_lik = np.sum(np.log( sum(pi[k]*np.exp(-0.5*((X-mu[k])/sigma[k])**2)/(sigma[k]*np.sqrt(2*np.pi)) for k in range(K)))) if abs(log_lik - log_lik_prev) < tol: break log_lik_prev = log_lik return mu, sigma, pi
The true log-likelihood (gray) and the ELBO lower bound (teal) for a simple coin model with 65H/35T. Each EM step tightens the bound at the current θ (orange dot) then maximizes it. Monotone ascent guaranteed.
We have now seen two fundamentally different philosophies for parameter estimation. They give different answers, especially with small data — and understanding the difference helps you choose the right tool.
With lots of data, both approaches converge to the same answer. The likelihood dominates the prior, so the posterior concentrates near the MLE. This is the Bernstein-von Mises theorem: under mild conditions, the posterior converges to a point mass at the true parameter as m → ∞.
They diverge sharply with small data. The MLE gives extreme estimates (0 probability for unseen events, 1.0 for events that always occurred). The Bayesian posterior stays uncertain, pulled toward the prior. With 0 data, the MLE is undefined; the Bayesian estimate is the prior mean.
| Property | MLE (Frequentist) | Bayesian Posterior | MAP |
|---|---|---|---|
| Output | Point estimate θ* | Full distribution P(θ|D) | Point estimate θMAP |
| Prior | None needed | Required | Required (acts as regularizer) |
| Small data | Overconfident, extreme | Uncertain, prior-dominated | Shrunk toward prior mode |
| Large data | Consistent, efficient | Converges to MLE | Converges to MLE |
| Uncertainty | None (point estimate) | Full posterior | None (point estimate) |
| Computation | Optimization | Integration (often hard) | Optimization (easier than full Bayes) |
Same coin data, three estimates plotted on the [0,1] interval. With few flips they diverge; with many, they converge. MAP uses a Beta(2,2) prior. Adjust true P(heads) and see how fast each method tracks it.
Parameter learning does not exist in isolation. Every technique here feeds directly into the broader machinery of probabilistic modeling. Here is where each thread leads.
This chapter assumed the structure of the model was known — we knew it was a coin, a Gaussian, a specific Bayesian network. In practice, you also need to choose the right model family. Chapter 5 (Structure Learning) addresses this: given data, find the best graph structure, not just the best parameters.
| Method | Best When | Fails When | Key Parameter |
|---|---|---|---|
| MLE | Large data, known model family | Small data, zero counts | — |
| Bayesian (Beta/Dirichlet) | Small data, need uncertainty | Prior wrong, intractable posteriors | Pseudocounts α |
| MAP | Need point estimate + regularization | Prior misspecified | Prior strength |
| KDE | Unknown density shape, large data | High dimensions (curse) | Bandwidth h |
| k-NN density | Adaptive local density | High dimensions, boundaries | Neighbors k |
| EM | Missing data, latent variables | Local optima, slow convergence | Init., num. components |
Related micro-lessons:
Continue in this series:
"The purpose of models is not to fit the data but to sharpen the questions." — Samuel Karlin
With parameters learned, the full probabilistic model is specified. The next step is asking whether we chose the right structure — which variables should connect to which. That is Chapter 5.