Kochenderfer, Wheeler & Wray, Chapter 4

Parameter Learning

You have data. You want a distribution. This chapter shows three roads from observations to parameters: maximum likelihood, Bayesian updating, and nonparametric density estimation — plus EM for the case when data is incomplete.

Prerequisites: Chapter 3 (Inference) — conditional probability, Bayesian networks, factor operations.
10
Chapters
6+
Simulations

Chapter 0: The Parameter Learning Problem

You are a doctor studying a rare disease. You test 10 patients and 3 come back positive. What is the probability that the next patient tests positive? Your gut says 3/10 = 0.3. But is that right? What if you had only tested 2 patients and 1 was positive — would you bet heavily on 0.5?

This is parameter learning: given a set of observations (data), estimate the numerical parameters of a probability distribution. In Chapter 2 we learned the structure of Bayesian networks (which variables connect to which). In Chapter 3 we learned to reason with them. Now we face the third challenge: where do the numbers in the CPTs come from?

The challenge is not just finding a number — any idiot can compute 3/10. The challenge is knowing how confident to be. With 3 heads in 10 flips we are uncertain. With 3000 heads in 10000 flips we are much more sure. Uncertainty about the parameters themselves is the central theme of this chapter.

Three philosophies, three chapters. (1) Maximum Likelihood Estimation (MLE): find the parameters that make the data most probable. (2) Bayesian learning: treat parameters as random variables with a prior, update to a posterior. (3) Nonparametric: skip the parametric family entirely, let the data speak for themselves.
The Learning Problem Visualized

Click "Flip" to generate coin flips from a biased coin (true P(heads) = 0.65). Watch how different estimation strategies evolve. MLE (orange) jumps to the sample fraction. A Bayesian with a uniform prior (teal) starts at 0.5 and gradually converges.

0 flips
You observe 1 head in 2 coin flips. The MLE estimate of P(heads) is 0.5. Why might you still be uncertain about the true probability?

Chapter 1: Maximum Likelihood Estimation

The oldest and most widely-used approach: find the parameter values that make the observed data as probable as possible. Formally, given data D = {x(1), ..., x(m)} and a model parameterized by θ, we want:

θ* = arg maxθ P(D | θ) = arg maxθi=1m P(x(i) | θ)

The product over m samples collapses to a product because we assume the samples are i.i.d. (independent and identically distributed). The function P(D | θ) viewed as a function of θ (with D fixed) is called the likelihood.

Products are computationally fragile — multiplying 1000 probabilities together produces a number so small it underflows to zero. We always work with the log-likelihood instead, which turns products into sums:

ℓ(θ) = log P(D | θ) = ∑i=1m log P(x(i) | θ)

Since log is monotonically increasing, maximizing ℓ(θ) is equivalent to maximizing P(D | θ). The log-likelihood is your workhorse in statistical learning.

The coin flip MLE derivation. For a coin with P(heads) = θ, observe nH heads and nT tails. The log-likelihood is ℓ(θ) = nH log θ + nT log(1 − θ). Setting dℓ/dθ = 0 gives θ* = nH / (nH + nT). The MLE is just the sample fraction — exactly what your gut said.
Log-Likelihood Landscape

Drag the sliders to set the number of heads and tails. The log-likelihood curve shows which θ value the MLE selects. The peak is always at nH/(nH+nT).

Heads (nH) 7
Tails (nT) 3

For a Bayesian network with known structure, MLE decomposes beautifully. Each CPT row can be estimated independently from the data using the relevant counts. If node X has parents pa(X), then:

θ*x | pa = m(x, pa) / m(pa)

where m(x, pa) is the number of times we see X=x with parents pa(X) = pa, and m(pa) is the count of that parent configuration. This is just counting — MLE in Bayesian networks reduces to computing frequency tables from data.

python
import numpy as np
from collections import defaultdict

def mle_cpt(data, child_col, parent_cols):
    """Estimate CPT via MLE from data (list of dicts)."""
    counts = defaultdict(lambda: defaultdict(int))
    for row in data:
        pa = tuple(row[p] for p in parent_cols)
        x  = row[child_col]
        counts[pa][x] += 1
    cpt = {}
    for pa, xcounts in counts.items():
        total = sum(xcounts.values())
        cpt[pa] = {x: n/total for x, n in xcounts.items()}
    return cpt

# Example: Learn P(Rain | Sprinkler) from observations
data = [
    {'sprinkler': 0, 'rain': 0},
    {'sprinkler': 1, 'rain': 0},
    {'sprinkler': 0, 'rain': 1},
]
cpt = mle_cpt(data, 'rain', ['sprinkler'])
# cpt[(0,)] = {0: 0.5, 1: 0.5}  -- given no sprinkler
# cpt[(1,)] = {0: 1.0}           -- given sprinkler on (only 1 obs!)
You observe 6 heads and 4 tails. What is the MLE estimate of P(heads), and what mathematical operation finds it?

Chapter 2: Bayesian Parameter Learning

MLE gives you a single number — but after 3 coin flips, that single number carries enormous uncertainty. Bayesian learning treats the parameter θ itself as a random variable with a prior distribution P(θ) reflecting your beliefs before seeing data. Data updates the prior to a posterior via Bayes' rule:

P(θ | D) ∝ P(D | θ) · P(θ)

The posterior P(θ | D) captures everything you know about θ after seeing the data. It is a full distribution, not a point estimate — it tells you not just where θ probably lives, but how confident you should be.

For the coin flip problem, a natural prior over θ ∈ [0,1] is the Beta distribution: Beta(α, β). Its density is proportional to θα−1(1−θ)β−1. The parameters α and β act like "pseudo-counts" — α is like seeing α−1 heads before any real data, β like seeing β−1 tails.

P(θ) = Beta(α, β) ∝ θα−1(1 − θ)β−1

After observing nH heads and nT tails, the posterior is exactly:

P(θ | nH, nT) = Beta(α + nH, β + nT)

This is the miracle of conjugate priors: the posterior has the same family as the prior. You just add the counts to the prior parameters. The posterior mean is (α + nH) / (α + β + n), which pulls the MLE toward α/(α+β) when data is scarce but converges to the MLE as n → ∞.

The showcase simulation below is the main event. Use the sliders to set your prior strength (the "pseudocounts" α and β) and watch the posterior sharpen as you add coin flip data. A flat prior (both = 1) trusts only the data. A strong prior (large values) resists updating. This is the core intuition of Bayesian learning.
Beta Prior Updating — SHOWCASE

Set prior strength, flip coins (true bias = 0.65), watch the posterior sharpen. Orange curve = posterior density. Teal line = posterior mean. Dimmed line = MLE. Green line = true value.

Prior α (pseudo-heads) 1.0
Prior β (pseudo-tails) 1.0
0 flips | 0 H 0 T
python
import numpy as np
from scipy.stats import beta

def bayesian_coin_update(prior_alpha, prior_beta, flips):
    """Update Beta prior with observed coin flips (1=heads, 0=tails)."""
    n_heads = sum(flips)
    n_tails = len(flips) - n_heads
    post_alpha = prior_alpha + n_heads
    post_beta  = prior_beta  + n_tails
    return post_alpha, post_beta

# Uniform prior: Beta(1,1) = Uniform(0,1)
a0, b0 = 1.0, 1.0

# Observe 7 heads, 3 tails
flips = [1]*7 + [0]*3
a1, b1 = bayesian_coin_update(a0, b0, flips)
# a1 = 8, b1 = 4  =>  posterior = Beta(8, 4)

post_mean = a1 / (a1 + b1)      # 8/12 = 0.667
post_var  = a1*b1 / ((a1+b1)**2 * (a1+b1+1))
mle       = 7 / 10             # 0.7

print(f"Posterior mean: {post_mean:.3f}  MLE: {mle:.3f}")
# Posterior mean pulled toward prior mean (0.5) vs MLE (0.7)
After observing 5 heads and 3 tails with a Beta(2, 2) prior, what is the posterior distribution?

Chapter 3: Dirichlet–Categorical Learning

The Beta-Binomial is for two outcomes (heads/tails). Real variables often have more: a traffic light has three states (red, yellow, green); a word in a language model has thousands. The generalization is the Dirichlet-Categorical conjugate pair.

A Categorical distribution over k outcomes is parameterized by a probability vector θ = (θ1, ..., θk) with ∑i θi = 1. The conjugate prior over this simplex is the Dirichlet distribution:

P(θ) = Dir(α1, ..., αk) ∝ ∏i=1k θiαi−1

Each αi acts as a pseudocount for outcome i. The Dirichlet prior with all αi = 1 is the uniform distribution over the simplex — complete prior ignorance. After observing mi occurrences of outcome i, the posterior is just:

P(θ | data) = Dir(α1 + m1, ..., αk + mk)

The posterior mean for θi is (αi + mi) / (∑j αj + m), which is called Laplace smoothing in the special case αi = 1. It ensures no outcome gets probability exactly zero even if we never observed it — crucial for language models.

The empty cell problem. With MLE, if you never see "yellow light" in your training data, you get P(yellow) = 0. Then your inference engine multiplies by zero and all posteriors collapse. Dirichlet priors with αi > 0 give every outcome a floor probability — this is why Bayesian smoothing is standard practice in language models and NLP.
Dirichlet Posterior over Three Outcomes

Observe outcomes A, B, C. Solid bars = posterior mean (Bayesian). Faded bars = MLE. The prior concentration α controls resistance to data. When C is never seen, notice MLE gives 0 but Bayesian does not.

Prior concentration α 1.0
A:0 B:0 C:0
python
import numpy as np

def dirichlet_posterior_mean(alpha_prior, counts):
    """Posterior mean of Dirichlet after observing counts."""
    alpha_post = alpha_prior + counts
    return alpha_post / alpha_post.sum()

# Symmetric Dir(1,1,1) prior = uniform over simplex
alpha0 = np.array([1.0, 1.0, 1.0])
counts = np.array([10, 3, 0])   # A seen 10x, B 3x, C never

post_mean = dirichlet_posterior_mean(alpha0, counts)
# [11/17, 4/17, 1/17] = [0.647, 0.235, 0.059]
# C gets 1/17 ~= 0.059, not zero -- Laplace smoothing

mle = counts / counts.sum()
# [10/13, 3/13, 0/13] -- C gets probability ZERO (dangerous!)
You use a Dir(1,1,1) prior and observe outcomes: A, A, B. What is the posterior mean for outcome C?

Chapter 4: Gaussian MLE

Many real-valued variables are well-modeled by a Gaussian (Normal) distribution N(μ, σ2). The two parameters are the mean μ (center) and variance σ2 (spread). Given i.i.d. data {x(1), ..., x(m)}, what are the MLE estimates?

The log-likelihood of a Gaussian is:

ℓ(μ, σ2) = −m/2 · log(2πσ2) − 1/(2σ2) ∑i (x(i) − μ)2

Setting the partial derivatives to zero gives two clean closed-form solutions. For the mean: ∂ℓ/∂μ = 0 gives μ* = (1/m) ∑ x(i) — the sample mean. For the variance: ∂ℓ/∂σ2 = 0 yields:

μ* = (1/m) ∑i=1m x(i)      σ2* = (1/m) ∑i=1m (x(i) − μ*)2

There is a subtlety: the MLE variance estimator divides by m, not m−1. This makes it biased — it systematically underestimates the true population variance. The unbiased estimator divides by m−1 (Bessel's correction). MLE optimizes likelihood, not unbiasedness.

MLE variance is biased because we estimated μ from the same data. The data used to estimate μ has already "absorbed" some variance. The divisor m−1 corrects for this lost degree of freedom. In practice: use m for MLE (when likelihood is the goal), use m−1 for unbiased estimation (when variance estimation is the goal). NumPy's np.var(x, ddof=0) gives MLE; ddof=1 gives unbiased.
Gaussian MLE from Samples

True distribution: N(0, 1). Draw samples, watch MLE estimates (μ, σ2) converge to truth. The fitted curve (orange) matches the true curve (teal, dashed) as sample size grows.

n=0
Gaussian MLE cheatsheet.
μ* = sample mean
σ2* = (1/m) ∑(xi−μ*)2
Biased toward 0; unbiased uses m−1
When it breaks. Gaussian MLE fails when data is multimodal, heavy-tailed, or has outliers. A single Gaussian can't capture a distribution with two humps. That is where mixture models (EM algorithm, Ch 6) come in.
You compute the MLE variance estimate on 5 samples and get 4.0. The unbiased (Bessel-corrected) estimate would be:

Chapter 5: Nonparametric Methods

Parametric methods (Gaussian, Beta, Categorical) bet on a specific functional form. If the true distribution is bimodal, no single Gaussian will fit it well no matter how good your estimates are. Nonparametric methods make no such bet — they let the data define the shape directly.

Kernel Density Estimation (KDE)

Kernel Density Estimation places a small "bump" (the kernel) centered at each data point, then sums all the bumps to estimate the density. The most common kernel is the Gaussian:

p̂(x) = (1/m) ∑i=1m Kh(x − x(i))   where   Kh(u) = (1/h) · φ(u/h)

Here h is the bandwidth — the width of each kernel bump. Too small: you get a spiky density that overfits to noise. Too large: you get an oversmoothed density that misses true structure. Bandwidth selection is the central challenge of KDE.

KDE bandwidth = bias-variance tradeoff. Small h → low bias (fits data exactly), high variance (wiggly). Large h → high bias (misses peaks), low variance (smooth). The optimal bandwidth by minimizing MISE scales as h* ∼ m−1/5 for Gaussian kernels — a famous result from nonparametric statistics.

k-Nearest-Neighbor Density Estimation

k-NN density estimation asks: how densely packed are the k nearest neighbors around point x? Regions where neighbors are close get high density; sparse regions get low density. The estimate at x is:

p̂(x) = k / (m · Vk(x))

where Vk(x) is the volume of the ball containing the k nearest neighbors. In 1D, Vk(x) = 2 · dk(x), the distance to the k-th nearest neighbor. Like bandwidth in KDE, k controls the bias-variance tradeoff: small k = noisy, large k = smooth.

KDE Bandwidth Explorer

Samples from a bimodal distribution (two clusters). Orange = KDE estimate. Teal dashed = true density. Adjust bandwidth to find the sweet spot between overfitting and oversmoothing.

Bandwidth h 0.50
30 samples
python
import numpy as np
from scipy.stats import gaussian_kde

# Bimodal data (mixture of two Gaussians)
np.random.seed(42)
data = np.concatenate([
    np.random.normal(-2, 0.8, 50),
    np.random.normal(2,  0.8, 50)
])

# KDE with Scott's rule bandwidth (automatic selection)
kde = gaussian_kde(data, bw_method='scott')
x_eval = np.linspace(-6, 6, 300)
density = kde(x_eval)

# Manual KDE with Gaussian kernel
h = 0.5
def kde_manual(x, data, h):
    return np.mean(np.exp(-0.5*((x - data)/h)**2) / (h * np.sqrt(2*np.pi)))
In KDE, what happens when the bandwidth h is set too small?

Chapter 6: The EM Algorithm

Sometimes your data has missing values. A sensor failed for half the observations. Some patients didn't report their smoking status. You can't ignore the incomplete rows — that biases your estimates. You can't fully use them either — the missing values affect the likelihood.

The Expectation-Maximization (EM) algorithm handles this elegantly. It alternates between two steps:

E-step: Expectation
Fill in the missing data probabilistically. Using the current parameter estimate θold, compute the expected value of the complete-data log-likelihood: Q(θ, θold) = E[log P(x, z | θ) | xobs, θold]
M-step: Maximization
Find the parameters that maximize the expected complete-data log-likelihood: θnew = arg maxθ Q(θ, θold)
↻ repeat until convergence

EM is guaranteed to increase the observed-data log-likelihood at each step (or leave it unchanged at convergence). It cannot decrease it. This makes it safe to run until the likelihood stops improving.

The key insight: EM is coordinate ascent in disguise. The observed-data likelihood log P(xobs | θ) is hard to maximize directly because it involves summing over all possible values of missing variables. EM constructs a lower bound (the Q function) that is easier to maximize, then tightens the bound at the next E-step. Repeat until the bound touches the likelihood.

The textbook example is Gaussian mixture models. Suppose you know data comes from K Gaussian components but don't know which component generated each point. The component assignment is the "missing" data. EM alternates between: (E) assigning soft probabilities ("responsibilities") to each component for each point, and (M) re-estimating the Gaussian parameters using weighted counts.

EM for Gaussian Mixture Models

Data from two Gaussians. EM doesn't know which cluster each point came from. Color of dots = responsibility for component 1 (orange → teal). Watch EM discover the two clusters. Dashed lines = estimated means.

Iteration 0
EM is guaranteed to do what at each iteration?

Chapter 7: E-step and M-step Derived

Let us work through EM concretely for a Bayesian network with missing data. We observe xobs, the missing variables are z (latent). The complete-data log-likelihood is log P(xobs, z | θ). We can't maximize this directly because z is unknown. Instead, EM works with its expectation over the posterior of z given the observed data.

Define the Q function (the E-step output):

Q(θ, θold) = Ez | xobs, θold [ log P(xobs, z | θ) ]

The E-step computes Q. The M-step finds θnew = arg maxθ Q(θ, θold).

Why Does EM Work?

Jensen's inequality guarantees that for any distribution q(z):

log P(xobs | θ) ≥ Eq[log P(xobs, z | θ)] − Eq[log q(z)]

The right side is a lower bound called the Evidence Lower BOund (ELBO). The E-step sets q(z) = P(z | xobs, θold) which makes the bound tight at the current θ. The M-step maximizes the bound in θ. Since we maximized a lower bound and the likelihood is above it, the likelihood can only increase.

EM for Bayesian networks with missing data (the textbook algorithm). For each observation i, compute the joint probability of all variable assignments (summing out missing ones). These "soft counts" replace the hard counts of MLE. The M-step is then identical to MLE but with fractional counts instead of integer counts — same formula, softer data.
python
import numpy as np

def em_gaussian_mixture(X, K=2, max_iter=100, tol=1e-6):
    """EM for K-component Gaussian mixture on 1D data X."""
    m = len(X)
    # Initialize: random means, unit variances, uniform weights
    mu    = X[np.random.choice(m, K, replace=False)]
    sigma = np.ones(K)
    pi    = np.ones(K) / K
    log_lik_prev = -np.inf

    for _ in range(max_iter):
        # E-step: compute responsibilities r[i,k] = P(z_i=k | x_i, theta)
        r = np.zeros((m, K))
        for k in range(K):
            r[:, k] = pi[k] * np.exp(-0.5*((X - mu[k])/sigma[k])**2) / (sigma[k]*np.sqrt(2*np.pi))
        r /= r.sum(axis=1, keepdims=True)  # normalize rows

        # M-step: re-estimate parameters using soft counts
        Nk = r.sum(axis=0)
        mu    = (r * X[:, None]).sum(0) / Nk
        sigma = np.sqrt((r * (X[:,None] - mu)**2).sum(0) / Nk)
        pi    = Nk / m

        # Check convergence via log-likelihood
        log_lik = np.sum(np.log(
            sum(pi[k]*np.exp(-0.5*((X-mu[k])/sigma[k])**2)/(sigma[k]*np.sqrt(2*np.pi))
                for k in range(K))))
        if abs(log_lik - log_lik_prev) < tol: break
        log_lik_prev = log_lik

    return mu, sigma, pi
ELBO Lower Bound — EM Convergence

The true log-likelihood (gray) and the ELBO lower bound (teal) for a simple coin model with 65H/35T. Each EM step tightens the bound at the current θ (orange dot) then maximizes it. Monotone ascent guaranteed.

Iteration 0
In the M-step of EM for a Gaussian mixture, how do we compute the new mean μk of component k?

Chapter 8: Bayesian vs Frequentist

We have now seen two fundamentally different philosophies for parameter estimation. They give different answers, especially with small data — and understanding the difference helps you choose the right tool.

Frequentist (MLE). Parameters are fixed but unknown constants. Probability describes the long-run frequency of events. "P(heads) = 0.6" is a claim about the physical world. The MLE is the value that most likely generated the data. No prior needed. Works badly with tiny samples.
Bayesian. Parameters are random variables with distributions. Probability describes degrees of belief. "P(heads) = 0.6" means "I believe the bias is 0.6 with some certainty." The posterior encodes all uncertainty about θ. Prior encodes domain knowledge. Requires a prior (subjective), but gives richer output.

When They Agree and When They Diverge

With lots of data, both approaches converge to the same answer. The likelihood dominates the prior, so the posterior concentrates near the MLE. This is the Bernstein-von Mises theorem: under mild conditions, the posterior converges to a point mass at the true parameter as m → ∞.

They diverge sharply with small data. The MLE gives extreme estimates (0 probability for unseen events, 1.0 for events that always occurred). The Bayesian posterior stays uncertain, pulled toward the prior. With 0 data, the MLE is undefined; the Bayesian estimate is the prior mean.

The MAP estimate bridges both. The Maximum A Posteriori (MAP) estimate finds the mode of the posterior: θMAP = arg maxθ P(θ | D) = arg maxθ [log P(D | θ) + log P(θ)]. It is like MLE with regularization: the log-prior acts as a penalty on extreme values. L2 regularization corresponds to a Gaussian prior; L1 (sparsity) corresponds to a Laplace prior.
PropertyMLE (Frequentist)Bayesian PosteriorMAP
OutputPoint estimate θ*Full distribution P(θ|D)Point estimate θMAP
PriorNone neededRequiredRequired (acts as regularizer)
Small dataOverconfident, extremeUncertain, prior-dominatedShrunk toward prior mode
Large dataConsistent, efficientConverges to MLEConverges to MLE
UncertaintyNone (point estimate)Full posteriorNone (point estimate)
ComputationOptimizationIntegration (often hard)Optimization (easier than full Bayes)
MLE vs MAP vs Posterior Mean

Same coin data, three estimates plotted on the [0,1] interval. With few flips they diverge; with many, they converge. MAP uses a Beta(2,2) prior. Adjust true P(heads) and see how fast each method tracks it.

0 flips
True P(heads) 0.70
The MAP estimate θMAP = arg max P(θ | D) is equivalent to MLE with what modification?

Chapter 9: Connections & What's Next

Parameter learning does not exist in isolation. Every technique here feeds directly into the broader machinery of probabilistic modeling. Here is where each thread leads.

MLE → Deep Learning
Neural network training with cross-entropy loss is MLE under a categorical likelihood. Gradient descent maximizes the log-likelihood. Adam, SGD — all are MLE optimization strategies.
Bayesian Learning → Chapter 5 (Structure Learning)
To score Bayesian network structures, we compute the marginal likelihood P(D | structure) = ∫ P(D | θ, structure) P(θ) dθ. With Dirichlet priors, this integral has a closed form — the Bayesian Dirichlet (BD) score.
KDE → Particle Filters, Generative Models
Particle filters represent beliefs as weighted samples; KDE converts them back to densities. Score-based diffusion models learn the score function ∇ log p(x) — the gradient of a KDE-like density.
EM → Hidden Markov Models, VAEs
The Baum-Welch algorithm for HMMs is exactly EM with the forward-backward algorithm in the E-step. Variational Autoencoders are amortized variational EM — the encoder learns to approximate the E-step.

Limitations of Parameter Learning

This chapter assumed the structure of the model was known — we knew it was a coin, a Gaussian, a specific Bayesian network. In practice, you also need to choose the right model family. Chapter 5 (Structure Learning) addresses this: given data, find the best graph structure, not just the best parameters.

MethodBest WhenFails WhenKey Parameter
MLELarge data, known model familySmall data, zero counts
Bayesian (Beta/Dirichlet)Small data, need uncertaintyPrior wrong, intractable posteriorsPseudocounts α
MAPNeed point estimate + regularizationPrior misspecifiedPrior strength
KDEUnknown density shape, large dataHigh dimensions (curse)Bandwidth h
k-NN densityAdaptive local densityHigh dimensions, boundariesNeighbors k
EMMissing data, latent variablesLocal optima, slow convergenceInit., num. components
The golden rule of parameter learning: More data beats a better algorithm. If you have 10,000 samples, MLE is usually good enough. If you have 10, the prior matters enormously. Know your data regime before choosing your method.

Related micro-lessons:

Continue in this series:

"The purpose of models is not to fit the data but to sharpen the questions." — Samuel Karlin

With parameters learned, the full probabilistic model is specified. The next step is asking whether we chose the right structure — which variables should connect to which. That is Chapter 5.