Parameter Learning — Kochenderfer et al., Chapter 4

Chapter 0: The Parameter Learning Problem

You are a doctor studying a rare disease. You test 10 patients and 3 come back positive. What is the probability that the next patient tests positive? Your gut says 3/10 = 0.3. But is that right? What if you had only tested 2 patients and 1 was positive — would you bet heavily on 0.5?

This is parameter learning: given a set of observations (data), estimate the numerical parameters of a probability distribution. In Chapter 2 we learned the structure of Bayesian networks (which variables connect to which). In Chapter 3 we learned to reason with them. Now we face the third challenge: where do the numbers in the CPTs come from?

The challenge is not just finding a number — any idiot can compute 3/10. The challenge is knowing how confident to be. With 3 heads in 10 flips we are uncertain. With 3000 heads in 10000 flips we are much more sure. Uncertainty about the parameters themselves is the central theme of this chapter.

Three philosophies, three chapters. (1) Maximum Likelihood Estimation (MLE): find the parameters that make the data most probable. (2) Bayesian learning: treat parameters as random variables with a prior, update to a posterior. (3) Nonparametric: skip the parametric family entirely, let the data speak for themselves.

The Learning Problem Visualized

Click "Flip" to generate coin flips from a biased coin (true P(heads) = 0.65). Watch how different estimation strategies evolve. MLE (orange) jumps to the sample fraction. A Bayesian with a uniform prior (teal) starts at 0.5 and gradually converges.

0 flips

You observe 1 head in 2 coin flips. The MLE estimate of P(heads) is 0.5. Why might you still be uncertain about the true probability?

Because 0.5 is never a valid probability Because the sample size is tiny — the true probability could easily be 0.3 or 0.7 and still produce this data Because coins are never fair

Chapter 1: Maximum Likelihood Estimation

The oldest and most widely-used approach: find the parameter values that make the observed data as probable as possible. Formally, given data D = {x⁽¹⁾, ..., x^(m)} and a model parameterized by θ, we want:

θ^* = arg max_θ P(D | θ) = arg max_θ ∏_i=1^m P(x⁽ⁱ⁾ | θ)

The product over m samples collapses to a product because we assume the samples are i.i.d. (independent and identically distributed). The function P(D | θ) viewed as a function of θ (with D fixed) is called the likelihood.

Products are computationally fragile — multiplying 1000 probabilities together produces a number so small it underflows to zero. We always work with the log-likelihood instead, which turns products into sums:

ℓ(θ) = log P(D | θ) = ∑_i=1^m log P(x⁽ⁱ⁾ | θ)

Since log is monotonically increasing, maximizing ℓ(θ) is equivalent to maximizing P(D | θ). The log-likelihood is your workhorse in statistical learning.

The coin flip MLE derivation. For a coin with P(heads) = θ, observe n_H heads and n_T tails. The log-likelihood is ℓ(θ) = n_H log θ + n_T log(1 − θ). Setting dℓ/dθ = 0 gives θ^* = n_H / (n_H + n_T). The MLE is just the sample fraction — exactly what your gut said.

Log-Likelihood Landscape

Drag the sliders to set the number of heads and tails. The log-likelihood curve shows which θ value the MLE selects. The peak is always at n_H/(n_H+n_T).

Heads (n_H) 7

Tails (n_T) 3

For a Bayesian network with known structure, MLE decomposes beautifully. Each CPT row can be estimated independently from the data using the relevant counts. If node X has parents pa(X), then:

θ^*_{x | pa} = m(x, pa) / m(pa)

where m(x, pa) is the number of times we see X=x with parents pa(X) = pa, and m(pa) is the count of that parent configuration. This is just counting — MLE in Bayesian networks reduces to computing frequency tables from data.

python
import numpy as np
from collections import defaultdict

def mle_cpt(data, child_col, parent_cols):
    """Estimate CPT via MLE from data (list of dicts)."""
    counts = defaultdict(lambda: defaultdict(int))
    for row in data:
        pa = tuple(row[p] for p in parent_cols)
        x  = row[child_col]
        counts[pa][x] += 1
    cpt = {}
    for pa, xcounts in counts.items():
        total = sum(xcounts.values())
        cpt[pa] = {x: n/total for x, n in xcounts.items()}
    return cpt

# Example: Learn P(Rain | Sprinkler) from observations
data = [
    {'sprinkler': 0, 'rain': 0},
    {'sprinkler': 1, 'rain': 0},
    {'sprinkler': 0, 'rain': 1},
]
cpt = mle_cpt(data, 'rain', ['sprinkler'])
# cpt[(0,)] = {0: 0.5, 1: 0.5}  -- given no sprinkler
# cpt[(1,)] = {0: 1.0}           -- given sprinkler on (only 1 obs!)

You observe 6 heads and 4 tails. What is the MLE estimate of P(heads), and what mathematical operation finds it?

θ^* = 0.6, found by setting the derivative of the log-likelihood to zero θ^* = 0.5, because that is the fair coin value θ^* = 1.0, because more heads = higher likelihood

Chapter 2: Bayesian Parameter Learning

MLE gives you a single number — but after 3 coin flips, that single number carries enormous uncertainty. Bayesian learning treats the parameter θ itself as a random variable with a prior distribution P(θ) reflecting your beliefs before seeing data. Data updates the prior to a posterior via Bayes' rule:

P(θ | D) ∝ P(D | θ) · P(θ)

The posterior P(θ | D) captures everything you know about θ after seeing the data. It is a full distribution, not a point estimate — it tells you not just where θ probably lives, but how confident you should be.

For the coin flip problem, a natural prior over θ ∈ [0,1] is the Beta distribution: Beta(α, β). Its density is proportional to θ^α−1(1−θ)^β−1. The parameters α and β act like "pseudo-counts" — α is like seeing α−1 heads before any real data, β like seeing β−1 tails.

P(θ) = Beta(α, β) ∝ θ^α−1(1 − θ)^β−1

After observing n_H heads and n_T tails, the posterior is exactly:

P(θ | n_H, n_T) = Beta(α + n_H, β + n_T)

This is the miracle of conjugate priors: the posterior has the same family as the prior. You just add the counts to the prior parameters. The posterior mean is (α + n_H) / (α + β + n), which pulls the MLE toward α/(α+β) when data is scarce but converges to the MLE as n → ∞.

The showcase simulation below is the main event. Use the sliders to set your prior strength (the "pseudocounts" α and β) and watch the posterior sharpen as you add coin flip data. A flat prior (both = 1) trusts only the data. A strong prior (large values) resists updating. This is the core intuition of Bayesian learning.

Beta Prior Updating — SHOWCASE

Set prior strength, flip coins (true bias = 0.65), watch the posterior sharpen. Orange curve = posterior density. Teal line = posterior mean. Dimmed line = MLE. Green line = true value.

Prior α (pseudo-heads) 1.0

Prior β (pseudo-tails) 1.0

0 flips | 0 H 0 T

python
import numpy as np
from scipy.stats import beta

def bayesian_coin_update(prior_alpha, prior_beta, flips):
    """Update Beta prior with observed coin flips (1=heads, 0=tails)."""
    n_heads = sum(flips)
    n_tails = len(flips) - n_heads
    post_alpha = prior_alpha + n_heads
    post_beta  = prior_beta  + n_tails
    return post_alpha, post_beta

# Uniform prior: Beta(1,1) = Uniform(0,1)
a0, b0 = 1.0, 1.0

# Observe 7 heads, 3 tails
flips = [1]*7 + [0]*3
a1, b1 = bayesian_coin_update(a0, b0, flips)
# a1 = 8, b1 = 4  =>  posterior = Beta(8, 4)

post_mean = a1 / (a1 + b1)      # 8/12 = 0.667
post_var  = a1*b1 / ((a1+b1)**2 * (a1+b1+1))
mle       = 7 / 10             # 0.7

print(f"Posterior mean: {post_mean:.3f}  MLE: {mle:.3f}")
# Posterior mean pulled toward prior mean (0.5) vs MLE (0.7)

After observing 5 heads and 3 tails with a Beta(2, 2) prior, what is the posterior distribution?

Beta(7, 5) — add counts to prior parameters Beta(5, 3) — the prior is ignored Beta(2, 2) — the posterior equals the prior

Chapter 3: Dirichlet–Categorical Learning

The Beta-Binomial is for two outcomes (heads/tails). Real variables often have more: a traffic light has three states (red, yellow, green); a word in a language model has thousands. The generalization is the Dirichlet-Categorical conjugate pair.

A Categorical distribution over k outcomes is parameterized by a probability vector θ = (θ₁, ..., θ_k) with ∑_i θ_i = 1. The conjugate prior over this simplex is the Dirichlet distribution:

P(θ) = Dir(α₁, ..., α_k) ∝ ∏_i=1^k θ_i^α_i−1

Each α_i acts as a pseudocount for outcome i. The Dirichlet prior with all α_i = 1 is the uniform distribution over the simplex — complete prior ignorance. After observing m_i occurrences of outcome i, the posterior is just:

P(θ | data) = Dir(α₁ + m₁, ..., α_k + m_k)

The posterior mean for θ_i is (α_i + m_i) / (∑_j α_j + m), which is called Laplace smoothing in the special case α_i = 1. It ensures no outcome gets probability exactly zero even if we never observed it — crucial for language models.

The empty cell problem. With MLE, if you never see "yellow light" in your training data, you get P(yellow) = 0. Then your inference engine multiplies by zero and all posteriors collapse. Dirichlet priors with α_i > 0 give every outcome a floor probability — this is why Bayesian smoothing is standard practice in language models and NLP.

Dirichlet Posterior over Three Outcomes

Observe outcomes A, B, C. Solid bars = posterior mean (Bayesian). Faded bars = MLE. The prior concentration α controls resistance to data. When C is never seen, notice MLE gives 0 but Bayesian does not.

Prior concentration α 1.0

A:0 B:0 C:0

python
import numpy as np

def dirichlet_posterior_mean(alpha_prior, counts):
    """Posterior mean of Dirichlet after observing counts."""
    alpha_post = alpha_prior + counts
    return alpha_post / alpha_post.sum()

# Symmetric Dir(1,1,1) prior = uniform over simplex
alpha0 = np.array([1.0, 1.0, 1.0])
counts = np.array([10, 3, 0])   # A seen 10x, B 3x, C never

post_mean = dirichlet_posterior_mean(alpha0, counts)
# [11/17, 4/17, 1/17] = [0.647, 0.235, 0.059]
# C gets 1/17 ~= 0.059, not zero -- Laplace smoothing

mle = counts / counts.sum()
# [10/13, 3/13, 0/13] -- C gets probability ZERO (dangerous!)

You use a Dir(1,1,1) prior and observe outcomes: A, A, B. What is the posterior mean for outcome C?

0.0 — C was never observed 0.5 — prior always gives equal weight 1/6 — posterior is Dir(3,2,1), mean for C = 1/(3+2+1)

Chapter 4: Gaussian MLE

Many real-valued variables are well-modeled by a Gaussian (Normal) distribution N(μ, σ²). The two parameters are the mean μ (center) and variance σ² (spread). Given i.i.d. data {x⁽¹⁾, ..., x^(m)}, what are the MLE estimates?

The log-likelihood of a Gaussian is:

ℓ(μ, σ²) = −m/2 · log(2πσ²) − 1/(2σ²) ∑_i (x⁽ⁱ⁾ − μ)²

Setting the partial derivatives to zero gives two clean closed-form solutions. For the mean: ∂ℓ/∂μ = 0 gives μ^* = (1/m) ∑ x⁽ⁱ⁾ — the sample mean. For the variance: ∂ℓ/∂σ² = 0 yields:

μ^* = (1/m) ∑_i=1^m x⁽ⁱ⁾ σ^2* = (1/m) ∑_i=1^m (x⁽ⁱ⁾ − μ^*)²

There is a subtlety: the MLE variance estimator divides by m, not m−1. This makes it biased — it systematically underestimates the true population variance. The unbiased estimator divides by m−1 (Bessel's correction). MLE optimizes likelihood, not unbiasedness.

MLE variance is biased because we estimated μ from the same data. The data used to estimate μ has already "absorbed" some variance. The divisor m−1 corrects for this lost degree of freedom. In practice: use m for MLE (when likelihood is the goal), use m−1 for unbiased estimation (when variance estimation is the goal). NumPy's np.var(x, ddof=0) gives MLE; ddof=1 gives unbiased.

Gaussian MLE from Samples

True distribution: N(0, 1). Draw samples, watch MLE estimates (μ, σ²) converge to truth. The fitted curve (orange) matches the true curve (teal, dashed) as sample size grows.

n=0

Gaussian MLE cheatsheet.
μ^* = sample mean
σ^2* = (1/m) ∑(x_i−μ^*)²
Biased toward 0; unbiased uses m−1

When it breaks. Gaussian MLE fails when data is multimodal, heavy-tailed, or has outliers. A single Gaussian can't capture a distribution with two humps. That is where mixture models (EM algorithm, Ch 6) come in.

You compute the MLE variance estimate on 5 samples and get 4.0. The unbiased (Bessel-corrected) estimate would be:

5.0 — multiply by m/(m−1) = 5/4 3.2 — multiply by (m−1)/m = 4/5 4.0 — they are the same thing

Chapter 5: Nonparametric Methods

Parametric methods (Gaussian, Beta, Categorical) bet on a specific functional form. If the true distribution is bimodal, no single Gaussian will fit it well no matter how good your estimates are. Nonparametric methods make no such bet — they let the data define the shape directly.

Kernel Density Estimation (KDE)

Kernel Density Estimation places a small "bump" (the kernel) centered at each data point, then sums all the bumps to estimate the density. The most common kernel is the Gaussian:

p̂(x) = (1/m) ∑_i=1^m K_h(x − x⁽ⁱ⁾) where K_h(u) = (1/h) · φ(u/h)

Here h is the bandwidth — the width of each kernel bump. Too small: you get a spiky density that overfits to noise. Too large: you get an oversmoothed density that misses true structure. Bandwidth selection is the central challenge of KDE.

KDE bandwidth = bias-variance tradeoff. Small h → low bias (fits data exactly), high variance (wiggly). Large h → high bias (misses peaks), low variance (smooth). The optimal bandwidth by minimizing MISE scales as h^* ∼ m^−1/5 for Gaussian kernels — a famous result from nonparametric statistics.

k-Nearest-Neighbor Density Estimation

k-NN density estimation asks: how densely packed are the k nearest neighbors around point x? Regions where neighbors are close get high density; sparse regions get low density. The estimate at x is:

p̂(x) = k / (m · V_k(x))

where V_k(x) is the volume of the ball containing the k nearest neighbors. In 1D, V_k(x) = 2 · d_k(x), the distance to the k-th nearest neighbor. Like bandwidth in KDE, k controls the bias-variance tradeoff: small k = noisy, large k = smooth.

KDE Bandwidth Explorer

Samples from a bimodal distribution (two clusters). Orange = KDE estimate. Teal dashed = true density. Adjust bandwidth to find the sweet spot between overfitting and oversmoothing.

Bandwidth h 0.50

30 samples

python
import numpy as np
from scipy.stats import gaussian_kde

# Bimodal data (mixture of two Gaussians)
np.random.seed(42)
data = np.concatenate([
    np.random.normal(-2, 0.8, 50),
    np.random.normal(2,  0.8, 50)
])

# KDE with Scott's rule bandwidth (automatic selection)
kde = gaussian_kde(data, bw_method='scott')
x_eval = np.linspace(-6, 6, 300)
density = kde(x_eval)

# Manual KDE with Gaussian kernel
h = 0.5
def kde_manual(x, data, h):
    return np.mean(np.exp(-0.5*((x - data)/h)**2) / (h * np.sqrt(2*np.pi)))

In KDE, what happens when the bandwidth h is set too small?

The estimate is spiky and overfits to noise (high variance, low bias) The estimate is oversmoothed and misses true peaks (low variance, high bias) The estimate converges to a Gaussian regardless of data

Chapter 6: The EM Algorithm

Sometimes your data has missing values. A sensor failed for half the observations. Some patients didn't report their smoking status. You can't ignore the incomplete rows — that biases your estimates. You can't fully use them either — the missing values affect the likelihood.

The Expectation-Maximization (EM) algorithm handles this elegantly. It alternates between two steps:

E-step: Expectation

Fill in the missing data probabilistically. Using the current parameter estimate θ^old, compute the expected value of the complete-data log-likelihood: Q(θ, θ^old) = E[log P(x, z | θ) | x^obs, θ^old]

↓

M-step: Maximization

Find the parameters that maximize the expected complete-data log-likelihood: θ^new = arg max_θ Q(θ, θ^old)

↻ repeat until convergence

EM is guaranteed to increase the observed-data log-likelihood at each step (or leave it unchanged at convergence). It cannot decrease it. This makes it safe to run until the likelihood stops improving.

The key insight: EM is coordinate ascent in disguise. The observed-data likelihood log P(x^obs | θ) is hard to maximize directly because it involves summing over all possible values of missing variables. EM constructs a lower bound (the Q function) that is easier to maximize, then tightens the bound at the next E-step. Repeat until the bound touches the likelihood.

The textbook example is Gaussian mixture models. Suppose you know data comes from K Gaussian components but don't know which component generated each point. The component assignment is the "missing" data. EM alternates between: (E) assigning soft probabilities ("responsibilities") to each component for each point, and (M) re-estimating the Gaussian parameters using weighted counts.

EM for Gaussian Mixture Models

Data from two Gaussians. EM doesn't know which cluster each point came from. Color of dots = responsibility for component 1 (orange → teal). Watch EM discover the two clusters. Dashed lines = estimated means.

Iteration 0

EM is guaranteed to do what at each iteration?

Never decrease the observed-data log-likelihood (monotone ascent) Converge to the global maximum of the likelihood Decrease the number of model parameters

Chapter 7: E-step and M-step Derived

Let us work through EM concretely for a Bayesian network with missing data. We observe x^obs, the missing variables are z (latent). The complete-data log-likelihood is log P(x^obs, z | θ). We can't maximize this directly because z is unknown. Instead, EM works with its expectation over the posterior of z given the observed data.

Define the Q function (the E-step output):

Q(θ, θ^old) = E_{z | x^obs, θ^old} [ log P(x^obs, z | θ) ]

The E-step computes Q. The M-step finds θ^new = arg max_θ Q(θ, θ^old).

Why Does EM Work?

Jensen's inequality guarantees that for any distribution q(z):

log P(x^obs | θ) ≥ E_q[log P(x^obs, z | θ)] − E_q[log q(z)]

The right side is a lower bound called the Evidence Lower BOund (ELBO). The E-step sets q(z) = P(z | x^obs, θ^old) which makes the bound tight at the current θ. The M-step maximizes the bound in θ. Since we maximized a lower bound and the likelihood is above it, the likelihood can only increase.

EM for Bayesian networks with missing data (the textbook algorithm). For each observation i, compute the joint probability of all variable assignments (summing out missing ones). These "soft counts" replace the hard counts of MLE. The M-step is then identical to MLE but with fractional counts instead of integer counts — same formula, softer data.

python
import numpy as np

def em_gaussian_mixture(X, K=2, max_iter=100, tol=1e-6):
    """EM for K-component Gaussian mixture on 1D data X."""
    m = len(X)
    # Initialize: random means, unit variances, uniform weights
    mu    = X[np.random.choice(m, K, replace=False)]
    sigma = np.ones(K)
    pi    = np.ones(K) / K
    log_lik_prev = -np.inf

    for _ in range(max_iter):
        # E-step: compute responsibilities r[i,k] = P(z_i=k | x_i, theta)
        r = np.zeros((m, K))
        for k in range(K):
            r[:, k] = pi[k] * np.exp(-0.5*((X - mu[k])/sigma[k])**2) / (sigma[k]*np.sqrt(2*np.pi))
        r /= r.sum(axis=1, keepdims=True)  # normalize rows

        # M-step: re-estimate parameters using soft counts
        Nk = r.sum(axis=0)
        mu    = (r * X[:, None]).sum(0) / Nk
        sigma = np.sqrt((r * (X[:,None] - mu)**2).sum(0) / Nk)
        pi    = Nk / m

        # Check convergence via log-likelihood
        log_lik = np.sum(np.log(
            sum(pi[k]*np.exp(-0.5*((X-mu[k])/sigma[k])**2)/(sigma[k]*np.sqrt(2*np.pi))
                for k in range(K))))
        if abs(log_lik - log_lik_prev) < tol: break
        log_lik_prev = log_lik

    return mu, sigma, pi

ELBO Lower Bound — EM Convergence

The true log-likelihood (gray) and the ELBO lower bound (teal) for a simple coin model with 65H/35T. Each EM step tightens the bound at the current θ (orange dot) then maximizes it. Monotone ascent guaranteed.

Iteration 0

In the M-step of EM for a Gaussian mixture, how do we compute the new mean μ_k of component k?

Weighted average of all data points, with weights = responsibilities r_ik Average of data points assigned to component k with hard assignment The mode of the posterior P(z | x, θ)

Chapter 8: Bayesian vs Frequentist

We have now seen two fundamentally different philosophies for parameter estimation. They give different answers, especially with small data — and understanding the difference helps you choose the right tool.

Frequentist (MLE). Parameters are fixed but unknown constants. Probability describes the long-run frequency of events. "P(heads) = 0.6" is a claim about the physical world. The MLE is the value that most likely generated the data. No prior needed. Works badly with tiny samples.

Bayesian. Parameters are random variables with distributions. Probability describes degrees of belief. "P(heads) = 0.6" means "I believe the bias is 0.6 with some certainty." The posterior encodes all uncertainty about θ. Prior encodes domain knowledge. Requires a prior (subjective), but gives richer output.

When They Agree and When They Diverge

With lots of data, both approaches converge to the same answer. The likelihood dominates the prior, so the posterior concentrates near the MLE. This is the Bernstein-von Mises theorem: under mild conditions, the posterior converges to a point mass at the true parameter as m → ∞.

They diverge sharply with small data. The MLE gives extreme estimates (0 probability for unseen events, 1.0 for events that always occurred). The Bayesian posterior stays uncertain, pulled toward the prior. With 0 data, the MLE is undefined; the Bayesian estimate is the prior mean.

The MAP estimate bridges both. The Maximum A Posteriori (MAP) estimate finds the mode of the posterior: θ^MAP = arg max_θ P(θ | D) = arg max_θ [log P(D | θ) + log P(θ)]. It is like MLE with regularization: the log-prior acts as a penalty on extreme values. L2 regularization corresponds to a Gaussian prior; L1 (sparsity) corresponds to a Laplace prior.

Property	MLE (Frequentist)	Bayesian Posterior	MAP
Output	Point estimate θ^*	Full distribution P(θ\|D)	Point estimate θ^MAP
Prior	None needed	Required	Required (acts as regularizer)
Small data	Overconfident, extreme	Uncertain, prior-dominated	Shrunk toward prior mode
Large data	Consistent, efficient	Converges to MLE	Converges to MLE
Uncertainty	None (point estimate)	Full posterior	None (point estimate)
Computation	Optimization	Integration (often hard)	Optimization (easier than full Bayes)

MLE vs MAP vs Posterior Mean

Same coin data, three estimates plotted on the [0,1] interval. With few flips they diverge; with many, they converge. MAP uses a Beta(2,2) prior. Adjust true P(heads) and see how fast each method tracks it.

0 flips

True P(heads) 0.70

The MAP estimate θ^MAP = arg max P(θ | D) is equivalent to MLE with what modification?

Adding log P(θ) to the log-likelihood — the prior acts as a regularizer Dividing the likelihood by the marginal P(D) Computing the mean of the posterior instead of the mode

Chapter 9: Connections & What's Next

Parameter learning does not exist in isolation. Every technique here feeds directly into the broader machinery of probabilistic modeling. Here is where each thread leads.

MLE → Deep Learning

Neural network training with cross-entropy loss is MLE under a categorical likelihood. Gradient descent maximizes the log-likelihood. Adam, SGD — all are MLE optimization strategies.

↓

Bayesian Learning → Chapter 5 (Structure Learning)

To score Bayesian network structures, we compute the marginal likelihood P(D | structure) = ∫ P(D | θ, structure) P(θ) dθ. With Dirichlet priors, this integral has a closed form — the Bayesian Dirichlet (BD) score.

↓

KDE → Particle Filters, Generative Models

Particle filters represent beliefs as weighted samples; KDE converts them back to densities. Score-based diffusion models learn the score function ∇ log p(x) — the gradient of a KDE-like density.

↓

EM → Hidden Markov Models, VAEs

The Baum-Welch algorithm for HMMs is exactly EM with the forward-backward algorithm in the E-step. Variational Autoencoders are amortized variational EM — the encoder learns to approximate the E-step.

Limitations of Parameter Learning

This chapter assumed the structure of the model was known — we knew it was a coin, a Gaussian, a specific Bayesian network. In practice, you also need to choose the right model family. Chapter 5 (Structure Learning) addresses this: given data, find the best graph structure, not just the best parameters.

Method	Best When	Fails When	Key Parameter
MLE	Large data, known model family	Small data, zero counts	—
Bayesian (Beta/Dirichlet)	Small data, need uncertainty	Prior wrong, intractable posteriors	Pseudocounts α
MAP	Need point estimate + regularization	Prior misspecified	Prior strength
KDE	Unknown density shape, large data	High dimensions (curse)	Bandwidth h
k-NN density	Adaptive local density	High dimensions, boundaries	Neighbors k
EM	Missing data, latent variables	Local optima, slow convergence	Init., num. components

The golden rule of parameter learning: More data beats a better algorithm. If you have 10,000 samples, MLE is usually good enough. If you have 10, the prior matters enormously. Know your data regime before choosing your method.

Related micro-lessons:

Bayesian Estimation — priors, posteriors, credible intervals
Bayesian Networks — the structure these parameters fill
Hidden Markov Models — EM in sequential settings

Continue in this series:

"The purpose of models is not to fit the data but to sharpen the questions." — Samuel Karlin

With parameters learned, the full probabilistic model is specified. The next step is asking whether we chose the right structure — which variables should connect to which. That is Chapter 5.