Bishop PRML, Chapter 9

Mixture Models and EM

K-means, Gaussian mixtures, and the expectation-maximization algorithm — learning models with latent variables.

Prerequisites: Chapters 2, 8 (Gaussians, graphical models).
11
Chapters
2
Simulations
11
Quizzes

Chapter 0: Why Mixtures?

A single Gaussian can model a unimodal, elliptical cluster. But real data is often multimodal — think of handwritten digits (10 clusters), customer segments, or gene expression profiles. A mixture model combines K simpler distributions to model complex, multimodal data:

p(x) = ∑k=1K πk p(x|k)

where πk are mixing coefficients (must be non-negative, sum to 1) and p(x|k) is the k-th component distribution.

Latent variables: The mixture model has a natural interpretation using a latent variable z (a one-hot vector) indicating which component generated each data point. The generative process: sample zk = 1 with probability πk, then sample x from p(x|k). We never observe z directly — it's "hidden" or "latent." This latent variable perspective leads directly to the EM algorithm.

The challenge: with latent variables, the log-likelihood contains a sum inside a logarithm, making direct optimization difficult. The EM algorithm handles this elegantly.

Check: What is the role of latent variables in a mixture model?

Chapter 1: K-Means Clustering

K-means is the simplest clustering algorithm. It assigns each data point to the nearest cluster center, then updates centers to the mean of assigned points.

Algorithm:

1. Initialize K cluster centers μ1, ..., μK.

2. Assign: Each point xn to nearest center: rnk = 1 if k = arg minj ||xnμj||2.

3. Update: μk = ∑n rnk xn / ∑n rnk.

4. Repeat 2-3 until convergence.

The objective minimized is:

J = ∑n=1Nk=1K rnk ||xnμk||2
K-means as hard EM: K-means is a limiting case of EM for Gaussian mixtures. When all covariances are σ2I and we take σ2 → 0, the "soft" responsibilities γ(znk) (probabilities of assignment) become "hard" assignments rnk ∈ {0, 1}. K-means is the zero-temperature limit of a Gaussian mixture model.

Limitations: K-means assumes spherical clusters of equal size, makes hard assignments (no uncertainty), and is sensitive to initialization. Gaussian mixtures address all of these.

Check: How is K-means related to Gaussian mixtures?

Chapter 2: Gaussian Mixture Models

A Gaussian mixture model (GMM) uses K Gaussian components:

p(x) = ∑k=1K πk N(x|μk, Σk)

Parameters: K mixing coefficients πk, K mean vectors μk, and K covariance matrices Σk.

The responsibility of component k for data point xn is the posterior probability:

γ(znk) = p(zk=1|xn) = πk N(xn|μk, Σk) / ∑j πj N(xn|μj, Σj)
Soft vs hard assignment: Unlike K-means where each point belongs to exactly one cluster, the GMM gives each point a probability of belonging to each cluster. Point xn might have responsibilities (0.7, 0.2, 0.1) across three components — 70% likely from component 1, 20% from 2, 10% from 3. This captures uncertainty in cluster membership.
Check: What is a "responsibility" in a Gaussian mixture model?

Chapter 3: Maximum Likelihood Difficulties

The log-likelihood for a GMM is:

ln p(X|π, μ, Σ) = ∑n=1N ln { ∑k=1K πk N(xn|μk, Σk) }

The sum inside the logarithm prevents a closed-form solution. Setting derivatives to zero gives equations that couple all parameters together — no clean solution like single-Gaussian ML.

Singularities: The GMM likelihood has dangerous singularities. If a component "collapses" onto a single data point (μkxn, σk → 0), the Gaussian density N(xn|μk, σk2) → ∞. The log-likelihood goes to infinity! This is a pathology of maximum likelihood — the Bayesian approach (Chapter 10) avoids it by putting priors on the parameters.

Other issues: identifiability (K! equivalent modes from relabeling components), sensitivity to initialization (EM finds local maxima, not the global maximum), and choosing K (model selection).

Check: What pathology can occur in maximum likelihood estimation of GMMs?

Chapter 4: EM for Gaussian Mixtures

The Expectation-Maximization (EM) algorithm iterates between two steps:

E-step: Compute responsibilities using current parameters:

γ(znk) = πk N(xn|μk, Σk) / ∑j πj N(xn|μj, Σj)

M-step: Re-estimate parameters using responsibilities:

μknew = (1/Nk) ∑n γ(znk) xn
Σknew = (1/Nk) ∑n γ(znk) (xnμknew)(xnμknew)T
πknew = Nk / N     where   Nk = ∑n γ(znk)
EM intuition: If we knew the latent assignments z, ML would be easy (just fit a Gaussian to each cluster's points). If we knew the parameters, computing responsibilities would be easy (just Bayes' rule). EM breaks the chicken-and-egg problem by alternating: use current parameters to guess assignments (E), then use guessed assignments to update parameters (M). Each iteration is guaranteed to increase (or maintain) the log-likelihood.

Nk = ∑n γ(znk) is the effective number of points assigned to component k. The new mean is a responsibility-weighted average. The new mixing coefficient is the fraction of "effective points" in each component.

Check: What does the E-step of EM compute?

Chapter 5: EM in Action

Watch the EM algorithm fit a Gaussian mixture model to 2D data. Each step alternates between computing responsibilities (coloring points by cluster membership) and updating the Gaussian parameters.

EM for Gaussian Mixtures

Click "Step" to run one EM iteration. Watch how responsibilities (point colors) and Gaussian ellipses evolve.

Iteration: 0
Check: Is the EM algorithm guaranteed to find the global maximum of the likelihood?

Chapter 6: The ELBO Decomposition

Why does EM work? The log-likelihood can be decomposed for any distribution q(Z) over latent variables:

ln p(X|θ) = L(q, θ) + KL(q || p(Z|X,θ))

where L(q, θ) is the Evidence Lower Bound (ELBO) and KL is the Kullback-Leibler divergence:

L(q, θ) = ∑Z q(Z) ln {p(X,Z|θ) / q(Z)}
The ELBO view of EM:
E-step: Fix θ, maximize L over q. Since KL ≥ 0, L = ln p(X|θ) − KL. Maximizing L means minimizing KL, which is achieved by setting q(Z) = p(Z|X,θ) (the exact posterior). This makes KL = 0 and L = ln p(X|θ).
M-step: Fix q, maximize L over θ. With q set to the old posterior, this is equivalent to maximizing the expected complete-data log-likelihood.

Since KL ≥ 0, we always have L ≤ ln p(X|θ). The ELBO is a lower bound on the log-likelihood. Each EM step increases the ELBO, which guarantees the log-likelihood never decreases.

This decomposition is the foundation of variational inference (Chapter 10), where we restrict q to a tractable family instead of using the exact posterior.

Check: What does the E-step do in terms of the ELBO decomposition?

Chapter 7: EM in General

The EM algorithm applies to any model with latent variables, not just Gaussian mixtures. The general recipe:

E-step: Compute the expected complete-data log-likelihood under the current posterior over latent variables:

Q(θ, θold) = EZ|X,θold [ln p(X, Z|θ)]

M-step: Maximize Q with respect to θ:

θnew = arg maxθ Q(θ, θold)
EM applications across the book: EM appears in many forms — GMMs (this chapter), factor analysis (Ch 12), HMMs (Ch 13, where the E-step is the forward-backward algorithm), probabilistic PCA (Ch 12). The same principle applies every time: if direct ML is hard because of latent variables, introduce them explicitly, compute their posterior (E), and optimize the expected complete-data likelihood (M).

Generalized EM (GEM): Instead of fully maximizing Q in the M-step, we can just increase it. This relaxation is useful when the M-step has no closed form (e.g., we can use gradient ascent). The monotonic convergence guarantee still holds.

Check: Can EM be applied only to Gaussian mixtures?

Chapter 8: Mixtures of Bernoullis

Mixtures aren't limited to Gaussians. For binary data (e.g., binary images of handwritten digits), we use a mixture of Bernoulli distributions:

p(x|μk) = ∏i=1D μkixi (1 − μki)1−xi

where μk is a D-dimensional vector of pixel probabilities for component k.

The EM algorithm follows the same pattern:

E-step: Compute responsibilities γ(znk) using Bernoulli likelihoods.

M-step: μkinew = ∑n γ(znk) xni / Nk — the responsibility-weighted mean of the data.

Bernoulli mixtures for digits: With K = 10 components on MNIST-style binary images, each component learns the "average" image for one digit. Component k's μk is a template: high values where the digit has ink, low values where it's blank. The responsibility tells us which digit template best matches a given image.

The same EM framework extends to mixtures of multinomials (for text), mixtures of Poissons (for count data), or any exponential family member.

Check: What does each component μ_k represent in a Bernoulli mixture for binary images?

Chapter 9: EM for Bayesian Regression

EM can also handle models where latent variables are continuous parameters with priors. Consider Bayesian linear regression with hyperparameters α (weight precision) and β (noise precision). The evidence framework from Ch 3 can be viewed as EM:

E-step: Compute posterior p(w|t, α, β) = N(w|m, Σ).

M-step: Update α and β by maximizing the expected complete-data log-likelihood, giving the re-estimation formulas from Ch 3:

γ = ∑i γi,    αnew = γ / mTm,    (1/β)new = ||tΦm||2 / (N − γ)
EM unifies the evidence framework: The re-estimation equations for α and β from Ch 3 — which seemed like clever algebra — are revealed as the M-step of an EM algorithm. The posterior over weights is the E-step. This shows the deep connection: the evidence framework is EM applied to a Bayesian regression model, treating the weights as latent variables and the hyperparameters as parameters to be estimated.
Check: What is the "latent variable" in the EM view of Bayesian regression?

Chapter 10: Summary

ConceptKey idea
K-meansHard assignment to nearest center; the zero-variance limit of GMM EM
GMMSoft assignment via responsibilities; models multimodal data
EM algorithmE: compute posterior over latent variables. M: maximize expected complete-data likelihood
ELBOln p(X|θ) = L(q,θ) + KL(q||p); EM maximizes L
ConvergenceMonotonically increases log-likelihood; converges to local maximum
EM as a unifying idea: The EM algorithm is one of the most important algorithms in machine learning. It turns hard optimization problems (with sums inside logarithms) into sequences of easier problems. The ELBO decomposition connects EM to variational inference (Ch 10), and the latent variable perspective connects to graphical models (Ch 8). Nearly every generative model in the rest of the book uses EM or a variant of it.

What comes next: Chapter 10 develops variational inference, which generalizes EM: instead of computing the exact posterior q = p(Z|X,θ) in the E-step, we approximate it with a tractable family. This enables inference in models where the exact posterior is intractable.

"The expectation-maximization algorithm, or EM algorithm,
is a general technique for finding maximum likelihood solutions
for probabilistic models having latent variables."
— Christopher Bishop, PRML §9.4
Check: What guarantee does the EM algorithm provide about the log-likelihood?