Ch 9: Mixture Models and EM

Chapter 0: Why Mixtures?

A single Gaussian can model a unimodal, elliptical cluster. But real data is often multimodal — think of handwritten digits (10 clusters), customer segments, or gene expression profiles. A mixture model combines K simpler distributions to model complex, multimodal data:

p(x) = ∑_k=1^K π_k p(x|k)

where π_k are mixing coefficients (must be non-negative, sum to 1) and p(x|k) is the k-th component distribution.

Latent variables: The mixture model has a natural interpretation using a latent variable z (a one-hot vector) indicating which component generated each data point. The generative process: sample z_k = 1 with probability π_k, then sample x from p(x|k). We never observe z directly — it's "hidden" or "latent." This latent variable perspective leads directly to the EM algorithm.

The challenge: with latent variables, the log-likelihood contains a sum inside a logarithm, making direct optimization difficult. The EM algorithm handles this elegantly.

Check: What is the role of latent variables in a mixture model?

They indicate which component generated each data point, but are never observed They store the model parameters They are the input features

Chapter 1: K-Means Clustering

K-means is the simplest clustering algorithm. It assigns each data point to the nearest cluster center, then updates centers to the mean of assigned points.

Algorithm:

1. Initialize K cluster centers μ₁, ..., μ_K.

2. Assign: Each point x_n to nearest center: r_nk = 1 if k = arg min_j ||x_n − μ_j||².

3. Update: μ_k = ∑_n r_nk x_n / ∑_n r_nk.

4. Repeat 2-3 until convergence.

The objective minimized is:

J = ∑_n=1^N ∑_k=1^K r_nk ||x_n − μ_k||²

K-means as hard EM: K-means is a limiting case of EM for Gaussian mixtures. When all covariances are σ²I and we take σ² → 0, the "soft" responsibilities γ(z_nk) (probabilities of assignment) become "hard" assignments r_nk ∈ {0, 1}. K-means is the zero-temperature limit of a Gaussian mixture model.

Limitations: K-means assumes spherical clusters of equal size, makes hard assignments (no uncertainty), and is sensitive to initialization. Gaussian mixtures address all of these.

Check: How is K-means related to Gaussian mixtures?

K-means is the hard-assignment, zero-variance limit of EM for Gaussian mixtures K-means uses Gaussians directly They are completely unrelated

Chapter 2: Gaussian Mixture Models

A Gaussian mixture model (GMM) uses K Gaussian components:

p(x) = ∑_k=1^K π_k N(x|μ_k, Σ_k)

Parameters: K mixing coefficients π_k, K mean vectors μ_k, and K covariance matrices Σ_k.

The responsibility of component k for data point x_n is the posterior probability:

γ(z_nk) = p(z_k=1|x_n) = π_k N(x_n|μ_k, Σ_k) / ∑_j π_j N(x_n|μ_j, Σ_j)

Soft vs hard assignment: Unlike K-means where each point belongs to exactly one cluster, the GMM gives each point a probability of belonging to each cluster. Point x_n might have responsibilities (0.7, 0.2, 0.1) across three components — 70% likely from component 1, 20% from 2, 10% from 3. This captures uncertainty in cluster membership.

Check: What is a "responsibility" in a Gaussian mixture model?

The posterior probability that a particular component generated a given data point The mixing coefficient of a component The distance from a point to a cluster center

Chapter 3: Maximum Likelihood Difficulties

The log-likelihood for a GMM is:

ln p(X|π, μ, Σ) = ∑_n=1^N ln { ∑_k=1^K π_k N(x_n|μ_k, Σ_k) }

The sum inside the logarithm prevents a closed-form solution. Setting derivatives to zero gives equations that couple all parameters together — no clean solution like single-Gaussian ML.

Singularities: The GMM likelihood has dangerous singularities. If a component "collapses" onto a single data point (μ_k → x_n, σ_k → 0), the Gaussian density N(x_n|μ_k, σ_k²) → ∞. The log-likelihood goes to infinity! This is a pathology of maximum likelihood — the Bayesian approach (Chapter 10) avoids it by putting priors on the parameters.

Other issues: identifiability (K! equivalent modes from relabeling components), sensitivity to initialization (EM finds local maxima, not the global maximum), and choosing K (model selection).

Check: What pathology can occur in maximum likelihood estimation of GMMs?

A component can collapse onto a single data point, sending the likelihood to infinity (a singularity) The algorithm always diverges The means converge to zero

Chapter 4: EM for Gaussian Mixtures

The Expectation-Maximization (EM) algorithm iterates between two steps:

E-step: Compute responsibilities using current parameters:

γ(z_nk) = π_k N(x_n|μ_k, Σ_k) / ∑_j π_j N(x_n|μ_j, Σ_j)

M-step: Re-estimate parameters using responsibilities:

μ_k^new = (1/N_k) ∑_n γ(z_nk) x_n

Σ_k^new = (1/N_k) ∑_n γ(z_nk) (x_n − μ_k^new)(x_n − μ_k^new)^T

π_k^new = N_k / N where N_k = ∑_n γ(z_nk)

EM intuition: If we knew the latent assignments z, ML would be easy (just fit a Gaussian to each cluster's points). If we knew the parameters, computing responsibilities would be easy (just Bayes' rule). EM breaks the chicken-and-egg problem by alternating: use current parameters to guess assignments (E), then use guessed assignments to update parameters (M). Each iteration is guaranteed to increase (or maintain) the log-likelihood.

N_k = ∑_n γ(z_nk) is the effective number of points assigned to component k. The new mean is a responsibility-weighted average. The new mixing coefficient is the fraction of "effective points" in each component.

Check: What does the E-step of EM compute?

The responsibilities (posterior probabilities of latent assignments) using current parameters New values for the means and covariances The gradient of the log-likelihood

Chapter 5: EM in Action

Watch the EM algorithm fit a Gaussian mixture model to 2D data. Each step alternates between computing responsibilities (coloring points by cluster membership) and updating the Gaussian parameters.

EM for Gaussian Mixtures

Click "Step" to run one EM iteration. Watch how responsibilities (point colors) and Gaussian ellipses evolve.

Iteration: 0

Check: Is the EM algorithm guaranteed to find the global maximum of the likelihood?

Yes, always No — EM converges to a local maximum (or saddle point), which depends on initialization EM may decrease the likelihood

Chapter 6: The ELBO Decomposition

Why does EM work? The log-likelihood can be decomposed for any distribution q(Z) over latent variables:

ln p(X|θ) = L(q, θ) + KL(q || p(Z|X,θ))

where L(q, θ) is the Evidence Lower Bound (ELBO) and KL is the Kullback-Leibler divergence:

L(q, θ) = ∑_Z q(Z) ln {p(X,Z|θ) / q(Z)}

The ELBO view of EM:
E-step: Fix θ, maximize L over q. Since KL ≥ 0, L = ln p(X|θ) − KL. Maximizing L means minimizing KL, which is achieved by setting q(Z) = p(Z|X,θ) (the exact posterior). This makes KL = 0 and L = ln p(X|θ).
M-step: Fix q, maximize L over θ. With q set to the old posterior, this is equivalent to maximizing the expected complete-data log-likelihood.

Since KL ≥ 0, we always have L ≤ ln p(X|θ). The ELBO is a lower bound on the log-likelihood. Each EM step increases the ELBO, which guarantees the log-likelihood never decreases.

This decomposition is the foundation of variational inference (Chapter 10), where we restrict q to a tractable family instead of using the exact posterior.

Check: What does the E-step do in terms of the ELBO decomposition?

Sets q(Z) to the true posterior p(Z|X,theta), making KL = 0 and the ELBO equal to the log-likelihood Minimizes the ELBO Maximizes the KL divergence

Chapter 7: EM in General

The EM algorithm applies to any model with latent variables, not just Gaussian mixtures. The general recipe:

E-step: Compute the expected complete-data log-likelihood under the current posterior over latent variables:

Q(θ, θ^old) = E_Z|X,θ^old [ln p(X, Z|θ)]

M-step: Maximize Q with respect to θ:

θ^new = arg max_θ Q(θ, θ^old)

EM applications across the book: EM appears in many forms — GMMs (this chapter), factor analysis (Ch 12), HMMs (Ch 13, where the E-step is the forward-backward algorithm), probabilistic PCA (Ch 12). The same principle applies every time: if direct ML is hard because of latent variables, introduce them explicitly, compute their posterior (E), and optimize the expected complete-data likelihood (M).

Generalized EM (GEM): Instead of fully maximizing Q in the M-step, we can just increase it. This relaxation is useful when the M-step has no closed form (e.g., we can use gradient ascent). The monotonic convergence guarantee still holds.

Check: Can EM be applied only to Gaussian mixtures?

Yes, it's specific to GMMs No — EM is a general framework for any model with latent variables (GMMs, HMMs, factor analysis, etc.) Only to exponential family models

Chapter 8: Mixtures of Bernoullis

Mixtures aren't limited to Gaussians. For binary data (e.g., binary images of handwritten digits), we use a mixture of Bernoulli distributions:

p(x|μ_k) = ∏_i=1^D μ_ki^x_i (1 − μ_ki)^1−x_i

where μ_k is a D-dimensional vector of pixel probabilities for component k.

The EM algorithm follows the same pattern:

E-step: Compute responsibilities γ(z_nk) using Bernoulli likelihoods.

M-step: μ_ki^new = ∑_n γ(z_nk) x_ni / N_k — the responsibility-weighted mean of the data.

Bernoulli mixtures for digits: With K = 10 components on MNIST-style binary images, each component learns the "average" image for one digit. Component k's μ_k is a template: high values where the digit has ink, low values where it's blank. The responsibility tells us which digit template best matches a given image.

The same EM framework extends to mixtures of multinomials (for text), mixtures of Poissons (for count data), or any exponential family member.

Check: What does each component μ_k represent in a Bernoulli mixture for binary images?

A template of pixel probabilities — the "average" binary image for that component The variance of each pixel The class label

Chapter 9: EM for Bayesian Regression

EM can also handle models where latent variables are continuous parameters with priors. Consider Bayesian linear regression with hyperparameters α (weight precision) and β (noise precision). The evidence framework from Ch 3 can be viewed as EM:

E-step: Compute posterior p(w|t, α, β) = N(w|m, Σ).

M-step: Update α and β by maximizing the expected complete-data log-likelihood, giving the re-estimation formulas from Ch 3:

γ = ∑_i γ_i, α^new = γ / m^Tm, (1/β)^new = ||t − Φm||² / (N − γ)

EM unifies the evidence framework: The re-estimation equations for α and β from Ch 3 — which seemed like clever algebra — are revealed as the M-step of an EM algorithm. The posterior over weights is the E-step. This shows the deep connection: the evidence framework is EM applied to a Bayesian regression model, treating the weights as latent variables and the hyperparameters as parameters to be estimated.

Check: What is the "latent variable" in the EM view of Bayesian regression?

The weights w — they have a posterior distribution (E-step) while hyperparameters alpha, beta are optimized (M-step) The data points The noise variance

Chapter 10: Summary

Concept	Key idea
K-means	Hard assignment to nearest center; the zero-variance limit of GMM EM
GMM	Soft assignment via responsibilities; models multimodal data
EM algorithm	E: compute posterior over latent variables. M: maximize expected complete-data likelihood
ELBO	ln p(X\|θ) = L(q,θ) + KL(q\|\|p); EM maximizes L
Convergence	Monotonically increases log-likelihood; converges to local maximum

EM as a unifying idea: The EM algorithm is one of the most important algorithms in machine learning. It turns hard optimization problems (with sums inside logarithms) into sequences of easier problems. The ELBO decomposition connects EM to variational inference (Ch 10), and the latent variable perspective connects to graphical models (Ch 8). Nearly every generative model in the rest of the book uses EM or a variant of it.

What comes next: Chapter 10 develops variational inference, which generalizes EM: instead of computing the exact posterior q = p(Z|X,θ) in the E-step, we approximate it with a tractable family. This enables inference in models where the exact posterior is intractable.

"The expectation-maximization algorithm, or EM algorithm,
is a general technique for finding maximum likelihood solutions
for probabilistic models having latent variables."
— Christopher Bishop, PRML §9.4

Check: What guarantee does the EM algorithm provide about the log-likelihood?

It never decreases the log-likelihood — each iteration increases or maintains it, converging to a local maximum It always finds the global maximum It converges in a fixed number of steps