K-means, Gaussian mixtures, and the expectation-maximization algorithm — learning models with latent variables.
A single Gaussian can model a unimodal, elliptical cluster. But real data is often multimodal — think of handwritten digits (10 clusters), customer segments, or gene expression profiles. A mixture model combines K simpler distributions to model complex, multimodal data:
where πk are mixing coefficients (must be non-negative, sum to 1) and p(x|k) is the k-th component distribution.
The challenge: with latent variables, the log-likelihood contains a sum inside a logarithm, making direct optimization difficult. The EM algorithm handles this elegantly.
K-means is the simplest clustering algorithm. It assigns each data point to the nearest cluster center, then updates centers to the mean of assigned points.
Algorithm:
1. Initialize K cluster centers μ1, ..., μK.
2. Assign: Each point xn to nearest center: rnk = 1 if k = arg minj ||xn − μj||2.
3. Update: μk = ∑n rnk xn / ∑n rnk.
4. Repeat 2-3 until convergence.
The objective minimized is:
Limitations: K-means assumes spherical clusters of equal size, makes hard assignments (no uncertainty), and is sensitive to initialization. Gaussian mixtures address all of these.
A Gaussian mixture model (GMM) uses K Gaussian components:
Parameters: K mixing coefficients πk, K mean vectors μk, and K covariance matrices Σk.
The responsibility of component k for data point xn is the posterior probability:
The log-likelihood for a GMM is:
The sum inside the logarithm prevents a closed-form solution. Setting derivatives to zero gives equations that couple all parameters together — no clean solution like single-Gaussian ML.
Other issues: identifiability (K! equivalent modes from relabeling components), sensitivity to initialization (EM finds local maxima, not the global maximum), and choosing K (model selection).
The Expectation-Maximization (EM) algorithm iterates between two steps:
E-step: Compute responsibilities using current parameters:
M-step: Re-estimate parameters using responsibilities:
Nk = ∑n γ(znk) is the effective number of points assigned to component k. The new mean is a responsibility-weighted average. The new mixing coefficient is the fraction of "effective points" in each component.
Watch the EM algorithm fit a Gaussian mixture model to 2D data. Each step alternates between computing responsibilities (coloring points by cluster membership) and updating the Gaussian parameters.
Click "Step" to run one EM iteration. Watch how responsibilities (point colors) and Gaussian ellipses evolve.
Why does EM work? The log-likelihood can be decomposed for any distribution q(Z) over latent variables:
where L(q, θ) is the Evidence Lower Bound (ELBO) and KL is the Kullback-Leibler divergence:
Since KL ≥ 0, we always have L ≤ ln p(X|θ). The ELBO is a lower bound on the log-likelihood. Each EM step increases the ELBO, which guarantees the log-likelihood never decreases.
This decomposition is the foundation of variational inference (Chapter 10), where we restrict q to a tractable family instead of using the exact posterior.
The EM algorithm applies to any model with latent variables, not just Gaussian mixtures. The general recipe:
E-step: Compute the expected complete-data log-likelihood under the current posterior over latent variables:
M-step: Maximize Q with respect to θ:
Generalized EM (GEM): Instead of fully maximizing Q in the M-step, we can just increase it. This relaxation is useful when the M-step has no closed form (e.g., we can use gradient ascent). The monotonic convergence guarantee still holds.
Mixtures aren't limited to Gaussians. For binary data (e.g., binary images of handwritten digits), we use a mixture of Bernoulli distributions:
where μk is a D-dimensional vector of pixel probabilities for component k.
The EM algorithm follows the same pattern:
E-step: Compute responsibilities γ(znk) using Bernoulli likelihoods.
M-step: μkinew = ∑n γ(znk) xni / Nk — the responsibility-weighted mean of the data.
The same EM framework extends to mixtures of multinomials (for text), mixtures of Poissons (for count data), or any exponential family member.
EM can also handle models where latent variables are continuous parameters with priors. Consider Bayesian linear regression with hyperparameters α (weight precision) and β (noise precision). The evidence framework from Ch 3 can be viewed as EM:
E-step: Compute posterior p(w|t, α, β) = N(w|m, Σ).
M-step: Update α and β by maximizing the expected complete-data log-likelihood, giving the re-estimation formulas from Ch 3:
| Concept | Key idea |
|---|---|
| K-means | Hard assignment to nearest center; the zero-variance limit of GMM EM |
| GMM | Soft assignment via responsibilities; models multimodal data |
| EM algorithm | E: compute posterior over latent variables. M: maximize expected complete-data likelihood |
| ELBO | ln p(X|θ) = L(q,θ) + KL(q||p); EM maximizes L |
| Convergence | Monotonically increases log-likelihood; converges to local maximum |
What comes next: Chapter 10 develops variational inference, which generalizes EM: instead of computing the exact posterior q = p(Z|X,θ) in the E-step, we approximate it with a tractable family. This enables inference in models where the exact posterior is intractable.