When one Gaussian is not enough — modelling multi-modal data with soft clustering and EM.
Imagine you measure the heights of people in a room, and you get a beautiful bell curve. A single Gaussian fits perfectly. Now imagine you measure the heights of people and their dogs. You get two bumps — one around 170 cm, one around 50 cm. A single Gaussian cannot capture that shape. It would place its mean somewhere around 110 cm, where almost no actual data points live.
This is the multi-modal problem. Real-world data rarely comes from one clean bell curve. Customer spending has frugal shoppers and big spenders. Image pixels cluster into distinct color groups. Gene expression data groups cells into types. Whenever the underlying population has subgroups, a single Gaussian fails.
The orange curve is a single Gaussian fit. Notice how it smears across the gap between the two clusters, assigning high density where no data exists. Click Toggle Mixture to see how two Gaussians capture the data far better.
The fix is simple in concept: instead of one Gaussian, use several. Each Gaussian captures one cluster. We weight them by how prevalent each cluster is, and the overall density becomes a weighted sum. This is a Gaussian Mixture Model (GMM).
But this raises hard questions. How do we figure out which Gaussian each data point belongs to? How do we estimate the means, covariances, and weights when we don't know the assignments? This chapter introduces the Expectation-Maximization (EM) algorithm — an elegant iterative procedure that solves all of these simultaneously.
A Gaussian Mixture Model is a weighted sum of K Gaussian densities. Each component k has its own mean μk, covariance Σk, and mixing weight πk. The overall density at a point x is:
where N(x | μ, Σ) is the multivariate Gaussian density:
The mixing weights πk satisfy two constraints: each weight is non-negative (πk ≥ 0), and they sum to one (∑ πk = 1). Think of πk as the probability that a randomly drawn data point came from component k. If π1 = 0.7, then 70% of the data was generated by the first Gaussian.
| Parameter | Symbol | Meaning |
|---|---|---|
| Means | μ1, …, μK | Center of each cluster |
| Covariances | Σ1, …, ΣK | Shape & spread of each cluster |
| Weights | π1, …, πK | Relative size of each cluster (sum to 1) |
| Number of components | K | How many Gaussians (chosen by us) |
In 1D, each Gaussian is a bell curve. In 2D, each is an ellipse (the shape determined by its covariance matrix). A GMM in 2D looks like a landscape of overlapping hills, each centered on a different μk, each with its own tilt and spread from Σk.
Given N data points x1, …, xN, we want to find the parameters θ = {πk, μk, Σk}k=1..K that best explain the data. The standard approach: maximum likelihood estimation. We maximize the probability of observing our data under the model.
Assuming data points are i.i.d., the likelihood is a product over data points:
Taking the log (to turn products into sums) gives the log-likelihood:
Here is the problem. In standard maximum likelihood for a single Gaussian, the log passes through to the exponent and everything simplifies to a clean closed-form solution. But here the log sits outside a sum over k. The log of a sum does not simplify. Setting derivatives to zero gives equations where every parameter depends on every other parameter in a tangled mess.
There is a second subtle issue: the likelihood surface has singularities. If a component's mean μk collapses exactly onto a data point and its variance shrinks to zero, the density at that point goes to infinity. The likelihood becomes unbounded. In practice, we avoid this with regularization (adding a small diagonal to Σk) or by detecting degenerate components and resetting them.
At the heart of EM is a simple question: for each data point xn, how much does each component k "claim" it? This is the responsibility rnk — the posterior probability that component k generated point n.
We compute it with Bayes' theorem. The prior probability that point n came from component k is πk. The likelihood of xn given component k is N(xn | μk, Σk). By Bayes' rule:
Each rnk is between 0 and 1, and for each point n, the responsibilities sum to 1 across components: ∑k rnk = 1. A point near the center of cluster k gets rnk close to 1. A point equidistant from two clusters might have rnk ≈ 0.5 for each.
Drag the slider to move a test point along the x-axis. Watch how the responsibility shifts between the two components. Near a cluster center, one component dominates. In between, responsibilities are split.
We also define the effective number of points assigned to component k:
This is a soft count. If all 100 points have rn1 = 0.6, then N1 = 60. It tells us how much data each component "owns." We will use Nk to normalize the M-step updates.
Once we have the responsibilities (the E-step), we can update the parameters to better fit the data — this is the M-step. Let's start with the means. The new mean for component k is the responsibility-weighted average of all data points:
This makes perfect intuitive sense. Each point contributes to the mean of component k in proportion to how much component k claims it. A point with rnk = 0.9 pulls the mean strongly. A point with rnk = 0.01 barely matters. The normalizer Nk = ∑ rnk ensures the weights sum to 1.
Points are colored by responsibility (orange = component 1, teal = component 2). The weighted mean is the larger dot. Drag the slider to shift responsibilities and watch the mean move.
Why does this update increase the likelihood? Setting the derivative of the expected complete-data log-likelihood with respect to μk to zero gives exactly this formula. It is the MLE for a single Gaussian, but with each data point weighted by rnk. If we knew the true assignments, fitting each cluster's mean would be trivial. The responsibilities act as a surrogate for those unknown assignments.
The mean tells us where a cluster is; the covariance tells us its shape. Is it a tight circle? A wide ellipse tilted 45 degrees? The M-step updates each covariance Σk as a responsibility-weighted outer product of the residuals:
Read this carefully. For each data point, we compute its deviation from the new mean, form the outer product (which gives a matrix capturing direction and magnitude of deviation), and weight it by the responsibility. Sum them up, normalize by Nk, and you have the new covariance.
There is a practical concern: if too few points are assigned to a component (Nk very small), the covariance estimate becomes unreliable and can collapse. A common fix is to add a small regularization term: Σknew ← Σknew + ε I, where I is the identity matrix and ε is tiny (e.g. 10−6). This prevents singularities and keeps the matrix invertible.
python def update_covariance(X, mu_k, r_k, N_k, eps=1e-6): # X: (N, D), mu_k: (D,), r_k: (N,), N_k: scalar diff = X - mu_k # (N, D) weighted = r_k[:, None] * diff # (N, D) weighted residuals Sigma_k = (weighted.T @ diff) / N_k # (D, D) Sigma_k += eps * np.eye(X.shape[1]) # regularize return Sigma_k
The mixing weights πk represent how prevalent each cluster is. If cluster 1 claims most of the data (high N1), its weight should be large. The M-step update is the simplest of the three:
That's it. The new weight for component k is just the fraction of the total (soft) data count assigned to it. If N = 100 data points and Nk = 30 (counting soft assignments), then πk = 0.30. The constraint ∑ πk = 1 is automatically satisfied because ∑ Nk = N (every point's responsibilities sum to 1 across components).
Here is the complete set of M-step updates, side by side:
| Parameter | Update Rule | Intuition |
|---|---|---|
| μk | (1/Nk) ∑ rnk xn | Weighted center of mass |
| Σk | (1/Nk) ∑ rnk (xn−μk)(xn−μk)T | Weighted spread around mean |
| πk | Nk / N | Fraction of data owned |
Each formula is a soft-weighted version of the maximum likelihood estimator for a single Gaussian. If the responsibilities were hard (0 or 1), these would be exactly the MLE formulas for each component independently. The soft assignments gracefully interpolate.
Now we put it all together. The Expectation-Maximization algorithm for GMMs alternates between two steps until convergence:
Initialization matters. EM finds a local maximum, not necessarily the global one. Different starting points lead to different solutions. Common strategies:
| Strategy | How |
|---|---|
| Random | Pick K random data points as initial means, use identity covariances |
| K-means++ | Run K-means first, use its centroids as initial means |
| Multiple restarts | Run EM several times, keep the result with highest log-likelihood |
When to stop: Track the log-likelihood after each iteration. When the increase drops below a threshold (e.g. |Δ ln L| < 10−4), declare convergence. Alternatively, stop after a fixed number of iterations (e.g. 100).
python def em_gmm(X, K, max_iter=100, tol=1e-4): N, D = X.shape # Initialize mu = X[np.random.choice(N, K, replace=False)] Sigma = [np.eye(D) for _ in range(K)] pi = np.ones(K) / K for it in range(max_iter): # E-step: compute responsibilities r = np.zeros((N, K)) for k in range(K): r[:, k] = pi[k] * multivariate_normal(mu[k], Sigma[k]).pdf(X) r /= r.sum(axis=1, keepdims=True) # M-step: update parameters Nk = r.sum(axis=0) for k in range(K): mu[k] = (r[:, k] @ X) / Nk[k] diff = X - mu[k] Sigma[k] = (r[:, k][:, None] * diff).T @ diff / Nk[k] Sigma[k] += 1e-6 * np.eye(D) pi = Nk / N # Check convergence (log-likelihood) ll = np.sum(np.log(r.sum(axis=1))) # simplified if it > 0 and abs(ll - prev_ll) < tol: break prev_ll = ll return mu, Sigma, pi
Time to see it work. Below is a 2D dataset with K = 3 true clusters. The algorithm starts with random initial means (the large circles) and spherical covariances. Click through the E-step and M-step to watch EM converge, or hit Auto-Run to animate the full process.
E-Step recolors each point by its responsibilities (blended colors = uncertain assignment). M-Step moves the means and reshapes the covariance ellipses. The log-likelihood plot below tracks progress.
Try resetting and watching again. Different random initializations lead to different convergence paths — sometimes a component "steals" points from another, sometimes two components start near the same cluster and one drifts away. This is the local-optima issue in practice.
There is a deeper way to think about GMMs that connects to a broad class of models. Introduce a latent variable zn for each data point. zn is a one-hot vector of length K — it indicates which component generated xn. If zn = [0, 1, 0], then point n came from component 2.
The generative story becomes explicit:
The joint distribution of x and z is:
When zk = 1 (point assigned to component k), only the k-th term survives. Marginalizing out z (summing over all possible assignments) recovers the GMM density: p(x) = ∑k πk N(x | μk, Σk). This is exactly where the mixture formula comes from.
This perspective reveals EM as a general algorithm. Any model where some variables are hidden fits the pattern: define the joint p(x, z | θ), compute the expected complete-data log-likelihood (E-step), and maximize it (M-step). The latent variables could be cluster assignments (GMMs), hidden states (HMMs), missing data (imputation), or topic assignments (LDA).
Gaussian Mixture Models sit at a crossroads of several important ideas in machine learning. Let's map the connections.
| Method | Relationship to GMMs |
|---|---|
| K-Means | Hard-assignment limit of EM: as covariances shrink to zero, soft responsibilities become hard assignments |
| Bayesian GMM | Put priors on μ, Σ, π (e.g. Dirichlet on weights). Avoids overfitting, can automatically select K |
| Hidden Markov Models | GMM + temporal structure. States evolve over time; each state emits from a Gaussian |
| Variational Autoencoders | Deep learning generalization: the latent z is continuous, the decoder is a neural network, and EM becomes variational inference |
| Kernel Density Estimation | Extreme limit: K = N (one component per point). No EM needed, but no compression either |
| Factor Analysis | Each component's covariance has low-rank structure, revealing underlying factors |
Strengths of GMMs:
• Soft clustering (probabilistic)
• Models cluster shape via covariances
• Principled density estimation
• EM has convergence guarantees
• Generative: can sample new data
Limitations:
• Must choose K in advance
• EM finds local optima only
• Assumes Gaussian-shaped clusters
• Covariance estimation needs D « N
• Singularities without regularization
Looking ahead: Chapter 12 takes a completely different approach to the classification problem. Instead of modelling the full density p(x), Support Vector Machines find the decision boundary directly. Where GMMs are generative (model how data is produced), SVMs are discriminative (model the boundary between classes). Both perspectives have their place.