Deisenroth et al., Chapter 11

Gaussian Mixture Models

When one Gaussian is not enough — modelling multi-modal data with soft clustering and EM.

Prerequisites: Chapter 6 (Probability & Distributions) + Chapter 8 (When Models Meet Data). That's it.
11
Chapters
5+
Simulations
11
Quizzes

Chapter 0: Why Mixtures?

Imagine you measure the heights of people in a room, and you get a beautiful bell curve. A single Gaussian fits perfectly. Now imagine you measure the heights of people and their dogs. You get two bumps — one around 170 cm, one around 50 cm. A single Gaussian cannot capture that shape. It would place its mean somewhere around 110 cm, where almost no actual data points live.

This is the multi-modal problem. Real-world data rarely comes from one clean bell curve. Customer spending has frugal shoppers and big spenders. Image pixels cluster into distinct color groups. Gene expression data groups cells into types. Whenever the underlying population has subgroups, a single Gaussian fails.

One Gaussian vs. Multi-Modal Data

The orange curve is a single Gaussian fit. Notice how it smears across the gap between the two clusters, assigning high density where no data exists. Click Toggle Mixture to see how two Gaussians capture the data far better.

The fix is simple in concept: instead of one Gaussian, use several. Each Gaussian captures one cluster. We weight them by how prevalent each cluster is, and the overall density becomes a weighted sum. This is a Gaussian Mixture Model (GMM).

But this raises hard questions. How do we figure out which Gaussian each data point belongs to? How do we estimate the means, covariances, and weights when we don't know the assignments? This chapter introduces the Expectation-Maximization (EM) algorithm — an elegant iterative procedure that solves all of these simultaneously.

The core idea: A GMM says "the data was generated by K different Gaussians, but we don't know which Gaussian produced which point." EM alternates between guessing the assignments (E-step) and updating the Gaussians to fit those guesses (M-step). Each iteration increases the likelihood. It is one of the most beautiful algorithms in all of machine learning.
Why does a single Gaussian fail on data with two well-separated clusters?

Chapter 1: The GMM

A Gaussian Mixture Model is a weighted sum of K Gaussian densities. Each component k has its own mean μk, covariance Σk, and mixing weight πk. The overall density at a point x is:

p(x) = ∑k=1K πk · N(x | μk, Σk)

where N(x | μ, Σ) is the multivariate Gaussian density:

N(x | μ, Σ) = (2π)−D/2 |Σ|−1/2 exp(−½ (x − μ)T Σ−1 (x − μ))

The mixing weights πk satisfy two constraints: each weight is non-negative (πk ≥ 0), and they sum to one (∑ πk = 1). Think of πk as the probability that a randomly drawn data point came from component k. If π1 = 0.7, then 70% of the data was generated by the first Gaussian.

Key insight: A GMM is a convex combination of simple distributions. Each component is a standard Gaussian (which we understand well), and the mixture just blends them. The complexity comes not from the components themselves but from figuring out the parameters: K means, K covariances, and K weights.
ParameterSymbolMeaning
Meansμ1, …, μKCenter of each cluster
CovariancesΣ1, …, ΣKShape & spread of each cluster
Weightsπ1, …, πKRelative size of each cluster (sum to 1)
Number of componentsKHow many Gaussians (chosen by us)

In 1D, each Gaussian is a bell curve. In 2D, each is an ellipse (the shape determined by its covariance matrix). A GMM in 2D looks like a landscape of overlapping hills, each centered on a different μk, each with its own tilt and spread from Σk.

Terminology note: The weights πk are sometimes called prior probabilities of each component. In the Bayesian view, before observing x, the probability it came from component k is πk. After observing x, we update this to a posterior — which is what the E-step will compute.
In a GMM with K = 3 components, what must be true about π1 + π2 + π3?

Chapter 2: Log-Likelihood

Given N data points x1, …, xN, we want to find the parameters θ = {πk, μk, Σk}k=1..K that best explain the data. The standard approach: maximum likelihood estimation. We maximize the probability of observing our data under the model.

Assuming data points are i.i.d., the likelihood is a product over data points:

L(θ) = ∏n=1N p(xn | θ) = ∏n=1Nk=1K πk N(xn | μk, Σk)

Taking the log (to turn products into sums) gives the log-likelihood:

ln L(θ) = ∑n=1N ln [ ∑k=1K πk N(xn | μk, Σk) ]

Here is the problem. In standard maximum likelihood for a single Gaussian, the log passes through to the exponent and everything simplifies to a clean closed-form solution. But here the log sits outside a sum over k. The log of a sum does not simplify. Setting derivatives to zero gives equations where every parameter depends on every other parameter in a tangled mess.

Why EM exists: You cannot solve the GMM log-likelihood in closed form. The sum inside the log couples all parameters together. If you knew which component generated each point, the problem would decompose into K independent Gaussian fits — trivially easy. EM exploits this: it pretends to know the assignments (E-step), solves the easy problem (M-step), then updates the pretend assignments. Repeat.

There is a second subtle issue: the likelihood surface has singularities. If a component's mean μk collapses exactly onto a data point and its variance shrinks to zero, the density at that point goes to infinity. The likelihood becomes unbounded. In practice, we avoid this with regularization (adding a small diagonal to Σk) or by detecting degenerate components and resetting them.

Log-likelihood as a fitness score: Think of the log-likelihood as a scorecard. Higher means the model assigns higher probability to the observed data — the model "explains" the data better. EM is guaranteed to increase this score at every step, though it may only reach a local maximum, not the global one.
Why can't we find a closed-form MLE solution for GMM parameters?

Chapter 3: Responsibilities

At the heart of EM is a simple question: for each data point xn, how much does each component k "claim" it? This is the responsibility rnk — the posterior probability that component k generated point n.

We compute it with Bayes' theorem. The prior probability that point n came from component k is πk. The likelihood of xn given component k is N(xn | μk, Σk). By Bayes' rule:

rnk = πk N(xn | μk, Σk) ⁄ ∑j=1K πj N(xn | μj, Σj)

Each rnk is between 0 and 1, and for each point n, the responsibilities sum to 1 across components: ∑k rnk = 1. A point near the center of cluster k gets rnk close to 1. A point equidistant from two clusters might have rnk ≈ 0.5 for each.

Soft assignment vs. hard assignment: In K-means, each point belongs to exactly one cluster. In a GMM, each point has fractional membership in every cluster. A point might be 80% cluster 1 and 20% cluster 2. This is soft clustering, and it is more faithful to reality — boundary points genuinely are ambiguous.
Responsibilities in 1D

Drag the slider to move a test point along the x-axis. Watch how the responsibility shifts between the two components. Near a cluster center, one component dominates. In between, responsibilities are split.

x position 2.0

We also define the effective number of points assigned to component k:

Nk = ∑n=1N rnk

This is a soft count. If all 100 points have rn1 = 0.6, then N1 = 60. It tells us how much data each component "owns." We will use Nk to normalize the M-step updates.

Computing responsibilities is the E-step. Given current parameters (πk, μk, Σk), we evaluate the responsibility formula for every data point and every component. This is a forward pass of Bayes' theorem — prior × likelihood, normalized.
If a point xn sits exactly at the mean μ2 of component 2 and far from all other components, what is rn2 approximately?

Chapter 4: Updating Means

Once we have the responsibilities (the E-step), we can update the parameters to better fit the data — this is the M-step. Let's start with the means. The new mean for component k is the responsibility-weighted average of all data points:

μknew = (1 / Nk) ∑n=1N rnk · xn

This makes perfect intuitive sense. Each point contributes to the mean of component k in proportion to how much component k claims it. A point with rnk = 0.9 pulls the mean strongly. A point with rnk = 0.01 barely matters. The normalizer Nk = ∑ rnk ensures the weights sum to 1.

Compare with K-means: In K-means, the mean update is the simple average of all points assigned to cluster k. The GMM mean update is identical but with soft assignments: rnk instead of hard 0/1 indicators. When responsibilities are nearly binary (points clearly belong to one cluster), the GMM update reduces to K-means.
Responsibility-Weighted Mean

Points are colored by responsibility (orange = component 1, teal = component 2). The weighted mean is the larger dot. Drag the slider to shift responsibilities and watch the mean move.

Boundary shift 0.0

Why does this update increase the likelihood? Setting the derivative of the expected complete-data log-likelihood with respect to μk to zero gives exactly this formula. It is the MLE for a single Gaussian, but with each data point weighted by rnk. If we knew the true assignments, fitting each cluster's mean would be trivial. The responsibilities act as a surrogate for those unknown assignments.

How does the GMM mean update differ from the K-means centroid update?

Chapter 5: Updating Covariances

The mean tells us where a cluster is; the covariance tells us its shape. Is it a tight circle? A wide ellipse tilted 45 degrees? The M-step updates each covariance Σk as a responsibility-weighted outer product of the residuals:

Σknew = (1 / Nk) ∑n=1N rnk (xn − μknew)(xn − μknew)T

Read this carefully. For each data point, we compute its deviation from the new mean, form the outer product (which gives a matrix capturing direction and magnitude of deviation), and weight it by the responsibility. Sum them up, normalize by Nk, and you have the new covariance.

Shape intuition in 2D: If all high-responsibility points lie along a line, the covariance will be elongated (one large eigenvalue, one small). If they form a round cloud, both eigenvalues are similar. The eigenvectors of Σk give the orientation of the ellipse; the eigenvalues give its radii (specifically, the square roots of the eigenvalues).

There is a practical concern: if too few points are assigned to a component (Nk very small), the covariance estimate becomes unreliable and can collapse. A common fix is to add a small regularization term: Σknew ← Σknew + ε I, where I is the identity matrix and ε is tiny (e.g. 10−6). This prevents singularities and keeps the matrix invertible.

Why (x − μ)(x − μ)T? This outer product is a rank-1 matrix that "remembers" which direction x deviated from the mean. Averaging many such matrices gives the covariance — a matrix that encodes the average shape of the spread. It is the same formula as the standard sample covariance, but with soft weights.
python
def update_covariance(X, mu_k, r_k, N_k, eps=1e-6):
    # X: (N, D), mu_k: (D,), r_k: (N,), N_k: scalar
    diff = X - mu_k                       # (N, D)
    weighted = r_k[:, None] * diff        # (N, D) weighted residuals
    Sigma_k = (weighted.T @ diff) / N_k   # (D, D)
    Sigma_k += eps * np.eye(X.shape[1])  # regularize
    return Sigma_k
What does the covariance update formula compute?

Chapter 6: Updating Weights

The mixing weights πk represent how prevalent each cluster is. If cluster 1 claims most of the data (high N1), its weight should be large. The M-step update is the simplest of the three:

πknew = Nk / N

That's it. The new weight for component k is just the fraction of the total (soft) data count assigned to it. If N = 100 data points and Nk = 30 (counting soft assignments), then πk = 0.30. The constraint ∑ πk = 1 is automatically satisfied because ∑ Nk = N (every point's responsibilities sum to 1 across components).

Where does this come from? Maximizing the log-likelihood with respect to πk subject to the constraint ∑ πk = 1 requires a Lagrange multiplier. The constrained optimization yields exactly πk = Nk / N. The Lagrange multiplier enforces the sum-to-one constraint. This is a beautiful example of constrained optimization from Chapter 7 in action.

Here is the complete set of M-step updates, side by side:

ParameterUpdate RuleIntuition
μk(1/Nk) ∑ rnk xnWeighted center of mass
Σk(1/Nk) ∑ rnk (xn−μk)(xn−μk)TWeighted spread around mean
πkNk / NFraction of data owned

Each formula is a soft-weighted version of the maximum likelihood estimator for a single Gaussian. If the responsibilities were hard (0 or 1), these would be exactly the MLE formulas for each component independently. The soft assignments gracefully interpolate.

A sanity check: After the M-step, do the new weights sum to 1? Yes: ∑k πknew = ∑k Nk/N = (1/N) ∑kn rnk = (1/N) ∑n 1 = 1. The inner sum uses the fact that responsibilities sum to 1 for each point.
If component k has Nk = 40 out of N = 200 total (soft-assigned) data points, what is πknew?

Chapter 7: The EM Algorithm

Now we put it all together. The Expectation-Maximization algorithm for GMMs alternates between two steps until convergence:

E-Step (Expectation)
Compute responsibilities rnk using current parameters. "Which component likely generated each point?"
M-Step (Maximization)
Update μk, Σk, πk using responsibilities. "Fit each component to its soft-assigned data."
↻ repeat until log-likelihood converges

Initialization matters. EM finds a local maximum, not necessarily the global one. Different starting points lead to different solutions. Common strategies:

StrategyHow
RandomPick K random data points as initial means, use identity covariances
K-means++Run K-means first, use its centroids as initial means
Multiple restartsRun EM several times, keep the result with highest log-likelihood
Convergence guarantee: EM is guaranteed to increase (or maintain) the log-likelihood at every iteration. The E-step computes the expected complete-data log-likelihood (the "Q function"). The M-step maximizes it. The math ensures that each full E-M cycle never decreases the observed-data log-likelihood. Convergence is monotonic — the score can only go up.

When to stop: Track the log-likelihood after each iteration. When the increase drops below a threshold (e.g. |Δ ln L| < 10−4), declare convergence. Alternatively, stop after a fixed number of iterations (e.g. 100).

python
def em_gmm(X, K, max_iter=100, tol=1e-4):
    N, D = X.shape
    # Initialize
    mu = X[np.random.choice(N, K, replace=False)]
    Sigma = [np.eye(D) for _ in range(K)]
    pi = np.ones(K) / K

    for it in range(max_iter):
        # E-step: compute responsibilities
        r = np.zeros((N, K))
        for k in range(K):
            r[:, k] = pi[k] * multivariate_normal(mu[k], Sigma[k]).pdf(X)
        r /= r.sum(axis=1, keepdims=True)

        # M-step: update parameters
        Nk = r.sum(axis=0)
        for k in range(K):
            mu[k] = (r[:, k] @ X) / Nk[k]
            diff = X - mu[k]
            Sigma[k] = (r[:, k][:, None] * diff).T @ diff / Nk[k]
            Sigma[k] += 1e-6 * np.eye(D)
        pi = Nk / N

        # Check convergence (log-likelihood)
        ll = np.sum(np.log(r.sum(axis=1)))  # simplified
        if it > 0 and abs(ll - prev_ll) < tol: break
        prev_ll = ll
    return mu, Sigma, pi
EM beyond GMMs: The EM algorithm is a general framework for any model with latent (hidden) variables. GMMs are the textbook example, but EM also trains Hidden Markov Models (Baum-Welch), topic models (LDA), and many Bayesian inference problems. The pattern is always the same: E-step computes expected latent variables, M-step maximizes parameters.
What does the E-step compute?

Chapter 8: EM in Action

Time to see it work. Below is a 2D dataset with K = 3 true clusters. The algorithm starts with random initial means (the large circles) and spherical covariances. Click through the E-step and M-step to watch EM converge, or hit Auto-Run to animate the full process.

EM Step-by-Step (K = 3)

E-Step recolors each point by its responsibilities (blended colors = uncertain assignment). M-Step moves the means and reshapes the covariance ellipses. The log-likelihood plot below tracks progress.

Iteration 0
Speed400ms
What to watch for: In the first few iterations, the means race toward the cluster centers and the covariance ellipses shrink and rotate to match the data shape. The log-likelihood curve rises steeply at first, then levels off as the algorithm converges. Points near cluster boundaries remain softly colored — genuinely ambiguous assignments.

Try resetting and watching again. Different random initializations lead to different convergence paths — sometimes a component "steals" points from another, sometimes two components start near the same cluster and one drifts away. This is the local-optima issue in practice.

Counting iterations: For well-separated clusters, EM typically converges in 10–30 iterations. For overlapping clusters, it may take 50–100. The per-iteration cost is O(NKD2) for the E-step and O(NKD2) for the M-step (dominated by the covariance updates). For large N, this is quite efficient compared to alternatives.
What happens to the log-likelihood across EM iterations?

Chapter 9: Latent Variables

There is a deeper way to think about GMMs that connects to a broad class of models. Introduce a latent variable zn for each data point. zn is a one-hot vector of length K — it indicates which component generated xn. If zn = [0, 1, 0], then point n came from component 2.

The generative story becomes explicit:

1. Pick a component
Sample zn ~ Categorical(π1, …, πK)
2. Generate data
Sample xn ~ N(μk, Σk) where k is the active component in zn

The joint distribution of x and z is:

p(x, z) = p(z) · p(x | z) = ∏k=1K [ πk N(x | μk, Σk) ]zk

When zk = 1 (point assigned to component k), only the k-th term survives. Marginalizing out z (summing over all possible assignments) recovers the GMM density: p(x) = ∑k πk N(x | μk, Σk). This is exactly where the mixture formula comes from.

Why latent variables matter: The latent variables z are what make the log-likelihood hard (the sum inside the log). If we observed z, the log-likelihood would decompose into K independent terms, each a standard Gaussian MLE problem. EM's trick is to replace the unknown z with its expected value under the current parameters — which is exactly the responsibilities rnk.

This perspective reveals EM as a general algorithm. Any model where some variables are hidden fits the pattern: define the joint p(x, z | θ), compute the expected complete-data log-likelihood (E-step), and maximize it (M-step). The latent variables could be cluster assignments (GMMs), hidden states (HMMs), missing data (imputation), or topic assignments (LDA).

Incomplete data: We observe x but not z. The log-likelihood of the observed data is ln p(x | θ) = ln ∑z p(x, z | θ). This marginal is hard to optimize. EM instead optimizes a lower bound (the ELBO), which is tight at the E-step solution. This connects directly to variational inference and the VAE (Variational Autoencoder) in deep learning.
What is the latent variable zn in a GMM?

Chapter 10: Connections

Gaussian Mixture Models sit at a crossroads of several important ideas in machine learning. Let's map the connections.

MethodRelationship to GMMs
K-MeansHard-assignment limit of EM: as covariances shrink to zero, soft responsibilities become hard assignments
Bayesian GMMPut priors on μ, Σ, π (e.g. Dirichlet on weights). Avoids overfitting, can automatically select K
Hidden Markov ModelsGMM + temporal structure. States evolve over time; each state emits from a Gaussian
Variational AutoencodersDeep learning generalization: the latent z is continuous, the decoder is a neural network, and EM becomes variational inference
Kernel Density EstimationExtreme limit: K = N (one component per point). No EM needed, but no compression either
Factor AnalysisEach component's covariance has low-rank structure, revealing underlying factors
Choosing K: We treated K as given, but in practice you must select it. Common approaches: (1) the Bayesian Information Criterion (BIC), which penalizes model complexity; (2) the elbow method, where you plot log-likelihood vs. K and look for diminishing returns; (3) Bayesian nonparametrics (Dirichlet Process GMMs), which let the data decide K automatically.

Strengths of GMMs:

• Soft clustering (probabilistic)
• Models cluster shape via covariances
• Principled density estimation
• EM has convergence guarantees
• Generative: can sample new data

Limitations:

• Must choose K in advance
• EM finds local optima only
• Assumes Gaussian-shaped clusters
• Covariance estimation needs D « N
• Singularities without regularization

Looking ahead: Chapter 12 takes a completely different approach to the classification problem. Instead of modelling the full density p(x), Support Vector Machines find the decision boundary directly. Where GMMs are generative (model how data is produced), SVMs are discriminative (model the boundary between classes). Both perspectives have their place.

"All models are wrong, but some are useful."
— George E. P. Box
What happens to the GMM's EM algorithm in the limit where responsibilities become hard (0 or 1)?