Ch 11: Gaussian Mixture Models

Chapter 0: Why Mixtures?

Imagine you measure the heights of people in a room, and you get a beautiful bell curve. A single Gaussian fits perfectly. Now imagine you measure the heights of people and their dogs. You get two bumps — one around 170 cm, one around 50 cm. A single Gaussian cannot capture that shape. It would place its mean somewhere around 110 cm, where almost no actual data points live.

This is the multi-modal problem. Real-world data rarely comes from one clean bell curve. Customer spending has frugal shoppers and big spenders. Image pixels cluster into distinct color groups. Gene expression data groups cells into types. Whenever the underlying population has subgroups, a single Gaussian fails.

One Gaussian vs. Multi-Modal Data

The orange curve is a single Gaussian fit. Notice how it smears across the gap between the two clusters, assigning high density where no data exists. Click Toggle Mixture to see how two Gaussians capture the data far better.

The fix is simple in concept: instead of one Gaussian, use several. Each Gaussian captures one cluster. We weight them by how prevalent each cluster is, and the overall density becomes a weighted sum. This is a Gaussian Mixture Model (GMM).

But this raises hard questions. How do we figure out which Gaussian each data point belongs to? How do we estimate the means, covariances, and weights when we don't know the assignments? This chapter introduces the Expectation-Maximization (EM) algorithm — an elegant iterative procedure that solves all of these simultaneously.

The core idea: A GMM says "the data was generated by K different Gaussians, but we don't know which Gaussian produced which point." EM alternates between guessing the assignments (E-step) and updating the Gaussians to fit those guesses (M-step). Each iteration increases the likelihood. It is one of the most beautiful algorithms in all of machine learning.

Why does a single Gaussian fail on data with two well-separated clusters?

Because Gaussians can only model small datasets Because a single bell curve places its peak between the clusters, where no data lives Because Gaussians cannot model continuous data

Chapter 1: The GMM

A Gaussian Mixture Model is a weighted sum of K Gaussian densities. Each component k has its own mean μ_k, covariance Σ_k, and mixing weight π_k. The overall density at a point x is:

p(x) = ∑_k=1^K π_k · N(x | μ_k, Σ_k)

where N(x | μ, Σ) is the multivariate Gaussian density:

N(x | μ, Σ) = (2π)^−D/2 |Σ|^−1/2 exp(−½ (x − μ)^T Σ⁻¹ (x − μ))

The mixing weights π_k satisfy two constraints: each weight is non-negative (π_k ≥ 0), and they sum to one (∑ π_k = 1). Think of π_k as the probability that a randomly drawn data point came from component k. If π₁ = 0.7, then 70% of the data was generated by the first Gaussian.

Key insight: A GMM is a convex combination of simple distributions. Each component is a standard Gaussian (which we understand well), and the mixture just blends them. The complexity comes not from the components themselves but from figuring out the parameters: K means, K covariances, and K weights.

Parameter	Symbol	Meaning
Means	μ₁, …, μ_K	Center of each cluster
Covariances	Σ₁, …, Σ_K	Shape & spread of each cluster
Weights	π₁, …, π_K	Relative size of each cluster (sum to 1)
Number of components	K	How many Gaussians (chosen by us)

In 1D, each Gaussian is a bell curve. In 2D, each is an ellipse (the shape determined by its covariance matrix). A GMM in 2D looks like a landscape of overlapping hills, each centered on a different μ_k, each with its own tilt and spread from Σ_k.

Terminology note: The weights π_k are sometimes called prior probabilities of each component. In the Bayesian view, before observing x, the probability it came from component k is π_k. After observing x, we update this to a posterior — which is what the E-step will compute.

In a GMM with K = 3 components, what must be true about π₁ + π₂ + π₃?

It must equal 1 It must equal K = 3 It can be any positive number

Chapter 2: Log-Likelihood

Given N data points x₁, …, x_N, we want to find the parameters θ = {π_k, μ_k, Σ_k}_k=1..K that best explain the data. The standard approach: maximum likelihood estimation. We maximize the probability of observing our data under the model.

Assuming data points are i.i.d., the likelihood is a product over data points:

L(θ) = ∏_n=1^N p(x_n | θ) = ∏_n=1^N ∑_k=1^K π_k N(x_n | μ_k, Σ_k)

Taking the log (to turn products into sums) gives the log-likelihood:

ln L(θ) = ∑_n=1^N ln [ ∑_k=1^K π_k N(x_n | μ_k, Σ_k) ]

Here is the problem. In standard maximum likelihood for a single Gaussian, the log passes through to the exponent and everything simplifies to a clean closed-form solution. But here the log sits outside a sum over k. The log of a sum does not simplify. Setting derivatives to zero gives equations where every parameter depends on every other parameter in a tangled mess.

Why EM exists: You cannot solve the GMM log-likelihood in closed form. The sum inside the log couples all parameters together. If you knew which component generated each point, the problem would decompose into K independent Gaussian fits — trivially easy. EM exploits this: it pretends to know the assignments (E-step), solves the easy problem (M-step), then updates the pretend assignments. Repeat.

There is a second subtle issue: the likelihood surface has singularities. If a component's mean μ_k collapses exactly onto a data point and its variance shrinks to zero, the density at that point goes to infinity. The likelihood becomes unbounded. In practice, we avoid this with regularization (adding a small diagonal to Σ_k) or by detecting degenerate components and resetting them.

Log-likelihood as a fitness score: Think of the log-likelihood as a scorecard. Higher means the model assigns higher probability to the observed data — the model "explains" the data better. EM is guaranteed to increase this score at every step, though it may only reach a local maximum, not the global one.

Why can't we find a closed-form MLE solution for GMM parameters?

Because the data is not i.i.d. Because the log of the sum over components couples all parameters together Because the Gaussian density has no gradient

Chapter 3: Responsibilities

At the heart of EM is a simple question: for each data point x_n, how much does each component k "claim" it? This is the responsibility r_nk — the posterior probability that component k generated point n.

We compute it with Bayes' theorem. The prior probability that point n came from component k is π_k. The likelihood of x_n given component k is N(x_n | μ_k, Σ_k). By Bayes' rule:

r_nk = π_k N(x_n | μ_k, Σ_k) ⁄ ∑_j=1^K π_j N(x_n | μ_j, Σ_j)

Each r_nk is between 0 and 1, and for each point n, the responsibilities sum to 1 across components: ∑_k r_nk = 1. A point near the center of cluster k gets r_nk close to 1. A point equidistant from two clusters might have r_nk ≈ 0.5 for each.

Soft assignment vs. hard assignment: In K-means, each point belongs to exactly one cluster. In a GMM, each point has fractional membership in every cluster. A point might be 80% cluster 1 and 20% cluster 2. This is soft clustering, and it is more faithful to reality — boundary points genuinely are ambiguous.

Responsibilities in 1D

Drag the slider to move a test point along the x-axis. Watch how the responsibility shifts between the two components. Near a cluster center, one component dominates. In between, responsibilities are split.

x position 2.0

We also define the effective number of points assigned to component k:

N_k = ∑_n=1^N r_nk

This is a soft count. If all 100 points have r_n1 = 0.6, then N₁ = 60. It tells us how much data each component "owns." We will use N_k to normalize the M-step updates.

Computing responsibilities is the E-step. Given current parameters (π_k, μ_k, Σ_k), we evaluate the responsibility formula for every data point and every component. This is a forward pass of Bayes' theorem — prior × likelihood, normalized.

If a point x_n sits exactly at the mean μ₂ of component 2 and far from all other components, what is r_n2 approximately?

0 (the point is unassigned) 0.5 (always split evenly) Close to 1 (component 2 strongly claims it)

Chapter 4: Updating Means

Once we have the responsibilities (the E-step), we can update the parameters to better fit the data — this is the M-step. Let's start with the means. The new mean for component k is the responsibility-weighted average of all data points:

μ_k^new = (1 / N_k) ∑_n=1^N r_nk · x_n

This makes perfect intuitive sense. Each point contributes to the mean of component k in proportion to how much component k claims it. A point with r_nk = 0.9 pulls the mean strongly. A point with r_nk = 0.01 barely matters. The normalizer N_k = ∑ r_nk ensures the weights sum to 1.

Compare with K-means: In K-means, the mean update is the simple average of all points assigned to cluster k. The GMM mean update is identical but with soft assignments: r_nk instead of hard 0/1 indicators. When responsibilities are nearly binary (points clearly belong to one cluster), the GMM update reduces to K-means.

Responsibility-Weighted Mean

Points are colored by responsibility (orange = component 1, teal = component 2). The weighted mean is the larger dot. Drag the slider to shift responsibilities and watch the mean move.

Boundary shift 0.0

Why does this update increase the likelihood? Setting the derivative of the expected complete-data log-likelihood with respect to μ_k to zero gives exactly this formula. It is the MLE for a single Gaussian, but with each data point weighted by r_nk. If we knew the true assignments, fitting each cluster's mean would be trivial. The responsibilities act as a surrogate for those unknown assignments.

How does the GMM mean update differ from the K-means centroid update?

GMM uses responsibility-weighted averaging (soft), K-means uses hard 0/1 assignment K-means uses weighted averaging, GMM uses unweighted They are completely unrelated algorithms

Chapter 5: Updating Covariances

The mean tells us where a cluster is; the covariance tells us its shape. Is it a tight circle? A wide ellipse tilted 45 degrees? The M-step updates each covariance Σ_k as a responsibility-weighted outer product of the residuals:

Σ_k^new = (1 / N_k) ∑_n=1^N r_nk (x_n − μ_k^new)(x_n − μ_k^new)^T

Read this carefully. For each data point, we compute its deviation from the new mean, form the outer product (which gives a matrix capturing direction and magnitude of deviation), and weight it by the responsibility. Sum them up, normalize by N_k, and you have the new covariance.

Shape intuition in 2D: If all high-responsibility points lie along a line, the covariance will be elongated (one large eigenvalue, one small). If they form a round cloud, both eigenvalues are similar. The eigenvectors of Σ_k give the orientation of the ellipse; the eigenvalues give its radii (specifically, the square roots of the eigenvalues).

There is a practical concern: if too few points are assigned to a component (N_k very small), the covariance estimate becomes unreliable and can collapse. A common fix is to add a small regularization term: Σ_k^new ← Σ_k^new + ε I, where I is the identity matrix and ε is tiny (e.g. 10⁻⁶). This prevents singularities and keeps the matrix invertible.

Why (x − μ)(x − μ)^T? This outer product is a rank-1 matrix that "remembers" which direction x deviated from the mean. Averaging many such matrices gives the covariance — a matrix that encodes the average shape of the spread. It is the same formula as the standard sample covariance, but with soft weights.

python
def update_covariance(X, mu_k, r_k, N_k, eps=1e-6):
    # X: (N, D), mu_k: (D,), r_k: (N,), N_k: scalar
    diff = X - mu_k                       # (N, D)
    weighted = r_k[:, None] * diff        # (N, D) weighted residuals
    Sigma_k = (weighted.T @ diff) / N_k   # (D, D)
    Sigma_k += eps * np.eye(X.shape[1])  # regularize
    return Sigma_k

What does the covariance update formula compute?

A responsibility-weighted average of the outer products of deviations from the mean The unweighted sample variance of all data points The inverse of the mean vector

Chapter 6: Updating Weights

The mixing weights π_k represent how prevalent each cluster is. If cluster 1 claims most of the data (high N₁), its weight should be large. The M-step update is the simplest of the three:

π_k^new = N_k / N

That's it. The new weight for component k is just the fraction of the total (soft) data count assigned to it. If N = 100 data points and N_k = 30 (counting soft assignments), then π_k = 0.30. The constraint ∑ π_k = 1 is automatically satisfied because ∑ N_k = N (every point's responsibilities sum to 1 across components).

Where does this come from? Maximizing the log-likelihood with respect to π_k subject to the constraint ∑ π_k = 1 requires a Lagrange multiplier. The constrained optimization yields exactly π_k = N_k / N. The Lagrange multiplier enforces the sum-to-one constraint. This is a beautiful example of constrained optimization from Chapter 7 in action.

Here is the complete set of M-step updates, side by side:

Parameter	Update Rule	Intuition
μ_k	(1/N_k) ∑ r_nk x_n	Weighted center of mass
Σ_k	(1/N_k) ∑ r_nk (x_n−μ_k)(x_n−μ_k)^T	Weighted spread around mean
π_k	N_k / N	Fraction of data owned

Each formula is a soft-weighted version of the maximum likelihood estimator for a single Gaussian. If the responsibilities were hard (0 or 1), these would be exactly the MLE formulas for each component independently. The soft assignments gracefully interpolate.

A sanity check: After the M-step, do the new weights sum to 1? Yes: ∑_k π_k^new = ∑_k N_k/N = (1/N) ∑_k ∑_n r_nk = (1/N) ∑_n 1 = 1. The inner sum uses the fact that responsibilities sum to 1 for each point.

If component k has N_k = 40 out of N = 200 total (soft-assigned) data points, what is π_k^new?

40 0.2 0.4

Chapter 7: The EM Algorithm

Now we put it all together. The Expectation-Maximization algorithm for GMMs alternates between two steps until convergence:

E-Step (Expectation)

Compute responsibilities r_nk using current parameters. "Which component likely generated each point?"

↓

M-Step (Maximization)

Update μ_k, Σ_k, π_k using responsibilities. "Fit each component to its soft-assigned data."

↻ repeat until log-likelihood converges

Initialization matters. EM finds a local maximum, not necessarily the global one. Different starting points lead to different solutions. Common strategies:

Strategy	How
Random	Pick K random data points as initial means, use identity covariances
K-means++	Run K-means first, use its centroids as initial means
Multiple restarts	Run EM several times, keep the result with highest log-likelihood

Convergence guarantee: EM is guaranteed to increase (or maintain) the log-likelihood at every iteration. The E-step computes the expected complete-data log-likelihood (the "Q function"). The M-step maximizes it. The math ensures that each full E-M cycle never decreases the observed-data log-likelihood. Convergence is monotonic — the score can only go up.

When to stop: Track the log-likelihood after each iteration. When the increase drops below a threshold (e.g. |Δ ln L| < 10⁻⁴), declare convergence. Alternatively, stop after a fixed number of iterations (e.g. 100).

python
def em_gmm(X, K, max_iter=100, tol=1e-4):
    N, D = X.shape
    # Initialize
    mu = X[np.random.choice(N, K, replace=False)]
    Sigma = [np.eye(D) for _ in range(K)]
    pi = np.ones(K) / K

    for it in range(max_iter):
        # E-step: compute responsibilities
        r = np.zeros((N, K))
        for k in range(K):
            r[:, k] = pi[k] * multivariate_normal(mu[k], Sigma[k]).pdf(X)
        r /= r.sum(axis=1, keepdims=True)

        # M-step: update parameters
        Nk = r.sum(axis=0)
        for k in range(K):
            mu[k] = (r[:, k] @ X) / Nk[k]
            diff = X - mu[k]
            Sigma[k] = (r[:, k][:, None] * diff).T @ diff / Nk[k]
            Sigma[k] += 1e-6 * np.eye(D)
        pi = Nk / N

        # Check convergence (log-likelihood)
        ll = np.sum(np.log(r.sum(axis=1)))  # simplified
        if it > 0 and abs(ll - prev_ll) < tol: break
        prev_ll = ll
    return mu, Sigma, pi

EM beyond GMMs: The EM algorithm is a general framework for any model with latent (hidden) variables. GMMs are the textbook example, but EM also trains Hidden Markov Models (Baum-Welch), topic models (LDA), and many Bayesian inference problems. The pattern is always the same: E-step computes expected latent variables, M-step maximizes parameters.

What does the E-step compute?

New values for the means μ_k Responsibilities r_nk — the posterior probability each component generated each point The number of components K

Chapter 8: EM in Action

Time to see it work. Below is a 2D dataset with K = 3 true clusters. The algorithm starts with random initial means (the large circles) and spherical covariances. Click through the E-step and M-step to watch EM converge, or hit Auto-Run to animate the full process.

EM Step-by-Step (K = 3)

E-Step recolors each point by its responsibilities (blended colors = uncertain assignment). M-Step moves the means and reshapes the covariance ellipses. The log-likelihood plot below tracks progress.

Iteration 0

Speed400ms

What to watch for: In the first few iterations, the means race toward the cluster centers and the covariance ellipses shrink and rotate to match the data shape. The log-likelihood curve rises steeply at first, then levels off as the algorithm converges. Points near cluster boundaries remain softly colored — genuinely ambiguous assignments.

Try resetting and watching again. Different random initializations lead to different convergence paths — sometimes a component "steals" points from another, sometimes two components start near the same cluster and one drifts away. This is the local-optima issue in practice.

Counting iterations: For well-separated clusters, EM typically converges in 10–30 iterations. For overlapping clusters, it may take 50–100. The per-iteration cost is O(NKD²) for the E-step and O(NKD²) for the M-step (dominated by the covariance updates). For large N, this is quite efficient compared to alternatives.

What happens to the log-likelihood across EM iterations?

It monotonically increases (or stays the same) at every iteration It oscillates up and down It always decreases

Chapter 9: Latent Variables

There is a deeper way to think about GMMs that connects to a broad class of models. Introduce a latent variable z_n for each data point. z_n is a one-hot vector of length K — it indicates which component generated x_n. If z_n = [0, 1, 0], then point n came from component 2.

The generative story becomes explicit:

1. Pick a component

Sample z_n ~ Categorical(π₁, …, π_K)

↓

2. Generate data

Sample x_n ~ N(μ_k, Σ_k) where k is the active component in z_n

The joint distribution of x and z is:

p(x, z) = p(z) · p(x | z) = ∏_k=1^K [ π_k N(x | μ_k, Σ_k) ]^z_k

When z_k = 1 (point assigned to component k), only the k-th term survives. Marginalizing out z (summing over all possible assignments) recovers the GMM density: p(x) = ∑_k π_k N(x | μ_k, Σ_k). This is exactly where the mixture formula comes from.

Why latent variables matter: The latent variables z are what make the log-likelihood hard (the sum inside the log). If we observed z, the log-likelihood would decompose into K independent terms, each a standard Gaussian MLE problem. EM's trick is to replace the unknown z with its expected value under the current parameters — which is exactly the responsibilities r_nk.

This perspective reveals EM as a general algorithm. Any model where some variables are hidden fits the pattern: define the joint p(x, z | θ), compute the expected complete-data log-likelihood (E-step), and maximize it (M-step). The latent variables could be cluster assignments (GMMs), hidden states (HMMs), missing data (imputation), or topic assignments (LDA).

Incomplete data: We observe x but not z. The log-likelihood of the observed data is ln p(x | θ) = ln ∑_z p(x, z | θ). This marginal is hard to optimize. EM instead optimizes a lower bound (the ELBO), which is tight at the E-step solution. This connects directly to variational inference and the VAE (Variational Autoencoder) in deep learning.

What is the latent variable z_n in a GMM?

The data point itself A one-hot indicator of which component generated point n The covariance matrix

Chapter 10: Connections

Gaussian Mixture Models sit at a crossroads of several important ideas in machine learning. Let's map the connections.

Method	Relationship to GMMs
K-Means	Hard-assignment limit of EM: as covariances shrink to zero, soft responsibilities become hard assignments
Bayesian GMM	Put priors on μ, Σ, π (e.g. Dirichlet on weights). Avoids overfitting, can automatically select K
Hidden Markov Models	GMM + temporal structure. States evolve over time; each state emits from a Gaussian
Variational Autoencoders	Deep learning generalization: the latent z is continuous, the decoder is a neural network, and EM becomes variational inference
Kernel Density Estimation	Extreme limit: K = N (one component per point). No EM needed, but no compression either
Factor Analysis	Each component's covariance has low-rank structure, revealing underlying factors

Choosing K: We treated K as given, but in practice you must select it. Common approaches: (1) the Bayesian Information Criterion (BIC), which penalizes model complexity; (2) the elbow method, where you plot log-likelihood vs. K and look for diminishing returns; (3) Bayesian nonparametrics (Dirichlet Process GMMs), which let the data decide K automatically.

Strengths of GMMs:

• Soft clustering (probabilistic)
• Models cluster shape via covariances
• Principled density estimation
• EM has convergence guarantees
• Generative: can sample new data

Limitations:

• Must choose K in advance
• EM finds local optima only
• Assumes Gaussian-shaped clusters
• Covariance estimation needs D « N
• Singularities without regularization

Looking ahead: Chapter 12 takes a completely different approach to the classification problem. Instead of modelling the full density p(x), Support Vector Machines find the decision boundary directly. Where GMMs are generative (model how data is produced), SVMs are discriminative (model the boundary between classes). Both perspectives have their place.

"All models are wrong, but some are useful."
— George E. P. Box

What happens to the GMM's EM algorithm in the limit where responsibilities become hard (0 or 1)?

It reduces to the K-means algorithm It diverges It becomes PCA