Murphy, Chapters 20–21

Generative Models & Clustering

Find structure without labels. PCA finds the axes of variation. Autoencoders learn compressed representations. K-means and GMMs discover clusters. The EM algorithm ties it all together.

Prerequisites: Probability (Ch 2–3) + Statistics (Ch 4) + DNNs (Ch 13). You need Gaussians, MLE, and basic neural networks.
12
Chapters
5
Simulations
12
Quizzes

Chapter 0: Why Unsupervised?

Every model so far has used labeled data: (x, y) pairs where y is the target. But most data in the world is unlabeled. Images, text, sensor readings — they arrive without annotations. Unsupervised learning finds structure in data without labels.

Two fundamental tasks: (1) dimensionality reduction — find a compact representation that captures the important variation, and (2) clustering — group similar data points together.

The big picture: PCA and autoencoders learn a low-dimensional representation z of high-dimensional data x. K-means and GMMs discover K groups in the data. VAEs combine both: they learn a generative model p(x | z) and an inference model q(z | x), giving you both representation learning and generation.
Dimensionality Reduction (Ch 20)
PCA, autoencoders, VAEs, t-SNE, manifold learning
Clustering (Ch 21)
K-means, GMMs, EM algorithm, spectral clustering
Generative Models
VAEs learn p(x) = ∫ p(x|z) p(z) dz, generating new data
Representation Learning
Learned features for downstream tasks: classification, retrieval
What is the key difference between supervised and unsupervised learning?

Chapter 1: Principal Component Analysis

High-dimensional data often lies near a lower-dimensional subspace. PCA (Murphy 20.1) finds the directions of maximum variance and projects the data onto them.

Given centered data X (mean subtracted), PCA computes the eigendecomposition of the covariance matrix:

Σ = (1/N) XTX = U Λ UT

The columns of U are the principal components (eigenvectors), ordered by decreasing eigenvalue λd. The first principal component is the direction of maximum variance. The k-th component captures the most remaining variance orthogonal to the first k−1.

Dimensionality reduction: Project each data point onto the top L eigenvectors: z = ULTx, where UL contains only the first L columns. The reconstruction is x̂ = ULz. The reconstruction error is minimized among all linear projections to L dimensions. The fraction of variance explained is ∑d=1L λd / ∑d=1D λd.
Choosing L: Plot the eigenvalues (scree plot). Look for an "elbow" where the eigenvalues drop sharply. Or choose L such that 95% of the variance is explained. In practice, PCA can reduce thousands of dimensions to tens while retaining most of the information.
PropertyDetails
ObjectiveMaximize variance of projections (or minimize reconstruction error)
SolutionEigenvectors of covariance matrix (or SVD of data matrix)
LinearityLinear projection only; cannot capture nonlinear structure
Reconstructionx̂ = ULULTx (projection then lifting)
Probabilistic versionPPCA: p(x | z) = N(Wz + μ, σ²I), z ∼ N(0, I)
What does the first principal component represent?

Chapter 2: Autoencoders

PCA is limited to linear projections. What if the data lies on a curved manifold? Autoencoders (Murphy 20.3) use neural networks to learn nonlinear encodings.

An autoencoder has two parts: an encoder z = fθ(x) that maps the input to a low-dimensional latent code, and a decoder x̂ = gφ(z) that reconstructs the input from the code. The model is trained to minimize reconstruction error:

L(θ, φ) = (1/N) ∑n ||xn − gφ(fθ(xn))||²
Bottleneck forces compression: If the latent dimension L is much smaller than the input dimension D, the network must learn a compact representation that preserves the information needed for reconstruction. The encoder discovers the important features; the decoder shows what can be reconstructed from them.

Variants that add structure to the latent space:

VariantModificationEffect
Bottleneck AEL << DForces compression (like PCA but nonlinear)
Denoising AECorrupt input, reconstruct cleanLearns robust features, ignores noise
Sparse AEPenalize activations of zOnly a few latent units active per input
Contractive AEPenalize ||∂z/∂x||2Representation is locally invariant to input perturbations
PCA is a linear autoencoder: If the encoder and decoder are both linear (no activation functions) and the loss is MSE, the optimal solution recovers the PCA projection. The nonlinear autoencoder strictly generalizes PCA.
What forces an autoencoder to learn a useful representation rather than just copying the input?

Chapter 3: Variational Autoencoders

Standard autoencoders learn a mapping, not a probability distribution. You cannot sample new data from them (the latent space may have "holes" where the decoder produces garbage). Variational autoencoders (VAEs, Kingma & Welling 2014, Murphy 20.3.5) fix this by making the model generative.

The generative model is: sample z ∼ N(0, I), then sample x ∼ pθ(x | z). The decoder is pθ(x | z) (e.g., a Gaussian whose mean is a neural network). We want to learn θ by maximizing the marginal likelihood p(x) = ∫ pθ(x | z) p(z) dz, but this integral is intractable.

The ELBO: Instead of maximizing log p(x) directly, we maximize a lower bound (the Evidence Lower BOund):
ELBO = Eq[log pθ(x | z)] − KL(qφ(z | x) || p(z))
The first term is reconstruction quality (how well the decoder reconstructs x from z). The second is a regularizer that pushes the encoder's output qφ(z | x) toward the prior N(0, I).

The encoder outputs the parameters of qφ(z | x) = N(μφ(x), σφ²(x)). During training, we sample z = μ + σ ⊙ ε, where ε ∼ N(0, I). This reparameterization trick makes the sampling differentiable, allowing backpropagation through the stochastic node.

Why VAEs can generate: The KL term ensures the latent space is smooth and regular (no holes). Any z sampled from N(0, I) should decode to a plausible x. This is what makes VAEs generative — you can sample z ∼ N(0, I) and decode to get new data.
LVAE = −Eq[log p(x | z)] + KL(q(z | x) || p(z))

For Gaussian p(x | z) and q(z | x), both terms have closed-form expressions. The KL term for two Gaussians is: KL = ½ ∑jj² + σj² − log σj² − 1).

What does the KL divergence term in the VAE loss accomplish?

Chapter 4: K-Means Clustering

Given N unlabeled data points, partition them into K groups such that points within each group are similar. K-means (Murphy 21.3) is the simplest and most widely used clustering algorithm.

The algorithm alternates two steps:

E-step:  rnk = 1 if k = argminj ||xn − μj||², else 0
M-step:  μk = ∑n rnk xn / ∑n rnk

E-step: assign each point to the nearest centroid. M-step: update each centroid to the mean of its assigned points. Repeat until convergence.

K-means minimizes distortion: J = ∑nk rnk ||xn − μk||². Each step decreases J (or keeps it the same). The algorithm always converges, but to a local minimum — the result depends on initialization. Run multiple times with different random starts and pick the lowest J.
K-means++ initialization (21.3.4): Choose the first centroid randomly. Each subsequent centroid is chosen with probability proportional to its squared distance from the nearest existing centroid. This spreads out the initial centroids and provably gives an O(log K)-competitive solution.
PropertyK-Means
Cluster shapeSpherical (Voronoi cells)
AssignmentsHard (each point in exactly one cluster)
ObjectiveMinimize total within-cluster squared distance
ConvergenceAlways converges (to a local minimum)
Choosing KElbow method, silhouette score, or gap statistic
Why does K-means always converge?

Chapter 5: Gaussian Mixture Models

K-means makes hard assignments: each point belongs to exactly one cluster. What if a point is ambiguous, lying between two clusters? Gaussian Mixture Models (GMMs, Murphy 21.4) give soft assignments — probabilities of belonging to each cluster.

p(x) = ∑k=1K πk N(x | μk, Σk)

where πk are the mixing proportions (how prevalent each cluster is, ∑πk = 1), and each component k is a Gaussian with its own mean μk and covariance Σk.

Soft assignments (responsibilities): The probability that point xn belongs to cluster k is: rnk = πk N(xn | μk, Σk) / ∑j πj N(xn | μj, Σj). This is just Bayes' rule: prior (πk) times likelihood (N(xn | μk, Σk)), normalized.

GMMs generalize K-means in three ways:

FeatureK-MeansGMM
AssignmentsHard (0 or 1)Soft (probabilities)
Cluster shapeSphericalElliptical (full covariance)
Cluster sizeEqualDifferent (via Σk)
Cluster weightImplicitExplicit (πk)
ObjectiveDistortionLog-likelihood
K-means as a special case: If all covariances are σ²I (spherical, same for all clusters), all mixing weights are equal (πk = 1/K), and we take σ² → 0, then soft assignments become hard assignments and EM for the GMM reduces to the K-means algorithm.
How do GMMs differ from K-means in handling ambiguous data points?

Chapter 6: The EM Algorithm

How do we fit a GMM? We can't use simple MLE because the cluster assignments (latent variables z) are unknown. The Expectation-Maximization (EM) algorithm handles this elegantly.

EM alternates between two steps:

E-step (Expectation): Given current parameters θold, compute the expected value of the latent variables (the responsibilities rnk). "If the parameters were these, which cluster does each point most likely belong to?"
M-step (Maximization): Given the responsibilities, update the parameters to maximize the expected complete-data log-likelihood. "Given these soft assignments, what are the best parameters?"

For GMMs, the M-step updates are:

μk = ∑n rnk xn / Nk    Σk = ∑n rnk (xn − μk)(xn − μk)T / Nk    πk = Nk / N

where Nk = ∑n rnk is the effective number of points in cluster k.

EM guarantees: Each iteration increases the log-likelihood (or keeps it the same). Like K-means, EM converges to a local maximum, not necessarily the global one. Multiple restarts help. EM is more general than GMMs — it applies to any model with latent variables (factor analysis, HMMs, mixture of experts).
PropertyEM for GMMs
E-stepCompute rnk (soft assignments via Bayes' rule)
M-stepUpdate μk, Σk, πk using weighted statistics
ConvergenceMonotonic increase in log-likelihood
InitializationK-means++ for centroids, then one E-step
Why does EM need two steps instead of directly maximizing the log-likelihood?

Chapter 7: Spectral Clustering

K-means and GMMs assume clusters are convex (roughly blob-shaped). What about rings, spirals, or other non-convex shapes? Spectral clustering (Murphy 21.5) handles these by working with a similarity graph rather than raw coordinates.

The algorithm has three steps:

1. Build similarity graph
Wij = exp(−||xi − xj||² / 2σ²) if neighbors, else 0
2. Compute graph Laplacian eigenvectors
L = D − W, find the K smallest eigenvectors of Lsym
3. Cluster in eigenvector space
Run K-means on the N×K matrix of eigenvectors
Why eigenvectors? The graph Laplacian L = D − W (where D is the degree matrix) encodes the connectivity of the data. Its smallest eigenvectors are smooth over the graph: they change slowly between connected nodes and jump across cuts. K-means on these eigenvectors separates clusters even if they are non-convex in the original space.
Normalized cuts (21.5.1): Spectral clustering can be interpreted as approximately solving the normalized cut problem: find a partition that minimizes the edges cut between clusters, normalized by the cluster sizes. This is NP-hard in general, but the spectral relaxation gives a good approximation.

Connection with kernel PCA: Spectral clustering using the graph Laplacian eigenvectors is closely related to kernel PCA with the same kernel. Both use the leading eigenvectors of a kernel matrix to embed the data in a space where linear methods work.

Why can spectral clustering find non-convex clusters that K-means cannot?

Chapter 8: Manifold Learning

PCA assumes the data lives in a linear subspace. Real data often lies on a curved manifold — a smooth, low-dimensional surface embedded in high-dimensional space (Murphy 20.4). Think of a sheet of paper crumpled into a ball: it is 2D intrinsically, but 3D extrinsically.

The manifold hypothesis: Natural high-dimensional data (images, speech, text) actually lies on or near a low-dimensional manifold. A 256×256 image has 65,536 pixel values, but the set of "natural images" occupies a tiny fraction of this space. Manifold learning methods try to discover this structure.
MethodApproachPreserves
MDSFind embedding that preserves pairwise distancesGlobal distances
IsomapUse geodesic (graph) distances instead of EuclideanGeodesic distances
LLEPreserve local linear reconstruction weightsLocal geometry
Laplacian EigenmapsMinimize weighted distances in embeddingLocal connectivity
t-SNEMatch probability distributions over neighborsLocal structure (visualization)
t-SNE (20.4.10): The most popular visualization tool. It converts high-dimensional distances to probabilities: nearby points get high probability, far points get low. Then it finds a 2D embedding with matching probabilities. The "t" comes from using a Student-t distribution in the low-dimensional space, which has heavier tails than the Gaussian used in high dimensions. This prevents the "crowding problem" where moderate-distance points collapse together.

Word embeddings (20.5): Word2Vec and GloVe learn manifold-like structure for words. Each word is a point in a space where distances reflect semantic similarity. Famous example: vector("king") − vector("man") + vector("woman") ≈ vector("queen"). These are essentially nonlinear dimensionality reduction of co-occurrence statistics.

What is the manifold hypothesis?

Chapter 9: K-Means Explorer

Watch K-means clustering in action. Click to place data points, choose K, and step through the algorithm one iteration at a time.

K-Means Clustering

Click to add data points. Then hit Step to run one E-step + M-step iteration. Centroids are shown as large squares. Colors indicate cluster assignments.

0 points | iter 0
K3
What to observe: Place 3 distinct clusters of points. Start with K=3 and step through. Watch how centroids quickly move to cluster centers. Then try K=2 (too few) or K=5 (too many) to see how the algorithm handles misspecified K. Try different initializations to see convergence to different local minima.
Why might K-means converge to different solutions with different random initializations?

Chapter 10: PCA Playground

Visualize PCA in 2D. Place data points and see the principal components (eigenvectors of the covariance matrix). The first PC points in the direction of maximum variance.

PCA: Principal Components

Click to add points. The orange arrow is PC1 (max variance). The teal arrow is PC2. Arrow length is proportional to the eigenvalue (variance explained).

0 points
Try this: Add points in an elongated ellipse. PC1 will align with the long axis. Add points in a circular pattern — both eigenvalues will be roughly equal (no preferred direction). Add an outlier far away and watch how it pulls PC1 toward it.
If a 2D dataset lies along a narrow diagonal stripe, what will the first principal component look like?

Chapter 11: Connections

Unsupervised learning provides the representations and structures that power modern ML systems.

Concept from this chapterWhere it leads
PCAWhitening for training, face recognition (eigenfaces), kernel PCA
AutoencodersPretraining for DNNs, anomaly detection, image compression
VAEsImage generation, drug discovery, disentangled representations, diffusion models
K-meansVector quantization in VQ-VAEs, codebook learning, initialization for GMMs
GMMsSpeaker recognition, anomaly detection, density estimation for GANs
EM algorithmTraining HMMs, learning with missing data, mixture of experts
t-SNEVisualizing neural network embeddings, single-cell genomics
Word embeddingsInput representations for NLP, retrieval, recommendation
What we covered: PCA for linear dimensionality reduction, autoencoders for nonlinear feature learning, VAEs as probabilistic generative models with the ELBO and reparameterization trick, K-means clustering, Gaussian mixture models with soft assignments, the EM algorithm, spectral clustering for non-convex shapes, manifold learning (t-SNE, Isomap), and word embeddings.
The complete picture: Murphy's book builds a tower from probability and statistics (Ch 2–4) through linear models (Ch 10–11) to deep networks (Ch 13–15), nonparametric methods (Ch 16–17), and unsupervised learning (Ch 20–21). Each layer builds on the last. The probabilistic framework — priors, likelihoods, posteriors — unifies them all.

"All models are wrong, but some are useful." — George Box

How does a VAE differ from a standard autoencoder?