Find structure without labels. PCA finds the axes of variation. Autoencoders learn compressed representations. K-means and GMMs discover clusters. The EM algorithm ties it all together.
Every model so far has used labeled data: (x, y) pairs where y is the target. But most data in the world is unlabeled. Images, text, sensor readings — they arrive without annotations. Unsupervised learning finds structure in data without labels.
Two fundamental tasks: (1) dimensionality reduction — find a compact representation that captures the important variation, and (2) clustering — group similar data points together.
High-dimensional data often lies near a lower-dimensional subspace. PCA (Murphy 20.1) finds the directions of maximum variance and projects the data onto them.
Given centered data X (mean subtracted), PCA computes the eigendecomposition of the covariance matrix:
The columns of U are the principal components (eigenvectors), ordered by decreasing eigenvalue λd. The first principal component is the direction of maximum variance. The k-th component captures the most remaining variance orthogonal to the first k−1.
| Property | Details |
|---|---|
| Objective | Maximize variance of projections (or minimize reconstruction error) |
| Solution | Eigenvectors of covariance matrix (or SVD of data matrix) |
| Linearity | Linear projection only; cannot capture nonlinear structure |
| Reconstruction | x̂ = ULULTx (projection then lifting) |
| Probabilistic version | PPCA: p(x | z) = N(Wz + μ, σ²I), z ∼ N(0, I) |
PCA is limited to linear projections. What if the data lies on a curved manifold? Autoencoders (Murphy 20.3) use neural networks to learn nonlinear encodings.
An autoencoder has two parts: an encoder z = fθ(x) that maps the input to a low-dimensional latent code, and a decoder x̂ = gφ(z) that reconstructs the input from the code. The model is trained to minimize reconstruction error:
Variants that add structure to the latent space:
| Variant | Modification | Effect |
|---|---|---|
| Bottleneck AE | L << D | Forces compression (like PCA but nonlinear) |
| Denoising AE | Corrupt input, reconstruct clean | Learns robust features, ignores noise |
| Sparse AE | Penalize activations of z | Only a few latent units active per input |
| Contractive AE | Penalize ||∂z/∂x||2 | Representation is locally invariant to input perturbations |
Standard autoencoders learn a mapping, not a probability distribution. You cannot sample new data from them (the latent space may have "holes" where the decoder produces garbage). Variational autoencoders (VAEs, Kingma & Welling 2014, Murphy 20.3.5) fix this by making the model generative.
The generative model is: sample z ∼ N(0, I), then sample x ∼ pθ(x | z). The decoder is pθ(x | z) (e.g., a Gaussian whose mean is a neural network). We want to learn θ by maximizing the marginal likelihood p(x) = ∫ pθ(x | z) p(z) dz, but this integral is intractable.
The encoder outputs the parameters of qφ(z | x) = N(μφ(x), σφ²(x)). During training, we sample z = μ + σ ⊙ ε, where ε ∼ N(0, I). This reparameterization trick makes the sampling differentiable, allowing backpropagation through the stochastic node.
For Gaussian p(x | z) and q(z | x), both terms have closed-form expressions. The KL term for two Gaussians is: KL = ½ ∑j (μj² + σj² − log σj² − 1).
Given N unlabeled data points, partition them into K groups such that points within each group are similar. K-means (Murphy 21.3) is the simplest and most widely used clustering algorithm.
The algorithm alternates two steps:
E-step: assign each point to the nearest centroid. M-step: update each centroid to the mean of its assigned points. Repeat until convergence.
| Property | K-Means |
|---|---|
| Cluster shape | Spherical (Voronoi cells) |
| Assignments | Hard (each point in exactly one cluster) |
| Objective | Minimize total within-cluster squared distance |
| Convergence | Always converges (to a local minimum) |
| Choosing K | Elbow method, silhouette score, or gap statistic |
K-means makes hard assignments: each point belongs to exactly one cluster. What if a point is ambiguous, lying between two clusters? Gaussian Mixture Models (GMMs, Murphy 21.4) give soft assignments — probabilities of belonging to each cluster.
where πk are the mixing proportions (how prevalent each cluster is, ∑πk = 1), and each component k is a Gaussian with its own mean μk and covariance Σk.
GMMs generalize K-means in three ways:
| Feature | K-Means | GMM |
|---|---|---|
| Assignments | Hard (0 or 1) | Soft (probabilities) |
| Cluster shape | Spherical | Elliptical (full covariance) |
| Cluster size | Equal | Different (via Σk) |
| Cluster weight | Implicit | Explicit (πk) |
| Objective | Distortion | Log-likelihood |
How do we fit a GMM? We can't use simple MLE because the cluster assignments (latent variables z) are unknown. The Expectation-Maximization (EM) algorithm handles this elegantly.
EM alternates between two steps:
For GMMs, the M-step updates are:
where Nk = ∑n rnk is the effective number of points in cluster k.
| Property | EM for GMMs |
|---|---|
| E-step | Compute rnk (soft assignments via Bayes' rule) |
| M-step | Update μk, Σk, πk using weighted statistics |
| Convergence | Monotonic increase in log-likelihood |
| Initialization | K-means++ for centroids, then one E-step |
K-means and GMMs assume clusters are convex (roughly blob-shaped). What about rings, spirals, or other non-convex shapes? Spectral clustering (Murphy 21.5) handles these by working with a similarity graph rather than raw coordinates.
The algorithm has three steps:
Connection with kernel PCA: Spectral clustering using the graph Laplacian eigenvectors is closely related to kernel PCA with the same kernel. Both use the leading eigenvectors of a kernel matrix to embed the data in a space where linear methods work.
PCA assumes the data lives in a linear subspace. Real data often lies on a curved manifold — a smooth, low-dimensional surface embedded in high-dimensional space (Murphy 20.4). Think of a sheet of paper crumpled into a ball: it is 2D intrinsically, but 3D extrinsically.
| Method | Approach | Preserves |
|---|---|---|
| MDS | Find embedding that preserves pairwise distances | Global distances |
| Isomap | Use geodesic (graph) distances instead of Euclidean | Geodesic distances |
| LLE | Preserve local linear reconstruction weights | Local geometry |
| Laplacian Eigenmaps | Minimize weighted distances in embedding | Local connectivity |
| t-SNE | Match probability distributions over neighbors | Local structure (visualization) |
Word embeddings (20.5): Word2Vec and GloVe learn manifold-like structure for words. Each word is a point in a space where distances reflect semantic similarity. Famous example: vector("king") − vector("man") + vector("woman") ≈ vector("queen"). These are essentially nonlinear dimensionality reduction of co-occurrence statistics.
Watch K-means clustering in action. Click to place data points, choose K, and step through the algorithm one iteration at a time.
Click to add data points. Then hit Step to run one E-step + M-step iteration. Centroids are shown as large squares. Colors indicate cluster assignments.
Visualize PCA in 2D. Place data points and see the principal components (eigenvectors of the covariance matrix). The first PC points in the direction of maximum variance.
Click to add points. The orange arrow is PC1 (max variance). The teal arrow is PC2. Arrow length is proportional to the eigenvalue (variance explained).
Unsupervised learning provides the representations and structures that power modern ML systems.
| Concept from this chapter | Where it leads |
|---|---|
| PCA | Whitening for training, face recognition (eigenfaces), kernel PCA |
| Autoencoders | Pretraining for DNNs, anomaly detection, image compression |
| VAEs | Image generation, drug discovery, disentangled representations, diffusion models |
| K-means | Vector quantization in VQ-VAEs, codebook learning, initialization for GMMs |
| GMMs | Speaker recognition, anomaly detection, density estimation for GANs |
| EM algorithm | Training HMMs, learning with missing data, mixture of experts |
| t-SNE | Visualizing neural network embeddings, single-cell genomics |
| Word embeddings | Input representations for NLP, retrieval, recommendation |
"All models are wrong, but some are useful." — George Box