PCA, probabilistic PCA, kernel PCA, factor analysis, ICA, and autoencoders — discovering low-dimensional structure in high-dimensional data.
Real-world data is often high-dimensional (images: millions of pixels, genomes: billions of bases), but the intrinsic dimensionality is much lower. Faces can be described by a few dozen parameters (pose, expression, lighting). Handwritten digits live on a manifold of dimension ~10 in a 784-dimensional pixel space.
The probabilistic perspective: we model x as generated from a latent variable z via a mapping plus noise. This gives us a generative model, uncertainty estimates, and principled handling of missing data.
PCA finds the M directions that capture the most variance. The first principal component u1 maximizes:
where S = (1/N) ∑n (xn − x̄)(xn − x̄)T is the data covariance matrix.
Using Lagrange multipliers: Su1 = λ1u1. The solution is the eigenvector of S with the largest eigenvalue λ1. The variance along this direction is λ1.
An equivalent formulation: PCA finds the M-dimensional subspace that minimizes reconstruction error:
where x̄n is the projection of xn onto the M-dimensional subspace and back to D dimensions.
The minimum distortion is:
— the sum of the discarded eigenvalues. Maximum variance = minimum reconstruction error. The two formulations are equivalent.
Data is generated from an elongated 2D Gaussian. The first principal component (warm) aligns with the direction of maximum variance. Projections (teal) show reconstruction from 1D.
Standard PCA is an algorithm, not a model. Probabilistic PCA (PPCA) gives PCA a generative interpretation:
where z ~ N(0, IM) is the latent variable, W is a D × M loading matrix, and ε ~ N(0, σ2I) is isotropic noise.
The marginal distribution of x is:
Benefits of PPCA over standard PCA: handles missing data (via EM), gives a density model (useful for anomaly detection), enables Bayesian model selection (choosing M), and connects to factor analysis.
The EM algorithm for PPCA avoids computing the D × D covariance matrix, making it efficient for high-dimensional data:
E-step: Compute the posterior over latent variables:
where M = WTW + σ2I is only M × M.
M-step: Update the loading matrix:
How many principal components should we keep? Bayesian PCA answers this automatically by placing priors on the columns of W and using automatic relevance determination (ARD).
Each column wi of W gets its own precision hyperparameter αi:
Factor analysis generalizes PPCA by allowing non-isotropic noise:
where Ψ = diag(ψ1, ..., ψD) is a diagonal (but not scalar) noise matrix. Each observed dimension has its own noise level.
The marginal distribution is p(x) = N(x|μ, WWT + Ψ). Factor analysis decomposes the covariance into a low-rank component (WWT, capturing shared variance from latent factors) plus diagonal noise (uniquenesses).
Kernel PCA extends PCA to nonlinear dimensionality reduction using the kernel trick from Chapter 6.
Standard PCA works with the covariance matrix S = (1/N)ΦTΦ in feature space. The dual formulation uses the kernel matrix Knm = k(xn, xm).
The principal components in feature space are given by the eigenvectors of the centered kernel matrix K̃:
where K̃ = K − (1/N)1K − K1(1/N) + (1/N)1K1(1/N) is the centered kernel matrix.
Limitation: Kernel PCA does not give an explicit mapping back to input space (the pre-image problem). PPCA gives a full generative model; kernel PCA gives only projections.
ICA (Independent Component Analysis) finds components that are statistically independent, not just uncorrelated. The model:
where s = (s1, ..., sM) are independent non-Gaussian sources and A is an unknown mixing matrix.
ICA assumes at most one source is Gaussian (otherwise the mixing is unidentifiable). Common ICA algorithms maximize non-Gaussianity of the recovered sources (kurtosis, negentropy) or minimize mutual information.
An autoassociative neural network (autoencoder) learns a nonlinear low-dimensional representation by training a network to reconstruct its input through a bottleneck.
Architecture: input (x, D units) → encoder (hidden layers) → bottleneck (z, M units) → decoder (hidden layers) → output (x̂, D units).
The connection to PPCA: a linear autoencoder trained with MSE loss is equivalent to PPCA in the limit σ2 → 0. The bottleneck dimension M plays the role of the latent space dimension. VAEs make this connection explicit and add noise modeling.
| Method | Noise model | Mapping | Unique to |
|---|---|---|---|
| PCA | None | Linear | Uncorrelated, max variance |
| PPCA | Isotropic σ2I | Linear | Generative, missing data |
| Factor analysis | Diagonal Ψ | Linear | Per-feature noise, unique solution |
| Bayesian PCA | Isotropic + ARD | Linear | Auto-selects M |
| Kernel PCA | None | Nonlinear (kernel) | Nonlinear manifolds |
| ICA | None | Linear | Independent, non-Gaussian |
| Autoencoder | Learned | Nonlinear (NN) | Flexible nonlinear encoding |
What comes next: Chapter 13 introduces sequential data — time series models (HMMs, Kalman filters) where the latent variables have temporal structure.