Bishop PRML, Chapter 12

Continuous Latent Variables

PCA, probabilistic PCA, kernel PCA, factor analysis, ICA, and autoencoders — discovering low-dimensional structure in high-dimensional data.

Prerequisites: Chapters 2–3, 9 (Gaussians, linear regression, EM).
12
Chapters
2
Simulations
12
Quizzes

Chapter 0: Why Latent Variables?

Real-world data is often high-dimensional (images: millions of pixels, genomes: billions of bases), but the intrinsic dimensionality is much lower. Faces can be described by a few dozen parameters (pose, expression, lighting). Handwritten digits live on a manifold of dimension ~10 in a 784-dimensional pixel space.

Dimensionality reduction: Find a low-dimensional representation z ∈ RM that captures the essential structure of high-dimensional data x ∈ RD (M « D). This enables visualization (M=2), compression, denoising, and more efficient learning. PCA is the simplest approach: project onto the directions of maximum variance.

The probabilistic perspective: we model x as generated from a latent variable z via a mapping plus noise. This gives us a generative model, uncertainty estimates, and principled handling of missing data.

Check: Why is dimensionality reduction possible for real-world data?

Chapter 1: PCA — Maximum Variance

PCA finds the M directions that capture the most variance. The first principal component u1 maximizes:

maxu1 u1T S u1    subject to   u1Tu1 = 1

where S = (1/N) ∑n (xn)(xn)T is the data covariance matrix.

Using Lagrange multipliers: Su1 = λ1u1. The solution is the eigenvector of S with the largest eigenvalue λ1. The variance along this direction is λ1.

Successive components: The second principal component maximizes variance subject to being orthogonal to the first: u2Tu1 = 0. This gives the eigenvector with the second-largest eigenvalue. In general, the top M eigenvectors of S give the M principal components. The proportion of variance retained is ∑i=1M λi / ∑i=1D λi.
Check: What are the principal components of a dataset?

Chapter 2: PCA — Minimum Error

An equivalent formulation: PCA finds the M-dimensional subspace that minimizes reconstruction error:

J = (1/N) ∑n=1N ||xnn||2

where n is the projection of xn onto the M-dimensional subspace and back to D dimensions.

The minimum distortion is:

Jmin = ∑i=M+1D λi

— the sum of the discarded eigenvalues. Maximum variance = minimum reconstruction error. The two formulations are equivalent.

High-dimensional PCA trick: When D > N (more features than data points), computing the D × D covariance is expensive. Instead, work with the N × N matrix (1/N)XTX. Its eigenvectors, when projected back to D dimensions, give the same principal components. This reduces cost from O(D3) to O(N3) — critical for images and genomics.
Check: What is the minimum reconstruction error in PCA?

Chapter 3: PCA Visualization

PCA: Finding Principal Components

Data is generated from an elongated 2D Gaussian. The first principal component (warm) aligns with the direction of maximum variance. Projections (teal) show reconstruction from 1D.

Angle 30° Eccentricity 5.0
Check: What does PCA's first principal component capture?

Chapter 4: Probabilistic PCA

Standard PCA is an algorithm, not a model. Probabilistic PCA (PPCA) gives PCA a generative interpretation:

x = Wz + μ + ε

where z ~ N(0, IM) is the latent variable, W is a D × M loading matrix, and ε ~ N(0, σ2I) is isotropic noise.

The marginal distribution of x is:

p(x) = N(x|μ, WWT + σ2I)
PCA emerges from ML: Maximum likelihood estimation of W recovers the principal components! Specifically, WML = UM(ΛM − σ2I)1/2R, where UM contains the top M eigenvectors, ΛM the corresponding eigenvalues, and R is an arbitrary rotation. In the limit σ2 → 0, this reduces to standard PCA. The probabilistic formulation adds noise modeling, a proper likelihood, and enables Bayesian extensions.

Benefits of PPCA over standard PCA: handles missing data (via EM), gives a density model (useful for anomaly detection), enables Bayesian model selection (choosing M), and connects to factor analysis.

Check: How does probabilistic PCA relate to standard PCA?

Chapter 5: EM for PCA

The EM algorithm for PPCA avoids computing the D × D covariance matrix, making it efficient for high-dimensional data:

E-step: Compute the posterior over latent variables:

E[zn] = M−1WT(xnμ)

where M = WTW + σ2I is only M × M.

M-step: Update the loading matrix:

Wnew = [∑n (xnμ) E[zn]T] [∑n E[znznT]]−1
Computational advantage: Each EM iteration costs O(NMD) instead of O(D3) for eigendecomposition. When M « D, this is much cheaper. For a 1000×1000 image (D = 106) with M = 50 components, EM is orders of magnitude faster. EM also handles missing data naturally: just omit missing dimensions from the E-step.
Check: When is EM for PCA preferred over eigendecomposition?

Chapter 6: Bayesian PCA

How many principal components should we keep? Bayesian PCA answers this automatically by placing priors on the columns of W and using automatic relevance determination (ARD).

Each column wi of W gets its own precision hyperparameter αi:

p(W|α) = ∏i=1Mi/2π)D/2 exp(−½ αi ||wi||2)
Automatic dimensionality selection: During variational inference or evidence maximization, some αi → ∞, which drives the corresponding columns of W to zero. These components are automatically pruned. Start with M larger than necessary; the algorithm finds the effective dimensionality. This is the same ARD mechanism as in the RVM (Ch 7) — a recurring Bayesian theme.
Check: How does Bayesian PCA determine the number of components?

Chapter 7: Factor Analysis

Factor analysis generalizes PPCA by allowing non-isotropic noise:

x = Wz + μ + ε,    ε ~ N(0, Ψ)

where Ψ = diag(ψ1, ..., ψD) is a diagonal (but not scalar) noise matrix. Each observed dimension has its own noise level.

PPCA vs. factor analysis: PPCA: ψ1 = ... = ψD = σ2 (same noise everywhere). Factor analysis: ψi differ per dimension. This makes factor analysis rotation-variant (no arbitrary rotation R), so the solution is unique (up to sign/permutation). Factor analysis better models data where some features are noisier than others — common in practice.

The marginal distribution is p(x) = N(x|μ, WWT + Ψ). Factor analysis decomposes the covariance into a low-rank component (WWT, capturing shared variance from latent factors) plus diagonal noise (uniquenesses).

Check: What is the key difference between PPCA and factor analysis?

Chapter 8: Kernel PCA

Kernel PCA extends PCA to nonlinear dimensionality reduction using the kernel trick from Chapter 6.

Standard PCA works with the covariance matrix S = (1/N)ΦTΦ in feature space. The dual formulation uses the kernel matrix Knm = k(xn, xm).

The principal components in feature space are given by the eigenvectors of the centered kernel matrix K̃:

ai = N λi ai

where K̃ = K − (1/N)1K − K1(1/N) + (1/N)1K1(1/N) is the centered kernel matrix.

Nonlinear structure from linear PCA: Kernel PCA performs linear PCA in a (potentially infinite-dimensional) feature space, which corresponds to nonlinear dimensionality reduction in input space. A Gaussian kernel gives a smooth manifold; a polynomial kernel gives polynomial features. The kernel determines what "structure" means.

Limitation: Kernel PCA does not give an explicit mapping back to input space (the pre-image problem). PPCA gives a full generative model; kernel PCA gives only projections.

Check: What does kernel PCA achieve?

Chapter 9: Independent Component Analysis

ICA (Independent Component Analysis) finds components that are statistically independent, not just uncorrelated. The model:

x = As

where s = (s1, ..., sM) are independent non-Gaussian sources and A is an unknown mixing matrix.

PCA vs. ICA: PCA finds uncorrelated components (no linear correlation). ICA finds independent components (no dependence of any kind). For Gaussian sources, uncorrelated = independent, so ICA and PCA give the same result. ICA is interesting precisely when sources are non-Gaussian. The classic example: the "cocktail party problem" — separating mixed audio signals into individual speakers.

ICA assumes at most one source is Gaussian (otherwise the mixing is unidentifiable). Common ICA algorithms maximize non-Gaussianity of the recovered sources (kurtosis, negentropy) or minimize mutual information.

Check: When does ICA give different results from PCA?

Chapter 10: Autoencoders

An autoassociative neural network (autoencoder) learns a nonlinear low-dimensional representation by training a network to reconstruct its input through a bottleneck.

Architecture: input (x, D units) → encoder (hidden layers) → bottleneck (z, M units) → decoder (hidden layers) → output (, D units).

Linear autoencoder = PCA: With a single hidden layer and linear activations, the autoencoder learns exactly the same subspace as PCA (though not necessarily the same basis). With nonlinear activations, the autoencoder learns a nonlinear manifold — a more powerful generalization. Modern variational autoencoders (VAEs) combine autoencoders with probabilistic PCA, using variational inference (Ch 10) to learn both the encoder and decoder.

The connection to PPCA: a linear autoencoder trained with MSE loss is equivalent to PPCA in the limit σ2 → 0. The bottleneck dimension M plays the role of the latent space dimension. VAEs make this connection explicit and add noise modeling.

Check: What does a linear autoencoder learn?

Chapter 11: Summary

MethodNoise modelMappingUnique to
PCANoneLinearUncorrelated, max variance
PPCAIsotropic σ2ILinearGenerative, missing data
Factor analysisDiagonal ΨLinearPer-feature noise, unique solution
Bayesian PCAIsotropic + ARDLinearAuto-selects M
Kernel PCANoneNonlinear (kernel)Nonlinear manifolds
ICANoneLinearIndependent, non-Gaussian
AutoencoderLearnedNonlinear (NN)Flexible nonlinear encoding
The latent variable perspective: All these methods share a common idea: data x is generated from a simpler latent representation z. The differences lie in the mapping (linear vs nonlinear), the noise model (isotropic vs diagonal vs none), and the prior on z (Gaussian for PCA/FA, non-Gaussian for ICA). This latent variable framework extends to modern deep generative models: VAEs, GANs, and diffusion models.

What comes next: Chapter 13 introduces sequential data — time series models (HMMs, Kalman filters) where the latent variables have temporal structure.

"PCA can be viewed as a limiting case of a linear-Gaussian
latent variable model."
— Christopher Bishop, PRML §12.2
Check: What unifying idea connects PCA, factor analysis, and autoencoders?