Bishop PRML, Chapter 12

Continuous Latent Variables

PCA, probabilistic PCA, kernel PCA, factor analysis, ICA, and autoencoders — discovering low-dimensional structure in high-dimensional data.

Prerequisites: Chapters 2–3, 9 (Gaussians, linear regression, EM).

Chapters

Simulations

Quizzes

Chapter 0: Why Latent Variables?

Real-world data is often high-dimensional (images: millions of pixels, genomes: billions of bases), but the intrinsic dimensionality is much lower. Faces can be described by a few dozen parameters (pose, expression, lighting). Handwritten digits live on a manifold of dimension ~10 in a 784-dimensional pixel space.

Dimensionality reduction: Find a low-dimensional representation z ∈ R^M that captures the essential structure of high-dimensional data x ∈ R^D (M « D). This enables visualization (M=2), compression, denoising, and more efficient learning. PCA is the simplest approach: project onto the directions of maximum variance.

The probabilistic perspective: we model x as generated from a latent variable z via a mapping plus noise. This gives us a generative model, uncertainty estimates, and principled handling of missing data.

Check: Why is dimensionality reduction possible for real-world data?

The intrinsic dimensionality is often much lower than the ambient dimensionality — data lies near a low-dimensional manifold All dimensions are equally important High-dimensional data has no structure

Chapter 1: PCA — Maximum Variance

PCA finds the M directions that capture the most variance. The first principal component u₁ maximizes:

max_u₁ u₁^T S u₁ subject to u₁^Tu₁ = 1

where S = (1/N) ∑_n (x_n − x̄)(x_n − x̄)^T is the data covariance matrix.

Using Lagrange multipliers: Su₁ = λ₁u₁. The solution is the eigenvector of S with the largest eigenvalue λ₁. The variance along this direction is λ₁.

Successive components: The second principal component maximizes variance subject to being orthogonal to the first: u₂^Tu₁ = 0. This gives the eigenvector with the second-largest eigenvalue. In general, the top M eigenvectors of S give the M principal components. The proportion of variance retained is ∑_i=1^M λ_i / ∑_i=1^D λ_i.

Check: What are the principal components of a dataset?

The eigenvectors of the data covariance matrix, ordered by decreasing eigenvalue (variance captured) The mean and variance of each feature The cluster centers

Chapter 2: PCA — Minimum Error

An equivalent formulation: PCA finds the M-dimensional subspace that minimizes reconstruction error:

J = (1/N) ∑_n=1^N ||x_n − x̄_n||²

where x̄_n is the projection of x_n onto the M-dimensional subspace and back to D dimensions.

The minimum distortion is:

J_min = ∑_i=M+1^D λ_i

— the sum of the discarded eigenvalues. Maximum variance = minimum reconstruction error. The two formulations are equivalent.

High-dimensional PCA trick: When D > N (more features than data points), computing the D × D covariance is expensive. Instead, work with the N × N matrix (1/N)X^TX. Its eigenvectors, when projected back to D dimensions, give the same principal components. This reduces cost from O(D³) to O(N³) — critical for images and genomics.

Check: What is the minimum reconstruction error in PCA?

The sum of the discarded eigenvalues — the variance in the directions we throw away Always zero The largest eigenvalue

Chapter 3: PCA Visualization

PCA: Finding Principal Components

Data is generated from an elongated 2D Gaussian. The first principal component (warm) aligns with the direction of maximum variance. Projections (teal) show reconstruction from 1D.

Angle 30° Eccentricity 5.0

Check: What does PCA's first principal component capture?

The direction of maximum variance in the data The direction of minimum variance The mean of the data

Chapter 4: Probabilistic PCA

Standard PCA is an algorithm, not a model. Probabilistic PCA (PPCA) gives PCA a generative interpretation:

x = Wz + μ + ε

where z ~ N(0, I_M) is the latent variable, W is a D × M loading matrix, and ε ~ N(0, σ²I) is isotropic noise.

The marginal distribution of x is:

p(x) = N(x|μ, WW^T + σ²I)

PCA emerges from ML: Maximum likelihood estimation of W recovers the principal components! Specifically, W_ML = U_M(Λ_M − σ²I)^1/2R, where U_M contains the top M eigenvectors, Λ_M the corresponding eigenvalues, and R is an arbitrary rotation. In the limit σ² → 0, this reduces to standard PCA. The probabilistic formulation adds noise modeling, a proper likelihood, and enables Bayesian extensions.

Benefits of PPCA over standard PCA: handles missing data (via EM), gives a density model (useful for anomaly detection), enables Bayesian model selection (choosing M), and connects to factor analysis.

Check: How does probabilistic PCA relate to standard PCA?

ML estimation of the PPCA model recovers the standard PCA solution; PPCA adds noise modeling and a proper likelihood They are completely different methods PPCA always gives different components

Chapter 5: EM for PCA

The EM algorithm for PPCA avoids computing the D × D covariance matrix, making it efficient for high-dimensional data:

E-step: Compute the posterior over latent variables:

E[z_n] = M⁻¹W^T(x_n − μ)

where M = W^TW + σ²I is only M × M.

M-step: Update the loading matrix:

W^new = [∑_n (x_n − μ) E[z_n]^T] [∑_n E[z_nz_n^T]]⁻¹

Computational advantage: Each EM iteration costs O(NMD) instead of O(D³) for eigendecomposition. When M « D, this is much cheaper. For a 1000×1000 image (D = 10⁶) with M = 50 components, EM is orders of magnitude faster. EM also handles missing data naturally: just omit missing dimensions from the E-step.

Check: When is EM for PCA preferred over eigendecomposition?

When D is very large and M is small — EM costs O(NMD) vs. O(D^3) for eigendecomposition Always Never — eigendecomposition is always faster

Chapter 6: Bayesian PCA

How many principal components should we keep? Bayesian PCA answers this automatically by placing priors on the columns of W and using automatic relevance determination (ARD).

Each column w_i of W gets its own precision hyperparameter α_i:

p(W|α) = ∏_i=1^M (α_i/2π)^D/2 exp(−½ α_i ||w_i||²)

Automatic dimensionality selection: During variational inference or evidence maximization, some α_i → ∞, which drives the corresponding columns of W to zero. These components are automatically pruned. Start with M larger than necessary; the algorithm finds the effective dimensionality. This is the same ARD mechanism as in the RVM (Ch 7) — a recurring Bayesian theme.

Check: How does Bayesian PCA determine the number of components?

ARD priors on the loading columns: irrelevant components are pruned as their precision hyperparameters go to infinity Cross-validation The user specifies M

Chapter 7: Factor Analysis

Factor analysis generalizes PPCA by allowing non-isotropic noise:

x = Wz + μ + ε, ε ~ N(0, Ψ)

where Ψ = diag(ψ₁, ..., ψ_D) is a diagonal (but not scalar) noise matrix. Each observed dimension has its own noise level.

PPCA vs. factor analysis: PPCA: ψ₁ = ... = ψ_D = σ² (same noise everywhere). Factor analysis: ψ_i differ per dimension. This makes factor analysis rotation-variant (no arbitrary rotation R), so the solution is unique (up to sign/permutation). Factor analysis better models data where some features are noisier than others — common in practice.

The marginal distribution is p(x) = N(x|μ, WW^T + Ψ). Factor analysis decomposes the covariance into a low-rank component (WW^T, capturing shared variance from latent factors) plus diagonal noise (uniquenesses).

Check: What is the key difference between PPCA and factor analysis?

Factor analysis allows different noise variances per dimension (diagonal Psi), while PPCA uses isotropic noise Factor analysis uses nonlinear mappings They are identical

Chapter 8: Kernel PCA

Kernel PCA extends PCA to nonlinear dimensionality reduction using the kernel trick from Chapter 6.

Standard PCA works with the covariance matrix S = (1/N)Φ^TΦ in feature space. The dual formulation uses the kernel matrix K_nm = k(x_n, x_m).

The principal components in feature space are given by the eigenvectors of the centered kernel matrix K̃:

K̃ a_i = N λ_i a_i

where K̃ = K − (1/N)1K − K1(1/N) + (1/N)1K1(1/N) is the centered kernel matrix.

Nonlinear structure from linear PCA: Kernel PCA performs linear PCA in a (potentially infinite-dimensional) feature space, which corresponds to nonlinear dimensionality reduction in input space. A Gaussian kernel gives a smooth manifold; a polynomial kernel gives polynomial features. The kernel determines what "structure" means.

Limitation: Kernel PCA does not give an explicit mapping back to input space (the pre-image problem). PPCA gives a full generative model; kernel PCA gives only projections.

Check: What does kernel PCA achieve?

Nonlinear dimensionality reduction by performing linear PCA in a kernel-defined feature space Faster computation of standard PCA A probabilistic model for the data

Chapter 9: Independent Component Analysis

ICA (Independent Component Analysis) finds components that are statistically independent, not just uncorrelated. The model:

x = As

where s = (s₁, ..., s_M) are independent non-Gaussian sources and A is an unknown mixing matrix.

PCA vs. ICA: PCA finds uncorrelated components (no linear correlation). ICA finds independent components (no dependence of any kind). For Gaussian sources, uncorrelated = independent, so ICA and PCA give the same result. ICA is interesting precisely when sources are non-Gaussian. The classic example: the "cocktail party problem" — separating mixed audio signals into individual speakers.

ICA assumes at most one source is Gaussian (otherwise the mixing is unidentifiable). Common ICA algorithms maximize non-Gaussianity of the recovered sources (kurtosis, negentropy) or minimize mutual information.

Check: When does ICA give different results from PCA?

When the underlying sources are non-Gaussian — ICA finds truly independent components, not just uncorrelated ones Always Never — they always give the same result

Chapter 10: Autoencoders

An autoassociative neural network (autoencoder) learns a nonlinear low-dimensional representation by training a network to reconstruct its input through a bottleneck.

Architecture: input (x, D units) → encoder (hidden layers) → bottleneck (z, M units) → decoder (hidden layers) → output (x̂, D units).

Linear autoencoder = PCA: With a single hidden layer and linear activations, the autoencoder learns exactly the same subspace as PCA (though not necessarily the same basis). With nonlinear activations, the autoencoder learns a nonlinear manifold — a more powerful generalization. Modern variational autoencoders (VAEs) combine autoencoders with probabilistic PCA, using variational inference (Ch 10) to learn both the encoder and decoder.

The connection to PPCA: a linear autoencoder trained with MSE loss is equivalent to PPCA in the limit σ² → 0. The bottleneck dimension M plays the role of the latent space dimension. VAEs make this connection explicit and add noise modeling.

Check: What does a linear autoencoder learn?

The same subspace as PCA — a linear autoencoder with MSE loss is equivalent to PCA A completely different representation Independent components like ICA

Chapter 11: Summary

Method	Noise model	Mapping	Unique to
PCA	None	Linear	Uncorrelated, max variance
PPCA	Isotropic σ²I	Linear	Generative, missing data
Factor analysis	Diagonal Ψ	Linear	Per-feature noise, unique solution
Bayesian PCA	Isotropic + ARD	Linear	Auto-selects M
Kernel PCA	None	Nonlinear (kernel)	Nonlinear manifolds
ICA	None	Linear	Independent, non-Gaussian
Autoencoder	Learned	Nonlinear (NN)	Flexible nonlinear encoding

The latent variable perspective: All these methods share a common idea: data x is generated from a simpler latent representation z. The differences lie in the mapping (linear vs nonlinear), the noise model (isotropic vs diagonal vs none), and the prior on z (Gaussian for PCA/FA, non-Gaussian for ICA). This latent variable framework extends to modern deep generative models: VAEs, GANs, and diffusion models.

What comes next: Chapter 13 introduces sequential data — time series models (HMMs, Kalman filters) where the latent variables have temporal structure.

"PCA can be viewed as a limiting case of a linear-Gaussian
latent variable model."
— Christopher Bishop, PRML §12.2

Check: What unifying idea connects PCA, factor analysis, and autoencoders?

All model data as generated from a lower-dimensional latent representation, differing in mapping and noise assumptions They all use the same algorithm They all require labeled data