Ch 10: Principal Component Analysis — Deisenroth et al. MML

Chapter 0: Why

Imagine you have a dataset of faces. Each image is 64×64 pixels — that is 4,096 numbers per face. But most of those numbers are redundant. Skin color varies smoothly, noses sit in roughly the same place, eyes look mostly alike. The effective dimensionality of face images is far lower than 4,096. Maybe a few hundred numbers are enough to describe any face.

This is the general situation in high-dimensional data: the data looks like it lives in a huge space, but it actually clusters near a much smaller subspace. Points are correlated. Measurements overlap. Information is repeated.

We need a principled way to ask: which directions carry the most information? And once we find them, can we compress the data by keeping only those directions — and later reconstruct an approximation from the compressed version?

The core idea: PCA finds the directions of greatest variance in your data. Those directions are the principal components. Projecting onto them gives the best possible low-dimensional summary. It is a linear compression scheme with a beautiful mathematical foundation: eigenvectors of the covariance matrix.

High Dimensions Are Overcomplete

A cloud of 2D points that is really 1D in disguise. The data spreads along a line — the second axis carries mostly noise. Click New Data to regenerate.

In the widget above, the data lives in 2D, but it stretches mostly along one direction. The second dimension is almost pure noise. If you project every point onto the long axis, you lose very little information. That long axis is the first principal component.

PCA formalizes this idea. It works in any number of dimensions — 2, 200, or 20,000 — and always finds the best linear compression to any target dimensionality you choose. The chapter ahead derives exactly why eigenvectors give the answer, from two different perspectives that turn out to be the same.

What PCA needs	What it produces
Data matrix X (N points, D dimensions)	Principal directions b₁, ..., b_M
Target dimensionality M < D	Low-dimensional codes z = B^Tx
Nothing else — no labels, no model	Reconstructions x̃ = Bz

Check: Why is high-dimensional data often compressible?

Because computers have limited memory Because features are correlated, so effective dimensionality is lower than ambient dimensionality Because most features are exactly zero

Chapter 1: The Compression Problem

Let's set up the problem precisely. You have N data points x₁, ..., x_N in D dimensions. You want to represent each point using only M numbers, where M < D. That is dimensionality reduction.

We build a two-step machine. The encoder takes a D-dimensional point and maps it to an M-dimensional code. The decoder takes the code and maps it back to D dimensions, producing a reconstruction. Both steps are linear.

Original

x ∈ R^D

↓ encode: z = B^Tx

Code

z ∈ R^M

↓ decode: x̃ = Bz

Reconstruction

x̃ ∈ R^D

Here B is a D×M matrix whose columns b₁, ..., b_M are the projection directions. The encoder projects x onto these columns: z_m = b_m^Tx. The decoder reconstructs by combining the columns weighted by the codes: x̃ = z₁b₁ + ... + z_Mb_M.

The full round-trip is:

x̃ = B B^T x

This is a projection. The matrix BB^T projects every point onto the M-dimensional subspace spanned by the columns of B. If the columns of B are orthonormal (b_i^Tb_j = δ_ij), then BB^T is an orthogonal projection matrix.

Key insight: Compression = projection. The encoder finds coordinates in a subspace; the decoder reconstructs from those coordinates. The only question is: which subspace? PCA will answer: the subspace that preserves the most information.

The reconstruction error for a single point is the squared distance between the original and its reconstruction:

||x - x̃||² = ||x - BB^Tx||²

Averaged over all N points, this gives the mean reconstruction error. Our goal: find the columns of B that minimize this average error. Equivalently (as we will show), find the columns of B that maximize the variance of the projected data. These two problems have the same solution.

Encode → Decode Round-Trip

Watch a 2D point get encoded to 1D and decoded back. The projection line is the subspace. The reconstruction error is the perpendicular distance. Drag the slider to rotate the projection direction.

Angle θ 30°

In the widget, notice that when the projection line aligns with the long axis of the data, the reconstruction errors (red segments) are tiny. When the line is perpendicular to the data spread, the errors are huge. PCA finds the angle that minimizes total error.

Check: What does the decoder x̃ = Bz produce geometrically?

The orthogonal projection of x onto the column space of B A random point near x The original point x unchanged

Chapter 2: Maximum Variance

Here is the first way to derive PCA. We want a single direction b (a unit vector in R^D) such that projecting the data onto b captures the most variance. The projection of a centered data point x onto b is the scalar z = b^Tx. The variance of these projections across the dataset is:

V = (1/N) ∑_n (b^Tx_n)² = b^T S b

where S is the data covariance matrix:

S = (1/N) ∑_n x_n x_n^T = (1/N) X^TX

(assuming centered data with mean zero). We want to maximize b^TSb subject to the constraint ||b|| = 1. Without the constraint, we could make V arbitrarily large by scaling b. The unit-length constraint forces us to pick a direction, not a magnitude.

The Lagrangian derivation

This is a constrained optimization problem. We use a Lagrange multiplier λ to enforce the constraint:

L(b, λ) = b^TSb − λ(b^Tb − 1)

We seek a stationary point. Take the derivative with respect to b and set it to zero:

∂L/∂b = 2Sb − 2λb = 0

This simplifies to:

Sb = λb

This is the eigenvalue equation. The direction b that maximizes variance must be an eigenvector of the covariance matrix S, and λ is the corresponding eigenvalue.

The punchline: Plug the eigenvector equation back into the variance: V = b^TSb = b^T(λb) = λb^Tb = λ. The variance captured along eigenvector b equals its eigenvalue. To maximize variance, pick the eigenvector with the largest eigenvalue. That is the first principal component.

Let us verify this with a numerical example. Suppose we have a 2D covariance matrix:

S = [[2.0, 1.2], [1.2, 1.0]]

The eigenvalues are λ₁ ≈ 2.72 and λ₂ ≈ 0.28. The first eigenvector points along the direction of maximum spread. The second is orthogonal and captures the residual. Together they account for all the variance: λ₁ + λ₂ = 2 + 1 = 3 = tr(S).

Key insight: The Lagrange multiplier λ is not just a mathematical device — it is the answer. It tells you exactly how much variance the corresponding eigenvector captures. Larger eigenvalue = more important direction.

2D PCA Explorer (Showcase)

A cloud of correlated 2D points with its covariance ellipse and eigenvectors. Drag the projection line to any angle. The bar chart shows variance captured. Snapping to PC1 maximizes the bar. PC2 is orthogonal and captures the rest.

Projection θ 30°

Play with the slider above. As you rotate the projection line toward the eigenvector with the largest eigenvalue, the variance bar grows. At the exact PC1 angle, the bar is maximized. At PC2 (perpendicular), the bar is minimized — you are projecting onto the direction of least variance. Any other angle falls between the two.

Check: Why does maximizing b^TSb subject to ||b||=1 lead to an eigenvalue equation?

Because we are minimizing a quadratic function Because setting the Lagrangian gradient to zero gives Sb = λb Because S is always a diagonal matrix

Chapter 3: Finding Principal Components

We found the first principal component: the eigenvector of S with the largest eigenvalue. What about the second? And the third? PCA extracts them sequentially, each one orthogonal to all previous ones.

The m-th principal component is the direction that maximizes variance in the subspace orthogonal to the first m−1 components. By the same Lagrangian argument (now with m−1 additional orthogonality constraints), the solution is again an eigenvector of S — specifically, the one with the m-th largest eigenvalue.

Fact: Since S is symmetric and positive semi-definite, its eigenvectors are orthogonal (by the spectral theorem). So the first M eigenvectors automatically satisfy the orthogonality constraints. We don't need to solve M separate constrained optimization problems — we just eigendecompose S once and read off the eigenvectors in order of decreasing eigenvalue.

Let λ₁ ≥ λ₂ ≥ ... ≥ λ_D be the sorted eigenvalues with corresponding eigenvectors b₁, b₂, ..., b_D. The first M principal components are b₁, ..., b_M. Together they form the columns of the projection matrix B.

Component	Direction	Variance captured
PC1	b₁ (eigenvector of largest λ)	λ₁
PC2	b₂ (next largest λ)	λ₂
PC m	b_m	λ_m
...	...	...
PC D	b_D (smallest λ)	λ_D

The total variance captured by the first M components is:

V_M = ∑_m=1^M λ_m

And the total variance in the data is the sum of all eigenvalues, which equals the trace of S:

V_total = ∑_d=1^D λ_d = tr(S)

So the fraction of variance retained by M components is V_M / V_total. This gives a crisp criterion for choosing M: pick the smallest M such that this fraction exceeds your threshold (e.g., 95%).

Worked example: A 3D dataset has eigenvalues λ₁ = 5.0, λ₂ = 3.0, λ₃ = 0.2. Total variance = 8.2. With M=2 components: V₂/V_total = 8.0/8.2 = 97.6%. We can reduce from 3D to 2D and lose only 2.4% of the information. The third dimension is nearly pure noise.

Sequential Extraction

Eigenvalues of a sample covariance matrix, sorted in decreasing order. Each bar is the variance along one PC. The cumulative line shows how much total variance is captured as you add components.

Check: If the eigenvalues of a 4D covariance matrix are 10, 5, 1, 0.1, what fraction of variance is retained by 2 PCs?

15/16.1 = 93.2% 10/16.1 = 62.1% 5/16.1 = 31.1%

Chapter 4: Reconstruction Error

We derived PCA from the "maximize variance" perspective. Now let's derive it from the other direction: minimize reconstruction error. We will arrive at the same eigenvectors.

The average reconstruction error across the dataset is:

J = (1/N) ∑_n ||x_n − x̃_n||²

Since x̃_n = BB^Tx_n (the projection), the error for each point is the component of x_n orthogonal to the subspace spanned by B. This orthogonal complement is captured by the eigenvectors we didn't keep.

Expand x_n in the full eigenbasis b₁, ..., b_D:

x_n = ∑_d=1^D (b_d^Tx_n) b_d

The reconstruction keeps only the first M terms. The error is the sum of the remaining D−M terms:

x_n − x̃_n = ∑_d=M+1^D (b_d^Tx_n) b_d

The average squared error becomes:

J_M = ∑_d=M+1^D λ_d

The reconstruction error is the sum of the discarded eigenvalues. To minimize J_M, we should discard the eigenvectors with the smallest eigenvalues — i.e., keep the ones with the largest eigenvalues. This is exactly the same answer as maximizing variance.

Let's verify the bookkeeping. The total variance equals the variance retained plus the variance lost:

∑_d=1^D λ_d = ∑_m=1^M λ_m + ∑_d=M+1^D λ_d

The teal term is the variance retained (V_M). The orange term is the reconstruction error (J_M). They partition the total variance: V_M + J_M = tr(S). Maximizing V_M and minimizing J_M are literally the same optimization.

Worked example: With eigenvalues 5.0, 3.0, 0.2 and M=2: Variance retained = 5.0 + 3.0 = 8.0. Reconstruction error = 0.2. Total = 8.2 = tr(S). The error is just the discarded eigenvalue.

Check: The reconstruction error J_M equals...

The sum of all eigenvalues The sum of the D−M smallest eigenvalues (those discarded) The product of the retained eigenvalues

Chapter 5: Two Views, One Solution

We have now seen PCA from both sides:

View 1: Maximum Variance

max b^TSb s.t. ||b||=1

Find the direction that preserves the most spread in the data.

View 2: Minimum Reconstruction Error

min ||x − BB^Tx||²

Find the subspace where projections are closest to the originals.

Both views lead to the same answer: the eigenvectors of the covariance matrix, ordered by decreasing eigenvalue. This is not a coincidence — it is a direct consequence of the variance decomposition V_M + J_M = tr(S).

Key insight: Maximizing variance retained = minimizing variance lost = minimizing reconstruction error. The eigenvalues partition neatly, so optimizing one quantity automatically optimizes the other. This duality is why PCA is so elegant.

Think of it this way. You have a fixed "budget" of total variance (tr(S)). Keeping M components gives you V_M variance and costs you J_M = tr(S) − V_M in error. Maximizing what you keep is equivalent to minimizing what you lose. Two sides of the same coin.

Variance Retained vs. Error

A stacked bar shows how eigenvalues partition into retained (teal) and discarded (orange). Drag M to see the split change.

Components M 2

In the widget, as you increase M, the teal bar grows (more variance retained) and the orange bar shrinks (less reconstruction error). At M=D (all components), the error is zero and the reconstruction is perfect. At M=1, you get the best single-direction summary possible.

Check: Why are the max-variance and min-error formulations equivalent?

Because V_M + J_M = tr(S) is constant, so maximizing one minimizes the other Because both use gradient descent They are not equivalent; they give different answers for M > 1

Chapter 6: The SVD Connection

In practice, we rarely form the covariance matrix S and eigendecompose it. Instead, we use the Singular Value Decomposition (SVD) of the centered data matrix X directly. This is numerically stabler and often faster.

The SVD of the N×D data matrix X is:

X = U Σ V^T

where U is N×N (left singular vectors), Σ is N×D (singular values on the diagonal), and V is D×D (right singular vectors). The covariance matrix is:

S = (1/N) X^TX = (1/N) V Σ^T U^T U Σ V^T = V · (Σ²/N) · V^T

Since U^TU = I, this simplifies beautifully. Comparing with the eigendecomposition S = V Λ V^T, we see:

The connection: The right singular vectors V of X are the eigenvectors of S (the principal components). The eigenvalues are λ_d = σ_d² / N, where σ_d are the singular values. We never need to form S explicitly — just SVD the data matrix.

This also connects to the Eckart-Young theorem: the best rank-M approximation to X (in Frobenius norm) is obtained by keeping only the M largest singular values and their vectors. This is exactly PCA.

PCA language	SVD language
Principal components b_m	Right singular vectors v_m
Eigenvalue λ_m	σ_m² / N
Projected coordinates z_n	U_n Σ (first M columns)
Best rank-M data approximation	U_M Σ_M V_M^T

python
import numpy as np

# Generate correlated 2D data
np.random.seed(42)
X = np.random.randn(100, 2) @ [[2, 1], [1, 1.5]]
X -= X.mean(axis=0)  # center

# SVD approach
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
eigenvalues = sigma**2 / len(X)
principal_dirs = Vt.T  # columns = PCs

# Verify: eigendecompose covariance
S = X.T @ X / len(X)
eigvals, eigvecs = np.linalg.eigh(S)
print("SVD eigenvalues:", eigenvalues)
print("Cov eigenvalues:", eigvals[::-1])
# They match!

Check: How do the eigenvalues of S relate to the singular values of X?

λ = σ λ = σ² / N λ = N / σ²

Chapter 7: Practical Steps

Here is the PCA recipe, step by step. Every step earns its place — skip one and the results are wrong.

1. Center

Subtract the mean: x_n ← x_n − μ

↓

2. (Optional) Standardize

Divide by std if features have different units

↓

3. Eigendecompose

SVD or eigendecompose S = X^TX/N

↓

4. Choose M

Pick M from scree plot or variance threshold

↓

5. Project

z_n = B^T(x_n − μ)

↓

6. Reconstruct

x̃_n = Bz_n + μ

Step 1 (Center) is crucial. PCA finds directions of maximum variance. If the data is not centered, the mean itself dominates the first component — the algorithm wastes its first PC pointing at the center of mass instead of finding interesting structure.

Step 2 (Standardize) is needed when features have different units (e.g., height in cm vs. weight in kg). Without standardization, PCA is dominated by whichever feature has the largest numerical variance — which is just a units artifact. Dividing each feature by its standard deviation puts all features on the same scale. When features are already in comparable units (e.g., pixel intensities), skip this step.

Key insight: Centering is mandatory; standardizing is a judgment call. PCA on the raw covariance matrix (centered only) is called "covariance PCA." PCA on the standardized data (correlation matrix) is "correlation PCA." They can give very different results.

Step-by-Step PCA Demo

Walk through each stage: raw data, centered, eigendecompose, project, reconstruct. Click Next Step to advance.

Step 0: Raw data

python
import numpy as np

def pca(X, M):
    # 1. Center
    mu = X.mean(axis=0)
    Xc = X - mu
    # 2. SVD (no need to form covariance)
    U, sigma, Vt = np.linalg.svd(Xc, full_matrices=False)
    B = Vt[:M].T             # D x M projection matrix
    Z = Xc @ B               # N x M codes
    X_hat = Z @ B.T + mu     # N x D reconstructions
    eigvals = sigma**2 / len(X)
    return B, Z, X_hat, eigvals

# Example: 100 points in 5D, reduce to 2D
X = np.random.randn(100, 5) @ np.random.randn(5, 5)
B, Z, X_hat, lam = pca(X, M=2)
print(f"Variance retained: {sum(lam[:2])/sum(lam)*100:.1f}%")

Check: Why must data be centered before PCA?

Otherwise the first PC points toward the mean rather than the direction of maximum variance Centering makes the computation faster The SVD only works on centered data

Chapter 8: The Scree Plot

How do you choose M, the number of components to keep? There is no single right answer, but the scree plot is the standard visual tool. It plots the eigenvalues in decreasing order as a bar chart.

The name comes from geology: "scree" is the rubble that collects at the base of a cliff. In the plot, the eigenvalues drop steeply at first (the cliff), then flatten out (the scree). The elbow — where the steep drop transitions to the flat tail — is a natural place to cut.

The elbow rule: Choose M at the point where adding more components yields diminishing returns. Before the elbow, each new PC captures substantial variance. After the elbow, you are mostly fitting noise. The elbow is where signal ends and noise begins.

A more quantitative approach: pick M such that the cumulative variance exceeds a threshold:

(∑_m=1^M λ_m) / (∑_d=1^D λ_d) ≥ 0.95

A 95% threshold is common, but the right value depends on your application. For lossy compression, 90% may be fine. For scientific analysis where small variance matters, 99% is safer.

Interactive Scree Plot

Eigenvalue bar chart (blue) with cumulative variance line (warm). Drag the slider to set M. The dashed threshold at 95% helps locate the cut. Click New Data to generate a different spectrum.

Components M 3

When the elbow is unclear: If eigenvalues decay smoothly without a sharp drop, there is no natural cutoff. In such cases, use the variance threshold criterion, or consider that PCA may not be the right tool — the data may genuinely be high-dimensional with no low-rank structure.

Check: What does the "elbow" in a scree plot indicate?

The maximum possible number of components The point where adding more components stops giving meaningful variance gains The smallest eigenvalue

Chapter 9: Probabilistic PCA

Standard PCA is a purely algebraic method — it finds eigenvectors of a matrix. Probabilistic PCA (PPCA) recasts PCA as a generative model: a story about how the data was created. This opens the door to Bayesian inference, handling missing data, and connecting to modern latent variable models.

The generative story is simple. Each data point is generated in two steps:

Step 1

Draw a latent code: z ~ N(0, I_M)

↓

Step 2

Generate the observation: x = Bz + μ + ε, where ε ~ N(0, σ²I)

The latent variable z lives in M dimensions. The matrix B maps it to D dimensions. The noise ε accounts for the variance that M components cannot explain. The marginal distribution of x (integrating out z) is:

x ~ N(μ, BB^T + σ²I)

We can fit this model by maximum likelihood. The log-likelihood for N data points is:

log p(X | B, μ, σ²) = −(N/2)[D log(2π) + log|C| + tr(C⁻¹S)]

where C = BB^T + σ²I and S is the data covariance. Tipping and Bishop (1999) showed that the MLE for B has a closed-form solution:

B_ML = V_M (Λ_M − σ²I)^1/2 R

where V_M contains the M leading eigenvectors of S, Λ_M is the diagonal matrix of their eigenvalues, and R is an arbitrary orthogonal rotation matrix. The noise variance is:

σ²_ML = (1/(D−M)) ∑_d=M+1^D λ_d

The key result: As σ² → 0, PPCA recovers standard PCA exactly. The noise variance is the average of the discarded eigenvalues — it captures the "average unexplained variance per dimension." PPCA gives PCA a probabilistic meaning: PCA is the zero-noise limit of a latent variable model.

Why this matters: The probabilistic view lets you handle missing data (marginalize over unobserved dimensions), perform model selection (compare likelihoods for different M), and mix PCA with other generative models. It is also the starting point for Variational Autoencoders.

Check: In PPCA, what is the noise variance σ² at the maximum likelihood solution?

The average of the discarded eigenvalues The largest eigenvalue Always zero

Chapter 10: The Auto-Encoder View

Step back and look at the PCA pipeline again: encode x into a low-dimensional z via B^T, then decode z back to x̃ via B. This is exactly the structure of an auto-encoder: a network with an encoder, a bottleneck, and a decoder, trained to minimize reconstruction error.

PCA as auto-encoder:

Encoder: z = B^T(x − μ)
Decoder: x̃ = Bz + μ
Loss: ||x − x̃||²
Both layers are linear. The optimal B has PCA eigenvectors as columns.

Deep auto-encoder:

Encoder: z = f_θ(x) (nonlinear)
Decoder: x̃ = g_φ(z) (nonlinear)
Loss: ||x − x̃||²
Multi-layer neural networks with ReLU, etc.

PCA is the linear auto-encoder. Baldi and Hornik (1989) proved that a linear auto-encoder with M hidden units, trained with squared-error loss, learns the PCA subspace. The weights converge to the M leading eigenvectors (up to rotation). Going nonlinear — adding activation functions — lets the network learn curved manifolds that PCA cannot capture.

This connection is why PCA appears in deep learning textbooks. It is the simplest possible dimensionality reduction — the case where encoder and decoder are single linear layers with shared (transposed) weights. Every more complex auto-encoder (VAE, VQ-VAE, denoising AE) is a generalization of this idea.

Method	Encoder	Decoder	Captures
PCA	Linear (B^T)	Linear (B)	Linear subspaces
Kernel PCA	Nonlinear (kernel trick)	Pre-image	Nonlinear manifolds
Auto-encoder	Neural net	Neural net	Complex manifolds
VAE	Neural net + sampling	Neural net	Manifolds + generation

Linear vs. Nonlinear Compression

The data lies on a curve (nonlinear structure). PCA projects onto a line (best linear fit). A nonlinear encoder could capture the curve. Notice the reconstruction errors are large for PCA on curved data.

Check: What happens when you train a linear auto-encoder with M hidden units and squared-error loss?

It learns the M largest principal components of the data It learns the PCA subspace (eigenvectors, possibly rotated) It fails to converge

Chapter 11: Summary

PCA is the foundational dimensionality reduction technique. It finds the directions of maximum variance in the data — the eigenvectors of the covariance matrix — and projects onto them. The result is the best linear compression for any given target dimension.

Concept	Core result
Compression	Encode: z = B^Tx. Decode: x̃ = Bz. Round-trip = projection.
Max variance	Lagrangian → Sb = λb. PC = eigenvector. Variance = eigenvalue.
Min recon error	J_M = ∑ discarded eigenvalues. Same optimal B.
Duality	V_M + J_M = tr(S). Max one = min the other.
SVD	λ_d = σ_d²/N. Right singular vectors = PCs.
PPCA	z ~ N(0,I), x = Bz + μ + ε. MLE recovers PCA as σ → 0.
Auto-encoder	Linear AE = PCA. Nonlinear AE generalizes to curved manifolds.

Limitations: PCA finds linear structure only. If the data lives on a curved manifold (a Swiss roll, a circle), PCA gives a poor summary. It is also sensitive to outliers (they inflate eigenvalues). And it is unsupervised — it ignores labels, so the directions of maximum variance may not be the directions that distinguish classes. For classification, Linear Discriminant Analysis (LDA) is the supervised counterpart.

Strengths:

• Optimal linear compression (provably)
• Fast (eigendecomposition or SVD)
• No hyperparameters (M is a design choice, not tuned)
• Interpretable (each PC is a direction)

Limitations:

• Linear only (cannot capture curves)
• Sensitive to outliers
• Unsupervised (ignores class labels)
• Requires all data in memory (batch method)

What comes next: Chapter 11 (Regression) uses dimensionality reduction as a preprocessing step. Chapter 12 (Classification) introduces supervised methods like LDA. In the broader ML landscape, PCA connects to kernel methods, auto-encoders, and variational inference — it is the seed from which modern representation learning grew.

Extension	What it adds
Kernel PCA	Nonlinear mapping via the kernel trick
Sparse PCA	L1 penalty for interpretable components
Incremental PCA	Online updates for streaming data
Factor Analysis	Per-feature noise variances (not isotropic)
VAE	Nonlinear + generative + deep

"The first principal component is the direction in which the data
tells the longest story."
— paraphrasing Karl Pearson, 1901

Check: What is the main limitation of PCA that motivates kernel PCA and auto-encoders?

PCA can only capture linear structure, not curved manifolds PCA is too slow for large datasets PCA requires labeled data