Deisenroth et al., Chapter 10

Principal Component Analysis

Finding the directions that matter most in high-dimensional data.

Prerequisites: Chapters 3–4 (linear algebra & matrix decompositions). That's it.
12
Chapters
5+
Simulations
12
Quizzes

Chapter 0: Why

Imagine you have a dataset of faces. Each image is 64×64 pixels — that is 4,096 numbers per face. But most of those numbers are redundant. Skin color varies smoothly, noses sit in roughly the same place, eyes look mostly alike. The effective dimensionality of face images is far lower than 4,096. Maybe a few hundred numbers are enough to describe any face.

This is the general situation in high-dimensional data: the data looks like it lives in a huge space, but it actually clusters near a much smaller subspace. Points are correlated. Measurements overlap. Information is repeated.

We need a principled way to ask: which directions carry the most information? And once we find them, can we compress the data by keeping only those directions — and later reconstruct an approximation from the compressed version?

The core idea: PCA finds the directions of greatest variance in your data. Those directions are the principal components. Projecting onto them gives the best possible low-dimensional summary. It is a linear compression scheme with a beautiful mathematical foundation: eigenvectors of the covariance matrix.
High Dimensions Are Overcomplete

A cloud of 2D points that is really 1D in disguise. The data spreads along a line — the second axis carries mostly noise. Click New Data to regenerate.

In the widget above, the data lives in 2D, but it stretches mostly along one direction. The second dimension is almost pure noise. If you project every point onto the long axis, you lose very little information. That long axis is the first principal component.

PCA formalizes this idea. It works in any number of dimensions — 2, 200, or 20,000 — and always finds the best linear compression to any target dimensionality you choose. The chapter ahead derives exactly why eigenvectors give the answer, from two different perspectives that turn out to be the same.

What PCA needsWhat it produces
Data matrix X (N points, D dimensions)Principal directions b1, ..., bM
Target dimensionality M < DLow-dimensional codes z = BTx
Nothing else — no labels, no modelReconstructions x̃ = Bz
Check: Why is high-dimensional data often compressible?

Chapter 1: The Compression Problem

Let's set up the problem precisely. You have N data points x1, ..., xN in D dimensions. You want to represent each point using only M numbers, where M < D. That is dimensionality reduction.

We build a two-step machine. The encoder takes a D-dimensional point and maps it to an M-dimensional code. The decoder takes the code and maps it back to D dimensions, producing a reconstruction. Both steps are linear.

Original
x ∈ RD
↓ encode: z = BTx
Code
z ∈ RM
↓ decode: x̃ = Bz
Reconstruction
x̃ ∈ RD

Here B is a D×M matrix whose columns b1, ..., bM are the projection directions. The encoder projects x onto these columns: zm = bmTx. The decoder reconstructs by combining the columns weighted by the codes: x̃ = z1b1 + ... + zMbM.

The full round-trip is:

x̃ = B BT x

This is a projection. The matrix BBT projects every point onto the M-dimensional subspace spanned by the columns of B. If the columns of B are orthonormal (biTbj = δij), then BBT is an orthogonal projection matrix.

Key insight: Compression = projection. The encoder finds coordinates in a subspace; the decoder reconstructs from those coordinates. The only question is: which subspace? PCA will answer: the subspace that preserves the most information.

The reconstruction error for a single point is the squared distance between the original and its reconstruction:

||x - x̃||2 = ||x - BBTx||2

Averaged over all N points, this gives the mean reconstruction error. Our goal: find the columns of B that minimize this average error. Equivalently (as we will show), find the columns of B that maximize the variance of the projected data. These two problems have the same solution.

Encode → Decode Round-Trip

Watch a 2D point get encoded to 1D and decoded back. The projection line is the subspace. The reconstruction error is the perpendicular distance. Drag the slider to rotate the projection direction.

Angle θ 30°

In the widget, notice that when the projection line aligns with the long axis of the data, the reconstruction errors (red segments) are tiny. When the line is perpendicular to the data spread, the errors are huge. PCA finds the angle that minimizes total error.

Check: What does the decoder x̃ = Bz produce geometrically?

Chapter 2: Maximum Variance

Here is the first way to derive PCA. We want a single direction b (a unit vector in RD) such that projecting the data onto b captures the most variance. The projection of a centered data point x onto b is the scalar z = bTx. The variance of these projections across the dataset is:

V = (1/N) ∑n (bTxn)2 = bT S b

where S is the data covariance matrix:

S = (1/N) ∑n xn xnT = (1/N) XTX

(assuming centered data with mean zero). We want to maximize bTSb subject to the constraint ||b|| = 1. Without the constraint, we could make V arbitrarily large by scaling b. The unit-length constraint forces us to pick a direction, not a magnitude.

The Lagrangian derivation

This is a constrained optimization problem. We use a Lagrange multiplier λ to enforce the constraint:

L(b, λ) = bTSb − λ(bTb − 1)

We seek a stationary point. Take the derivative with respect to b and set it to zero:

∂L/∂b = 2Sb − 2λb = 0

This simplifies to:

Sb = λb

This is the eigenvalue equation. The direction b that maximizes variance must be an eigenvector of the covariance matrix S, and λ is the corresponding eigenvalue.

The punchline: Plug the eigenvector equation back into the variance: V = bTSb = bT(λb) = λbTb = λ. The variance captured along eigenvector b equals its eigenvalue. To maximize variance, pick the eigenvector with the largest eigenvalue. That is the first principal component.

Let us verify this with a numerical example. Suppose we have a 2D covariance matrix:

S = [[2.0, 1.2], [1.2, 1.0]]

The eigenvalues are λ1 ≈ 2.72 and λ2 ≈ 0.28. The first eigenvector points along the direction of maximum spread. The second is orthogonal and captures the residual. Together they account for all the variance: λ1 + λ2 = 2 + 1 = 3 = tr(S).

Key insight: The Lagrange multiplier λ is not just a mathematical device — it is the answer. It tells you exactly how much variance the corresponding eigenvector captures. Larger eigenvalue = more important direction.
2D PCA Explorer (Showcase)

A cloud of correlated 2D points with its covariance ellipse and eigenvectors. Drag the projection line to any angle. The bar chart shows variance captured. Snapping to PC1 maximizes the bar. PC2 is orthogonal and captures the rest.

Projection θ 30°

Play with the slider above. As you rotate the projection line toward the eigenvector with the largest eigenvalue, the variance bar grows. At the exact PC1 angle, the bar is maximized. At PC2 (perpendicular), the bar is minimized — you are projecting onto the direction of least variance. Any other angle falls between the two.

Check: Why does maximizing bTSb subject to ||b||=1 lead to an eigenvalue equation?

Chapter 3: Finding Principal Components

We found the first principal component: the eigenvector of S with the largest eigenvalue. What about the second? And the third? PCA extracts them sequentially, each one orthogonal to all previous ones.

The m-th principal component is the direction that maximizes variance in the subspace orthogonal to the first m−1 components. By the same Lagrangian argument (now with m−1 additional orthogonality constraints), the solution is again an eigenvector of S — specifically, the one with the m-th largest eigenvalue.

Fact: Since S is symmetric and positive semi-definite, its eigenvectors are orthogonal (by the spectral theorem). So the first M eigenvectors automatically satisfy the orthogonality constraints. We don't need to solve M separate constrained optimization problems — we just eigendecompose S once and read off the eigenvectors in order of decreasing eigenvalue.

Let λ1 ≥ λ2 ≥ ... ≥ λD be the sorted eigenvalues with corresponding eigenvectors b1, b2, ..., bD. The first M principal components are b1, ..., bM. Together they form the columns of the projection matrix B.

ComponentDirectionVariance captured
PC1b1 (eigenvector of largest λ)λ1
PC2b2 (next largest λ)λ2
PC mbmλm
.........
PC DbD (smallest λ)λD

The total variance captured by the first M components is:

VM = ∑m=1M λm

And the total variance in the data is the sum of all eigenvalues, which equals the trace of S:

Vtotal = ∑d=1D λd = tr(S)

So the fraction of variance retained by M components is VM / Vtotal. This gives a crisp criterion for choosing M: pick the smallest M such that this fraction exceeds your threshold (e.g., 95%).

Worked example: A 3D dataset has eigenvalues λ1 = 5.0, λ2 = 3.0, λ3 = 0.2. Total variance = 8.2. With M=2 components: V2/Vtotal = 8.0/8.2 = 97.6%. We can reduce from 3D to 2D and lose only 2.4% of the information. The third dimension is nearly pure noise.
Sequential Extraction

Eigenvalues of a sample covariance matrix, sorted in decreasing order. Each bar is the variance along one PC. The cumulative line shows how much total variance is captured as you add components.

Check: If the eigenvalues of a 4D covariance matrix are 10, 5, 1, 0.1, what fraction of variance is retained by 2 PCs?

Chapter 4: Reconstruction Error

We derived PCA from the "maximize variance" perspective. Now let's derive it from the other direction: minimize reconstruction error. We will arrive at the same eigenvectors.

The average reconstruction error across the dataset is:

J = (1/N) ∑n ||xn − x̃n||2

Since x̃n = BBTxn (the projection), the error for each point is the component of xn orthogonal to the subspace spanned by B. This orthogonal complement is captured by the eigenvectors we didn't keep.

Expand xn in the full eigenbasis b1, ..., bD:

xn = ∑d=1D (bdTxn) bd

The reconstruction keeps only the first M terms. The error is the sum of the remaining D−M terms:

xn − x̃n = ∑d=M+1D (bdTxn) bd

The average squared error becomes:

JM = ∑d=M+1D λd
The reconstruction error is the sum of the discarded eigenvalues. To minimize JM, we should discard the eigenvectors with the smallest eigenvalues — i.e., keep the ones with the largest eigenvalues. This is exactly the same answer as maximizing variance.

Let's verify the bookkeeping. The total variance equals the variance retained plus the variance lost:

d=1D λd = m=1M λm + d=M+1D λd

The teal term is the variance retained (VM). The orange term is the reconstruction error (JM). They partition the total variance: VM + JM = tr(S). Maximizing VM and minimizing JM are literally the same optimization.

Worked example: With eigenvalues 5.0, 3.0, 0.2 and M=2: Variance retained = 5.0 + 3.0 = 8.0. Reconstruction error = 0.2. Total = 8.2 = tr(S). The error is just the discarded eigenvalue.
Check: The reconstruction error JM equals...

Chapter 5: Two Views, One Solution

We have now seen PCA from both sides:

View 1: Maximum Variance

max bTSb   s.t. ||b||=1

Find the direction that preserves the most spread in the data.

View 2: Minimum Reconstruction Error

min ||x − BBTx||2

Find the subspace where projections are closest to the originals.

Both views lead to the same answer: the eigenvectors of the covariance matrix, ordered by decreasing eigenvalue. This is not a coincidence — it is a direct consequence of the variance decomposition VM + JM = tr(S).

Key insight: Maximizing variance retained = minimizing variance lost = minimizing reconstruction error. The eigenvalues partition neatly, so optimizing one quantity automatically optimizes the other. This duality is why PCA is so elegant.

Think of it this way. You have a fixed "budget" of total variance (tr(S)). Keeping M components gives you VM variance and costs you JM = tr(S) − VM in error. Maximizing what you keep is equivalent to minimizing what you lose. Two sides of the same coin.

Variance Retained vs. Error

A stacked bar shows how eigenvalues partition into retained (teal) and discarded (orange). Drag M to see the split change.

Components M 2

In the widget, as you increase M, the teal bar grows (more variance retained) and the orange bar shrinks (less reconstruction error). At M=D (all components), the error is zero and the reconstruction is perfect. At M=1, you get the best single-direction summary possible.

Check: Why are the max-variance and min-error formulations equivalent?

Chapter 6: The SVD Connection

In practice, we rarely form the covariance matrix S and eigendecompose it. Instead, we use the Singular Value Decomposition (SVD) of the centered data matrix X directly. This is numerically stabler and often faster.

The SVD of the N×D data matrix X is:

X = U Σ VT

where U is N×N (left singular vectors), Σ is N×D (singular values on the diagonal), and V is D×D (right singular vectors). The covariance matrix is:

S = (1/N) XTX = (1/N) V ΣT UT U Σ VT = V · (Σ2/N) · VT

Since UTU = I, this simplifies beautifully. Comparing with the eigendecomposition S = V Λ VT, we see:

The connection: The right singular vectors V of X are the eigenvectors of S (the principal components). The eigenvalues are λd = σd2 / N, where σd are the singular values. We never need to form S explicitly — just SVD the data matrix.

This also connects to the Eckart-Young theorem: the best rank-M approximation to X (in Frobenius norm) is obtained by keeping only the M largest singular values and their vectors. This is exactly PCA.

PCA languageSVD language
Principal components bmRight singular vectors vm
Eigenvalue λmσm2 / N
Projected coordinates znUn Σ (first M columns)
Best rank-M data approximationUM ΣM VMT
python
import numpy as np

# Generate correlated 2D data
np.random.seed(42)
X = np.random.randn(100, 2) @ [[2, 1], [1, 1.5]]
X -= X.mean(axis=0)  # center

# SVD approach
U, sigma, Vt = np.linalg.svd(X, full_matrices=False)
eigenvalues = sigma**2 / len(X)
principal_dirs = Vt.T  # columns = PCs

# Verify: eigendecompose covariance
S = X.T @ X / len(X)
eigvals, eigvecs = np.linalg.eigh(S)
print("SVD eigenvalues:", eigenvalues)
print("Cov eigenvalues:", eigvals[::-1])
# They match!
Check: How do the eigenvalues of S relate to the singular values of X?

Chapter 7: Practical Steps

Here is the PCA recipe, step by step. Every step earns its place — skip one and the results are wrong.

1. Center
Subtract the mean: xn ← xn − μ
2. (Optional) Standardize
Divide by std if features have different units
3. Eigendecompose
SVD or eigendecompose S = XTX/N
4. Choose M
Pick M from scree plot or variance threshold
5. Project
zn = BT(xn − μ)
6. Reconstruct
n = Bzn + μ

Step 1 (Center) is crucial. PCA finds directions of maximum variance. If the data is not centered, the mean itself dominates the first component — the algorithm wastes its first PC pointing at the center of mass instead of finding interesting structure.

Step 2 (Standardize) is needed when features have different units (e.g., height in cm vs. weight in kg). Without standardization, PCA is dominated by whichever feature has the largest numerical variance — which is just a units artifact. Dividing each feature by its standard deviation puts all features on the same scale. When features are already in comparable units (e.g., pixel intensities), skip this step.

Key insight: Centering is mandatory; standardizing is a judgment call. PCA on the raw covariance matrix (centered only) is called "covariance PCA." PCA on the standardized data (correlation matrix) is "correlation PCA." They can give very different results.
Step-by-Step PCA Demo

Walk through each stage: raw data, centered, eigendecompose, project, reconstruct. Click Next Step to advance.

Step 0: Raw data
python
import numpy as np

def pca(X, M):
    # 1. Center
    mu = X.mean(axis=0)
    Xc = X - mu
    # 2. SVD (no need to form covariance)
    U, sigma, Vt = np.linalg.svd(Xc, full_matrices=False)
    B = Vt[:M].T             # D x M projection matrix
    Z = Xc @ B               # N x M codes
    X_hat = Z @ B.T + mu     # N x D reconstructions
    eigvals = sigma**2 / len(X)
    return B, Z, X_hat, eigvals

# Example: 100 points in 5D, reduce to 2D
X = np.random.randn(100, 5) @ np.random.randn(5, 5)
B, Z, X_hat, lam = pca(X, M=2)
print(f"Variance retained: {sum(lam[:2])/sum(lam)*100:.1f}%")
Check: Why must data be centered before PCA?

Chapter 8: The Scree Plot

How do you choose M, the number of components to keep? There is no single right answer, but the scree plot is the standard visual tool. It plots the eigenvalues in decreasing order as a bar chart.

The name comes from geology: "scree" is the rubble that collects at the base of a cliff. In the plot, the eigenvalues drop steeply at first (the cliff), then flatten out (the scree). The elbow — where the steep drop transitions to the flat tail — is a natural place to cut.

The elbow rule: Choose M at the point where adding more components yields diminishing returns. Before the elbow, each new PC captures substantial variance. After the elbow, you are mostly fitting noise. The elbow is where signal ends and noise begins.

A more quantitative approach: pick M such that the cumulative variance exceeds a threshold:

(∑m=1M λm) / (∑d=1D λd) ≥ 0.95

A 95% threshold is common, but the right value depends on your application. For lossy compression, 90% may be fine. For scientific analysis where small variance matters, 99% is safer.

Interactive Scree Plot

Eigenvalue bar chart (blue) with cumulative variance line (warm). Drag the slider to set M. The dashed threshold at 95% helps locate the cut. Click New Data to generate a different spectrum.

Components M 3
When the elbow is unclear: If eigenvalues decay smoothly without a sharp drop, there is no natural cutoff. In such cases, use the variance threshold criterion, or consider that PCA may not be the right tool — the data may genuinely be high-dimensional with no low-rank structure.
Check: What does the "elbow" in a scree plot indicate?

Chapter 9: Probabilistic PCA

Standard PCA is a purely algebraic method — it finds eigenvectors of a matrix. Probabilistic PCA (PPCA) recasts PCA as a generative model: a story about how the data was created. This opens the door to Bayesian inference, handling missing data, and connecting to modern latent variable models.

The generative story is simple. Each data point is generated in two steps:

Step 1
Draw a latent code: z ~ N(0, IM)
Step 2
Generate the observation: x = Bz + μ + ε, where ε ~ N(0, σ2I)

The latent variable z lives in M dimensions. The matrix B maps it to D dimensions. The noise ε accounts for the variance that M components cannot explain. The marginal distribution of x (integrating out z) is:

x ~ N(μ, BBT + σ2I)

We can fit this model by maximum likelihood. The log-likelihood for N data points is:

log p(X | B, μ, σ2) = −(N/2)[D log(2π) + log|C| + tr(C−1S)]

where C = BBT + σ2I and S is the data covariance. Tipping and Bishop (1999) showed that the MLE for B has a closed-form solution:

BML = VMM − σ2I)1/2 R

where VM contains the M leading eigenvectors of S, ΛM is the diagonal matrix of their eigenvalues, and R is an arbitrary orthogonal rotation matrix. The noise variance is:

σ2ML = (1/(D−M)) ∑d=M+1D λd
The key result: As σ2 → 0, PPCA recovers standard PCA exactly. The noise variance is the average of the discarded eigenvalues — it captures the "average unexplained variance per dimension." PPCA gives PCA a probabilistic meaning: PCA is the zero-noise limit of a latent variable model.
Why this matters: The probabilistic view lets you handle missing data (marginalize over unobserved dimensions), perform model selection (compare likelihoods for different M), and mix PCA with other generative models. It is also the starting point for Variational Autoencoders.
Check: In PPCA, what is the noise variance σ2 at the maximum likelihood solution?

Chapter 10: The Auto-Encoder View

Step back and look at the PCA pipeline again: encode x into a low-dimensional z via BT, then decode z back to x̃ via B. This is exactly the structure of an auto-encoder: a network with an encoder, a bottleneck, and a decoder, trained to minimize reconstruction error.

PCA as auto-encoder:

Encoder: z = BT(x − μ)
Decoder: x̃ = Bz + μ
Loss: ||x − x̃||2
Both layers are linear. The optimal B has PCA eigenvectors as columns.

Deep auto-encoder:

Encoder: z = fθ(x) (nonlinear)
Decoder: x̃ = gφ(z) (nonlinear)
Loss: ||x − x̃||2
Multi-layer neural networks with ReLU, etc.

PCA is the linear auto-encoder. Baldi and Hornik (1989) proved that a linear auto-encoder with M hidden units, trained with squared-error loss, learns the PCA subspace. The weights converge to the M leading eigenvectors (up to rotation). Going nonlinear — adding activation functions — lets the network learn curved manifolds that PCA cannot capture.

This connection is why PCA appears in deep learning textbooks. It is the simplest possible dimensionality reduction — the case where encoder and decoder are single linear layers with shared (transposed) weights. Every more complex auto-encoder (VAE, VQ-VAE, denoising AE) is a generalization of this idea.

MethodEncoderDecoderCaptures
PCALinear (BT)Linear (B)Linear subspaces
Kernel PCANonlinear (kernel trick)Pre-imageNonlinear manifolds
Auto-encoderNeural netNeural netComplex manifolds
VAENeural net + samplingNeural netManifolds + generation
Linear vs. Nonlinear Compression

The data lies on a curve (nonlinear structure). PCA projects onto a line (best linear fit). A nonlinear encoder could capture the curve. Notice the reconstruction errors are large for PCA on curved data.

Check: What happens when you train a linear auto-encoder with M hidden units and squared-error loss?

Chapter 11: Summary

PCA is the foundational dimensionality reduction technique. It finds the directions of maximum variance in the data — the eigenvectors of the covariance matrix — and projects onto them. The result is the best linear compression for any given target dimension.

ConceptCore result
CompressionEncode: z = BTx. Decode: x̃ = Bz. Round-trip = projection.
Max varianceLagrangian → Sb = λb. PC = eigenvector. Variance = eigenvalue.
Min recon errorJM = ∑ discarded eigenvalues. Same optimal B.
DualityVM + JM = tr(S). Max one = min the other.
SVDλd = σd2/N. Right singular vectors = PCs.
PPCAz ~ N(0,I), x = Bz + μ + ε. MLE recovers PCA as σ → 0.
Auto-encoderLinear AE = PCA. Nonlinear AE generalizes to curved manifolds.
Limitations: PCA finds linear structure only. If the data lives on a curved manifold (a Swiss roll, a circle), PCA gives a poor summary. It is also sensitive to outliers (they inflate eigenvalues). And it is unsupervised — it ignores labels, so the directions of maximum variance may not be the directions that distinguish classes. For classification, Linear Discriminant Analysis (LDA) is the supervised counterpart.

Strengths:

• Optimal linear compression (provably)
• Fast (eigendecomposition or SVD)
• No hyperparameters (M is a design choice, not tuned)
• Interpretable (each PC is a direction)

Limitations:

• Linear only (cannot capture curves)
• Sensitive to outliers
• Unsupervised (ignores class labels)
• Requires all data in memory (batch method)

What comes next: Chapter 11 (Regression) uses dimensionality reduction as a preprocessing step. Chapter 12 (Classification) introduces supervised methods like LDA. In the broader ML landscape, PCA connects to kernel methods, auto-encoders, and variational inference — it is the seed from which modern representation learning grew.

ExtensionWhat it adds
Kernel PCANonlinear mapping via the kernel trick
Sparse PCAL1 penalty for interpretable components
Incremental PCAOnline updates for streaming data
Factor AnalysisPer-feature noise variances (not isotropic)
VAENonlinear + generative + deep
"The first principal component is the direction in which the data
tells the longest story."
— paraphrasing Karl Pearson, 1901
Check: What is the main limitation of PCA that motivates kernel PCA and auto-encoders?