Finding the directions that matter most in high-dimensional data.
Imagine you have a dataset of faces. Each image is 64×64 pixels — that is 4,096 numbers per face. But most of those numbers are redundant. Skin color varies smoothly, noses sit in roughly the same place, eyes look mostly alike. The effective dimensionality of face images is far lower than 4,096. Maybe a few hundred numbers are enough to describe any face.
This is the general situation in high-dimensional data: the data looks like it lives in a huge space, but it actually clusters near a much smaller subspace. Points are correlated. Measurements overlap. Information is repeated.
We need a principled way to ask: which directions carry the most information? And once we find them, can we compress the data by keeping only those directions — and later reconstruct an approximation from the compressed version?
A cloud of 2D points that is really 1D in disguise. The data spreads along a line — the second axis carries mostly noise. Click New Data to regenerate.
In the widget above, the data lives in 2D, but it stretches mostly along one direction. The second dimension is almost pure noise. If you project every point onto the long axis, you lose very little information. That long axis is the first principal component.
PCA formalizes this idea. It works in any number of dimensions — 2, 200, or 20,000 — and always finds the best linear compression to any target dimensionality you choose. The chapter ahead derives exactly why eigenvectors give the answer, from two different perspectives that turn out to be the same.
| What PCA needs | What it produces |
|---|---|
| Data matrix X (N points, D dimensions) | Principal directions b1, ..., bM |
| Target dimensionality M < D | Low-dimensional codes z = BTx |
| Nothing else — no labels, no model | Reconstructions x̃ = Bz |
Let's set up the problem precisely. You have N data points x1, ..., xN in D dimensions. You want to represent each point using only M numbers, where M < D. That is dimensionality reduction.
We build a two-step machine. The encoder takes a D-dimensional point and maps it to an M-dimensional code. The decoder takes the code and maps it back to D dimensions, producing a reconstruction. Both steps are linear.
Here B is a D×M matrix whose columns b1, ..., bM are the projection directions. The encoder projects x onto these columns: zm = bmTx. The decoder reconstructs by combining the columns weighted by the codes: x̃ = z1b1 + ... + zMbM.
The full round-trip is:
This is a projection. The matrix BBT projects every point onto the M-dimensional subspace spanned by the columns of B. If the columns of B are orthonormal (biTbj = δij), then BBT is an orthogonal projection matrix.
The reconstruction error for a single point is the squared distance between the original and its reconstruction:
Averaged over all N points, this gives the mean reconstruction error. Our goal: find the columns of B that minimize this average error. Equivalently (as we will show), find the columns of B that maximize the variance of the projected data. These two problems have the same solution.
Watch a 2D point get encoded to 1D and decoded back. The projection line is the subspace. The reconstruction error is the perpendicular distance. Drag the slider to rotate the projection direction.
In the widget, notice that when the projection line aligns with the long axis of the data, the reconstruction errors (red segments) are tiny. When the line is perpendicular to the data spread, the errors are huge. PCA finds the angle that minimizes total error.
Here is the first way to derive PCA. We want a single direction b (a unit vector in RD) such that projecting the data onto b captures the most variance. The projection of a centered data point x onto b is the scalar z = bTx. The variance of these projections across the dataset is:
where S is the data covariance matrix:
(assuming centered data with mean zero). We want to maximize bTSb subject to the constraint ||b|| = 1. Without the constraint, we could make V arbitrarily large by scaling b. The unit-length constraint forces us to pick a direction, not a magnitude.
This is a constrained optimization problem. We use a Lagrange multiplier λ to enforce the constraint:
We seek a stationary point. Take the derivative with respect to b and set it to zero:
This simplifies to:
This is the eigenvalue equation. The direction b that maximizes variance must be an eigenvector of the covariance matrix S, and λ is the corresponding eigenvalue.
Let us verify this with a numerical example. Suppose we have a 2D covariance matrix:
The eigenvalues are λ1 ≈ 2.72 and λ2 ≈ 0.28. The first eigenvector points along the direction of maximum spread. The second is orthogonal and captures the residual. Together they account for all the variance: λ1 + λ2 = 2 + 1 = 3 = tr(S).
A cloud of correlated 2D points with its covariance ellipse and eigenvectors. Drag the projection line to any angle. The bar chart shows variance captured. Snapping to PC1 maximizes the bar. PC2 is orthogonal and captures the rest.
Play with the slider above. As you rotate the projection line toward the eigenvector with the largest eigenvalue, the variance bar grows. At the exact PC1 angle, the bar is maximized. At PC2 (perpendicular), the bar is minimized — you are projecting onto the direction of least variance. Any other angle falls between the two.
We found the first principal component: the eigenvector of S with the largest eigenvalue. What about the second? And the third? PCA extracts them sequentially, each one orthogonal to all previous ones.
The m-th principal component is the direction that maximizes variance in the subspace orthogonal to the first m−1 components. By the same Lagrangian argument (now with m−1 additional orthogonality constraints), the solution is again an eigenvector of S — specifically, the one with the m-th largest eigenvalue.
Let λ1 ≥ λ2 ≥ ... ≥ λD be the sorted eigenvalues with corresponding eigenvectors b1, b2, ..., bD. The first M principal components are b1, ..., bM. Together they form the columns of the projection matrix B.
| Component | Direction | Variance captured |
|---|---|---|
| PC1 | b1 (eigenvector of largest λ) | λ1 |
| PC2 | b2 (next largest λ) | λ2 |
| PC m | bm | λm |
| ... | ... | ... |
| PC D | bD (smallest λ) | λD |
The total variance captured by the first M components is:
And the total variance in the data is the sum of all eigenvalues, which equals the trace of S:
So the fraction of variance retained by M components is VM / Vtotal. This gives a crisp criterion for choosing M: pick the smallest M such that this fraction exceeds your threshold (e.g., 95%).
Eigenvalues of a sample covariance matrix, sorted in decreasing order. Each bar is the variance along one PC. The cumulative line shows how much total variance is captured as you add components.
We derived PCA from the "maximize variance" perspective. Now let's derive it from the other direction: minimize reconstruction error. We will arrive at the same eigenvectors.
The average reconstruction error across the dataset is:
Since x̃n = BBTxn (the projection), the error for each point is the component of xn orthogonal to the subspace spanned by B. This orthogonal complement is captured by the eigenvectors we didn't keep.
Expand xn in the full eigenbasis b1, ..., bD:
The reconstruction keeps only the first M terms. The error is the sum of the remaining D−M terms:
The average squared error becomes:
Let's verify the bookkeeping. The total variance equals the variance retained plus the variance lost:
The teal term is the variance retained (VM). The orange term is the reconstruction error (JM). They partition the total variance: VM + JM = tr(S). Maximizing VM and minimizing JM are literally the same optimization.
We have now seen PCA from both sides:
View 1: Maximum Variance
Find the direction that preserves the most spread in the data.
View 2: Minimum Reconstruction Error
Find the subspace where projections are closest to the originals.
Both views lead to the same answer: the eigenvectors of the covariance matrix, ordered by decreasing eigenvalue. This is not a coincidence — it is a direct consequence of the variance decomposition VM + JM = tr(S).
Think of it this way. You have a fixed "budget" of total variance (tr(S)). Keeping M components gives you VM variance and costs you JM = tr(S) − VM in error. Maximizing what you keep is equivalent to minimizing what you lose. Two sides of the same coin.
A stacked bar shows how eigenvalues partition into retained (teal) and discarded (orange). Drag M to see the split change.
In the widget, as you increase M, the teal bar grows (more variance retained) and the orange bar shrinks (less reconstruction error). At M=D (all components), the error is zero and the reconstruction is perfect. At M=1, you get the best single-direction summary possible.
In practice, we rarely form the covariance matrix S and eigendecompose it. Instead, we use the Singular Value Decomposition (SVD) of the centered data matrix X directly. This is numerically stabler and often faster.
The SVD of the N×D data matrix X is:
where U is N×N (left singular vectors), Σ is N×D (singular values on the diagonal), and V is D×D (right singular vectors). The covariance matrix is:
Since UTU = I, this simplifies beautifully. Comparing with the eigendecomposition S = V Λ VT, we see:
This also connects to the Eckart-Young theorem: the best rank-M approximation to X (in Frobenius norm) is obtained by keeping only the M largest singular values and their vectors. This is exactly PCA.
| PCA language | SVD language |
|---|---|
| Principal components bm | Right singular vectors vm |
| Eigenvalue λm | σm2 / N |
| Projected coordinates zn | Un Σ (first M columns) |
| Best rank-M data approximation | UM ΣM VMT |
python import numpy as np # Generate correlated 2D data np.random.seed(42) X = np.random.randn(100, 2) @ [[2, 1], [1, 1.5]] X -= X.mean(axis=0) # center # SVD approach U, sigma, Vt = np.linalg.svd(X, full_matrices=False) eigenvalues = sigma**2 / len(X) principal_dirs = Vt.T # columns = PCs # Verify: eigendecompose covariance S = X.T @ X / len(X) eigvals, eigvecs = np.linalg.eigh(S) print("SVD eigenvalues:", eigenvalues) print("Cov eigenvalues:", eigvals[::-1]) # They match!
Here is the PCA recipe, step by step. Every step earns its place — skip one and the results are wrong.
Step 1 (Center) is crucial. PCA finds directions of maximum variance. If the data is not centered, the mean itself dominates the first component — the algorithm wastes its first PC pointing at the center of mass instead of finding interesting structure.
Step 2 (Standardize) is needed when features have different units (e.g., height in cm vs. weight in kg). Without standardization, PCA is dominated by whichever feature has the largest numerical variance — which is just a units artifact. Dividing each feature by its standard deviation puts all features on the same scale. When features are already in comparable units (e.g., pixel intensities), skip this step.
Walk through each stage: raw data, centered, eigendecompose, project, reconstruct. Click Next Step to advance.
python import numpy as np def pca(X, M): # 1. Center mu = X.mean(axis=0) Xc = X - mu # 2. SVD (no need to form covariance) U, sigma, Vt = np.linalg.svd(Xc, full_matrices=False) B = Vt[:M].T # D x M projection matrix Z = Xc @ B # N x M codes X_hat = Z @ B.T + mu # N x D reconstructions eigvals = sigma**2 / len(X) return B, Z, X_hat, eigvals # Example: 100 points in 5D, reduce to 2D X = np.random.randn(100, 5) @ np.random.randn(5, 5) B, Z, X_hat, lam = pca(X, M=2) print(f"Variance retained: {sum(lam[:2])/sum(lam)*100:.1f}%")
How do you choose M, the number of components to keep? There is no single right answer, but the scree plot is the standard visual tool. It plots the eigenvalues in decreasing order as a bar chart.
The name comes from geology: "scree" is the rubble that collects at the base of a cliff. In the plot, the eigenvalues drop steeply at first (the cliff), then flatten out (the scree). The elbow — where the steep drop transitions to the flat tail — is a natural place to cut.
A more quantitative approach: pick M such that the cumulative variance exceeds a threshold:
A 95% threshold is common, but the right value depends on your application. For lossy compression, 90% may be fine. For scientific analysis where small variance matters, 99% is safer.
Eigenvalue bar chart (blue) with cumulative variance line (warm). Drag the slider to set M. The dashed threshold at 95% helps locate the cut. Click New Data to generate a different spectrum.
Standard PCA is a purely algebraic method — it finds eigenvectors of a matrix. Probabilistic PCA (PPCA) recasts PCA as a generative model: a story about how the data was created. This opens the door to Bayesian inference, handling missing data, and connecting to modern latent variable models.
The generative story is simple. Each data point is generated in two steps:
The latent variable z lives in M dimensions. The matrix B maps it to D dimensions. The noise ε accounts for the variance that M components cannot explain. The marginal distribution of x (integrating out z) is:
We can fit this model by maximum likelihood. The log-likelihood for N data points is:
where C = BBT + σ2I and S is the data covariance. Tipping and Bishop (1999) showed that the MLE for B has a closed-form solution:
where VM contains the M leading eigenvectors of S, ΛM is the diagonal matrix of their eigenvalues, and R is an arbitrary orthogonal rotation matrix. The noise variance is:
Step back and look at the PCA pipeline again: encode x into a low-dimensional z via BT, then decode z back to x̃ via B. This is exactly the structure of an auto-encoder: a network with an encoder, a bottleneck, and a decoder, trained to minimize reconstruction error.
PCA as auto-encoder:
Encoder: z = BT(x − μ)
Decoder: x̃ = Bz + μ
Loss: ||x − x̃||2
Both layers are linear. The optimal B has PCA eigenvectors as columns.
Deep auto-encoder:
Encoder: z = fθ(x) (nonlinear)
Decoder: x̃ = gφ(z) (nonlinear)
Loss: ||x − x̃||2
Multi-layer neural networks with ReLU, etc.
This connection is why PCA appears in deep learning textbooks. It is the simplest possible dimensionality reduction — the case where encoder and decoder are single linear layers with shared (transposed) weights. Every more complex auto-encoder (VAE, VQ-VAE, denoising AE) is a generalization of this idea.
| Method | Encoder | Decoder | Captures |
|---|---|---|---|
| PCA | Linear (BT) | Linear (B) | Linear subspaces |
| Kernel PCA | Nonlinear (kernel trick) | Pre-image | Nonlinear manifolds |
| Auto-encoder | Neural net | Neural net | Complex manifolds |
| VAE | Neural net + sampling | Neural net | Manifolds + generation |
The data lies on a curve (nonlinear structure). PCA projects onto a line (best linear fit). A nonlinear encoder could capture the curve. Notice the reconstruction errors are large for PCA on curved data.
PCA is the foundational dimensionality reduction technique. It finds the directions of maximum variance in the data — the eigenvectors of the covariance matrix — and projects onto them. The result is the best linear compression for any given target dimension.
| Concept | Core result |
|---|---|
| Compression | Encode: z = BTx. Decode: x̃ = Bz. Round-trip = projection. |
| Max variance | Lagrangian → Sb = λb. PC = eigenvector. Variance = eigenvalue. |
| Min recon error | JM = ∑ discarded eigenvalues. Same optimal B. |
| Duality | VM + JM = tr(S). Max one = min the other. |
| SVD | λd = σd2/N. Right singular vectors = PCs. |
| PPCA | z ~ N(0,I), x = Bz + μ + ε. MLE recovers PCA as σ → 0. |
| Auto-encoder | Linear AE = PCA. Nonlinear AE generalizes to curved manifolds. |
Strengths:
• Optimal linear compression (provably)
• Fast (eigendecomposition or SVD)
• No hyperparameters (M is a design choice, not tuned)
• Interpretable (each PC is a direction)
Limitations:
• Linear only (cannot capture curves)
• Sensitive to outliers
• Unsupervised (ignores class labels)
• Requires all data in memory (batch method)
What comes next: Chapter 11 (Regression) uses dimensionality reduction as a preprocessing step. Chapter 12 (Classification) introduces supervised methods like LDA. In the broader ML landscape, PCA connects to kernel methods, auto-encoders, and variational inference — it is the seed from which modern representation learning grew.
| Extension | What it adds |
|---|---|
| Kernel PCA | Nonlinear mapping via the kernel trick |
| Sparse PCA | L1 penalty for interpretable components |
| Incremental PCA | Online updates for streaming data |
| Factor Analysis | Per-feature noise variances (not isotropic) |
| VAE | Nonlinear + generative + deep |