Compress, reconstruct, separate — the art of finding low-dimensional structure in high-dimensional data.
You have 10,000 images, each with 784 pixels (28×28). That's 784 numbers per image. But most of those numbers are redundant — a face image doesn't need all 784 independent values. The actual "face-ness" lives in maybe 20 underlying factors: lighting, pose, expression, identity.
The question: can we learn a mapping that squeezes 784 dimensions down to 20, then expands back to 784 — and get a near-perfect reconstruction? If yes, those 20 numbers are the essence of the data.
Now here's a harder problem: what if your data matrix is corrupted? Not just noisy (small errors everywhere), but corrupted — some entries are completely wrong. Can you separate the clean low-rank structure from the sparse corruption? This is Robust PCA, and it's everywhere: removing shadows from surveillance video, detecting anomalies in networks, cleaning genomic data.
A point in high-dimensional space is projected to 2D (the latent code), then reconstructed. Drag the latent point to see how reconstruction changes.
An autoencoder has two halves. The encoder maps input x to a low-dimensional latent code z. The decoder maps z back to a reconstruction x̂. The loss is the reconstruction error: how different is x̂ from x?
The encoder compresses. The decoder expands. Training minimizes reconstruction error, which forces the bottleneck z to capture the most important information about x.
If z has the same dimension as x, the autoencoder can just copy the input (identity function). That teaches nothing. The bottleneck — making dim(z) << dim(x) — forces the network to discover structure. It must decide: what information is essential, and what can be discarded?
python import torch.nn as nn class Autoencoder(nn.Module): def __init__(self, input_dim=784, z_dim=20): super().__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, 256), nn.ReLU(), nn.Linear(256, z_dim) ) self.decoder = nn.Sequential( nn.Linear(z_dim, 256), nn.ReLU(), nn.Linear(256, input_dim), nn.Sigmoid() # pixel values in [0,1] ) def forward(self, x): z = self.encoder(x) # [B, 784] -> [B, 20] x_hat = self.decoder(z) # [B, 20] -> [B, 784] return x_hat, z
Slide z_dim to see how reconstruction quality depends on bottleneck size. Smaller z = more lossy.
Here's a beautiful fact: if the encoder and decoder are both linear (no ReLU, no nonlinearity), the optimal autoencoder recovers PCA exactly. Let's prove it.
Let the encoder be z = Wx (where W is k×d) and the decoder be x̂ = Dz (where D is d×k). The loss is:
We want to minimize over W and D. Let's write this in matrix form. Stack all data as columns of X (d×n). Then:
The product DW is a d×d matrix of rank at most k (since it factors through k dimensions). So we're looking for the best rank-k approximation to the identity (when applied to X). By the Eckart-Young theorem, the best rank-k approximation to X in Frobenius norm is given by the truncated SVD:
The optimal encoder projects onto the top-k singular vectors: W = UkT. The optimal decoder reconstructs using those same vectors: D = Uk. This is exactly PCA — projection onto the principal components.
2D data projected to 1D (k=1). The orange line is the principal component. Teal dots = reconstructions. The AE learns this same direction.
| Property | PCA | Nonlinear AE |
|---|---|---|
| Encoder | Linear projection W=UkT | Neural network f(x) |
| Decoder | Linear expansion D=Uk | Neural network g(z) |
| Subspace | Linear (hyperplane) | Curved manifold |
| Optimal? | Yes (global, closed form) | Local minimum (SGD) |
| Interpretable? | Yes (eigenvectors) | Not always |
A regular autoencoder maps each input to a single point in latent space. But what if we want to generate new data? We'd need to sample random points in latent space and decode them. The problem: a regular AE's latent space has "holes" — regions that don't correspond to valid data.
The Variational Autoencoder (VAE) fixes this by making the encoder output a distribution instead of a point. Specifically, it outputs μ and σ for a Gaussian, and we sample z from that Gaussian.
We can't backpropagate through a random sample. But we can rewrite sampling as a deterministic function of a separate random variable ε:
Now the gradient flows through μ and σ (deterministic), while ε is just an external random input. This is the reparameterization trick — it makes the sampling step differentiable.
Two terms: (1) reconstruction error (make x̂ close to x), and (2) KL divergence that pushes q(z|x) toward the prior p(z) = N(0,I). The KL term ensures the latent space is smooth and complete — no holes.
Each point is an encoded data sample. Regular AE (left) has gaps. VAE (right) fills the space smoothly. Adjust β to control the KL weight.
Standard PCA is fragile. A single outlier can rotate the principal components dramatically. What if we knew the corruption was sparse — only affecting some entries, but affecting them badly?
Robust PCA decomposes a matrix X into two components:
where L is low-rank (the clean signal) and S is sparse (the corruption). Think of it as: L is the video background (low-rank because every frame looks similar), S is the foreground (sparse because moving objects occupy few pixels).
If you SVD the corrupted matrix X, the corruption contaminates every singular vector. SVD finds the best overall low-rank approximation, but it smears the sparse errors across all components. Robust PCA separates them cleanly.
We want to minimize rank(L) subject to X = L + S with S sparse. But rank minimization is NP-hard. The nuclear norm ||L||* (sum of singular values) is the tightest convex relaxation of rank. Similarly, ||S||1 (sum of absolute values) promotes sparsity.
This is Principal Component Pursuit (PCP). Remarkably, under mild conditions (L is not too sparse, S is not too low-rank), this convex program exactly recovers the true L and S.
A rank-2 matrix (left) corrupted by sparse errors (middle). RPCA recovers both. Adjust corruption level.
How do we actually solve the PCP convex program? Two main approaches: ADMM (Alternating Direction Method of Multipliers) and iterative thresholding.
ADMM introduces a Lagrange multiplier Y and alternates between updating L, S, and Y:
The key operation is SVT: take the SVD of a matrix, subtract a threshold τ from each singular value (setting negatives to zero), then reconstruct. This is the proximal operator for the nuclear norm — it shrinks the singular values, effectively reducing rank.
For the sparse component, soft thresholding shrinks each element toward zero:
This is the proximal operator for the L1 norm. It zeroes out small entries (they're treated as noise) and shrinks large entries (they're treated as true corruption minus bias).
python import numpy as np def robust_pca(X, lam=None, mu=None, tol=1e-6, max_iter=100): """Principal Component Pursuit via ADMM.""" m, n = X.shape if lam is None: lam = 1.0 / np.sqrt(max(m, n)) if mu is None: mu = m * n / (4 * np.abs(X).sum()) L = np.zeros_like(X) S = np.zeros_like(X) Y = np.zeros_like(X) for _ in range(max_iter): # SVT for L U, s, Vt = np.linalg.svd(X - S + Y/mu, full_matrices=False) s_thresh = np.maximum(s - 1/mu, 0) L = U @ np.diag(s_thresh) @ Vt # Soft threshold for S temp = X - L + Y/mu S = np.sign(temp) * np.maximum(np.abs(temp) - lam/mu, 0) # Dual update residual = X - L - S Y = Y + mu * residual if np.linalg.norm(residual) / np.linalg.norm(X) < tol: break return L, S
Watch the algorithm iterate. Each step the residual ||X-L-S|| decreases. Click "Step" to advance one iteration.
Time to see Robust PCA in action. We'll create a low-rank matrix, corrupt it with sparse errors, and watch the ADMM algorithm cleanly separate the two components.
X = L + S. Top row: the ground truth. Bottom row: RPCA's estimate. Color = value (blue negative, orange positive).
Try pushing corruption above 40%. The algorithm starts confusing S with L — the sparse component becomes too dense to be distinguishable from low-rank structure. The theoretical guarantee requires:
In plain English: L must be genuinely low-rank, and S must be genuinely sparse. If either condition fails, no algorithm can separate them (information-theoretically impossible).
Autoencoders and Robust PCA are entry points into a rich landscape of representation learning and matrix decomposition methods.
| Method | What It Finds | When to Use |
|---|---|---|
| PCA / Linear AE | Best linear subspace | Data is roughly Gaussian, linear structure |
| Nonlinear AE | Nonlinear manifold | Data lives on curves/surfaces, not planes |
| VAE | Smooth generative latent space | Need to sample new data, interpolate |
| Robust PCA | Low-rank + sparse decomposition | Sparse corruption or outliers |
| NMF | Non-negative parts | Data is inherently non-negative (spectra, counts) |
| Dictionary Learning | Sparse coding with learned atoms | Signals are sparse in some learned basis |
Next: NMF & Clustering — what happens when you require non-negativity. Also: Dictionary Learning & Sparse Coding — learning overcomplete dictionaries with sparse codes.