Stanford EE269 — Mert Pilanci

Autoencoders & Robust PCA

Compress, reconstruct, separate — the art of finding low-dimensional structure in high-dimensional data.

Prerequisites: Linear algebra (SVD) + Basic calculus. That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Compression?

You have 10,000 images, each with 784 pixels (28×28). That's 784 numbers per image. But most of those numbers are redundant — a face image doesn't need all 784 independent values. The actual "face-ness" lives in maybe 20 underlying factors: lighting, pose, expression, identity.

The question: can we learn a mapping that squeezes 784 dimensions down to 20, then expands back to 784 — and get a near-perfect reconstruction? If yes, those 20 numbers are the essence of the data.

Now here's a harder problem: what if your data matrix is corrupted? Not just noisy (small errors everywhere), but corrupted — some entries are completely wrong. Can you separate the clean low-rank structure from the sparse corruption? This is Robust PCA, and it's everywhere: removing shadows from surveillance video, detecting anomalies in networks, cleaning genomic data.

Two flavors of the same idea: Autoencoders find low-dimensional structure by learning a neural compression. Robust PCA finds it by decomposing a matrix into low-rank + sparse. Both ask: "what's the simplest explanation of this data?"
Compression: 784D → 2D → 784D

A point in high-dimensional space is projected to 2D (the latent code), then reconstructed. Drag the latent point to see how reconstruction changes.

Drag the teal dot
What does an autoencoder learn to do?

Chapter 1: The Encoder-Decoder Architecture

An autoencoder has two halves. The encoder maps input x to a low-dimensional latent code z. The decoder maps z back to a reconstruction x̂. The loss is the reconstruction error: how different is x̂ from x?

z = fenc(x),   x̂ = fdec(z),   L = ||x - x̂||²

The encoder compresses. The decoder expands. Training minimizes reconstruction error, which forces the bottleneck z to capture the most important information about x.

The Bottleneck Is the Point

If z has the same dimension as x, the autoencoder can just copy the input (identity function). That teaches nothing. The bottleneck — making dim(z) << dim(x) — forces the network to discover structure. It must decide: what information is essential, and what can be discarded?

Input x
[784] — e.g., flattened 28×28 image
↓ Encoder: Linear(784,256) → ReLU → Linear(256,z_dim)
Latent z
[z_dim] — compressed representation (e.g., 20)
↓ Decoder: Linear(z_dim,256) → ReLU → Linear(256,784)
Reconstruction x̂
[784] — should match input
python
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim=784, z_dim=20):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, z_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(z_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
            nn.Sigmoid()  # pixel values in [0,1]
        )

    def forward(self, x):
        z = self.encoder(x)       # [B, 784] -> [B, 20]
        x_hat = self.decoder(z)   # [B, 20] -> [B, 784]
        return x_hat, z
Key insight: The latent dimension z_dim is a design choice. Too small and you can't reconstruct well (underfitting). Too large and the bottleneck doesn't force compression (overfitting / trivial identity). The sweet spot captures the intrinsic dimensionality of your data.
Bottleneck Dimension vs Reconstruction

Slide z_dim to see how reconstruction quality depends on bottleneck size. Smaller z = more lossy.

z_dim5
What happens if the latent dimension equals the input dimension?

Chapter 2: Linear AE = PCA (The Proof)

Here's a beautiful fact: if the encoder and decoder are both linear (no ReLU, no nonlinearity), the optimal autoencoder recovers PCA exactly. Let's prove it.

Setup

Let the encoder be z = Wx (where W is k×d) and the decoder be x̂ = Dz (where D is d×k). The loss is:

L = ∑i ||xi - DWxi||²

We want to minimize over W and D. Let's write this in matrix form. Stack all data as columns of X (d×n). Then:

L = ||X - DWX||F²

The Proof

The product DW is a d×d matrix of rank at most k (since it factors through k dimensions). So we're looking for the best rank-k approximation to the identity (when applied to X). By the Eckart-Young theorem, the best rank-k approximation to X in Frobenius norm is given by the truncated SVD:

X ≈ Uk Σk VkT

The optimal encoder projects onto the top-k singular vectors: W = UkT. The optimal decoder reconstructs using those same vectors: D = Uk. This is exactly PCA — projection onto the principal components.

The punchline: A linear autoencoder with k-dimensional bottleneck learns the same subspace as PCA with k components. Nonlinear autoencoders generalize PCA to curved manifolds — they can capture structure that lives on nonlinear submanifolds of the data space.
Linear AE = PCA Visualization

2D data projected to 1D (k=1). The orange line is the principal component. Teal dots = reconstructions. The AE learns this same direction.

Data angle30°
Noise0.20
PropertyPCANonlinear AE
EncoderLinear projection W=UkTNeural network f(x)
DecoderLinear expansion D=UkNeural network g(z)
SubspaceLinear (hyperplane)Curved manifold
Optimal?Yes (global, closed form)Local minimum (SGD)
Interpretable?Yes (eigenvectors)Not always
Why does a linear autoencoder recover PCA?

Chapter 3: Variational Autoencoders (VAE)

A regular autoencoder maps each input to a single point in latent space. But what if we want to generate new data? We'd need to sample random points in latent space and decode them. The problem: a regular AE's latent space has "holes" — regions that don't correspond to valid data.

The Variational Autoencoder (VAE) fixes this by making the encoder output a distribution instead of a point. Specifically, it outputs μ and σ for a Gaussian, and we sample z from that Gaussian.

Encoder: x → (μ, log σ²),   z = μ + σ · ε,   ε ~ N(0, I)

The Reparameterization Trick

We can't backpropagate through a random sample. But we can rewrite sampling as a deterministic function of a separate random variable ε:

z = μ + σ ⊙ ε,   ε ~ N(0, I)

Now the gradient flows through μ and σ (deterministic), while ε is just an external random input. This is the reparameterization trick — it makes the sampling step differentiable.

The VAE Loss

L = ||x - x̂||² + β · KL(q(z|x) || p(z))

Two terms: (1) reconstruction error (make x̂ close to x), and (2) KL divergence that pushes q(z|x) toward the prior p(z) = N(0,I). The KL term ensures the latent space is smooth and complete — no holes.

Input x
[784]
↓ Encoder network
μ, logσ²
Mean and log-variance of q(z|x)
↓ z = μ + σ·ε (reparameterize)
z ~ q(z|x)
[z_dim] — sampled latent
↓ Decoder network
[784] — reconstructed
Key insight: The KL term acts as a regularizer. Without it, the encoder could map each training point to a distant, isolated spike in latent space (just memorizing). The KL term forces all encodings to cluster near N(0,I), creating a continuous, interpolatable latent space.
VAE Latent Space

Each point is an encoded data sample. Regular AE (left) has gaps. VAE (right) fills the space smoothly. Adjust β to control the KL weight.

β (KL weight)1.0
What does the reparameterization trick enable?

Chapter 4: Robust PCA — Low-Rank + Sparse

Standard PCA is fragile. A single outlier can rotate the principal components dramatically. What if we knew the corruption was sparse — only affecting some entries, but affecting them badly?

Robust PCA decomposes a matrix X into two components:

X = L + S

where L is low-rank (the clean signal) and S is sparse (the corruption). Think of it as: L is the video background (low-rank because every frame looks similar), S is the foreground (sparse because moving objects occupy few pixels).

Why Not Just Use SVD?

If you SVD the corrupted matrix X, the corruption contaminates every singular vector. SVD finds the best overall low-rank approximation, but it smears the sparse errors across all components. Robust PCA separates them cleanly.

The Convex Relaxation

We want to minimize rank(L) subject to X = L + S with S sparse. But rank minimization is NP-hard. The nuclear norm ||L||* (sum of singular values) is the tightest convex relaxation of rank. Similarly, ||S||1 (sum of absolute values) promotes sparsity.

min ||L||* + λ||S||1   subject to   X = L + S

This is Principal Component Pursuit (PCP). Remarkably, under mild conditions (L is not too sparse, S is not too low-rank), this convex program exactly recovers the true L and S.

The miracle: A convex program — something we can solve efficiently — exactly recovers the rank and the sparse component, even though the original problem is combinatorial. This is one of the great results in compressed sensing / convex optimization (Candès et al., 2011).
Matrix Decomposition: X = L + S

A rank-2 matrix (left) corrupted by sparse errors (middle). RPCA recovers both. Adjust corruption level.

Rank of L2
Sparsity of S0.15
What is the nuclear norm ||L||* a convex proxy for?

Chapter 5: Algorithms for Robust PCA

How do we actually solve the PCP convex program? Two main approaches: ADMM (Alternating Direction Method of Multipliers) and iterative thresholding.

ADMM for PCP

ADMM introduces a Lagrange multiplier Y and alternates between updating L, S, and Y:

Update L
L ← SVT1/μ(X - S + Y/μ) — singular value thresholding
Update S
S ← Shrinkλ/μ(X - L + Y/μ) — soft thresholding
Update Y
Y ← Y + μ(X - L - S) — dual ascent
↻ repeat until convergence

Singular Value Thresholding (SVT)

The key operation is SVT: take the SVD of a matrix, subtract a threshold τ from each singular value (setting negatives to zero), then reconstruct. This is the proximal operator for the nuclear norm — it shrinks the singular values, effectively reducing rank.

SVTτ(M) = U · max(Σ - τI, 0) · VT

Soft Thresholding (Shrinkage)

For the sparse component, soft thresholding shrinks each element toward zero:

Shrinkτ(x) = sign(x) · max(|x| - τ, 0)

This is the proximal operator for the L1 norm. It zeroes out small entries (they're treated as noise) and shrinks large entries (they're treated as true corruption minus bias).

python
import numpy as np

def robust_pca(X, lam=None, mu=None, tol=1e-6, max_iter=100):
    """Principal Component Pursuit via ADMM."""
    m, n = X.shape
    if lam is None: lam = 1.0 / np.sqrt(max(m, n))
    if mu is None: mu = m * n / (4 * np.abs(X).sum())

    L = np.zeros_like(X)
    S = np.zeros_like(X)
    Y = np.zeros_like(X)

    for _ in range(max_iter):
        # SVT for L
        U, s, Vt = np.linalg.svd(X - S + Y/mu, full_matrices=False)
        s_thresh = np.maximum(s - 1/mu, 0)
        L = U @ np.diag(s_thresh) @ Vt

        # Soft threshold for S
        temp = X - L + Y/mu
        S = np.sign(temp) * np.maximum(np.abs(temp) - lam/mu, 0)

        # Dual update
        residual = X - L - S
        Y = Y + mu * residual

        if np.linalg.norm(residual) / np.linalg.norm(X) < tol:
            break

    return L, S
Convergence: ADMM converges to the global optimum because the problem is convex. In practice, 50-100 iterations suffice. The main cost per iteration is one SVD — O(mn·min(m,n)).
ADMM Convergence

Watch the algorithm iterate. Each step the residual ||X-L-S|| decreases. Click "Step" to advance one iteration.

Iteration: 0
What does Singular Value Thresholding (SVT) do geometrically?

Chapter 6: Showcase — Matrix Separation

Time to see Robust PCA in action. We'll create a low-rank matrix, corrupt it with sparse errors, and watch the ADMM algorithm cleanly separate the two components.

This is the full interactive demo. You control the rank, sparsity, corruption magnitude, and λ. Watch the heatmaps update as the algorithm converges. Try breaking it: what happens when the corruption is too dense?
Interactive Robust PCA Decomposition

X = L + S. Top row: the ground truth. Bottom row: RPCA's estimate. Color = value (blue negative, orange positive).

True Rank2
Corruption %0.15
Corruption magnitude5.0
λ0.18
Recovery error: —

When Does RPCA Fail?

Try pushing corruption above 40%. The algorithm starts confusing S with L — the sparse component becomes too dense to be distinguishable from low-rank structure. The theoretical guarantee requires:

rank(L) ≤ ρr · min(m,n) / (μ · log²(max(m,n)))
||S||0 ≤ ρs · mn

In plain English: L must be genuinely low-rank, and S must be genuinely sparse. If either condition fails, no algorithm can separate them (information-theoretically impossible).

Chapter 7: Beyond — Connections

Autoencoders and Robust PCA are entry points into a rich landscape of representation learning and matrix decomposition methods.

MethodWhat It FindsWhen to Use
PCA / Linear AEBest linear subspaceData is roughly Gaussian, linear structure
Nonlinear AENonlinear manifoldData lives on curves/surfaces, not planes
VAESmooth generative latent spaceNeed to sample new data, interpolate
Robust PCALow-rank + sparse decompositionSparse corruption or outliers
NMFNon-negative partsData is inherently non-negative (spectra, counts)
Dictionary LearningSparse coding with learned atomsSignals are sparse in some learned basis
The thread: All these methods ask the same question in different disguises: "What is the simplest structure that explains my data?" PCA says "a flat plane." AE says "a curved surface." Robust PCA says "a flat plane, once I remove the outliers." NMF says "a non-negative combination of parts." Dictionary learning says "a sparse combination of learned atoms."

Related Lessons

Next: NMF & Clustering — what happens when you require non-negativity. Also: Dictionary Learning & Sparse Coding — learning overcomplete dictionaries with sparse codes.

What distinguishes Robust PCA from standard PCA?