EE269 Autoencoders — Autoencoders & Robust PCA

Chapter 0: Why Compression?

You have 10,000 images, each with 784 pixels (28×28). That's 784 numbers per image. But most of those numbers are redundant — a face image doesn't need all 784 independent values. The actual "face-ness" lives in maybe 20 underlying factors: lighting, pose, expression, identity.

The question: can we learn a mapping that squeezes 784 dimensions down to 20, then expands back to 784 — and get a near-perfect reconstruction? If yes, those 20 numbers are the essence of the data.

Now here's a harder problem: what if your data matrix is corrupted? Not just noisy (small errors everywhere), but corrupted — some entries are completely wrong. Can you separate the clean low-rank structure from the sparse corruption? This is Robust PCA, and it's everywhere: removing shadows from surveillance video, detecting anomalies in networks, cleaning genomic data.

Two flavors of the same idea: Autoencoders find low-dimensional structure by learning a neural compression. Robust PCA finds it by decomposing a matrix into low-rank + sparse. Both ask: "what's the simplest explanation of this data?"

Compression: 784D → 2D → 784D

A point in high-dimensional space is projected to 2D (the latent code), then reconstructed. Drag the latent point to see how reconstruction changes.

Drag the teal dot

What does an autoencoder learn to do?

Classify images into categories Compress data to a low-dimensional code and reconstruct it Generate random images from noise

Chapter 1: The Encoder-Decoder Architecture

An autoencoder has two halves. The encoder maps input x to a low-dimensional latent code z. The decoder maps z back to a reconstruction x̂. The loss is the reconstruction error: how different is x̂ from x?

z = f_enc(x), x̂ = f_dec(z), L = ||x - x̂||²

The encoder compresses. The decoder expands. Training minimizes reconstruction error, which forces the bottleneck z to capture the most important information about x.

The Bottleneck Is the Point

If z has the same dimension as x, the autoencoder can just copy the input (identity function). That teaches nothing. The bottleneck — making dim(z) << dim(x) — forces the network to discover structure. It must decide: what information is essential, and what can be discarded?

Input x

[784] — e.g., flattened 28×28 image

↓ Encoder: Linear(784,256) → ReLU → Linear(256,z_dim)

Latent z

[z_dim] — compressed representation (e.g., 20)

↓ Decoder: Linear(z_dim,256) → ReLU → Linear(256,784)

Reconstruction x̂

[784] — should match input

python
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim=784, z_dim=20):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, z_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(z_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
            nn.Sigmoid()  # pixel values in [0,1]
        )

    def forward(self, x):
        z = self.encoder(x)       # [B, 784] -> [B, 20]
        x_hat = self.decoder(z)   # [B, 20] -> [B, 784]
        return x_hat, z

Key insight: The latent dimension z_dim is a design choice. Too small and you can't reconstruct well (underfitting). Too large and the bottleneck doesn't force compression (overfitting / trivial identity). The sweet spot captures the intrinsic dimensionality of your data.

Bottleneck Dimension vs Reconstruction

Slide z_dim to see how reconstruction quality depends on bottleneck size. Smaller z = more lossy.

z_dim5

What happens if the latent dimension equals the input dimension?

The network can learn the identity (no compression), defeating the purpose It always produces better reconstructions Training becomes impossible

Chapter 2: Linear AE = PCA (The Proof)

Here's a beautiful fact: if the encoder and decoder are both linear (no ReLU, no nonlinearity), the optimal autoencoder recovers PCA exactly. Let's prove it.

Setup

Let the encoder be z = Wx (where W is k×d) and the decoder be x̂ = Dz (where D is d×k). The loss is:

L = ∑_i ||x_i - DWx_i||²

We want to minimize over W and D. Let's write this in matrix form. Stack all data as columns of X (d×n). Then:

L = ||X - DWX||_F²

The Proof

The product DW is a d×d matrix of rank at most k (since it factors through k dimensions). So we're looking for the best rank-k approximation to the identity (when applied to X). By the Eckart-Young theorem, the best rank-k approximation to X in Frobenius norm is given by the truncated SVD:

X ≈ U_k Σ_k V_k^T

The optimal encoder projects onto the top-k singular vectors: W = U_k^T. The optimal decoder reconstructs using those same vectors: D = U_k. This is exactly PCA — projection onto the principal components.

The punchline: A linear autoencoder with k-dimensional bottleneck learns the same subspace as PCA with k components. Nonlinear autoencoders generalize PCA to curved manifolds — they can capture structure that lives on nonlinear submanifolds of the data space.

Linear AE = PCA Visualization

2D data projected to 1D (k=1). The orange line is the principal component. Teal dots = reconstructions. The AE learns this same direction.

Data angle30°

Noise0.20

Property	PCA	Nonlinear AE
Encoder	Linear projection W=U_k^T	Neural network f(x)
Decoder	Linear expansion D=U_k	Neural network g(z)
Subspace	Linear (hyperplane)	Curved manifold
Optimal?	Yes (global, closed form)	Local minimum (SGD)
Interpretable?	Yes (eigenvectors)	Not always

Why does a linear autoencoder recover PCA?

Because neural networks are always equivalent to matrix factorization Because minimizing linear reconstruction error with a rank-k bottleneck is exactly the problem PCA solves (Eckart-Young) Because PCA uses gradient descent internally

Chapter 3: Variational Autoencoders (VAE)

A regular autoencoder maps each input to a single point in latent space. But what if we want to generate new data? We'd need to sample random points in latent space and decode them. The problem: a regular AE's latent space has "holes" — regions that don't correspond to valid data.

The Variational Autoencoder (VAE) fixes this by making the encoder output a distribution instead of a point. Specifically, it outputs μ and σ for a Gaussian, and we sample z from that Gaussian.

Encoder: x → (μ, log σ²), z = μ + σ · ε, ε ~ N(0, I)

The Reparameterization Trick

We can't backpropagate through a random sample. But we can rewrite sampling as a deterministic function of a separate random variable ε:

z = μ + σ ⊙ ε, ε ~ N(0, I)

Now the gradient flows through μ and σ (deterministic), while ε is just an external random input. This is the reparameterization trick — it makes the sampling step differentiable.

The VAE Loss

L = ||x - x̂||² + β · KL(q(z|x) || p(z))

Two terms: (1) reconstruction error (make x̂ close to x), and (2) KL divergence that pushes q(z|x) toward the prior p(z) = N(0,I). The KL term ensures the latent space is smooth and complete — no holes.

Input x

[784]

↓ Encoder network

μ, logσ²

Mean and log-variance of q(z|x)

↓ z = μ + σ·ε (reparameterize)

z ~ q(z|x)

[z_dim] — sampled latent

↓ Decoder network

x̂

[784] — reconstructed

Key insight: The KL term acts as a regularizer. Without it, the encoder could map each training point to a distant, isolated spike in latent space (just memorizing). The KL term forces all encodings to cluster near N(0,I), creating a continuous, interpolatable latent space.

VAE Latent Space

Each point is an encoded data sample. Regular AE (left) has gaps. VAE (right) fills the space smoothly. Adjust β to control the KL weight.

β (KL weight)1.0

What does the reparameterization trick enable?

Faster training speed Smaller model size Backpropagation through the sampling step (makes sampling differentiable)

Chapter 4: Robust PCA — Low-Rank + Sparse

Standard PCA is fragile. A single outlier can rotate the principal components dramatically. What if we knew the corruption was sparse — only affecting some entries, but affecting them badly?

Robust PCA decomposes a matrix X into two components:

X = L + S

where L is low-rank (the clean signal) and S is sparse (the corruption). Think of it as: L is the video background (low-rank because every frame looks similar), S is the foreground (sparse because moving objects occupy few pixels).

Why Not Just Use SVD?

If you SVD the corrupted matrix X, the corruption contaminates every singular vector. SVD finds the best overall low-rank approximation, but it smears the sparse errors across all components. Robust PCA separates them cleanly.

The Convex Relaxation

We want to minimize rank(L) subject to X = L + S with S sparse. But rank minimization is NP-hard. The nuclear norm ||L||_* (sum of singular values) is the tightest convex relaxation of rank. Similarly, ||S||₁ (sum of absolute values) promotes sparsity.

min ||L||_* + λ||S||₁ subject to X = L + S

This is Principal Component Pursuit (PCP). Remarkably, under mild conditions (L is not too sparse, S is not too low-rank), this convex program exactly recovers the true L and S.

The miracle: A convex program — something we can solve efficiently — exactly recovers the rank and the sparse component, even though the original problem is combinatorial. This is one of the great results in compressed sensing / convex optimization (Candès et al., 2011).

Matrix Decomposition: X = L + S

A rank-2 matrix (left) corrupted by sparse errors (middle). RPCA recovers both. Adjust corruption level.

Rank of L2

Sparsity of S0.15

What is the nuclear norm ||L||* a convex proxy for?

The rank of matrix L (number of non-zero singular values) The sparsity of L The Frobenius norm of L

Chapter 5: Algorithms for Robust PCA

How do we actually solve the PCP convex program? Two main approaches: ADMM (Alternating Direction Method of Multipliers) and iterative thresholding.

ADMM for PCP

ADMM introduces a Lagrange multiplier Y and alternates between updating L, S, and Y:

Update L

L ← SVT_1/μ(X - S + Y/μ) — singular value thresholding

↓

Update S

S ← Shrink_λ/μ(X - L + Y/μ) — soft thresholding

↓

Update Y

Y ← Y + μ(X - L - S) — dual ascent

↻ repeat until convergence

Singular Value Thresholding (SVT)

The key operation is SVT: take the SVD of a matrix, subtract a threshold τ from each singular value (setting negatives to zero), then reconstruct. This is the proximal operator for the nuclear norm — it shrinks the singular values, effectively reducing rank.

SVT_τ(M) = U · max(Σ - τI, 0) · V^T

Soft Thresholding (Shrinkage)

For the sparse component, soft thresholding shrinks each element toward zero:

Shrink_τ(x) = sign(x) · max(|x| - τ, 0)

This is the proximal operator for the L1 norm. It zeroes out small entries (they're treated as noise) and shrinks large entries (they're treated as true corruption minus bias).

python
import numpy as np

def robust_pca(X, lam=None, mu=None, tol=1e-6, max_iter=100):
    """Principal Component Pursuit via ADMM."""
    m, n = X.shape
    if lam is None: lam = 1.0 / np.sqrt(max(m, n))
    if mu is None: mu = m * n / (4 * np.abs(X).sum())

    L = np.zeros_like(X)
    S = np.zeros_like(X)
    Y = np.zeros_like(X)

    for _ in range(max_iter):
        # SVT for L
        U, s, Vt = np.linalg.svd(X - S + Y/mu, full_matrices=False)
        s_thresh = np.maximum(s - 1/mu, 0)
        L = U @ np.diag(s_thresh) @ Vt

        # Soft threshold for S
        temp = X - L + Y/mu
        S = np.sign(temp) * np.maximum(np.abs(temp) - lam/mu, 0)

        # Dual update
        residual = X - L - S
        Y = Y + mu * residual

        if np.linalg.norm(residual) / np.linalg.norm(X) < tol:
            break

    return L, S

Convergence: ADMM converges to the global optimum because the problem is convex. In practice, 50-100 iterations suffice. The main cost per iteration is one SVD — O(mn·min(m,n)).

ADMM Convergence

Watch the algorithm iterate. Each step the residual ||X-L-S|| decreases. Click "Step" to advance one iteration.

Iteration: 0

What does Singular Value Thresholding (SVT) do geometrically?

Shrinks singular values toward zero, effectively reducing the rank of the matrix Rotates the matrix to align with principal axes Zeroes out entire rows of the matrix

Chapter 6: Showcase — Matrix Separation

Time to see Robust PCA in action. We'll create a low-rank matrix, corrupt it with sparse errors, and watch the ADMM algorithm cleanly separate the two components.

This is the full interactive demo. You control the rank, sparsity, corruption magnitude, and λ. Watch the heatmaps update as the algorithm converges. Try breaking it: what happens when the corruption is too dense?

Interactive Robust PCA Decomposition

X = L + S. Top row: the ground truth. Bottom row: RPCA's estimate. Color = value (blue negative, orange positive).

True Rank2

Corruption %0.15

Corruption magnitude5.0

λ0.18

Recovery error: —

When Does RPCA Fail?

Try pushing corruption above 40%. The algorithm starts confusing S with L — the sparse component becomes too dense to be distinguishable from low-rank structure. The theoretical guarantee requires:

rank(L) ≤ ρ_r · min(m,n) / (μ · log²(max(m,n)))

||S||₀ ≤ ρ_s · mn

In plain English: L must be genuinely low-rank, and S must be genuinely sparse. If either condition fails, no algorithm can separate them (information-theoretically impossible).

Chapter 7: Beyond — Connections

Autoencoders and Robust PCA are entry points into a rich landscape of representation learning and matrix decomposition methods.

Method	What It Finds	When to Use
PCA / Linear AE	Best linear subspace	Data is roughly Gaussian, linear structure
Nonlinear AE	Nonlinear manifold	Data lives on curves/surfaces, not planes
VAE	Smooth generative latent space	Need to sample new data, interpolate
Robust PCA	Low-rank + sparse decomposition	Sparse corruption or outliers
NMF	Non-negative parts	Data is inherently non-negative (spectra, counts)
Dictionary Learning	Sparse coding with learned atoms	Signals are sparse in some learned basis

The thread: All these methods ask the same question in different disguises: "What is the simplest structure that explains my data?" PCA says "a flat plane." AE says "a curved surface." Robust PCA says "a flat plane, once I remove the outliers." NMF says "a non-negative combination of parts." Dictionary learning says "a sparse combination of learned atoms."

Related Lessons

Next: NMF & Clustering — what happens when you require non-negativity. Also: Dictionary Learning & Sparse Coding — learning overcomplete dictionaries with sparse codes.

What distinguishes Robust PCA from standard PCA?

Robust PCA uses neural networks Robust PCA explicitly separates sparse corruption from low-rank structure, rather than treating all errors as Gaussian noise Robust PCA produces higher-rank approximations