Stanford EE269 — Mert Pilanci

Dictionary Learning & Sparse Coding

Learn an overcomplete dictionary where signals have sparse representations — fewer atoms, more expressive power.

Prerequisites: Linear algebra + Basic optimization. That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Sparse Representation?

You have a signal — an image patch, an audio frame, a sensor reading. You want to represent it compactly. PCA gives you a fixed basis (the eigenvectors of the data covariance). But PCA uses all components with dense coefficients. Every signal is a mix of everything.

What if, instead, each signal used only 3-5 elements from a much larger collection of templates? A face patch might be: 0.8 × "nose_left" + 0.3 × "shadow_diagonal" + 0.5 × "skin_smooth". Most coefficients are zero. This is sparse coding: represent signals using few atoms from a learned dictionary.

The dictionary D has more columns than rows (it's overcomplete). This means there are many ways to represent each signal. But we insist on the sparsest one — the fewest atoms. This constraint makes the representation unique, interpretable, and compressible.

The core idea: Learn a dictionary D such that every signal x ≈ Dα where α is sparse (mostly zeros). Unlike PCA's dense coefficients, sparse codes say "this signal is made of these specific 3 atoms" — which is how humans think about composition.
Dense (PCA) vs Sparse (Dictionary) Codes

Left: PCA uses all components (dense α). Right: dictionary uses few atoms (sparse α). Both reconstruct the signal, but sparse is more interpretable.

What does "sparse representation" mean?

Chapter 1: The Sparse Coding Problem

Formally, given a signal x ∈ ℝm and a dictionary D ∈ ℝm×n (with n > m, overcomplete), find the sparsest coefficient vector α such that x = Dα:

min ||α||0   subject to   x = Dα

The L0 "norm" ||α||0 counts the number of non-zero entries. We want the representation with fewest active atoms.

Why Overcomplete?

If D has exactly m columns (a square basis like the DFT or wavelets), there's only one representation. With n > m columns, there are infinitely many solutions to x = Dα. The sparsity constraint selects the "simplest" one.

PropertyOrthogonal basis (n=m)Overcomplete dictionary (n>m)
Columnsm (exactly spans ℝm)n > m (redundant)
RepresentationUnique: α = D-1xInfinitely many solutions
SparsityFixed (data-dependent)Can be much sparser (dictionary adapts)
AdaptivityPre-defined (DCT, wavelet)Learned from data
Key insight: An overcomplete dictionary can have atoms tailored to the specific structures in your data. Natural images have edges at many orientations — a fixed basis (DCT) needs many coefficients to represent an edge. A learned dictionary can have an atom that IS that edge, needing only one coefficient.

The Optimization Landscape

The L0 minimization is NP-hard in general. We need algorithms that find (approximately) sparse solutions efficiently. The next chapters cover the two main approaches: greedy pursuit (Matching Pursuit, OMP) and convex relaxation (LASSO/Basis Pursuit).

Overcomplete Dictionary

Each column of D is an "atom" (shown as a bar pattern). A signal is a sparse combination of atoms. Click atoms to toggle them on/off.

Dictionary size (n)8
Signal dim (m)5
Why is L0 minimization NP-hard?

Chapter 2: Matching Pursuit — Greedy Atom Selection

Matching Pursuit (MP) is the simplest sparse coding algorithm. It's greedy: at each step, find the atom most correlated with the current residual, subtract its contribution, repeat.

Initialize
Residual r = x, code α = 0
Select best atom
k* = argmaxk |⟨r, dk⟩| — most correlated
Update coefficient
αk* += ⟨r, dk*
Update residual
r = r - ⟨r, dk*⟩ · dk*
↻ repeat until ||r|| < ε or sparsity budget reached
python
def matching_pursuit(x, D, n_atoms=5, tol=1e-4):
    """Greedy atom selection. D: [m, n], x: [m]."""
    r = x.copy()           # residual
    alpha = np.zeros(D.shape[1])
    selected = []

    for _ in range(n_atoms):
        # Find most correlated atom
        corrs = D.T @ r     # [n] inner products
        k = np.argmax(np.abs(corrs))
        selected.append(k)

        # Update
        alpha[k] += corrs[k]
        r = r - corrs[k] * D[:, k]

        if np.linalg.norm(r) < tol:
            break

    return alpha, selected
MP's weakness: Once an atom is selected, its coefficient is never revised. If atom 1 and atom 2 are correlated, MP might pick atom 1 first with a slightly-too-large coefficient, then need atom 2 to "correct" it. This leads to sub-optimal sparsity. OMP (next chapter) fixes this.
Matching Pursuit Step-by-Step

Watch MP select atoms one at a time. The residual shrinks with each step. The selected atom is highlighted.

Atoms selected: 0 | Residual: —
At each step, Matching Pursuit selects the atom that:

Chapter 3: OMP — Orthogonal Matching Pursuit

Orthogonal Matching Pursuit (OMP) improves on MP with one key change: after selecting each new atom, it re-solves the least-squares problem over ALL selected atoms. This ensures the residual is orthogonal to the span of selected atoms — hence "orthogonal."

The Key Difference

MP: Only updates the new atom's coefficient. Old coefficients stay fixed. Residual is orthogonal to only the latest atom.

OMP: Re-solves all coefficients jointly (least-squares). Residual is orthogonal to the entire span of selected atoms.

After selecting atoms S = {k1, ..., kt}:  αS = (DSTDS)-1DSTx

This is just the least-squares projection of x onto the columns of D indexed by S. The residual is r = x - DSαS, which is guaranteed orthogonal to span(DS).

python
def omp(x, D, n_atoms=5, tol=1e-4):
    """OMP: re-solve LS after each atom selection."""
    r = x.copy()
    support = []
    alpha = np.zeros(D.shape[1])

    for _ in range(n_atoms):
        # Select: same as MP
        corrs = D.T @ r
        k = np.argmax(np.abs(corrs))
        support.append(k)

        # KEY DIFFERENCE: re-solve LS over entire support
        Ds = D[:, support]
        alpha_s = np.linalg.lstsq(Ds, x, rcond=None)[0]
        for i, idx in enumerate(support):
            alpha[idx] = alpha_s[i]

        # Residual orthogonal to span(Ds)
        r = x - Ds @ alpha_s

        if np.linalg.norm(r) < tol:
            break

    return alpha, support
Why OMP matters: OMP guarantees that once you've selected t atoms, you have the best t-atom approximation within that support. MP doesn't — it can over-represent early atoms and under-represent later ones. OMP typically needs fewer atoms to reach the same residual.
MP vs OMP Comparison

Both algorithms run on the same signal. Compare how quickly the residual drops. OMP (teal) typically converges in fewer steps than MP (orange).

What makes OMP "orthogonal"?

Chapter 4: LASSO — L1 Convex Relaxation

The L0 problem is combinatorial. Greedy methods (MP, OMP) are fast but may not find the globally sparsest solution. Is there a convex relaxation that's both solvable and provably correct?

Yes: replace ||α||0 with ||α||1 = ∑|αi|. This is the LASSO (Least Absolute Shrinkage and Selection Operator):

minα ½||x - Dα||2² + λ||α||1

The L1 norm is the tightest convex relaxation of L0. It promotes sparsity because the L1 ball has corners on the axes — the optimal solution tends to land at a corner where some coordinates are exactly zero.

Why L1 Gives Sparsity

Geometrically: the L1 ball is a diamond (in 2D) or cross-polytope. The level sets of the quadratic loss are ellipsoids. When an ellipsoid first touches the diamond, it typically hits a corner (a sparse point). The L2 ball is a sphere — it touches ellipsoids at arbitrary points, giving non-sparse solutions.

The LASSO guarantee (Candès, Tao, Donoho): If the dictionary satisfies the Restricted Isometry Property (RIP) and the true solution is s-sparse, LASSO recovers the exact support with high probability. The threshold: λ should be O(σ√(log n)) where σ is noise level.

Solving LASSO: ISTA

The Iterative Shrinkage-Thresholding Algorithm (ISTA) alternates gradient step + soft thresholding:

α(t+1) = Shrinkλ/L(t) + (1/L) DT(x - Dα(t)))

where L = largest eigenvalue of DTD, and Shrinkτ(z) = sign(z)max(|z|-τ, 0).

python
def ista(x, D, lam, n_iter=500):
    """ISTA for LASSO: min 0.5*||x-D@alpha||^2 + lam*||alpha||_1"""
    m, n = D.shape
    L = np.linalg.norm(D.T @ D, ord=2)  # Lipschitz constant
    alpha = np.zeros(n)

    for _ in range(n_iter):
        # Gradient step
        grad = D.T @ (D @ alpha - x)
        z = alpha - grad / L

        # Soft thresholding (proximal of L1)
        alpha = np.sign(z) * np.maximum(np.abs(z) - lam/L, 0)

    return alpha
L1 vs L2 Geometry

The loss ellipse touches the L1 diamond at a corner (sparse solution). With L2, it touches the circle at a generic point (dense). Drag λ to see how the constraint ball changes.

λ (regularization)1.0
Why does L1 regularization produce sparse solutions while L2 does not?

Chapter 5: K-SVD — Learning the Dictionary

So far we assumed the dictionary D is given. But the real power comes from learning D from data. The K-SVD algorithm (Aharon, Elad, Bruckstein, 2006) alternates between sparse coding and dictionary update:

minD, Α ||X - DΑ||F²   s.t.   ||αi||0 ≤ s   ∀i

where X = [x1 | ... | xN] are training signals and Α = [α1 | ... | αN] are their sparse codes.

The Two-Stage Algorithm

Stage 1: Sparse Coding
Fix D, solve for Α using OMP (per signal)
Stage 2: Dictionary Update
Fix Α, update each atom dk of D one at a time
↻ repeat 50-100 times

Atom Update (the "SVD" in K-SVD)

To update atom k, collect all signals that use atom k (call this set ωk). The error matrix restricted to these signals, with all other atoms' contributions removed, is:

Ek = Xωk - ∑j≠k dj αjT

We want dk and the k-th row of Α to best approximate Ek. This is a rank-1 approximation problem — solved by the largest singular vector of Ek:

Ek ≈ σ1 · u1 · v1T → dk = u1,   αkT = σ1 v1T
python
def ksvd(X, n_atoms, sparsity, n_iter=50):
    """K-SVD dictionary learning."""
    m, N = X.shape
    # Initialize dictionary (random normalized columns)
    D = np.random.randn(m, n_atoms)
    D /= np.linalg.norm(D, axis=0)

    for _ in range(n_iter):
        # Stage 1: Sparse code each signal with OMP
        Alpha = np.zeros((n_atoms, N))
        for i in range(N):
            Alpha[:, i], _ = omp(X[:, i], D, n_atoms=sparsity)

        # Stage 2: Update each atom
        for k in range(n_atoms):
            # Signals using atom k
            omega = np.where(Alpha[k, :] != 0)[0]
            if len(omega) == 0:
                continue

            # Error without atom k
            E = X[:, omega] - D @ Alpha[:, omega] + \
                np.outer(D[:, k], Alpha[k, omega])

            # Rank-1 SVD
            U, s, Vt = np.linalg.svd(E, full_matrices=False)
            D[:, k] = U[:, 0]
            Alpha[k, omega] = s[0] * Vt[0, :]

    return D, Alpha
Why K-SVD works well: Each atom update is the globally optimal rank-1 approximation to the error matrix. Unlike gradient-based methods, it doesn't get stuck in shallow local minima for the dictionary. The OMP step guarantees exact sparsity. Together, they converge quickly to good dictionaries.
K-SVD Learning Progress

Watch dictionary atoms emerge from random initialization as K-SVD iterates. Each column is one atom.

Iteration: 0
In K-SVD, how is each dictionary atom updated?

Chapter 6: Showcase — Interactive K-SVD

Watch K-SVD learn a dictionary from synthetic signal patches. You control the sparsity level, number of atoms, and noise. The dictionary starts random and converges to meaningful structure.

This is the full interactive experience. Generate training data (synthetic patches with known structure), run K-SVD, and see the learned atoms. Compare with the ground-truth dictionary that generated the data.
K-SVD Dictionary Learning

Top: ground-truth atoms used to generate data. Middle: current learned dictionary. Bottom: reconstruction error over iterations. Watch the learned atoms converge toward ground truth.

Sparsity (atoms per signal)2
Dictionary size6
Noise level0.10
Iter: 0 | Error: —

Things to Try

Chapter 7: Beyond — Connections

Dictionary learning sits at the crossroads of signal processing, machine learning, and neuroscience.

MethodDictionarySparsity MechanismBest For
K-SVDLearned, overcompleteOMP (greedy, exact L0)Image denoising, inpainting
LASSO / BPFixed or learnedL1 relaxation (convex)Compressed sensing, regression
Online DLLearned, streamingLARS / coordinate descentLarge-scale (Mairal et al.)
Sparse AENeural network encoderL1 penalty on activationsFeature learning, mechanistic interp.
NMFNon-negativeNon-negativity as implicit sparsityAudio, text, images (non-neg data)
Dictionary learning in modern ML: Sparse autoencoders (used in mechanistic interpretability of LLMs) are essentially neural dictionary learning. The encoder learns a dictionary of "features" and the code is forced sparse with an L1 penalty. When researchers probe what neurons in GPT mean, they're doing dictionary learning on activations.

The Neuroscience Connection

Olshausen & Field (1996) showed that sparse coding on natural image patches produces atoms that look like oriented Gabor filters — exactly what V1 neurons in the visual cortex compute. The brain may literally be doing dictionary learning. Sparsity isn't just a computational convenience; it may be a fundamental principle of biological representation.

Related Lessons

Previous: NMF & Clustering — non-negative factorization as an alternative to signed sparse coding.

Previous: Autoencoders & Robust PCA — linear AE = PCA, the dense counterpart to dictionary learning.

What is the modern ML analog of dictionary learning?