EE269 Dictionary Learning — Dictionary Learning & Sparse Coding

Chapter 0: Why Sparse Representation?

You have a signal — an image patch, an audio frame, a sensor reading. You want to represent it compactly. PCA gives you a fixed basis (the eigenvectors of the data covariance). But PCA uses all components with dense coefficients. Every signal is a mix of everything.

What if, instead, each signal used only 3-5 elements from a much larger collection of templates? A face patch might be: 0.8 × "nose_left" + 0.3 × "shadow_diagonal" + 0.5 × "skin_smooth". Most coefficients are zero. This is sparse coding: represent signals using few atoms from a learned dictionary.

The dictionary D has more columns than rows (it's overcomplete). This means there are many ways to represent each signal. But we insist on the sparsest one — the fewest atoms. This constraint makes the representation unique, interpretable, and compressible.

The core idea: Learn a dictionary D such that every signal x ≈ Dα where α is sparse (mostly zeros). Unlike PCA's dense coefficients, sparse codes say "this signal is made of these specific 3 atoms" — which is how humans think about composition.

Dense (PCA) vs Sparse (Dictionary) Codes

Left: PCA uses all components (dense α). Right: dictionary uses few atoms (sparse α). Both reconstruct the signal, but sparse is more interpretable.

What does "sparse representation" mean?

Using a small dictionary with few atoms Representing a signal using few non-zero coefficients from a (possibly large) dictionary Compressing the dictionary itself

Chapter 1: The Sparse Coding Problem

Formally, given a signal x ∈ ℝ^m and a dictionary D ∈ ℝ^m×n (with n > m, overcomplete), find the sparsest coefficient vector α such that x = Dα:

min ||α||₀ subject to x = Dα

The L0 "norm" ||α||₀ counts the number of non-zero entries. We want the representation with fewest active atoms.

Why Overcomplete?

If D has exactly m columns (a square basis like the DFT or wavelets), there's only one representation. With n > m columns, there are infinitely many solutions to x = Dα. The sparsity constraint selects the "simplest" one.

Property	Orthogonal basis (n=m)	Overcomplete dictionary (n>m)
Columns	m (exactly spans ℝ^m)	n > m (redundant)
Representation	Unique: α = D^-1x	Infinitely many solutions
Sparsity	Fixed (data-dependent)	Can be much sparser (dictionary adapts)
Adaptivity	Pre-defined (DCT, wavelet)	Learned from data

Key insight: An overcomplete dictionary can have atoms tailored to the specific structures in your data. Natural images have edges at many orientations — a fixed basis (DCT) needs many coefficients to represent an edge. A learned dictionary can have an atom that IS that edge, needing only one coefficient.

The Optimization Landscape

The L0 minimization is NP-hard in general. We need algorithms that find (approximately) sparse solutions efficiently. The next chapters cover the two main approaches: greedy pursuit (Matching Pursuit, OMP) and convex relaxation (LASSO/Basis Pursuit).

Overcomplete Dictionary

Each column of D is an "atom" (shown as a bar pattern). A signal is a sparse combination of atoms. Click atoms to toggle them on/off.

Dictionary size (n)8

Signal dim (m)5

Why is L0 minimization NP-hard?

Because the dictionary is too large to store Because gradient descent can't handle integer constraints Because finding the support (which entries are non-zero) requires searching over 2^n subsets

Chapter 2: Matching Pursuit — Greedy Atom Selection

Matching Pursuit (MP) is the simplest sparse coding algorithm. It's greedy: at each step, find the atom most correlated with the current residual, subtract its contribution, repeat.

Initialize

Residual r = x, code α = 0

↓

Select best atom

k* = argmax_k |⟨r, d_k⟩| — most correlated

↓

Update coefficient

α_k* += ⟨r, d_k*⟩

↓

Update residual

r = r - ⟨r, d_k*⟩ · d_k*

↻ repeat until ||r|| < ε or sparsity budget reached

python
def matching_pursuit(x, D, n_atoms=5, tol=1e-4):
    """Greedy atom selection. D: [m, n], x: [m]."""
    r = x.copy()           # residual
    alpha = np.zeros(D.shape[1])
    selected = []

    for _ in range(n_atoms):
        # Find most correlated atom
        corrs = D.T @ r     # [n] inner products
        k = np.argmax(np.abs(corrs))
        selected.append(k)

        # Update
        alpha[k] += corrs[k]
        r = r - corrs[k] * D[:, k]

        if np.linalg.norm(r) < tol:
            break

    return alpha, selected

MP's weakness: Once an atom is selected, its coefficient is never revised. If atom 1 and atom 2 are correlated, MP might pick atom 1 first with a slightly-too-large coefficient, then need atom 2 to "correct" it. This leads to sub-optimal sparsity. OMP (next chapter) fixes this.

Matching Pursuit Step-by-Step

Watch MP select atoms one at a time. The residual shrinks with each step. The selected atom is highlighted.

Atoms selected: 0 | Residual: —

At each step, Matching Pursuit selects the atom that:

Has the largest inner product (correlation) with the current residual Has the smallest norm Was not selected in previous steps

Chapter 3: OMP — Orthogonal Matching Pursuit

Orthogonal Matching Pursuit (OMP) improves on MP with one key change: after selecting each new atom, it re-solves the least-squares problem over ALL selected atoms. This ensures the residual is orthogonal to the span of selected atoms — hence "orthogonal."

The Key Difference

MP: Only updates the new atom's coefficient. Old coefficients stay fixed. Residual is orthogonal to only the latest atom.

OMP: Re-solves all coefficients jointly (least-squares). Residual is orthogonal to the entire span of selected atoms.

After selecting atoms S = {k₁, ..., k_t}: α_S = (D_S^TD_S)^-1D_S^Tx

This is just the least-squares projection of x onto the columns of D indexed by S. The residual is r = x - D_Sα_S, which is guaranteed orthogonal to span(D_S).

python
def omp(x, D, n_atoms=5, tol=1e-4):
    """OMP: re-solve LS after each atom selection."""
    r = x.copy()
    support = []
    alpha = np.zeros(D.shape[1])

    for _ in range(n_atoms):
        # Select: same as MP
        corrs = D.T @ r
        k = np.argmax(np.abs(corrs))
        support.append(k)

        # KEY DIFFERENCE: re-solve LS over entire support
        Ds = D[:, support]
        alpha_s = np.linalg.lstsq(Ds, x, rcond=None)[0]
        for i, idx in enumerate(support):
            alpha[idx] = alpha_s[i]

        # Residual orthogonal to span(Ds)
        r = x - Ds @ alpha_s

        if np.linalg.norm(r) < tol:
            break

    return alpha, support

Why OMP matters: OMP guarantees that once you've selected t atoms, you have the best t-atom approximation within that support. MP doesn't — it can over-represent early atoms and under-represent later ones. OMP typically needs fewer atoms to reach the same residual.

MP vs OMP Comparison

Both algorithms run on the same signal. Compare how quickly the residual drops. OMP (teal) typically converges in fewer steps than MP (orange).

What makes OMP "orthogonal"?

The dictionary atoms must be orthogonal The residual is always orthogonal to the span of all selected atoms (via least-squares re-solve) The selected atoms are orthogonalized before use

Chapter 4: LASSO — L1 Convex Relaxation

The L0 problem is combinatorial. Greedy methods (MP, OMP) are fast but may not find the globally sparsest solution. Is there a convex relaxation that's both solvable and provably correct?

Yes: replace ||α||₀ with ||α||₁ = ∑|α_i|. This is the LASSO (Least Absolute Shrinkage and Selection Operator):

min_α ½||x - Dα||₂² + λ||α||₁

The L1 norm is the tightest convex relaxation of L0. It promotes sparsity because the L1 ball has corners on the axes — the optimal solution tends to land at a corner where some coordinates are exactly zero.

Why L1 Gives Sparsity

Geometrically: the L1 ball is a diamond (in 2D) or cross-polytope. The level sets of the quadratic loss are ellipsoids. When an ellipsoid first touches the diamond, it typically hits a corner (a sparse point). The L2 ball is a sphere — it touches ellipsoids at arbitrary points, giving non-sparse solutions.

The LASSO guarantee (Candès, Tao, Donoho): If the dictionary satisfies the Restricted Isometry Property (RIP) and the true solution is s-sparse, LASSO recovers the exact support with high probability. The threshold: λ should be O(σ√(log n)) where σ is noise level.

Solving LASSO: ISTA

The Iterative Shrinkage-Thresholding Algorithm (ISTA) alternates gradient step + soft thresholding:

α^(t+1) = Shrink_λ/L(α^(t) + (1/L) D^T(x - Dα^(t)))

where L = largest eigenvalue of D^TD, and Shrink_τ(z) = sign(z)max(|z|-τ, 0).

python
def ista(x, D, lam, n_iter=500):
    """ISTA for LASSO: min 0.5*||x-D@alpha||^2 + lam*||alpha||_1"""
    m, n = D.shape
    L = np.linalg.norm(D.T @ D, ord=2)  # Lipschitz constant
    alpha = np.zeros(n)

    for _ in range(n_iter):
        # Gradient step
        grad = D.T @ (D @ alpha - x)
        z = alpha - grad / L

        # Soft thresholding (proximal of L1)
        alpha = np.sign(z) * np.maximum(np.abs(z) - lam/L, 0)

    return alpha

L1 vs L2 Geometry

The loss ellipse touches the L1 diamond at a corner (sparse solution). With L2, it touches the circle at a generic point (dense). Drag λ to see how the constraint ball changes.

λ (regularization)1.0

Why does L1 regularization produce sparse solutions while L2 does not?

The L1 ball has corners on the axes, so the optimal point tends to have zero coordinates L1 is computationally simpler L2 only works for overdetermined systems

Chapter 5: K-SVD — Learning the Dictionary

So far we assumed the dictionary D is given. But the real power comes from learning D from data. The K-SVD algorithm (Aharon, Elad, Bruckstein, 2006) alternates between sparse coding and dictionary update:

min_{D, Α} ||X - DΑ||_F² s.t. ||α_i||₀ ≤ s ∀i

where X = [x₁ | ... | x_N] are training signals and Α = [α₁ | ... | α_N] are their sparse codes.

The Two-Stage Algorithm

Stage 1: Sparse Coding

Fix D, solve for Α using OMP (per signal)

↓

Stage 2: Dictionary Update

Fix Α, update each atom d_k of D one at a time

↻ repeat 50-100 times

Atom Update (the "SVD" in K-SVD)

To update atom k, collect all signals that use atom k (call this set ω_k). The error matrix restricted to these signals, with all other atoms' contributions removed, is:

E_k = X_ωk - ∑_j≠k d_j α_j^T

We want d_k and the k-th row of Α to best approximate E_k. This is a rank-1 approximation problem — solved by the largest singular vector of E_k:

E_k ≈ σ₁ · u₁ · v₁^T → d_k = u₁, α_k^T = σ₁ v₁^T

python
def ksvd(X, n_atoms, sparsity, n_iter=50):
    """K-SVD dictionary learning."""
    m, N = X.shape
    # Initialize dictionary (random normalized columns)
    D = np.random.randn(m, n_atoms)
    D /= np.linalg.norm(D, axis=0)

    for _ in range(n_iter):
        # Stage 1: Sparse code each signal with OMP
        Alpha = np.zeros((n_atoms, N))
        for i in range(N):
            Alpha[:, i], _ = omp(X[:, i], D, n_atoms=sparsity)

        # Stage 2: Update each atom
        for k in range(n_atoms):
            # Signals using atom k
            omega = np.where(Alpha[k, :] != 0)[0]
            if len(omega) == 0:
                continue

            # Error without atom k
            E = X[:, omega] - D @ Alpha[:, omega] + \
                np.outer(D[:, k], Alpha[k, omega])

            # Rank-1 SVD
            U, s, Vt = np.linalg.svd(E, full_matrices=False)
            D[:, k] = U[:, 0]
            Alpha[k, omega] = s[0] * Vt[0, :]

    return D, Alpha

Why K-SVD works well: Each atom update is the globally optimal rank-1 approximation to the error matrix. Unlike gradient-based methods, it doesn't get stuck in shallow local minima for the dictionary. The OMP step guarantees exact sparsity. Together, they converge quickly to good dictionaries.

K-SVD Learning Progress

Watch dictionary atoms emerge from random initialization as K-SVD iterates. Each column is one atom.

Iteration: 0

In K-SVD, how is each dictionary atom updated?

Gradient descent on the reconstruction loss Rank-1 SVD of the error matrix restricted to signals using that atom Random perturbation followed by projection

Chapter 6: Showcase — Interactive K-SVD

Watch K-SVD learn a dictionary from synthetic signal patches. You control the sparsity level, number of atoms, and noise. The dictionary starts random and converges to meaningful structure.

This is the full interactive experience. Generate training data (synthetic patches with known structure), run K-SVD, and see the learned atoms. Compare with the ground-truth dictionary that generated the data.

K-SVD Dictionary Learning

Top: ground-truth atoms used to generate data. Middle: current learned dictionary. Bottom: reconstruction error over iterations. Watch the learned atoms converge toward ground truth.

Sparsity (atoms per signal)2

Dictionary size6

Noise level0.10

Iter: 0 | Error: —

Things to Try

Increase noise to 0.4: The atoms become noisier but still emerge. K-SVD is robust to moderate noise.
Set sparsity to 1: Each signal uses only one atom. K-SVD converges fastest here (essentially k-means on signal directions).
Set dictionary larger than needed: Extra atoms become "dead" (rarely used). K-SVD naturally handles overparameterization.

Chapter 7: Beyond — Connections

Dictionary learning sits at the crossroads of signal processing, machine learning, and neuroscience.

Method	Dictionary	Sparsity Mechanism	Best For
K-SVD	Learned, overcomplete	OMP (greedy, exact L0)	Image denoising, inpainting
LASSO / BP	Fixed or learned	L1 relaxation (convex)	Compressed sensing, regression
Online DL	Learned, streaming	LARS / coordinate descent	Large-scale (Mairal et al.)
Sparse AE	Neural network encoder	L1 penalty on activations	Feature learning, mechanistic interp.
NMF	Non-negative	Non-negativity as implicit sparsity	Audio, text, images (non-neg data)

Dictionary learning in modern ML: Sparse autoencoders (used in mechanistic interpretability of LLMs) are essentially neural dictionary learning. The encoder learns a dictionary of "features" and the code is forced sparse with an L1 penalty. When researchers probe what neurons in GPT mean, they're doing dictionary learning on activations.

The Neuroscience Connection

Olshausen & Field (1996) showed that sparse coding on natural image patches produces atoms that look like oriented Gabor filters — exactly what V1 neurons in the visual cortex compute. The brain may literally be doing dictionary learning. Sparsity isn't just a computational convenience; it may be a fundamental principle of biological representation.

Related Lessons

Previous: NMF & Clustering — non-negative factorization as an alternative to signed sparse coding.

Previous: Autoencoders & Robust PCA — linear AE = PCA, the dense counterpart to dictionary learning.

What is the modern ML analog of dictionary learning?

Sparse autoencoders (learned dictionary with L1-penalized codes, used in mechanistic interpretability) Batch normalization Dropout regularization