Learn an overcomplete dictionary where signals have sparse representations — fewer atoms, more expressive power.
You have a signal — an image patch, an audio frame, a sensor reading. You want to represent it compactly. PCA gives you a fixed basis (the eigenvectors of the data covariance). But PCA uses all components with dense coefficients. Every signal is a mix of everything.
What if, instead, each signal used only 3-5 elements from a much larger collection of templates? A face patch might be: 0.8 × "nose_left" + 0.3 × "shadow_diagonal" + 0.5 × "skin_smooth". Most coefficients are zero. This is sparse coding: represent signals using few atoms from a learned dictionary.
The dictionary D has more columns than rows (it's overcomplete). This means there are many ways to represent each signal. But we insist on the sparsest one — the fewest atoms. This constraint makes the representation unique, interpretable, and compressible.
Left: PCA uses all components (dense α). Right: dictionary uses few atoms (sparse α). Both reconstruct the signal, but sparse is more interpretable.
Formally, given a signal x ∈ ℝm and a dictionary D ∈ ℝm×n (with n > m, overcomplete), find the sparsest coefficient vector α such that x = Dα:
The L0 "norm" ||α||0 counts the number of non-zero entries. We want the representation with fewest active atoms.
If D has exactly m columns (a square basis like the DFT or wavelets), there's only one representation. With n > m columns, there are infinitely many solutions to x = Dα. The sparsity constraint selects the "simplest" one.
| Property | Orthogonal basis (n=m) | Overcomplete dictionary (n>m) |
|---|---|---|
| Columns | m (exactly spans ℝm) | n > m (redundant) |
| Representation | Unique: α = D-1x | Infinitely many solutions |
| Sparsity | Fixed (data-dependent) | Can be much sparser (dictionary adapts) |
| Adaptivity | Pre-defined (DCT, wavelet) | Learned from data |
The L0 minimization is NP-hard in general. We need algorithms that find (approximately) sparse solutions efficiently. The next chapters cover the two main approaches: greedy pursuit (Matching Pursuit, OMP) and convex relaxation (LASSO/Basis Pursuit).
Each column of D is an "atom" (shown as a bar pattern). A signal is a sparse combination of atoms. Click atoms to toggle them on/off.
Matching Pursuit (MP) is the simplest sparse coding algorithm. It's greedy: at each step, find the atom most correlated with the current residual, subtract its contribution, repeat.
python def matching_pursuit(x, D, n_atoms=5, tol=1e-4): """Greedy atom selection. D: [m, n], x: [m].""" r = x.copy() # residual alpha = np.zeros(D.shape[1]) selected = [] for _ in range(n_atoms): # Find most correlated atom corrs = D.T @ r # [n] inner products k = np.argmax(np.abs(corrs)) selected.append(k) # Update alpha[k] += corrs[k] r = r - corrs[k] * D[:, k] if np.linalg.norm(r) < tol: break return alpha, selected
Watch MP select atoms one at a time. The residual shrinks with each step. The selected atom is highlighted.
Orthogonal Matching Pursuit (OMP) improves on MP with one key change: after selecting each new atom, it re-solves the least-squares problem over ALL selected atoms. This ensures the residual is orthogonal to the span of selected atoms — hence "orthogonal."
MP: Only updates the new atom's coefficient. Old coefficients stay fixed. Residual is orthogonal to only the latest atom.
OMP: Re-solves all coefficients jointly (least-squares). Residual is orthogonal to the entire span of selected atoms.
This is just the least-squares projection of x onto the columns of D indexed by S. The residual is r = x - DSαS, which is guaranteed orthogonal to span(DS).
python def omp(x, D, n_atoms=5, tol=1e-4): """OMP: re-solve LS after each atom selection.""" r = x.copy() support = [] alpha = np.zeros(D.shape[1]) for _ in range(n_atoms): # Select: same as MP corrs = D.T @ r k = np.argmax(np.abs(corrs)) support.append(k) # KEY DIFFERENCE: re-solve LS over entire support Ds = D[:, support] alpha_s = np.linalg.lstsq(Ds, x, rcond=None)[0] for i, idx in enumerate(support): alpha[idx] = alpha_s[i] # Residual orthogonal to span(Ds) r = x - Ds @ alpha_s if np.linalg.norm(r) < tol: break return alpha, support
Both algorithms run on the same signal. Compare how quickly the residual drops. OMP (teal) typically converges in fewer steps than MP (orange).
The L0 problem is combinatorial. Greedy methods (MP, OMP) are fast but may not find the globally sparsest solution. Is there a convex relaxation that's both solvable and provably correct?
Yes: replace ||α||0 with ||α||1 = ∑|αi|. This is the LASSO (Least Absolute Shrinkage and Selection Operator):
The L1 norm is the tightest convex relaxation of L0. It promotes sparsity because the L1 ball has corners on the axes — the optimal solution tends to land at a corner where some coordinates are exactly zero.
Geometrically: the L1 ball is a diamond (in 2D) or cross-polytope. The level sets of the quadratic loss are ellipsoids. When an ellipsoid first touches the diamond, it typically hits a corner (a sparse point). The L2 ball is a sphere — it touches ellipsoids at arbitrary points, giving non-sparse solutions.
The Iterative Shrinkage-Thresholding Algorithm (ISTA) alternates gradient step + soft thresholding:
where L = largest eigenvalue of DTD, and Shrinkτ(z) = sign(z)max(|z|-τ, 0).
python def ista(x, D, lam, n_iter=500): """ISTA for LASSO: min 0.5*||x-D@alpha||^2 + lam*||alpha||_1""" m, n = D.shape L = np.linalg.norm(D.T @ D, ord=2) # Lipschitz constant alpha = np.zeros(n) for _ in range(n_iter): # Gradient step grad = D.T @ (D @ alpha - x) z = alpha - grad / L # Soft thresholding (proximal of L1) alpha = np.sign(z) * np.maximum(np.abs(z) - lam/L, 0) return alpha
The loss ellipse touches the L1 diamond at a corner (sparse solution). With L2, it touches the circle at a generic point (dense). Drag λ to see how the constraint ball changes.
So far we assumed the dictionary D is given. But the real power comes from learning D from data. The K-SVD algorithm (Aharon, Elad, Bruckstein, 2006) alternates between sparse coding and dictionary update:
where X = [x1 | ... | xN] are training signals and Α = [α1 | ... | αN] are their sparse codes.
To update atom k, collect all signals that use atom k (call this set ωk). The error matrix restricted to these signals, with all other atoms' contributions removed, is:
We want dk and the k-th row of Α to best approximate Ek. This is a rank-1 approximation problem — solved by the largest singular vector of Ek:
python def ksvd(X, n_atoms, sparsity, n_iter=50): """K-SVD dictionary learning.""" m, N = X.shape # Initialize dictionary (random normalized columns) D = np.random.randn(m, n_atoms) D /= np.linalg.norm(D, axis=0) for _ in range(n_iter): # Stage 1: Sparse code each signal with OMP Alpha = np.zeros((n_atoms, N)) for i in range(N): Alpha[:, i], _ = omp(X[:, i], D, n_atoms=sparsity) # Stage 2: Update each atom for k in range(n_atoms): # Signals using atom k omega = np.where(Alpha[k, :] != 0)[0] if len(omega) == 0: continue # Error without atom k E = X[:, omega] - D @ Alpha[:, omega] + \ np.outer(D[:, k], Alpha[k, omega]) # Rank-1 SVD U, s, Vt = np.linalg.svd(E, full_matrices=False) D[:, k] = U[:, 0] Alpha[k, omega] = s[0] * Vt[0, :] return D, Alpha
Watch dictionary atoms emerge from random initialization as K-SVD iterates. Each column is one atom.
Watch K-SVD learn a dictionary from synthetic signal patches. You control the sparsity level, number of atoms, and noise. The dictionary starts random and converges to meaningful structure.
Top: ground-truth atoms used to generate data. Middle: current learned dictionary. Bottom: reconstruction error over iterations. Watch the learned atoms converge toward ground truth.
Dictionary learning sits at the crossroads of signal processing, machine learning, and neuroscience.
| Method | Dictionary | Sparsity Mechanism | Best For |
|---|---|---|---|
| K-SVD | Learned, overcomplete | OMP (greedy, exact L0) | Image denoising, inpainting |
| LASSO / BP | Fixed or learned | L1 relaxation (convex) | Compressed sensing, regression |
| Online DL | Learned, streaming | LARS / coordinate descent | Large-scale (Mairal et al.) |
| Sparse AE | Neural network encoder | L1 penalty on activations | Feature learning, mechanistic interp. |
| NMF | Non-negative | Non-negativity as implicit sparsity | Audio, text, images (non-neg data) |
Olshausen & Field (1996) showed that sparse coding on natural image patches produces atoms that look like oriented Gabor filters — exactly what V1 neurons in the visual cortex compute. The brain may literally be doing dictionary learning. Sparsity isn't just a computational convenience; it may be a fundamental principle of biological representation.
Previous: NMF & Clustering — non-negative factorization as an alternative to signed sparse coding.
Previous: Autoencoders & Robust PCA — linear AE = PCA, the dense counterpart to dictionary learning.