Factor a matrix into non-negative parts — spectral bases, source activations, and the surprising connection to k-means.
You're looking at a spectrogram of a recording — two people talking at the same time. Each pixel represents energy at a frequency and time. Energy is always non-negative (you can't have -3 decibels of sound in a linear scale). Can you separate the two voices?
PCA won't work. PCA finds orthogonal components that can be negative — you'd get "anti-sounds" that subtract from the mix. But sound energy doesn't subtract. We need a decomposition where everything stays non-negative: the parts, the weights, the reconstruction.
Nonnegative Matrix Factorization (NMF) factors a non-negative matrix V into two non-negative matrices W and H: V ≈ WH, with V, W, H ≥ 0. The columns of W are "parts" (spectral bases), the rows of H are "activations" (when each part is active). For audio: W = what each source sounds like, H = when each source is speaking.
PCA components (left) can be negative — hard to interpret as "parts." NMF components (right) are non-negative — they look like actual spectral shapes.
Given a non-negative matrix V of size m×n (m frequency bins, n time frames), we want:
Here k is the number of components (sources). Each column of W is a spectral basis — the frequency fingerprint of one source. Each row of H is an activation vector — the temporal pattern of that source.
| Matrix | Shape | Interpretation (audio) |
|---|---|---|
| V | [m × n] | Spectrogram: m freq bins, n time frames |
| W | [m × k] | k spectral templates (what each source sounds like) |
| H | [k × n] | k activation patterns (when each source is active) |
The reconstruction of one column (one time frame) is:
Each time frame is a non-negative mixture of the k spectral bases, weighted by the activations. This is exactly source separation: each source contributes its spectral shape scaled by its current loudness.
A 6×8 matrix V factored into W (6×k) and H (k×8). Adjust k to see how more components improve the approximation.
How do we actually compute W and H? We can't just use SVD (it gives negative values). Instead, Lee and Seung (1999) gave beautifully simple multiplicative update rules that guarantee non-negativity and monotonic decrease of the cost function.
We minimize the squared Frobenius norm of the reconstruction error:
The ⊙ is element-wise multiplication, and division is also element-wise. These rules are derived from the gradient, but structured so that:
Take the gradient of ||V-WH||² with respect to H:
Standard gradient descent would be H ← H - η∇H. Lee & Seung choose the step size element-wise as ηij = Hij / (WTWH)ij. Substituting gives the multiplicative rule. The non-negativity of V, W, H ensures the ratio is always non-negative.
python import numpy as np def nmf(V, k, n_iter=200, eps=1e-10): """Multiplicative update NMF. V >= 0, shape (m, n).""" m, n = V.shape W = np.random.rand(m, k) + eps H = np.random.rand(k, n) + eps for _ in range(n_iter): # Update H H *= (W.T @ V) / (W.T @ W @ H + eps) # Update W W *= (V @ H.T) / (W @ H @ H.T + eps) return W, H
Click "Step" to run one iteration of multiplicative updates. Watch the reconstruction error drop and W, H converge.
The killer application of NMF is audio source separation. Given a recording of multiple sounds mixed together, NMF can pull them apart — without knowing what the sources are in advance.
Each source has a distinctive spectral shape (its column in W). A violin has strong harmonics at specific frequencies. A drum has broadband energy in short bursts. NMF discovers these shapes from the mixture alone, because:
A mixture of 2 spectral sources (top). NMF finds individual bases W (bottom). Colors show which source dominates each time-frequency cell.
What if we stack multiple NMF layers? Instead of V ≈ WH, we factorize:
Each layer provides a finer decomposition. The first layer captures broad structure, deeper layers capture finer details. This is Deep NMF — and it connects NMF to deep learning.
A single-layer NMF with k components finds the best rank-k non-negative approximation. But what if the true structure is hierarchical? A face is composed of eyes, nose, mouth (layer 1). Each eye is composed of iris, pupil, eyelid (layer 2). Deep NMF captures this hierarchy naturally.
Even without explicit regularization, depth acts as an implicit constraint. A deep factorization V ≈ W2W1H with the same total rank has a different optimization landscape than shallow V ≈ WH. Deeper factorizations tend to find sparser, more structured solutions — a phenomenon also observed in deep linear networks.
Compare reconstruction quality: 1 layer (rank k) vs 2 layers (each rank k). Deep NMF often finds sparser components.
Here's a surprising connection: k-means clustering is a special case of NMF. When we constrain H to have exactly one non-zero entry per column (a one-hot encoding), NMF reduces to k-means.
In k-means, we assign each data point xj to cluster ci and represent it by the centroid μi. In matrix form:
The columns of W are the cluster centroids. H is the assignment matrix (one-hot per column). The reconstruction VH = centroids of assigned clusters.
| k-means | NMF view |
|---|---|
| Cluster centroids μ1,...,μk | Columns of W |
| Cluster assignment of xj | Column j of H (one-hot) |
| Objective: ∑ ||xj - μc(j)||² | ||V - WH||F² with one-hot H |
Standard NMF relaxes the one-hot constraint: H can have multiple non-zero entries per column. This means each data point can be a mixture of clusters rather than assigned to exactly one. NMF is "soft clustering."
Left: k-means (hard assignment). Right: NMF (soft mixing). Points are colored by their cluster membership weights.
The moment of truth. You'll mix 2-3 synthetic spectral sources with different frequencies and temporal patterns. Then watch NMF separate them in real time.
Top: mixed spectrogram. Middle: recovered spectral bases W. Bottom: activation patterns H. Adjust source parameters and watch separation quality.
Try making Source 1 and Source 2 have similar frequencies (both around 3-4). The separation degrades because NMF can't distinguish sources with identical spectral shapes. Separation works when sources have distinctive frequency fingerprints.
NMF sits at the intersection of matrix factorization, source separation, topic modeling, and deep learning.
| Method | Constraint | Best For |
|---|---|---|
| PCA / SVD | Orthogonality | Variance maximization, general dim-reduction |
| NMF | Non-negativity | Parts-based, interpretable (audio, images, text) |
| Sparse NMF | Non-neg + sparsity on H | Fewer active sources per frame |
| Dictionary Learning | Sparsity on codes | Overcomplete, any-sign dictionaries |
| k-means | One-hot H | Hard clustering |
| LDA (topic model) | Probabilistic (Dirichlet) | Document-topic decomposition |
Previous: Autoencoders & Robust PCA — linear AE = PCA, the non-negative cousin of NMF.
Next: Dictionary Learning & Sparse Coding — what happens when you allow negative values but enforce sparsity.