Stanford EE269 — Mert Pilanci

NMF & Clustering

Factor a matrix into non-negative parts — spectral bases, source activations, and the surprising connection to k-means.

Prerequisites: Matrix multiplication + Basic optimization. That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Non-Negative?

You're looking at a spectrogram of a recording — two people talking at the same time. Each pixel represents energy at a frequency and time. Energy is always non-negative (you can't have -3 decibels of sound in a linear scale). Can you separate the two voices?

PCA won't work. PCA finds orthogonal components that can be negative — you'd get "anti-sounds" that subtract from the mix. But sound energy doesn't subtract. We need a decomposition where everything stays non-negative: the parts, the weights, the reconstruction.

Nonnegative Matrix Factorization (NMF) factors a non-negative matrix V into two non-negative matrices W and H: V ≈ WH, with V, W, H ≥ 0. The columns of W are "parts" (spectral bases), the rows of H are "activations" (when each part is active). For audio: W = what each source sounds like, H = when each source is speaking.

The core idea: When your data is inherently non-negative (pixel intensities, word counts, spectral magnitudes, chemical concentrations), NMF gives you interpretable parts-based decomposition. PCA gives you directions. NMF gives you ingredients.
Parts vs Directions

PCA components (left) can be negative — hard to interpret as "parts." NMF components (right) are non-negative — they look like actual spectral shapes.

Why can't PCA separate audio sources from a spectrogram?

Chapter 1: The Factorization V ≈ WH

Given a non-negative matrix V of size m×n (m frequency bins, n time frames), we want:

V ≈ W · H,   where W ∈ ℝm×k≥0,   H ∈ ℝk×n≥0

Here k is the number of components (sources). Each column of W is a spectral basis — the frequency fingerprint of one source. Each row of H is an activation vector — the temporal pattern of that source.

What Each Dimension Means

MatrixShapeInterpretation (audio)
V[m × n]Spectrogram: m freq bins, n time frames
W[m × k]k spectral templates (what each source sounds like)
H[k × n]k activation patterns (when each source is active)

The reconstruction of one column (one time frame) is:

vj ≈ W · hj = ∑i=1k wi · hij

Each time frame is a non-negative mixture of the k spectral bases, weighted by the activations. This is exactly source separation: each source contributes its spectral shape scaled by its current loudness.

Key insight: Because everything is non-negative, the only way to represent data is by adding parts together. There's no cancellation. This means the parts W must correspond to real, physically meaningful components. NMF is forced to discover the actual sources, not arbitrary linear combinations.
V = W · H Visualized

A 6×8 matrix V factored into W (6×k) and H (k×8). Adjust k to see how more components improve the approximation.

Components k2
In NMF of a spectrogram, what do the columns of W represent?

Chapter 2: Multiplicative Update Rules

How do we actually compute W and H? We can't just use SVD (it gives negative values). Instead, Lee and Seung (1999) gave beautifully simple multiplicative update rules that guarantee non-negativity and monotonic decrease of the cost function.

The Cost Function

We minimize the squared Frobenius norm of the reconstruction error:

minW,H≥0 ||V - WH||F²

The Updates

H ← H ⊙ (WTV) / (WTWH)
W ← W ⊙ (VHT) / (WHHT)

The ⊙ is element-wise multiplication, and division is also element-wise. These rules are derived from the gradient, but structured so that:

  1. Non-negativity is preserved: if W, H start non-negative and V is non-negative, the updates stay non-negative (we're multiplying by non-negative ratios).
  2. Cost monotonically decreases: each update is guaranteed to not increase the objective.
  3. Fixed points are stationary points: when Hij(WTV)ij = Hij(WTWH)ij for all i,j, we've reached a (local) minimum.

Derivation Sketch

Take the gradient of ||V-WH||² with respect to H:

H = -2WTV + 2WTWH = -2WT(V - WH)

Standard gradient descent would be H ← H - η∇H. Lee & Seung choose the step size element-wise as ηij = Hij / (WTWH)ij. Substituting gives the multiplicative rule. The non-negativity of V, W, H ensures the ratio is always non-negative.

python
import numpy as np

def nmf(V, k, n_iter=200, eps=1e-10):
    """Multiplicative update NMF. V >= 0, shape (m, n)."""
    m, n = V.shape
    W = np.random.rand(m, k) + eps
    H = np.random.rand(k, n) + eps

    for _ in range(n_iter):
        # Update H
        H *= (W.T @ V) / (W.T @ W @ H + eps)
        # Update W
        W *= (V @ H.T) / (W @ H @ H.T + eps)

    return W, H
Why multiplicative? Additive updates (gradient descent) require careful step-size tuning and projection back to non-negative values. Multiplicative updates automatically stay non-negative and need no learning rate. The price: convergence can be slow near zero values.
Watch NMF Converge

Click "Step" to run one iteration of multiplicative updates. Watch the reconstruction error drop and W, H converge.

Iter: 0 | Error: —
Why do multiplicative updates preserve non-negativity?

Chapter 3: Audio Source Separation

The killer application of NMF is audio source separation. Given a recording of multiple sounds mixed together, NMF can pull them apart — without knowing what the sources are in advance.

The Pipeline

Mixed audio x(t)
Sum of sources in time domain
↓ STFT
Magnitude spectrogram V = |STFT|
[freq_bins × time_frames], all ≥ 0
↓ NMF with k sources
W (spectral bases), H (activations)
W: [freq × k], H: [k × time]
↓ Mask: Mi = wihi / (WH)
Separated spectrograms
Vi = Mi ⊙ V for each source i
↓ Inverse STFT
Separated audio signals
Individual sources in time domain

Why It Works

Each source has a distinctive spectral shape (its column in W). A violin has strong harmonics at specific frequencies. A drum has broadband energy in short bursts. NMF discovers these shapes from the mixture alone, because:

Wiener filter masking: After NMF gives us W and H, we create a soft mask for source i: Mi[f,t] = (wi[f] · hi[t]) / ∑j(wj[f] · hj[t]). This mask says "what fraction of the energy at this time-frequency point belongs to source i?" Multiply by the original spectrogram to isolate each source.
Simulated Source Separation

A mixture of 2 spectral sources (top). NMF finds individual bases W (bottom). Colors show which source dominates each time-frequency cell.

Source 1 freq3
Source 2 freq7
In NMF-based source separation, what does the soft mask M_i represent?

Chapter 4: Deep NMF

What if we stack multiple NMF layers? Instead of V ≈ WH, we factorize:

V ≈ WL · WL-1 · ... · W1 · H

Each layer provides a finer decomposition. The first layer captures broad structure, deeper layers capture finer details. This is Deep NMF — and it connects NMF to deep learning.

Why Go Deep?

A single-layer NMF with k components finds the best rank-k non-negative approximation. But what if the true structure is hierarchical? A face is composed of eyes, nose, mouth (layer 1). Each eye is composed of iris, pupil, eyelid (layer 2). Deep NMF captures this hierarchy naturally.

Implicit Regularization

Even without explicit regularization, depth acts as an implicit constraint. A deep factorization V ≈ W2W1H with the same total rank has a different optimization landscape than shallow V ≈ WH. Deeper factorizations tend to find sparser, more structured solutions — a phenomenon also observed in deep linear networks.

Think of it this way: Shallow NMF gives you "ingredients." Deep NMF gives you "ingredients of ingredients" — a hierarchy. Just as a deep neural network builds complex features from simple ones, deep NMF builds complex spectral patterns from elementary ones.
Shallow vs Deep NMF

Compare reconstruction quality: 1 layer (rank k) vs 2 layers (each rank k). Deep NMF often finds sparser components.

Depth (layers)1
Rank per layer3
What advantage does Deep NMF provide over single-layer NMF?

Chapter 5: k-means as NMF

Here's a surprising connection: k-means clustering is a special case of NMF. When we constrain H to have exactly one non-zero entry per column (a one-hot encoding), NMF reduces to k-means.

The Equivalence

In k-means, we assign each data point xj to cluster ci and represent it by the centroid μi. In matrix form:

V ≈ WH,   where Hij ∈ {0,1},   ∑iHij = 1

The columns of W are the cluster centroids. H is the assignment matrix (one-hot per column). The reconstruction VH = centroids of assigned clusters.

k-meansNMF view
Cluster centroids μ1,...,μkColumns of W
Cluster assignment of xjColumn j of H (one-hot)
Objective: ∑ ||xj - μc(j)||²||V - WH||F² with one-hot H

Standard NMF relaxes the one-hot constraint: H can have multiple non-zero entries per column. This means each data point can be a mixture of clusters rather than assigned to exactly one. NMF is "soft clustering."

The spectrum of constraints:
• k-means: H is binary one-hot (hard assignment)
• NMF: H ≥ 0 (soft mixture, non-negative weights)
• PCA: H unrestricted (can subtract components)
Going from k-means to NMF to PCA relaxes constraints progressively.
k-means vs NMF Clustering

Left: k-means (hard assignment). Right: NMF (soft mixing). Points are colored by their cluster membership weights.

k (clusters)3
What constraint on H makes NMF equivalent to k-means?

Chapter 6: Showcase — Interactive Source Separation

The moment of truth. You'll mix 2-3 synthetic spectral sources with different frequencies and temporal patterns. Then watch NMF separate them in real time.

Mix sources using the sliders below. Each source has a characteristic frequency band and temporal pattern. NMF will try to recover the individual sources from the mixture. See how clean the separation is — and what happens when sources overlap in frequency.
NMF Source Separation Demo

Top: mixed spectrogram. Middle: recovered spectral bases W. Bottom: activation patterns H. Adjust source parameters and watch separation quality.

Source 1 amplitude1.0
Source 2 amplitude1.0
Source 3 amplitude0.7
NMF iterations100
Reconstruction error: —

What Makes Separation Hard?

Try making Source 1 and Source 2 have similar frequencies (both around 3-4). The separation degrades because NMF can't distinguish sources with identical spectral shapes. Separation works when sources have distinctive frequency fingerprints.

Chapter 7: Beyond — Connections

NMF sits at the intersection of matrix factorization, source separation, topic modeling, and deep learning.

MethodConstraintBest For
PCA / SVDOrthogonalityVariance maximization, general dim-reduction
NMFNon-negativityParts-based, interpretable (audio, images, text)
Sparse NMFNon-neg + sparsity on HFewer active sources per frame
Dictionary LearningSparsity on codesOvercomplete, any-sign dictionaries
k-meansOne-hot HHard clustering
LDA (topic model)Probabilistic (Dirichlet)Document-topic decomposition
NMF in modern ML: NMF ideas appear everywhere — attention mechanisms (non-negative softmax weights), neural audio codecs (VQ as discretized NMF), topic models (LDA is Bayesian NMF), and recommender systems (user-item matrices are non-negative).

Related Lessons

Previous: Autoencoders & Robust PCA — linear AE = PCA, the non-negative cousin of NMF.

Next: Dictionary Learning & Sparse Coding — what happens when you allow negative values but enforce sparsity.

What is the key difference between NMF and PCA?