EE269 NMF — Nonnegative Matrix Factorization & Clustering

Chapter 0: Why Non-Negative?

You're looking at a spectrogram of a recording — two people talking at the same time. Each pixel represents energy at a frequency and time. Energy is always non-negative (you can't have -3 decibels of sound in a linear scale). Can you separate the two voices?

PCA won't work. PCA finds orthogonal components that can be negative — you'd get "anti-sounds" that subtract from the mix. But sound energy doesn't subtract. We need a decomposition where everything stays non-negative: the parts, the weights, the reconstruction.

Nonnegative Matrix Factorization (NMF) factors a non-negative matrix V into two non-negative matrices W and H: V ≈ WH, with V, W, H ≥ 0. The columns of W are "parts" (spectral bases), the rows of H are "activations" (when each part is active). For audio: W = what each source sounds like, H = when each source is speaking.

The core idea: When your data is inherently non-negative (pixel intensities, word counts, spectral magnitudes, chemical concentrations), NMF gives you interpretable parts-based decomposition. PCA gives you directions. NMF gives you ingredients.

Parts vs Directions

PCA components (left) can be negative — hard to interpret as "parts." NMF components (right) are non-negative — they look like actual spectral shapes.

Why can't PCA separate audio sources from a spectrogram?

PCA is too slow for audio PCA components can be negative, but spectrogram energy is always non-negative PCA requires labeled training data

Chapter 1: The Factorization V ≈ WH

Given a non-negative matrix V of size m×n (m frequency bins, n time frames), we want:

V ≈ W · H, where W ∈ ℝ^m×k_≥0, H ∈ ℝ^k×n_≥0

Here k is the number of components (sources). Each column of W is a spectral basis — the frequency fingerprint of one source. Each row of H is an activation vector — the temporal pattern of that source.

What Each Dimension Means

Matrix	Shape	Interpretation (audio)
V	[m × n]	Spectrogram: m freq bins, n time frames
W	[m × k]	k spectral templates (what each source sounds like)
H	[k × n]	k activation patterns (when each source is active)

The reconstruction of one column (one time frame) is:

v_j ≈ W · h_j = ∑_i=1^k w_i · h_ij

Each time frame is a non-negative mixture of the k spectral bases, weighted by the activations. This is exactly source separation: each source contributes its spectral shape scaled by its current loudness.

Key insight: Because everything is non-negative, the only way to represent data is by adding parts together. There's no cancellation. This means the parts W must correspond to real, physically meaningful components. NMF is forced to discover the actual sources, not arbitrary linear combinations.

V = W · H Visualized

A 6×8 matrix V factored into W (6×k) and H (k×8). Adjust k to see how more components improve the approximation.

Components k2

In NMF of a spectrogram, what do the columns of W represent?

Spectral templates (the frequency fingerprint of each source) Time-domain waveforms of each source Classification labels for each frame

Chapter 2: Multiplicative Update Rules

How do we actually compute W and H? We can't just use SVD (it gives negative values). Instead, Lee and Seung (1999) gave beautifully simple multiplicative update rules that guarantee non-negativity and monotonic decrease of the cost function.

The Cost Function

We minimize the squared Frobenius norm of the reconstruction error:

min_W,H≥0 ||V - WH||_F²

The Updates

H ← H ⊙ (W^TV) / (W^TWH)

W ← W ⊙ (VH^T) / (WHH^T)

The ⊙ is element-wise multiplication, and division is also element-wise. These rules are derived from the gradient, but structured so that:

Non-negativity is preserved: if W, H start non-negative and V is non-negative, the updates stay non-negative (we're multiplying by non-negative ratios).
Cost monotonically decreases: each update is guaranteed to not increase the objective.
Fixed points are stationary points: when H_ij(W^TV)_ij = H_ij(W^TWH)_ij for all i,j, we've reached a (local) minimum.

Derivation Sketch

Take the gradient of ||V-WH||² with respect to H:

∇_H = -2W^TV + 2W^TWH = -2W^T(V - WH)

Standard gradient descent would be H ← H - η∇_H. Lee & Seung choose the step size element-wise as η_ij = H_ij / (W^TWH)_ij. Substituting gives the multiplicative rule. The non-negativity of V, W, H ensures the ratio is always non-negative.

python
import numpy as np

def nmf(V, k, n_iter=200, eps=1e-10):
    """Multiplicative update NMF. V >= 0, shape (m, n)."""
    m, n = V.shape
    W = np.random.rand(m, k) + eps
    H = np.random.rand(k, n) + eps

    for _ in range(n_iter):
        # Update H
        H *= (W.T @ V) / (W.T @ W @ H + eps)
        # Update W
        W *= (V @ H.T) / (W @ H @ H.T + eps)

    return W, H

Why multiplicative? Additive updates (gradient descent) require careful step-size tuning and projection back to non-negative values. Multiplicative updates automatically stay non-negative and need no learning rate. The price: convergence can be slow near zero values.

Watch NMF Converge

Click "Step" to run one iteration of multiplicative updates. Watch the reconstruction error drop and W, H converge.

Iter: 0 | Error: —

Why do multiplicative updates preserve non-negativity?

They clip negative values after each step They multiply by non-negative ratios, so positive values stay positive They use absolute values of the gradient

Chapter 3: Audio Source Separation

The killer application of NMF is audio source separation. Given a recording of multiple sounds mixed together, NMF can pull them apart — without knowing what the sources are in advance.

The Pipeline

Mixed audio x(t)

Sum of sources in time domain

↓ STFT

Magnitude spectrogram V = |STFT|

[freq_bins × time_frames], all ≥ 0

↓ NMF with k sources

W (spectral bases), H (activations)

W: [freq × k], H: [k × time]

↓ Mask: M_i = w_ih_i / (WH)

Separated spectrograms

V_i = M_i ⊙ V for each source i

↓ Inverse STFT

Separated audio signals

Individual sources in time domain

Why It Works

Each source has a distinctive spectral shape (its column in W). A violin has strong harmonics at specific frequencies. A drum has broadband energy in short bursts. NMF discovers these shapes from the mixture alone, because:

Different sources activate at different times (H is approximately sparse)
Different sources have different spectral shapes (W columns are distinct)
The non-negativity constraint prevents destructive interference artifacts

Wiener filter masking: After NMF gives us W and H, we create a soft mask for source i: M_i[f,t] = (w_i[f] · h_i[t]) / ∑_j(w_j[f] · h_j[t]). This mask says "what fraction of the energy at this time-frequency point belongs to source i?" Multiply by the original spectrogram to isolate each source.

Simulated Source Separation

A mixture of 2 spectral sources (top). NMF finds individual bases W (bottom). Colors show which source dominates each time-frequency cell.

Source 1 freq3

Source 2 freq7

In NMF-based source separation, what does the soft mask M_i represent?

The fraction of energy at each time-frequency point attributed to source i A binary decision of which source owns each pixel The phase angle for reconstruction

Chapter 4: Deep NMF

What if we stack multiple NMF layers? Instead of V ≈ WH, we factorize:

V ≈ W_L · W_L-1 · ... · W₁ · H

Each layer provides a finer decomposition. The first layer captures broad structure, deeper layers capture finer details. This is Deep NMF — and it connects NMF to deep learning.

Why Go Deep?

A single-layer NMF with k components finds the best rank-k non-negative approximation. But what if the true structure is hierarchical? A face is composed of eyes, nose, mouth (layer 1). Each eye is composed of iris, pupil, eyelid (layer 2). Deep NMF captures this hierarchy naturally.

Implicit Regularization

Even without explicit regularization, depth acts as an implicit constraint. A deep factorization V ≈ W₂W₁H with the same total rank has a different optimization landscape than shallow V ≈ WH. Deeper factorizations tend to find sparser, more structured solutions — a phenomenon also observed in deep linear networks.

Think of it this way: Shallow NMF gives you "ingredients." Deep NMF gives you "ingredients of ingredients" — a hierarchy. Just as a deep neural network builds complex features from simple ones, deep NMF builds complex spectral patterns from elementary ones.

Shallow vs Deep NMF

Compare reconstruction quality: 1 layer (rank k) vs 2 layers (each rank k). Deep NMF often finds sparser components.

Depth (layers)1

Rank per layer3

What advantage does Deep NMF provide over single-layer NMF?

Hierarchical decomposition: parts of parts, with implicit regularization toward sparser solutions Guaranteed faster convergence It can handle negative data

Chapter 5: k-means as NMF

Here's a surprising connection: k-means clustering is a special case of NMF. When we constrain H to have exactly one non-zero entry per column (a one-hot encoding), NMF reduces to k-means.

The Equivalence

In k-means, we assign each data point x_j to cluster c_i and represent it by the centroid μ_i. In matrix form:

V ≈ WH, where H_ij ∈ {0,1}, ∑_iH_ij = 1

The columns of W are the cluster centroids. H is the assignment matrix (one-hot per column). The reconstruction VH = centroids of assigned clusters.

k-means	NMF view
Cluster centroids μ₁,...,μ_k	Columns of W
Cluster assignment of x_j	Column j of H (one-hot)
Objective: ∑ \|\|x_j - μ_c(j)\|\|²	\|\|V - WH\|\|_F² with one-hot H

Standard NMF relaxes the one-hot constraint: H can have multiple non-zero entries per column. This means each data point can be a mixture of clusters rather than assigned to exactly one. NMF is "soft clustering."

The spectrum of constraints:
• k-means: H is binary one-hot (hard assignment)
• NMF: H ≥ 0 (soft mixture, non-negative weights)
• PCA: H unrestricted (can subtract components)
Going from k-means to NMF to PCA relaxes constraints progressively.

k-means vs NMF Clustering

Left: k-means (hard assignment). Right: NMF (soft mixing). Points are colored by their cluster membership weights.

k (clusters)3

What constraint on H makes NMF equivalent to k-means?

H must be orthogonal Each column of H must be one-hot (exactly one non-zero entry) H must be symmetric

Chapter 6: Showcase — Interactive Source Separation

The moment of truth. You'll mix 2-3 synthetic spectral sources with different frequencies and temporal patterns. Then watch NMF separate them in real time.

Mix sources using the sliders below. Each source has a characteristic frequency band and temporal pattern. NMF will try to recover the individual sources from the mixture. See how clean the separation is — and what happens when sources overlap in frequency.

NMF Source Separation Demo

Top: mixed spectrogram. Middle: recovered spectral bases W. Bottom: activation patterns H. Adjust source parameters and watch separation quality.

Source 1 amplitude1.0

Source 2 amplitude1.0

Source 3 amplitude0.7

NMF iterations100

Reconstruction error: —

What Makes Separation Hard?

Try making Source 1 and Source 2 have similar frequencies (both around 3-4). The separation degrades because NMF can't distinguish sources with identical spectral shapes. Separation works when sources have distinctive frequency fingerprints.

Chapter 7: Beyond — Connections

NMF sits at the intersection of matrix factorization, source separation, topic modeling, and deep learning.

Method	Constraint	Best For
PCA / SVD	Orthogonality	Variance maximization, general dim-reduction
NMF	Non-negativity	Parts-based, interpretable (audio, images, text)
Sparse NMF	Non-neg + sparsity on H	Fewer active sources per frame
Dictionary Learning	Sparsity on codes	Overcomplete, any-sign dictionaries
k-means	One-hot H	Hard clustering
LDA (topic model)	Probabilistic (Dirichlet)	Document-topic decomposition

NMF in modern ML: NMF ideas appear everywhere — attention mechanisms (non-negative softmax weights), neural audio codecs (VQ as discretized NMF), topic models (LDA is Bayesian NMF), and recommender systems (user-item matrices are non-negative).

Related Lessons

Previous: Autoencoders & Robust PCA — linear AE = PCA, the non-negative cousin of NMF.

Next: Dictionary Learning & Sparse Coding — what happens when you allow negative values but enforce sparsity.

What is the key difference between NMF and PCA?

PCA is faster to compute NMF requires labeled data NMF constrains all factors to be non-negative, giving interpretable parts-based decomposition

NMF & Clustering