When the statistics change or you never had them — learn the optimal filter on the fly, one sample at a time.
You're on a phone call in a noisy coffee shop. Your phone receives a mix of your voice and background clatter. A second microphone picks up mostly the ambient noise. Can we subtract the noise from the voice signal?
We could design a fixed filter if we knew the exact noise characteristics — its power spectrum, how it correlates with the primary signal. But those statistics change constantly: someone starts the espresso machine, a door opens, music volume shifts. By the time we estimate the statistics and design the optimal filter, they've already changed.
The solution: an adaptive filter that learns the optimal weights on the fly, updating with each new sample. No batch statistics needed. No matrix inversion. Just a simple update rule that converges to the optimal solution — and tracks changes when the environment shifts.
A noisy signal (top) is cleaned by an adaptive filter that converges over time. Watch the error shrink.
Before going adaptive, let's find the optimal filter assuming we know the statistics. This is the Wiener filter — the gold standard that adaptive methods try to approximate.
We want to minimize the mean squared error:
Expand this:
Define the autocorrelation matrix and cross-correlation vector:
Then J(w) = σd2 − 2wTrud + wTRuw. Taking the gradient and setting to zero:
J(w) is a quadratic bowl in w-space. The minimum is at wopt, and the minimum MSE is:
The eigenvalues of Ru determine the shape of the bowl. If the eigenvalues are spread (high eigenvalue spread = ratio λmax/λmin), the bowl is elongated — a narrow valley. This makes gradient descent slow, as we'll see.
The contour plot shows the MSE as a function of two filter weights. The star is the Wiener solution.
Suppose u[n] is white noise with σu2 = 1 (so Ru = I2), and the cross-correlation is rud = [0.8, −0.3]T. Then:
The minimum MSE is Jmin = σd2 − rudTwopt. If σd2 = 1, then Jmin = 1 − (0.64 + 0.09) = 0.27. The filter removes 73% of the error power.
For correlated input (non-identity Ru), the contours become elliptical, and the condition number κ = λmax/λmin controls how elongated they are. Gradient descent spirals slowly through elongated valleys.
Instead of inverting Ru, we can slide down the MSE bowl using gradient descent. The gradient of J(w) is:
The steepest descent update:
where μ is the step size (learning rate). This still requires Ru and rud, but it avoids the matrix inverse. Each step costs O(M2) instead of O(M3).
Transform to the eigenvector basis of Ru: let v = QT(w − wopt), where Q diagonalizes Ru. Each component of v evolves independently:
where λk is the k-th eigenvalue of Ru. For convergence, we need |1 − 2μλk| < 1 for all k, which gives:
The convergence time constant for mode k is τk = 1/(2μλk). The slowest mode has τmax = 1/(2μλmin). But μ must stay below 1/λmax, so the worst case is:
where χ(Ru) = λmax/λmin is the condition number. High condition number → slow convergence. This motivates normalized variants of LMS.
Watch the weight vector spiral toward the optimum. High eigenvalue spread creates a slow zig-zag path.
Steepest descent requires the true gradient ∇J = 2Ruw − 2rud, which requires the ensemble statistics Ru and rud. Widrow and Hoff's insight (1960): replace the ensemble average with the instantaneous value.
The true gradient is E[2un(unTw − d[n])] = E[−2une[n]]. The instantaneous (single-sample) approximation is just −2une[n]. This gives the Least Mean Squares (LMS) algorithm:
LMS is stochastic gradient descent (SGD) on the MSE cost. The instantaneous cost at sample n is:
The gradient: ∇Jn = −2une[n]. So the update w[n+1] = w[n] − (μ/2)∇Jn = w[n] + μ·e[n]·un. (We absorb the factor of 2 into μ.)
The gradient estimate is noisy (based on a single sample) but unbiased: E[−2une[n]] = ∇J. Over many iterations, the noise averages out and the algorithm converges to the vicinity of wopt.
python import numpy as np def lms(u, d, M, mu): """LMS adaptive filter. u: input signal, d: desired signal M: filter length, mu: step size""" N = len(u) w = np.zeros(M) # init weights to zero e = np.zeros(N) # error signal y = np.zeros(N) # filter output for n in range(M, N): u_n = u[n:n-M:-1] # [u[n], u[n-1], ..., u[n-M+1]] y[n] = w @ u_n # filter output e[n] = d[n] - y[n] # error w = w + mu * e[n] * u_n # LMS update return y, e, w
M = 2 taps, μ = 0.1, w[0] = [0, 0]. Input: u = [1, 0.5, −0.3, 0.8, ...]. Desired: d = [0.8, −0.1, 0.6, ...].
n = 0: u0 = [1, 0], y[0] = [0,0]·[1,0] = 0, e[0] = 0.8 − 0 = 0.8, w[1] = [0,0] + 0.1·0.8·[1,0] = [0.08, 0].
n = 1: u1 = [0.5, 1], y[1] = [0.08, 0]·[0.5, 1] = 0.04, e[1] = −0.1 − 0.04 = −0.14, w[2] = [0.08, 0] + 0.1·(−0.14)·[0.5, 1] = [0.073, −0.014].
Each step only uses the current input and error. After hundreds of steps, w converges near the Wiener solution — without ever computing Ru or its inverse.
When does LMS converge? And to what? Since the gradient is noisy, LMS doesn't converge to wopt exactly — it wanders around it. The step size μ controls the tradeoff between convergence speed and steady-state noise.
Take expectations of the LMS update: E[w[n+1]] = E[w[n]] + μ E[une[n]]. Under certain assumptions (independence of w[n] and un, which is approximate), this reduces to the same recursion as steepest descent. The convergence condition is the same:
Since we usually don't know λmax, a practical bound uses the trace:
because λmax ≤ tr(Ru) = ∑λk = M · σu2 for white input with variance σu2.
At steady state, the MSE is higher than Jmin by an amount called the excess MSE:
The misadjustment M is the ratio of excess MSE to minimum MSE:
For 10% misadjustment with M = 10 taps and σu2 = 1: μ < 0.02.
Echo canceller with M = 64 taps, input is speech with σu2 ≈ 0.01. The stability bound is μ < 2/(64 · 0.01) = 3.125. For 5% misadjustment: μ < 2 · 0.05 / (64 · 0.01) = 0.156. A typical choice would be μ = 0.05 — well within stability and giving about 1.6% misadjustment.
The convergence time constant is approximately τ ≈ 1/(2μλmin). If the smallest eigenvalue of Ru is 0.001, then τ ≈ 1/(2 · 0.05 · 0.001) = 10,000 samples. At 8 kHz, that's 1.25 seconds to converge. Increasing μ speeds this up but increases steady-state noise.
Define the weight error vector c[n] = w[n] − wopt. The LMS update becomes:
where v[n] = d[n] − woptTun is the optimal error (noise floor). The first term drives c toward zero (convergence). The second term adds noise (the stochastic gradient is imperfect). At steady state, these two effects balance.
A common improvement: normalize the step size by the input energy:
where δ is a small constant for numerical stability. NLMS removes the dependence on input power, making μ ∈ (0, 2) the universal stability range.
Watch the learning curve (MSE vs. iteration). Large μ converges fast but is noisier at steady state.
The classic application of adaptive filtering. You have two inputs:
The adaptive filter takes u[n] as input and tries to make its output y[n] = wTun approximate v[n]. The error e[n] = d[n] − y[n] = s[n] + v[n] − y[n]. When w converges, y[n] ≈ v[n], so e[n] ≈ s[n]. The error IS the clean signal.
The MSE is E[e2[n]] = E[(s[n] + v[n] − y[n])2]. Expand: E[s2] + E[(v − y)2] + 2E[s(v − y)]. Since s is uncorrelated with both v and u (and hence y), the cross term vanishes. So:
The first term is constant (signal power). Minimizing MSE = minimizing E[(v − y)2] = making y[n] approximate v[n]. When the filter converges, y[n] ≈ v[n], so e[n] = s[n] + v[n] − v[n] ≈ s[n]. The proof is elegant: you never told the filter about s[n], yet it recovers it by removing everything correlated with the reference.
For noise cancellation to work: (1) The reference u[n] must be correlated with the noise v[n] but uncorrelated with the signal s[n]. (2) The relationship between u[n] and v[n] must be approximately linear (for LMS to model it). (3) The filter length M must be long enough to capture the noise transfer path. If these are violated, the filter may distort the signal or fail to cancel the noise.
Top: clean signal. Middle: noisy signal (what you hear). Bottom: filter output (cleaned). Watch the filter converge in real time.
When you talk on a speakerphone, the far-end speaker's voice plays from your speaker, bounces around the room, and gets picked up by your microphone. The far-end listener hears their own voice delayed — an echo. This echo can be hundreds of milliseconds long, covering thousands of samples.
The echo path from speaker to microphone is a linear filter (convolution with the room impulse response). Model it as an FIR filter with coefficients h = [h0, h1, ..., hL−1]. The microphone signal is:
where u[n] is the far-end signal (known — we sent it to the speaker) and s[n] is the near-end talker (the voice we want to preserve).
An adaptive filter with input u[n] and desired output d[n] learns to approximate h. Its output y[n] ≈ ∑hku[n−k] is the estimated echo. Subtracting gives the echo-free signal: e[n] = d[n] − y[n] ≈ s[n].
| Algorithm | Update cost | Convergence | Tracks changes |
|---|---|---|---|
| LMS | O(M) | Slow (depends on eigenvalue spread) | Yes, continuously |
| NLMS | O(M) | Faster (normalized step) | Yes |
| RLS | O(M2) | Fast (independent of eigenvalue spread) | Yes (forgetting factor) |
| Block LMS (FFT) | O(M log M) | Same as LMS | Yes, with block delay |
RLS (Recursive Least Squares) maintains a running estimate of Ru−1 and updates it rank-1 at each step using the matrix inversion lemma (Woodbury identity):
where P[n] ≈ Ru−1 and λ ∈ (0, 1] is a forgetting factor (typically 0.99). RLS converges in O(M) iterations regardless of eigenvalue spread, but costs O(M2) per sample. For long echo cancellers (M > 1000), the cost is prohibitive, so NLMS with subband decomposition is typical.
Real echo cancellers use a double-talk detector to freeze adaptation when both parties speak. The idea: if the error power suddenly increases (because near-end speech appeared), stop updating. Common detectors compare the ratio E[e2[n]]/E[d2[n]] to a threshold.
Subband adaptive filtering decomposes the signal into frequency bands, runs a short adaptive filter in each band (lower sampling rate = faster convergence), then reconstructs. This reduces the effective filter length and eigenvalue spread simultaneously.
The true echo path (teal) vs. the adaptive filter's estimate (orange). Watch it converge over time.
Let's consolidate the journey from optimal filtering to online learning.
| Concept | Key Equation | Insight |
|---|---|---|
| Wiener solution | wopt = Ru−1rud | Decorrelate error from input |
| LMS update | w += μ·e[n]·un | SGD on instantaneous MSE |
| Stability bound | μ < 2/λmax | Step size < inverse max eigenvalue |
| Misadjustment | M ≈ μ·M·σu2/2 | Steady-state excess MSE |
| Noise cancel | e[n] = s[n] + v[n] − v̂[n] | Error = cleaned signal |
The entire adaptive filtering framework — from Wiener to LMS to RLS — illustrates a recurring theme in engineering: replace exact computation with iterative approximation. The Wiener filter is beautiful but impractical. Steepest descent is practical but still batch. LMS drops everything except the current sample and somehow still works. This is the power of stochastic methods.
Modern adaptive filtering extends LMS in many directions:
In telecommunications, adaptive filters are ubiquitous. Every phone call uses echo cancellation. Every wireless receiver uses adaptive equalization to undo channel distortion. 5G MIMO systems use hundreds of adaptive filters running in parallel. The LMS algorithm, in some variant, is almost certainly running on the device you're reading this on right now.
"The LMS algorithm is perhaps the most widely used adaptive algorithm in practice. Its simplicity is its greatest strength." — Simon Haykin, Adaptive Filter Theory