EE269 Lecture 22 — Mert Pilanci, Stanford

Adaptive Filters & LMS

When the statistics change or you never had them — learn the optimal filter on the fly, one sample at a time.

Prerequisites: Linear regression (normal equations) + Gradient descent intuition. That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Adaptive?

You're on a phone call in a noisy coffee shop. Your phone receives a mix of your voice and background clatter. A second microphone picks up mostly the ambient noise. Can we subtract the noise from the voice signal?

We could design a fixed filter if we knew the exact noise characteristics — its power spectrum, how it correlates with the primary signal. But those statistics change constantly: someone starts the espresso machine, a door opens, music volume shifts. By the time we estimate the statistics and design the optimal filter, they've already changed.

The solution: an adaptive filter that learns the optimal weights on the fly, updating with each new sample. No batch statistics needed. No matrix inversion. Just a simple update rule that converges to the optimal solution — and tracks changes when the environment shifts.

The adaptive filter framework. At time n:
Input vector: un = [u[n], u[n−1], ..., u[n−M+1]]T (M most recent input samples)
Filter output: y[n] = wTun (weighted sum)
Desired signal: d[n] (what we want the output to be)
Error: e[n] = d[n] − y[n]
Update: adjust w to make e[n] smaller
Adaptive Filter in Action

A noisy signal (top) is cleaned by an adaptive filter that converges over time. Watch the error shrink.

An adaptive filter is needed when:

Chapter 1: The Wiener Filter

Before going adaptive, let's find the optimal filter assuming we know the statistics. This is the Wiener filter — the gold standard that adaptive methods try to approximate.

We want to minimize the mean squared error:

J(w) = E[|e[n]|2] = E[|d[n] − wTun|2]

Expand this:

J(w) = E[d2[n]] − 2wTE[und[n]] + wTE[ununT]w

Define the autocorrelation matrix and cross-correlation vector:

Ru = E[ununT]    (M × M matrix)
rud = E[und[n]]    (M × 1 vector)

Then J(w) = σd2 − 2wTrud + wTRuw. Taking the gradient and setting to zero:

∇J = 2Ruw − 2rud = 0
wopt = Ru−1 rud
Wiener-Hopf equation. This is the same structure as the normal equations from regression. Ru is the Gram matrix of the input, rud is the input-output cross-correlation. The optimal filter decorrelates the error from the input: E[uneopt[n]] = 0.

The MSE Surface

J(w) is a quadratic bowl in w-space. The minimum is at wopt, and the minimum MSE is:

Jmin = σd2 − rudT Ru−1 rud

The eigenvalues of Ru determine the shape of the bowl. If the eigenvalues are spread (high eigenvalue spread = ratio λmaxmin), the bowl is elongated — a narrow valley. This makes gradient descent slow, as we'll see.

MSE Surface (2-tap filter)

The contour plot shows the MSE as a function of two filter weights. The star is the Wiener solution.

Eigenvalue spread 1.0
Why not just compute wopt directly? Two reasons: (1) We often don't know Ru and rud — we'd have to estimate them from data, which requires storing and processing all past samples. (2) Even if we did, the statistics might be non-stationary. Adaptive filters avoid both problems.

Worked Example: 2-Tap Wiener Filter

Suppose u[n] is white noise with σu2 = 1 (so Ru = I2), and the cross-correlation is rud = [0.8, −0.3]T. Then:

wopt = I−1 · [0.8, −0.3]T = [0.8, −0.3]T

The minimum MSE is Jmin = σd2 − rudTwopt. If σd2 = 1, then Jmin = 1 − (0.64 + 0.09) = 0.27. The filter removes 73% of the error power.

For correlated input (non-identity Ru), the contours become elliptical, and the condition number κ = λmaxmin controls how elongated they are. Gradient descent spirals slowly through elongated valleys.

The Wiener solution wopt = Ru−1rud requires knowledge of:

Chapter 2: Gradient Descent on the MSE Surface

Instead of inverting Ru, we can slide down the MSE bowl using gradient descent. The gradient of J(w) is:

∇J(w) = 2Ruw − 2rud

The steepest descent update:

w[n+1] = w[n] − μ ∇J(w[n]) = w[n] − 2μ(Ruw[n] − rud)

where μ is the step size (learning rate). This still requires Ru and rud, but it avoids the matrix inverse. Each step costs O(M2) instead of O(M3).

Convergence in Eigenvector Coordinates

Transform to the eigenvector basis of Ru: let v = QT(w − wopt), where Q diagonalizes Ru. Each component of v evolves independently:

vk[n+1] = (1 − 2μλk) vk[n]

where λk is the k-th eigenvalue of Ru. For convergence, we need |1 − 2μλk| < 1 for all k, which gives:

0 < μ < 1/λmax
The step size bound. Too large a μ makes the algorithm diverge (oscillations grow). Too small makes convergence painfully slow. The fastest convergence for component k is when μ = 1/(2λk), but each eigenvalue wants a different step size. With a single μ, the slowest mode (smallest eigenvalue) determines convergence speed.

The Eigenvalue Spread Problem

The convergence time constant for mode k is τk = 1/(2μλk). The slowest mode has τmax = 1/(2μλmin). But μ must stay below 1/λmax, so the worst case is:

τmax = λmax / (2λmin) = χ(Ru) / 2

where χ(Ru) = λmaxmin is the condition number. High condition number → slow convergence. This motivates normalized variants of LMS.

Gradient Descent Trajectory

Watch the weight vector spiral toward the optimum. High eigenvalue spread creates a slow zig-zag path.

Step size μ 0.050
Eigenvalue spread 3.0
For steepest descent on the MSE surface to converge, the step size must satisfy:

Chapter 3: The LMS Algorithm

Steepest descent requires the true gradient ∇J = 2Ruw − 2rud, which requires the ensemble statistics Ru and rud. Widrow and Hoff's insight (1960): replace the ensemble average with the instantaneous value.

The true gradient is E[2un(unTw − d[n])] = E[−2une[n]]. The instantaneous (single-sample) approximation is just −2une[n]. This gives the Least Mean Squares (LMS) algorithm:

Compute output
y[n] = wT[n] un
Compute error
e[n] = d[n] − y[n]
Update weights
w[n+1] = w[n] + μ · e[n] · un
↻ repeat for next sample
Why LMS is beautiful. Three multiplies and an addition per weight per sample. O(M) per iteration. No matrix operations. No stored statistics. No batch processing. Just the current input vector, the current error, and a step size. It's the simplest possible adaptive algorithm — and it works.

Derivation from Stochastic Gradient Descent

LMS is stochastic gradient descent (SGD) on the MSE cost. The instantaneous cost at sample n is:

Jn(w) = |e[n]|2 = |d[n] − wTun|2

The gradient: ∇Jn = −2une[n]. So the update w[n+1] = w[n] − (μ/2)∇Jn = w[n] + μ·e[n]·un. (We absorb the factor of 2 into μ.)

The gradient estimate is noisy (based on a single sample) but unbiased: E[−2une[n]] = ∇J. Over many iterations, the noise averages out and the algorithm converges to the vicinity of wopt.

Implementation

python
import numpy as np

def lms(u, d, M, mu):
    """LMS adaptive filter.
    u: input signal, d: desired signal
    M: filter length, mu: step size"""
    N = len(u)
    w = np.zeros(M)           # init weights to zero
    e = np.zeros(N)           # error signal
    y = np.zeros(N)           # filter output

    for n in range(M, N):
        u_n = u[n:n-M:-1]      # [u[n], u[n-1], ..., u[n-M+1]]
        y[n] = w @ u_n          # filter output
        e[n] = d[n] - y[n]     # error
        w = w + mu * e[n] * u_n # LMS update

    return y, e, w

Worked Example: LMS Step by Step

M = 2 taps, μ = 0.1, w[0] = [0, 0]. Input: u = [1, 0.5, −0.3, 0.8, ...]. Desired: d = [0.8, −0.1, 0.6, ...].

n = 0: u0 = [1, 0], y[0] = [0,0]·[1,0] = 0, e[0] = 0.8 − 0 = 0.8, w[1] = [0,0] + 0.1·0.8·[1,0] = [0.08, 0].

n = 1: u1 = [0.5, 1], y[1] = [0.08, 0]·[0.5, 1] = 0.04, e[1] = −0.1 − 0.04 = −0.14, w[2] = [0.08, 0] + 0.1·(−0.14)·[0.5, 1] = [0.073, −0.014].

Each step only uses the current input and error. After hundreds of steps, w converges near the Wiener solution — without ever computing Ru or its inverse.

Complexity comparison. Wiener filter: O(M3) once (matrix inverse). LMS: O(M) per sample, runs forever. For a 1024-tap echo canceller at 8 kHz, LMS does 8,000 updates/second, each costing 1024 multiplies. The Wiener solution would require inverting a 1024×1024 matrix — and recomputing whenever the room changes.
The LMS weight update rule is:

Chapter 4: Convergence Analysis

When does LMS converge? And to what? Since the gradient is noisy, LMS doesn't converge to wopt exactly — it wanders around it. The step size μ controls the tradeoff between convergence speed and steady-state noise.

Convergence in the Mean

Take expectations of the LMS update: E[w[n+1]] = E[w[n]] + μ E[une[n]]. Under certain assumptions (independence of w[n] and un, which is approximate), this reduces to the same recursion as steepest descent. The convergence condition is the same:

0 < μ < 2/λmax(Ru)

Since we usually don't know λmax, a practical bound uses the trace:

0 < μ < 2/tr(Ru) = 2/(M · σu2)

because λmax ≤ tr(Ru) = ∑λk = M · σu2 for white input with variance σu2.

The μ dilemma.
Large μ: fast initial convergence, but large steady-state fluctuations (the weight vector jitters around wopt)
Small μ: slow convergence, but small steady-state error (closer to wopt)
You can't have both — this is the fundamental tradeoff of LMS.

Excess MSE and Misadjustment

At steady state, the MSE is higher than Jmin by an amount called the excess MSE:

Jexcess ≈ μ · Jmin · tr(Ru) / 2

The misadjustment M is the ratio of excess MSE to minimum MSE:

M = Jexcess/Jmin ≈ μ · tr(Ru) / 2 = μ · M · σu2 / 2

For 10% misadjustment with M = 10 taps and σu2 = 1: μ < 0.02.

Worked Example: Choosing μ

Echo canceller with M = 64 taps, input is speech with σu2 ≈ 0.01. The stability bound is μ < 2/(64 · 0.01) = 3.125. For 5% misadjustment: μ < 2 · 0.05 / (64 · 0.01) = 0.156. A typical choice would be μ = 0.05 — well within stability and giving about 1.6% misadjustment.

The convergence time constant is approximately τ ≈ 1/(2μλmin). If the smallest eigenvalue of Ru is 0.001, then τ ≈ 1/(2 · 0.05 · 0.001) = 10,000 samples. At 8 kHz, that's 1.25 seconds to converge. Increasing μ speeds this up but increases steady-state noise.

Weight Error Vector Analysis

Define the weight error vector c[n] = w[n] − wopt. The LMS update becomes:

c[n+1] = (I − μ ununT) c[n] + μ un v[n]

where v[n] = d[n] − woptTun is the optimal error (noise floor). The first term drives c toward zero (convergence). The second term adds noise (the stochastic gradient is imperfect). At steady state, these two effects balance.

NLMS: Normalized LMS

A common improvement: normalize the step size by the input energy:

w[n+1] = w[n] + (μ/(unTun + δ)) · e[n] · un

where δ is a small constant for numerical stability. NLMS removes the dependence on input power, making μ ∈ (0, 2) the universal stability range.

LMS Convergence: μ Tradeoff

Watch the learning curve (MSE vs. iteration). Large μ converges fast but is noisier at steady state.

μ (step size) 0.020
Filter length M 8
Increasing the LMS step size μ (within the stable range) causes:

Chapter 5: Noise Cancellation

The classic application of adaptive filtering. You have two inputs:

The adaptive filter takes u[n] as input and tries to make its output y[n] = wTun approximate v[n]. The error e[n] = d[n] − y[n] = s[n] + v[n] − y[n]. When w converges, y[n] ≈ v[n], so e[n] ≈ s[n]. The error IS the clean signal.

The magic of noise cancellation. The error signal is the output! Minimizing MSE forces the filter to cancel the noise component, leaving only the signal. No knowledge of the signal s[n] or noise v[n] is needed — just a reference that's correlated with the noise.

Why Does Minimizing MSE Cancel Noise?

The MSE is E[e2[n]] = E[(s[n] + v[n] − y[n])2]. Expand: E[s2] + E[(v − y)2] + 2E[s(v − y)]. Since s is uncorrelated with both v and u (and hence y), the cross term vanishes. So:

MSE = E[s2[n]] + E[(v[n] − y[n])2]

The first term is constant (signal power). Minimizing MSE = minimizing E[(v − y)2] = making y[n] approximate v[n]. When the filter converges, y[n] ≈ v[n], so e[n] = s[n] + v[n] − v[n] ≈ s[n]. The proof is elegant: you never told the filter about s[n], yet it recovers it by removing everything correlated with the reference.

Practical Requirements

For noise cancellation to work: (1) The reference u[n] must be correlated with the noise v[n] but uncorrelated with the signal s[n]. (2) The relationship between u[n] and v[n] must be approximately linear (for LMS to model it). (3) The filter length M must be long enough to capture the noise transfer path. If these are violated, the filter may distort the signal or fail to cancel the noise.

Adaptive Noise Cancellation Showcase

Top: clean signal. Middle: noisy signal (what you hear). Bottom: filter output (cleaned). Watch the filter converge in real time.

μ 0.010
Filter taps M 8
Noise level 1.50
Experiment ideas:
• Start with high noise level (2.0+) — the noisy signal looks hopeless, but the filter recovers the original
• Increase M — more taps capture more complex noise paths
• Crank μ too high — the filter oscillates and doesn't clean well
• Watch the early samples: the filter hasn't converged yet, so the output is still noisy
In adaptive noise cancellation, the cleaned signal is:

Chapter 6: Echo Cancellation

When you talk on a speakerphone, the far-end speaker's voice plays from your speaker, bounces around the room, and gets picked up by your microphone. The far-end listener hears their own voice delayed — an echo. This echo can be hundreds of milliseconds long, covering thousands of samples.

The Echo Path

The echo path from speaker to microphone is a linear filter (convolution with the room impulse response). Model it as an FIR filter with coefficients h = [h0, h1, ..., hL−1]. The microphone signal is:

d[n] = s[n] + ∑k=0L−1 hk u[n−k]

where u[n] is the far-end signal (known — we sent it to the speaker) and s[n] is the near-end talker (the voice we want to preserve).

An adaptive filter with input u[n] and desired output d[n] learns to approximate h. Its output y[n] ≈ ∑hku[n−k] is the estimated echo. Subtracting gives the echo-free signal: e[n] = d[n] − y[n] ≈ s[n].

Challenges in echo cancellation:
Long echo paths: 200ms at 8kHz = 1600 taps. LMS with M=1600 needs very small μ
Double talk: when both sides talk simultaneously, d[n] contains both echo and near-end speech. The filter must not adapt during double talk (it would try to cancel the near-end speaker)
Non-stationarity: the echo path changes when someone moves or a door opens

Comparison of Algorithms

AlgorithmUpdate costConvergenceTracks changes
LMSO(M)Slow (depends on eigenvalue spread)Yes, continuously
NLMSO(M)Faster (normalized step)Yes
RLSO(M2)Fast (independent of eigenvalue spread)Yes (forgetting factor)
Block LMS (FFT)O(M log M)Same as LMSYes, with block delay

RLS (Recursive Least Squares) maintains a running estimate of Ru−1 and updates it rank-1 at each step using the matrix inversion lemma (Woodbury identity):

P[n] = (1/λ)(P[n−1] − P[n−1]ununTP[n−1] / (λ + unTP[n−1]un))

where P[n] ≈ Ru−1 and λ ∈ (0, 1] is a forgetting factor (typically 0.99). RLS converges in O(M) iterations regardless of eigenvalue spread, but costs O(M2) per sample. For long echo cancellers (M > 1000), the cost is prohibitive, so NLMS with subband decomposition is typical.

Practical Considerations

Real echo cancellers use a double-talk detector to freeze adaptation when both parties speak. The idea: if the error power suddenly increases (because near-end speech appeared), stop updating. Common detectors compare the ratio E[e2[n]]/E[d2[n]] to a threshold.

Subband adaptive filtering decomposes the signal into frequency bands, runs a short adaptive filter in each band (lower sampling rate = faster convergence), then reconstructs. This reduces the effective filter length and eigenvalue spread simultaneously.

Echo Path Estimation

The true echo path (teal) vs. the adaptive filter's estimate (orange). Watch it converge over time.

μ 0.010
In echo cancellation, the adaptive filter's input is:

Chapter 7: Mastery

Let's consolidate the journey from optimal filtering to online learning.

Wiener Filter
wopt = Ru−1rud — optimal but needs statistics
↓ replace E[·] with batch estimate
Steepest Descent
w[n+1] = w[n] − μ∇J — iterative, still needs Ru
↓ replace ∇J with instantaneous estimate
LMS
w[n+1] = w[n] + μe[n]un — O(M), no statistics
↓ normalize by input power
NLMS
Step size adapts to input energy — robust
ConceptKey EquationInsight
Wiener solutionwopt = Ru−1rudDecorrelate error from input
LMS updatew += μ·e[n]·unSGD on instantaneous MSE
Stability boundμ < 2/λmaxStep size < inverse max eigenvalue
MisadjustmentM ≈ μ·M·σu2/2Steady-state excess MSE
Noise cancele[n] = s[n] + v[n] − v̂[n]Error = cleaned signal

Connections

The entire adaptive filtering framework — from Wiener to LMS to RLS — illustrates a recurring theme in engineering: replace exact computation with iterative approximation. The Wiener filter is beautiful but impractical. Steepest descent is practical but still batch. LMS drops everything except the current sample and somehow still works. This is the power of stochastic methods.

What you can now do:
• Derive the Wiener filter from first principles
• Explain why LMS is stochastic gradient descent on the MSE
• State the convergence condition μ < 2/λmax and explain what happens when violated
• Implement LMS in a few lines of code
• Design a noise cancellation system with primary and reference inputs
• Explain the echo cancellation architecture

Beyond LMS

Modern adaptive filtering extends LMS in many directions:

In telecommunications, adaptive filters are ubiquitous. Every phone call uses echo cancellation. Every wireless receiver uses adaptive equalization to undo channel distortion. 5G MIMO systems use hundreds of adaptive filters running in parallel. The LMS algorithm, in some variant, is almost certainly running on the device you're reading this on right now.

Deep learning connection. SGD (stochastic gradient descent) training of neural networks is conceptually identical to LMS: use a single (mini-batch) sample to estimate the gradient, take a step, repeat. The convergence conditions are analogous: the learning rate must be small enough relative to the curvature of the loss landscape. Adam, RMSProp, and other optimizers are the neural network equivalents of NLMS — they normalize by running estimates of gradient magnitude.

"The LMS algorithm is perhaps the most widely used adaptive algorithm in practice. Its simplicity is its greatest strength." — Simon Haykin, Adaptive Filter Theory