EE269 Lecture 22 — Adaptive Filters & LMS

Chapter 0: Why Adaptive?

You're on a phone call in a noisy coffee shop. Your phone receives a mix of your voice and background clatter. A second microphone picks up mostly the ambient noise. Can we subtract the noise from the voice signal?

We could design a fixed filter if we knew the exact noise characteristics — its power spectrum, how it correlates with the primary signal. But those statistics change constantly: someone starts the espresso machine, a door opens, music volume shifts. By the time we estimate the statistics and design the optimal filter, they've already changed.

The solution: an adaptive filter that learns the optimal weights on the fly, updating with each new sample. No batch statistics needed. No matrix inversion. Just a simple update rule that converges to the optimal solution — and tracks changes when the environment shifts.

The adaptive filter framework. At time n:
• Input vector: u_n = [u[n], u[n−1], ..., u[n−M+1]]^T (M most recent input samples)
• Filter output: y[n] = w^Tu_n (weighted sum)
• Desired signal: d[n] (what we want the output to be)
• Error: e[n] = d[n] − y[n]
• Update: adjust w to make e[n] smaller

Adaptive Filter in Action

A noisy signal (top) is cleaned by an adaptive filter that converges over time. Watch the error shrink.

An adaptive filter is needed when:

The signal statistics are known exactly and constant The signal statistics are unknown or change over time The signal is deterministic (no noise)

Chapter 1: The Wiener Filter

Before going adaptive, let's find the optimal filter assuming we know the statistics. This is the Wiener filter — the gold standard that adaptive methods try to approximate.

We want to minimize the mean squared error:

J(w) = E[|e[n]|²] = E[|d[n] − w^Tu_n|²]

Expand this:

J(w) = E[d²[n]] − 2w^TE[u_nd[n]] + w^TE[u_nu_n^T]w

Define the autocorrelation matrix and cross-correlation vector:

R_u = E[u_nu_n^T] (M × M matrix)

r_ud = E[u_nd[n]] (M × 1 vector)

Then J(w) = σ_d² − 2w^Tr_ud + w^TR_uw. Taking the gradient and setting to zero:

∇J = 2R_uw − 2r_ud = 0

w_opt = R_u⁻¹ r_ud

Wiener-Hopf equation. This is the same structure as the normal equations from regression. R_u is the Gram matrix of the input, r_ud is the input-output cross-correlation. The optimal filter decorrelates the error from the input: E[u_ne_opt[n]] = 0.

The MSE Surface

J(w) is a quadratic bowl in w-space. The minimum is at w_opt, and the minimum MSE is:

J_min = σ_d² − r_ud^T R_u⁻¹ r_ud

The eigenvalues of R_u determine the shape of the bowl. If the eigenvalues are spread (high eigenvalue spread = ratio λ_max/λ_min), the bowl is elongated — a narrow valley. This makes gradient descent slow, as we'll see.

MSE Surface (2-tap filter)

The contour plot shows the MSE as a function of two filter weights. The star is the Wiener solution.

Eigenvalue spread 1.0

Why not just compute w_opt directly? Two reasons: (1) We often don't know R_u and r_ud — we'd have to estimate them from data, which requires storing and processing all past samples. (2) Even if we did, the statistics might be non-stationary. Adaptive filters avoid both problems.

Worked Example: 2-Tap Wiener Filter

Suppose u[n] is white noise with σ_u² = 1 (so R_u = I₂), and the cross-correlation is r_ud = [0.8, −0.3]^T. Then:

w_opt = I⁻¹ · [0.8, −0.3]^T = [0.8, −0.3]^T

The minimum MSE is J_min = σ_d² − r_ud^Tw_opt. If σ_d² = 1, then J_min = 1 − (0.64 + 0.09) = 0.27. The filter removes 73% of the error power.

For correlated input (non-identity R_u), the contours become elliptical, and the condition number κ = λ_max/λ_min controls how elongated they are. Gradient descent spirals slowly through elongated valleys.

The Wiener solution w_opt = R_u⁻¹r_ud requires knowledge of:

The input autocorrelation matrix and input-desired cross-correlation Only the input signal u[n] The filter length M only

Chapter 2: Gradient Descent on the MSE Surface

Instead of inverting R_u, we can slide down the MSE bowl using gradient descent. The gradient of J(w) is:

∇J(w) = 2R_uw − 2r_ud

The steepest descent update:

w[n+1] = w[n] − μ ∇J(w[n]) = w[n] − 2μ(R_uw[n] − r_ud)

where μ is the step size (learning rate). This still requires R_u and r_ud, but it avoids the matrix inverse. Each step costs O(M²) instead of O(M³).

Convergence in Eigenvector Coordinates

Transform to the eigenvector basis of R_u: let v = Q^T(w − w_opt), where Q diagonalizes R_u. Each component of v evolves independently:

v_k[n+1] = (1 − 2μλ_k) v_k[n]

where λ_k is the k-th eigenvalue of R_u. For convergence, we need |1 − 2μλ_k| < 1 for all k, which gives:

0 < μ < 1/λ_max

The step size bound. Too large a μ makes the algorithm diverge (oscillations grow). Too small makes convergence painfully slow. The fastest convergence for component k is when μ = 1/(2λ_k), but each eigenvalue wants a different step size. With a single μ, the slowest mode (smallest eigenvalue) determines convergence speed.

The Eigenvalue Spread Problem

The convergence time constant for mode k is τ_k = 1/(2μλ_k). The slowest mode has τ_max = 1/(2μλ_min). But μ must stay below 1/λ_max, so the worst case is:

τ_max = λ_max / (2λ_min) = χ(R_u) / 2

where χ(R_u) = λ_max/λ_min is the condition number. High condition number → slow convergence. This motivates normalized variants of LMS.

Gradient Descent Trajectory

Watch the weight vector spiral toward the optimum. High eigenvalue spread creates a slow zig-zag path.

Step size μ 0.050

Eigenvalue spread 3.0

For steepest descent on the MSE surface to converge, the step size must satisfy:

μ > 1/λ_max 0 < μ < 1/λ_max μ = λ_min

Chapter 3: The LMS Algorithm

Steepest descent requires the true gradient ∇J = 2R_uw − 2r_ud, which requires the ensemble statistics R_u and r_ud. Widrow and Hoff's insight (1960): replace the ensemble average with the instantaneous value.

The true gradient is E[2u_n(u_n^Tw − d[n])] = E[−2u_ne[n]]. The instantaneous (single-sample) approximation is just −2u_ne[n]. This gives the Least Mean Squares (LMS) algorithm:

Compute output

y[n] = w^T[n] u_n

↓

Compute error

e[n] = d[n] − y[n]

↓

Update weights

w[n+1] = w[n] + μ · e[n] · u_n

↻ repeat for next sample

Why LMS is beautiful. Three multiplies and an addition per weight per sample. O(M) per iteration. No matrix operations. No stored statistics. No batch processing. Just the current input vector, the current error, and a step size. It's the simplest possible adaptive algorithm — and it works.

Derivation from Stochastic Gradient Descent

LMS is stochastic gradient descent (SGD) on the MSE cost. The instantaneous cost at sample n is:

J_n(w) = |e[n]|² = |d[n] − w^Tu_n|²

The gradient: ∇J_n = −2u_ne[n]. So the update w[n+1] = w[n] − (μ/2)∇J_n = w[n] + μ·e[n]·u_n. (We absorb the factor of 2 into μ.)

The gradient estimate is noisy (based on a single sample) but unbiased: E[−2u_ne[n]] = ∇J. Over many iterations, the noise averages out and the algorithm converges to the vicinity of w_opt.

Implementation

python
import numpy as np

def lms(u, d, M, mu):
    """LMS adaptive filter.
    u: input signal, d: desired signal
    M: filter length, mu: step size"""
    N = len(u)
    w = np.zeros(M)           # init weights to zero
    e = np.zeros(N)           # error signal
    y = np.zeros(N)           # filter output

    for n in range(M, N):
        u_n = u[n:n-M:-1]      # [u[n], u[n-1], ..., u[n-M+1]]
        y[n] = w @ u_n          # filter output
        e[n] = d[n] - y[n]     # error
        w = w + mu * e[n] * u_n # LMS update

    return y, e, w

Worked Example: LMS Step by Step

M = 2 taps, μ = 0.1, w[0] = [0, 0]. Input: u = [1, 0.5, −0.3, 0.8, ...]. Desired: d = [0.8, −0.1, 0.6, ...].

n = 0: u₀ = [1, 0], y[0] = [0,0]·[1,0] = 0, e[0] = 0.8 − 0 = 0.8, w[1] = [0,0] + 0.1·0.8·[1,0] = [0.08, 0].

n = 1: u₁ = [0.5, 1], y[1] = [0.08, 0]·[0.5, 1] = 0.04, e[1] = −0.1 − 0.04 = −0.14, w[2] = [0.08, 0] + 0.1·(−0.14)·[0.5, 1] = [0.073, −0.014].

Each step only uses the current input and error. After hundreds of steps, w converges near the Wiener solution — without ever computing R_u or its inverse.

Complexity comparison. Wiener filter: O(M³) once (matrix inverse). LMS: O(M) per sample, runs forever. For a 1024-tap echo canceller at 8 kHz, LMS does 8,000 updates/second, each costing 1024 multiplies. The Wiener solution would require inverting a 1024×1024 matrix — and recomputing whenever the room changes.

The LMS weight update rule is:

w[n+1] = w[n] + μ · e[n] · u_n w[n+1] = w[n] − μ · e[n]² w[n+1] = R_u⁻¹ r_ud

Chapter 4: Convergence Analysis

When does LMS converge? And to what? Since the gradient is noisy, LMS doesn't converge to w_opt exactly — it wanders around it. The step size μ controls the tradeoff between convergence speed and steady-state noise.

Convergence in the Mean

Take expectations of the LMS update: E[w[n+1]] = E[w[n]] + μ E[u_ne[n]]. Under certain assumptions (independence of w[n] and u_n, which is approximate), this reduces to the same recursion as steepest descent. The convergence condition is the same:

0 < μ < 2/λ_max(R_u)

Since we usually don't know λ_max, a practical bound uses the trace:

0 < μ < 2/tr(R_u) = 2/(M · σ_u²)

because λ_max ≤ tr(R_u) = ∑λ_k = M · σ_u² for white input with variance σ_u².

The μ dilemma.
• Large μ: fast initial convergence, but large steady-state fluctuations (the weight vector jitters around w_opt)
• Small μ: slow convergence, but small steady-state error (closer to w_opt)
You can't have both — this is the fundamental tradeoff of LMS.

Excess MSE and Misadjustment

At steady state, the MSE is higher than J_min by an amount called the excess MSE:

J_excess ≈ μ · J_min · tr(R_u) / 2

The misadjustment M is the ratio of excess MSE to minimum MSE:

M = J_excess/J_min ≈ μ · tr(R_u) / 2 = μ · M · σ_u² / 2

For 10% misadjustment with M = 10 taps and σ_u² = 1: μ < 0.02.

Worked Example: Choosing μ

Echo canceller with M = 64 taps, input is speech with σ_u² ≈ 0.01. The stability bound is μ < 2/(64 · 0.01) = 3.125. For 5% misadjustment: μ < 2 · 0.05 / (64 · 0.01) = 0.156. A typical choice would be μ = 0.05 — well within stability and giving about 1.6% misadjustment.

The convergence time constant is approximately τ ≈ 1/(2μλ_min). If the smallest eigenvalue of R_u is 0.001, then τ ≈ 1/(2 · 0.05 · 0.001) = 10,000 samples. At 8 kHz, that's 1.25 seconds to converge. Increasing μ speeds this up but increases steady-state noise.

Weight Error Vector Analysis

Define the weight error vector c[n] = w[n] − w_opt. The LMS update becomes:

c[n+1] = (I − μ u_nu_n^T) c[n] + μ u_n v[n]

where v[n] = d[n] − w_opt^Tu_n is the optimal error (noise floor). The first term drives c toward zero (convergence). The second term adds noise (the stochastic gradient is imperfect). At steady state, these two effects balance.

NLMS: Normalized LMS

A common improvement: normalize the step size by the input energy:

w[n+1] = w[n] + (μ/(u_n^Tu_n + δ)) · e[n] · u_n

where δ is a small constant for numerical stability. NLMS removes the dependence on input power, making μ ∈ (0, 2) the universal stability range.

LMS Convergence: μ Tradeoff

Watch the learning curve (MSE vs. iteration). Large μ converges fast but is noisier at steady state.

μ (step size) 0.020

Filter length M 8

Increasing the LMS step size μ (within the stable range) causes:

Faster convergence but larger steady-state fluctuations Slower convergence and smaller fluctuations No change in convergence rate

Chapter 5: Noise Cancellation

The classic application of adaptive filtering. You have two inputs:

Primary: d[n] = s[n] + v[n] — signal s[n] corrupted by noise v[n]
Reference: u[n] — correlated with v[n] but uncorrelated with s[n]

The adaptive filter takes u[n] as input and tries to make its output y[n] = w^Tu_n approximate v[n]. The error e[n] = d[n] − y[n] = s[n] + v[n] − y[n]. When w converges, y[n] ≈ v[n], so e[n] ≈ s[n]. The error IS the clean signal.

The magic of noise cancellation. The error signal is the output! Minimizing MSE forces the filter to cancel the noise component, leaving only the signal. No knowledge of the signal s[n] or noise v[n] is needed — just a reference that's correlated with the noise.

Why Does Minimizing MSE Cancel Noise?

The MSE is E[e²[n]] = E[(s[n] + v[n] − y[n])²]. Expand: E[s²] + E[(v − y)²] + 2E[s(v − y)]. Since s is uncorrelated with both v and u (and hence y), the cross term vanishes. So:

MSE = E[s²[n]] + E[(v[n] − y[n])²]

The first term is constant (signal power). Minimizing MSE = minimizing E[(v − y)²] = making y[n] approximate v[n]. When the filter converges, y[n] ≈ v[n], so e[n] = s[n] + v[n] − v[n] ≈ s[n]. The proof is elegant: you never told the filter about s[n], yet it recovers it by removing everything correlated with the reference.

Practical Requirements

For noise cancellation to work: (1) The reference u[n] must be correlated with the noise v[n] but uncorrelated with the signal s[n]. (2) The relationship between u[n] and v[n] must be approximately linear (for LMS to model it). (3) The filter length M must be long enough to capture the noise transfer path. If these are violated, the filter may distort the signal or fail to cancel the noise.

Adaptive Noise Cancellation Showcase

Top: clean signal. Middle: noisy signal (what you hear). Bottom: filter output (cleaned). Watch the filter converge in real time.

μ 0.010

Filter taps M 8

Noise level 1.50

Experiment ideas:
• Start with high noise level (2.0+) — the noisy signal looks hopeless, but the filter recovers the original
• Increase M — more taps capture more complex noise paths
• Crank μ too high — the filter oscillates and doesn't clean well
• Watch the early samples: the filter hasn't converged yet, so the output is still noisy

In adaptive noise cancellation, the cleaned signal is:

The filter output y[n] The error signal e[n] = d[n] − y[n] The reference input u[n]

Chapter 6: Echo Cancellation

When you talk on a speakerphone, the far-end speaker's voice plays from your speaker, bounces around the room, and gets picked up by your microphone. The far-end listener hears their own voice delayed — an echo. This echo can be hundreds of milliseconds long, covering thousands of samples.

The Echo Path

The echo path from speaker to microphone is a linear filter (convolution with the room impulse response). Model it as an FIR filter with coefficients h = [h₀, h₁, ..., h_L−1]. The microphone signal is:

d[n] = s[n] + ∑_k=0^L−1 h_k u[n−k]

where u[n] is the far-end signal (known — we sent it to the speaker) and s[n] is the near-end talker (the voice we want to preserve).

An adaptive filter with input u[n] and desired output d[n] learns to approximate h. Its output y[n] ≈ ∑h_ku[n−k] is the estimated echo. Subtracting gives the echo-free signal: e[n] = d[n] − y[n] ≈ s[n].

Challenges in echo cancellation:
• Long echo paths: 200ms at 8kHz = 1600 taps. LMS with M=1600 needs very small μ
• Double talk: when both sides talk simultaneously, d[n] contains both echo and near-end speech. The filter must not adapt during double talk (it would try to cancel the near-end speaker)
• Non-stationarity: the echo path changes when someone moves or a door opens

Comparison of Algorithms

Algorithm	Update cost	Convergence	Tracks changes
LMS	O(M)	Slow (depends on eigenvalue spread)	Yes, continuously
NLMS	O(M)	Faster (normalized step)	Yes
RLS	O(M²)	Fast (independent of eigenvalue spread)	Yes (forgetting factor)
Block LMS (FFT)	O(M log M)	Same as LMS	Yes, with block delay

RLS (Recursive Least Squares) maintains a running estimate of R_u⁻¹ and updates it rank-1 at each step using the matrix inversion lemma (Woodbury identity):

P[n] = (1/λ)(P[n−1] − P[n−1]u_nu_n^TP[n−1] / (λ + u_n^TP[n−1]u_n))

where P[n] ≈ R_u⁻¹ and λ ∈ (0, 1] is a forgetting factor (typically 0.99). RLS converges in O(M) iterations regardless of eigenvalue spread, but costs O(M²) per sample. For long echo cancellers (M > 1000), the cost is prohibitive, so NLMS with subband decomposition is typical.

Practical Considerations

Real echo cancellers use a double-talk detector to freeze adaptation when both parties speak. The idea: if the error power suddenly increases (because near-end speech appeared), stop updating. Common detectors compare the ratio E[e²[n]]/E[d²[n]] to a threshold.

Subband adaptive filtering decomposes the signal into frequency bands, runs a short adaptive filter in each band (lower sampling rate = faster convergence), then reconstructs. This reduces the effective filter length and eigenvalue spread simultaneously.

Echo Path Estimation

The true echo path (teal) vs. the adaptive filter's estimate (orange). Watch it converge over time.

μ 0.010

In echo cancellation, the adaptive filter's input is:

The far-end signal u[n] (the signal we sent to the speaker) The near-end speech s[n] The microphone signal d[n]

Chapter 7: Mastery

Let's consolidate the journey from optimal filtering to online learning.

Wiener Filter

w_opt = R_u⁻¹r_ud — optimal but needs statistics

↓ replace E[·] with batch estimate

Steepest Descent

w[n+1] = w[n] − μ∇J — iterative, still needs R_u

↓ replace ∇J with instantaneous estimate

LMS

w[n+1] = w[n] + μe[n]u_n — O(M), no statistics

↓ normalize by input power

NLMS

Step size adapts to input energy — robust

Concept	Key Equation	Insight
Wiener solution	w_opt = R_u⁻¹r_ud	Decorrelate error from input
LMS update	w += μ·e[n]·u_n	SGD on instantaneous MSE
Stability bound	μ < 2/λ_max	Step size < inverse max eigenvalue
Misadjustment	M ≈ μ·M·σ_u²/2	Steady-state excess MSE
Noise cancel	e[n] = s[n] + v[n] − v̂[n]	Error = cleaned signal

Connections

Regression & AR Models — the batch/offline counterpart
RKHS & Kernel Regression — kernel LMS extends adaptive filtering to nonlinear models
Kalman filter — can be seen as an optimal adaptive filter for state-space models
SGD in deep learning — same stochastic gradient idea, applied to nonlinear models with millions of parameters

The entire adaptive filtering framework — from Wiener to LMS to RLS — illustrates a recurring theme in engineering: replace exact computation with iterative approximation. The Wiener filter is beautiful but impractical. Steepest descent is practical but still batch. LMS drops everything except the current sample and somehow still works. This is the power of stochastic methods.

What you can now do:
• Derive the Wiener filter from first principles
• Explain why LMS is stochastic gradient descent on the MSE
• State the convergence condition μ < 2/λ_max and explain what happens when violated
• Implement LMS in a few lines of code
• Design a noise cancellation system with primary and reference inputs
• Explain the echo cancellation architecture

Beyond LMS

Modern adaptive filtering extends LMS in many directions:

Frequency-domain LMS: Use FFT to compute the convolution, reducing per-sample cost to O(M log M / M) = O(log M) amortized
Affine projection: Uses the last K input vectors (not just the current one) for a better gradient estimate. Convergence between LMS and RLS, cost O(KM)
Kernel LMS: Replaces the linear filter with a kernel expansion, enabling nonlinear adaptive filtering (connects directly to RKHS)
Sparse LMS: Adds an L1 penalty to promote sparse filter coefficients, useful when the true impulse response is sparse
Variable step size: Adapts μ over time — large at start for fast convergence, small at steady state for low noise. GASS (gradient adaptive step size) and VSS-LMS automate this

In telecommunications, adaptive filters are ubiquitous. Every phone call uses echo cancellation. Every wireless receiver uses adaptive equalization to undo channel distortion. 5G MIMO systems use hundreds of adaptive filters running in parallel. The LMS algorithm, in some variant, is almost certainly running on the device you're reading this on right now.

Deep learning connection. SGD (stochastic gradient descent) training of neural networks is conceptually identical to LMS: use a single (mini-batch) sample to estimate the gradient, take a step, repeat. The convergence conditions are analogous: the learning rate must be small enough relative to the curvature of the loss landscape. Adam, RMSProp, and other optimizers are the neural network equivalents of NLMS — they normalize by running estimates of gradient magnitude.

"The LMS algorithm is perhaps the most widely used adaptive algorithm in practice. Its simplicity is its greatest strength." — Simon Haykin, Adaptive Filter Theory