EE269 Lecture 10 — Mert Pilanci, Stanford

Wavelet Transforms

Adaptive time-frequency analysis — short windows for high frequencies, long windows for low frequencies.

Prerequisites: EE269 Lecture 8 (STFT) + Basic calculus (integrals). That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Why Wavelets

In Lecture 8, we built the STFT — a powerful tool that gives us time-frequency analysis. But it has one fundamental limitation: the window size is fixed.

Consider music. A bass drum hit lasts ~100 ms and has frequency ~60 Hz. A hi-hat click lasts ~5 ms and has frequency ~8000 Hz. The STFT forces you to choose ONE window size for both:

• Long window (good for bass drum's frequency): smears the hi-hat's timing

• Short window (good for hi-hat's timing): can't resolve the bass drum's pitch

What we really want: long windows at low frequencies (to resolve pitch) and short windows at high frequencies (to resolve timing). This is exactly what wavelets provide.

The STFT's fixed window is a design choice, not a law of nature. The uncertainty principle says Δt · Δf ≥ 1/(4π) — the AREA of each time-frequency tile is fixed. But nothing says the tiles must be the same shape. Wavelets use tall-narrow tiles at low frequencies and short-wide tiles at high frequencies. Same area, different aspect ratio.

Natural Signals Have Multi-Scale Structure

Nature itself operates at multiple scales:

Speech: Pitch (100-300 Hz) changes slowly; fricatives (2000-8000 Hz) are brief bursts

Music: Bass notes sustain for seconds; percussion is milliseconds

Earthquakes: P-waves (high freq) arrive as sharp spikes; surface waves (low freq) are long undulations

ECG: QRS complex is a quick spike (~100 ms); T-wave is slow and broad (~300 ms)

In all these cases, the "interesting" features at high frequency are SHORT, and the "interesting" features at low frequency are LONG. Wavelets match this natural structure.

STFT vs. Wavelet Tiling

Left: STFT uses uniform tiles. Right: Wavelet uses adaptive tiles. Both cover the same time-frequency plane with the same total area per tile.

A Brief History

Wavelets emerged from multiple fields simultaneously in the 1980s:

Jean Morlet (1982): Geophysicist analyzing seismic signals, needed multi-scale analysis

Alex Grossmann (1984): Physicist who formalized the continuous wavelet transform

Stéphane Mallat (1989): Connected wavelets to filter banks and multi-resolution analysis

Ingrid Daubechies (1988): Constructed compactly-supported orthogonal wavelets — the "db" family

What is the fundamental limitation of the STFT that wavelets solve?

Chapter 1: Mother Wavelet & Scaling

The STFT uses sinusoids (ejωt) modulated by a window. Wavelets use a single prototype function — the mother wavelet ψ(t) — that gets scaled and shifted to analyze the signal at different frequencies and times.

The Mother Wavelet ψ(t)

A mother wavelet is any function ψ(t) that satisfies two conditions:

1. Zero mean: ∫ ψ(t) dt = 0 (it oscillates — has positive and negative parts)

2. Finite energy: ∫ |ψ(t)|2 dt < ∞ (it's localized — decays to zero)

Why "wavelet"? The name means "small wave." Unlike a sinusoid (which oscillates forever), a wavelet is a brief oscillation that dies out. It's localized in BOTH time and frequency — the ideal time-frequency atom.

Scaling: Stretching and Compressing

From the single mother wavelet, we create a family of wavelets at different scales s:

ψs(t) = (1/√s) · ψ(t/s)

What does scaling do?

s > 1 (stretch): Wavelet is wider in time → captures lower frequencies

s < 1 (compress): Wavelet is narrower → captures higher frequencies

s = 1: The mother wavelet itself

The factor 1/√s preserves energy: ||ψs||2 = ||ψ||2 for all s.

Shifting: Moving in Time

To analyze different time positions, we translate:

ψs,τ(t) = (1/√s) · ψ((t - τ)/s)

This is the complete wavelet family: parametrized by scale s (which frequency) and translation τ (which time).

Scale ↔ Frequency Relationship

If the mother wavelet ψ(t) has center frequency f0, then the scaled wavelet ψs(t) has center frequency:

f = f0 / s

Large scale = low frequency. Small scale = high frequency. This is the opposite convention from what you might expect (scale is inversely proportional to frequency).

Worked Example: Scaling a Wavelet

Mother wavelet: Mexican hat with center frequency f0 = 1 Hz and effective duration 1 second.

At scale s = 2: Wavelet stretches to 2 seconds duration, center freq = 0.5 Hz. It's slower and more spread out — suitable for detecting low-frequency features.

At scale s = 0.25: Wavelet compresses to 0.25 seconds, center freq = 4 Hz. It's fast and localized — suitable for detecting sharp transients at 4 Hz.

The energy normalization 1/√s ensures that:

• At s=2: amplitude is 1/√2 ≈ 0.707 (wider but lower)

• At s=0.25: amplitude is 1/√0.25 = 2 (narrower but taller)

• Total area under |ψs|2 is the same for all s

Example: Mexican Hat Wavelet

The Mexican hat wavelet (Ricker wavelet) is the negative second derivative of a Gaussian:

ψ(t) = (2/√3) π-1/4 (1 - t2) e-t2/2

It looks like a sombrero: one positive bump flanked by two negative bumps. Zero mean (positive area = negative area). Localized (Gaussian envelope decays quickly).

Other Common Mother Wavelets

Morlet wavelet: A Gaussian-windowed complex sinusoid: ψ(t) = e0t e-t2/2. Combines the optimal time-frequency localization of the Gaussian with explicit frequency (via ω0). Used extensively in geophysics and neuroscience (EEG analysis).

Haar wavelet: The simplest: +1 on the left half, -1 on the right half, 0 elsewhere. Discontinuous but perfectly localized in time. We'll study it in detail in Chapter 5.

Meyer wavelet: Infinitely smooth, compactly supported in frequency (but not in time). Theoretically beautiful but rarely used in practice due to infinite time support.

Mother Wavelet at Different Scales

The Mexican Hat wavelet stretched (low freq) and compressed (high freq). All have the same energy but different time/frequency spreads.

Scale s 1.0
A mother wavelet has center frequency f₀ = 1000 Hz. At scale s = 4, what frequency does it analyze?

Chapter 2: CWT Definition

Now we have our family of wavelets ψs,τ(t). The Continuous Wavelet Transform (CWT) asks: "How much does the signal f(t) look like the wavelet at scale s and position τ?"

Definition

W(s, τ) = ∫-∞ f(t) · ψ*s,τ(t) dt = <f, ψs,τ>

Expanding the wavelet:

W(s, τ) = (1/√s) ∫-∞ f(t) · ψ*((t - τ)/s) dt
The CWT maps a 1D function f(t) to a 2D function W(s, τ). The input is a signal over time. The output is a surface over (scale, translation) — equivalent to (frequency, time). This is the wavelet analog of the spectrogram, but with adaptive resolution.

Interpretation

W(s, τ) large: The signal at time τ matches the wavelet at scale s well (there's energy at frequency f0/s near time τ)

W(s, τ) ≈ 0: No energy at that scale/time combination

W(s, τ) < 0: Anti-correlation (signal is "opposite" to wavelet at that point)

CWT as Correlation

The CWT at each scale is actually a cross-correlation between f(t) and the scaled wavelet ψs(t). We're sliding the wavelet across the signal and measuring similarity at each position. This is the same template-matching idea from Lecture 9, but with a frequency-adaptive template!

Worked Example

Signal: f(t) = sin(2π · 5t) for 0 < t < 0.5, then sin(2π · 20t) for 0.5 < t < 1.

CWT with Mexican hat wavelet:

• At scale s = f0/5 (tuned to 5 Hz): W is large for τ < 0.5, near zero for τ > 0.5

• At scale s = f0/20 (tuned to 20 Hz): W is near zero for τ < 0.5, large for τ > 0.5

• The CWT correctly localizes which frequency is present at which time!

In Code

python
import numpy as np

def mexican_hat(t):
    """Mexican hat (Ricker) wavelet."""
    return (1 - t**2) * np.exp(-t**2 / 2)

def cwt(signal, scales, dt=1.0):
    """Continuous Wavelet Transform.
    Returns (len(scales), len(signal)) coefficient matrix."""
    N = len(signal)
    t = np.arange(N) * dt
    coeffs = np.zeros((len(scales), N))

    for i, s in enumerate(scales):
        # Create wavelet at this scale
        t_wavelet = np.arange(-4*s, 4*s, dt) / s
        wavelet = mexican_hat(t_wavelet) / np.sqrt(s)
        # Correlate with signal
        coeffs[i] = np.correlate(signal, wavelet, mode='same')

    return coeffs
CWT of a Two-Frequency Signal

Top: signal that switches frequency mid-way. Bottom: CWT magnitude (scale vs time). Notice how both the frequency change AND its timing are clearly resolved.

The CWT maps a 1D signal f(t) to a 2D function W(s, τ). What are the two axes?

Chapter 3: Time-Frequency Tiling

The key advantage of wavelets over the STFT becomes crystal clear when we compare their time-frequency tilings.

STFT: Uniform Grid

The STFT divides the time-frequency plane into identical rectangles. Every tile has the same width Δt and height Δf, regardless of frequency. This means:

• At 100 Hz, we have the same time resolution as at 10000 Hz

• At 10000 Hz, we have the same frequency resolution as at 100 Hz

This is wasteful. At 100 Hz, one period is 10 ms — we don't need sub-millisecond timing. But at 10000 Hz, one period is 0.1 ms — we DO need fine timing to resolve individual cycles.

Wavelet: Dyadic Grid

Wavelets use tiles whose width and height depend on scale:

Low frequency (large scale s): Wide in time (Δt = s · Δt0), narrow in freq (Δf = Δf0/s)

High frequency (small scale s): Narrow in time (Δt = s · Δt0), wide in freq (Δf = Δf0/s)

The area of each tile is constant: Δt · Δf = Δt0 · Δf0 (uncertainty principle satisfied).

The wavelet tiling matches how nature works: High-frequency events (clicks, onsets, edges) are brief — wavelets resolve their timing. Low-frequency events (bass notes, trends) are sustained — wavelets resolve their frequency. The STFT compromises everywhere; wavelets are optimal everywhere.

Constant-Q Property

The ratio of center frequency to bandwidth is constant for all wavelet scales:

Q = fcenter / Δf = f0/s ÷ (Δf0/s) = f0/Δf0 = constant

This is called a constant-Q transform. It means the frequency resolution is always the same fraction of the center frequency — exactly matching musical pitch (octaves, semitones). The mel scale in audio ML is motivated by this same idea.

Comparison Table

PropertySTFTCWT
Tile shapeFixed rectanglesAdaptive (scale-dependent)
Δf at low freqSame as at high freqNarrow (good resolution)
Δt at high freqSame as at low freqNarrow (good resolution)
Q factorVaries (Q = f/Δf increases with f)Constant
Basis functionsWindowed sinusoidsScaled wavelets
Musical pitchLinear frequency axisLogarithmic (matches octaves)
STFT vs. Wavelet: Resolving a Multi-Scale Signal

A signal with a low-freq tone + a brief high-freq click. See how each representation handles the two components.

The "constant-Q" property of wavelets means:

Chapter 4: DWT Filter Bank

The CWT is beautiful mathematically, but expensive computationally: it samples scale and translation continuously. The Discrete Wavelet Transform (DWT) samples only at dyadic scales (s = 2j) and achieves the same information content with far less computation.

The Key Insight: Filter Banks

Stéphane Mallat (1989) showed that the DWT can be computed as a cascade of simple filter + downsample operations. No need to explicitly construct wavelets at each scale — just apply two filters repeatedly.

The DWT algorithm: At each level, split the signal into LOW frequencies (approximation) and HIGH frequencies (detail) using a low-pass filter H₀ and a high-pass filter H₁. Then downsample by 2. Repeat on the low-frequency part. This is called a filter bank or subband coding.

One Level of the DWT

Input x[n]
N samples at full resolution
↓ split into two branches
Low-pass H₀
Keep frequencies 0 to fs/4 → downsample ↓2 → approximation a[n] (N/2 samples)
High-pass H₁
Keep frequencies fs/4 to fs/2 → downsample ↓2 → detail d[n] (N/2 samples)

After one level: N/2 approximation coefficients + N/2 detail coefficients = N total. No information lost!

Multi-Level Decomposition

Apply the same split to the approximation coefficients (the low-frequency part) again and again:

Level 1: x → a1 (N/2) + d1 (N/2)

Level 2: a1 → a2 (N/4) + d2 (N/4)

Level 3: a2 → a3 (N/8) + d3 (N/8)

After J levels: aJ (N/2J) + dJ (N/2J) + ... + d2 (N/4) + d1 (N/2) = N coefficients total.

The Filters

For the Haar wavelet: H₀ = [1/√2, 1/√2] (average), H₁ = [1/√2, -1/√2] (difference).

For Daubechies wavelets: longer filters (4, 6, 8... taps) with specific coefficients that guarantee orthogonality and vanishing moments.

Worked Example: One Level of Haar DWT

Signal: x = [4, 6, 10, 12, 8, 6, 5, 5] (8 samples).

Approximation (average pairs):

a = [(4+6)/√2, (10+12)/√2, (8+6)/√2, (5+5)/√2] = [7.07, 15.56, 9.90, 7.07]

Detail (difference of pairs):

d = [(4-6)/√2, (10-12)/√2, (8-6)/√2, (5-5)/√2] = [-1.41, -1.41, 1.41, 0]

Notice: a captures the trend (values going up then down). d captures the local variation (small differences between adjacent samples). The last detail coefficient is 0 because samples 7 and 8 are equal — no local change.

Total: 4 approximation + 4 detail = 8 coefficients = same as input. Perfect reconstruction: x[2k] = (a[k] + d[k])/√2, x[2k+1] = (a[k] - d[k])/√2.

In Code

python
import numpy as np

def dwt_level(x, h0, h1):
    """One level of DWT decomposition."""
    # Convolve and downsample by 2
    approx = np.convolve(x, h0, mode='same')[::2]
    detail = np.convolve(x, h1, mode='same')[::2]
    return approx, detail

def dwt_multilevel(x, h0, h1, levels=3):
    """Multi-level DWT. Returns [aJ, dJ, ..., d2, d1]."""
    details = []
    current = x
    for j in range(levels):
        current, d = dwt_level(current, h0, h1)
        details.append(d)
    return [current] + details[::-1]

# Haar filters
h0 = np.array([1, 1]) / np.sqrt(2)   # low-pass (average)
h1 = np.array([1, -1]) / np.sqrt(2)  # high-pass (difference)
DWT Filter Bank Decomposition

Watch a signal decompose level by level. Each level splits into approximation (low freq) and detail (high freq) with half the samples.

Levels 3
A signal has 1024 samples. After 3 levels of DWT, what are the coefficient counts?

Chapter 5: Haar & Daubechies Wavelets

Different wavelet families are optimized for different signal characteristics. The two most important: Haar (simplest) and Daubechies (most versatile).

Haar Wavelet (db1)

The Haar wavelet (1909, Alfred Haar) is the simplest possible wavelet:

ψ(t) = +1 for 0 ≤ t < 1/2,   -1 for 1/2 ≤ t < 1,   0 otherwise

It's a step function: +1 on the first half, -1 on the second half. The filters are:

• H₀ = [1, 1]/√2 (local average of two samples)

• H₁ = [1, -1]/√2 (local difference of two samples)

The Haar DWT is just averaging and differencing! At each level, take pairs of samples. The average goes to the approximation (smooth trend). The difference goes to the detail (local change). This is the most intuitive wavelet — average removes noise, difference detects edges.

Haar Limitations

The Haar wavelet is discontinuous (step function). This means:

• Poor frequency localization (step has infinite bandwidth)

• Block artifacts in reconstruction (visible staircasing)

• Only 1 vanishing moment (can only represent constants exactly)

Daubechies Wavelets (dbN)

Ingrid Daubechies (1988) constructed a family of orthogonal wavelets with compact support (finite length) and N vanishing moments:

db1 = Haar (2 filter taps, 1 vanishing moment)

db2: 4 filter taps, 2 vanishing moments

db4: 8 filter taps, 4 vanishing moments

db8: 16 filter taps, 8 vanishing moments

Vanishing Moments

A wavelet with N vanishing moments satisfies:

∫ tk ψ(t) dt = 0   for k = 0, 1, ..., N-1

This means it's "blind" to polynomials of degree < N. A wavelet with 4 vanishing moments won't respond to cubic trends — only to features sharper than a cubic. More vanishing moments = better at ignoring smooth background.

WaveletFilter LengthVanishing MomentsBest For
Haar (db1)21Edge detection, binary signals
db242Piecewise-linear signals
db484General audio, EEG
db8168Smooth signals, compression
Wavelet Family Comparison

Compare Haar (blocky) vs. higher-order Daubechies wavelets (smooth). More taps = smoother but wider support.

A signal contains a quadratic trend plus sharp spikes. Which wavelet is best for detecting ONLY the spikes (ignoring the trend)?

Chapter 6: Multi-Resolution Analysis

The DWT's multi-level decomposition gives us something extraordinary: the ability to see a signal at different resolutions simultaneously. Each level peels off one layer of detail, revealing progressively coarser structure.

The MRA Framework

At level j, the signal is represented as:

f(t) = aJ(t) + dJ(t) + dJ-1(t) + ... + d1(t)

Where:

• aJ: Coarsest approximation (overall shape, DC + very low frequencies)

• dj: Detail at level j (captures frequencies in band [fs/2j+1, fs/2j])

This is perfect reconstruction: sum all components and you get back the original signal exactly.

Multi-resolution = zoom levels. Imagine Google Maps. Level 0 = street view (every detail). Level 1 = neighborhood (broad streets, no house numbers). Level 2 = city (only highways). Level 3 = country (only borders). Each "detail" level is what you LOSE going from one zoom to the next. Sum them all = original full-resolution view.

Applications of MRA

Denoising: Noise is typically broadband (present at all scales). Signal features are concentrated at specific scales. Strategy: decompose, threshold small detail coefficients (likely noise), reconstruct.

python
def wavelet_denoise(signal, threshold, levels=4):
    """Denoise via wavelet thresholding."""
    coeffs = dwt_multilevel(signal, h0, h1, levels)
    # Threshold detail coefficients (soft threshold)
    for i in range(1, len(coeffs)):
        coeffs[i] = np.sign(coeffs[i]) * np.maximum(np.abs(coeffs[i]) - threshold, 0)
    # Reconstruct
    return idwt_multilevel(coeffs, g0, g1)

Compression (JPEG2000): Most wavelet coefficients are near zero. Keep only the large ones (say, top 10%). This is why JPEG2000 uses the DWT instead of the DCT (used in JPEG). Wavelets give fewer artifacts at high compression because they handle edges better.

Edge detection: Edges (sharp transitions) produce large detail coefficients at fine scales. Find peaks in d1 = edge locations.

Wavelet Thresholding: Soft vs. Hard

Hard thresholding: Zero out coefficients below threshold λ, keep the rest unchanged.

d̂[n] = d[n] if |d[n]| ≥ λ,   0 otherwise

Soft thresholding: Shrink all coefficients toward zero by λ. Smoother result.

d̂[n] = sign(d[n]) · max(|d[n]| - λ, 0)

Donoho & Johnstone (1994) proved that the universal threshold λ = σ √(2 log N) is nearly minimax-optimal for denoising, where σ is the noise standard deviation and N is the signal length. This remarkable result means wavelet denoising is provably near-optimal with almost no tuning.

Why JPEG2000 Uses Wavelets

JPEG (1992) uses the DCT (block-by-block cosine transform). At high compression, you see blocking artifacts at the 8×8 block boundaries because the DCT processes each block independently.

JPEG2000 (2000) uses the DWT on the entire image. No blocks → no blocking artifacts. At the same bitrate, JPEG2000 gives 2–3 dB better PSNR than JPEG. The advantage is even larger at very low bitrates where JPEG falls apart.

The specific wavelet used is the CDF 9/7 biorthogonal wavelet — chosen because it has good compression performance AND allows reversible (lossless) encoding with an integer variant.

Multi-Resolution Decomposition & Reconstruction

Decompose a signal into levels. Toggle each level on/off to see its contribution. Remove a level = lose those frequencies. This is how wavelet denoising and compression work.

Noise 0.30
In wavelet denoising, we threshold the detail coefficients. Why does this remove noise while preserving signal?

Chapter 7: Mastery

Let's consolidate the wavelet transform and see its place in the signal processing toolkit.

The Complete Wavelet Toolkit

TransformInputOutputUse
CWT1D signal2D (scale × time)Analysis, visualization
DWT1D signal (N samples)N coefficients (multi-level)Compression, denoising
IDWTWavelet coefficientsReconstructed signalSynthesis after modification

Key Formulas

CWT: W(s,τ) = (1/√s) ∫ f(t) ψ*((t-τ)/s) dt
Scaling: ψs,τ(t) = (1/√s) ψ((t-τ)/s)
DWT: a[n] = (x * h₀)[↓2],   d[n] = (x * h₁)[↓2]
Reconstruction: f = aJ + dJ + dJ-1 + ... + d1

Applications Summary

ApplicationHow Wavelets HelpStandard Used
JPEG2000DWT replaces DCT; better edges at high compressionCDF 9/7 wavelet
DenoisingThreshold small coefficients (noise) → reconstructdb4-db8 + soft threshold
Edge detectionLarge detail coefficients at fine scales = edgesHaar or db2
EEG/ECGMulti-scale analysis of brain/heart signalsdb4, Morlet
SeismologyCWT for multi-scale event detectionMorlet wavelet
Speech codingSubband coding for efficient compressionFilter bank (QMF)

DFT vs. STFT vs. Wavelets

DFTSTFTWavelet
Time infoNoneFixed resolutionAdaptive
Freq infoPerfectFixed resolutionAdaptive
BasisSinusoidsWindowed sinusoidsScaled wavelets
Best forStationary signalsSlowly varyingMulti-scale / transients
ComplexityO(N log N)O(N log N)O(N)

Connections

Lecture 8 (STFT): Wavelets overcome the fixed-window limitation

Lecture 9 (Classification): Wavelet coefficients as features for signal classification

Deep Learning: Modern architectures (WaveNet, WavLM) use dilated convolutions — effectively a learned wavelet transform

Vision: Image pyramids in CNNs mirror multi-resolution wavelet analysis

The Bigger Picture: Transform Zoo

The DFT, STFT, and wavelet transform are all members of a larger family of time-frequency representations. Each makes different tradeoffs:

Wigner-Ville distribution: Best possible time-frequency resolution (no uncertainty smearing!) but suffers from cross-term interference (phantom energy between two real components). Quadratic, not linear.

Gabor transform: STFT with Gaussian window. Achieves the uncertainty minimum. The theoretical gold standard, but impractical because discrete Gaussians don't form an orthogonal basis.

Synchrosqueezed wavelet transform: Post-processes the CWT to sharpen ridges. Gives better frequency estimates than raw CWT while maintaining invertibility.

Scattering transform: Cascade of wavelet transforms + modulus. Provably invariant to translations and stable to deformations. A mathematically principled alternative to deep features.

When to Use What

Use DFT when: signal is stationary, you need exact frequency values, speed matters

Use STFT when: signal is slowly varying, you need time-frequency visualization, you want simple reconstruction

Use DWT when: signal has multi-scale structure, you need compression or denoising, you want O(N) speed

Use CWT when: you need detailed time-scale analysis, scale matters more than frequency, visualization is the goal

Ingrid Daubechies (1988): "I had this very strong feeling that wavelets could be really useful... I wanted to construct wavelets that would be as short as possible while still being smooth." — Her dbN family revolutionized signal processing and won her the Wolf Prize in Mathematics (2023).
Why does the DWT have O(N) complexity while the FFT has O(N log N)?