Adaptive time-frequency analysis — short windows for high frequencies, long windows for low frequencies.
In Lecture 8, we built the STFT — a powerful tool that gives us time-frequency analysis. But it has one fundamental limitation: the window size is fixed.
Consider music. A bass drum hit lasts ~100 ms and has frequency ~60 Hz. A hi-hat click lasts ~5 ms and has frequency ~8000 Hz. The STFT forces you to choose ONE window size for both:
• Long window (good for bass drum's frequency): smears the hi-hat's timing
• Short window (good for hi-hat's timing): can't resolve the bass drum's pitch
What we really want: long windows at low frequencies (to resolve pitch) and short windows at high frequencies (to resolve timing). This is exactly what wavelets provide.
Nature itself operates at multiple scales:
• Speech: Pitch (100-300 Hz) changes slowly; fricatives (2000-8000 Hz) are brief bursts
• Music: Bass notes sustain for seconds; percussion is milliseconds
• Earthquakes: P-waves (high freq) arrive as sharp spikes; surface waves (low freq) are long undulations
• ECG: QRS complex is a quick spike (~100 ms); T-wave is slow and broad (~300 ms)
In all these cases, the "interesting" features at high frequency are SHORT, and the "interesting" features at low frequency are LONG. Wavelets match this natural structure.
Left: STFT uses uniform tiles. Right: Wavelet uses adaptive tiles. Both cover the same time-frequency plane with the same total area per tile.
Wavelets emerged from multiple fields simultaneously in the 1980s:
• Jean Morlet (1982): Geophysicist analyzing seismic signals, needed multi-scale analysis
• Alex Grossmann (1984): Physicist who formalized the continuous wavelet transform
• Stéphane Mallat (1989): Connected wavelets to filter banks and multi-resolution analysis
• Ingrid Daubechies (1988): Constructed compactly-supported orthogonal wavelets — the "db" family
The STFT uses sinusoids (ejωt) modulated by a window. Wavelets use a single prototype function — the mother wavelet ψ(t) — that gets scaled and shifted to analyze the signal at different frequencies and times.
A mother wavelet is any function ψ(t) that satisfies two conditions:
1. Zero mean: ∫ ψ(t) dt = 0 (it oscillates — has positive and negative parts)
2. Finite energy: ∫ |ψ(t)|2 dt < ∞ (it's localized — decays to zero)
From the single mother wavelet, we create a family of wavelets at different scales s:
What does scaling do?
• s > 1 (stretch): Wavelet is wider in time → captures lower frequencies
• s < 1 (compress): Wavelet is narrower → captures higher frequencies
• s = 1: The mother wavelet itself
The factor 1/√s preserves energy: ||ψs||2 = ||ψ||2 for all s.
To analyze different time positions, we translate:
This is the complete wavelet family: parametrized by scale s (which frequency) and translation τ (which time).
If the mother wavelet ψ(t) has center frequency f0, then the scaled wavelet ψs(t) has center frequency:
Large scale = low frequency. Small scale = high frequency. This is the opposite convention from what you might expect (scale is inversely proportional to frequency).
Mother wavelet: Mexican hat with center frequency f0 = 1 Hz and effective duration 1 second.
At scale s = 2: Wavelet stretches to 2 seconds duration, center freq = 0.5 Hz. It's slower and more spread out — suitable for detecting low-frequency features.
At scale s = 0.25: Wavelet compresses to 0.25 seconds, center freq = 4 Hz. It's fast and localized — suitable for detecting sharp transients at 4 Hz.
The energy normalization 1/√s ensures that:
• At s=2: amplitude is 1/√2 ≈ 0.707 (wider but lower)
• At s=0.25: amplitude is 1/√0.25 = 2 (narrower but taller)
• Total area under |ψs|2 is the same for all s
The Mexican hat wavelet (Ricker wavelet) is the negative second derivative of a Gaussian:
It looks like a sombrero: one positive bump flanked by two negative bumps. Zero mean (positive area = negative area). Localized (Gaussian envelope decays quickly).
Morlet wavelet: A Gaussian-windowed complex sinusoid: ψ(t) = ejω0t e-t2/2. Combines the optimal time-frequency localization of the Gaussian with explicit frequency (via ω0). Used extensively in geophysics and neuroscience (EEG analysis).
Haar wavelet: The simplest: +1 on the left half, -1 on the right half, 0 elsewhere. Discontinuous but perfectly localized in time. We'll study it in detail in Chapter 5.
Meyer wavelet: Infinitely smooth, compactly supported in frequency (but not in time). Theoretically beautiful but rarely used in practice due to infinite time support.
The Mexican Hat wavelet stretched (low freq) and compressed (high freq). All have the same energy but different time/frequency spreads.
Now we have our family of wavelets ψs,τ(t). The Continuous Wavelet Transform (CWT) asks: "How much does the signal f(t) look like the wavelet at scale s and position τ?"
Expanding the wavelet:
• W(s, τ) large: The signal at time τ matches the wavelet at scale s well (there's energy at frequency f0/s near time τ)
• W(s, τ) ≈ 0: No energy at that scale/time combination
• W(s, τ) < 0: Anti-correlation (signal is "opposite" to wavelet at that point)
The CWT at each scale is actually a cross-correlation between f(t) and the scaled wavelet ψs(t). We're sliding the wavelet across the signal and measuring similarity at each position. This is the same template-matching idea from Lecture 9, but with a frequency-adaptive template!
Signal: f(t) = sin(2π · 5t) for 0 < t < 0.5, then sin(2π · 20t) for 0.5 < t < 1.
CWT with Mexican hat wavelet:
• At scale s = f0/5 (tuned to 5 Hz): W is large for τ < 0.5, near zero for τ > 0.5
• At scale s = f0/20 (tuned to 20 Hz): W is near zero for τ < 0.5, large for τ > 0.5
• The CWT correctly localizes which frequency is present at which time!
python import numpy as np def mexican_hat(t): """Mexican hat (Ricker) wavelet.""" return (1 - t**2) * np.exp(-t**2 / 2) def cwt(signal, scales, dt=1.0): """Continuous Wavelet Transform. Returns (len(scales), len(signal)) coefficient matrix.""" N = len(signal) t = np.arange(N) * dt coeffs = np.zeros((len(scales), N)) for i, s in enumerate(scales): # Create wavelet at this scale t_wavelet = np.arange(-4*s, 4*s, dt) / s wavelet = mexican_hat(t_wavelet) / np.sqrt(s) # Correlate with signal coeffs[i] = np.correlate(signal, wavelet, mode='same') return coeffs
Top: signal that switches frequency mid-way. Bottom: CWT magnitude (scale vs time). Notice how both the frequency change AND its timing are clearly resolved.
The key advantage of wavelets over the STFT becomes crystal clear when we compare their time-frequency tilings.
The STFT divides the time-frequency plane into identical rectangles. Every tile has the same width Δt and height Δf, regardless of frequency. This means:
• At 100 Hz, we have the same time resolution as at 10000 Hz
• At 10000 Hz, we have the same frequency resolution as at 100 Hz
This is wasteful. At 100 Hz, one period is 10 ms — we don't need sub-millisecond timing. But at 10000 Hz, one period is 0.1 ms — we DO need fine timing to resolve individual cycles.
Wavelets use tiles whose width and height depend on scale:
• Low frequency (large scale s): Wide in time (Δt = s · Δt0), narrow in freq (Δf = Δf0/s)
• High frequency (small scale s): Narrow in time (Δt = s · Δt0), wide in freq (Δf = Δf0/s)
The area of each tile is constant: Δt · Δf = Δt0 · Δf0 (uncertainty principle satisfied).
The ratio of center frequency to bandwidth is constant for all wavelet scales:
This is called a constant-Q transform. It means the frequency resolution is always the same fraction of the center frequency — exactly matching musical pitch (octaves, semitones). The mel scale in audio ML is motivated by this same idea.
| Property | STFT | CWT |
|---|---|---|
| Tile shape | Fixed rectangles | Adaptive (scale-dependent) |
| Δf at low freq | Same as at high freq | Narrow (good resolution) |
| Δt at high freq | Same as at low freq | Narrow (good resolution) |
| Q factor | Varies (Q = f/Δf increases with f) | Constant |
| Basis functions | Windowed sinusoids | Scaled wavelets |
| Musical pitch | Linear frequency axis | Logarithmic (matches octaves) |
A signal with a low-freq tone + a brief high-freq click. See how each representation handles the two components.
The CWT is beautiful mathematically, but expensive computationally: it samples scale and translation continuously. The Discrete Wavelet Transform (DWT) samples only at dyadic scales (s = 2j) and achieves the same information content with far less computation.
Stéphane Mallat (1989) showed that the DWT can be computed as a cascade of simple filter + downsample operations. No need to explicitly construct wavelets at each scale — just apply two filters repeatedly.
After one level: N/2 approximation coefficients + N/2 detail coefficients = N total. No information lost!
Apply the same split to the approximation coefficients (the low-frequency part) again and again:
Level 1: x → a1 (N/2) + d1 (N/2)
Level 2: a1 → a2 (N/4) + d2 (N/4)
Level 3: a2 → a3 (N/8) + d3 (N/8)
After J levels: aJ (N/2J) + dJ (N/2J) + ... + d2 (N/4) + d1 (N/2) = N coefficients total.
For the Haar wavelet: H₀ = [1/√2, 1/√2] (average), H₁ = [1/√2, -1/√2] (difference).
For Daubechies wavelets: longer filters (4, 6, 8... taps) with specific coefficients that guarantee orthogonality and vanishing moments.
Signal: x = [4, 6, 10, 12, 8, 6, 5, 5] (8 samples).
Approximation (average pairs):
Detail (difference of pairs):
Notice: a captures the trend (values going up then down). d captures the local variation (small differences between adjacent samples). The last detail coefficient is 0 because samples 7 and 8 are equal — no local change.
Total: 4 approximation + 4 detail = 8 coefficients = same as input. Perfect reconstruction: x[2k] = (a[k] + d[k])/√2, x[2k+1] = (a[k] - d[k])/√2.
python import numpy as np def dwt_level(x, h0, h1): """One level of DWT decomposition.""" # Convolve and downsample by 2 approx = np.convolve(x, h0, mode='same')[::2] detail = np.convolve(x, h1, mode='same')[::2] return approx, detail def dwt_multilevel(x, h0, h1, levels=3): """Multi-level DWT. Returns [aJ, dJ, ..., d2, d1].""" details = [] current = x for j in range(levels): current, d = dwt_level(current, h0, h1) details.append(d) return [current] + details[::-1] # Haar filters h0 = np.array([1, 1]) / np.sqrt(2) # low-pass (average) h1 = np.array([1, -1]) / np.sqrt(2) # high-pass (difference)
Watch a signal decompose level by level. Each level splits into approximation (low freq) and detail (high freq) with half the samples.
Different wavelet families are optimized for different signal characteristics. The two most important: Haar (simplest) and Daubechies (most versatile).
The Haar wavelet (1909, Alfred Haar) is the simplest possible wavelet:
It's a step function: +1 on the first half, -1 on the second half. The filters are:
• H₀ = [1, 1]/√2 (local average of two samples)
• H₁ = [1, -1]/√2 (local difference of two samples)
The Haar wavelet is discontinuous (step function). This means:
• Poor frequency localization (step has infinite bandwidth)
• Block artifacts in reconstruction (visible staircasing)
• Only 1 vanishing moment (can only represent constants exactly)
Ingrid Daubechies (1988) constructed a family of orthogonal wavelets with compact support (finite length) and N vanishing moments:
• db1 = Haar (2 filter taps, 1 vanishing moment)
• db2: 4 filter taps, 2 vanishing moments
• db4: 8 filter taps, 4 vanishing moments
• db8: 16 filter taps, 8 vanishing moments
A wavelet with N vanishing moments satisfies:
This means it's "blind" to polynomials of degree < N. A wavelet with 4 vanishing moments won't respond to cubic trends — only to features sharper than a cubic. More vanishing moments = better at ignoring smooth background.
| Wavelet | Filter Length | Vanishing Moments | Best For |
|---|---|---|---|
| Haar (db1) | 2 | 1 | Edge detection, binary signals |
| db2 | 4 | 2 | Piecewise-linear signals |
| db4 | 8 | 4 | General audio, EEG |
| db8 | 16 | 8 | Smooth signals, compression |
Compare Haar (blocky) vs. higher-order Daubechies wavelets (smooth). More taps = smoother but wider support.
The DWT's multi-level decomposition gives us something extraordinary: the ability to see a signal at different resolutions simultaneously. Each level peels off one layer of detail, revealing progressively coarser structure.
At level j, the signal is represented as:
Where:
• aJ: Coarsest approximation (overall shape, DC + very low frequencies)
• dj: Detail at level j (captures frequencies in band [fs/2j+1, fs/2j])
This is perfect reconstruction: sum all components and you get back the original signal exactly.
Denoising: Noise is typically broadband (present at all scales). Signal features are concentrated at specific scales. Strategy: decompose, threshold small detail coefficients (likely noise), reconstruct.
python def wavelet_denoise(signal, threshold, levels=4): """Denoise via wavelet thresholding.""" coeffs = dwt_multilevel(signal, h0, h1, levels) # Threshold detail coefficients (soft threshold) for i in range(1, len(coeffs)): coeffs[i] = np.sign(coeffs[i]) * np.maximum(np.abs(coeffs[i]) - threshold, 0) # Reconstruct return idwt_multilevel(coeffs, g0, g1)
Compression (JPEG2000): Most wavelet coefficients are near zero. Keep only the large ones (say, top 10%). This is why JPEG2000 uses the DWT instead of the DCT (used in JPEG). Wavelets give fewer artifacts at high compression because they handle edges better.
Edge detection: Edges (sharp transitions) produce large detail coefficients at fine scales. Find peaks in d1 = edge locations.
Hard thresholding: Zero out coefficients below threshold λ, keep the rest unchanged.
Soft thresholding: Shrink all coefficients toward zero by λ. Smoother result.
Donoho & Johnstone (1994) proved that the universal threshold λ = σ √(2 log N) is nearly minimax-optimal for denoising, where σ is the noise standard deviation and N is the signal length. This remarkable result means wavelet denoising is provably near-optimal with almost no tuning.
JPEG (1992) uses the DCT (block-by-block cosine transform). At high compression, you see blocking artifacts at the 8×8 block boundaries because the DCT processes each block independently.
JPEG2000 (2000) uses the DWT on the entire image. No blocks → no blocking artifacts. At the same bitrate, JPEG2000 gives 2–3 dB better PSNR than JPEG. The advantage is even larger at very low bitrates where JPEG falls apart.
The specific wavelet used is the CDF 9/7 biorthogonal wavelet — chosen because it has good compression performance AND allows reversible (lossless) encoding with an integer variant.
Decompose a signal into levels. Toggle each level on/off to see its contribution. Remove a level = lose those frequencies. This is how wavelet denoising and compression work.
Let's consolidate the wavelet transform and see its place in the signal processing toolkit.
| Transform | Input | Output | Use |
|---|---|---|---|
| CWT | 1D signal | 2D (scale × time) | Analysis, visualization |
| DWT | 1D signal (N samples) | N coefficients (multi-level) | Compression, denoising |
| IDWT | Wavelet coefficients | Reconstructed signal | Synthesis after modification |
| Application | How Wavelets Help | Standard Used |
|---|---|---|
| JPEG2000 | DWT replaces DCT; better edges at high compression | CDF 9/7 wavelet |
| Denoising | Threshold small coefficients (noise) → reconstruct | db4-db8 + soft threshold |
| Edge detection | Large detail coefficients at fine scales = edges | Haar or db2 |
| EEG/ECG | Multi-scale analysis of brain/heart signals | db4, Morlet |
| Seismology | CWT for multi-scale event detection | Morlet wavelet |
| Speech coding | Subband coding for efficient compression | Filter bank (QMF) |
| DFT | STFT | Wavelet | |
|---|---|---|---|
| Time info | None | Fixed resolution | Adaptive |
| Freq info | Perfect | Fixed resolution | Adaptive |
| Basis | Sinusoids | Windowed sinusoids | Scaled wavelets |
| Best for | Stationary signals | Slowly varying | Multi-scale / transients |
| Complexity | O(N log N) | O(N log N) | O(N) |
• Lecture 8 (STFT): Wavelets overcome the fixed-window limitation
• Lecture 9 (Classification): Wavelet coefficients as features for signal classification
• Deep Learning: Modern architectures (WaveNet, WavLM) use dilated convolutions — effectively a learned wavelet transform
• Vision: Image pyramids in CNNs mirror multi-resolution wavelet analysis
The DFT, STFT, and wavelet transform are all members of a larger family of time-frequency representations. Each makes different tradeoffs:
• Wigner-Ville distribution: Best possible time-frequency resolution (no uncertainty smearing!) but suffers from cross-term interference (phantom energy between two real components). Quadratic, not linear.
• Gabor transform: STFT with Gaussian window. Achieves the uncertainty minimum. The theoretical gold standard, but impractical because discrete Gaussians don't form an orthogonal basis.
• Synchrosqueezed wavelet transform: Post-processes the CWT to sharpen ridges. Gives better frequency estimates than raw CWT while maintaining invertibility.
• Scattering transform: Cascade of wavelet transforms + modulus. Provably invariant to translations and stable to deformations. A mathematically principled alternative to deep features.
• Use DFT when: signal is stationary, you need exact frequency values, speed matters
• Use STFT when: signal is slowly varying, you need time-frequency visualization, you want simple reconstruction
• Use DWT when: signal has multi-scale structure, you need compression or denoising, you want O(N) speed
• Use CWT when: you need detailed time-scale analysis, scale matters more than frequency, visualization is the goal