EE269 Lecture 11 — Mert Pilanci, Stanford

Wavelet & STFT Applications

Choosing the right wavelet, mastering the STFT, and applying both to real problems — denoising, compression, edge detection.

Prerequisites: EE269 Lecture 10 (Wavelets) + EE269 Lecture 8 (STFT). That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Why It Matters

You have a recording of a violin playing a note. Unfortunately, the microphone was cheap and the recording is swamped in broadband noise — a constant hiss that buries the music. How do you clean it up?

Your first instinct: use a frequency-domain filter. Compute the DFT, zero out the noisy bins, inverse DFT. But the noise is everywhere in frequency — it overlaps the violin's harmonics. A simple frequency mask would kill the music along with the noise.

Here's the key insight: the violin's harmonics are structured — they persist over time and concentrate in specific frequency bands. The noise is unstructured — it spreads evenly across all times and frequencies. In the wavelet domain, structured signal produces large coefficients at specific scales, while noise produces small coefficients everywhere.

The strategy: Transform to the wavelet domain, keep the big coefficients (signal), throw away the small ones (noise), transform back. This is called wavelet thresholding, and it is one of the most powerful denoising techniques in signal processing.

But to make this work, you need to choose the right wavelet. Not all wavelets are created equal. A Haar wavelet (blocky, discontinuous) would be terrible for a smooth violin signal. A Daubechies-20 (smooth, wide support) would be great for the violin but terrible for detecting a sharp click in the recording.

This lecture is about making those choices. We'll learn what makes a wavelet "good" for a given task, survey the major wavelet families, compare wavelets to Fourier methods, and then apply everything to real problems: denoising, compression, edge detection, and audio analysis.

Noisy Signal: DFT Filter vs. Wavelet Threshold

A clean signal buried in noise. Compare naive frequency filtering (kills signal too) with wavelet thresholding (preserves signal structure).

Click buttons to compare approaches.
Why can't a simple frequency-domain filter remove noise from a violin recording when the noise is broadband?

Chapter 1: Good Wavelet Criteria

In Lecture 10, we introduced the mother wavelet ψ(t) and showed how scaling and shifting it creates a basis for decomposing signals. But we didn't ask: what makes one mother wavelet better than another?

There are four properties that determine how useful a wavelet is for a given application. Let's examine each one.

1. Compact Time Support

A wavelet has compact time support if ψ(t) = 0 outside some finite interval [a, b]. The support length is b - a. Why does this matter?

Short support means each wavelet coefficient depends on only a few samples of the signal. This gives excellent time localization: if a spike appears at sample n = 100, only the coefficients whose wavelet overlaps n = 100 will be affected. Long-support wavelets spread the spike's influence across many coefficients.

Think of it like a flashlight: A short-support wavelet is a narrow beam — it illuminates exactly one feature at a time. A long-support wavelet is a floodlight — it catches everything in a wide area, making it hard to pinpoint where the interesting thing happened.

2. Compact Frequency Support

Dually, we want Ψ(ω) — the Fourier transform of ψ(t) — to be concentrated in a narrow frequency band. This means each scale of the wavelet decomposition isolates a specific frequency range cleanly, with minimal leakage into other bands.

Here's the catch: by the uncertainty principle, you can't have both arbitrarily short time support AND arbitrarily narrow frequency support. Δt · Δω ≥ 1/2. So every wavelet is a compromise.

3. Smoothness (Vanishing Moments)

A wavelet has N vanishing moments if:

∫ tk ψ(t) dt = 0,   for k = 0, 1, ..., N-1

What does this buy you? If the signal locally looks like a polynomial of degree < N, the wavelet coefficients at that location will be zero (or very small). Since smooth signals can be locally approximated by polynomials, more vanishing moments means smoother signals produce sparser wavelet representations.

Worked example: Haar has N = 1 vanishing moment (∫ ψ(t) dt = 0). It zeroes out constant regions. Daubechies-4 (db4) has N = 4 vanishing moments — it zeroes out cubic trends. For a smooth sine wave, db4 gives much sparser coefficients than Haar.

Sparsity is power. Sparser representations mean better compression (fewer nonzero coefficients to store) and better denoising (signal concentrates in a few big coefficients, noise spreads across all of them).

4. Orthogonality

An orthogonal wavelet basis satisfies ⟨ψj,k, ψm,n⟩ = δjmδkn. This means:

• No redundancy — each coefficient carries unique information

• Parseval's theorem holds — energy in signal domain equals energy in wavelet domain

• Perfect reconstruction is trivial — just sum the weighted basis functions

Not all useful wavelets are orthogonal. The Morlet wavelet (used in geophysics) is redundant but very smooth. Orthogonality is most important when you need to preserve energy or do compression.

The Tradeoff Table

PropertyBenefitsTension With
Short time supportGood time localization, fast computationSmoothness, freq. support
Narrow freq. supportClean frequency separationTime support (uncertainty)
Many vanishing momentsSparser for smooth signalsLonger time support needed
OrthogonalityEnergy preservation, no redundancySmoothness constraints
Vanishing Moments vs. Support Length

Drag the slider to increase vanishing moments. Notice how the wavelet gets longer (more oscillations) but smoother.

Vanishing moments N 1
A wavelet with 4 vanishing moments will produce zero coefficients when the local signal looks like:

Chapter 2: Wavelet Families

Knowing what makes a good wavelet is one thing. Having actual wavelets you can use is another. Let's survey the major families. Each is a collection of wavelets indexed by a parameter (usually order N) that controls the tradeoff between time support and vanishing moments.

Haar (db1)

The simplest possible wavelet:

ψ(t) = +1 for 0 ≤ t < 1/2,   -1 for 1/2 ≤ t < 1,   0 elsewhere

Support: 1. Vanishing moments: 1. Smoothness: None (discontinuous).

Haar is fast and simple. It's the only orthogonal wavelet that is symmetric AND compactly supported. Its discontinuity means it's excellent for detecting sharp edges but terrible for smooth signals — it produces large coefficients at every point where the signal isn't piecewise constant.

Daubechies (dbN)

Ingrid Daubechies (1988) constructed a family of wavelets that are orthogonal, compactly supported, and have maximum vanishing moments for a given support length.

Naming: dbN has N vanishing moments and support length 2N - 1. So db1 = Haar (1 moment, length 1), db2 has 2 moments and length 3, db4 has 4 moments and length 7.

Filter coefficients for db2:

h = [(1+√3)/4,   (3+√3)/4,   (3-√3)/4,   (1-√3)/4] ÷ √2

These are the lowpass filter coefficients. The highpass coefficients come from alternating signs: g[n] = (-1)n h[L-1-n]. This guarantees orthogonality between the scaling function and the wavelet.

Daubechies wavelets are the workhorse of signal processing. db4 through db8 cover most practical applications. They're the default choice when you need orthogonality. The tradeoff: they're asymmetric (not symmetric about their center), which can introduce phase distortion in some applications.

Symlets (symN)

Daubechies' construction was modified to produce wavelets that are as nearly symmetric as possible while retaining orthogonality and compact support. Symlets have the same number of vanishing moments as Daubechies wavelets of the same order, but with less phase distortion.

In practice: sym4 through sym8 are preferred over dbN when linear phase (no phase distortion) matters — especially for image processing.

Coiflets (coifN)

Named after Ronald Coifman, who asked Daubechies to construct wavelets where both the wavelet AND the scaling function have vanishing moments. coifN has 2N vanishing moments for the wavelet and 2N-1 for the scaling function, with support length 6N-1.

When do you need scaling function vanishing moments? When you're using the wavelet transform for numerical approximation — representing functions by their scaling coefficients. Coiflets give better approximation of smooth functions at coarse scales.

Family Comparison

FamilyVanishing MomentsSupport LengthSymmetryBest For
Haar (db1)11SymmetricEdge detection, binary signals
Daubechies (dbN)N2N-1AsymmetricGeneral purpose, compression
Symlets (symN)N2N-1Near-symmetricImage processing
Coiflets (coifN)2N6N-1Near-symmetricNumerical analysis
Wavelet Family Explorer

View the shape of different wavelet families. Higher order = more vanishing moments = smoother but wider.

Select a wavelet family above.

MATLAB: wavedec

In MATLAB, the standard function for multilevel wavelet decomposition is wavedec:

matlab
% Decompose signal x to level 4 using db4
[C, L] = wavedec(x, 4, 'db4');
% C = concatenated coefficients [a4 | d4 | d3 | d2 | d1]
% L = bookkeeping vector of lengths

% Extract detail coefficients at level 3
d3 = detcoef(C, L, 3);

% Reconstruct from modified coefficients
x_recon = waverec(C, L, 'db4');

The output C is a single vector that packs all coefficients end-to-end: approximation at the coarsest level, then details from coarse to fine. L tells you how many coefficients belong to each level.

In Python with PyWavelets:

python
import pywt
import numpy as np

# Decompose to level 4 using db4
coeffs = pywt.wavedec(x, 'db4', level=4)
# coeffs = [a4, d4, d3, d2, d1]  (list of arrays)

# Reconstruct
x_recon = pywt.waverec(coeffs, 'db4')
Daubechies db4 has 4 vanishing moments and support length 7. If you needed more vanishing moments but similar support, which family would you choose?

Chapter 3: Fourier vs Wavelet

Both Fourier and wavelet transforms decompose signals into basis functions. But they have fundamentally different strengths. Understanding when to use each is one of the most important skills in signal processing.

Fourier Strengths

The Convolution Theorem: The DFT turns convolution into multiplication. If you need to filter a signal, the DFT route is often the fastest: DFT both signals, multiply, inverse DFT. Wavelets have no simple analogue — convolution in the wavelet domain is complicated.

x[n] * h[n] ↔ X[k] · H[k]

Uniform spectral resolution: Every frequency bin has the same bandwidth: Δf = fs/N. If your signal has energy evenly distributed across frequency (like white noise analysis or spectral estimation), uniform resolution is exactly what you want.

Phase information: The DFT naturally gives you magnitude AND phase. Phase encodes timing information. For applications like communications (where carrier phase matters), Fourier is the natural choice.

Wavelet Strengths

Adaptive resolution: This is Pilanci's key point: "100 Hz resolution at 400 Hz and at 4000 Hz are not the same." At 400 Hz, 100 Hz resolution means you can distinguish 300 Hz from 500 Hz — a 25% difference. At 4000 Hz, 100 Hz resolution means you can distinguish 3900 Hz from 4100 Hz — a 2.5% difference.

Pilanci's insight: Human perception is logarithmic. A musical octave spans a 2:1 frequency ratio regardless of absolute frequency: 200-400 Hz is one octave, 4000-8000 Hz is one octave. Wavelets give constant relative resolution (Δf/f = constant), which matches perception. Fourier gives constant absolute resolution (Δf = constant), which doesn't.

Sparsity for piecewise-smooth signals: Natural images and audio are piecewise-smooth: smooth regions punctuated by edges or transients. Wavelets represent these signals with very few large coefficients (at the edges) and many near-zero coefficients (in smooth regions). The DFT spreads each edge across ALL frequency bins — it's the opposite of sparse.

Time localization: A wavelet coefficient at scale j, position k tells you "there's energy at frequency ~2j near time k." A Fourier coefficient X[k] tells you "there's energy at frequency k across ALL time." For nonstationary signals, wavelet wins.

Head-to-Head Comparison

FeatureFourier (DFT)Wavelet (DWT)
Basis functionsSinusoids (infinite extent)Localized oscillations (finite extent)
Frequency resolutionUniform (Δf = fs/N)Logarithmic (Δf/f = const)
Time localizationNoneYes (adaptive)
Convolution theoremYes — fast filteringNo simple analogue
Best for stationary signalsYesOverkill
Best for transients/edgesPoorExcellent
Compression of natural signalsModerateExcellent (JPEG 2000 uses wavelets)
Computational costO(N log N) via FFTO(N) via filter bank
Fourier vs. Wavelet: Sparse Representations

A piecewise-smooth signal. Compare the coefficient magnitudes: Fourier spreads energy everywhere; wavelet concentrates it at the discontinuities.

When to Use Which

Use Fourier when: The signal is stationary, you need the convolution theorem, you care about exact frequency values, or you're doing spectral estimation.

Use wavelets when: The signal is nonstationary, you need time-frequency analysis, you're doing compression or denoising of natural signals, or you need to detect transients/edges.

Use STFT when: You need a middle ground — some time localization but with the interpretability of frequency bins (Hz on the axis rather than "scale").

Why do wavelets give better compression than Fourier for natural images?

Chapter 4: STFT Revisited

Let's revisit the STFT with fresh eyes. In Lecture 8, we defined it and explored the time-frequency tradeoff. Now we'll focus on the practical details that matter for applications: the exact formula, window choice, and what the output actually looks like as data.

The STFT Formula (Pilanci's Notation)

Given a signal x[n] and a window function w[m] of length M, the STFT at time frame n and frequency bin k is:

X[n, k] = ∑m=0M-1 x[n + m] · w[m] · e-j(2π/N)km

Note the convention: x[n + m] means we start the window at sample n and look forward M samples. The window w[m] tapers the segment. The complex exponential e-j(2π/N)km extracts frequency bin k.

Compare with DFT: The regular DFT is X[k] = ∑ x[m] e-j(2π/N)km. The STFT just adds the window w[m] and the time shift x[n+m]. Every STFT frame is a windowed DFT of a local segment.

The Output: A 2D Complex Array

The STFT produces X[n, k] — a complex number for each (time frame n, frequency bin k) pair. From it we derive:

Magnitude spectrogram: |X[n, k]| — most common visualization

Power spectrogram: |X[n, k]|2 — used in MFCC computation (Lecture 13)

Phase spectrogram: ∠X[n, k] — important for synthesis/reconstruction

Window Functions for STFT

The window w[m] controls the tradeoff between main lobe width (frequency resolution) and side lobe level (spectral leakage). Common choices:

WindowMain Lobe WidthSide Lobe LevelUse Case
RectangularNarrowest (4π/M)Worst (-13 dB)When you need max frequency resolution
HannModerate (8π/M)Good (-31 dB)General audio analysis
HammingModerate (8π/M)Better (-43 dB)Speech processing
BlackmanWide (12π/M)Best (-58 dB)When leakage must be minimal

The Hann window is the most popular default. It's defined as:

w[m] = 0.5 (1 - cos(2πm / (M-1))),   m = 0, 1, ..., M-1

It tapers smoothly to zero at the edges, which prevents the sharp cutoff artifacts that a rectangular window produces.

STFT of a Two-Tone Signal

A signal with a low tone followed by a high tone. Adjust window length to see the time-frequency tradeoff in action.

Window length M 64

In Code

python
import numpy as np

def stft(x, win_len, hop, n_fft=None):
    """Compute STFT of x with given window length and hop size."""
    if n_fft is None:
        n_fft = win_len
    w = np.hanning(win_len)              # Hann window
    n_frames = (len(x) - win_len) // hop + 1
    X = np.zeros((n_frames, n_fft), dtype=complex)
    for n in range(n_frames):
        start = n * hop
        frame = x[start:start+win_len] * w  # window the segment
        X[n] = np.fft.fft(frame, n_fft)     # DFT of windowed frame
    return X

# Power spectrogram (what you usually plot)
S = np.abs(stft(x, 512, 128))**2
In the STFT, what does increasing the window length M do?

Chapter 5: Wavelet Denoising

This is the payoff. Wavelet denoising is one of the most elegant applications of everything we've learned: choose a wavelet, decompose, threshold, reconstruct. Let's build it step by step, then play with it interactively.

The Denoising Pipeline

1. Decompose
Apply DWT to noisy signal y = x + noise. Get wavelet coefficients at each level.
2. Threshold
Apply a threshold λ to detail coefficients. Small coefficients (likely noise) → zero.
3. Reconstruct
Inverse DWT from the thresholded coefficients. Result: denoised estimate of x.

Why This Works

The insight is statistical. If the signal is piecewise-smooth and the noise is white Gaussian:

Signal produces a few LARGE wavelet coefficients (at edges and features)

Noise produces MANY SMALL wavelet coefficients (spread evenly)

A threshold separates them. Coefficients above the threshold are mostly signal. Coefficients below are mostly noise.

Choosing the Threshold

David Donoho and Iain Johnstone (1994) proved that the universal threshold is near-optimal:

λ = σ √(2 ln N)

Where σ is the noise standard deviation and N is the signal length. This threshold guarantees that, with high probability, all pure-noise coefficients fall below it.

How to estimate σ: Use the finest-scale detail coefficients (level 1). These are dominated by noise because high-frequency signal components are rare. The robust estimate is: σ̂ = median(|d1|) / 0.6745. The constant 0.6745 normalizes the median absolute deviation to match the standard deviation for Gaussian noise.

Hard vs. Soft Thresholding

Hard thresholding: Keep coefficients above λ unchanged, zero out the rest.

ηH(c, λ) = c if |c| ≥ λ,   0 if |c| < λ

Soft thresholding: Shrink coefficients above λ toward zero by λ, zero out the rest.

ηS(c, λ) = sign(c) · max(|c| - λ, 0)

Hard thresholding preserves amplitudes but can create artifacts at the threshold boundary. Soft thresholding is smoother but introduces bias (it shrinks everything). In practice, soft thresholding usually gives better visual results.

Worked Example

Signal: x = [4, 4, 4, 4, 0, 0, 0, 0] (step function). Noise: σ = 0.5. Observed: y = [4.3, 3.8, 4.6, 4.1, 0.2, -0.4, 0.3, -0.1].

Haar DWT of y at level 1 gives approximation a1 and detail d1. The detail coefficient at the step edge is large (~4), while detail coefficients in smooth regions are small (~0.3–0.5). Threshold λ = 0.5 · √(2 ln 8) ≈ 1.02. Only the edge coefficient survives. Reconstruction: a clean step function.

Interactive Wavelet Denoiser

A signal buried in noise. Choose wavelet, adjust noise level and threshold, toggle hard/soft thresholding. Watch the denoised output change in real time.

Noise σ 0.5
Threshold λ 1.0
MSE will appear here.

In Code

python
import numpy as np
import pywt

def wavelet_denoise(y, wavelet='db4', level=4, mode='soft'):
    """Denoise signal y using wavelet thresholding."""
    coeffs = pywt.wavedec(y, wavelet, level=level)

    # Estimate noise from finest detail
    sigma = np.median(np.abs(coeffs[-1])) / 0.6745
    threshold = sigma * np.sqrt(2 * np.log(len(y)))

    # Threshold detail coefficients (skip approx)
    for i in range(1, len(coeffs)):
        coeffs[i] = pywt.threshold(coeffs[i], threshold, mode=mode)

    return pywt.waverec(coeffs, wavelet)

# Usage
x_clean = wavelet_denoise(y_noisy)
In wavelet denoising, why do we estimate σ from the finest-scale detail coefficients (level 1) rather than from the signal directly?

Chapter 6: Image Applications

Everything we've done for 1D signals extends naturally to 2D images. The key idea: apply the wavelet transform separately along rows and columns. This creates a 2D DWT that decomposes an image into four sub-bands at each level.

2D Wavelet Decomposition

For an image I[x, y], one level of the 2D DWT produces four sub-images:

Sub-bandRow FilterCol FilterContent
LL (Approximation)LowpassLowpassBlurred, downsampled version
LH (Horizontal detail)LowpassHighpassHorizontal edges
HL (Vertical detail)HighpassLowpassVertical edges
HH (Diagonal detail)HighpassHighpassDiagonal edges

The LL sub-band is then recursively decomposed for multi-level analysis. This is exactly the 2D analogue of the 1D filter bank from Lecture 10.

JPEG 2000 uses this exact decomposition. The standard applies 5-7 levels of the 2D DWT (using either the CDF 9/7 or Le Gall 5/3 wavelet), then quantizes and entropy-codes the sub-band coefficients. Because natural images are sparse in the wavelet domain, this achieves much better compression than JPEG's block-DCT approach — especially at low bit rates.

Image Denoising

The 1D denoising pipeline extends directly: decompose image into 2D wavelet sub-bands, threshold the detail sub-bands (LH, HL, HH at each level), reconstruct. The universal threshold still applies, estimated from the finest HH sub-band.

Edge Detection

The LH, HL, and HH sub-bands at the finest scale contain the edges. A wavelet-based edge detector is simply:

1. Compute one level of the 2D DWT

2. Take the magnitude of the detail sub-bands: E[x,y] = √(LH2 + HL2 + HH2)

3. Threshold E to get binary edge map

This is more robust than simple gradient-based edge detection because the wavelet's vanishing moments suppress noise while preserving true edges. Higher-order wavelets (more vanishing moments) give cleaner edge maps but with some loss of localization.

2D Wavelet Decomposition Visualizer

A synthetic image decomposed into LL, LH, HL, HH sub-bands. Toggle to see what each sub-band captures.

Compression: Keep the Big Coefficients

Wavelet compression is conceptually simple:

1. Compute the 2D DWT (5-7 levels)

2. Sort all coefficients by magnitude

3. Keep only the top k% largest coefficients, zero out the rest

4. Inverse DWT to reconstruct

For a typical natural image, keeping just 5-10% of wavelet coefficients produces a visually acceptable reconstruction. The DFT would need 20-40% to achieve similar quality. This is why JPEG 2000 (wavelets) beats JPEG (DCT) at low bit rates.

python
import pywt
import numpy as np

def wavelet_compress(img, wavelet='db4', keep_pct=10):
    """Compress image by keeping top keep_pct% of wavelet coefficients."""
    coeffs = pywt.wavedec2(img, wavelet, level=5)

    # Flatten all coefficients
    arr, slices = pywt.coeffs_to_array(coeffs)
    thresh = np.percentile(np.abs(arr), 100 - keep_pct)
    arr[np.abs(arr) < thresh] = 0

    # Reconstruct
    coeffs_t = pywt.array_to_coeffs(arr, slices, output_format='wavedec2')
    return pywt.waverec2(coeffs_t, wavelet)
In a 2D wavelet decomposition, which sub-band captures horizontal edges?

Chapter 7: Mastery

Let's consolidate. This lecture connected the theoretical wavelet machinery from Lecture 10 to practical tools and applications.

The Decision Framework

If Your Signal Is...Use...Why
Stationary, need spectral analysisDFT/FFTUniform resolution, convolution theorem
Nonstationary, need time + freqSTFTFamiliar Hz axis, tunable resolution
Multi-scale, transients + smoothDWTAdaptive resolution, sparse representation
Audio, need onset detectionSTFT (spectrogram)Standard in audio, well-understood
Image compression2D DWTJPEG 2000, excellent low-bitrate quality
Denoising any 1D/2D signalDWT + thresholdingNear-optimal for piecewise-smooth signals

Wavelet Selection Cheat Sheet

TaskRecommended WaveletWhy
Sharp edge detectionHaar (db1)Shortest support, catches discontinuities
General 1D denoisingdb4 – db8Good smoothness/support balance
Image processingsym4 – sym8Near-symmetry avoids phase artifacts
Smooth signal analysisdb8+ or coif4+Many vanishing moments for sparsity
JPEG 2000 compressionCDF 9/7 (lossy), Le Gall 5/3 (lossless)Biorthogonal, standard compliant

Connections

Looking back: Lecture 10 (Wavelets) built the CWT, DWT, and filter bank. Lecture 8 (STFT) established time-frequency analysis. This lecture brought them together for applications.

Looking forward: Lecture 12 (Linear Systems) formalizes the filter interpretation — what it means for a wavelet filter bank to be an LTI system. Lecture 13 (Cepstrum & MFCC) takes the STFT output and builds speech features from it.

Pilanci's message: The tools from Lectures 6-11 — DFT, STFT, wavelets — form a complete toolkit for frequency analysis. The DFT is the foundation. The STFT adds time localization. Wavelets add adaptive resolution. Knowing when to use each is as important as knowing how.
For denoising a piecewise-smooth signal contaminated by white Gaussian noise, which approach and threshold formula is near-optimal?