Peering into the frequency content of signals as they change over time — the spectrogram.
You have a 3-second recording of someone saying "ah" followed by "ss." The "ah" is a vowel — low frequencies, strong harmonics around 100–800 Hz. The "ss" is a fricative — broadband noise above 4000 Hz. These two sounds occupy completely different frequency ranges.
Now you compute the DFT of the entire 3-second clip. What do you get? A single magnitude spectrum that averages the "ah" and the "ss" together. You see energy at both low and high frequencies, but you've lost the crucial information: when each frequency was present.
The DFT is global. It assumes the signal is stationary — that its frequency content doesn't change over time. But real-world signals are almost always nonstationary: speech changes phoneme every 50–100 ms, music changes notes, a siren sweeps frequency continuously.
A chirp is a signal whose frequency increases linearly over time: x(t) = cos(2π(f0 + βt)t). At t = 0 the frequency is f0; at t = T it's f0 + 2βT. Police sirens, bat echolocation, and radar pulses are all chirps.
The DFT of a chirp shows energy spread across all frequencies from f0 to f0 + 2βT — a flat band. You cannot tell that frequency increased linearly. All temporal structure is destroyed.
A chirp (frequency rising over time). The global DFT just shows a flat band. Click "Show Local" to see what windowed analysis reveals.
The fix is beautifully simple. Instead of computing one DFT over the entire signal, we:
1. Cut the signal into short overlapping segments (frames)
2. Multiply each frame by a smooth window function
3. Compute the DFT of each windowed frame
Each frame is short enough (~20–50 ms) that the signal is approximately stationary within it. The result: a grid of spectra over time — a time-frequency representation.
Let's build the STFT from first principles. We start with the DFT you already know, then add one crucial modification: a sliding window.
The Discrete Fourier Transform of a length-N signal x[n]:
This uses ALL N samples. Every sample contributes equally to every frequency bin X[k]. There is no notion of "local" here — it's a single decomposition of the entire signal.
To localize in time, we multiply x[n] by a window function g[n] centered at time τ. The window is nonzero only near τ, so it "selects" a short segment of the signal:
This is the Short-Time Fourier Transform. For each time position τ, we get a complete spectrum Xx[τ, k] for k = 0, 1, ..., N-1.
Window length L: How many samples the window g[n] spans. Typical: 256–4096 samples (16–256 ms at 16 kHz).
Hop size D: How far we slide the window between successive frames. Typical: L/4 to L/2 (75% or 50% overlap). Smaller hop = more frames = smoother time resolution but more computation.
FFT size N: Length of the DFT for each frame. Often N ≥ L (zero-pad the windowed frame for finer frequency sampling).
Signal: x = [1, 2, 3, 4, 5, 6, 7, 8] (N = 8 samples). Window: rectangular, L = 4. Hop: D = 2.
Frame 0 (τ = 0): Extract x[0..3] = [1, 2, 3, 4]. Compute 4-point DFT:
Frame 1 (τ = 2): Extract x[2..5] = [3, 4, 5, 6]. Compute 4-point DFT:
Frame 2 (τ = 4): Extract x[4..7] = [5, 6, 7, 8]. DFT:
Result: a 3×4 matrix (3 frames, 4 frequency bins). The DC component (X[0]) increases across frames — correctly reflecting that the signal's local mean increases over time.
python import numpy as np def stft(x, window, hop, nfft): """Compute the STFT of signal x. Returns complex matrix of shape (num_frames, nfft//2+1).""" L = len(window) num_frames = (len(x) - L) // hop + 1 result = np.zeros((num_frames, nfft // 2 + 1), dtype=complex) for m in range(num_frames): start = m * hop frame = x[start : start + L] * window result[m] = np.fft.rfft(frame, n=nfft) return result
Watch the window slide across the signal. Each position produces one column of the STFT. Drag the slider to move the window.
The window g[n] shapes each frame before the DFT. Its choice dramatically affects what you see in the spectrogram. A bad window can mask real spectral features or create phantom ones.
The simplest window is the rectangular window: g[n] = 1 for 0 ≤ n < L, and 0 elsewhere. It just chops out a segment with no tapering. Simple, but problematic.
When you multiply a signal by a rectangle and take the DFT, you're convolving the signal's spectrum with the DFT of the rectangle. The DFT of a rectangle is a sinc function (sin(πx)/(πx)), which has large side lobes. These side lobes cause spectral leakage: energy from one frequency "leaks" into neighboring bins.
| Window | Formula | Main Lobe Width | Side Lobe (dB) | Use Case |
|---|---|---|---|---|
| Rectangular | g[n] = 1 | 2/L (narrowest) | -13 dB | Best freq resolution, worst leakage |
| Hann | 0.5(1 - cos(2πn/L)) | 4/L | -31 dB | General purpose |
| Hamming | 0.54 - 0.46cos(2πn/L) | 4/L | -42 dB | Speech processing |
| Blackman | 0.42 - 0.5cos + 0.08cos | 6/L | -58 dB | High dynamic range |
| Gaussian | e-n²/(2σ²) | ~4σ/L | -∞ (no lobes) | Time-frequency analysis |
Every window faces the same tradeoff:
• Narrow main lobe = better frequency resolution (can resolve two close frequencies)
• Low side lobes = less spectral leakage (weak components aren't masked by strong ones)
You can't have both. The rectangular window has the narrowest main lobe but the worst side lobes (-13 dB). The Blackman window has tiny side lobes (-58 dB) but the widest main lobe. The Hann window is the most common compromise.
python import numpy as np L = 256 # window length n = np.arange(L) rect = np.ones(L) hann = 0.5 * (1 - np.cos(2 * np.pi * n / (L - 1))) hamming = 0.54 - 0.46 * np.cos(2 * np.pi * n / (L - 1)) # Or just use numpy/scipy: # hann = np.hanning(L) # hamming = np.hamming(L)
Top: time-domain window shape. Bottom: magnitude of its DFT (log scale). Notice how smoother windows have wider main lobes but lower side lobes.
Signal: x[n] = sin(2π · 5n/64) for n = 0,...,63. This is exactly 5 cycles in 64 samples — it lands perfectly on DFT bin k = 5.
Rectangular window: DFT shows a sharp spike at k = 5 with zero energy elsewhere. Perfect! No leakage because the signal is periodic in the window.
Now shift slightly: x[n] = sin(2π · 5.5n/64). The frequency 5.5 falls BETWEEN bins 5 and 6. With a rectangular window, energy leaks into ALL bins (sinc sidelobes). With a Hann window, leakage is concentrated in bins 4–7 (the wider main lobe) but the sidelobes are 18 dB lower.
This is why window choice matters: when your signal's frequency doesn't land exactly on a bin (which is almost always true in practice), the window determines how much energy "smears" into neighboring bins.
Each window trades off frequency resolution against noise. The ENBW quantifies how much noise a window lets through compared to a rectangular window:
• Rectangular: ENBW = 1.0 (reference)
• Hann: ENBW = 1.5 (50% more noise, but much less leakage)
• Hamming: ENBW = 1.36
• Blackman: ENBW = 1.73
• For speech/audio: Hann or Hamming, L = 20–40 ms (320–640 samples at 16 kHz)
• For frequency estimation (resolving close tones): Rectangular or Kaiser (β low)
• For weak signal detection (high dynamic range): Blackman or Kaiser (β high)
• For reconstruction (overlap-add): Hann with 50% or 75% overlap (satisfies COLA condition)
Here is the central dilemma of time-frequency analysis. It's not a limitation of our algorithm — it's a fundamental property of signals themselves.
If we use a short window (say L = 32 samples at 16 kHz = 2 ms):
• Each frame spans only 2 ms → excellent time localization (we know EXACTLY when something happens)
• But 32-point DFT has frequency resolution Δf = fs/L = 16000/32 = 500 Hz → terrible! Two tones 400 Hz apart would merge into one blob.
If we use a long window (L = 2048 samples = 128 ms):
• Frequency resolution is superb: Δf = 16000/2048 ≈ 8 Hz (can resolve tones 10 Hz apart)
• But each frame spans 128 ms → a piano note that starts at t = 50 ms gets smeared across the entire 0–128 ms frame. We lose temporal precision.
For a window of length L samples at sample rate fs:
Their product:
No matter what window length you choose, the product Δt · Δf = 1. This is the discrete version of the uncertainty principle (we'll derive the continuous version in Chapter 4).
Think of the time-frequency plane as a grid of rectangular tiles. Each tile represents one STFT coefficient. The tile at position (τ, k) has width Δt and height Δf. All tiles have the same area Δt · Δf = 1, but their aspect ratio depends on window length:
• Short window: wide tiles (good time) that are tall (poor freq)
• Long window: narrow tiles (poor time) that are short (good freq)
Adjust window length to see how tiles reshape. Area stays constant. A chirp signal is shown — which window resolves it better?
There is no universally correct choice. It depends on your signal:
• Speech: L = 20–40 ms. Phonemes change every ~80 ms, so 20 ms gives good temporal tracking. Fundamental frequency is 80–300 Hz, so Δf ≈ 50 Hz is adequate.
• Music (pitched): L = 50–100 ms. Need good frequency resolution to distinguish notes (semitone spacing at 440 Hz is 26 Hz).
• Transients (drums, clicks): L = 5–10 ms. Need precise onset timing; frequency content is broadband anyway.
Problem: Analyze a male speaker (fundamental f0 = 120 Hz) at sample rate fs = 16 kHz.
To resolve f0, we need Δf < f0/2 = 60 Hz (Rayleigh criterion). Since Δf = fs/L:
But phonemes change every ~80 ms, so we want Δt < 40 ms for decent temporal tracking:
Sweet spot: L = 400–512 samples (25–32 ms). This resolves the fundamental frequency while tracking phoneme transitions. With hop D = L/4 = 128 samples (8 ms), we get smooth time evolution.
The time-frequency tradeoff we observed isn't just a practical inconvenience — it's a mathematical theorem. No matter how clever your analysis method, you cannot simultaneously achieve perfect time AND perfect frequency resolution.
For any signal x(t) with finite energy, define:
Time spread (RMS duration around the center of mass):
Frequency spread (RMS bandwidth around center frequency):
The Heisenberg-Gabor uncertainty principle states:
The proof uses three facts:
1. The Fourier transform of tx(t) is (j/2π) dX/df (differentiation property)
2. Parseval's theorem: ∫|x|2dt = ∫|X|2df
3. Cauchy-Schwarz: |<u,v>|2 ≤ ||u||2 · ||v||2
Apply Cauchy-Schwarz with u = tx(t) and v = dx/dt. After simplification (using integration by parts), you get Δt · Δf ≥ 1/(4π). Equality holds when x(t) is a Gaussian: x(t) = e-αt².
The STFT window g[n] is itself a signal. Its time spread Δtg and frequency spread Δfg satisfy the uncertainty principle. Since the STFT resolution is determined by the window:
• Time resolution of STFT = Δtg (duration of the window)
• Frequency resolution of STFT = Δfg (bandwidth of the window's DFT)
So the STFT inherits the uncertainty principle from its window. No window can beat Δtg · Δfg ≥ 1/(4π). The Gaussian window achieves the minimum — which is why Gabor (1946) proposed using Gaussian windows for time-frequency analysis.
The Gabor atom is a Gaussian-windowed complex exponential:
It's localized at time τ with spread σ, and at frequency ω with spread 1/(4πσ). The product Δt · Δf = 1/(4π) — the theoretical minimum. This is the "best possible" time-frequency atom.
Adjust σ of a Gaussian window. As time spread decreases, frequency spread increases — their product stays at the minimum 1/(4π).
Gaussian window with σ = 2 ms at sample rate 16 kHz:
• Time spread: Δt = σ = 2 ms
• Frequency spread: Δf = 1/(4πσ) = 1/(4π × 0.002) ≈ 39.8 Hz
• Product: Δt × Δf = 0.002 × 39.8 = 0.0796 = 1/(4π) ✓
Now double σ to 4 ms:
• Δt = 4 ms (worse time resolution)
• Δf = 19.9 Hz (better frequency resolution)
• Product: 0.004 × 19.9 = 0.0796 (unchanged! Same minimum.)
You can redistribute the "uncertainty budget" between time and frequency, but you can never reduce the total.
This is the SAME uncertainty principle as Heisenberg's in quantum mechanics (Δx · Δp ≥ ℏ/2), but for signals. In QM, position and momentum are Fourier pairs. In signal processing, time and frequency are Fourier pairs. The math is identical.
The spectrogram is the squared magnitude of the STFT:
It discards phase information and gives a real-valued, non-negative time-frequency energy density. This is what you see when you open audio in Audacity, or analyze speech in Praat, or visualize music in a DAW.
• X-axis: Time (frame index τ, or seconds)
• Y-axis: Frequency (bin index k, or Hz)
• Color/brightness: Energy at that (time, frequency) point. Usually displayed in dB: 10 log10(S[τ,k]).
A pure tone at constant frequency appears as a horizontal line. A chirp appears as a diagonal line. A drum hit appears as a vertical stripe (energy at all frequencies simultaneously). Silence is dark everywhere.
Human pitch perception is logarithmic: the distance between 100 Hz and 200 Hz (one octave) sounds the same as 1000 Hz to 2000 Hz. To match this, we often warp the frequency axis:
• Log-frequency spectrogram: Map frequency bins to log scale
• Mel spectrogram: Apply triangular filter bank on the mel scale (mel(f) = 2595 · log10(1 + f/700))
python import numpy as np def spectrogram(x, win_len=512, hop=128, nfft=512): """Compute power spectrogram in dB.""" window = np.hanning(win_len) num_frames = (len(x) - win_len) // hop + 1 S = np.zeros((nfft // 2 + 1, num_frames)) for m in range(num_frames): frame = x[m*hop : m*hop + win_len] * window X = np.fft.rfft(frame, n=nfft) S[:, m] = np.abs(X) ** 2 # Convert to dB S_dB = 10 * np.log10(S + 1e-10) return S_dB
Choose a signal type and adjust window size and hop. The spectrogram updates in real-time. Notice how window length affects the tradeoff.
Chirp: With a long window, you see a clean diagonal line (good freq resolution tracks the rising frequency). With a short window, the line is thick and blurry (poor freq resolution) but onset is sharp.
Two Tones: Two horizontal lines at different frequencies. A long window resolves them clearly. A short window may merge them if they're close.
Pulse Train: Vertical stripes (each pulse is impulsive = broadband). Short window gives sharp stripes, long window smears them.
In practice, spectrograms are almost always displayed in decibels (dB) to compress the dynamic range. Human hearing spans ~120 dB (factor of 1012 in power), so a linear scale would make quiet components invisible.
The small constant ε (typically 10-10) prevents log(0). The result is clamped to a display range, typically -80 dB to 0 dB relative to the peak.
Common spectrogram variants in audio ML:
• Linear spectrogram: Linear frequency axis, dB magnitude. Used in speech analysis.
• Mel spectrogram: Frequency warped to mel scale, often 80–128 mel bins. Input to Whisper, wav2vec2, HuBERT.
• Log-mel spectrogram: Mel spectrogram in dB. The standard input for most modern audio AI.
• Constant-Q spectrogram: Logarithmic frequency spacing (like wavelets). Good for music analysis.
python # Common settings for different domains: # Speech recognition (Whisper-style) win_len = 400 # 25 ms at 16 kHz hop = 160 # 10 ms hop n_mels = 80 # mel filter banks # Music analysis win_len = 2048 # ~93 ms at 22.05 kHz hop = 512 # ~23 ms hop n_mels = 128 # more bins for music # Environmental sound (AudioSet) win_len = 1024 # 64 ms at 16 kHz hop = 320 # 20 ms hop n_mels = 64 # fewer bins sufficient
A natural question: can we go back? Given the STFT Xx[τ, k], can we recover the original signal x[n]? The answer is YES — under certain conditions on the window and hop size.
The simplest reconstruction method:
For perfect reconstruction (x̂[n] = x[n] without the normalization step), the window must satisfy the Constant Overlap-Add (COLA) condition:
This means: at every sample position n, the sum of all windows covering that position must be the same constant C. If C = 1, we get perfect reconstruction directly.
In audio processing, we often modify the STFT before reconstructing:
• Noise reduction: Subtract noise spectrum, then reconstruct
• Time stretching: Interpolate STFT frames, then OLA
• Pitch shifting: Shift frequency bins, then OLA
• Source separation: Apply a binary mask to the STFT, then OLA
If the window isn't COLA-compliant, the reconstructed signal will have amplitude modulation artifacts (periodic volume changes at the hop rate).
Hann window of length L = 4: g = [0, 0.75, 0.75, 0] (using the standard formula).
With hop D = L/2 = 2 (50% overlap), the overlapping windows at position n:
• n = 0: covered by window 0 only: g[0] = 0
• n = 1: covered by window 0 only: g[1] = 0.75
• n = 2: covered by windows 0 and 1: g[2] + g[0] = 0.75 + 0 = 0.75
• n = 3: covered by windows 0 and 1: g[3] + g[1] = 0 + 0.75 = 0.75
After the startup transient, the sum is constant at 0.75 for all n (the endpoints overlap with the midpoints of the next window). Divide by 0.75 → perfect reconstruction. For a symmetric Hann of length L (standard NumPy definition), 50% overlap gives a constant sum of exactly 1.0 — no normalization needed.
python import numpy as np def istft_ola(stft_matrix, window, hop): """Reconstruct signal from STFT via overlap-add.""" num_frames, nfreqs = stft_matrix.shape nfft = (nfreqs - 1) * 2 L = len(window) output_len = (num_frames - 1) * hop + L x = np.zeros(output_len) win_sum = np.zeros(output_len) for m in range(num_frames): frame = np.fft.irfft(stft_matrix[m], n=nfft)[:L] start = m * hop x[start : start + L] += frame * window win_sum[start : start + L] += window ** 2 # Normalize where windows overlap win_sum[win_sum < 1e-8] = 1.0 return x / win_sum
Watch how overlapping windowed frames sum to reconstruct the original signal. Green = original, orange = individual frames, teal = reconstruction.
Let's consolidate everything from this lecture and connect it to the broader signal processing landscape.
| Method | Time-Freq Tiling | Strength | Weakness |
|---|---|---|---|
| DFT | No time axis | Perfect freq resolution | No temporal info |
| STFT | Fixed-size tiles | Simple, well-understood, invertible | Fixed resolution at all frequencies |
| Wavelet | Adaptive tiles | Good time @ high freq, good freq @ low freq | More complex, no unique inverse |
| Wigner-Ville | Point-wise | Best resolution theoretically | Cross-terms (interference artifacts) |
• Lecture 7 (Spectral Descriptors): Descriptors are computed per-STFT-frame to get time-varying features.
• Lecture 10 (Wavelets): Overcomes the fixed-window limitation with scale-adaptive analysis.
• Mel spectrograms → deep learning: Nearly all modern speech/audio ML (wav2vec, Whisper, AudioLM) uses STFT-based mel spectrograms as input.
• Whisper (OpenAI): Converts audio to 80-channel log-mel spectrogram (25ms window, 10ms hop), then feeds to a Transformer encoder-decoder. The spectrogram IS the input representation.
• Music source separation (Demucs): Operates on complex STFT, modifies it with a neural network, then reconstructs via inverse STFT with COLA-compliant Hann windows.
• Noise cancellation (AirPods): Real-time STFT → spectral subtraction → inverse STFT, all within a few milliseconds latency.