EE269 Lecture 8 — Mert Pilanci, Stanford

Short-Time Fourier Transform

Peering into the frequency content of signals as they change over time — the spectrogram.

Prerequisites: EE269 Lecture 6 (DFT) + Complex exponentials. That's it.
8
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Local Spectra

You have a 3-second recording of someone saying "ah" followed by "ss." The "ah" is a vowel — low frequencies, strong harmonics around 100–800 Hz. The "ss" is a fricative — broadband noise above 4000 Hz. These two sounds occupy completely different frequency ranges.

Now you compute the DFT of the entire 3-second clip. What do you get? A single magnitude spectrum that averages the "ah" and the "ss" together. You see energy at both low and high frequencies, but you've lost the crucial information: when each frequency was present.

The DFT is global. It assumes the signal is stationary — that its frequency content doesn't change over time. But real-world signals are almost always nonstationary: speech changes phoneme every 50–100 ms, music changes notes, a siren sweeps frequency continuously.

The fundamental problem: The DFT tells you WHAT frequencies are present, but not WHEN they occur. For nonstationary signals, we need a representation that captures both time and frequency simultaneously.

A Motivating Example: The Chirp

A chirp is a signal whose frequency increases linearly over time: x(t) = cos(2π(f0 + βt)t). At t = 0 the frequency is f0; at t = T it's f0 + 2βT. Police sirens, bat echolocation, and radar pulses are all chirps.

The DFT of a chirp shows energy spread across all frequencies from f0 to f0 + 2βT — a flat band. You cannot tell that frequency increased linearly. All temporal structure is destroyed.

Global DFT Fails on Nonstationary Signals

A chirp (frequency rising over time). The global DFT just shows a flat band. Click "Show Local" to see what windowed analysis reveals.

The Solution: Divide and Conquer

The fix is beautifully simple. Instead of computing one DFT over the entire signal, we:

1. Cut the signal into short overlapping segments (frames)

2. Multiply each frame by a smooth window function

3. Compute the DFT of each windowed frame

Each frame is short enough (~20–50 ms) that the signal is approximately stationary within it. The result: a grid of spectra over time — a time-frequency representation.

Key insight: The Short-Time Fourier Transform trades one long DFT for many short DFTs. Each short DFT captures the local frequency content at a specific moment in time. Together, they form the spectrogram — arguably the most important visualization in audio signal processing.
Why does the standard (global) DFT fail for a chirp signal?

Chapter 1: STFT Definition

Let's build the STFT from first principles. We start with the DFT you already know, then add one crucial modification: a sliding window.

Recall: The DFT

The Discrete Fourier Transform of a length-N signal x[n]:

X[k] = ∑n=0N-1 x[n] · e-j2πkn/N,   k = 0, 1, ..., N-1

This uses ALL N samples. Every sample contributes equally to every frequency bin X[k]. There is no notion of "local" here — it's a single decomposition of the entire signal.

Adding a Window

To localize in time, we multiply x[n] by a window function g[n] centered at time τ. The window is nonzero only near τ, so it "selects" a short segment of the signal:

Xx[τ, k] = ∑n=0N-1 x[n] · g[n - τ] · e-j(2π/N)kn

This is the Short-Time Fourier Transform. For each time position τ, we get a complete spectrum Xx[τ, k] for k = 0, 1, ..., N-1.

The STFT is a 2D function: Input: 1D signal x[n]. Output: 2D array Xx[τ, k] indexed by time τ and frequency k. We've traded one dimension for two — that's where the extra information (temporal localization) comes from.

Parameters

Window length L: How many samples the window g[n] spans. Typical: 256–4096 samples (16–256 ms at 16 kHz).

Hop size D: How far we slide the window between successive frames. Typical: L/4 to L/2 (75% or 50% overlap). Smaller hop = more frames = smoother time resolution but more computation.

FFT size N: Length of the DFT for each frame. Often N ≥ L (zero-pad the windowed frame for finer frequency sampling).

The Procedure Step by Step

1. Position window
Center g[n] at time τ = m·D for frame m
2. Extract & multiply
y[n] = x[n] · g[n - τ] (zeroes outside window)
3. DFT
Xx[τ, k] = DFT{y[n]} for k = 0, ..., N-1
↻ slide by D, repeat

Worked Example

Signal: x = [1, 2, 3, 4, 5, 6, 7, 8] (N = 8 samples). Window: rectangular, L = 4. Hop: D = 2.

Frame 0 (τ = 0): Extract x[0..3] = [1, 2, 3, 4]. Compute 4-point DFT:

X[0] = 10,   X[1] = -2+2j,   X[2] = -2,   X[3] = -2-2j

Frame 1 (τ = 2): Extract x[2..5] = [3, 4, 5, 6]. Compute 4-point DFT:

X[0] = 18,   X[1] = -2+2j,   X[2] = -2,   X[3] = -2-2j

Frame 2 (τ = 4): Extract x[4..7] = [5, 6, 7, 8]. DFT:

X[0] = 26,   X[1] = -2+2j,   X[2] = -2,   X[3] = -2-2j

Result: a 3×4 matrix (3 frames, 4 frequency bins). The DC component (X[0]) increases across frames — correctly reflecting that the signal's local mean increases over time.

In Code

python
import numpy as np

def stft(x, window, hop, nfft):
    """Compute the STFT of signal x.
    Returns complex matrix of shape (num_frames, nfft//2+1)."""
    L = len(window)
    num_frames = (len(x) - L) // hop + 1
    result = np.zeros((num_frames, nfft // 2 + 1), dtype=complex)
    for m in range(num_frames):
        start = m * hop
        frame = x[start : start + L] * window
        result[m] = np.fft.rfft(frame, n=nfft)
    return result
Sliding Window Visualization

Watch the window slide across the signal. Each position produces one column of the STFT. Drag the slider to move the window.

Frame 0
If signal length is 1024 samples, window length L = 256, and hop size D = 128, how many STFT frames are produced?

Chapter 2: Window Functions

The window g[n] shapes each frame before the DFT. Its choice dramatically affects what you see in the spectrogram. A bad window can mask real spectral features or create phantom ones.

Why Not Just Use a Rectangle?

The simplest window is the rectangular window: g[n] = 1 for 0 ≤ n < L, and 0 elsewhere. It just chops out a segment with no tapering. Simple, but problematic.

When you multiply a signal by a rectangle and take the DFT, you're convolving the signal's spectrum with the DFT of the rectangle. The DFT of a rectangle is a sinc function (sin(πx)/(πx)), which has large side lobes. These side lobes cause spectral leakage: energy from one frequency "leaks" into neighboring bins.

Think of it this way: Abruptly chopping a signal creates artificial discontinuities at the frame edges. The DFT interprets these discontinuities as high-frequency content that isn't actually in the signal. Tapering the window to zero at the edges eliminates these artificial edges.

Common Windows

WindowFormulaMain Lobe WidthSide Lobe (dB)Use Case
Rectangularg[n] = 12/L (narrowest)-13 dBBest freq resolution, worst leakage
Hann0.5(1 - cos(2πn/L))4/L-31 dBGeneral purpose
Hamming0.54 - 0.46cos(2πn/L)4/L-42 dBSpeech processing
Blackman0.42 - 0.5cos + 0.08cos6/L-58 dBHigh dynamic range
Gaussiane-n²/(2σ²)~4σ/L-∞ (no lobes)Time-frequency analysis

The Tradeoff: Main Lobe vs Side Lobes

Every window faces the same tradeoff:

Narrow main lobe = better frequency resolution (can resolve two close frequencies)

Low side lobes = less spectral leakage (weak components aren't masked by strong ones)

You can't have both. The rectangular window has the narrowest main lobe but the worst side lobes (-13 dB). The Blackman window has tiny side lobes (-58 dB) but the widest main lobe. The Hann window is the most common compromise.

Window Design in Code

python
import numpy as np

L = 256  # window length
n = np.arange(L)

rect    = np.ones(L)
hann    = 0.5 * (1 - np.cos(2 * np.pi * n / (L - 1)))
hamming = 0.54 - 0.46 * np.cos(2 * np.pi * n / (L - 1))

# Or just use numpy/scipy:
# hann = np.hanning(L)
# hamming = np.hamming(L)
Window Functions & Their Spectra

Top: time-domain window shape. Bottom: magnitude of its DFT (log scale). Notice how smoother windows have wider main lobes but lower side lobes.

Worked Example: Spectral Leakage

Signal: x[n] = sin(2π · 5n/64) for n = 0,...,63. This is exactly 5 cycles in 64 samples — it lands perfectly on DFT bin k = 5.

Rectangular window: DFT shows a sharp spike at k = 5 with zero energy elsewhere. Perfect! No leakage because the signal is periodic in the window.

Now shift slightly: x[n] = sin(2π · 5.5n/64). The frequency 5.5 falls BETWEEN bins 5 and 6. With a rectangular window, energy leaks into ALL bins (sinc sidelobes). With a Hann window, leakage is concentrated in bins 4–7 (the wider main lobe) but the sidelobes are 18 dB lower.

This is why window choice matters: when your signal's frequency doesn't land exactly on a bin (which is almost always true in practice), the window determines how much energy "smears" into neighboring bins.

Equivalent Noise Bandwidth (ENBW)

Each window trades off frequency resolution against noise. The ENBW quantifies how much noise a window lets through compared to a rectangular window:

ENBW = N · (∑ g[n]2) / (∑ g[n])2

• Rectangular: ENBW = 1.0 (reference)

• Hann: ENBW = 1.5 (50% more noise, but much less leakage)

• Hamming: ENBW = 1.36

• Blackman: ENBW = 1.73

Practical Guidelines

• For speech/audio: Hann or Hamming, L = 20–40 ms (320–640 samples at 16 kHz)

• For frequency estimation (resolving close tones): Rectangular or Kaiser (β low)

• For weak signal detection (high dynamic range): Blackman or Kaiser (β high)

• For reconstruction (overlap-add): Hann with 50% or 75% overlap (satisfies COLA condition)

Why does the Hann window cause less spectral leakage than the rectangular window?

Chapter 3: Time-Frequency Tradeoff

Here is the central dilemma of time-frequency analysis. It's not a limitation of our algorithm — it's a fundamental property of signals themselves.

Short Window: Good Time, Poor Frequency

If we use a short window (say L = 32 samples at 16 kHz = 2 ms):

• Each frame spans only 2 ms → excellent time localization (we know EXACTLY when something happens)

• But 32-point DFT has frequency resolution Δf = fs/L = 16000/32 = 500 Hz → terrible! Two tones 400 Hz apart would merge into one blob.

Long Window: Good Frequency, Poor Time

If we use a long window (L = 2048 samples = 128 ms):

• Frequency resolution is superb: Δf = 16000/2048 ≈ 8 Hz (can resolve tones 10 Hz apart)

• But each frame spans 128 ms → a piano note that starts at t = 50 ms gets smeared across the entire 0–128 ms frame. We lose temporal precision.

The tradeoff: Time resolution Δt and frequency resolution Δf are inversely related. Narrow window → small Δt, large Δf. Wide window → large Δt, small Δf. You cannot have both simultaneously. This is NOT a software limitation — it's a law of nature.

Quantifying the Tradeoff

For a window of length L samples at sample rate fs:

Δt = L / fs   (time resolution, seconds)
Δf = fs / L   (frequency resolution, Hz)

Their product:

Δt · Δf = (L / fs) · (fs / L) = 1

No matter what window length you choose, the product Δt · Δf = 1. This is the discrete version of the uncertainty principle (we'll derive the continuous version in Chapter 4).

Visualizing the Tiles

Think of the time-frequency plane as a grid of rectangular tiles. Each tile represents one STFT coefficient. The tile at position (τ, k) has width Δt and height Δf. All tiles have the same area Δt · Δf = 1, but their aspect ratio depends on window length:

• Short window: wide tiles (good time) that are tall (poor freq)

• Long window: narrow tiles (poor time) that are short (good freq)

Time-Frequency Tiling

Adjust window length to see how tiles reshape. Area stays constant. A chirp signal is shown — which window resolves it better?

Window Length 4

The "Right" Window Length

There is no universally correct choice. It depends on your signal:

Speech: L = 20–40 ms. Phonemes change every ~80 ms, so 20 ms gives good temporal tracking. Fundamental frequency is 80–300 Hz, so Δf ≈ 50 Hz is adequate.

Music (pitched): L = 50–100 ms. Need good frequency resolution to distinguish notes (semitone spacing at 440 Hz is 26 Hz).

Transients (drums, clicks): L = 5–10 ms. Need precise onset timing; frequency content is broadband anyway.

Worked Example: Choosing Window Length for Speech

Problem: Analyze a male speaker (fundamental f0 = 120 Hz) at sample rate fs = 16 kHz.

To resolve f0, we need Δf < f0/2 = 60 Hz (Rayleigh criterion). Since Δf = fs/L:

L > fs/60 = 16000/60 ≈ 267 samples (16.7 ms)

But phonemes change every ~80 ms, so we want Δt < 40 ms for decent temporal tracking:

L < 40 ms × 16000 = 640 samples

Sweet spot: L = 400–512 samples (25–32 ms). This resolves the fundamental frequency while tracking phoneme transitions. With hop D = L/4 = 128 samples (8 ms), we get smooth time evolution.

Preview of wavelets: The STFT uses the SAME window length for all frequencies. Low frequencies need long windows, high frequencies need short windows. Wavelets solve this by using adaptive windows — but that's Lecture 10.
You're analyzing a signal with a 1 ms drum hit followed 5 ms later by a 500 Hz tone. Which window length best reveals both events?

Chapter 4: Uncertainty Principle

The time-frequency tradeoff we observed isn't just a practical inconvenience — it's a mathematical theorem. No matter how clever your analysis method, you cannot simultaneously achieve perfect time AND perfect frequency resolution.

Continuous-Time Statement

For any signal x(t) with finite energy, define:

Time spread (RMS duration around the center of mass):

Δt2 = ∫ (t - t0)2 |x(t)|2 dt  /  ∫ |x(t)|2 dt

Frequency spread (RMS bandwidth around center frequency):

Δf2 = ∫ (f - f0)2 |X(f)|2 df  /  ∫ |X(f)|2 df

The Heisenberg-Gabor uncertainty principle states:

Δt · Δf ≥ 1/(4π)
This is a theorem, not a conjecture. It follows directly from the Cauchy-Schwarz inequality applied to x(t) and its derivative. No signal can violate it. The Gaussian pulse achieves equality (it's the "minimum uncertainty" signal).

Proof Sketch

The proof uses three facts:

1. The Fourier transform of tx(t) is (j/2π) dX/df (differentiation property)

2. Parseval's theorem: ∫|x|2dt = ∫|X|2df

3. Cauchy-Schwarz: |<u,v>|2 ≤ ||u||2 · ||v||2

Apply Cauchy-Schwarz with u = tx(t) and v = dx/dt. After simplification (using integration by parts), you get Δt · Δf ≥ 1/(4π). Equality holds when x(t) is a Gaussian: x(t) = e-αt².

What This Means for the STFT

The STFT window g[n] is itself a signal. Its time spread Δtg and frequency spread Δfg satisfy the uncertainty principle. Since the STFT resolution is determined by the window:

• Time resolution of STFT = Δtg (duration of the window)

• Frequency resolution of STFT = Δfg (bandwidth of the window's DFT)

So the STFT inherits the uncertainty principle from its window. No window can beat Δtg · Δfg ≥ 1/(4π). The Gaussian window achieves the minimum — which is why Gabor (1946) proposed using Gaussian windows for time-frequency analysis.

The Gaussian Window (Gabor Atom)

The Gabor atom is a Gaussian-windowed complex exponential:

gτ,ω(t) = e-(t-τ)²/(2σ²) · ejωt

It's localized at time τ with spread σ, and at frequency ω with spread 1/(4πσ). The product Δt · Δf = 1/(4π) — the theoretical minimum. This is the "best possible" time-frequency atom.

Uncertainty Principle Visualizer

Adjust σ of a Gaussian window. As time spread decreases, frequency spread increases — their product stays at the minimum 1/(4π).

σ 2.0

Numerical Example

Gaussian window with σ = 2 ms at sample rate 16 kHz:

• Time spread: Δt = σ = 2 ms

• Frequency spread: Δf = 1/(4πσ) = 1/(4π × 0.002) ≈ 39.8 Hz

• Product: Δt × Δf = 0.002 × 39.8 = 0.0796 = 1/(4π) ✓

Now double σ to 4 ms:

• Δt = 4 ms (worse time resolution)

• Δf = 19.9 Hz (better frequency resolution)

• Product: 0.004 × 19.9 = 0.0796 (unchanged! Same minimum.)

You can redistribute the "uncertainty budget" between time and frequency, but you can never reduce the total.

Comparison to Quantum Mechanics

This is the SAME uncertainty principle as Heisenberg's in quantum mechanics (Δx · Δp ≥ ℏ/2), but for signals. In QM, position and momentum are Fourier pairs. In signal processing, time and frequency are Fourier pairs. The math is identical.

Which window achieves equality in the uncertainty principle (Δt · Δf = 1/(4π))?

Chapter 5: The Spectrogram

The spectrogram is the squared magnitude of the STFT:

Sx[τ, k] = |Xx[τ, k]|2

It discards phase information and gives a real-valued, non-negative time-frequency energy density. This is what you see when you open audio in Audacity, or analyze speech in Praat, or visualize music in a DAW.

Reading a Spectrogram

X-axis: Time (frame index τ, or seconds)

Y-axis: Frequency (bin index k, or Hz)

Color/brightness: Energy at that (time, frequency) point. Usually displayed in dB: 10 log10(S[τ,k]).

A pure tone at constant frequency appears as a horizontal line. A chirp appears as a diagonal line. A drum hit appears as a vertical stripe (energy at all frequencies simultaneously). Silence is dark everywhere.

The spectrogram is perhaps the single most useful representation in audio signal processing. Speech recognition, music analysis, environmental sound classification, sonar, radar, seismology — all rely on spectrograms. Learning to read them is as fundamental as reading waveforms.

Log-Frequency and Mel Spectrograms

Human pitch perception is logarithmic: the distance between 100 Hz and 200 Hz (one octave) sounds the same as 1000 Hz to 2000 Hz. To match this, we often warp the frequency axis:

Log-frequency spectrogram: Map frequency bins to log scale

Mel spectrogram: Apply triangular filter bank on the mel scale (mel(f) = 2595 · log10(1 + f/700))

Computing the Spectrogram in Code

python
import numpy as np

def spectrogram(x, win_len=512, hop=128, nfft=512):
    """Compute power spectrogram in dB."""
    window = np.hanning(win_len)
    num_frames = (len(x) - win_len) // hop + 1
    S = np.zeros((nfft // 2 + 1, num_frames))
    for m in range(num_frames):
        frame = x[m*hop : m*hop + win_len] * window
        X = np.fft.rfft(frame, n=nfft)
        S[:, m] = np.abs(X) ** 2
    # Convert to dB
    S_dB = 10 * np.log10(S + 1e-10)
    return S_dB
Interactive Spectrogram

Choose a signal type and adjust window size and hop. The spectrogram updates in real-time. Notice how window length affects the tradeoff.

Window 64
Hop 16

What to Look For

Chirp: With a long window, you see a clean diagonal line (good freq resolution tracks the rising frequency). With a short window, the line is thick and blurry (poor freq resolution) but onset is sharp.

Two Tones: Two horizontal lines at different frequencies. A long window resolves them clearly. A short window may merge them if they're close.

Pulse Train: Vertical stripes (each pulse is impulsive = broadband). Short window gives sharp stripes, long window smears them.

Real-World Spectrograms

In practice, spectrograms are almost always displayed in decibels (dB) to compress the dynamic range. Human hearing spans ~120 dB (factor of 1012 in power), so a linear scale would make quiet components invisible.

SdB[τ, k] = 10 log10(|X[τ, k]|2 + ε)

The small constant ε (typically 10-10) prevents log(0). The result is clamped to a display range, typically -80 dB to 0 dB relative to the peak.

Common spectrogram variants in audio ML:

Linear spectrogram: Linear frequency axis, dB magnitude. Used in speech analysis.

Mel spectrogram: Frequency warped to mel scale, often 80–128 mel bins. Input to Whisper, wav2vec2, HuBERT.

Log-mel spectrogram: Mel spectrogram in dB. The standard input for most modern audio AI.

Constant-Q spectrogram: Logarithmic frequency spacing (like wavelets). Good for music analysis.

Spectrogram Parameters in Practice

python
# Common settings for different domains:

# Speech recognition (Whisper-style)
win_len = 400   # 25 ms at 16 kHz
hop = 160       # 10 ms hop
n_mels = 80    # mel filter banks

# Music analysis
win_len = 2048  # ~93 ms at 22.05 kHz
hop = 512       # ~23 ms hop
n_mels = 128   # more bins for music

# Environmental sound (AudioSet)
win_len = 1024  # 64 ms at 16 kHz
hop = 320       # 20 ms hop
n_mels = 64    # fewer bins sufficient
In a spectrogram, a drum hit (a brief broadband event) appears as:

Chapter 6: Reconstruction

A natural question: can we go back? Given the STFT Xx[τ, k], can we recover the original signal x[n]? The answer is YES — under certain conditions on the window and hop size.

The Overlap-Add (OLA) Method

The simplest reconstruction method:

1. Inverse DFT
For each frame τ: yτ[n] = IDFT{Xx[τ, k]}
2. Overlap
Position yτ[n] at time τ · D in the output
3. Add
Sum all overlapping frames: x̂[n] = ∑τ yτ[n - τ·D]
4. Normalize
Divide by sum of windows: x[n] = x̂[n] / ∑τ g[n - τ·D]

The COLA Condition

For perfect reconstruction (x̂[n] = x[n] without the normalization step), the window must satisfy the Constant Overlap-Add (COLA) condition:

m=-∞ g[n - m·D] = C   (constant for all n)

This means: at every sample position n, the sum of all windows covering that position must be the same constant C. If C = 1, we get perfect reconstruction directly.

COLA-compliant combinations:
• Hann window + 50% overlap (D = L/2): ∑ = 1.0 ✓
• Hann window + 75% overlap (D = L/4): ∑ = 1.5 (normalize by 1.5) ✓
• Rectangular + 0% overlap (D = L): ∑ = 1.0 ✓
• Hamming + 50% overlap: ∑ ≠ constant ✗ (NOT COLA!)

Why COLA Matters

In audio processing, we often modify the STFT before reconstructing:

Noise reduction: Subtract noise spectrum, then reconstruct

Time stretching: Interpolate STFT frames, then OLA

Pitch shifting: Shift frequency bins, then OLA

Source separation: Apply a binary mask to the STFT, then OLA

If the window isn't COLA-compliant, the reconstructed signal will have amplitude modulation artifacts (periodic volume changes at the hop rate).

Worked Example: Verifying COLA

Hann window of length L = 4: g = [0, 0.75, 0.75, 0] (using the standard formula).

With hop D = L/2 = 2 (50% overlap), the overlapping windows at position n:

• n = 0: covered by window 0 only: g[0] = 0

• n = 1: covered by window 0 only: g[1] = 0.75

• n = 2: covered by windows 0 and 1: g[2] + g[0] = 0.75 + 0 = 0.75

• n = 3: covered by windows 0 and 1: g[3] + g[1] = 0 + 0.75 = 0.75

After the startup transient, the sum is constant at 0.75 for all n (the endpoints overlap with the midpoints of the next window). Divide by 0.75 → perfect reconstruction. For a symmetric Hann of length L (standard NumPy definition), 50% overlap gives a constant sum of exactly 1.0 — no normalization needed.

OLA in Code

python
import numpy as np

def istft_ola(stft_matrix, window, hop):
    """Reconstruct signal from STFT via overlap-add."""
    num_frames, nfreqs = stft_matrix.shape
    nfft = (nfreqs - 1) * 2
    L = len(window)
    output_len = (num_frames - 1) * hop + L
    x = np.zeros(output_len)
    win_sum = np.zeros(output_len)

    for m in range(num_frames):
        frame = np.fft.irfft(stft_matrix[m], n=nfft)[:L]
        start = m * hop
        x[start : start + L] += frame * window
        win_sum[start : start + L] += window ** 2

    # Normalize where windows overlap
    win_sum[win_sum < 1e-8] = 1.0
    return x / win_sum
Overlap-Add Reconstruction

Watch how overlapping windowed frames sum to reconstruct the original signal. Green = original, orange = individual frames, teal = reconstruction.

Overlap % 50%
What does the COLA condition guarantee?

Chapter 7: Mastery

Let's consolidate everything from this lecture and connect it to the broader signal processing landscape.

The Complete STFT Pipeline

Signal x[n]
Nonstationary: frequency content changes over time
Window & Segment
Multiply by g[n-τ], hop by D samples
DFT per Frame
Xx[τ,k] — complex-valued 2D array
Spectrogram
|Xx[τ,k]|2 — time-frequency energy
Analysis / Modification
Extract features, apply masks, filter
Reconstruct (OLA)
IDFT + overlap-add → modified signal

STFT vs. Alternatives

MethodTime-Freq TilingStrengthWeakness
DFTNo time axisPerfect freq resolutionNo temporal info
STFTFixed-size tilesSimple, well-understood, invertibleFixed resolution at all frequencies
WaveletAdaptive tilesGood time @ high freq, good freq @ low freqMore complex, no unique inverse
Wigner-VillePoint-wiseBest resolution theoreticallyCross-terms (interference artifacts)

Key Formulas Summary

Xx[τ, k] = ∑n x[n] · g[n-τ] · e-j2πkn/N
S[τ, k] = |Xx[τ, k]|2   (spectrogram)
Δt · Δf ≥ 1/(4π)   (uncertainty principle)
m g[n - mD] = C   (COLA for perfect reconstruction)

Connections

Lecture 7 (Spectral Descriptors): Descriptors are computed per-STFT-frame to get time-varying features.

Lecture 10 (Wavelets): Overcomes the fixed-window limitation with scale-adaptive analysis.

Mel spectrograms → deep learning: Nearly all modern speech/audio ML (wav2vec, Whisper, AudioLM) uses STFT-based mel spectrograms as input.

Real Applications

Whisper (OpenAI): Converts audio to 80-channel log-mel spectrogram (25ms window, 10ms hop), then feeds to a Transformer encoder-decoder. The spectrogram IS the input representation.

Music source separation (Demucs): Operates on complex STFT, modifies it with a neural network, then reconstructs via inverse STFT with COLA-compliant Hann windows.

Noise cancellation (AirPods): Real-time STFT → spectral subtraction → inverse STFT, all within a few milliseconds latency.

Dennis Gabor (1946): "A signal can be simultaneously described in terms of time and frequency only with a certain imprecision." — Nobel laureate who invented holography and formalized time-frequency analysis.
You want to process a speech signal (modify it in the STFT domain) and get a clean output. Which window/overlap combination ensures artifact-free reconstruction?