Separating source from filter in speech — the logarithmic trick that powers all of modern speech recognition.
You record someone saying "ah." The microphone captures a signal y[n]. But that signal isn't just the person's vocal cords vibrating — it's been shaped by their throat, mouth, and nasal cavity. The physics:
• Source signal x[n]: A buzzy periodic pulse train from the vocal cords (the "glottal pulse"). Its frequency is the speaker's pitch (f0 ≈ 100-300 Hz).
• Vocal tract filter w[n]: The throat and mouth form a resonant tube. It amplifies some frequencies (called formants) and attenuates others. The filter shape determines which vowel you hear ("ah" vs "ee" vs "oo").
The speech signal is the source convolved with the filter:
In the DFT domain (by the convolution theorem from Lecture 12):
• Speech recognition: Needs W[k] (what phoneme is being spoken), doesn't care about X[k] (who's speaking it)
• Speaker identification: Needs X[k] (the unique pitch pattern), doesn't care about W[k]
• Pitch shifting: Change X[k] without changing W[k] — make someone sound higher/lower without changing the words
We can't just divide Y by X or Y by W — we don't know either one! We need a trick that separates them using only the observed Y. And that trick is... the logarithm.
The glottal source (periodic pulses) convolved with the vocal tract filter produces speech. In frequency domain, it's multiplication.
The key mathematical insight is embarrassingly simple. In the DFT domain:
Take the logarithm of both sides:
In the log-magnitude spectrum:
• log |X[k]| is the source spectrum. For voiced speech (vocal cord buzz), this is a series of evenly-spaced peaks (harmonics) at multiples of the fundamental frequency f0. These peaks are closely spaced — they look like a rapidly varying pattern.
• log |W[k]| is the vocal tract response. This is a smooth envelope that varies slowly across frequency — it has broad peaks (formants) separated by 1000+ Hz.
So log |Y[k]| = a slowly-varying component (filter) + a rapidly-varying component (source). This is just like a signal with low-frequency content (filter) and high-frequency content (source). We can separate them with... another Fourier transform!
Suppose Y[k] for k = 0, 1, ..., 7:
|Y| = [10, 2, 8, 1.5, 9, 1.8, 7, 2.2]
Notice the alternating pattern? Large values at even k (harmonics), small at odd k (between harmonics). This is the source's periodic structure.
Take log: log|Y| = [2.30, 0.69, 2.08, 0.41, 2.20, 0.59, 1.95, 0.79]
Now this alternating pattern is additive. The "slow" component (average ~1.5) is the filter envelope. The "fast" component (oscillation of ~0.8) is the source harmonics.
Top: |Y[k]| = |X[k]| · |W[k]| (multiplication). Bottom: log|Y[k]| = log|X[k]| + log|W[k]| (addition). In the log domain, you can visually separate the slowly-varying envelope from the rapidly-varying harmonics.
The cepstrum is defined as the inverse DFT (or DCT) of the log magnitude spectrum:
The name "cepstrum" is an anagram of "spectrum" — coined by Bogert, Healy, and Tukey (1963). The independent variable n in the cepstrum is called quefrency (anagram of "frequency"). These playful names remind us that we're in a "domain of a domain."
Since log |Y[k]| = log |X[k]| + log |W[k]|, the cepstrum is:
Where cx[n] = IDFT{log |X[k]|} and cw[n] = IDFT{log |W[k]|}.
The key observation: log |W[k]| is slowly varying (smooth envelope), so its cepstrum cw[n] is concentrated at low quefrencies (small n). Meanwhile, log |X[k]| oscillates rapidly (harmonics), so cx[n] has energy at high quefrencies (specifically, near n = N/f0, the "period" of the harmonic pattern in the log spectrum).
For voiced speech, the cepstrum has a clear peak at quefrency n = fs/f0 (the pitch period in samples). This peak is remarkably robust — it persists even in noisy conditions. Cepstral pitch detection was the standard method for decades before deep learning.
python import numpy as np def cepstrum(y, n_fft=1024): """Compute the real cepstrum of signal y.""" Y = np.fft.fft(y, n_fft) log_mag = np.log(np.abs(Y) + 1e-10) # epsilon avoids log(0) ceps = np.real(np.fft.ifft(log_mag)) return ceps def detect_pitch(y, fs, n_fft=1024): """Find pitch period from cepstral peak.""" ceps = cepstrum(y, n_fft) # Search in plausible pitch range: 50-500 Hz min_q = int(fs / 500) # max frequency max_q = int(fs / 50) # min frequency peak_q = min_q + np.argmax(ceps[min_q:max_q]) f0 = fs / peak_q return f0
Observe how the cepstrum separates pitch (high quefrency peak) from vocal tract (low quefrency shape). The dashed line is the lifter cutoff.
The cepstrum is mathematically elegant but doesn't account for how humans actually perceive frequency. A jump from 100 Hz to 200 Hz sounds like a HUGE change (one octave). A jump from 8000 Hz to 8100 Hz sounds like almost nothing (barely perceptible). Human frequency perception is logarithmic, not linear.
The mel scale is a perceptual frequency scale where equal distances correspond to equal perceived pitch differences. It was defined experimentally by Stevens, Volkmann, and Newman (1937) by asking listeners to adjust tones until they sounded "equally spaced."
And the inverse:
| Frequency (Hz) | Mel | Perception |
|---|---|---|
| 0 | 0 | Silence |
| 200 | 283 | Low male voice pitch |
| 500 | 607 | First formant of "ah" |
| 1000 | 1000 | Reference (1000 mel = 1000 Hz by design) |
| 2000 | 1346 | Second formant region |
| 4000 | 1750 | Consonant energy region |
| 8000 | 2146 | Near hearing limit for older adults |
Notice: from 0 to 1000 Hz spans 1000 mel. From 1000 to 8000 Hz spans only 1146 mel. The mel scale compresses high frequencies drastically — matching our reduced sensitivity there.
If two phonemes differ by the same amount in mel space, they're equally easy for humans to distinguish. So a speech recognition system should allocate equal "attention" to equal mel intervals — more frequency bins at low frequencies (where perception is sharp) and fewer bins at high frequencies (where perception is coarse). The mel filter bank does exactly this.
The mel scale is approximately linear below 700 Hz and logarithmic above. Move the frequency slider to see the mapping.
The mel filter bank is a set of triangular bandpass filters spaced uniformly on the mel scale. Each filter captures the energy in one perceptual frequency band.
Given R filters spanning from flow to fhigh:
1. Convert flow and fhigh to mel: mlow, mhigh
2. Create R+2 equally-spaced points in mel: m0, m1, ..., mR+1
3. Convert back to Hz: f0, f1, ..., fR+1
4. Each filter r is a triangle with center fr+1, left edge fr, right edge fr+2
The r-th filter weight at frequency bin k is:
Apply each filter to the power spectrum |X[k]|2:
This gives R values (typically R = 26 or 40) that represent the signal's energy distribution on a perceptually-motivated frequency scale.
R = 4 filters, flow = 0 Hz, fhigh = 4000 Hz, fs = 8000 Hz.
Mel bounds: mlow = 0, mhigh = 2146. Six points: 0, 358, 715, 1073, 1431, 1789, 2146.
Back to Hz: 0, 233, 508, 841, 1253, 1774, 2400 (approximately).
Filter 0: triangle from 0 to 508 Hz (center 233 Hz) — narrow, 508 Hz wide
Filter 3: triangle from 1253 to 2400 Hz (center 1774 Hz) — wide, 1147 Hz wide
The higher filter is more than twice as wide — lower resolution at high frequencies.
Triangular filters spaced on the mel scale. Notice: narrow at low frequencies, wide at high frequencies. Adjust number of filters.
The Mel-Frequency Cepstral Coefficients combine the mel scale perception model with the cepstral separation idea. They are the standard feature representation for speech recognition. Let's build the full pipeline.
The output: 13 numbers (c[0] through c[12]) that summarize the spectral shape of one frame of speech. These 13 MFCCs are computed every 10 ms, producing a feature matrix for the entire utterance.
Windowing (Step 1): 25 ms frame (speech is quasi-stationary over this duration). Hamming window prevents spectral leakage.
Power spectrum (Step 3): Discards phase (irrelevant for speech recognition). Power rather than magnitude because human loudness perception is closer to power.
Mel filter bank (Step 4): Mimics the frequency resolution of the human cochlea. Compresses high frequencies. Reduces from ~256 spectral bins to ~26 mel bands.
Logarithm (Step 5): Two purposes: (a) mimics human loudness perception (Weber-Fechner law: perceived loudness is proportional to log of intensity); (b) converts the multiplicative source-filter combination into an additive one.
DCT (Step 6): Decorrelates the mel filter bank outputs (adjacent filters overlap, so MF[r] values are correlated). The DCT is like a "cepstral" transform of the mel spectrum. Keeping only the first 13 coefficients is a form of dimensionality reduction that captures the smooth spectral envelope (vocal tract shape) and discards fine harmonic structure.
Suppose a 25 ms frame at 16 kHz sampling rate (= 400 samples). After windowing and 512-point FFT:
Power spectrum P[k] for k = 0 to 256 (first half of symmetric spectrum).
Apply R = 26 mel filters: MF = [0.02, 0.15, 0.43, 0.89, 1.23, 0.98, ..., 0.04]
Take log: S = [-3.91, -1.90, -0.84, -0.12, 0.21, -0.02, ..., -3.22]
Apply DCT (type-II) and keep first 13 coefficients:
MFCC = [c0, c1, ..., c12] = [-5.2, 3.1, -1.8, 0.9, -0.4, 0.2, ...]
c0 ≈ overall energy. c1-c4 ≈ spectral tilt and broad shape. c5-c12 ≈ finer spectral details.
A synthetic speech-like signal processed through the full MFCC pipeline. Click each step to see the data at that stage.
python import numpy as np def compute_mfcc(signal, fs=16000, n_mfcc=13, n_filt=26, frame_len=0.025, frame_step=0.01, n_fft=512): """Compute MFCCs from a speech signal.""" # 1. Framing + windowing frame_samp = int(frame_len * fs) # 400 samples step_samp = int(frame_step * fs) # 160 samples n_frames = (len(signal) - frame_samp) // step_samp + 1 mfccs = [] for i in range(n_frames): frame = signal[i*step_samp : i*step_samp + frame_samp] frame = frame * np.hamming(frame_samp) # window # 2-3. Power spectrum X = np.fft.rfft(frame, n_fft) P = np.abs(X)**2 / n_fft # 4. Mel filter bank mel_energies = mel_filter_bank(P, fs, n_fft, n_filt) # 5. Log log_mel = np.log(mel_energies + 1e-10) # 6. DCT (type-II, first n_mfcc coefficients) mfcc = dct(log_mel)[:n_mfcc] mfccs.append(mfcc) return np.array(mfccs) # shape: (n_frames, 13)
MFCCs have been the dominant feature for speech and audio machine learning for over 30 years. Let's see why.
The classic speech recognition pipeline:
1. Extract 13 MFCCs per frame (every 10 ms)
2. Append delta (velocity) and delta-delta (acceleration) coefficients → 39 features per frame
3. Feed to a classifier: HMM-GMM (traditional) or DNN/RNN (modern)
MFCCs capture "what phoneme" while being mostly invariant to "who's speaking" (because the DCT liftering removes pitch information). They also compress 256+ spectral bins into just 13 numbers — massive dimensionality reduction with minimal information loss for phoneme classification.
Paradoxically, MFCCs are ALSO used for speaker identification — just with different modeling. The same 13 MFCCs capture enough vocal tract information to distinguish speakers, because each person's vocal tract has a unique shape (unique formant pattern).
The difference: speech recognition models the sequence of MFCCs (phoneme transitions). Speaker identification models the distribution of MFCCs (which region of MFCC space this speaker tends to occupy).
MFCCs capture timbre — the "color" of sound that distinguishes a piano from a guitar playing the same note. Applications:
• Genre classification (rock vs. classical vs. jazz)
• Instrument recognition
• Music similarity / recommendation
| Variant | Modification | Use Case |
|---|---|---|
| MFCC + Δ + ΔΔ | Add first and second time derivatives | Standard ASR (captures dynamics) |
| Log-mel spectrogram | Skip the DCT step | Neural network input (Whisper, etc.) |
| PLP (Perceptual LP) | All-pole model of mel spectrum | Noise-robust speech recognition |
| Gammatone filterbank | More physiologically accurate filters | Hearing aid research |
Different vowels occupy distinct regions in MFCC space. Here we plot c1 vs c2 for synthetic vowel-like signals.
Let's consolidate the full story from source-filter model to modern speech features.
| Problem | Solution | Key Insight |
|---|---|---|
| Speech = source * filter (convolved) | Take DFT: Y = X · W (multiplied) | Convolution theorem (Lecture 12) |
| X and W are multiplied | Take log: log Y = log X + log W (added) | Logarithm turns multiplication to addition |
| Need to separate additive components | IDFT of log spectrum = cepstrum | Source (fast) and filter (slow) separate in quefrency |
| Linear frequency scale doesn't match perception | Mel filter bank | Human hearing is logarithmic |
| Need compact, decorrelated features | DCT of log mel energies = MFCC | DCT gives energy compaction |
Looking back: Lecture 12 (LTI Systems) proved the convolution theorem that makes the source-filter separation possible. Lecture 8 (STFT) provided the windowing framework. Lecture 11 (Wavelet Applications) showed that logarithmic frequency resolution matches natural signals.
The bigger picture: MFCCs encode THREE insights: (1) the convolution theorem converts convolution to multiplication; (2) the logarithm converts multiplication to addition; (3) human perception is logarithmic in frequency. Each step has a principled justification — nothing is arbitrary.