EE269 Lecture 13 — Cepstrum & MFCC

Chapter 0: The Convolution Problem

You record someone saying "ah." The microphone captures a signal y[n]. But that signal isn't just the person's vocal cords vibrating — it's been shaped by their throat, mouth, and nasal cavity. The physics:

• Source signal x[n]: A buzzy periodic pulse train from the vocal cords (the "glottal pulse"). Its frequency is the speaker's pitch (f₀ ≈ 100-300 Hz).

• Vocal tract filter w[n]: The throat and mouth form a resonant tube. It amplifies some frequencies (called formants) and attenuates others. The filter shape determines which vowel you hear ("ah" vs "ee" vs "oo").

The speech signal is the source convolved with the filter:

y[n] = x[n] * w[n]

In the DFT domain (by the convolution theorem from Lecture 12):

Y[k] = X[k] · W[k]

The problem: We observe Y[k] = X[k] · W[k]. We want to separate X[k] (pitch information) from W[k] (vowel/timbre information). But they're multiplied together in the frequency domain. How do you "un-multiply" two unknown signals?

Why Separation Matters

• Speech recognition: Needs W[k] (what phoneme is being spoken), doesn't care about X[k] (who's speaking it)

• Speaker identification: Needs X[k] (the unique pitch pattern), doesn't care about W[k]

• Pitch shifting: Change X[k] without changing W[k] — make someone sound higher/lower without changing the words

We can't just divide Y by X or Y by W — we don't know either one! We need a trick that separates them using only the observed Y. And that trick is... the logarithm.

Source-Filter Model of Speech

The glottal source (periodic pulses) convolved with the vocal tract filter produces speech. In frequency domain, it's multiplication.

In the source-filter model of speech, the observed signal y[n] is related to the source x[n] and filter w[n] by:

Convolution in time: y[n] = x[n] * w[n], which becomes multiplication in DFT: Y[k] = X[k] · W[k] Addition: y[n] = x[n] + w[n] Multiplication in time: y[n] = x[n] · w[n]

Chapter 1: The Logarithm Trick

The key mathematical insight is embarrassingly simple. In the DFT domain:

Y[k] = X[k] · W[k]

Take the logarithm of both sides:

log |Y[k]| = log |X[k]| + log |W[k]|

The logarithm turns multiplication into addition! In the log-magnitude spectrum, the source and filter contributions are simply ADDED together. And separating an additive mixture is much easier than separating a multiplicative one — it's just filtering.

Why This Works Geometrically

In the log-magnitude spectrum:

• log |X[k]| is the source spectrum. For voiced speech (vocal cord buzz), this is a series of evenly-spaced peaks (harmonics) at multiples of the fundamental frequency f₀. These peaks are closely spaced — they look like a rapidly varying pattern.

• log |W[k]| is the vocal tract response. This is a smooth envelope that varies slowly across frequency — it has broad peaks (formants) separated by 1000+ Hz.

So log |Y[k]| = a slowly-varying component (filter) + a rapidly-varying component (source). This is just like a signal with low-frequency content (filter) and high-frequency content (source). We can separate them with... another Fourier transform!

Worked Example

Suppose Y[k] for k = 0, 1, ..., 7:

|Y| = [10, 2, 8, 1.5, 9, 1.8, 7, 2.2]

Notice the alternating pattern? Large values at even k (harmonics), small at odd k (between harmonics). This is the source's periodic structure.

Take log: log|Y| = [2.30, 0.69, 2.08, 0.41, 2.20, 0.59, 1.95, 0.79]

Now this alternating pattern is additive. The "slow" component (average ~1.5) is the filter envelope. The "fast" component (oscillation of ~0.8) is the source harmonics.

The Logarithm Separates Multiplicative Signals

Top: |Y[k]| = |X[k]| · |W[k]| (multiplication). Bottom: log|Y[k]| = log|X[k]| + log|W[k]| (addition). In the log domain, you can visually separate the slowly-varying envelope from the rapidly-varying harmonics.

Pitch f₀ 6

After taking log|Y[k]|, the source (harmonic structure) and filter (vocal tract envelope) are related by:

Addition: log|Y[k]| = log|X[k]| + log|W[k]|, making them separable by filtering Multiplication: they're still multiplied Convolution: they're convolved in the log domain

Chapter 2: Cepstrum

The cepstrum is defined as the inverse DFT (or DCT) of the log magnitude spectrum:

c[n] = IDFT{log |Y[k]|}

The name "cepstrum" is an anagram of "spectrum" — coined by Bogert, Healy, and Tukey (1963). The independent variable n in the cepstrum is called quefrency (anagram of "frequency"). These playful names remind us that we're in a "domain of a domain."

What the Cepstrum Contains

Since log |Y[k]| = log |X[k]| + log |W[k]|, the cepstrum is:

c[n] = c_x[n] + c_w[n]

Where c_x[n] = IDFT{log |X[k]|} and c_w[n] = IDFT{log |W[k]|}.

The key observation: log |W[k]| is slowly varying (smooth envelope), so its cepstrum c_w[n] is concentrated at low quefrencies (small n). Meanwhile, log |X[k]| oscillates rapidly (harmonics), so c_x[n] has energy at high quefrencies (specifically, near n = N/f₀, the "period" of the harmonic pattern in the log spectrum).

Separation recipe: Compute the cepstrum. The vocal tract (formant) information lives at low quefrencies (n < 15 or so). The pitch information lives at high quefrencies (near the pitch period). A simple "lifter" (cepstral filter) that keeps only low quefrencies extracts the vocal tract envelope.

The Complete Pipeline

1. DFT

Y[k] = DFT{y[n]}

↓

2. Log magnitude

L[k] = log|Y[k]|

↓

3. IDFT (or DCT)

c[n] = IDFT{L[k]} = cepstrum

↓

4. Lifter

Keep low quefrencies (vocal tract) OR detect peak (pitch)

Pitch Detection via Cepstrum

For voiced speech, the cepstrum has a clear peak at quefrency n = f_s/f₀ (the pitch period in samples). This peak is remarkably robust — it persists even in noisy conditions. Cepstral pitch detection was the standard method for decades before deep learning.

python
import numpy as np

def cepstrum(y, n_fft=1024):
    """Compute the real cepstrum of signal y."""
    Y = np.fft.fft(y, n_fft)
    log_mag = np.log(np.abs(Y) + 1e-10)  # epsilon avoids log(0)
    ceps = np.real(np.fft.ifft(log_mag))
    return ceps

def detect_pitch(y, fs, n_fft=1024):
    """Find pitch period from cepstral peak."""
    ceps = cepstrum(y, n_fft)
    # Search in plausible pitch range: 50-500 Hz
    min_q = int(fs / 500)   # max frequency
    max_q = int(fs / 50)    # min frequency
    peak_q = min_q + np.argmax(ceps[min_q:max_q])
    f0 = fs / peak_q
    return f0

Cepstrum: Separating Source and Filter

Observe how the cepstrum separates pitch (high quefrency peak) from vocal tract (low quefrency shape). The dashed line is the lifter cutoff.

Lifter cutoff 15

In the cepstrum, where does the vocal tract (formant) information concentrate?

At low quefrencies (small n), because the vocal tract envelope is smooth (slowly varying in frequency) At high quefrencies (large n) Uniformly across all quefrencies

Chapter 3: Mel Scale

The cepstrum is mathematically elegant but doesn't account for how humans actually perceive frequency. A jump from 100 Hz to 200 Hz sounds like a HUGE change (one octave). A jump from 8000 Hz to 8100 Hz sounds like almost nothing (barely perceptible). Human frequency perception is logarithmic, not linear.

The mel scale is a perceptual frequency scale where equal distances correspond to equal perceived pitch differences. It was defined experimentally by Stevens, Volkmann, and Newman (1937) by asking listeners to adjust tones until they sounded "equally spaced."

The Mel Formula

m = 2595 · log₁₀(1 + f/700)

And the inverse:

f = 700 · (10^m/2595 - 1)

Worked Examples

Frequency (Hz)	Mel	Perception
0	0	Silence
200	283	Low male voice pitch
500	607	First formant of "ah"
1000	1000	Reference (1000 mel = 1000 Hz by design)
2000	1346	Second formant region
4000	1750	Consonant energy region
8000	2146	Near hearing limit for older adults

Notice: from 0 to 1000 Hz spans 1000 mel. From 1000 to 8000 Hz spans only 1146 mel. The mel scale compresses high frequencies drastically — matching our reduced sensitivity there.

Why 700 Hz? The formula approximates two regions: below ~700 Hz, perception is roughly linear (m ≈ f). Above ~700 Hz, perception is roughly logarithmic (m ≈ 2595 · log f). The constant 700 is the transition point. This matches psychoacoustic experiments closely.

Why Mel Scale for Speech Recognition?

If two phonemes differ by the same amount in mel space, they're equally easy for humans to distinguish. So a speech recognition system should allocate equal "attention" to equal mel intervals — more frequency bins at low frequencies (where perception is sharp) and fewer bins at high frequencies (where perception is coarse). The mel filter bank does exactly this.

Hz to Mel Mapping

The mel scale is approximately linear below 700 Hz and logarithmic above. Move the frequency slider to see the mapping.

Frequency (Hz) 1000

On the mel scale, the perceptual distance between 100 Hz and 200 Hz is _____ the distance between 4000 Hz and 4100 Hz.

Much larger than (because low frequencies span more mel per Hz) The same as Smaller than

Chapter 4: Mel Filter Bank

The mel filter bank is a set of triangular bandpass filters spaced uniformly on the mel scale. Each filter captures the energy in one perceptual frequency band.

Construction

Given R filters spanning from f_low to f_high:

1. Convert f_low and f_high to mel: m_low, m_high

2. Create R+2 equally-spaced points in mel: m₀, m₁, ..., m_R+1

3. Convert back to Hz: f₀, f₁, ..., f_R+1

4. Each filter r is a triangle with center f_r+1, left edge f_r, right edge f_r+2

The r-th filter weight at frequency bin k is:

V_r[k] = 0 if f[k] < f_r or f[k] > f_r+2

V_r[k] = (f[k] - f_r) / (f_r+1 - f_r) if f_r ≤ f[k] ≤ f_r+1

V_r[k] = (f_r+2 - f[k]) / (f_r+2 - f_r+1) if f_r+1 < f[k] ≤ f_r+2

The Mel-Frequency Spectrum

Apply each filter to the power spectrum |X[k]|²:

MF[r] = ∑_k V_r[k] · |X[k]|², r = 0, 1, ..., R-1

This gives R values (typically R = 26 or 40) that represent the signal's energy distribution on a perceptually-motivated frequency scale.

Key properties: Filters are narrow at low frequencies (high resolution where we're sensitive) and wide at high frequencies (low resolution where we're less sensitive). Adjacent filters overlap at their half-power points, ensuring no spectral energy is lost. This is exactly the "logarithmic frequency resolution" we discussed with wavelets — but designed specifically for human hearing.

Worked Example

R = 4 filters, f_low = 0 Hz, f_high = 4000 Hz, f_s = 8000 Hz.

Mel bounds: m_low = 0, m_high = 2146. Six points: 0, 358, 715, 1073, 1431, 1789, 2146.

Back to Hz: 0, 233, 508, 841, 1253, 1774, 2400 (approximately).

Filter 0: triangle from 0 to 508 Hz (center 233 Hz) — narrow, 508 Hz wide

Filter 3: triangle from 1253 to 2400 Hz (center 1774 Hz) — wide, 1147 Hz wide

The higher filter is more than twice as wide — lower resolution at high frequencies.

Mel Filter Bank Visualizer

Triangular filters spaced on the mel scale. Notice: narrow at low frequencies, wide at high frequencies. Adjust number of filters.

Number of filters R 20

Why are mel filter bank triangles wider at high frequencies?

Because human frequency discrimination is coarser at high frequencies (equal mel spacing maps to wider Hz bands), matching our reduced perceptual resolution Because high-frequency sounds are louder Because of computational efficiency

Chapter 5: MFCC Pipeline

The Mel-Frequency Cepstral Coefficients combine the mel scale perception model with the cepstral separation idea. They are the standard feature representation for speech recognition. Let's build the full pipeline.

The Complete MFCC Pipeline

1. Windowing

Extract 20-30 ms frame, apply Hamming window

↓

2. DFT

X[k] = DFT{windowed frame}

↓

3. Power spectrum

P[k] = |X[k]|² / N

↓

4. Mel filter bank

MF[r] = ∑_k V_r[k] · P[k], r = 0,...,R-1

↓

5. Logarithm

S[r] = log(MF[r])

↓

6. DCT

c[n] = DCT{S[r]}, n = 0,...,12

The output: 13 numbers (c[0] through c[12]) that summarize the spectral shape of one frame of speech. These 13 MFCCs are computed every 10 ms, producing a feature matrix for the entire utterance.

Why Each Step

Windowing (Step 1): 25 ms frame (speech is quasi-stationary over this duration). Hamming window prevents spectral leakage.

Power spectrum (Step 3): Discards phase (irrelevant for speech recognition). Power rather than magnitude because human loudness perception is closer to power.

Mel filter bank (Step 4): Mimics the frequency resolution of the human cochlea. Compresses high frequencies. Reduces from ~256 spectral bins to ~26 mel bands.

Logarithm (Step 5): Two purposes: (a) mimics human loudness perception (Weber-Fechner law: perceived loudness is proportional to log of intensity); (b) converts the multiplicative source-filter combination into an additive one.

DCT (Step 6): Decorrelates the mel filter bank outputs (adjacent filters overlap, so MF[r] values are correlated). The DCT is like a "cepstral" transform of the mel spectrum. Keeping only the first 13 coefficients is a form of dimensionality reduction that captures the smooth spectral envelope (vocal tract shape) and discards fine harmonic structure.

Why DCT instead of IDFT? The log mel spectrum is a real, symmetric-ish sequence. The DCT is the natural orthogonal transform for real sequences — it gives better energy compaction (more information in fewer coefficients) than the IDFT. Also, DCT coefficients are all real, which simplifies downstream modeling.

Worked Example with Numbers

Suppose a 25 ms frame at 16 kHz sampling rate (= 400 samples). After windowing and 512-point FFT:

Power spectrum P[k] for k = 0 to 256 (first half of symmetric spectrum).

Apply R = 26 mel filters: MF = [0.02, 0.15, 0.43, 0.89, 1.23, 0.98, ..., 0.04]

Take log: S = [-3.91, -1.90, -0.84, -0.12, 0.21, -0.02, ..., -3.22]

Apply DCT (type-II) and keep first 13 coefficients:

MFCC = [c₀, c₁, ..., c₁₂] = [-5.2, 3.1, -1.8, 0.9, -0.4, 0.2, ...]

c₀ ≈ overall energy. c₁-c₄ ≈ spectral tilt and broad shape. c₅-c₁₂ ≈ finer spectral details.

MFCC Pipeline: Step Through Each Stage

A synthetic speech-like signal processed through the full MFCC pipeline. Click each step to see the data at that stage.

Click pipeline steps to see the data at each stage.

python
import numpy as np

def compute_mfcc(signal, fs=16000, n_mfcc=13, n_filt=26,
                  frame_len=0.025, frame_step=0.01, n_fft=512):
    """Compute MFCCs from a speech signal."""
    # 1. Framing + windowing
    frame_samp = int(frame_len * fs)  # 400 samples
    step_samp = int(frame_step * fs)   # 160 samples
    n_frames = (len(signal) - frame_samp) // step_samp + 1

    mfccs = []
    for i in range(n_frames):
        frame = signal[i*step_samp : i*step_samp + frame_samp]
        frame = frame * np.hamming(frame_samp)  # window

        # 2-3. Power spectrum
        X = np.fft.rfft(frame, n_fft)
        P = np.abs(X)**2 / n_fft

        # 4. Mel filter bank
        mel_energies = mel_filter_bank(P, fs, n_fft, n_filt)

        # 5. Log
        log_mel = np.log(mel_energies + 1e-10)

        # 6. DCT (type-II, first n_mfcc coefficients)
        mfcc = dct(log_mel)[:n_mfcc]
        mfccs.append(mfcc)

    return np.array(mfccs)  # shape: (n_frames, 13)

Why do we apply the DCT to the log mel energies (rather than keeping the log mel values directly as features)?

The DCT decorrelates the overlapping mel filter outputs and provides energy compaction (most information in few coefficients), improving downstream modeling The DCT makes the values positive The DCT is faster to compute

Chapter 6: Applications

MFCCs have been the dominant feature for speech and audio machine learning for over 30 years. Let's see why.

Speech Recognition

The classic speech recognition pipeline:

1. Extract 13 MFCCs per frame (every 10 ms)

2. Append delta (velocity) and delta-delta (acceleration) coefficients → 39 features per frame

3. Feed to a classifier: HMM-GMM (traditional) or DNN/RNN (modern)

MFCCs capture "what phoneme" while being mostly invariant to "who's speaking" (because the DCT liftering removes pitch information). They also compress 256+ spectral bins into just 13 numbers — massive dimensionality reduction with minimal information loss for phoneme classification.

Speaker Identification

Paradoxically, MFCCs are ALSO used for speaker identification — just with different modeling. The same 13 MFCCs capture enough vocal tract information to distinguish speakers, because each person's vocal tract has a unique shape (unique formant pattern).

The difference: speech recognition models the sequence of MFCCs (phoneme transitions). Speaker identification models the distribution of MFCCs (which region of MFCC space this speaker tends to occupy).

Music Information Retrieval

MFCCs capture timbre — the "color" of sound that distinguishes a piano from a guitar playing the same note. Applications:

• Genre classification (rock vs. classical vs. jazz)

• Instrument recognition

• Music similarity / recommendation

The end of MFCCs? Since ~2012, deep learning models (especially CNNs and transformers) can learn directly from spectrograms or even raw waveforms, bypassing hand-crafted features. But MFCCs remain useful: they're computationally cheap, well-understood, and still competitive on limited-data tasks. Models like Whisper use mel spectrograms (steps 1-5) but skip the DCT, letting the neural network learn its own "optimal cepstral transform."

Beyond Standard MFCCs

Variant	Modification	Use Case
MFCC + Δ + ΔΔ	Add first and second time derivatives	Standard ASR (captures dynamics)
Log-mel spectrogram	Skip the DCT step	Neural network input (Whisper, etc.)
PLP (Perceptual LP)	All-pole model of mel spectrum	Noise-robust speech recognition
Gammatone filterbank	More physiologically accurate filters	Hearing aid research

MFCC Space: Different Vowels

Different vowels occupy distinct regions in MFCC space. Here we plot c1 vs c2 for synthetic vowel-like signals.

Click vowel buttons to plot their MFCC features. Different vowels cluster separately.

Why do modern neural networks (like Whisper) use log-mel spectrograms instead of full MFCCs?

The network can learn its own optimal transform (the DCT is fixed and may not be optimal for all tasks), while still benefiting from the perceptual mel scaling and log compression Log-mel spectrograms are faster to compute MFCCs contain too much information

Chapter 7: Mastery

Let's consolidate the full story from source-filter model to modern speech features.

The Conceptual Chain

Problem	Solution	Key Insight
Speech = source * filter (convolved)	Take DFT: Y = X · W (multiplied)	Convolution theorem (Lecture 12)
X and W are multiplied	Take log: log Y = log X + log W (added)	Logarithm turns multiplication to addition
Need to separate additive components	IDFT of log spectrum = cepstrum	Source (fast) and filter (slow) separate in quefrency
Linear frequency scale doesn't match perception	Mel filter bank	Human hearing is logarithmic
Need compact, decorrelated features	DCT of log mel energies = MFCC	DCT gives energy compaction

The Complete MFCC Recipe (13 numbers per 25ms frame)

signal → window → |FFT|² → mel filters → log → DCT → first 13 = MFCC

Connections

Looking back: Lecture 12 (LTI Systems) proved the convolution theorem that makes the source-filter separation possible. Lecture 8 (STFT) provided the windowing framework. Lecture 11 (Wavelet Applications) showed that logarithmic frequency resolution matches natural signals.

The bigger picture: MFCCs encode THREE insights: (1) the convolution theorem converts convolution to multiplication; (2) the logarithm converts multiplication to addition; (3) human perception is logarithmic in frequency. Each step has a principled justification — nothing is arbitrary.

Pilanci's EE269 arc: From the DFT (Lecture 6) through STFT (8), wavelets (10-11), LTI systems (12), to cepstrum/MFCC (13) — the course has built a complete toolkit for signal analysis. The DFT is the foundation; everything else is a specialized adaptation for specific signal types and perceptual requirements.

What is the complete chain of operations to compute MFCCs from a speech frame?

Window → DFT → power spectrum → mel filter bank → log → DCT → keep first 13 coefficients DFT → log → mel filter → DCT Window → wavelet decomposition → threshold → DCT