Signal Processing Workbook — Engineermaxxing

Chapter 0: Discrete Signals

You're building an audio classifier but the first question your reviewer asks is: "What's the energy of your input signal?" You stare blankly. Energy, power, convolution — these are the atoms of signal processing. Every algorithm you'll ever build sits on top of these three computations.

A discrete-time signal x[n] is just a sequence of numbers indexed by integer n. Unlike continuous signals, we can write them out explicitly and compute with them by hand.

Signal energy (finite-length signals):
E = ∑_n |x[n]|²

Signal power (periodic signals with period N):
P = (1/N) ∑_n=0^N-1 |x[n]|²

Linear convolution:
y[n] = (x * h)[n] = ∑_k x[k] · h[n - k]

Energy vs. Power. Energy sums |x[n]|² over all time — it's finite only for signals that die out. Power averages |x[n]|² per period — it's the right measure for signals that repeat forever. A 1-second audio clip has finite energy. A pure sine wave has finite power but infinite energy.

Exercise 0.1: Signal Energy Derive

Compute the energy of x[n] = [1, 2, 3, 2, 1].

E = |1|² + |2|² + |3|² + |2|² + |1|²

energy

Show derivation

E = 1² + 2² + 3² + 2² + 1² = 1 + 4 + 9 + 4 + 1 = 19

Sum of squared magnitudes. For real signals, |x[n]|² = x[n]². For complex signals you'd need |x[n]|² = x[n] · x*[n].

Exercise 0.2: Periodic Signal Power Derive

x[n] is periodic with one period = [3, 1, -1, 1]. What is the average power P?

P = (1/N) ∑ |x[n]|² over one period, N = 4.

power

Show derivation

P = (1/4)(3² + 1² + (-1)² + 1²) = (1/4)(9 + 1 + 1 + 1) = 12/4 = 3.0

Power is energy per period. This signal repeats [3, 1, -1, 1] forever, so it has infinite total energy but finite power = 3.0.

Exercise 0.3: Convolution by Hand Trace

Compute y = [1, 2, 1] * [1, 1, 1] (linear convolution). What is y[2] (the middle/peak value, using 0-indexing)?

y has length len(x) + len(h) - 1 = 5. Compute y[n] = ∑_k x[k] · h[n-k] for n = 2.

y[2]

Show derivation

x = [1, 2, 1], h = [1, 1, 1]. Output length = 3 + 3 - 1 = 5.

y[0] = x[0]·h[0] = 1·1 = 1

y[1] = x[0]·h[1] + x[1]·h[0] = 1·1 + 2·1 = 3

y[2] = x[0]·h[2] + x[1]·h[1] + x[2]·h[0] = 1·1 + 2·1 + 1·1 = 4

y[3] = x[1]·h[2] + x[2]·h[1] = 2·1 + 1·1 = 3

y[4] = x[2]·h[2] = 1·1 = 1

y = [1, 3, 4, 3, 1]

Convolution with [1,1,1] is a 3-point moving average (unnormalized). Notice the output is symmetric — convolving two symmetric signals always produces a symmetric result.

Exercise 0.4: Full Convolution Output Derive

Compute y[4] for x = [2, 0, 1, 3, 1] convolved with h = [1, -1]. (0-indexed.)

Output length = 5 + 2 - 1 = 6. y[n] = ∑_k x[k]·h[n-k].

y[4]

Show derivation

y[4] = x[4]·h[0] + x[3]·h[1] = 1·1 + 3·(-1) = 1 - 3 = -2

Convolution with [1, -1] computes a discrete derivative — the first difference. y[n] = x[n] - x[n-1]. This is a high-pass filter: it responds to changes and ignores constant regions.

Exercise 0.5: Implement convolve() Build

Write a function that computes the full linear convolution of two arrays. Return an array of length len(x) + len(h) - 1.

y[n] = sum over k of x[k] * h[n-k], where out-of-bounds values are 0.

Show solution

javascript
function convolve(x, h) {
  const N = x.length + h.length - 1;
  const y = new Array(N).fill(0);
  for (let n = 0; n < N; n++) {
    for (let k = 0; k < x.length; k++) {
      if (n - k >= 0 && n - k < h.length) {
        y[n] += x[k] * h[n - k];
      }
    }
  }
  return y;
}

Exercise 0.6: Convolution Length Trace

You convolve a 1024-sample audio frame with a 64-tap FIR filter. How long is the output?

1024 samples (same as input) 1088 samples 1087 samples (1024 + 64 - 1) 65536 samples (1024 × 64)

Show reasoning

Output length = len(x) + len(h) - 1 = 1024 + 64 - 1 = 1087

Linear convolution always grows the signal. The extra 63 samples are "tails" where the filter is only partially overlapping the signal. In practice, many systems discard these tails (valid convolution) or zero-pad to keep the same length.

Chapter 1: DFT by Hand

Your spectrogram shows a suspicious peak at bin 3. What frequency does that correspond to? Is it real or an aliasing artifact? You can't debug frequency-domain code without being able to compute the DFT by hand for small examples.

The Discrete Fourier Transform maps N time-domain samples to N frequency-domain coefficients:

Forward DFT:
X[k] = ∑_n=0^N-1 x[n] · e^-j2πkn/N for k = 0, 1, ..., N-1

Twiddle factor:
W_N = e^-j2π/N so e^-j2πkn/N = W_N^kn

Key property: W_N^N = 1 (the twiddle factors are periodic)

What each X[k] means. X[0] is the DC component — the sum (or mean) of the signal. X[k] measures how much of the signal oscillates at frequency k/N cycles per sample. X[N/2] (for even N) is the Nyquist frequency. The magnitude |X[k]| tells you the amplitude; the angle ∠X[k] tells you the phase.

Exercise 1.1: DFT of All-Ones Derive

Compute the 4-point DFT of x = [1, 1, 1, 1]. What is X[0]?

X[0] = ∑_n=0³ x[n] · e^-j·0 = ∑ x[n].

X[0]

Show derivation

X[0] = 1·e⁰ + 1·e⁰ + 1·e⁰ + 1·e⁰ = 4

X[1] = 1·e⁰ + 1·e^-jπ/2 + 1·e^-jπ + 1·e^-j3π/2 = 1 + (-j) + (-1) + j = 0

X[2] = 1 + e^-jπ + e^-j2π + e^-j3π = 1 + (-1) + 1 + (-1) = 0

X[3] = 0 (by same cancellation)

DFT([1,1,1,1]) = [4, 0, 0, 0]

A constant signal has all its energy at DC (frequency 0). No oscillation means no frequency content above DC. This is the frequency-domain definition of "flat."

Exercise 1.2: DFT of Alternating Signal Derive

Compute the 4-point DFT of x = [1, 0, 1, 0]. What is |X[2]|?

Compute X[k] for each k = 0, 1, 2, 3. Then take the magnitude of X[2].

|X[2]|

Show derivation

X[0] = 1 + 0 + 1 + 0 = 2

X[1] = 1·1 + 0 + 1·e^-jπ + 0 = 1 + (-1) = 0

X[2] = 1·1 + 0 + 1·e^-j2π + 0 = 1 + 1 = 2

X[3] = 1·1 + 0 + 1·e^-j3π + 0 = 1 + (-1) = 0

DFT([1,0,1,0]) = [2, 0, 2, 0] ⇒ |X[2]| = 2

[1,0,1,0] oscillates at the Nyquist rate — it flips every sample. That's frequency k=2 in a 4-point DFT (which is N/2 = the Nyquist bin). Energy appears at DC (the mean is 0.5) and at Nyquist.

Exercise 1.3: Magnitude Spectrum Derive

For x = [1, -1, 1, -1], compute |X[0]|. Then compute |X[2]|.

This signal has zero mean. Where does its energy concentrate?

|X[2]|

Show derivation

X[0] = 1 + (-1) + 1 + (-1) = 0 (zero mean ⇒ no DC)

X[1] = 1 + (-1)·(-j) + 1·(-1) + (-1)·(j) = 1 + j - 1 - j = 0

X[2] = 1 + (-1)·(-1) + 1·1 + (-1)·(-1) = 1 + 1 + 1 + 1 = 4

X[3] = 0 (by conjugate symmetry)

[1,-1,1,-1] is a pure Nyquist-rate oscillation. All energy is at bin k=2 (N/2). Compare with [1,0,1,0] which had energy split between DC and Nyquist because its mean was nonzero.

Exercise 1.4: Conjugate Symmetry Trace

For a real signal x[n], why is |X[k]| = |X[N-k]|?

Because the DFT matrix is unitary Because X[N-k] = X*[k] for real x[n] — complex conjugate symmetry means equal magnitudes Because the FFT algorithm processes pairs of frequencies Because zero-padding makes the DFT symmetric

Show reasoning

X[N-k] = ∑_n x[n] · e^{-j2π(N-k)n/N} = ∑_n x[n] · e^-j2πn · e^j2πkn/N

Since e^-j2πn = 1 for integer n: X[N-k] = ∑_n x[n] · e^j2πkn/N

For real x[n]: X[N-k] = X*[k] (complex conjugate)

Therefore |X[N-k]| = |X*[k]| = |X[k]| ■

This is why we only need to plot bins 0 to N/2 — the upper half is a mirror image. The "negative frequencies" X[N-k] are just the conjugates of the positive frequencies X[k]. This also means a real N-point signal has only N/2 + 1 independent frequency components.

Exercise 1.5: DFT of Impulse Derive

Compute |X[3]| for x = [1, 0, 0, 0, 0, 0, 0, 0] (8-point DFT of a unit impulse).

|X[3]|

Show derivation

X[k] = ∑_n=0⁷ x[n] · e^-j2πkn/8 = x[0] · e⁰ = 1 for all k

DFT([1,0,0,0,0,0,0,0]) = [1, 1, 1, 1, 1, 1, 1, 1]

|X[3]| = 1

An impulse contains equal energy at all frequencies — it's the "white" signal of the time domain. This is the Fourier dual of the all-ones result from Exercise 1.1: constant in one domain means impulse in the other.

Exercise 1.6: Parseval's Theorem Derive

Parseval's theorem states: ∑|x[n]|² = (1/N)∑|X[k]|². For x = [1, 0, 1, 0] with X = [2, 0, 2, 0], verify by computing (1/N)∑|X[k]|².

(1/N)∑|X[k]|²

Show derivation

Time-domain energy: |1|² + |0|² + |1|² + |0|² = 2

Freq-domain: (1/4)(|2|² + |0|² + |2|² + |0|²) = (1/4)(4 + 0 + 4 + 0) = 8/4 = 2 ✓

Parseval's theorem guarantees energy is preserved across the DFT. You can compute signal energy in whichever domain is more convenient. The 1/N factor compensates for the DFT's unnormalized convention.

Chapter 2: Quantization

Your ADC (analog-to-digital converter) uses 8 bits. Your colleague says "that's 48 dB of dynamic range." Where does that number come from? Quantization is the bridge between the continuous analog world and the discrete digital world, and its noise is the fundamental floor beneath every digital system.

Uniform quantization:
Range [-X_max, X_max], B bits ⇒ 2^B levels
Step size: Δ = 2X_max / 2^B

Quantization noise (uniform input assumption):
Noise power = Δ² / 12

SNR formula:
SNR_dB = 6.02B + 1.76 dB

The 6 dB per bit rule. Every additional bit of resolution doubles the number of quantization levels, halving the step size Δ. Since noise power ∝ Δ², halving Δ cuts noise power by 4x = 6 dB. So each bit buys you ~6 dB of SNR. This is the single most important number in digital audio and ADC design.

Exercise 2.1: Step Size Derive

8-bit uniform quantizer, range [-1, 1]. What is the step size Δ?

Δ = 2X_max / 2^B = 2 / 256.

Show derivation

Δ = 2 × 1 / 2⁸ = 2 / 256 = 0.0078125

With 256 levels spanning a range of 2, each level covers 0.0078125 units. Any input value within one step is rounded to the same output — this rounding is the quantization error.

Exercise 2.2: Quantization Noise Power Derive

For Δ = 0.0078125 (from Exercise 2.1), compute the quantization noise power.

Noise power = Δ² / 12. The quantization error is uniformly distributed in [-Δ/2, Δ/2], and the variance of a uniform distribution on [-a, a] is a²/3 = (Δ/2)²/3 = Δ²/12.

noise power (×10^-6)

Show derivation

Δ² = 0.0078125² = 6.1035 × 10^-5

Noise power = 6.1035 × 10^-5 / 12 = 5.086 × 10^-6

This is tiny — but remember, it's the noise power. The RMS noise amplitude is √(5.086×10^-6) ≈ 0.00226, which is about 0.23% of the full range. Clearly audible if your signal is quiet.

Exercise 2.3: SNR for 8-Bit Audio Derive

Compute the SNR in dB for an 8-bit quantizer using the formula SNR = 6.02B + 1.76.

Show derivation

SNR = 6.02 × 8 + 1.76 = 48.16 + 1.76 = 49.92 dB

~50 dB is mediocre by modern standards. Old telephone systems used 8-bit μ-law encoding. CD audio uses 16 bits (96 dB), and professional audio uses 24 bits (144 dB).

Exercise 2.4: SNR for 16-Bit Audio Derive

Compute the SNR for a 16-bit quantizer. By how many dB does it exceed 8-bit?

dB (16-bit SNR)

Show derivation

SNR₁₆ = 6.02 × 16 + 1.76 = 96.32 + 1.76 = 98.08 dB

Improvement = 98.08 - 49.92 = 48.16 dB = 6.02 × 8 additional bits

Doubling the bit depth adds 6.02 × 8 = 48.16 dB. This is why CD audio (16-bit, 98 dB) sounds dramatically better than old 8-bit systems (~50 dB). The human ear has a dynamic range of about 120 dB, so 16 bits covers most of it.

Exercise 2.5: Find the Bug Debug

This SNR calculator gives wrong results. Click the buggy line.

function snr_db(bits, xmax) {
  const levels = Math.pow(2, bits);
  const delta = 2 * xmax / levels;
  const noisePower = delta * delta / 12;
  const signalPower = xmax * xmax;
  return 10 * Math.log10(signalPower / noisePower);
}

Show explanation

Line 5 is the bug. The signal power for a full-scale sinusoid is X_max²/2 (the RMS squared), not X_max² (the peak squared). Using peak power instead of RMS power inflates the SNR by 3 dB. The corrected line should be const signalPower = xmax * xmax / 2;.

This is a classic mistake. The 6.02B + 1.76 formula already assumes a full-scale sinusoid with power = X_max²/2. If you derive SNR from scratch, you must use RMS signal power.

Exercise 2.6: Bits Needed Trace

You need at least 90 dB of SNR for a professional recording. Using SNR = 6.02B + 1.76, what is the minimum number of bits?

12 bits (SNR = 74 dB) 14 bits (SNR = 86 dB) 15 bits (SNR = 92 dB) — first integer that clears 90 dB 16 bits (SNR = 98 dB)

Show reasoning

B ≥ (90 - 1.76) / 6.02 = 88.24 / 6.02 = 14.66

Since B must be an integer: B = 15 ⇒ SNR = 6.02(15) + 1.76 = 92.06 dB ✓

In practice nobody uses 15-bit ADCs. You'd use 16 bits and get 98 dB — the extra headroom is cheap and useful. But the math shows that 15 bits is the theoretical minimum for 90 dB.

Chapter 3: Lloyd-Max Quantizer

Uniform quantization wastes resolution. If your input signal is Gaussian (as most audio and sensor data roughly is), values near zero are far more likely than values near the extremes. A non-uniform quantizer packs more levels where the PDF is dense and fewer where it's sparse — like variable-length coding for amplitude.

Lloyd-Max conditions (optimal quantizer):
1. Nearest-neighbor rule: Each input x is assigned to the reconstruction level r_i closest to it. Boundary b_i = (r_i + r_i+1) / 2.

2. Centroid rule: Each reconstruction level is the centroid of its region:
r_i = E[x | b_i-1 ≤ x < b_i] = ∫_{b_i-1}^b_i x · f(x) dx / ∫_{b_i-1}^b_i f(x) dx

Algorithm: Alternate between updating boundaries and centroids until convergence.

Lloyd-Max is k-means for 1D. The algorithm is identical to k-means clustering: assign each point to the nearest center (boundaries), then update centers to be the mean of assigned points (centroids). Repeat. It always converges, but may find a local minimum.

Exercise 3.1: Optimal 2-Level Boundary Derive

For a Gaussian N(0,1) input with a 2-level quantizer (1 bit), the two reconstruction levels are symmetric: r₁ = -a, r₂ = +a. The optimal boundary is at b = 0 (by symmetry). What is the optimal reconstruction level a = E[X | X > 0] for N(0,1)?

For a standard normal, E[X | X > 0] = √(2/π) ≈ 0.7979. This is a standard result from the half-normal distribution.

a (4 decimal places)

Show derivation

E[X | X > 0] = ∫₀^∞ x · f(x) dx / ∫₀^∞ f(x) dx

Numerator: ∫₀^∞ x · (1/√2π) · e^-x²/2 dx = 1/√(2π)

Denominator: P(X > 0) = 1/2

a = (1/√(2π)) / (1/2) = 2/√(2π) = √(2/π) ≈ 0.7979

The optimal 1-bit quantizer for Gaussian maps everything below 0 to -0.7979 and everything above 0 to +0.7979. The distortion is σ² - 2/π ≈ 0.3634, or about 36.3% of the signal variance. Terrible, but it's the best 1 bit can do.

Exercise 3.2: Lloyd-Max Iteration Trace

You're running Lloyd-Max on data points [−3, −2, −1, 0, 1, 2, 4] with initial reconstruction levels r₁ = −1, r₂ = 2. Step 1: Compute the boundary b = (r₁ + r₂)/2. Step 2: Update r₁ = mean of points below b, r₂ = mean of points above b.

What is the updated r₂ after one iteration?

r₂ (after 1 iteration)

Show derivation

Step 1: b = (-1 + 2) / 2 = 0.5

Points below 0.5: {-3, -2, -1, 0} ⇒ r₁ = (-3-2-1+0)/4 = -6/4 = -1.5

Points above 0.5: {1, 2, 4} ⇒ r₂ = (1+2+4)/3 = 7/3 ≈ 2.333

After one iteration: r₁ = -1.5, r₂ = 2.333. The next step would compute b = (-1.5+2.333)/2 = 0.417, shifting the boundary left because the negative cluster is denser. You keep iterating until the levels stop changing.

Exercise 3.3: Distortion After Quantization Derive

Using the updated levels from 3.2 (r₁=-1.5, r₂=2.333) with boundary b=0.5, compute the mean squared quantization error (MSE distortion) for the data [−3, −2, −1, 0, 1, 2, 4].

D = (1/N) ∑ (x_i - q(x_i))² where q(x) maps to the nearest reconstruction level. Points ≤ 0.5 map to r₁, points > 0.5 map to r₂.

MSE distortion

Show derivation

Points mapped to r₁ = -1.5: {-3, -2, -1, 0}

(-3-(-1.5))² + (-2-(-1.5))² + (-1-(-1.5))² + (0-(-1.5))² = 2.25 + 0.25 + 0.25 + 2.25 = 5.0

Points mapped to r₂ = 2.333: {1, 2, 4}

(1-2.333)² + (2-2.333)² + (4-2.333)² = 1.778 + 0.111 + 2.778 = 4.667

MSE = (5.0 + 4.667) / 7 = 9.667 / 7 ≈ 1.381

Compare with the distortion before the update (using original levels r₁=-1, r₂=2): MSE = (4+1+0+1+1+0+4)/7 = 11/7 ≈ 1.571. One Lloyd-Max iteration reduced distortion from 1.571 to 1.381 — a 12% improvement. The algorithm converges when the levels and boundaries stop changing.

Exercise 3.4: Uniform vs Lloyd-Max Trace

For a Gaussian input, a 4-level (2-bit) Lloyd-Max quantizer achieves about 9.24 dB SQNR, while a 4-level uniform quantizer achieves about 9.25 dB. Why are they almost identical?

Because the Gaussian PDF is symmetric Because 2 bits is too few for the non-uniformity to help With so few levels, both quantizers are forced into nearly the same configuration — the Gaussian is "almost uniform" over the range that matters with only 4 levels They're identical by definition at 2 bits

Show reasoning

At very low bit rates, the Gaussian PDF within the active range [-2σ, 2σ] doesn't vary enough for non-uniform spacing to help significantly. The Lloyd-Max gain over uniform grows with the number of levels. At 8 bits (256 levels), Lloyd-Max can gain ~1-2 dB over uniform for Gaussian input. The real payoff of non-uniform quantization is for highly non-uniform distributions (like speech amplitude, which follows a Laplacian).

Exercise 3.5: Companding as Approximate Lloyd-Max Trace

μ-law companding applies a logarithmic compression before uniform quantization: y = sign(x) · ln(1 + μ|x|) / ln(1 + μ). Why does this approximate a Lloyd-Max quantizer?

It doesn't — companding is a completely different technique The log compression maps the Gaussian-like PDF to a more uniform PDF, so uniform quantization of the compressed signal approximates non-uniform quantization of the original Because μ-law uses 256 levels, which matches Lloyd-Max

Show reasoning

Companding = compression + expanding. The log compressor makes small values bigger and large values smaller, effectively spreading the dense-near-zero PDF into a more uniform shape. Uniform quantization of this spread-out signal is equivalent to non-uniform quantization of the original — more levels near zero, fewer at the extremes. It's a practical approximation to Lloyd-Max that doesn't require knowledge of the exact input distribution.

Chapter 4: STFT & Windowing

You run an FFT on 3 seconds of music and get one frequency spectrum. But music changes over time — there's a note at t=0.5s and a different one at t=2s. The FFT smears them together. The Short-Time Fourier Transform solves this by chopping the signal into overlapping frames, windowing each frame, and computing the FFT per frame. The tradeoff: longer windows give better frequency resolution but worse time resolution.

Frequency resolution:
Δf = f_s / N (Hz per bin, where N = FFT size, f_s = sample rate)

Window mainlobe widths:
Rectangular: 4π/N (in rad/sample) = 2·f_s/N Hz
Hann: 8π/N (2x rectangular) = 4·f_s/N Hz

Sidelobe levels:
Rectangular: -13 dB (terrible — spectral leakage everywhere)
Hann: -31 dB (much better sidelobe suppression)

The time-frequency tradeoff. Longer window N = better frequency resolution (Δf = f_s/N gets smaller) but worse time resolution (the window spans more time = N/f_s seconds). You cannot have both arbitrarily good time and frequency resolution simultaneously. This is the Heisenberg uncertainty principle of signal processing.

Exercise 4.1: Frequency Resolution Derive

f_s = 44100 Hz, N = 1024. What is the frequency resolution Δf in Hz?

Show derivation

Δf = f_s / N = 44100 / 1024 ≈ 43.07 Hz

Each FFT bin spans ~43 Hz. The lowest piano note (A0) is 27.5 Hz and the next semitone (A#0) is 29.1 Hz — a gap of only 1.6 Hz. At 43 Hz resolution, you can't distinguish adjacent notes in the bass register. You'd need N ≥ 44100/1.6 ≈ 27563 for semitone resolution at A0.

Exercise 4.2: Time Resolution Derive

Same setup: f_s = 44100, N = 1024. What is the time duration of one window in milliseconds?

Show derivation

T_window = N / f_s = 1024 / 44100 ≈ 0.02322 s = 23.22 ms

23 ms is near the "sweet spot" for speech — long enough to capture a few pitch periods (speech fundamental is ~80-300 Hz) but short enough that the vocal tract shape doesn't change much within one frame. Most speech processing uses 20-30 ms windows.

Exercise 4.3: Hann vs Rectangular Mainlobe Derive

For N = 512 and f_s = 16000 Hz, compute the mainlobe width for a Hann window in Hz. (Hann mainlobe = 4 · f_s/N.)

Show derivation

Hann mainlobe = 4 · f_s / N = 4 × 16000 / 512 = 125 Hz

Rectangular mainlobe = 2 · f_s / N = 2 × 16000 / 512 = 62.5 Hz

The Hann window has 2x the mainlobe width of rectangular — you lose frequency resolution. But the sidelobes drop from -13 dB (rectangular) to -31 dB (Hann), dramatically reducing spectral leakage. In practice, you almost always prefer the Hann (or Hamming) window because leakage from strong signals masks weak signals more than a slightly wider mainlobe does.

Exercise 4.4: Number of STFT Frames Derive

Audio signal: 2 seconds at f_s = 16000 Hz. Window = 512 samples, hop = 256 samples. How many STFT frames?

Frames = floor((total_samples - window_size) / hop) + 1

frames

Show derivation

Total samples = 2 × 16000 = 32000

Frames = floor((32000 - 512) / 256) + 1 = floor(31488 / 256) + 1 = 123 + 1 = 124

With 50% overlap (hop = N/2), you get roughly 2 × (total_samples / N) frames. The overlap helps because each sample contributes to two adjacent frames, improving the time-domain smoothness of the spectrogram.

Exercise 4.5: Spectrogram Dimensions Derive

From Exercise 4.4 (124 frames, N=512), the spectrogram is a 2D matrix. What are its dimensions? (Only positive frequencies + DC.)

frequency bins (rows)

Show derivation

Frequency bins = N/2 + 1 = 512/2 + 1 = 257

Spectrogram shape = [257 freq bins, 124 time frames]

We only keep the positive frequencies because of conjugate symmetry (Chapter 1). The spectrogram is a 257 × 124 matrix, often displayed as a heatmap with frequency on the y-axis and time on the x-axis. Each cell is |X[k, m]|² (power) or 10·log10(|X[k,m]|²) (dB).

Exercise 4.6: Arrange the STFT Pipeline Design

Put these STFT steps in the correct order.

→

Extract overlapping frames Apply window function Compute FFT per frame Take magnitude squared Convert to dB (log scale)

Show answer

Frame → Window → FFT → Magnitude² → Log (dB)

Framing chops the signal into overlapping chunks. Windowing tapers each frame to reduce spectral leakage. FFT converts each windowed frame to the frequency domain. Taking |X|² gives the power spectrum. Log scaling compresses the dynamic range for display and downstream processing.

Chapter 5: Wavelets

The STFT uses the same window size for all frequencies — 43 Hz resolution everywhere. But music has wide notes in the bass (fundamentals separated by ~1 Hz at 27 Hz) and narrow notes in the treble (separated by ~100 Hz at 4 kHz). Wavelets solve this with multi-resolution analysis: long windows for low frequencies, short windows for high frequencies.

The simplest wavelet is the Haar wavelet. It decomposes a signal into two halves: an approximation (low-frequency, averaged) and detail (high-frequency, differenced):

Haar wavelet decomposition (one level):
Given x = [x₀, x₁, x₂, x₃]

Approximation coefficients: a[k] = (x[2k] + x[2k+1]) / √2
Detail coefficients: d[k] = (x[2k] - x[2k+1]) / √2

Reconstruction:
x[2k] = (a[k] + d[k]) / √2
x[2k+1] = (a[k] - d[k]) / √2

Averages and differences — that's all a wavelet transform is. The approximation coefficients capture the smooth, slowly-varying part. The detail coefficients capture the jumps and edges. At each level you halve the resolution. It's like squinting at the signal progressively harder — each level blurs the details but shows the larger structure.

Exercise 5.1: Haar Decomposition Derive

Decompose x = [4, 6, 10, 12] using one level of Haar wavelet. What is the first approximation coefficient a[0]?

a[0] = (x[0] + x[1]) / √2 = (4 + 6) / √2.

a[0]

Show derivation

a[0] = (4 + 6) / √2 = 10 / 1.4142 ≈ 7.071

a[1] = (10 + 12) / √2 = 22 / 1.4142 ≈ 15.556

d[0] = (4 - 6) / √2 = -2 / 1.4142 ≈ -1.414

d[1] = (10 - 12) / √2 = -2 / 1.4142 ≈ -1.414

Haar([4, 6, 10, 12]) = approx: [7.071, 15.556], detail: [-1.414, -1.414]

The approximation captures the local averages (scaled by √2 for energy preservation). The detail coefficients are equal here because both pairs differ by 2. The √2 normalization ensures energy is preserved: ∑|x|² = ∑|a|² + ∑|d|².

Exercise 5.2: Energy Preservation Derive

Verify energy preservation for x = [4, 6, 10, 12]. Compute ∑|x|² and ∑|a|² + ∑|d|². Are they equal?

∑|x|²

Show derivation

∑|x|² = 16 + 36 + 100 + 144 = 296

∑|a|² = 7.071² + 15.556² = 50 + 242 = 292

∑|d|² = 1.414² + 1.414² = 2 + 2 = 4

∑|a|² + ∑|d|² = 292 + 4 = 296 = ∑|x|² ✓

Parseval's theorem for wavelets. The √2 normalization is precisely what makes this work. Without it (using simple averages and differences), you'd need to track scaling factors manually.

Exercise 5.3: Reconstruction Derive

Given a = [7.071, 15.556] and d = [-1.414, -1.414], reconstruct x[0].

x[0] = (a[0] + d[0]) / √2.

x[0]

Show derivation

x[0] = (a[0] + d[0]) / √2 = (7.071 + (-1.414)) / 1.4142 = 5.657 / 1.4142 = 4.0 ✓

x[1] = (a[0] - d[0]) / √2 = (7.071 - (-1.414)) / 1.4142 = 8.485 / 1.4142 = 6.0 ✓

Perfect reconstruction. This is the key property of orthogonal wavelets — you can decompose, process in the wavelet domain (e.g., zero out small detail coefficients for denoising), and reconstruct without losing information.

Exercise 5.4: Implement haarDecompose() Build

Write a function that performs one level of Haar wavelet decomposition. Return [approx, detail] as two arrays.

Use the formula: a[k] = (x[2k] + x[2k+1])/sqrt(2), d[k] = (x[2k] - x[2k+1])/sqrt(2).

Show solution

javascript
function haarDecompose(x) {
  const s = Math.SQRT2;
  const approx = [], detail = [];
  for (let k = 0; k < x.length; k += 2) {
    approx.push((x[k] + x[k+1]) / s);
    detail.push((x[k] - x[k+1]) / s);
  }
  return [approx, detail];
}

Exercise 5.5: Wavelet Denoising Trace

You decompose a noisy signal and set all detail coefficients below a threshold to zero ("hard thresholding"), then reconstruct. Why does this denoise the signal?

Because noise is always low-frequency Because wavelets compress noise into fewer coefficients Noise spreads evenly across all wavelet coefficients (small magnitudes), while the signal concentrates energy in a few large coefficients — thresholding kills noise and preserves signal Because the Haar wavelet is orthogonal to Gaussian noise

Show reasoning

The key insight is sparsity. Real signals (edges, tones, transients) are sparse in the wavelet domain — their energy concentrates in a few large coefficients. White noise has equal power at all frequencies, so in the wavelet domain it spreads into many small coefficients. Thresholding keeps the large (signal) coefficients and zeros out the small (noise) ones. This is the foundation of JPEG2000 and many medical image denoising algorithms.

Chapter 6: MFCC Pipeline

Every speech recognition system — from 1990s GMM-HMM to modern Whisper — starts with the same question: how do we represent a chunk of audio as a feature vector? The Mel-Frequency Cepstral Coefficients (MFCCs) are the answer. They mimic how the human ear processes sound: emphasizing perceptual frequency differences over physical ones.

Mel scale:
m = 2595 · log₁₀(1 + f / 700) (Hz to mel)
f = 700 · (10^m/2595 - 1) (mel to Hz)

MFCC pipeline:
Signal → pre-emphasis → frame → window → FFT → mel filterbank → log → DCT → MFCCs

Why mel scale? Humans perceive pitch on a roughly logarithmic scale — the difference between 100 Hz and 200 Hz (one octave) sounds as big as the difference between 1000 Hz and 2000 Hz (also one octave). The mel scale models this: equal mel intervals sound equally spaced to a human listener. Below ~1000 Hz, mel ≈ Hz (nearly linear). Above ~1000 Hz, mel ≈ log(Hz).

Exercise 6.1: Hz to Mel Derive

Convert 1000 Hz to the mel scale. m = 2595 · log₁₀(1 + f/700).

mel

Show derivation

m = 2595 · log₁₀(1 + 1000/700) = 2595 · log₁₀(2.4286)

= 2595 × 0.3855 ≈ 999.98 mel

1000 Hz ≈ 1000 mel. This is by design — the mel scale was calibrated so that 1000 mel ≈ 1000 Hz. Below this point, mel tracks Hz almost linearly. Above it, the relationship becomes logarithmic.

Exercise 6.2: Mel Scale at High Frequencies Derive

Convert 8000 Hz to mel. Then compute how many mels span 0-1000 Hz vs 1000-8000 Hz.

mel (for 8000 Hz)

Show derivation

m(8000) = 2595 · log₁₀(1 + 8000/700) = 2595 · log₁₀(12.43) = 2595 × 1.0943 ≈ 2840 mel

0-1000 Hz spans: 1000 - 0 = 1000 mel

1000-8000 Hz spans: 2840 - 1000 = 1840 mel

7000 Hz of physical bandwidth (1000-8000) maps to only 1840 mel — less than 2x the 1000 mel that the first 1000 Hz gets. This compression at high frequencies is why mel filterbanks place many narrow filters below 1000 Hz and fewer wide filters above it.

Exercise 6.3: Mel Filterbank Design Derive

For 8 kHz audio (f_s = 16000, Nyquist = 8000 Hz) with 26 mel filters: what is the mel spacing between adjacent filter centers?

The mel range is [m(0), m(8000)]. We need 26 filters, which means 28 points (26 centers + 2 endpoints). Spacing = total mel range / 27.

mel between centers

Show derivation

m(0) = 2595 · log₁₀(1) = 0

m(8000) = 2840 mel (from Ex 6.2)

We need 28 equally-spaced mel points for 26 filters.

Spacing = 2840 / 27 ≈ 105.2 mel

In Hz, the first few filters are spaced ~100 Hz apart (near-linear below 1000 Hz), but the last few are spaced ~600+ Hz apart (logarithmic above 1000 Hz). This perceptually-motivated non-uniform spacing is the key feature of mel filterbanks.

Exercise 6.4: Pre-emphasis Filter Trace

Pre-emphasis applies y[n] = x[n] - α·x[n-1] (typically α = 0.97). What kind of filter is this?

Low-pass filter (boosts low frequencies) Band-pass filter (selects a frequency range) High-pass filter — it's a first difference with a leak, boosting high frequencies that speech naturally lacks energy in All-pass filter (changes phase only)

Show reasoning

y[n] = x[n] - 0.97·x[n-1] is approximately a first difference (derivative), which is a high-pass filter. Speech has a natural spectral tilt of about -6 dB/octave (the glottal pulse spectrum rolls off). Pre-emphasis flattens this tilt, giving the FFT and mel filters a more uniform energy distribution across frequencies. Without it, the low-frequency bins dominate and the high-frequency formant information is buried in noise.

Exercise 6.5: Why DCT After Log Mel? Trace

After applying the mel filterbank and taking the log, we apply a DCT to get MFCCs. Why DCT instead of just using the log-mel energies directly?

The DCT makes the features periodic The DCT decorrelates the log-mel energies (adjacent mel bands are correlated), and compacts energy into fewer coefficients — typically only the first 13 of 26 DCT coefficients are kept The DCT converts back to the time domain The DCT removes phase information

Show reasoning

Adjacent mel filters overlap, so their outputs are correlated. Correlated features are wasteful — a diagonal Gaussian classifier (common in GMM-HMMs) needs uncorrelated features. The DCT is like a PCA for this specific case: it decorrelates the log-mel features and concentrates most of the information in the first few coefficients. Keeping only 13 of 26 MFCCs reduces dimensionality by 50% with minimal information loss. Modern neural nets (Whisper, wav2vec) often skip the DCT and use log-mel features directly because they can learn to decorrelate internally.

Exercise 6.6: Arrange the MFCC Pipeline Design

Put the MFCC pipeline steps in order.

→

Pre-emphasis Frame + Window FFT (power spectrum) Mel filterbank Log compression DCT

Show answer

Pre-emphasis → Frame+Window → FFT → Mel filterbank → Log → DCT

This pipeline has been standard since Davis & Mermelstein (1980). Modern systems sometimes add delta and delta-delta (velocity and acceleration) features, and may replace DCT with neural feature learning. But the core pipeline — pre-emphasis through mel filterbank — remains universal.

Chapter 7: Classification

You have feature vectors — maybe MFCCs from Chapter 6, maybe sensor readings. Now you need to decide: which class does this observation belong to? The Bayes classifier is the gold standard — it's provably optimal if you know the true distributions. Fisher's linear discriminant finds the best projection direction when you don't.

Bayes classifier (2-class):
Decide class 1 if: P(C₁|x) > P(C₂|x)
By Bayes' theorem: decide C₁ if p(x|C₁)P(C₁) > p(x|C₂)P(C₂)

Fisher discriminant (2-class):
Maximize: J(w) = (w^TS_Bw) / (w^TS_Ww)
where S_B = (μ₁ - μ₂)(μ₁ - μ₂)^T (between-class scatter)
S_W = Σ₁ + Σ₂ (within-class scatter)

Solution: w* ∝ S_W^-1(μ₁ - μ₂)

Fisher = "maximize the signal-to-noise ratio" in 1D. After projecting onto direction w, the two class means are separated by w^T(μ₁-μ₂) and the variance is w^TS_Ww. Fisher finds the direction that maximizes separation relative to spread — exactly the 1D SNR. If the classes overlap in 2D, there may be a magic 1D projection where they separate cleanly.

Exercise 7.1: Bayes Posterior Derive

Two classes with equal priors P(C₁) = P(C₂) = 0.5. Class 1: N(μ₁=2, σ²=1). Class 2: N(μ₂=5, σ²=1). Compute P(C₁|x=3).

P(C₁|x) = p(x|C₁)P(C₁) / [p(x|C₁)P(C₁) + p(x|C₂)P(C₂)]. With equal priors, this simplifies to p(x|C₁) / [p(x|C₁) + p(x|C₂)].

P(C₁|x=3)

Show derivation

p(x=3|C₁) = (1/√2π) · exp(-(3-2)²/2) = (1/√2π) · e^-0.5 ≈ 0.2420

p(x=3|C₂) = (1/√2π) · exp(-(3-5)²/2) = (1/√2π) · e^-2 ≈ 0.0540

P(C₁|x=3) = 0.2420 / (0.2420 + 0.0540) = 0.2420 / 0.2960 ≈ 0.818

x=3 is 1σ from μ₁=2 but 2σ from μ₂=5. The Gaussian drops much faster at 2σ (e^-2 vs e^-0.5), so C₁ is strongly favored. The Bayes decision boundary is at x = 3.5 (the midpoint, since variances are equal and priors are equal).

Exercise 7.2: Decision Boundary Shift Derive

Same setup but now P(C₁) = 0.3, P(C₂) = 0.7. Where is the decision boundary? (Find x where P(C₁|x) = P(C₂|x), i.e., where p(x|C₁)·0.3 = p(x|C₂)·0.7.)

With equal variances, the log-likelihood ratio is linear. Solve: -(x-2)²/2 + ln(0.3) = -(x-5)²/2 + ln(0.7).

x (boundary)

Show derivation

Set p(x|C₁)·P(C₁) = p(x|C₂)·P(C₂). Take log of both sides:

-(x-2)²/2 + ln(0.3) = -(x-5)²/2 + ln(0.7)

Expand: (-x²+4x-4)/2 + (x²-10x+25)/2 = ln(7/3)

(-6x + 21)/2 = 0.8473

-6x + 21 = 1.6946 ⇒ x = 19.305/6 ≈ 3.22

The boundary shifted from 3.5 (equal priors) to 3.22 — toward C₁. This makes sense: since C₂ is more likely a priori (0.7 vs 0.3), we give C₂ more territory. A point at x=3.3 would be classified as C₂ even though it's closer to μ₁=2 than to μ₂=5, because C₂'s higher prior tips the balance.

Exercise 7.3: Fisher Discriminant Direction Derive

Two 2D classes: μ₁ = [1, 2], μ₂ = [4, 6]. Both have the identity covariance matrix (S_W = 2I). What is the Fisher discriminant direction w*?

w* ∝ S_W^-1(μ₁ - μ₂). With S_W = 2I, S_W^-1 = (1/2)I.

w₂/w₁ (direction ratio)

Show derivation

μ₁ - μ₂ = [1-4, 2-6] = [-3, -4]

w* ∝ S_W^-1(μ₁ - μ₂) = (1/2)I · [-3, -4] = [-1.5, -2]

Direction: ∝ [3, 4] (or [-3, -4], sign doesn't matter)

w₂/w₁ = 4/3 ≈ 1.333

When S_W is a scaled identity, the Fisher direction is simply the line connecting the class means. With equal spherical covariance, there's no need to "de-correlate" — the mean direction is already optimal. The ratio 4:3 means we project onto a line tilted at arctan(4/3) ≈ 53.1° from the x-axis.

Exercise 7.4: Fisher with Unequal Covariance Trace

If the within-class scatter S_W has large variance along one axis and small along another, how does Fisher's direction differ from simply connecting the means?

It doesn't — Fisher always points along the mean difference Fisher rotates away from the high-variance axis — projecting along a direction where within-class spread is small gives better separation, even if the means are closer in that direction Fisher only considers the means, not the covariance

Show reasoning

The S_W^-1 term is what distinguishes Fisher from naive "connect the means." If S_W has eigenvalue 10 along one axis and eigenvalue 1 along another, S_W^-1 has eigenvalues 0.1 and 1 — it shrinks the high-variance direction and stretches the low-variance one. This tips the Fisher direction toward the axis where the classes are narrow, even if the means are more separated along the wide axis. It's the same logic as Mahalanobis distance: an inch of separation in a narrow direction is worth more than a foot of separation in a wide one.

Exercise 7.5: Find the Bug Debug

This Bayes classifier computes the wrong posterior. Click the buggy line.

function bayesPosterior(x, mu1, mu2, sigma, prior1) {
  const prior2 = 1 - prior1;
  const d1 = x - mu1;
  const d2 = x - mu2;
  const p1 = Math.exp(-d1*d1 / (2*sigma)) * prior1;
  const p2 = Math.exp(-d2*d2 / (2*sigma)) * prior2;
  return p1 / (p1 + p2);
}

Show explanation

Line 5 (and line 6) have the bug. The Gaussian exponent should be -d²/(2σ²), but the code uses 2*sigma instead of 2*sigma*sigma. The parameter sigma is the standard deviation, so the variance is σ². Dividing by 2σ instead of 2σ² changes the effective variance and gives wrong posteriors whenever σ ≠ 1.

When σ = 1, the bug is invisible because 2σ = 2σ² = 2. This is a classic unit test failure: the test passed with σ=1 but the code is wrong for all other variances. Always test with σ ≠ 1.

Chapter 8: SVM Margins

You've found a separating hyperplane. Your colleague found a different one. Both classify the training data perfectly. Which is better? The Support Vector Machine answers: the one with the widest margin — the largest gap between the decision boundary and the nearest data point. This maximum-margin principle gives SVMs their legendary generalization.

Decision hyperplane: w · x + b = 0
Margin width: 2 / ||w|| (distance between the two support planes w·x+b = ±1)
Support vectors: points where |w·x + b| = 1 (they "touch" the margin)

Maximize margin = minimize ||w||² / 2
subject to: y_i(w · x_i + b) ≥ 1 for all i

Why maximum margin works. A wider margin means the classifier is more robust to perturbations. Vapnik showed that the generalization error is bounded by a function of the margin — wider margin = tighter bound = fewer mistakes on unseen data. The support vectors (the few points touching the margin) are the only training points that matter; move any other point and the hyperplane doesn't change.

Exercise 8.1: Margin Width Derive

A 2D SVM finds w = [3, 4], b = -2. What is the margin width?

Margin = 2 / ||w|| where ||w|| = √(w₁² + w₂²).

margin width

Show derivation

||w|| = √(9 + 16) = √25 = 5

Margin = 2 / 5 = 0.4

A margin of 0.4 means the nearest point to the boundary is 0.2 units away on each side. If we scaled w to [6, 8] and b to -4, the hyperplane would be identical but ||w|| = 10, giving a margin of 0.2. The margin depends on the normalized distance, so we minimize ||w|| to maximize the margin.

Exercise 8.2: Point-to-Hyperplane Distance Derive

Hyperplane: 3x₁ + 4x₂ - 2 = 0. What is the distance from point (1, 1) to this hyperplane?

Distance = |w·x + b| / ||w|| = |3(1) + 4(1) - 2| / 5.

distance

Show derivation

w · x + b = 3(1) + 4(1) + (-2) = 5

Distance = |5| / 5 = 1.0

The point (1,1) is exactly 1 unit from the hyperplane. Since the margin is 0.4 (half-margin = 0.2), this point is well outside the margin — it's NOT a support vector. A support vector would be at distance 1/||w|| = 0.2 from the hyperplane.

Exercise 8.3: Is It a Support Vector? Trace

For hyperplane w = [3, 4], b = -2, which point is a support vector (satisfies |w·x + b| = 1)?

(0, 0): |3(0)+4(0)-2| = |-2| = 2 ≠ 1 (1, 0): |3(1)+4(0)-2| = |1| = 1 ✓ (1, 1): |3(1)+4(1)-2| = |5| = 5 ≠ 1 (2, 2): |3(2)+4(2)-2| = |12| = 12 ≠ 1

Show reasoning

(1, 0) gives w·x + b = 3 + 0 - 2 = 1, so |w·x + b| = 1. This point lies exactly on the positive margin plane (w·x + b = +1). It's a support vector — remove it and the hyperplane might change. All other points have |w·x + b| > 1, meaning they're safely inside their class region and don't constrain the optimization.

Exercise 8.4: Kernel Trick Derive

The polynomial kernel φ(x) = [x₁², √2·x₁x₂, x₂²] maps 2D to 3D. Compute φ([1, 2]) · φ([3, 1]).

This is the kernel value K(a, b) = (φ(a))·(φ(b)). Also verify: K(a,b) = (a·b)².

φ(a) · φ(b)

Show derivation

φ([1,2]) = [1², √2·1·2, 2²] = [1, 2√2, 4]

φ([3,1]) = [9, 3√2, 1]

φ(a) · φ(b) = 1·9 + 2√2 · 3√2 + 4·1 = 9 + 12 + 4 = 25

Check: (a · b)² = (1·3 + 2·1)² = 5² = 25 ✓

The kernel trick: instead of computing the expensive mapping φ and then the dot product in 3D, we compute (a·b)² directly in the original 2D space. Same result, much cheaper. For degree-d polynomials in n dimensions, the mapped space has O(n^d) dimensions, but the kernel computation is still O(n). This is why SVMs can operate in effectively infinite-dimensional spaces (e.g., RBF kernel).

Exercise 8.5: Implement margin() Build

Write a function that computes the SVM margin given a weight vector w (as an array).

||w|| = sqrt(sum of w[i]^2).

Show solution

javascript
function margin(w) {
  const norm = Math.sqrt(w.reduce((s, v) => s + v * v, 0));
  return 2 / norm;
}

Exercise 8.6: RBF Kernel Derive

The RBF (Gaussian) kernel is K(x, y) = exp(-||x-y||² / (2σ²)). Compute K([0,0], [1,1]) with σ = 1.

K value

Show derivation

||x - y||² = (0-1)² + (0-1)² = 2

K = exp(-2 / (2 × 1)) = exp(-1) = e^-1 ≈ 0.3679

The RBF kernel measures similarity — K=1 for identical points, K→0 as distance grows. With σ=1, points at distance √2 ≈ 1.41 apart have K ≈ 0.37. The parameter σ controls the "reach": larger σ = smoother decision boundary, smaller σ = more wiggly (potentially overfitting).

Chapter 9: Capstone: Audio Feature Pipeline

You're designing the front-end for a speech recognition system. Your audio comes in at 16 kHz, and you need to produce MFCC feature vectors for each frame. This chapter ties together everything: framing, windowing, spectral analysis, mel filterbanks, and feature dimensions. Every number must be computable from the system parameters.

End-to-end thinking. A real audio pipeline is a chain of transformations, each with specific input/output dimensions. One miscalculation — wrong hop size, wrong number of FFT bins, wrong mel spacing — and the entire system silently produces garbage features. The exercises below force you to trace dimensions through the complete pipeline.

Exercise 9.1: Total Samples Derive

3 seconds of audio at f_s = 16000 Hz. How many samples?

samples

Show derivation

Samples = duration × f_s = 3 × 16000 = 48,000

Exercise 9.2: Frame Size in Samples Derive

Frame length = 25 ms at f_s = 16000 Hz. How many samples per frame?

samples

Show derivation

Frame samples = 0.025 × 16000 = 400

25 ms at 16 kHz gives exactly 400 samples. This is a common choice in speech processing. The frame should be long enough to capture 2-3 pitch periods (human pitch: 80-300 Hz, period: 3.3-12.5 ms) but short enough that the vocal tract is roughly stationary.

Exercise 9.3: Hop Size Derive

Hop = 10 ms at f_s = 16000 Hz. How many samples per hop?

samples

Show derivation

Hop samples = 0.010 × 16000 = 160

The overlap is frame_size - hop = 400 - 160 = 240 samples, which is 60% overlap. This is more aggressive than the typical 50% overlap from Chapter 4, giving smoother temporal features at the cost of more frames to process.

Exercise 9.4: Number of Frames Derive

48000 samples, frame = 400, hop = 160. How many frames?

Frames = floor((total - frame_size) / hop) + 1.

frames

Show derivation

Frames = floor((48000 - 400) / 160) + 1 = floor(47600 / 160) + 1 = 297 + 1 = 298

3 seconds of audio at 10 ms hop gives approximately 300 frames. Each frame produces one feature vector, so our downstream classifier sees a sequence of 298 feature vectors — one every 10 ms.

Exercise 9.5: FFT Size and Frequency Bins Derive

We zero-pad each 400-sample frame to N = 512 for the FFT (next power of 2). How many unique frequency bins in the magnitude spectrum?

frequency bins

Show derivation

Unique bins = N/2 + 1 = 512/2 + 1 = 257

Zero-padding from 400 to 512 doesn't add information — it just interpolates the spectrum more finely (smaller spacing between bins). The frequency resolution is still determined by the original 400 samples: Δf = f_s/400 = 40 Hz. But the FFT bins are spaced at f_s/512 = 31.25 Hz.

Exercise 9.6: MFCC Feature Vector Size Derive

We apply 26 mel filters and keep the first 13 DCT coefficients. With delta and delta-delta features appended, what is the dimension of each feature vector?

Delta features = first difference of MFCCs across time. Delta-delta = second difference. Same dimension as MFCCs.

feature dimension

Show derivation

13 MFCCs + 13 deltas + 13 delta-deltas = 39

39-dimensional feature vectors are the classic MFCC configuration used by virtually every pre-deep-learning speech recognizer (HTK, Kaldi). Each of the 298 frames produces a 39-D vector, so the complete feature matrix for 3 seconds of audio is 298 × 39.

Exercise 9.7: Total Feature Matrix Size Derive

298 frames × 39 dimensions per frame, stored as 32-bit floats. How many kilobytes is the feature matrix?

Show derivation

Elements = 298 × 39 = 11,622

Bytes = 11,622 × 4 = 46,488 bytes

KB = 46,488 / 1024 ≈ 45.4 KB

3 seconds of raw audio at 16 kHz, 16-bit = 96 KB. After MFCC extraction: 45.4 KB. That's only 2.3x compression — MFCCs aren't about compression, they're about representation. The 39 dimensions capture the perceptually relevant information (vocal tract shape, pitch dynamics) while discarding speaker-specific details and noise.

Exercise 9.8: Implement featurePipelineDims() Build

Write a function that computes the output dimensions of a speech feature pipeline given audio parameters.

nFrames = floor((totalSamples - frameSamples) / hopSamples) + 1. featDim = nMfcc * (deltas ? 3 : 1).

Show solution

javascript
function featurePipelineDims(durationSec, fs, frameSec, hopSec, nfft, nMel, nMfcc, deltas) {
  const totalSamples = Math.round(durationSec * fs);
  const frameSamples = Math.round(frameSec * fs);
  const hopSamples = Math.round(hopSec * fs);
  const nFrames = Math.floor((totalSamples - frameSamples) / hopSamples) + 1;
  const featDim = nMfcc * (deltas ? 3 : 1);
  return { nFrames, featDim, matrixSize: nFrames * featDim };
}

Exercise 9.9: Frequency Resolution vs Mel Resolution Trace

With N=512 and f_s=16000, the FFT resolution is 31.25 Hz/bin. Our first mel filter center is at ~133 Hz and the second is at ~238 Hz — spanning about 3 FFT bins. The last mel filter spans ~900 Hz — about 29 FFT bins. Why is this mismatch actually desirable?

It's not desirable — we should use a constant-Q transform instead More FFT bins per mel filter at high frequencies gives better noise averaging It matches human perception — we need fine frequency resolution where our ears are sensitive (low frequencies) and averaging is fine where we're less sensitive (high frequencies) It makes the FFT computation faster

Show reasoning

The mel filterbank deliberately blurs high frequencies because the human auditory system does the same. Below ~1 kHz, we can distinguish frequencies ~3 Hz apart. Above ~4 kHz, our resolution drops to ~100+ Hz. Narrow mel filters at low frequencies capture the fine pitch and formant structure that matters for speech intelligibility. Wide mel filters at high frequencies average out spectral detail that humans can't perceive anyway. The result is a compact, perceptually relevant representation.