From discrete convolution to mel-frequency cepstral coefficients. Every computation a signal processing engineer must derive by hand — quantization noise, DFT symmetry, wavelet decomposition, feature pipelines — all solvable in-browser with instant feedback.
You're building an audio classifier but the first question your reviewer asks is: "What's the energy of your input signal?" You stare blankly. Energy, power, convolution — these are the atoms of signal processing. Every algorithm you'll ever build sits on top of these three computations.
A discrete-time signal x[n] is just a sequence of numbers indexed by integer n. Unlike continuous signals, we can write them out explicitly and compute with them by hand.
Compute the energy of x[n] = [1, 2, 3, 2, 1].
E = |1|² + |2|² + |3|² + |2|² + |1|²
Sum of squared magnitudes. For real signals, |x[n]|² = x[n]². For complex signals you'd need |x[n]|² = x[n] · x*[n].
x[n] is periodic with one period = [3, 1, -1, 1]. What is the average power P?
P = (1/N) ∑ |x[n]|² over one period, N = 4.
Power is energy per period. This signal repeats [3, 1, -1, 1] forever, so it has infinite total energy but finite power = 3.0.
Compute y = [1, 2, 1] * [1, 1, 1] (linear convolution). What is y[2] (the middle/peak value, using 0-indexing)?
y has length len(x) + len(h) - 1 = 5. Compute y[n] = ∑k x[k] · h[n-k] for n = 2.
x = [1, 2, 1], h = [1, 1, 1]. Output length = 3 + 3 - 1 = 5.
Convolution with [1,1,1] is a 3-point moving average (unnormalized). Notice the output is symmetric — convolving two symmetric signals always produces a symmetric result.
Compute y[4] for x = [2, 0, 1, 3, 1] convolved with h = [1, -1]. (0-indexed.)
Output length = 5 + 2 - 1 = 6. y[n] = ∑k x[k]·h[n-k].
Convolution with [1, -1] computes a discrete derivative — the first difference. y[n] = x[n] - x[n-1]. This is a high-pass filter: it responds to changes and ignores constant regions.
Write a function that computes the full linear convolution of two arrays. Return an array of length len(x) + len(h) - 1.
javascript function convolve(x, h) { const N = x.length + h.length - 1; const y = new Array(N).fill(0); for (let n = 0; n < N; n++) { for (let k = 0; k < x.length; k++) { if (n - k >= 0 && n - k < h.length) { y[n] += x[k] * h[n - k]; } } } return y; }
Linear convolution always grows the signal. The extra 63 samples are "tails" where the filter is only partially overlapping the signal. In practice, many systems discard these tails (valid convolution) or zero-pad to keep the same length.
Your spectrogram shows a suspicious peak at bin 3. What frequency does that correspond to? Is it real or an aliasing artifact? You can't debug frequency-domain code without being able to compute the DFT by hand for small examples.
The Discrete Fourier Transform maps N time-domain samples to N frequency-domain coefficients:
Compute the 4-point DFT of x = [1, 1, 1, 1]. What is X[0]?
X[0] = ∑n=03 x[n] · e-j·0 = ∑ x[n].
A constant signal has all its energy at DC (frequency 0). No oscillation means no frequency content above DC. This is the frequency-domain definition of "flat."
Compute the 4-point DFT of x = [1, 0, 1, 0]. What is |X[2]|?
Compute X[k] for each k = 0, 1, 2, 3. Then take the magnitude of X[2].
[1,0,1,0] oscillates at the Nyquist rate — it flips every sample. That's frequency k=2 in a 4-point DFT (which is N/2 = the Nyquist bin). Energy appears at DC (the mean is 0.5) and at Nyquist.
For x = [1, -1, 1, -1], compute |X[0]|. Then compute |X[2]|.
This signal has zero mean. Where does its energy concentrate?
[1,-1,1,-1] is a pure Nyquist-rate oscillation. All energy is at bin k=2 (N/2). Compare with [1,0,1,0] which had energy split between DC and Nyquist because its mean was nonzero.
This is why we only need to plot bins 0 to N/2 — the upper half is a mirror image. The "negative frequencies" X[N-k] are just the conjugates of the positive frequencies X[k]. This also means a real N-point signal has only N/2 + 1 independent frequency components.
Compute |X[3]| for x = [1, 0, 0, 0, 0, 0, 0, 0] (8-point DFT of a unit impulse).
An impulse contains equal energy at all frequencies — it's the "white" signal of the time domain. This is the Fourier dual of the all-ones result from Exercise 1.1: constant in one domain means impulse in the other.
Parseval's theorem states: ∑|x[n]|² = (1/N)∑|X[k]|². For x = [1, 0, 1, 0] with X = [2, 0, 2, 0], verify by computing (1/N)∑|X[k]|².
Parseval's theorem guarantees energy is preserved across the DFT. You can compute signal energy in whichever domain is more convenient. The 1/N factor compensates for the DFT's unnormalized convention.
Your ADC (analog-to-digital converter) uses 8 bits. Your colleague says "that's 48 dB of dynamic range." Where does that number come from? Quantization is the bridge between the continuous analog world and the discrete digital world, and its noise is the fundamental floor beneath every digital system.
8-bit uniform quantizer, range [-1, 1]. What is the step size Δ?
Δ = 2Xmax / 2B = 2 / 256.
With 256 levels spanning a range of 2, each level covers 0.0078125 units. Any input value within one step is rounded to the same output — this rounding is the quantization error.
For Δ = 0.0078125 (from Exercise 2.1), compute the quantization noise power.
Noise power = Δ² / 12. The quantization error is uniformly distributed in [-Δ/2, Δ/2], and the variance of a uniform distribution on [-a, a] is a²/3 = (Δ/2)²/3 = Δ²/12.
This is tiny — but remember, it's the noise power. The RMS noise amplitude is √(5.086×10-6) ≈ 0.00226, which is about 0.23% of the full range. Clearly audible if your signal is quiet.
Compute the SNR in dB for an 8-bit quantizer using the formula SNR = 6.02B + 1.76.
~50 dB is mediocre by modern standards. Old telephone systems used 8-bit μ-law encoding. CD audio uses 16 bits (96 dB), and professional audio uses 24 bits (144 dB).
Compute the SNR for a 16-bit quantizer. By how many dB does it exceed 8-bit?
Doubling the bit depth adds 6.02 × 8 = 48.16 dB. This is why CD audio (16-bit, 98 dB) sounds dramatically better than old 8-bit systems (~50 dB). The human ear has a dynamic range of about 120 dB, so 16 bits covers most of it.
This SNR calculator gives wrong results. Click the buggy line.
function snr_db(bits, xmax) { const levels = Math.pow(2, bits); const delta = 2 * xmax / levels; const noisePower = delta * delta / 12; const signalPower = xmax * xmax; return 10 * Math.log10(signalPower / noisePower); }
Line 5 is the bug. The signal power for a full-scale sinusoid is Xmax²/2 (the RMS squared), not Xmax² (the peak squared). Using peak power instead of RMS power inflates the SNR by 3 dB. The corrected line should be const signalPower = xmax * xmax / 2;.
This is a classic mistake. The 6.02B + 1.76 formula already assumes a full-scale sinusoid with power = Xmax²/2. If you derive SNR from scratch, you must use RMS signal power.
In practice nobody uses 15-bit ADCs. You'd use 16 bits and get 98 dB — the extra headroom is cheap and useful. But the math shows that 15 bits is the theoretical minimum for 90 dB.
Uniform quantization wastes resolution. If your input signal is Gaussian (as most audio and sensor data roughly is), values near zero are far more likely than values near the extremes. A non-uniform quantizer packs more levels where the PDF is dense and fewer where it's sparse — like variable-length coding for amplitude.
For a Gaussian N(0,1) input with a 2-level quantizer (1 bit), the two reconstruction levels are symmetric: r1 = -a, r2 = +a. The optimal boundary is at b = 0 (by symmetry). What is the optimal reconstruction level a = E[X | X > 0] for N(0,1)?
For a standard normal, E[X | X > 0] = √(2/π) ≈ 0.7979. This is a standard result from the half-normal distribution.
The optimal 1-bit quantizer for Gaussian maps everything below 0 to -0.7979 and everything above 0 to +0.7979. The distortion is σ² - 2/π ≈ 0.3634, or about 36.3% of the signal variance. Terrible, but it's the best 1 bit can do.
You're running Lloyd-Max on data points [−3, −2, −1, 0, 1, 2, 4] with initial reconstruction levels r1 = −1, r2 = 2. Step 1: Compute the boundary b = (r1 + r2)/2. Step 2: Update r1 = mean of points below b, r2 = mean of points above b.
What is the updated r2 after one iteration?
After one iteration: r1 = -1.5, r2 = 2.333. The next step would compute b = (-1.5+2.333)/2 = 0.417, shifting the boundary left because the negative cluster is denser. You keep iterating until the levels stop changing.
Using the updated levels from 3.2 (r1=-1.5, r2=2.333) with boundary b=0.5, compute the mean squared quantization error (MSE distortion) for the data [−3, −2, −1, 0, 1, 2, 4].
D = (1/N) ∑ (xi - q(xi))² where q(x) maps to the nearest reconstruction level. Points ≤ 0.5 map to r1, points > 0.5 map to r2.
Compare with the distortion before the update (using original levels r1=-1, r2=2): MSE = (4+1+0+1+1+0+4)/7 = 11/7 ≈ 1.571. One Lloyd-Max iteration reduced distortion from 1.571 to 1.381 — a 12% improvement. The algorithm converges when the levels and boundaries stop changing.
At very low bit rates, the Gaussian PDF within the active range [-2σ, 2σ] doesn't vary enough for non-uniform spacing to help significantly. The Lloyd-Max gain over uniform grows with the number of levels. At 8 bits (256 levels), Lloyd-Max can gain ~1-2 dB over uniform for Gaussian input. The real payoff of non-uniform quantization is for highly non-uniform distributions (like speech amplitude, which follows a Laplacian).
Companding = compression + expanding. The log compressor makes small values bigger and large values smaller, effectively spreading the dense-near-zero PDF into a more uniform shape. Uniform quantization of this spread-out signal is equivalent to non-uniform quantization of the original — more levels near zero, fewer at the extremes. It's a practical approximation to Lloyd-Max that doesn't require knowledge of the exact input distribution.
You run an FFT on 3 seconds of music and get one frequency spectrum. But music changes over time — there's a note at t=0.5s and a different one at t=2s. The FFT smears them together. The Short-Time Fourier Transform solves this by chopping the signal into overlapping frames, windowing each frame, and computing the FFT per frame. The tradeoff: longer windows give better frequency resolution but worse time resolution.
fs = 44100 Hz, N = 1024. What is the frequency resolution Δf in Hz?
Each FFT bin spans ~43 Hz. The lowest piano note (A0) is 27.5 Hz and the next semitone (A#0) is 29.1 Hz — a gap of only 1.6 Hz. At 43 Hz resolution, you can't distinguish adjacent notes in the bass register. You'd need N ≥ 44100/1.6 ≈ 27563 for semitone resolution at A0.
Same setup: fs = 44100, N = 1024. What is the time duration of one window in milliseconds?
23 ms is near the "sweet spot" for speech — long enough to capture a few pitch periods (speech fundamental is ~80-300 Hz) but short enough that the vocal tract shape doesn't change much within one frame. Most speech processing uses 20-30 ms windows.
For N = 512 and fs = 16000 Hz, compute the mainlobe width for a Hann window in Hz. (Hann mainlobe = 4 · fs/N.)
The Hann window has 2x the mainlobe width of rectangular — you lose frequency resolution. But the sidelobes drop from -13 dB (rectangular) to -31 dB (Hann), dramatically reducing spectral leakage. In practice, you almost always prefer the Hann (or Hamming) window because leakage from strong signals masks weak signals more than a slightly wider mainlobe does.
Audio signal: 2 seconds at fs = 16000 Hz. Window = 512 samples, hop = 256 samples. How many STFT frames?
Frames = floor((total_samples - window_size) / hop) + 1
With 50% overlap (hop = N/2), you get roughly 2 × (total_samples / N) frames. The overlap helps because each sample contributes to two adjacent frames, improving the time-domain smoothness of the spectrogram.
From Exercise 4.4 (124 frames, N=512), the spectrogram is a 2D matrix. What are its dimensions? (Only positive frequencies + DC.)
We only keep the positive frequencies because of conjugate symmetry (Chapter 1). The spectrogram is a 257 × 124 matrix, often displayed as a heatmap with frequency on the y-axis and time on the x-axis. Each cell is |X[k, m]|² (power) or 10·log10(|X[k,m]|²) (dB).
Put these STFT steps in the correct order.
Frame → Window → FFT → Magnitude² → Log (dB)
Framing chops the signal into overlapping chunks. Windowing tapers each frame to reduce spectral leakage. FFT converts each windowed frame to the frequency domain. Taking |X|² gives the power spectrum. Log scaling compresses the dynamic range for display and downstream processing.
The STFT uses the same window size for all frequencies — 43 Hz resolution everywhere. But music has wide notes in the bass (fundamentals separated by ~1 Hz at 27 Hz) and narrow notes in the treble (separated by ~100 Hz at 4 kHz). Wavelets solve this with multi-resolution analysis: long windows for low frequencies, short windows for high frequencies.
The simplest wavelet is the Haar wavelet. It decomposes a signal into two halves: an approximation (low-frequency, averaged) and detail (high-frequency, differenced):
Decompose x = [4, 6, 10, 12] using one level of Haar wavelet. What is the first approximation coefficient a[0]?
a[0] = (x[0] + x[1]) / √2 = (4 + 6) / √2.
The approximation captures the local averages (scaled by √2 for energy preservation). The detail coefficients are equal here because both pairs differ by 2. The √2 normalization ensures energy is preserved: ∑|x|² = ∑|a|² + ∑|d|².
Verify energy preservation for x = [4, 6, 10, 12]. Compute ∑|x|² and ∑|a|² + ∑|d|². Are they equal?
Parseval's theorem for wavelets. The √2 normalization is precisely what makes this work. Without it (using simple averages and differences), you'd need to track scaling factors manually.
Given a = [7.071, 15.556] and d = [-1.414, -1.414], reconstruct x[0].
x[0] = (a[0] + d[0]) / √2.
Perfect reconstruction. This is the key property of orthogonal wavelets — you can decompose, process in the wavelet domain (e.g., zero out small detail coefficients for denoising), and reconstruct without losing information.
Write a function that performs one level of Haar wavelet decomposition. Return [approx, detail] as two arrays.
javascript function haarDecompose(x) { const s = Math.SQRT2; const approx = [], detail = []; for (let k = 0; k < x.length; k += 2) { approx.push((x[k] + x[k+1]) / s); detail.push((x[k] - x[k+1]) / s); } return [approx, detail]; }
The key insight is sparsity. Real signals (edges, tones, transients) are sparse in the wavelet domain — their energy concentrates in a few large coefficients. White noise has equal power at all frequencies, so in the wavelet domain it spreads into many small coefficients. Thresholding keeps the large (signal) coefficients and zeros out the small (noise) ones. This is the foundation of JPEG2000 and many medical image denoising algorithms.
Every speech recognition system — from 1990s GMM-HMM to modern Whisper — starts with the same question: how do we represent a chunk of audio as a feature vector? The Mel-Frequency Cepstral Coefficients (MFCCs) are the answer. They mimic how the human ear processes sound: emphasizing perceptual frequency differences over physical ones.
Convert 1000 Hz to the mel scale. m = 2595 · log10(1 + f/700).
1000 Hz ≈ 1000 mel. This is by design — the mel scale was calibrated so that 1000 mel ≈ 1000 Hz. Below this point, mel tracks Hz almost linearly. Above it, the relationship becomes logarithmic.
Convert 8000 Hz to mel. Then compute how many mels span 0-1000 Hz vs 1000-8000 Hz.
7000 Hz of physical bandwidth (1000-8000) maps to only 1840 mel — less than 2x the 1000 mel that the first 1000 Hz gets. This compression at high frequencies is why mel filterbanks place many narrow filters below 1000 Hz and fewer wide filters above it.
For 8 kHz audio (fs = 16000, Nyquist = 8000 Hz) with 26 mel filters: what is the mel spacing between adjacent filter centers?
The mel range is [m(0), m(8000)]. We need 26 filters, which means 28 points (26 centers + 2 endpoints). Spacing = total mel range / 27.
In Hz, the first few filters are spaced ~100 Hz apart (near-linear below 1000 Hz), but the last few are spaced ~600+ Hz apart (logarithmic above 1000 Hz). This perceptually-motivated non-uniform spacing is the key feature of mel filterbanks.
y[n] = x[n] - 0.97·x[n-1] is approximately a first difference (derivative), which is a high-pass filter. Speech has a natural spectral tilt of about -6 dB/octave (the glottal pulse spectrum rolls off). Pre-emphasis flattens this tilt, giving the FFT and mel filters a more uniform energy distribution across frequencies. Without it, the low-frequency bins dominate and the high-frequency formant information is buried in noise.
Adjacent mel filters overlap, so their outputs are correlated. Correlated features are wasteful — a diagonal Gaussian classifier (common in GMM-HMMs) needs uncorrelated features. The DCT is like a PCA for this specific case: it decorrelates the log-mel features and concentrates most of the information in the first few coefficients. Keeping only 13 of 26 MFCCs reduces dimensionality by 50% with minimal information loss. Modern neural nets (Whisper, wav2vec) often skip the DCT and use log-mel features directly because they can learn to decorrelate internally.
Put the MFCC pipeline steps in order.
Pre-emphasis → Frame+Window → FFT → Mel filterbank → Log → DCT
This pipeline has been standard since Davis & Mermelstein (1980). Modern systems sometimes add delta and delta-delta (velocity and acceleration) features, and may replace DCT with neural feature learning. But the core pipeline — pre-emphasis through mel filterbank — remains universal.
You have feature vectors — maybe MFCCs from Chapter 6, maybe sensor readings. Now you need to decide: which class does this observation belong to? The Bayes classifier is the gold standard — it's provably optimal if you know the true distributions. Fisher's linear discriminant finds the best projection direction when you don't.
Two classes with equal priors P(C1) = P(C2) = 0.5. Class 1: N(μ1=2, σ²=1). Class 2: N(μ2=5, σ²=1). Compute P(C1|x=3).
P(C1|x) = p(x|C1)P(C1) / [p(x|C1)P(C1) + p(x|C2)P(C2)]. With equal priors, this simplifies to p(x|C1) / [p(x|C1) + p(x|C2)].
x=3 is 1σ from μ1=2 but 2σ from μ2=5. The Gaussian drops much faster at 2σ (e-2 vs e-0.5), so C1 is strongly favored. The Bayes decision boundary is at x = 3.5 (the midpoint, since variances are equal and priors are equal).
Same setup but now P(C1) = 0.3, P(C2) = 0.7. Where is the decision boundary? (Find x where P(C1|x) = P(C2|x), i.e., where p(x|C1)·0.3 = p(x|C2)·0.7.)
With equal variances, the log-likelihood ratio is linear. Solve: -(x-2)²/2 + ln(0.3) = -(x-5)²/2 + ln(0.7).
The boundary shifted from 3.5 (equal priors) to 3.22 — toward C1. This makes sense: since C2 is more likely a priori (0.7 vs 0.3), we give C2 more territory. A point at x=3.3 would be classified as C2 even though it's closer to μ1=2 than to μ2=5, because C2's higher prior tips the balance.
Two 2D classes: μ1 = [1, 2], μ2 = [4, 6]. Both have the identity covariance matrix (SW = 2I). What is the Fisher discriminant direction w*?
w* ∝ SW-1(μ1 - μ2). With SW = 2I, SW-1 = (1/2)I.
When SW is a scaled identity, the Fisher direction is simply the line connecting the class means. With equal spherical covariance, there's no need to "de-correlate" — the mean direction is already optimal. The ratio 4:3 means we project onto a line tilted at arctan(4/3) ≈ 53.1° from the x-axis.
The SW-1 term is what distinguishes Fisher from naive "connect the means." If SW has eigenvalue 10 along one axis and eigenvalue 1 along another, SW-1 has eigenvalues 0.1 and 1 — it shrinks the high-variance direction and stretches the low-variance one. This tips the Fisher direction toward the axis where the classes are narrow, even if the means are more separated along the wide axis. It's the same logic as Mahalanobis distance: an inch of separation in a narrow direction is worth more than a foot of separation in a wide one.
This Bayes classifier computes the wrong posterior. Click the buggy line.
function bayesPosterior(x, mu1, mu2, sigma, prior1) { const prior2 = 1 - prior1; const d1 = x - mu1; const d2 = x - mu2; const p1 = Math.exp(-d1*d1 / (2*sigma)) * prior1; const p2 = Math.exp(-d2*d2 / (2*sigma)) * prior2; return p1 / (p1 + p2); }
Line 5 (and line 6) have the bug. The Gaussian exponent should be -d²/(2σ²), but the code uses 2*sigma instead of 2*sigma*sigma. The parameter sigma is the standard deviation, so the variance is σ². Dividing by 2σ instead of 2σ² changes the effective variance and gives wrong posteriors whenever σ ≠ 1.
When σ = 1, the bug is invisible because 2σ = 2σ² = 2. This is a classic unit test failure: the test passed with σ=1 but the code is wrong for all other variances. Always test with σ ≠ 1.
You've found a separating hyperplane. Your colleague found a different one. Both classify the training data perfectly. Which is better? The Support Vector Machine answers: the one with the widest margin — the largest gap between the decision boundary and the nearest data point. This maximum-margin principle gives SVMs their legendary generalization.
A 2D SVM finds w = [3, 4], b = -2. What is the margin width?
Margin = 2 / ||w|| where ||w|| = √(w1² + w2²).
A margin of 0.4 means the nearest point to the boundary is 0.2 units away on each side. If we scaled w to [6, 8] and b to -4, the hyperplane would be identical but ||w|| = 10, giving a margin of 0.2. The margin depends on the normalized distance, so we minimize ||w|| to maximize the margin.
Hyperplane: 3x1 + 4x2 - 2 = 0. What is the distance from point (1, 1) to this hyperplane?
Distance = |w·x + b| / ||w|| = |3(1) + 4(1) - 2| / 5.
The point (1,1) is exactly 1 unit from the hyperplane. Since the margin is 0.4 (half-margin = 0.2), this point is well outside the margin — it's NOT a support vector. A support vector would be at distance 1/||w|| = 0.2 from the hyperplane.
(1, 0) gives w·x + b = 3 + 0 - 2 = 1, so |w·x + b| = 1. This point lies exactly on the positive margin plane (w·x + b = +1). It's a support vector — remove it and the hyperplane might change. All other points have |w·x + b| > 1, meaning they're safely inside their class region and don't constrain the optimization.
The polynomial kernel φ(x) = [x1², √2·x1x2, x2²] maps 2D to 3D. Compute φ([1, 2]) · φ([3, 1]).
This is the kernel value K(a, b) = (φ(a))·(φ(b)). Also verify: K(a,b) = (a·b)².
The kernel trick: instead of computing the expensive mapping φ and then the dot product in 3D, we compute (a·b)² directly in the original 2D space. Same result, much cheaper. For degree-d polynomials in n dimensions, the mapped space has O(nd) dimensions, but the kernel computation is still O(n). This is why SVMs can operate in effectively infinite-dimensional spaces (e.g., RBF kernel).
Write a function that computes the SVM margin given a weight vector w (as an array).
javascript function margin(w) { const norm = Math.sqrt(w.reduce((s, v) => s + v * v, 0)); return 2 / norm; }
The RBF (Gaussian) kernel is K(x, y) = exp(-||x-y||² / (2σ²)). Compute K([0,0], [1,1]) with σ = 1.
The RBF kernel measures similarity — K=1 for identical points, K→0 as distance grows. With σ=1, points at distance √2 ≈ 1.41 apart have K ≈ 0.37. The parameter σ controls the "reach": larger σ = smoother decision boundary, smaller σ = more wiggly (potentially overfitting).
You're designing the front-end for a speech recognition system. Your audio comes in at 16 kHz, and you need to produce MFCC feature vectors for each frame. This chapter ties together everything: framing, windowing, spectral analysis, mel filterbanks, and feature dimensions. Every number must be computable from the system parameters.
3 seconds of audio at fs = 16000 Hz. How many samples?
Frame length = 25 ms at fs = 16000 Hz. How many samples per frame?
25 ms at 16 kHz gives exactly 400 samples. This is a common choice in speech processing. The frame should be long enough to capture 2-3 pitch periods (human pitch: 80-300 Hz, period: 3.3-12.5 ms) but short enough that the vocal tract is roughly stationary.
Hop = 10 ms at fs = 16000 Hz. How many samples per hop?
The overlap is frame_size - hop = 400 - 160 = 240 samples, which is 60% overlap. This is more aggressive than the typical 50% overlap from Chapter 4, giving smoother temporal features at the cost of more frames to process.
48000 samples, frame = 400, hop = 160. How many frames?
Frames = floor((total - frame_size) / hop) + 1.
3 seconds of audio at 10 ms hop gives approximately 300 frames. Each frame produces one feature vector, so our downstream classifier sees a sequence of 298 feature vectors — one every 10 ms.
We zero-pad each 400-sample frame to N = 512 for the FFT (next power of 2). How many unique frequency bins in the magnitude spectrum?
Zero-padding from 400 to 512 doesn't add information — it just interpolates the spectrum more finely (smaller spacing between bins). The frequency resolution is still determined by the original 400 samples: Δf = fs/400 = 40 Hz. But the FFT bins are spaced at fs/512 = 31.25 Hz.
We apply 26 mel filters and keep the first 13 DCT coefficients. With delta and delta-delta features appended, what is the dimension of each feature vector?
Delta features = first difference of MFCCs across time. Delta-delta = second difference. Same dimension as MFCCs.
39-dimensional feature vectors are the classic MFCC configuration used by virtually every pre-deep-learning speech recognizer (HTK, Kaldi). Each of the 298 frames produces a 39-D vector, so the complete feature matrix for 3 seconds of audio is 298 × 39.
298 frames × 39 dimensions per frame, stored as 32-bit floats. How many kilobytes is the feature matrix?
3 seconds of raw audio at 16 kHz, 16-bit = 96 KB. After MFCC extraction: 45.4 KB. That's only 2.3x compression — MFCCs aren't about compression, they're about representation. The 39 dimensions capture the perceptually relevant information (vocal tract shape, pitch dynamics) while discarding speaker-specific details and noise.
Write a function that computes the output dimensions of a speech feature pipeline given audio parameters.
javascript function featurePipelineDims(durationSec, fs, frameSec, hopSec, nfft, nMel, nMfcc, deltas) { const totalSamples = Math.round(durationSec * fs); const frameSamples = Math.round(frameSec * fs); const hopSamples = Math.round(hopSec * fs); const nFrames = Math.floor((totalSamples - frameSamples) / hopSamples) + 1; const featDim = nMfcc * (deltas ? 3 : 1); return { nFrames, featDim, matrixSize: nFrames * featDim }; }
The mel filterbank deliberately blurs high frequencies because the human auditory system does the same. Below ~1 kHz, we can distinguish frequencies ~3 Hz apart. Above ~4 kHz, our resolution drops to ~100+ Hz. Narrow mel filters at low frequencies capture the fine pitch and formant structure that matters for speech intelligibility. Wide mel filters at high frequencies average out spectral detail that humans can't perceive anyway. The result is a compact, perceptually relevant representation.