Extracting meaningful features from the DFT — centroid, spread, kurtosis, entropy, flatness, and flux.
You have the DFT of an audio frame — say, 512 complex numbers X[0], X[1], ..., X[511]. Each X[k] tells you the amplitude and phase of frequency k. That's 512 dimensions of information. Too many for most downstream tasks.
Imagine you're building a music genre classifier. You don't need to know that bin 147 has magnitude 0.032 and bin 148 has magnitude 0.029. You need summary statistics: Is the sound bright or dark? Narrow-band or wide-band? Smooth or noisy? Changing rapidly or stable?
Spectral descriptors are scalar summary statistics computed from the magnitude spectrum. They reduce 512 numbers to 5–15 meaningful features. Each captures one perceptual or physical property of the sound.
Define the spectral density (normalized magnitude spectrum):
Now sk ≥ 0 and ∑ sk = 1. It's a valid probability distribution. The "random variable" is frequency fk = k · fs/N (where fs is the sampling rate). We can compute moments of this distribution just like moments of any probability distribution.
• Speaker identification: Different speakers have different formant patterns → different centroids and spreads.
• Instrument recognition: A flute (pure tone, low spread) vs. cymbals (broadband, high spread).
• Music genre: Classical (low centroid, smooth) vs. metal (high centroid, rough).
• Voice activity detection: Speech has specific spectral shape (centroid ~500–3000 Hz, moderate spread) vs. background noise (flat spectrum, high entropy).
• Mood detection: Bright/happy (high centroid) vs. dark/sad (low centroid).
• Audio quality assessment: Compression artifacts change spectral shape (bandwidth reduction lowers rolloff frequency).
• Environmental sound classification: Rain (broadband, high entropy) vs. bird song (narrow-band, low entropy) vs. traffic (low centroid, moderate spread).
python import numpy as np def extract_all_descriptors(frame, fs=16000, nfft=512): """Extract all spectral descriptors from one frame.""" windowed = frame * np.hanning(len(frame)) X = np.fft.rfft(windowed, n=nfft) power = np.abs(X) ** 2 total = power.sum() if total < 1e-10: return {} s = power / total freqs = np.arange(len(s)) * fs / nfft # Moments centroid = (freqs * s).sum() spread = np.sqrt(((freqs - centroid)**2 * s).sum()) z = (freqs - centroid) / max(spread, 1e-8) skew = (z**3 * s).sum() kurt = (z**4 * s).sum() # Info-theoretic s_safe = np.maximum(s, 1e-10) entropy = -(s_safe * np.log2(s_safe)).sum() flatness = np.exp(np.log(power + 1e-10).mean()) / power.mean() crest = power.max() / power.mean() # Slope (linear regression) f_mean, s_mean = freqs.mean(), power.mean() slope = ((freqs-f_mean)*(power-s_mean)).sum() / ((freqs-f_mean)**2).sum() # Rolloff (85%) cumsum = np.cumsum(s) rolloff_idx = np.searchsorted(cumsum, 0.85) rolloff = freqs[min(rolloff_idx, len(freqs)-1)] return {'centroid': centroid, 'spread': spread, 'skewness': skew, 'kurtosis': kurt, 'entropy': entropy, 'flatness': flatness, 'crest': crest, 'slope': slope, 'rolloff': rolloff}
This function returns a dictionary of 9 descriptors per frame. For a full audio file, call this on overlapping frames (25 ms frame, 10 ms hop) to get a feature matrix of shape (num_frames, 9).
Spectral descriptors were formalized in the MPEG-7 Audio standard (2001), which defines a set of "low-level descriptors" for audio search and retrieval. Before MPEG-7, similar features were used informally in speech processing (1970s), music information retrieval (1990s), and hearing research. Pilanci's slides draw from this rich history to show how simple statistics of the DFT power spectrum capture perceptually meaningful properties.
Two different spectra with the same total energy but different "shapes." Their spectral descriptors differ dramatically.
The spectral centroid is the "center of mass" of the spectrum. It tells you where the average frequency is. High centroid = bright sound. Low centroid = dark sound.
Since we already normalized sk to sum to 1, this simplifies to:
where fk = k · fs/N is the frequency (in Hz) of bin k. This is literally the expected value of frequency under the spectral distribution.
Suppose N = 8, fs = 8000 Hz. The frequency bins are at fk = k · 1000 Hz for k = 0, 1, 2, 3, 4.
Magnitude spectrum (power): |X|2 = [0, 4, 1, 0, 0] (energy concentrated at 1000 Hz with a bit at 2000 Hz).
Total power = 5. Normalized: s = [0, 0.8, 0.2, 0, 0].
The centroid is 1200 Hz — pulled slightly above the fundamental by the harmonic at 2000 Hz. If we add more high-frequency harmonics, the centroid rises (brighter sound).
python import numpy as np def spectral_centroid(magnitude_spectrum, fs, N): """Compute spectral centroid in Hz.""" K = len(magnitude_spectrum) freqs = np.arange(K) * fs / N # frequency of each bin power = magnitude_spectrum ** 2 total = power.sum() if total == 0: return 0.0 return (freqs * power).sum() / total
Notice: we use the power spectrum (|X[k]|2) not the magnitude (|X[k]|). Some implementations use magnitude directly — the difference is whether loud frequencies are weighted linearly or quadratically. Power weighting (squared) emphasizes dominant components more heavily.
In practice, we compute the centroid for each short-time frame (10–30 ms of audio). The sequence of centroids over time is a time series that tracks how brightness changes. When a drummer hits a hi-hat, the centroid spikes (sudden high-frequency energy). During a bass note, it drops.
Psychoacoustic studies show that spectral centroid correlates with perceived "brightness" (r > 0.8 in most experiments). This makes it one of the most reliable audio features for perceptual tasks. In music information retrieval, centroid alone can distinguish:
• Classical piano (centroid ~1000–2000 Hz) from rock guitar (centroid ~3000–5000 Hz)
• Male speech (centroid ~800 Hz) from female speech (centroid ~1200 Hz)
• Muffled recordings (low centroid) from bright/tinny recordings (high centroid)
Draw a spectrum shape by clicking. The centroid (orange vertical line) updates in real time. Try making it high-frequency heavy vs. low-frequency heavy.
The centroid tells you WHERE the spectral energy is centered. The spectral spread tells you how WIDE the distribution is around that center. A narrow spread means a pure, tonal sound. A wide spread means a noisy, broadband sound.
This is the standard deviation of the spectral distribution. Units are Hz. It measures how far, on average, the spectral energy sits from the centroid.
From Chapter 1: s = [0, 0.8, 0.2, 0, 0], μ1 = 1200 Hz, f = [0, 1000, 2000, 3000, 4000] Hz.
The spread is 400 Hz. Compare to a pure tone at 1000 Hz which would have s = [0, 1, 0, 0, 0] and spread = 0. Our signal has some energy at 2000 Hz, giving it a nonzero spread.
| Sound | Centroid | Spread |
|---|---|---|
| Pure sine (flute) | = fundamental freq | ≈ 0 (very narrow) |
| Vowel "ah" | ~800 Hz | ~300–500 Hz |
| Snare drum | ~3000 Hz | ~2000 Hz (broad) |
| White noise | = fs/4 | = fs/4√3 (maximum) |
| Hi-hat | ~6000–10000 Hz | ~3000 Hz |
Some literature uses bandwidth instead of spread. They're related but not identical. The –3dB bandwidth measures where the spectrum drops by half power. The spectral spread is the standard deviation of the full distribution. For a Gaussian-shaped spectrum, bandwidth ≈ 2.35 · μ2. For non-Gaussian spectra, they can differ significantly.
python def spectral_spread(magnitude_spectrum, fs, N): K = len(magnitude_spectrum) freqs = np.arange(K) * fs / N power = magnitude_spectrum ** 2 total = power.sum() if total == 0: return 0.0 centroid = (freqs * power).sum() / total spread_sq = ((freqs - centroid) ** 2 * power).sum() / total return np.sqrt(spread_sq)
Centroid alone can't distinguish all sound types. A narrow-band signal at 2000 Hz (a whistle) and a broad-band signal centered at 2000 Hz (a snare drum) have the same centroid but completely different timbres. Spread captures this difference: the whistle has near-zero spread, the snare has spread > 1000 Hz.
In a feature space with centroid on one axis and spread on the other, different instrument families form distinct clusters. This is why both features together are far more discriminative than either alone.
Adjust the center frequency and width of a Gaussian spectrum. Watch how centroid (orange) and spread (teal band) change.
The centroid and spread are the first two moments. But spectra can have complex shapes that two numbers can't capture. The third and fourth moments — skewness and kurtosis — describe asymmetry and peakedness.
This measures the asymmetry of the spectral shape around the centroid:
• μ3 > 0: More energy above the centroid (right-skewed, "tail towards high frequencies")
• μ3 < 0: More energy below the centroid (left-skewed, "tail towards low frequencies")
• μ3 = 0: Symmetric spectrum
A guitar note with strong low harmonics and weak high harmonics has negative skewness. A signal that drops off sharply below the centroid but has a long high-frequency tail has positive skewness.
This measures the peakedness (or "tailedness") of the spectral distribution:
• μ4 > 3: Leptokurtic (sharper peak than Gaussian, heavier tails). A pure tone has very high kurtosis — all energy concentrated at one point.
• μ4 = 3: Mesokurtic (Gaussian-shaped spectrum).
• μ4 < 3: Platykurtic (flatter than Gaussian, lighter tails). Uniform/white noise has kurtosis close to 1.8.
Let's compute moments for a simple 5-bin spectrum: s = [0.1, 0.2, 0.4, 0.2, 0.1] at frequencies f = [1000, 2000, 3000, 4000, 5000] Hz.
Centroid: μ1 = 1000(0.1) + 2000(0.2) + 3000(0.4) + 4000(0.2) + 5000(0.1) = 100 + 400 + 1200 + 800 + 500 = 3000 Hz
Spread: μ22 = (1000−3000)2(0.1) + (2000−3000)2(0.2) + 0 + (4000−3000)2(0.2) + (5000−3000)2(0.1) = 400000 + 200000 + 0 + 200000 + 400000 = 1,200,000. μ2 = 1095 Hz.
Skewness: All terms are symmetric around the centroid (the spectrum is symmetric), so μ3 = 0.
Kurtosis: μ4 = ∑((fk−3000)/1095)4·sk = (1.83)4(0.1) + (0.91)4(0.2) + 0 + (0.91)4(0.2) + (1.83)4(0.1) = 11.2(0.1) + 0.69(0.2) + 0 + 0.69(0.2) + 11.2(0.1) = 1.12 + 0.14 + 0.14 + 1.12 = 2.52
Kurtosis = 2.52 < 3 → platykurtic (flatter than Gaussian). This makes sense: our symmetric 5-bin spectrum is more "uniform-like" than a peaked Gaussian.
| Moment | Name | Measures | Unit |
|---|---|---|---|
| μ1 | Centroid | Average frequency (brightness) | Hz |
| μ2 | Spread | Width (bandwidth) | Hz |
| μ3 | Skewness | Asymmetry | dimensionless |
| μ4 | Kurtosis | Peakedness/tailedness | dimensionless |
Choose a spectral shape and see all four moments computed. Compare symmetric vs. skewed, peaked vs. flat.
Beyond moments, we can compute information-theoretic and ratio-based descriptors that capture different aspects of spectral "shape."
This is Shannon entropy of the spectral distribution. It measures how "spread out" or "uncertain" the frequency content is:
• Low entropy: Energy concentrated at few bins (tonal, predictable). A pure tone has H ≈ 0 (all mass at one bin).
• High entropy: Energy spread evenly across all bins (noise-like, unpredictable). White noise has H = log2(N/2) (maximum possible).
Spectral flatness is the ratio of geometric to arithmetic mean of the power spectrum. By the AM-GM inequality, SF ∈ [0, 1]:
• SF = 1: perfectly flat spectrum (white noise)
• SF → 0: highly peaked spectrum (pure tone)
Also called Wiener entropy or tonality coefficient. It distinguishes tonal from noise-like signals without computing pitch.
The ratio of peak to average. High crest = one dominant frequency (tonal). Low crest = no single peak dominates (noise-like). This is the spectral analog of the crest factor in the time domain (peak/RMS).
Shannon entropy H measures the "surprise" or "unpredictability" of a distribution. Applied to the spectral distribution, it answers: "If I pick a random frequency weighted by the spectrum, how surprised am I by the result?"
For a pure tone (all energy at one bin): H = −1·log2(1) = 0 bits. No surprise — you always know which frequency it will be.
For white noise (K equal bins): H = −K·(1/K)·log2(1/K) = log2(K) bits. Maximum surprise — every frequency is equally likely.
For 256 frequency bins: maximum entropy = log2(256) = 8 bits. A speech frame typically has entropy around 5–6 bits. Background noise: 7+ bits (nearly maximum).
A simple but effective VAD algorithm:
This works because speech concentrates energy in formant regions (3-5 peaks in the spectrum, moderate entropy), while noise distributes energy uniformly (high entropy). The threshold T is calibrated to the specific noise environment.
Spectrum A (tonal): s = [0.9, 0.05, 0.03, 0.02] (4 bins)
H = −(0.9 log2 0.9 + 0.05 log2 0.05 + 0.03 log2 0.03 + 0.02 log2 0.02)
H = −(−0.137 − 0.216 − 0.152 − 0.113) = 0.618 bits
Maximum possible = log2(4) = 2 bits. So normalized entropy = 0.618/2 = 0.31 (very tonal).
Spectrum B (noisy): s = [0.27, 0.23, 0.25, 0.25]
H = −(0.27 log2 0.27 + 0.23 log2 0.23 + 0.25 log2 0.25 + 0.25 log2 0.25)
H ≈ 1.99 bits. Normalized = 0.99 (nearly maximum = noise-like).
Why is spectral flatness always ≤ 1? This follows from the arithmetic-geometric mean inequality: for non-negative numbers, the geometric mean never exceeds the arithmetic mean. Equality holds if and only if all numbers are equal — a flat spectrum.
So SF = GM/AM ≤ 1, with equality iff the spectrum is perfectly flat. In practice, implementations add a small floor (10−10) to all bins to avoid GM = 0 from a single zero bin.
| Descriptor | Pure Tone | White Noise | Range |
|---|---|---|---|
| Entropy | ≈ 0 bits | log2(K) bits | [0, log2(K)] |
| Flatness | ≈ 0 | 1.0 | [0, 1] |
| Crest | K (num bins) | ≈ 1 | [1, K] |
Mix between a pure tone and white noise. Watch how entropy, flatness, and crest respond.
All the descriptors so far describe a single frame. But many audio features are about change over time. Is the spectrum stable (sustained note) or rapidly evolving (speech, drum hits)?
Spectral flux measures how much the spectrum changes between consecutive frames. It's the squared Euclidean distance between successive magnitude spectra.
• Low flux: Sustained note, steady noise — spectrum doesn't change.
• High flux: Onset of a note, percussive hit, speech transitions — spectrum changes rapidly.
In practice, we often care only about increases in spectral energy (onsets), not decreases (offsets). The half-wave rectified flux is:
This ignores energy that disappears and only responds to energy that appears. Much more robust for onset detection.
This is the linear regression slope of the spectrum (magnitude vs. frequency). It describes the overall tilt of the spectrum:
• Negative slope: energy decreases with frequency (typical of most natural sounds — "pink" or "red" spectra)
• Near-zero slope: flat spectrum (white noise)
• Positive slope: energy increases with frequency (unusual, high-frequency dominated)
Most speech and music have negative spectral slope (−3 to −6 dB/octave). The steepness distinguishes vowels (gentle slope) from fricatives (steeper slope).
A related descriptor: the spectral rolloff is the frequency below which a certain percentage (typically 85% or 95%) of the total spectral energy lies:
High rolloff → energy extends to high frequencies (bright). Low rolloff → energy concentrated at low frequencies (dark). Similar to centroid but less sensitive to outlier high-frequency peaks.
Raw flux depends on signal loudness — a louder signal has larger magnitude changes. To compare flux across segments of different volume, normalize:
where ||Xt|| = √(∑|Xt[k]|2). This gives volume-independent flux that responds to spectral shape changes, not loudness changes.
The slope computation is ordinary least-squares regression of log-magnitude vs. log-frequency (for dB/octave measurement) or magnitude vs. frequency (for a simpler linear measure). Pilanci uses the linear form:
python def spectral_slope(magnitude_spectrum, fs, N): K = len(magnitude_spectrum) freqs = np.arange(K) * fs / N f_mean = freqs.mean() s_mean = magnitude_spectrum.mean() num = ((freqs - f_mean) * (magnitude_spectrum - s_mean)).sum() den = ((freqs - f_mean) ** 2).sum() return num / den if den > 0 else 0
A typical speech frame has slope ≈ −30 dB/decade (energy drops 30 dB per factor-10 frequency increase). Music varies more: bass-heavy genres have steeper negative slopes; bright electronic music has flatter slopes.
In Pilanci's slides, he shows the spectral centroid time series for a drum loop. When the hi-hat hits:
• Centroid jumps from ~1000 Hz (kick/snare region) to ~8000 Hz (hi-hat region)
• Duration of the jump: ~50 ms (the hi-hat's sustain)
• Spread simultaneously increases (hi-hat is broadband)
• Flatness spikes (hi-hat is noise-like, not tonal)
This pattern — simultaneous centroid spike + spread increase + flatness increase — is so characteristic that a simple threshold detector on these three features can locate every hi-hat hit with >95% accuracy, even in a complex mix.
Contrast with a snare hit:
• Centroid jumps to ~3000–4000 Hz (lower than hi-hat)
• Spread is large but less than hi-hat
• Has both tonal (snare wire resonance) and noise (shell vibration) components
• Flatness is moderate (between pure tone and pure noise)
A simulated sequence of spectral frames. Flux spikes at "onset" frames where the spectrum changes abruptly.
Now let's put it all together. In a real audio analysis system, you compute ALL spectral descriptors for each frame, producing a feature vector that characterizes the sound.
Typical values in audio analysis systems:
| Parameter | Speech | Music |
|---|---|---|
| Sample rate fs | 16,000 Hz | 44,100 Hz |
| Frame length | 25 ms (400 samples) | 23 ms (1024 samples) |
| Frame hop | 10 ms (160 samples) | 11.6 ms (512 samples) |
| FFT size N | 512 (zero-padded) | 2048 (zero-padded) |
| Window function | Hamming | Hann |
| Frequency bins used | N/2 + 1 = 257 | N/2 + 1 = 1025 |
| Descriptors per frame | 7–13 | 7–13 |
The choice of window function (Hann, Hamming, Blackman) affects spectral leakage: energy from one true frequency "leaks" into adjacent bins. Longer windows give better frequency resolution but worse time resolution. This is the time-frequency uncertainty principle — you cannot have arbitrarily precise measurements in both time and frequency simultaneously.
Draw any spectrum shape on the canvas below. All descriptors are computed in real time. Experiment: make it narrow (low spread), shift it left/right (change centroid), add a second peak (increase kurtosis).
Click/drag to draw a magnitude spectrum. All descriptors update live. Try different shapes!
Spectral descriptors reduce high-dimensional spectra to interpretable features. They're the bridge between raw Fourier coefficients and machine learning classifiers.
| Descriptor | Formula Essence | Measures | Bright Sound | Dark Sound |
|---|---|---|---|---|
| Centroid | ∑ fk·sk | Average frequency | High | Low |
| Spread | std(f under s) | Bandwidth | Moderate | Narrow |
| Skewness | 3rd standardized moment | Asymmetry | — | — |
| Kurtosis | 4th standardized moment | Peakedness | Low (noisy) | High (tonal) |
| Entropy | −∑ sk log sk | Disorder | High | Low |
| Flatness | geom/arith mean | Tonalness | High (noise) | Low (tone) |
| Crest | max/mean | Dominance | Low | High |
| Flux | ∑(Δ|X[k]|)2 | Change over time | — | — |
| Slope | regression on s vs. f | Tilt | Less negative | More negative |
• Lecture 6 (DFT): Spectral descriptors require the DFT as input. No DFT, no spectral features.
• MFCCs (coming): Mel-frequency cepstral coefficients are another spectrum summarization, using a perceptual frequency scale and cepstral analysis. They capture formant structure better than raw moments.
• ML connection: Modern audio classifiers (speech recognition, music tagging) often use learned features (spectrograms + CNNs), but spectral descriptors remain useful as hand-crafted baselines and for interpretable systems.
| Task | Key Descriptors | Why |
|---|---|---|
| Music genre | Centroid, spread, flux, flatness | Brightness, bandwidth, rhythm, tonalness |
| Speaker ID | MFCCs (13 coefficients) | Formant structure encodes vocal tract shape |
| Onset detection | Spectral flux (half-rectified) | Change = new event |
| Voice activity | Entropy, flatness, centroid | Speech vs. noise distinction |
| Instrument recognition | All moments + flux + slope | Each instrument has unique spectral fingerprint |
| Mood/valence | Centroid, slope, entropy | Bright/smooth = happy; dark/rough = sad |
Modern deep learning often bypasses hand-crafted descriptors entirely, feeding raw spectrograms (time-frequency images) into CNNs or transformers. However:
• Spectral descriptors remain useful as baselines — any learned system should beat a simple SVM on these features
• They provide interpretability — a centroid time series is human-readable, a CNN embedding is not
• For low-resource tasks (limited training data), hand-crafted features often outperform learned features
• They're computationally cheap — a microcontroller can compute centroid in real time
• They serve as sanity checks — if a learned model disagrees with simple descriptors, investigate why
The ideal system often combines both: spectral descriptors as an input layer alongside raw spectrogram features, letting the model learn when each is useful.
In practice, temporal derivatives of spectral features are often as informative as the features themselves. The delta (first derivative) and delta-delta (second derivative) features capture rate-of-change:
If you extract 9 base descriptors per frame, adding deltas and delta-deltas gives 27 features per frame. Speech recognition systems (before deep learning) typically used 13 MFCCs + 13 delta + 13 delta-delta = 39 features per frame.
Delta-centroid captures whether the sound is getting brighter (positive) or darker (negative). Delta-flux captures whether onsets are accelerating or decelerating. These temporal dynamics are crucial for distinguishing sounds that have similar static spectra but different temporal envelopes (e.g., a plucked vs. bowed string).
Some descriptors are highly correlated (measuring similar things differently):
• Entropy and flatness: both measure "noisiness" — correlation r > 0.9
• Centroid and rolloff: both measure "brightness" — correlation r > 0.85
• Crest and 1/flatness: nearly inverse of each other
In practice, using all descriptors together introduces multicollinearity in linear models. Options:
• PCA on the feature matrix (decorrelate, reduce dimensions)
• Select a subset (centroid + spread + flatness + flux covers most variance)
• Use non-linear models (random forests, neural nets) that handle correlation naturally
Pilanci recommends starting with the minimal set {centroid, spread, flatness, flux} and adding descriptors only if classification accuracy demands it. These four capture brightness, bandwidth, tonalness, and temporal change — the four perceptual axes most humans can discriminate.