EE269 Lecture 7 — Mert Pilanci, Stanford

Spectral Descriptors

Extracting meaningful features from the DFT — centroid, spread, kurtosis, entropy, flatness, and flux.

Prerequisites: EE269 Lecture 6 (DFT) + Basic statistics (mean, variance). That's it.
8
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: From DFT to Features

You have the DFT of an audio frame — say, 512 complex numbers X[0], X[1], ..., X[511]. Each X[k] tells you the amplitude and phase of frequency k. That's 512 dimensions of information. Too many for most downstream tasks.

Imagine you're building a music genre classifier. You don't need to know that bin 147 has magnitude 0.032 and bin 148 has magnitude 0.029. You need summary statistics: Is the sound bright or dark? Narrow-band or wide-band? Smooth or noisy? Changing rapidly or stable?

Spectral descriptors are scalar summary statistics computed from the magnitude spectrum. They reduce 512 numbers to 5–15 meaningful features. Each captures one perceptual or physical property of the sound.

The key idea: Treat the magnitude spectrum |X[k]|2 (or |X[k]|) as a probability distribution over frequency. Then compute statistics of that distribution: mean (centroid), variance (spread), skewness, kurtosis, entropy. Each statistic maps to a perceivable quality of the sound.

The Magnitude Spectrum as a Distribution

Define the spectral density (normalized magnitude spectrum):

sk = |X[k]|2 / ∑m=0N/2 |X[m]|2

Now sk ≥ 0 and ∑ sk = 1. It's a valid probability distribution. The "random variable" is frequency fk = k · fs/N (where fs is the sampling rate). We can compute moments of this distribution just like moments of any probability distribution.

Applications

Speaker identification: Different speakers have different formant patterns → different centroids and spreads.

Instrument recognition: A flute (pure tone, low spread) vs. cymbals (broadband, high spread).

Music genre: Classical (low centroid, smooth) vs. metal (high centroid, rough).

Voice activity detection: Speech has specific spectral shape (centroid ~500–3000 Hz, moderate spread) vs. background noise (flat spectrum, high entropy).

Mood detection: Bright/happy (high centroid) vs. dark/sad (low centroid).

Audio quality assessment: Compression artifacts change spectral shape (bandwidth reduction lowers rolloff frequency).

Environmental sound classification: Rain (broadband, high entropy) vs. bird song (narrow-band, low entropy) vs. traffic (low centroid, moderate spread).

The Full Pipeline in Code

python
import numpy as np

def extract_all_descriptors(frame, fs=16000, nfft=512):
    """Extract all spectral descriptors from one frame."""
    windowed = frame * np.hanning(len(frame))
    X = np.fft.rfft(windowed, n=nfft)
    power = np.abs(X) ** 2
    total = power.sum()
    if total < 1e-10:
        return {}
    s = power / total
    freqs = np.arange(len(s)) * fs / nfft

    # Moments
    centroid = (freqs * s).sum()
    spread = np.sqrt(((freqs - centroid)**2 * s).sum())
    z = (freqs - centroid) / max(spread, 1e-8)
    skew = (z**3 * s).sum()
    kurt = (z**4 * s).sum()

    # Info-theoretic
    s_safe = np.maximum(s, 1e-10)
    entropy = -(s_safe * np.log2(s_safe)).sum()
    flatness = np.exp(np.log(power + 1e-10).mean()) / power.mean()
    crest = power.max() / power.mean()

    # Slope (linear regression)
    f_mean, s_mean = freqs.mean(), power.mean()
    slope = ((freqs-f_mean)*(power-s_mean)).sum() / ((freqs-f_mean)**2).sum()

    # Rolloff (85%)
    cumsum = np.cumsum(s)
    rolloff_idx = np.searchsorted(cumsum, 0.85)
    rolloff = freqs[min(rolloff_idx, len(freqs)-1)]

    return {'centroid': centroid, 'spread': spread,
            'skewness': skew, 'kurtosis': kurt,
            'entropy': entropy, 'flatness': flatness,
            'crest': crest, 'slope': slope,
            'rolloff': rolloff}

This function returns a dictionary of 9 descriptors per frame. For a full audio file, call this on overlapping frames (25 ms frame, 10 ms hop) to get a feature matrix of shape (num_frames, 9).

The History

Spectral descriptors were formalized in the MPEG-7 Audio standard (2001), which defines a set of "low-level descriptors" for audio search and retrieval. Before MPEG-7, similar features were used informally in speech processing (1970s), music information retrieval (1990s), and hearing research. Pilanci's slides draw from this rich history to show how simple statistics of the DFT power spectrum capture perceptually meaningful properties.

Spectrum → Single Number

Two different spectra with the same total energy but different "shapes." Their spectral descriptors differ dramatically.

Why do we normalize the magnitude spectrum to sum to 1 before computing spectral descriptors?

Chapter 1: Spectral Centroid

The spectral centroid is the "center of mass" of the spectrum. It tells you where the average frequency is. High centroid = bright sound. Low centroid = dark sound.

Definition

μ1 = ∑k=0N/2 fk · sk / ∑k=0N/2 sk

Since we already normalized sk to sum to 1, this simplifies to:

μ1 = ∑k=0N/2 fk · sk

where fk = k · fs/N is the frequency (in Hz) of bin k. This is literally the expected value of frequency under the spectral distribution.

Physical meaning: If you balanced the magnitude spectrum on a see-saw, the centroid is the balance point. A flute playing A4 (440 Hz) has centroid near 440 Hz because most energy is at the fundamental. A cymbal crash might have centroid at 5000+ Hz because energy is spread across high frequencies.

Worked Example

Suppose N = 8, fs = 8000 Hz. The frequency bins are at fk = k · 1000 Hz for k = 0, 1, 2, 3, 4.

Magnitude spectrum (power): |X|2 = [0, 4, 1, 0, 0] (energy concentrated at 1000 Hz with a bit at 2000 Hz).

Total power = 5. Normalized: s = [0, 0.8, 0.2, 0, 0].

μ1 = 0·0 + 1000·0.8 + 2000·0.2 + 3000·0 + 4000·0 = 1200 Hz

The centroid is 1200 Hz — pulled slightly above the fundamental by the harmonic at 2000 Hz. If we add more high-frequency harmonics, the centroid rises (brighter sound).

Centroid in Code

python
import numpy as np

def spectral_centroid(magnitude_spectrum, fs, N):
    """Compute spectral centroid in Hz."""
    K = len(magnitude_spectrum)
    freqs = np.arange(K) * fs / N   # frequency of each bin
    power = magnitude_spectrum ** 2
    total = power.sum()
    if total == 0:
        return 0.0
    return (freqs * power).sum() / total

Notice: we use the power spectrum (|X[k]|2) not the magnitude (|X[k]|). Some implementations use magnitude directly — the difference is whether loud frequencies are weighted linearly or quadratically. Power weighting (squared) emphasizes dominant components more heavily.

Centroid Over Time

In practice, we compute the centroid for each short-time frame (10–30 ms of audio). The sequence of centroids over time is a time series that tracks how brightness changes. When a drummer hits a hi-hat, the centroid spikes (sudden high-frequency energy). During a bass note, it drops.

Perceptual Correlation

Psychoacoustic studies show that spectral centroid correlates with perceived "brightness" (r > 0.8 in most experiments). This makes it one of the most reliable audio features for perceptual tasks. In music information retrieval, centroid alone can distinguish:

• Classical piano (centroid ~1000–2000 Hz) from rock guitar (centroid ~3000–5000 Hz)

• Male speech (centroid ~800 Hz) from female speech (centroid ~1200 Hz)

• Muffled recordings (low centroid) from bright/tinny recordings (high centroid)

Spectral Centroid: Interactive

Draw a spectrum shape by clicking. The centroid (orange vertical line) updates in real time. Try making it high-frequency heavy vs. low-frequency heavy.

Click on the canvas to draw spectrum bars. Centroid shown as orange line.
A pure sine wave at 1000 Hz has most energy at a single frequency. What is its spectral centroid approximately?

Chapter 2: Spectral Spread & Bandwidth

The centroid tells you WHERE the spectral energy is centered. The spectral spread tells you how WIDE the distribution is around that center. A narrow spread means a pure, tonal sound. A wide spread means a noisy, broadband sound.

Definition

μ2 = √( ∑k (fk − μ1)2 · sk )

This is the standard deviation of the spectral distribution. Units are Hz. It measures how far, on average, the spectral energy sits from the centroid.

Analogy: The centroid is the mean of the "frequency random variable." The spread is its standard deviation. Together, they give you the first two moments of the spectral shape — location and width.

Worked Example (continued)

From Chapter 1: s = [0, 0.8, 0.2, 0, 0], μ1 = 1200 Hz, f = [0, 1000, 2000, 3000, 4000] Hz.

μ22 = (0−1200)2·0 + (1000−1200)2·0.8 + (2000−1200)2·0.2 + 0 + 0
= 0 + 40000·0.8 + 640000·0.2 = 32000 + 128000 = 160000
μ2 = √160000 = 400 Hz

The spread is 400 Hz. Compare to a pure tone at 1000 Hz which would have s = [0, 1, 0, 0, 0] and spread = 0. Our signal has some energy at 2000 Hz, giving it a nonzero spread.

Interpretation Table

SoundCentroidSpread
Pure sine (flute)= fundamental freq≈ 0 (very narrow)
Vowel "ah"~800 Hz~300–500 Hz
Snare drum~3000 Hz~2000 Hz (broad)
White noise= fs/4= fs/4√3 (maximum)
Hi-hat~6000–10000 Hz~3000 Hz

Bandwidth vs. Spread

Some literature uses bandwidth instead of spread. They're related but not identical. The –3dB bandwidth measures where the spectrum drops by half power. The spectral spread is the standard deviation of the full distribution. For a Gaussian-shaped spectrum, bandwidth ≈ 2.35 · μ2. For non-Gaussian spectra, they can differ significantly.

Spread Computation in Code

python
def spectral_spread(magnitude_spectrum, fs, N):
    K = len(magnitude_spectrum)
    freqs = np.arange(K) * fs / N
    power = magnitude_spectrum ** 2
    total = power.sum()
    if total == 0:
        return 0.0
    centroid = (freqs * power).sum() / total
    spread_sq = ((freqs - centroid) ** 2 * power).sum() / total
    return np.sqrt(spread_sq)

Why Spread Matters for Classification

Centroid alone can't distinguish all sound types. A narrow-band signal at 2000 Hz (a whistle) and a broad-band signal centered at 2000 Hz (a snare drum) have the same centroid but completely different timbres. Spread captures this difference: the whistle has near-zero spread, the snare has spread > 1000 Hz.

In a feature space with centroid on one axis and spread on the other, different instrument families form distinct clusters. This is why both features together are far more discriminative than either alone.

Centroid & Spread Visualizer

Adjust the center frequency and width of a Gaussian spectrum. Watch how centroid (orange) and spread (teal band) change.

Center freq 1500
Width (σ) 400
White noise has a flat spectrum (equal energy at all frequencies). What is its spectral spread?

Chapter 3: Higher-Order Moments

The centroid and spread are the first two moments. But spectra can have complex shapes that two numbers can't capture. The third and fourth moments — skewness and kurtosis — describe asymmetry and peakedness.

Spectral Skewness (μ3)

μ3 = ∑k ((fk − μ1) / μ2)3 · sk

This measures the asymmetry of the spectral shape around the centroid:

• μ3 > 0: More energy above the centroid (right-skewed, "tail towards high frequencies")

• μ3 < 0: More energy below the centroid (left-skewed, "tail towards low frequencies")

• μ3 = 0: Symmetric spectrum

A guitar note with strong low harmonics and weak high harmonics has negative skewness. A signal that drops off sharply below the centroid but has a long high-frequency tail has positive skewness.

Spectral Kurtosis (μ4)

μ4 = ∑k ((fk − μ1) / μ2)4 · sk

This measures the peakedness (or "tailedness") of the spectral distribution:

• μ4 > 3: Leptokurtic (sharper peak than Gaussian, heavier tails). A pure tone has very high kurtosis — all energy concentrated at one point.

• μ4 = 3: Mesokurtic (Gaussian-shaped spectrum).

• μ4 < 3: Platykurtic (flatter than Gaussian, lighter tails). Uniform/white noise has kurtosis close to 1.8.

Kurtosis detects non-Gaussianity: In signal processing, high spectral kurtosis indicates the presence of transients or narrow-band components hidden in noise. This is used in bearing fault detection — a damaged bearing produces periodic impulses that show up as high kurtosis in specific frequency bands.

Worked Example: Computing All Four Moments

Let's compute moments for a simple 5-bin spectrum: s = [0.1, 0.2, 0.4, 0.2, 0.1] at frequencies f = [1000, 2000, 3000, 4000, 5000] Hz.

Centroid: μ1 = 1000(0.1) + 2000(0.2) + 3000(0.4) + 4000(0.2) + 5000(0.1) = 100 + 400 + 1200 + 800 + 500 = 3000 Hz

Spread: μ22 = (1000−3000)2(0.1) + (2000−3000)2(0.2) + 0 + (4000−3000)2(0.2) + (5000−3000)2(0.1) = 400000 + 200000 + 0 + 200000 + 400000 = 1,200,000. μ2 = 1095 Hz.

Skewness: All terms are symmetric around the centroid (the spectrum is symmetric), so μ3 = 0.

Kurtosis: μ4 = ∑((fk−3000)/1095)4·sk = (1.83)4(0.1) + (0.91)4(0.2) + 0 + (0.91)4(0.2) + (1.83)4(0.1) = 11.2(0.1) + 0.69(0.2) + 0 + 0.69(0.2) + 11.2(0.1) = 1.12 + 0.14 + 0.14 + 1.12 = 2.52

Kurtosis = 2.52 < 3 → platykurtic (flatter than Gaussian). This makes sense: our symmetric 5-bin spectrum is more "uniform-like" than a peaked Gaussian.

Summary of Moments

MomentNameMeasuresUnit
μ1CentroidAverage frequency (brightness)Hz
μ2SpreadWidth (bandwidth)Hz
μ3SkewnessAsymmetrydimensionless
μ4KurtosisPeakedness/tailednessdimensionless
Skewness & Kurtosis Explorer

Choose a spectral shape and see all four moments computed. Compare symmetric vs. skewed, peaked vs. flat.

A pure sine wave has all its spectral energy at one frequency. What describes its spectral kurtosis?

Chapter 4: Entropy, Flatness & Crest

Beyond moments, we can compute information-theoretic and ratio-based descriptors that capture different aspects of spectral "shape."

Spectral Entropy

H = −∑k sk · log2(sk)

This is Shannon entropy of the spectral distribution. It measures how "spread out" or "uncertain" the frequency content is:

Low entropy: Energy concentrated at few bins (tonal, predictable). A pure tone has H ≈ 0 (all mass at one bin).

High entropy: Energy spread evenly across all bins (noise-like, unpredictable). White noise has H = log2(N/2) (maximum possible).

Application — Voice Activity Detection: Speech has moderate entropy (energy in formant regions). Background noise has high entropy (flat spectrum). Thresholding spectral entropy is a simple, effective VAD: low entropy → speech present.

Spectral Flatness

SF = (geometric mean of sk) / (arithmetic mean of sk) = exp((1/K)∑ ln sk) / ((1/K)∑ sk)

Spectral flatness is the ratio of geometric to arithmetic mean of the power spectrum. By the AM-GM inequality, SF ∈ [0, 1]:

• SF = 1: perfectly flat spectrum (white noise)

• SF → 0: highly peaked spectrum (pure tone)

Also called Wiener entropy or tonality coefficient. It distinguishes tonal from noise-like signals without computing pitch.

Spectral Crest

SC = max(sk) / mean(sk)

The ratio of peak to average. High crest = one dominant frequency (tonal). Low crest = no single peak dominates (noise-like). This is the spectral analog of the crest factor in the time domain (peak/RMS).

Flatness vs. Crest: They're nearly inverse measures. Flatness is high for noise, low for tones. Crest is high for tones, low for noise. Both capture "tonalness" but from different angles. Flatness uses all bins (geometric mean), while crest uses only the maximum.

Spectral Entropy in Detail

Shannon entropy H measures the "surprise" or "unpredictability" of a distribution. Applied to the spectral distribution, it answers: "If I pick a random frequency weighted by the spectrum, how surprised am I by the result?"

For a pure tone (all energy at one bin): H = −1·log2(1) = 0 bits. No surprise — you always know which frequency it will be.

For white noise (K equal bins): H = −K·(1/K)·log2(1/K) = log2(K) bits. Maximum surprise — every frequency is equally likely.

For 256 frequency bins: maximum entropy = log2(256) = 8 bits. A speech frame typically has entropy around 5–6 bits. Background noise: 7+ bits (nearly maximum).

Voice Activity Detection via Entropy

A simple but effective VAD algorithm:

Compute entropy H for each frame
From power spectrum distribution
Threshold: voice if H < T
T ≈ 0.7 × log2(K) (tunable)
Smooth decisions
Median filter over 5-10 frames

This works because speech concentrates energy in formant regions (3-5 peaks in the spectrum, moderate entropy), while noise distributes energy uniformly (high entropy). The threshold T is calibrated to the specific noise environment.

Worked Example: Entropy of Two Spectra

Spectrum A (tonal): s = [0.9, 0.05, 0.03, 0.02] (4 bins)

H = −(0.9 log2 0.9 + 0.05 log2 0.05 + 0.03 log2 0.03 + 0.02 log2 0.02)

H = −(−0.137 − 0.216 − 0.152 − 0.113) = 0.618 bits

Maximum possible = log2(4) = 2 bits. So normalized entropy = 0.618/2 = 0.31 (very tonal).

Spectrum B (noisy): s = [0.27, 0.23, 0.25, 0.25]

H = −(0.27 log2 0.27 + 0.23 log2 0.23 + 0.25 log2 0.25 + 0.25 log2 0.25)

H ≈ 1.99 bits. Normalized = 0.99 (nearly maximum = noise-like).

The AM-GM Inequality and Flatness

Why is spectral flatness always ≤ 1? This follows from the arithmetic-geometric mean inequality: for non-negative numbers, the geometric mean never exceeds the arithmetic mean. Equality holds if and only if all numbers are equal — a flat spectrum.

GM = (∏ sk)1/K ≤ (1/K)∑ sk = AM

So SF = GM/AM ≤ 1, with equality iff the spectrum is perfectly flat. In practice, implementations add a small floor (10−10) to all bins to avoid GM = 0 from a single zero bin.

Comparison

DescriptorPure ToneWhite NoiseRange
Entropy≈ 0 bitslog2(K) bits[0, log2(K)]
Flatness≈ 01.0[0, 1]
CrestK (num bins)≈ 1[1, K]
Entropy, Flatness, Crest

Mix between a pure tone and white noise. Watch how entropy, flatness, and crest respond.

Tone ↔ Noise 0.30
A spectral flatness of 0.95 indicates:

Chapter 5: Spectral Flux & Slope

All the descriptors so far describe a single frame. But many audio features are about change over time. Is the spectrum stable (sustained note) or rapidly evolving (speech, drum hits)?

Spectral Flux

F(t) = ∑k (|Xt[k]| − |Xt−1[k]|)2

Spectral flux measures how much the spectrum changes between consecutive frames. It's the squared Euclidean distance between successive magnitude spectra.

Low flux: Sustained note, steady noise — spectrum doesn't change.

High flux: Onset of a note, percussive hit, speech transitions — spectrum changes rapidly.

Application — Onset Detection: Peaks in spectral flux indicate note onsets. When a new note starts, the spectrum changes abruptly (new frequencies appear, old ones vanish). Beat tracking algorithms use flux peaks to find drum hits and note attacks.

Half-Wave Rectified Flux

In practice, we often care only about increases in spectral energy (onsets), not decreases (offsets). The half-wave rectified flux is:

F+(t) = ∑k max(0, |Xt[k]| − |Xt−1[k]|)2

This ignores energy that disappears and only responds to energy that appears. Much more robust for onset detection.

Spectral Slope

slope = (∑k (fk − f̄)(sk − s̄)) / (∑k (fk − f̄)2)

This is the linear regression slope of the spectrum (magnitude vs. frequency). It describes the overall tilt of the spectrum:

• Negative slope: energy decreases with frequency (typical of most natural sounds — "pink" or "red" spectra)

• Near-zero slope: flat spectrum (white noise)

• Positive slope: energy increases with frequency (unusual, high-frequency dominated)

Most speech and music have negative spectral slope (−3 to −6 dB/octave). The steepness distinguishes vowels (gentle slope) from fricatives (steeper slope).

Spectral Rolloff

A related descriptor: the spectral rolloff is the frequency below which a certain percentage (typically 85% or 95%) of the total spectral energy lies:

k=0krolloff sk = 0.85

High rolloff → energy extends to high frequencies (bright). Low rolloff → energy concentrated at low frequencies (dark). Similar to centroid but less sensitive to outlier high-frequency peaks.

Flux Normalization

Raw flux depends on signal loudness — a louder signal has larger magnitude changes. To compare flux across segments of different volume, normalize:

Fnorm(t) = ∑k (|Xt[k]| / ||Xt|| − |Xt−1[k]| / ||Xt−1||)2

where ||Xt|| = √(∑|Xt[k]|2). This gives volume-independent flux that responds to spectral shape changes, not loudness changes.

Spectral Slope in Practice

The slope computation is ordinary least-squares regression of log-magnitude vs. log-frequency (for dB/octave measurement) or magnitude vs. frequency (for a simpler linear measure). Pilanci uses the linear form:

python
def spectral_slope(magnitude_spectrum, fs, N):
    K = len(magnitude_spectrum)
    freqs = np.arange(K) * fs / N
    f_mean = freqs.mean()
    s_mean = magnitude_spectrum.mean()
    num = ((freqs - f_mean) * (magnitude_spectrum - s_mean)).sum()
    den = ((freqs - f_mean) ** 2).sum()
    return num / den if den > 0 else 0

A typical speech frame has slope ≈ −30 dB/decade (energy drops 30 dB per factor-10 frequency increase). Music varies more: bass-heavy genres have steeper negative slopes; bright electronic music has flatter slopes.

Pilanci's Hi-Hat Example

In Pilanci's slides, he shows the spectral centroid time series for a drum loop. When the hi-hat hits:

• Centroid jumps from ~1000 Hz (kick/snare region) to ~8000 Hz (hi-hat region)

• Duration of the jump: ~50 ms (the hi-hat's sustain)

• Spread simultaneously increases (hi-hat is broadband)

• Flatness spikes (hi-hat is noise-like, not tonal)

This pattern — simultaneous centroid spike + spread increase + flatness increase — is so characteristic that a simple threshold detector on these three features can locate every hi-hat hit with >95% accuracy, even in a complex mix.

Contrast with a snare hit:

• Centroid jumps to ~3000–4000 Hz (lower than hi-hat)

• Spread is large but less than hi-hat

• Has both tonal (snare wire resonance) and noise (shell vibration) components

• Flatness is moderate (between pure tone and pure noise)

Spectral Flux: Onset Detection

A simulated sequence of spectral frames. Flux spikes at "onset" frames where the spectrum changes abruptly.

Orange bars = flux values. Spikes indicate spectral transitions (onsets).
When would spectral flux be highest?

Chapter 6: Feature Extraction Pipeline

Now let's put it all together. In a real audio analysis system, you compute ALL spectral descriptors for each frame, producing a feature vector that characterizes the sound.

The Pipeline

Audio Signal
Continuous waveform, sampled at fs
Frame & Window
Split into overlapping frames, apply Hann window
DFT (via FFT)
Compute magnitude spectrum |X[k]|
Normalize
sk = |X[k]|2 / ∑|X[m]|2
Compute Descriptors
μ1, μ2, skew, kurt, entropy, flatness, crest, flux, slope
Feature Vector
9+ numbers per frame → downstream ML

Real-World Pipeline Parameters

Typical values in audio analysis systems:

ParameterSpeechMusic
Sample rate fs16,000 Hz44,100 Hz
Frame length25 ms (400 samples)23 ms (1024 samples)
Frame hop10 ms (160 samples)11.6 ms (512 samples)
FFT size N512 (zero-padded)2048 (zero-padded)
Window functionHammingHann
Frequency bins usedN/2 + 1 = 257N/2 + 1 = 1025
Descriptors per frame7–137–13

The choice of window function (Hann, Hamming, Blackman) affects spectral leakage: energy from one true frequency "leaks" into adjacent bins. Longer windows give better frequency resolution but worse time resolution. This is the time-frequency uncertainty principle — you cannot have arbitrarily precise measurements in both time and frequency simultaneously.

Showcase: Draw Your Spectrum

Draw any spectrum shape on the canvas below. All descriptors are computed in real time. Experiment: make it narrow (low spread), shift it left/right (change centroid), add a second peak (increase kurtosis).

Complete Spectral Feature Extractor

Click/drag to draw a magnitude spectrum. All descriptors update live. Try different shapes!

Which combination of descriptors best distinguishes a hi-hat (bright, noisy) from a bass drum (dark, tonal)?

Chapter 7: Mastery

Spectral descriptors reduce high-dimensional spectra to interpretable features. They're the bridge between raw Fourier coefficients and machine learning classifiers.

Complete Reference

DescriptorFormula EssenceMeasuresBright SoundDark Sound
Centroid∑ fk·skAverage frequencyHighLow
Spreadstd(f under s)BandwidthModerateNarrow
Skewness3rd standardized momentAsymmetry
Kurtosis4th standardized momentPeakednessLow (noisy)High (tonal)
Entropy−∑ sk log skDisorderHighLow
Flatnessgeom/arith meanTonalnessHigh (noise)Low (tone)
Crestmax/meanDominanceLowHigh
Flux∑(Δ|X[k]|)2Change over time
Sloperegression on s vs. fTiltLess negativeMore negative

Connections

Lecture 6 (DFT): Spectral descriptors require the DFT as input. No DFT, no spectral features.

MFCCs (coming): Mel-frequency cepstral coefficients are another spectrum summarization, using a perceptual frequency scale and cepstral analysis. They capture formant structure better than raw moments.

ML connection: Modern audio classifiers (speech recognition, music tagging) often use learned features (spectrograms + CNNs), but spectral descriptors remain useful as hand-crafted baselines and for interpretable systems.

Feature Selection: Which Descriptors for Which Task?

TaskKey DescriptorsWhy
Music genreCentroid, spread, flux, flatnessBrightness, bandwidth, rhythm, tonalness
Speaker IDMFCCs (13 coefficients)Formant structure encodes vocal tract shape
Onset detectionSpectral flux (half-rectified)Change = new event
Voice activityEntropy, flatness, centroidSpeech vs. noise distinction
Instrument recognitionAll moments + flux + slopeEach instrument has unique spectral fingerprint
Mood/valenceCentroid, slope, entropyBright/smooth = happy; dark/rough = sad

Beyond Hand-Crafted Features

Modern deep learning often bypasses hand-crafted descriptors entirely, feeding raw spectrograms (time-frequency images) into CNNs or transformers. However:

• Spectral descriptors remain useful as baselines — any learned system should beat a simple SVM on these features

• They provide interpretability — a centroid time series is human-readable, a CNN embedding is not

• For low-resource tasks (limited training data), hand-crafted features often outperform learned features

• They're computationally cheap — a microcontroller can compute centroid in real time

• They serve as sanity checks — if a learned model disagrees with simple descriptors, investigate why

The ideal system often combines both: spectral descriptors as an input layer alongside raw spectrogram features, letting the model learn when each is useful.

Delta Features

In practice, temporal derivatives of spectral features are often as informative as the features themselves. The delta (first derivative) and delta-delta (second derivative) features capture rate-of-change:

Δf(t) = (f(t+1) − f(t−1)) / 2
ΔΔf(t) = f(t+1) − 2f(t) + f(t−1)

If you extract 9 base descriptors per frame, adding deltas and delta-deltas gives 27 features per frame. Speech recognition systems (before deep learning) typically used 13 MFCCs + 13 delta + 13 delta-delta = 39 features per frame.

Delta-centroid captures whether the sound is getting brighter (positive) or darker (negative). Delta-flux captures whether onsets are accelerating or decelerating. These temporal dynamics are crucial for distinguishing sounds that have similar static spectra but different temporal envelopes (e.g., a plucked vs. bowed string).

Descriptor Correlations

Some descriptors are highly correlated (measuring similar things differently):

• Entropy and flatness: both measure "noisiness" — correlation r > 0.9

• Centroid and rolloff: both measure "brightness" — correlation r > 0.85

• Crest and 1/flatness: nearly inverse of each other

In practice, using all descriptors together introduces multicollinearity in linear models. Options:

• PCA on the feature matrix (decorrelate, reduce dimensions)

• Select a subset (centroid + spread + flatness + flux covers most variance)

• Use non-linear models (random forests, neural nets) that handle correlation naturally

Pilanci recommends starting with the minimal set {centroid, spread, flatness, flux} and adding descriptors only if classification accuracy demands it. These four capture brightness, bandwidth, tonalness, and temporal change — the four perceptual axes most humans can discriminate.

Pilanci's perspective: "These descriptors reduce a 512-dimensional spectrum to <10 numbers. That's lossy compression. The art is choosing descriptors that preserve the information relevant to your task. For music genre, centroid + flatness + flux might suffice. For speaker ID, you need finer spectral detail (MFCCs). Know your application, choose your features."
Which spectral descriptor directly measures whether a signal is more "tonal" or "noise-like"?
Final thought: "The purpose of computing is insight, not numbers." — Richard Hamming. Spectral descriptors embody this: they transform raw numbers (DFT bins) into insight (bright/dark, tonal/noisy, stable/changing).