Before any model can hear, sound must become numbers it can use. The journey from a raw pressure wave to the log-mel spectrogram every audio model eats — sampling, the Fourier transform, the spectrogram, and the perceptual scales that make it work.
Sound is a wave of air pressure — it pushes and pulls on your eardrum, on a microphone’s membrane. A microphone turns that motion into a fluctuating voltage, a single value that wiggles up and down over time. That continuous wiggle is the waveform: one number (amplitude) as a function of time. Everything in audio ML starts here, and the entire field is about turning this raw wiggle into something a model can actually use.
But the raw waveform is a hard thing to learn from directly. It’s enormously long (tens of thousands of values per second), and the information humans care about — pitch, timbre, which vowel, which instrument — isn’t obvious in the up-and-down of pressure. It’s hidden in the frequencies the wave contains. So the whole pipeline of this lesson is a series of transforms that turn “pressure over time” into “which frequencies, over time, on a scale that matches human hearing” — the log-mel spectrogram that Whisper and nearly every audio model consume.
A sound as a microphone sees it: amplitude (pressure) over time. Drag to mix two tones — notice that even a simple sum looks like a complicated wiggle. The frequencies are in there, but hidden.
A computer can’t store a continuous wiggle — it stores numbers. So we sample: measure the waveform’s amplitude at regular instants, many thousands of times per second. The sample rate is how often we measure. CDs use 44,100 samples per second; speech models like Whisper use 16,000. Each measurement is one number; a 10-second clip at 16 kHz is 160,000 numbers.
How fast must you sample? The Nyquist rule: to capture a frequency, you must sample at more than twice that frequency. Sample too slowly and high frequencies don’t just disappear — they alias, masquerading as false low frequencies (the wagon-wheel effect, in sound). Humans hear up to ~20 kHz, so 44.1 kHz (just over twice) captures everything audible. Speech energy lives mostly below 8 kHz, so 16 kHz suffices — which is why speech models downsample, saving data with no loss of intelligibility.
A 1,000 Hz tone, sampled at 16,000 Hz: you get 16 samples per cycle — plenty to trace the wave faithfully. The same tone sampled at 1,500 Hz: only 1.5 samples per cycle, below the Nyquist threshold of 2,000 Hz — the samples connect into a slow, wrong wave (an alias). The cure is always: low-pass filter to remove frequencies above half the sample rate before sampling.
A smooth tone (teal) sampled at dots. Lower the sample rate below twice the tone’s frequency and the dots start tracing a false, slower wave (orange) — aliasing. Keep it above Nyquist and the samples capture the truth.
The central idea of all audio analysis: any sound can be built by adding up pure sine waves of different frequencies, amplitudes, and phases. A flute playing a note isn’t one frequency — it’s a fundamental plus a stack of harmonics (integer multiples) whose particular mix gives the flute its timbre. A vowel is a pattern of resonant peaks. A drum is a burst of many frequencies at once.
This is Fourier’s insight, and it’s why “which frequencies are present” is the natural language of sound. Two sounds can look completely different as waveforms yet share structure that’s obvious once you list their frequencies. So the goal becomes: take a chunk of waveform and find its recipe of frequencies — the amount of each sine wave needed to reconstruct it. That recipe is called the spectrum.
Add harmonics one at a time (faint sines) and watch them sum into a richer waveform (teal) — the same way a real instrument’s timbre is built. More harmonics, more character.
The Fourier transform is the machine that extracts that recipe. Feed it a chunk of waveform; it returns, for every frequency, how much of that frequency is present. The way it works is elegantly simple: to measure how much of frequency f is in a signal, you multiply the signal by a sine wave of frequency f and add up the result. If the signal contains that frequency, the products reinforce and the sum is large; if it doesn’t, they cancel to near zero. Do this for every frequency and you have the full spectrum.
In practice we use the Fast Fourier Transform (FFT) — a clever algorithm that computes this for all frequencies at once in n log n time instead of n², making it cheap enough to run constantly. The output is a set of frequency bins, each holding the strength of one frequency band. The number of bins is set by the chunk size; a 400-sample chunk gives ~200 useful frequency bins.
Top: a waveform made of a few tones. Bottom: its spectrum — the FFT’s output, a spike at each frequency present. Add tones and watch new spikes appear exactly where they belong.
One Fourier transform of a whole clip tells you which frequencies are present, but not when. For speech and music, when is everything — a word is a sequence of changing sounds. The fix is the Short-Time Fourier Transform (STFT): instead of one transform of the whole signal, slide a short window across it and take an FFT of each little window. Now you know the frequency content moment by moment.
This forces a famous trade-off, the time-frequency uncertainty. A short window pinpoints when something happened but blurs which frequency (few samples = coarse frequency bins). A long window nails the frequency but smears the timing. You can’t have perfect resolution in both — you choose. Whisper’s 25-millisecond window with a 10-millisecond hop is a typical speech compromise: fine enough in time to catch fast consonants, long enough to resolve pitch.
Slide the window length. Short windows (left) give sharp timing but fat, blurry frequency bands; long windows (right) give crisp frequencies but smeared timing. There is no free lunch — pick your trade.
Stack the STFT’s per-window spectra side by side and you get the spectrogram — a 2-D image of sound. Time runs across the horizontal axis (one column per window), frequency climbs the vertical axis (one row per frequency bin), and brightness shows the energy at that time and frequency. This is the single most important picture in audio: speech, music, birdsong — all become legible visual patterns.
In a spectrogram you can see structure that’s invisible in the waveform: the horizontal stripes of a sustained vowel’s harmonics, the rising sweep of a question’s intonation, the vertical spikes of percussion, the formant bands that distinguish “ee” from “oo.” This is exactly why the spectrogram — not the raw waveform — is what gets fed to most audio models, including Whisper’s encoder. It re-organizes sound into the form its information actually takes.
A spectrogram of a little “utterance.” Change the sound type: a steady tone (horizontal lines), a sweep (diagonal), a click (vertical). Learn to read time × frequency × brightness.
A raw spectrogram is good; a log-mel spectrogram is what models actually use, and the two extra steps both come from human hearing. First, the mel scale. We don’t perceive frequency linearly — the difference between 100 and 200 Hz is enormous to the ear, but 5,000 to 5,100 Hz is imperceptible. So we warp the frequency axis: combine the linear FFT bins into a smaller set of mel bands — many narrow bands down low where we hear detail, few wide bands up high. Whisper uses 80 mel bands. This spends resolution where perception lives, and shrinks the representation.
Second, the log. Loudness spans a vast range and we perceive it logarithmically (hence decibels). Taking the log of each band’s energy compresses that range so a quiet consonant and a loud vowel are both visible, rather than the loud parts dominating. The result — warp to mel, take the log — is the log-mel spectrogram: compact, perceptually shaped, and the standard input to Whisper, audio classifiers, and beyond.
The triangular filters that fold many linear FFT bins into fewer mel bands — narrow and dense at low frequency, wide and sparse up high. Drag the number of mel bands to see the warping.
Watch a sound travel the whole chain: waveform → sampled → windowed → FFT per window → spectrogram → mel-warped → log. Each knob changes what the model would see. This is exactly what happens in the first few milliseconds inside Whisper, before a single transformer layer runs.
The waveform (top) becomes a log-mel spectrogram (bottom) through your chosen window length and mel-band count. Change the sound, the window, and the bands — and see the representation the model receives transform in real time.
That bottom image — compact, perceptual, organized as time × mel × log-energy — is the universal currency of audio ML. Whisper, audio classifiers, even many TTS and music models start from some version of it. Master this picture and you understand the front-end of the entire field.
Before deep learning, the dominant feature was the MFCC — Mel-Frequency Cepstral Coefficients. Take the log-mel spectrogram and apply one more transform (a discrete cosine transform) down each frame. This decorrelates the mel bands and keeps only the first ~13 coefficients, which capture the broad shape of the spectrum (the vocal-tract filter) while discarding fine detail. MFCCs were tiny and powerful — the workhorse of classical speech recognition for decades.
Deep models mostly dropped MFCCs in favor of the full log-mel spectrogram — because a neural network would rather learn its own features from the richer representation than be handed hand-engineered, lossy coefficients. The trend is toward less hand-engineering: log-mel for most models, and increasingly raw-waveform front-ends (learned convolutional filters that discover their own spectrogram-like features, as in Wav2Vec) or neural codec tokens for generative audio. The arc is the familiar one: from hand-designed features toward learned ones — but the spectrogram’s intuition underlies them all.
A spectrum of choices: MFCC (most hand-engineered, tiny), log-mel (the standard), raw-waveform learned filters, and neural-codec tokens (most learned). Drag to see what each keeps and discards.
| Term | Meaning |
|---|---|
| Sample rate | measurements per second (16k speech, 44.1k music) |
| Nyquist | sample > 2× the highest frequency or it aliases |
| FFT | fast algorithm for the frequency spectrum of a chunk |
| STFT | FFT of sliding windows → frequencies over time |
| Window trade-off | short = sharp time/blurry freq; long = vice versa |
| Mel + log | perceptual frequency warp + loudness compression |
→ Whisper — the speech model that eats the log-mel spectrogram
→ Neural Audio Codecs — turning audio into discrete tokens
→ TTS Architectures — going the other way: text → audio
→ Spectral Analysis — the signal-processing deep dive