Audio & Speech

Audio Representations

Before any model can hear, sound must become numbers it can use. The journey from a raw pressure wave to the log-mel spectrogram every audio model eats — sampling, the Fourier transform, the spectrogram, and the perceptual scales that make it work.

Prerequisites: Sound is a vibration in the air + A sine wave has a frequency. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Sound as Numbers

Sound is a wave of air pressure — it pushes and pulls on your eardrum, on a microphone’s membrane. A microphone turns that motion into a fluctuating voltage, a single value that wiggles up and down over time. That continuous wiggle is the waveform: one number (amplitude) as a function of time. Everything in audio ML starts here, and the entire field is about turning this raw wiggle into something a model can actually use.

But the raw waveform is a hard thing to learn from directly. It’s enormously long (tens of thousands of values per second), and the information humans care about — pitch, timbre, which vowel, which instrument — isn’t obvious in the up-and-down of pressure. It’s hidden in the frequencies the wave contains. So the whole pipeline of this lesson is a series of transforms that turn “pressure over time” into “which frequencies, over time, on a scale that matches human hearing” — the log-mel spectrogram that Whisper and nearly every audio model consume.

The trap: “just feed the raw waveform to the network.” You can — some models do — but it’s like handing someone a seismograph trace and asking what song was playing. The information is technically there, but in a form that wastes the model’s capacity. The representations in this lesson pre-package sound the way it’s actually structured, and that’s why they remain the backbone of audio ML.

The waveform

A sound as a microphone sees it: amplitude (pressure) over time. Drag to mix two tones — notice that even a simple sum looks like a complicated wiggle. The frequencies are in there, but hidden.

second tone strength0.50

What does a raw waveform represent?

The frequencies present in the sound, directly Air pressure (amplitude) as a function of time — frequencies are hidden inside it The loudness in decibels

Chapter 1: Sampling & the Sample Rate

A computer can’t store a continuous wiggle — it stores numbers. So we sample: measure the waveform’s amplitude at regular instants, many thousands of times per second. The sample rate is how often we measure. CDs use 44,100 samples per second; speech models like Whisper use 16,000. Each measurement is one number; a 10-second clip at 16 kHz is 160,000 numbers.

How fast must you sample? The Nyquist rule: to capture a frequency, you must sample at more than twice that frequency. Sample too slowly and high frequencies don’t just disappear — they alias, masquerading as false low frequencies (the wagon-wheel effect, in sound). Humans hear up to ~20 kHz, so 44.1 kHz (just over twice) captures everything audible. Speech energy lives mostly below 8 kHz, so 16 kHz suffices — which is why speech models downsample, saving data with no loss of intelligibility.

Worked example by hand

A 1,000 Hz tone, sampled at 16,000 Hz: you get 16 samples per cycle — plenty to trace the wave faithfully. The same tone sampled at 1,500 Hz: only 1.5 samples per cycle, below the Nyquist threshold of 2,000 Hz — the samples connect into a slow, wrong wave (an alias). The cure is always: low-pass filter to remove frequencies above half the sample rate before sampling.

Sampling and aliasing

A smooth tone (teal) sampled at dots. Lower the sample rate below twice the tone’s frequency and the dots start tracing a false, slower wave (orange) — aliasing. Keep it above Nyquist and the samples capture the truth.

sample rate16

The Nyquist rule says to capture a frequency you must sample at:

exactly that frequency more than twice that frequency, or it aliases into a false lower frequency half that frequency

Chapter 2: Sound Is Made of Frequencies

The central idea of all audio analysis: any sound can be built by adding up pure sine waves of different frequencies, amplitudes, and phases. A flute playing a note isn’t one frequency — it’s a fundamental plus a stack of harmonics (integer multiples) whose particular mix gives the flute its timbre. A vowel is a pattern of resonant peaks. A drum is a burst of many frequencies at once.

This is Fourier’s insight, and it’s why “which frequencies are present” is the natural language of sound. Two sounds can look completely different as waveforms yet share structure that’s obvious once you list their frequencies. So the goal becomes: take a chunk of waveform and find its recipe of frequencies — the amount of each sine wave needed to reconstruct it. That recipe is called the spectrum.

Building a sound from sine waves

Add harmonics one at a time (faint sines) and watch them sum into a richer waveform (teal) — the same way a real instrument’s timbre is built. More harmonics, more character.

harmonics added3

Fourier’s insight is that any sound can be:

compressed to a single number built by adding up pure sine waves of various frequencies and amplitudes represented only as text

Chapter 3: The Fourier Transform

The Fourier transform is the machine that extracts that recipe. Feed it a chunk of waveform; it returns, for every frequency, how much of that frequency is present. The way it works is elegantly simple: to measure how much of frequency f is in a signal, you multiply the signal by a sine wave of frequency f and add up the result. If the signal contains that frequency, the products reinforce and the sum is large; if it doesn’t, they cancel to near zero. Do this for every frequency and you have the full spectrum.

In practice we use the Fast Fourier Transform (FFT) — a clever algorithm that computes this for all frequencies at once in n log n time instead of n², making it cheap enough to run constantly. The output is a set of frequency bins, each holding the strength of one frequency band. The number of bins is set by the chunk size; a 400-sample chunk gives ~200 useful frequency bins.

Concept → realization: the “multiply by a test sine and sum” trick is a correlation — you’re asking “how much does my signal look like this pure tone?” A high answer means “lots of this frequency.” The FFT just does this for a whole comb of test frequencies simultaneously and efficiently. That’s the entire idea behind turning a wiggle into a spectrum.

From waveform to spectrum

Top: a waveform made of a few tones. Bottom: its spectrum — the FFT’s output, a spike at each frequency present. Add tones and watch new spikes appear exactly where they belong.

number of tones2

How does the Fourier transform measure how much of frequency f is in a signal?

By counting the samples By multiplying the signal by a test sine of frequency f and summing — large if that frequency is present, ~0 if not By taking the average amplitude

Chapter 4: The STFT — frequencies over time

One Fourier transform of a whole clip tells you which frequencies are present, but not when. For speech and music, when is everything — a word is a sequence of changing sounds. The fix is the Short-Time Fourier Transform (STFT): instead of one transform of the whole signal, slide a short window across it and take an FFT of each little window. Now you know the frequency content moment by moment.

This forces a famous trade-off, the time-frequency uncertainty. A short window pinpoints when something happened but blurs which frequency (few samples = coarse frequency bins). A long window nails the frequency but smears the timing. You can’t have perfect resolution in both — you choose. Whisper’s 25-millisecond window with a 10-millisecond hop is a typical speech compromise: fine enough in time to catch fast consonants, long enough to resolve pitch.

The time–frequency trade-off

Slide the window length. Short windows (left) give sharp timing but fat, blurry frequency bands; long windows (right) give crisp frequencies but smeared timing. There is no free lunch — pick your trade.

window length0.40

Why use the Short-Time Fourier Transform instead of one big FFT?

It is more accurate overall It reveals how the frequency content changes over time, by FFT-ing short sliding windows It removes the need for sampling

Chapter 5: The Spectrogram

Stack the STFT’s per-window spectra side by side and you get the spectrogram — a 2-D image of sound. Time runs across the horizontal axis (one column per window), frequency climbs the vertical axis (one row per frequency bin), and brightness shows the energy at that time and frequency. This is the single most important picture in audio: speech, music, birdsong — all become legible visual patterns.

In a spectrogram you can see structure that’s invisible in the waveform: the horizontal stripes of a sustained vowel’s harmonics, the rising sweep of a question’s intonation, the vertical spikes of percussion, the formant bands that distinguish “ee” from “oo.” This is exactly why the spectrogram — not the raw waveform — is what gets fed to most audio models, including Whisper’s encoder. It re-organizes sound into the form its information actually takes.

Reading a spectrogram

A spectrogram of a little “utterance.” Change the sound type: a steady tone (horizontal lines), a sweep (diagonal), a click (vertical). Learn to read time × frequency × brightness.

sound typetone

In a spectrogram, the three axes/quantities are:

time, loudness, and language time (horizontal), frequency (vertical), and energy (brightness) amplitude, phase, and duration only

Chapter 6: The Mel Scale & Log Loudness

A raw spectrogram is good; a log-mel spectrogram is what models actually use, and the two extra steps both come from human hearing. First, the mel scale. We don’t perceive frequency linearly — the difference between 100 and 200 Hz is enormous to the ear, but 5,000 to 5,100 Hz is imperceptible. So we warp the frequency axis: combine the linear FFT bins into a smaller set of mel bands — many narrow bands down low where we hear detail, few wide bands up high. Whisper uses 80 mel bands. This spends resolution where perception lives, and shrinks the representation.

Second, the log. Loudness spans a vast range and we perceive it logarithmically (hence decibels). Taking the log of each band’s energy compresses that range so a quiet consonant and a loud vowel are both visible, rather than the loud parts dominating. The result — warp to mel, take the log — is the log-mel spectrogram: compact, perceptually shaped, and the standard input to Whisper, audio classifiers, and beyond.

Linear bins folded into mel bands

The triangular filters that fold many linear FFT bins into fewer mel bands — narrow and dense at low frequency, wide and sparse up high. Drag the number of mel bands to see the warping.

mel bands12

Why convert a spectrogram to log-mel?

To make it bigger Mel-warping and log-scaling match human hearing — concentrating resolution where we perceive detail and compressing the loudness range To remove all the frequencies

Chapter 7: The Full Pipeline, Live (showcase)

Watch a sound travel the whole chain: waveform → sampled → windowed → FFT per window → spectrogram → mel-warped → log. Each knob changes what the model would see. This is exactly what happens in the first few milliseconds inside Whisper, before a single transformer layer runs.

Waveform → log-mel, end to end

The waveform (top) becomes a log-mel spectrogram (bottom) through your chosen window length and mel-band count. Change the sound, the window, and the bands — and see the representation the model receives transform in real time.

soundvowel

window length0.40

mel bands16

That bottom image — compact, perceptual, organized as time × mel × log-energy — is the universal currency of audio ML. Whisper, audio classifiers, even many TTS and music models start from some version of it. Master this picture and you understand the front-end of the entire field.

Chapter 8: MFCCs & Other Representations

Before deep learning, the dominant feature was the MFCC — Mel-Frequency Cepstral Coefficients. Take the log-mel spectrogram and apply one more transform (a discrete cosine transform) down each frame. This decorrelates the mel bands and keeps only the first ~13 coefficients, which capture the broad shape of the spectrum (the vocal-tract filter) while discarding fine detail. MFCCs were tiny and powerful — the workhorse of classical speech recognition for decades.

Deep models mostly dropped MFCCs in favor of the full log-mel spectrogram — because a neural network would rather learn its own features from the richer representation than be handed hand-engineered, lossy coefficients. The trend is toward less hand-engineering: log-mel for most models, and increasingly raw-waveform front-ends (learned convolutional filters that discover their own spectrogram-like features, as in Wav2Vec) or neural codec tokens for generative audio. The arc is the familiar one: from hand-designed features toward learned ones — but the spectrogram’s intuition underlies them all.

Representations, from hand-made to learned

A spectrum of choices: MFCC (most hand-engineered, tiny), log-mel (the standard), raw-waveform learned filters, and neural-codec tokens (most learned). Drag to see what each keeps and discards.

representationlog-mel

Why have deep models largely replaced MFCCs with log-mel spectrograms (or learned front-ends)?

MFCCs are too large to compute Networks prefer to learn their own features from a richer representation rather than be handed lossy, hand-engineered coefficients MFCCs cannot represent speech

Chapter 9: Cheat Sheet & Connections

waveform

air pressure (amplitude) over time

↓ sample (Nyquist: > 2× max freq)

samples

e.g. 16,000/sec for speech

↓ STFT (FFT of sliding windows)

spectrogram

time × frequency × energy image

↓ mel-warp + log

log-mel spectrogram

perceptual, compact → the standard model input

↓ (optional) DCT

MFCCs

classical compact features (largely superseded)

Term	Meaning
Sample rate	measurements per second (16k speech, 44.1k music)
Nyquist	sample > 2× the highest frequency or it aliases
FFT	fast algorithm for the frequency spectrum of a chunk
STFT	FFT of sliding windows → frequencies over time
Window trade-off	short = sharp time/blurry freq; long = vice versa
Mel + log	perceptual frequency warp + loudness compression

Keep exploring

→ Whisper — the speech model that eats the log-mel spectrogram
→ Neural Audio Codecs — turning audio into discrete tokens
→ TTS Architectures — going the other way: text → audio
→ Spectral Analysis — the signal-processing deep dive

“What I cannot create, I do not understand.” You just rebuilt the audio front-end from a pressure wave: sample it (mind Nyquist), find its frequencies with the Fourier transform, track them over time with the STFT, stack them into a spectrogram, and warp to mel and log to match the ear. Every audio model you’ll meet begins right here.