AI Architectures

Whisper & Audio Transformers

How a transformer learned to transcribe almost any speech, in any accent, through noise — by turning sound into a picture, and training on 680,000 hours of the messy real internet.

Prerequisites: A transformer encoder-decoder maps a sequence to a sequence + Sound is a wave. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Hearing Is Hard

For decades, speech recognition was brittle. A system trained on clean American English read aloud in a studio would collapse on a Scottish accent, a noisy café, a technical lecture, or a phone call. Every new domain meant collecting labeled audio and fine-tuning. The models were specialists, and specialists shatter outside their lane.

Whisper (OpenAI, 2022) broke that pattern. Out of the box — no fine-tuning — it transcribes dozens of languages, survives heavy noise and thick accents, even translates and adds timestamps. It approaches human robustness on messy real-world audio. And it does it with a completely standard transformer. The magic isn’t a clever new architecture; it’s two ideas: turn sound into a picture a transformer can read, and train on an enormous, diverse, messy pile of real audio instead of a small clean one.

The trap: “robust speech recognition needs a specialized audio architecture and pristine labeled data.” Whisper’s lesson is the opposite — a vanilla transformer plus 680,000 hours of weakly-labeled internet audio beats hand-engineered specialists, because diversity at scale is what teaches robustness. We’ll build every piece, starting with how sound becomes something a transformer can see.

Specialist vs. robust recognizer

Accuracy across conditions. A fine-tuned specialist (orange) is great on its home turf and falls off a cliff elsewhere; Whisper (teal) stays usable across accents, noise, and domains. Slide the “distance from training conditions.”

distance from training conditions0.50

What two ideas give Whisper its robustness, despite using a standard transformer?

A new attention variant and reinforcement learning Turning audio into a spectrogram image + training on a huge, diverse, weakly-labeled dataset Fine-tuning separately for every accent and domain

Chapter 1: Turning Sound Into a Picture

Raw audio is a long, fast wiggle — at the standard 16,000 samples per second, even a 30-second clip is 480,000 numbers, mostly redundant. A transformer can’t attend over that, and the raw waveform hides what matters. The fix is the spectrogram: a picture of which frequencies are present, over time.

The recipe: slide a short window across the audio (Whisper uses a 25-millisecond window, stepping every 10 milliseconds). For each window, run a Fourier transform to ask “how much of each frequency is in this slice?” Stack those frequency profiles side by side and you get a 2-D image: time across the horizontal, frequency up the vertical, brightness = energy. Speech, which was an incomprehensible wiggle, becomes legible bands and sweeps — the harmonics of vowels, the bursts of consonants. This is the fundamental front-end of essentially all modern audio models.

Concept → realization: the spectrogram is a brilliant compression. A 30-second clip’s 480,000 raw samples become a grid of about 3,000 time-frames × a few dozen frequency bins — small enough for a transformer, and organized exactly the way the information lives (speech is patterns of frequency over time). You traded an unreadable 1-D signal for a readable 2-D one.

Waveform → spectrogram

Top: the raw waveform. Bottom: its spectrogram — each column is one short window’s frequency content. Change the sound’s pitch and watch the bright bands slide up or down. That picture is what the model actually reads.

pitch1.0

Why convert the waveform into a spectrogram before the transformer?

To make the file smaller on disk It compresses the long raw signal into a compact time×frequency image organized the way speech information actually lives Transformers can only read color images

Chapter 2: The Mel Scale & Log Loudness

Whisper doesn’t use a raw spectrogram — it uses a log-mel spectrogram. Two tweaks, both borrowed from how human hearing works, and both make the picture more informative per pixel.

First, the mel scale. We don’t hear frequency linearly — the gap between 100 and 200 Hz sounds huge, but 5,000 to 5,100 Hz is imperceptible. So instead of evenly-spaced frequency bins, Whisper warps them onto the mel scale: many fine bins down low where we hear detail, few coarse bins up high. Whisper uses 80 mel bins. The picture now devotes resolution where it matters perceptually.

Second, the log. Loudness spans a vast range, and we perceive it logarithmically (that’s what decibels are). Taking the log of the energy compresses that range, so a quiet consonant and a loud vowel are both visible, instead of the loud parts washing everything else out. The result — an 80-bin, log-scaled, mel-warped spectrogram — is the exact input Whisper’s encoder sees.

Linear vs. mel frequency bins

Same sound, two binnings. Linear (orange) spreads bins evenly and wastes resolution on highs we barely distinguish; mel (teal) packs bins into the low/mid range where speech and hearing live. Toggle and compare.

Why does Whisper use a log-MEL spectrogram rather than a raw linear one?

It looks nicer Mel-warping and log-scaling match human hearing — concentrating resolution where we perceive detail and compressing loudness so quiet and loud sounds are both visible Raw spectrograms are impossible to compute

Chapter 3: The Encoder

Now the standard machinery takes over. Whisper processes audio in 30-second chunks. The 80-bin log-mel spectrogram of a chunk is about 3,000 time-frames wide. It first passes through two small convolution layers — the second with stride 2, which halves the time length to ~1,500 frames and gives a gentle local smoothing. Add positional encodings, and you have a sequence of ~1,500 audio “tokens.”

That sequence goes into a standard transformer encoder — self-attention plus MLPs, exactly like a text transformer, but over audio frames. Each frame attends to all others, so the encoder builds a rich representation where every moment knows its acoustic context. The output is ~1,500 context-rich vectors summarizing what was said and how.

log-mel [80 × 3000]

30s chunk

↓ 2 conv layers (stride 2)

frames [1500 × d]

+ positional encoding

↓ transformer encoder (self-attn × N)

audio features [1500 × d]

context-rich per-frame vectors

Note what didn’t happen: no audio-specific architecture, no recurrent network, no hand-crafted phoneme model. Just “treat spectrogram frames as a sequence and run a transformer.” The audio became, structurally, just another sequence.

Encoder data flow

Trace the shapes: a 30-second mel spectrogram is convolved down to ~1500 frames, then a transformer encoder mixes them with self-attention. Hover the stages.

chunk length (s)30

What does the encoder’s convolution stem (with stride 2) accomplish?

It converts text to audio It halves the time length (~3000→~1500 frames) and locally smooths before the transformer self-attends It removes all frequency information

Chapter 4: The Decoder

The decoder is a standard autoregressive text transformer — it generates the transcript one token at a time, just like a language model. Its one special power: every layer also does cross-attention into the encoder’s audio features. So as it predicts each next word, it can “look back” at the relevant moments of audio and ask “what sound supports this word?”

This is the same encoder-decoder design as the original Transformer for translation — except the “source language” is audio frames and the “target language” is text tokens. The decoder’s self-attention keeps the transcript fluent and grammatical (it’s a language model, after all), while cross-attention keeps it faithful to what was actually said. Fluency from the text side, faithfulness from the audio side — the two attentions divide the labor.

Why an autoregressive decoder helps accuracy: because it’s a language model, it knows “recognize speech” is far more likely than “wreck a nice beach,” even when the audio is ambiguous. The language prior fills gaps that the acoustics leave open — exactly how humans use context to disambiguate mumbled speech.

Decoder cross-attends to audio

As each text token is generated (bottom), it cross-attends to the audio frames (top) — line weight shows which moments of sound it’s drawing on. Step through the tokens.

token generated0

What is the role of cross-attention in Whisper’s decoder?

It generates the spectrogram It lets each predicted text token look back at the relevant audio frames, keeping the transcript faithful to the sound It removes the need for a language model

Chapter 5: One Model, Many Tasks

Whisper doesn’t just transcribe. The same weights also translate speech to English, detect the language, and add timestamps. How does one model do all that? Through special tokens in the decoder’s prompt. Before it generates the transcript, the decoder is fed a little sequence of control tokens that tell it what to do.

<|startoftranscript|>

begin

→

<|es|>

language = Spanish

→

<|transcribe|> or <|translate|>

the task

→

<|notimestamps|>?

timestamps on/off

Flip <|transcribe|> to <|translate|> and the very same model now outputs an English translation of the Spanish audio instead of a Spanish transcript. Set the language token and it transcribes that language; leave it out and the model predicts the language first. This is the same trick as instruction-prompting an LLM — the task is specified in the input sequence, not baked into the architecture — which is why one set of weights covers a whole family of speech tasks.

Build the task prompt

Toggle the control tokens and watch what the same model produces from the same audio. Task and language live in the prompt, not in separate models.

How does one Whisper model perform transcription, translation, language-ID, and timestamps?

Four separate models are swapped in Special control tokens in the decoder prompt select the task and language — the architecture is unchanged It retrains itself for each task

Chapter 6: Weak Supervision at Scale

Here is the real engine of Whisper’s robustness, and it’s about data, not architecture. Whisper was trained on 680,000 hours of audio paired with transcripts scraped from the internet — podcasts, videos, lectures, in 98 languages. The transcripts are weakly supervised: noisy, imperfect, sometimes auto-generated, never hand-cleaned. The bet was that scale and diversity beat cleanliness.

And it paid off spectacularly. Because the data covers every accent, recording quality, background noise, and domain you can imagine, the model learns features that generalize instead of overfitting to one clean distribution. It needs no fine-tuning because its training set already contained the diversity of the real world. A model trained on 1,000 hours of pristine studio audio is a specialist; a model trained on 680,000 hours of the chaotic internet is a generalist.

Common misconception: “noisy labels ruin training.” At small scale, yes. At massive scale, the noise averages out while the diversity remains — and diversity is exactly what robustness requires. Whisper deliberately chose 680k noisy hours over a small clean set, and it was the right call. This is one of the defining lessons of the scaling era: more diverse data, even imperfect, often beats less perfect data.

Data scale vs. real-world robustness

Robustness on diverse real audio as training hours grow. A small clean dataset (orange dot) is accurate in-domain but brittle; pile on diverse weakly-labeled hours (teal) and robustness climbs steeply. Slide the data scale.

training hours (log)0.50

Why did Whisper train on 680,000 hours of noisy, weakly-labeled web audio instead of a small clean dataset?

Clean data was unavailable At massive scale, the label noise averages out while the diversity teaches robustness — generalizing across accents, noise, and domains without fine-tuning Noisy data trains faster

Chapter 7: Transcribing, End to End (showcase)

Watch the whole pipeline run: speech comes in, becomes a log-mel spectrogram, the encoder turns it into audio features, and the decoder generates the transcript token by token, cross-attending to the audio. Add noise and the transcript holds up — the payoff of scale-driven robustness. Switch the task token to see translation instead.

The full Whisper pipeline

Press Listen. Audio → mel spectrogram → encoder → decoder emits tokens. Crank the noise: a brittle recognizer’s output garbles, Whisper’s stays readable. Flip the task to translate and the output language changes — same model, same audio.

background noise0.20

Every stage you see is something we built: the spectrogram front-end, the mel/log perceptual warping, the conv-plus-transformer encoder, the cross-attending autoregressive decoder, the task tokens, and the scale-bought robustness that keeps the output clean through noise. A standard transformer — pointed at sound made visible, trained on the whole messy world.

Chapter 8: Decoding, Limits & the Audio Family

A few practical realities and honest limits:

Long audio: Whisper only sees 30 seconds at a time, so long recordings are chunked and stitched — with the previous transcript fed back as context to keep continuity across boundaries.
Hallucination on silence: because the decoder is a language model, on silence or non-speech it can “hallucinate” plausible-sounding text from nothing. Voice-activity detection and temperature fallback mitigate it.
Decoding tricks: beam search for accuracy, temperature fallback when confidence drops, and timestamp tokens for alignment — the same kind of decoding care any autoregressive model needs.
Not real-time by default: it’s built for accuracy on chunks, not low-latency streaming (though streaming variants exist).

Whisper is one member of a fast-growing family of audio transformers. The same “spectrogram → transformer” recipe powers audio classification (the Audio Spectrogram Transformer), self-supervised speech (Wav2Vec 2.0, HuBERT), and the neural codecs that turn audio into discrete tokens for generative audio models. Whisper is the recognition flagship; the front-end you learned here is shared across all of them.

Chunking long audio

Long recordings are split into 30s windows; each is transcribed with the previous text as context, then stitched. Drag the recording length to see the chunks.

recording length (s)120

Why can Whisper “hallucinate” text on a silent clip?

Its microphone is broken Its decoder is a language model, so with no real acoustic evidence it can still generate plausible-sounding text The spectrogram is too large

Chapter 9: Cheat Sheet & Connections

waveform

16kHz audio, 30s chunks

↓ short-window FFT

log-mel spectrogram

80 mel bins × ~3000 frames; mel-warped + log (human hearing)

↓ 2 conv (stride 2) + encoder

audio features

~1500 frames, transformer self-attention

↓ autoregressive decoder + cross-attn

text tokens

task/language set by special prompt tokens

Piece	What it does	Why
Log-mel front-end	sound → time×freq image	compact, perceptual, transformer-readable
Conv stem	downsample ×2 + smooth	shorter sequence, local features
Encoder	self-attention over frames	context-rich audio representation
Decoder	autoregressive text + cross-attn	fluent (LM) + faithful (audio)
Task tokens	prompt control	one model, many tasks
680k weak hours	diverse training data	robustness without fine-tuning

Keep exploring

→ The Transformer — the encoder-decoder Whisper reuses wholesale
→ Attention Variants — self- vs cross-attention in depth
→ Embedding Layers — how the mel frames become tokens
→ Vision-Language Models — the same cross-attention bridge, for images

“What I cannot create, I do not understand.” You just rebuilt Whisper from a wave: make sound visible as a log-mel picture, run a vanilla transformer encoder-decoder over it with cross-attention, control the task with prompt tokens, and train on 680,000 hours of the messy real world. No new architecture — just sound, seen, at scale.