Audio & Speech

Self-Supervised Speech

BERT for audio. How Wav2Vec 2.0 and HuBERT learn rich speech representations from unlabeled sound — by masking spans and predicting hidden “speech units” — so that ten minutes of transcripts can train a recognizer that used to need thousands of hours.

Prerequisites: Masked language modeling hides words and predicts them + A transformer builds contextual representations. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Labels Are Scarce

To train a speech recognizer the old way, you need transcribed audio — thousands of hours of someone painstakingly typing out what was said. That labeling is slow, expensive, and barely exists for most of the world’s 7,000 languages. Meanwhile, unlabeled audio — podcasts, videos, radio, recordings — is essentially infinite and free. The mismatch is enormous: a trickle of labels, an ocean of raw sound.

Self-supervised learning for speech closes that gap. The idea, borrowed from how BERT learns language: pretrain on the ocean of unlabeled audio with a task that needs no labels, learning what speech is — its sounds, its structure. Then fine-tune on a tiny pile of labeled data for the actual task (recognition). Wav2Vec 2.0 showed you could reach strong recognition with as little as 10 minutes of labeled speech — a task that used to demand thousands of hours. This lesson builds that machinery.

The trap: “to recognize speech, you must train on transcribed speech.” You must train on transcribed speech eventually — but only a little, and only at the end. The hard part — learning the structure of sound — can be done first, for free, from unlabeled audio. Labels are then just a thin layer that maps already-learned representations to words.

The label gap

Available audio for a typical language: a vast pool of unlabeled sound (teal) vs a sliver of transcribed audio (orange). Self-supervision learns from the pool first, then uses the sliver. Drag to a low-resource language and watch the orange vanish.

language resource level0.50

What problem does self-supervised speech learning solve?

Audio files are too large to store Transcribed (labeled) audio is scarce/expensive, while unlabeled audio is abundant — so learn structure from unlabeled first, fine-tune on little labeled Microphones are too noisy

Chapter 1: Pretrain, Then Fine-tune

The recipe has two phases. Pretraining: on huge unlabeled audio, train the model on a pretext task — a self-made puzzle (predict hidden parts of the audio) that requires no human labels but forces the model to learn speech structure. Fine-tuning: take that pretrained model and train it briefly on a small labeled dataset for the real task (e.g. with a CTC loss mapping representations to letters), reusing everything it already learned.

This is transfer learning, the same paradigm behind BERT and GPT for text and ImageNet pretraining for vision — now for speech. The pretrained model is a reusable foundation: one expensive pretraining run, then cheap fine-tuning for many downstream tasks (recognition, speaker ID, emotion, language ID). The expensive part (learning what speech is) happens once; adapting to a task is cheap.

PRETRAIN

huge unlabeled audio + pretext task (mask & predict) → speech representations

↓ freeze or keep

FINE-TUNE

tiny labeled set + task head (CTC) → recognizer

In the self-supervised recipe, what happens during pretraining?

The model is trained on transcripts to recognize words The model learns speech structure from unlabeled audio via a pretext task (predict hidden parts), with no human labels Nothing — pretraining is skipped

Chapter 2: The Feature Encoder

Both Wav2Vec 2.0 and HuBERT start by turning the raw waveform into a sequence of feature vectors — but notably, they skip the spectrogram. Instead of a hand-designed mel front-end, a stack of strided 1-D convolutions reads the raw waveform directly and learns its own features, producing roughly one latent vector every 20 milliseconds (about 50 per second). The model discovers what aspects of the raw signal matter, rather than being handed mel bands.

This is the “learned front-end” trend from the Audio Representations lesson made real: rather than the fixed pipeline of FFT → mel → log, a convolutional encoder learns a representation tuned to the pretraining objective. These latent feature vectors — one per ~20 ms frame — are the raw material the rest of the model works on: they get masked, quantized, and contextualized.

Raw waveform → learned features

A conv stack reads the waveform (top) and emits a latent feature vector every ~20 ms (bottom row). No spectrogram — the features are learned. Drag the stride to see the frame rate change.

frame rate50/s

How do Wav2Vec 2.0 / HuBERT turn the waveform into features?

A fixed mel spectrogram, like Whisper A stack of strided 1-D convolutions that learns its own features directly from the raw waveform (~50 frames/sec) A lookup table of phonemes

Chapter 3: Masking — the pretext task

Here’s the self-made puzzle, lifted straight from BERT. Take the sequence of latent feature frames and mask out spans of them — replace chunks with a mask token, hiding that audio from the model. Then feed the masked sequence through a transformer (the context network), and ask it to figure out what was in the masked spans, using only the surrounding context. No labels needed — the answer is the audio you just hid.

To solve this, the model must learn the structure of speech: coarticulation (how sounds blend), phonotactics (which sounds follow which), prosody, the identity of the speaker. To fill a masked gap with the right sound, you have to understand the sounds around it — exactly the kind of contextual understanding that makes a representation useful downstream. Wav2Vec masks spans (a few consecutive frames at a time, ~half the sequence total), which makes the task hard enough to force real learning.

Why masking works as a teacher: predicting a hidden sound from context is impossible without a model of how speech is structured. The pretext task is a forcing function — the only way to get good at it is to internalize the regularities of speech. Those internalized regularities are the representation we’re after; recognition is then a short hop away.

Mask spans, predict the hidden audio

Feature frames with random spans masked (orange). The transformer must reconstruct what belonged in the masked spans from the surrounding context. Drag the mask amount: too little is trivial, too much is impossible.

mask fraction0.45

What is the pretext task in self-supervised speech learning?

Transcribe the audio into text Mask spans of the feature frames and predict what was hidden from the surrounding context (no labels needed) Classify the speaker’s emotion

Chapter 4: Speech Units — discretizing the target

There’s a wrinkle. The masked frames are continuous vectors — predicting an exact continuous vector runs into the blur/averaging problem and is a vague target. The fix, shared by both methods: turn the prediction target into a small set of discrete speech units. Quantize the continuous features into a finite codebook — a learned alphabet of, in effect, “phoneme-like” sound categories — and have the model predict the right unit for each masked frame, not the exact vector.

This is the same vector-quantization idea as neural codecs: snap a continuous vector to the nearest of a finite set of learned prototypes, and the “answer” becomes a category index. Predicting a discrete category is a crisp, well-posed classification problem — far easier and more stable than regressing a fuzzy continuous vector. And remarkably, these learned units often correspond loosely to real phonetic categories, discovered from raw audio with no phonetic labels at all. The model invents its own alphabet of speech.

Continuous features cluster into discrete units

Continuous feature frames (dots) cluster into a finite set of discrete speech units (colored regions). The prediction target is “which unit,” not the exact vector. Drag the number of units — the model’s self-discovered phoneme-like alphabet.

number of units6

Why discretize the prediction target into “speech units”?

To save disk space Predicting a discrete category is a crisp, well-posed problem; regressing the exact continuous vector is vague and prone to blur/averaging Because transformers can’t read continuous values

Chapter 5: Wav2Vec 2.0 — the contrastive way

Wav2Vec 2.0 (Meta, 2020) solves the masked task contrastively. At each masked position, the transformer produces a contextual representation. The model is then asked: among a set of candidate quantized units — the true one for this frame plus several distractors sampled from elsewhere — pick the true one. It’s a multiple-choice question: “which discrete unit really belongs in this gap?” Get it right by making the context representation most similar to the true unit and dissimilar from the distractors.

The quantization uses a Gumbel-softmax codebook so it’s differentiable, and the contrastive loss (the same InfoNCE idea from contrastive learning) pulls the masked context toward its true unit and pushes it from distractors. Pretraining on tens of thousands of hours this way produces representations so good that a thin recognition head, fine-tuned on minutes of labeled audio, reaches what used to require thousands of hours. The contrastive choice is elegant but has a subtlety: it depends on having good distractors and a codebook that doesn’t collapse.

Contrastive: pick the true unit among distractors

At a masked frame, the model scores candidates — the true unit (teal) and distractors (gray) — and must rank the true one highest. Drag the number of distractors: more makes the multiple-choice harder and the learning stronger.

distractors5

How does Wav2Vec 2.0 train on masked frames?

It regresses the exact waveform samples Contrastively: at each masked frame, identify the true quantized unit among distractors (InfoNCE) It uses the transcript as the target

Chapter 6: HuBERT — cluster, then predict

HuBERT (Meta, 2021) reaches the same goal a simpler, more stable way. Instead of a contrastive task with distractors, it does two steps. First, cluster: run k-means on acoustic features (initially simple MFCCs) to assign every frame a cluster ID — cheap, automatic pseudo-labels. Second, predict: mask spans and train the transformer to predict the cluster ID of the masked frames — a plain classification (cross-entropy) problem, exactly like BERT predicting a masked word from a fixed vocabulary.

The clever part is iteration. The first round’s clusters (from crude MFCCs) are rough, but the model trained to predict them learns decent representations. So you re-cluster on those learned representations, getting better pseudo-labels, and train again. Each round’s representations improve the next round’s targets, which improve the representations — a bootstrapping spiral toward genuinely phoneme-like units. HuBERT avoids the contrastive method’s fiddliness (no distractor sampling, no Gumbel codebook) and tends to be more robust, which is why it’s widely used.

The contrast in one line: Wav2Vec asks “which unit, vs these distractors?” (contrastive, one stage, differentiable codebook). HuBERT asks “which cluster ID?” (classification, with offline k-means pseudo-labels, refined over iterations). Same masked-prediction spirit; different, arguably simpler, machinery.

Cluster → pseudo-label → predict → re-cluster

Each iteration: cluster the current features into pseudo-labels, train to predict them, then re-cluster on the improved features. Step through iterations and watch the clusters sharpen into clean, phoneme-like groups.

iteration0

How does HuBERT differ from Wav2Vec 2.0?

It uses transcripts during pretraining It k-means-clusters features into pseudo-labels and predicts the masked cluster ID (classification), iterating to refine — no contrastive distractors It doesn’t use masking

Chapter 7: Data Efficiency, Live (showcase)

The payoff is data efficiency. Watch recognition error fall as you add labeled fine-tuning data — for a model trained from scratch versus one pretrained self-supervised on unlabeled audio. The pretrained model starts far lower and needs a tiny fraction of the labels to reach good accuracy. This is the curve that made self-supervision the default for speech.

Word error rate vs. labeled hours

Drag the amount of labeled fine-tuning data. From-scratch (orange) needs thousands of hours to get good; the self-supervised pretrained model (teal) reaches strong accuracy with minutes-to-hours. The gap at low labels is the whole point.

labeled data (log hours)10 min

unlabeled pretraining0.80

At 10 minutes of labels, the from-scratch model is hopeless while the pretrained one is already usable. That’s because pretraining already taught it what speech is — the labels only need to teach it the mapping to words. The expensive learning was done for free, on unlabeled audio.

Chapter 8: Impact & the Bigger Picture

Low-resource languages: the biggest win. Pretrain on whatever unlabeled audio exists (even multilingual, as in XLS-R), fine-tune on the handful of transcripts available — bringing speech tech to languages that could never afford thousands of labeled hours.
Reusable features: the pretrained encoder feeds many tasks — recognition, speaker verification, emotion, language ID — each with a small head. One foundation, many uses.
Semantic tokens for audio LMs: HuBERT’s discrete units are the “semantic tokens” in systems like AudioLM — they capture content, complementing the codec’s acoustic tokens. SSL units and codec tokens together power generative speech.

How does this relate to Whisper? Whisper went the other way — massive supervised (weakly-labeled) data, no self-supervised pretext task, betting that 680k labeled hours buys robustness directly. Self-supervised models bet that you can learn most of what you need from unlabeled audio and add labels cheaply. Both are valid; SSL shines when labels are scarce (most languages) and as a feature extractor, while Whisper-style supervision shines when you can amass huge labeled corpora. The field increasingly combines them.

Two philosophies: SSL vs. supervised-at-scale

Self-supervised (Wav2Vec/HuBERT): learn from unlabeled, fine-tune on little — wins when labels are scarce. Whisper: learn from massive weakly-labeled data — wins when labels are plentiful. Drag to see where each dominates.

labeled data available0.30

Where do self-supervised speech models shine most relative to Whisper-style supervision?

Only when millions of labeled hours are available When labeled data is scarce (most languages) and as a reusable feature extractor for many tasks Only for music generation

Chapter 9: Cheat Sheet & Connections

unlabeled audio

abundant, free

↓ conv feature encoder (learned, no mel)

feature frames

~50/sec, continuous

↓ mask spans + transformer + discrete units

pretext task

Wav2Vec: contrastive (pick true unit) · HuBERT: predict cluster ID

↓ fine-tune on tiny labeled set (CTC)

recognizer

strong ASR from minutes of labels

	Wav2Vec 2.0	HuBERT
pretext	contrastive (InfoNCE)	masked cluster-ID prediction
target	quantized units + distractors	k-means pseudo-labels
codebook	differentiable (Gumbel)	offline k-means, iterated
vibe	elegant, distractor-dependent	simple, stable, widely used

Keep exploring

→ Audio Representations — the learned-front-end idea
→ Neural Audio Codecs — acoustic tokens (vs SSL semantic tokens)
→ Contrastive Learning — the InfoNCE behind Wav2Vec
→ Whisper — the supervised-at-scale alternative

“What I cannot create, I do not understand.” You just rebuilt self-supervised speech: learn features from raw audio with a conv encoder, mask spans, discretize the targets into self-discovered speech units, and predict them — contrastively (Wav2Vec) or by clustering (HuBERT). Pretrain on the ocean of unlabeled sound, then teach it words with a teaspoon of labels.