BERT for audio. How Wav2Vec 2.0 and HuBERT learn rich speech representations from unlabeled sound — by masking spans and predicting hidden “speech units” — so that ten minutes of transcripts can train a recognizer that used to need thousands of hours.
To train a speech recognizer the old way, you need transcribed audio — thousands of hours of someone painstakingly typing out what was said. That labeling is slow, expensive, and barely exists for most of the world’s 7,000 languages. Meanwhile, unlabeled audio — podcasts, videos, radio, recordings — is essentially infinite and free. The mismatch is enormous: a trickle of labels, an ocean of raw sound.
Self-supervised learning for speech closes that gap. The idea, borrowed from how BERT learns language: pretrain on the ocean of unlabeled audio with a task that needs no labels, learning what speech is — its sounds, its structure. Then fine-tune on a tiny pile of labeled data for the actual task (recognition). Wav2Vec 2.0 showed you could reach strong recognition with as little as 10 minutes of labeled speech — a task that used to demand thousands of hours. This lesson builds that machinery.
Available audio for a typical language: a vast pool of unlabeled sound (teal) vs a sliver of transcribed audio (orange). Self-supervision learns from the pool first, then uses the sliver. Drag to a low-resource language and watch the orange vanish.
The recipe has two phases. Pretraining: on huge unlabeled audio, train the model on a pretext task — a self-made puzzle (predict hidden parts of the audio) that requires no human labels but forces the model to learn speech structure. Fine-tuning: take that pretrained model and train it briefly on a small labeled dataset for the real task (e.g. with a CTC loss mapping representations to letters), reusing everything it already learned.
This is transfer learning, the same paradigm behind BERT and GPT for text and ImageNet pretraining for vision — now for speech. The pretrained model is a reusable foundation: one expensive pretraining run, then cheap fine-tuning for many downstream tasks (recognition, speaker ID, emotion, language ID). The expensive part (learning what speech is) happens once; adapting to a task is cheap.
Both Wav2Vec 2.0 and HuBERT start by turning the raw waveform into a sequence of feature vectors — but notably, they skip the spectrogram. Instead of a hand-designed mel front-end, a stack of strided 1-D convolutions reads the raw waveform directly and learns its own features, producing roughly one latent vector every 20 milliseconds (about 50 per second). The model discovers what aspects of the raw signal matter, rather than being handed mel bands.
This is the “learned front-end” trend from the Audio Representations lesson made real: rather than the fixed pipeline of FFT → mel → log, a convolutional encoder learns a representation tuned to the pretraining objective. These latent feature vectors — one per ~20 ms frame — are the raw material the rest of the model works on: they get masked, quantized, and contextualized.
A conv stack reads the waveform (top) and emits a latent feature vector every ~20 ms (bottom row). No spectrogram — the features are learned. Drag the stride to see the frame rate change.
Here’s the self-made puzzle, lifted straight from BERT. Take the sequence of latent feature frames and mask out spans of them — replace chunks with a mask token, hiding that audio from the model. Then feed the masked sequence through a transformer (the context network), and ask it to figure out what was in the masked spans, using only the surrounding context. No labels needed — the answer is the audio you just hid.
To solve this, the model must learn the structure of speech: coarticulation (how sounds blend), phonotactics (which sounds follow which), prosody, the identity of the speaker. To fill a masked gap with the right sound, you have to understand the sounds around it — exactly the kind of contextual understanding that makes a representation useful downstream. Wav2Vec masks spans (a few consecutive frames at a time, ~half the sequence total), which makes the task hard enough to force real learning.
Feature frames with random spans masked (orange). The transformer must reconstruct what belonged in the masked spans from the surrounding context. Drag the mask amount: too little is trivial, too much is impossible.
There’s a wrinkle. The masked frames are continuous vectors — predicting an exact continuous vector runs into the blur/averaging problem and is a vague target. The fix, shared by both methods: turn the prediction target into a small set of discrete speech units. Quantize the continuous features into a finite codebook — a learned alphabet of, in effect, “phoneme-like” sound categories — and have the model predict the right unit for each masked frame, not the exact vector.
This is the same vector-quantization idea as neural codecs: snap a continuous vector to the nearest of a finite set of learned prototypes, and the “answer” becomes a category index. Predicting a discrete category is a crisp, well-posed classification problem — far easier and more stable than regressing a fuzzy continuous vector. And remarkably, these learned units often correspond loosely to real phonetic categories, discovered from raw audio with no phonetic labels at all. The model invents its own alphabet of speech.
Continuous feature frames (dots) cluster into a finite set of discrete speech units (colored regions). The prediction target is “which unit,” not the exact vector. Drag the number of units — the model’s self-discovered phoneme-like alphabet.
Wav2Vec 2.0 (Meta, 2020) solves the masked task contrastively. At each masked position, the transformer produces a contextual representation. The model is then asked: among a set of candidate quantized units — the true one for this frame plus several distractors sampled from elsewhere — pick the true one. It’s a multiple-choice question: “which discrete unit really belongs in this gap?” Get it right by making the context representation most similar to the true unit and dissimilar from the distractors.
The quantization uses a Gumbel-softmax codebook so it’s differentiable, and the contrastive loss (the same InfoNCE idea from contrastive learning) pulls the masked context toward its true unit and pushes it from distractors. Pretraining on tens of thousands of hours this way produces representations so good that a thin recognition head, fine-tuned on minutes of labeled audio, reaches what used to require thousands of hours. The contrastive choice is elegant but has a subtlety: it depends on having good distractors and a codebook that doesn’t collapse.
At a masked frame, the model scores candidates — the true unit (teal) and distractors (gray) — and must rank the true one highest. Drag the number of distractors: more makes the multiple-choice harder and the learning stronger.
HuBERT (Meta, 2021) reaches the same goal a simpler, more stable way. Instead of a contrastive task with distractors, it does two steps. First, cluster: run k-means on acoustic features (initially simple MFCCs) to assign every frame a cluster ID — cheap, automatic pseudo-labels. Second, predict: mask spans and train the transformer to predict the cluster ID of the masked frames — a plain classification (cross-entropy) problem, exactly like BERT predicting a masked word from a fixed vocabulary.
The clever part is iteration. The first round’s clusters (from crude MFCCs) are rough, but the model trained to predict them learns decent representations. So you re-cluster on those learned representations, getting better pseudo-labels, and train again. Each round’s representations improve the next round’s targets, which improve the representations — a bootstrapping spiral toward genuinely phoneme-like units. HuBERT avoids the contrastive method’s fiddliness (no distractor sampling, no Gumbel codebook) and tends to be more robust, which is why it’s widely used.
Each iteration: cluster the current features into pseudo-labels, train to predict them, then re-cluster on the improved features. Step through iterations and watch the clusters sharpen into clean, phoneme-like groups.
The payoff is data efficiency. Watch recognition error fall as you add labeled fine-tuning data — for a model trained from scratch versus one pretrained self-supervised on unlabeled audio. The pretrained model starts far lower and needs a tiny fraction of the labels to reach good accuracy. This is the curve that made self-supervision the default for speech.
Drag the amount of labeled fine-tuning data. From-scratch (orange) needs thousands of hours to get good; the self-supervised pretrained model (teal) reaches strong accuracy with minutes-to-hours. The gap at low labels is the whole point.
At 10 minutes of labels, the from-scratch model is hopeless while the pretrained one is already usable. That’s because pretraining already taught it what speech is — the labels only need to teach it the mapping to words. The expensive learning was done for free, on unlabeled audio.
How does this relate to Whisper? Whisper went the other way — massive supervised (weakly-labeled) data, no self-supervised pretext task, betting that 680k labeled hours buys robustness directly. Self-supervised models bet that you can learn most of what you need from unlabeled audio and add labels cheaply. Both are valid; SSL shines when labels are scarce (most languages) and as a feature extractor, while Whisper-style supervision shines when you can amass huge labeled corpora. The field increasingly combines them.
Self-supervised (Wav2Vec/HuBERT): learn from unlabeled, fine-tune on little — wins when labels are scarce. Whisper: learn from massive weakly-labeled data — wins when labels are plentiful. Drag to see where each dominates.
| Wav2Vec 2.0 | HuBERT | |
|---|---|---|
| pretext | contrastive (InfoNCE) | masked cluster-ID prediction |
| target | quantized units + distractors | k-means pseudo-labels |
| codebook | differentiable (Gumbel) | offline k-means, iterated |
| vibe | elegant, distractor-dependent | simple, stable, widely used |
→ Audio Representations — the learned-front-end idea
→ Neural Audio Codecs — acoustic tokens (vs SSL semantic tokens)
→ Contrastive Learning — the InfoNCE behind Wav2Vec
→ Whisper — the supervised-at-scale alternative