How EnCodec and SoundStream squeeze sound into a stream of discrete tokens — the trick that lets a language model generate audio the same way it generates text. Encoder, vector quantization, residual codebooks, and the GAN that keeps it sounding real.
Language models are spectacular at one thing: predicting the next token in a sequence of discrete symbols. Text is already discrete — a finite vocabulary of words and word-pieces. So GPT-style models slot right in. But audio is continuous: a waveform is a stream of real-valued samples, sixteen thousand per second. There’s no vocabulary, no “next word.” To generate audio with the language-model recipe, we first need to turn sound into a sequence of discrete tokens — a finite alphabet of audio.
That is exactly what a neural audio codec does. EnCodec (Meta), SoundStream (Google), and their kin compress a waveform into a short sequence of integer tokens, and can decode those tokens back into high-quality audio. Once sound is “just tokens,” everything language models can do — generate, continue, translate, condition on text — becomes possible for audio. This single bridge underlies modern text-to-speech, music generation (MusicGen), and audio language models (AudioLM).
Top: a waveform — thousands of continuous samples per second, no vocabulary. Bottom: the codec’s output — a short row of discrete token IDs a language model can predict. Drag to see the compression ratio.
A codec is, at heart, an autoencoder for sound. An encoder (a stack of strided 1-D convolutions) reads the raw waveform and compresses it down in time, producing a much shorter sequence of latent vectors — each one summarizing a small slice of audio. A decoder (mirror-image transposed convolutions) takes those latents and reconstructs the waveform. Train the pair to make the output sound like the input, and you have learned compression.
The encoder might turn 16,000 samples per second into, say, 75 latent vectors per second — a big temporal squeeze. Those latents are continuous, though. They’re a great compressed representation, but they’re not yet tokens: a language model needs a finite set of discrete symbols, not arbitrary real vectors. So the codec adds one crucial step in the middle, between encoder and decoder, that turns continuous latents into discrete codes. That step is quantization, and it’s the heart of the whole idea.
Vector quantization (VQ) is how we discretize. Keep a fixed list of, say, 1,024 reference vectors — the codebook. To quantize a continuous latent vector, find the nearest codebook entry and replace the latent with it. The token is simply the index of that entry — an integer from 0 to 1,023. The decoder later looks up that index to get the vector back. Continuous vector in, integer out: that’s the discretization.
It’s exactly like rounding, but in many dimensions and to a learned set of allowed values instead of to the integers. The codebook entries are the “allowed sounds” — the alphabet. A latent that lands between two entries snaps to the closer one. There’s some loss in that snapping (quantization error), and a key design goal is to make the codebook rich enough that the snapping barely hurts the reconstruction.
Codebook entries are fixed points (teal). Drag the continuous latent (orange) around — it always snaps to the nearest entry, and the token is that entry’s index. The gap between them is the quantization error.
The codebook is the codec’s vocabulary, and its size sets a fundamental trade-off. A bigger codebook (more entries) means finer quantization — each latent snaps to a closer match, so reconstruction is more faithful. But it also means more bits per token (you need “more letters” to name a larger alphabet) and a harder job for the downstream language model, which now must choose among more options. A smaller codebook is cheaper and easier to model, but quantizes more coarsely — audio quality drops.
There’s also a classic failure: codebook collapse, where the model only ever uses a handful of entries and the rest go dead — wasting the vocabulary. Codecs fight this with tricks like restarting unused entries and using exponential-moving-average updates so the whole codebook stays active. A healthy codebook spreads its entries to tile the space of real audio latents evenly.
More codebook entries (drag right) tile the latent space more finely — smaller snapping error, better audio — but cost more bits and are harder to model. Watch the reconstruction error shrink as entries multiply.
One codebook isn’t enough for high-fidelity audio — you’d need an astronomically large one. The key innovation in EnCodec/SoundStream is Residual Vector Quantization (RVQ): quantize in layers. Quantize the latent with the first codebook; that leaves a residual (what the first code missed). Quantize that residual with a second codebook. Quantize the new residual with a third. And so on — each codebook cleans up the error of the previous one.
So each time-step is now represented by several tokens — one per RVQ level (say 8 levels, 8 tokens). The first level captures the coarse structure; later levels add finer and finer detail. This is wildly more efficient: 8 codebooks of 1,024 entries give the expressive power of an impossibly huge single codebook, at a fraction of the cost. It also gives a beautiful property — scalable bitrate: keep only the first few RVQ levels for low-bitrate, lower-quality audio, or all of them for high fidelity. You can dial quality by choosing how many levels to keep.
Each level quantizes what the previous levels missed (the residual). Add levels and watch the reconstruction (teal) close in on the target (dashed) — coarse first, fine detail last. Each level = one more token per step.
If you train the codec only to minimize reconstruction error on the waveform or spectrogram, you get the same problem as image autoencoders: muffled, over-smoothed audio. Squared error rewards hedging, so the decoder blurs fine detail — the result sounds dull and underwater. High-fidelity codecs fix this exactly as image generators do: with a GAN.
Alongside the reconstruction loss, a set of discriminators is trained to tell real audio from the codec’s output. The codec (the generator) is trained to fool them. This adversarial loss pushes the decoder to produce crisp, realistic detail — the sharp transients, the natural texture — that plain reconstruction loss smooths away. Codecs typically use multiple discriminators looking at the signal in different ways (different resolutions, the waveform and the spectrogram) to catch artifacts at every scale. The combination — reconstruction loss for accuracy, adversarial loss for realism — is what makes EnCodec sound genuinely good at low bitrates.
Reconstruction loss alone (orange) over-smooths — fine detail muffled. Adding the GAN discriminator (teal) restores crisp transients and texture. Toggle to hear the difference (visually, as spectrogram sharpness).
Let’s ground it. Suppose the encoder produces 75 frames per second, RVQ uses 8 levels, and each codebook has 1,024 entries (so 10 bits per token). The bitrate is frames × levels × bits = 75 × 8 × 10 = 6,000 bits per second — 6 kbps. Compare raw 16 kHz audio at 16 bits: 256 kbps. The codec achieves roughly 40× compression while staying near-transparent in quality. That’s a remarkable feat — and far better than older hand-designed codecs at the same bitrate.
And because of RVQ, you choose the operating point after training. Keep all 8 levels for 6 kbps high quality; keep 4 levels for 3 kbps; keep 2 for 1.5 kbps lower quality. For audio language models, fewer levels also means a shorter, easier token sequence to predict — so there’s a direct trade between audio fidelity and how hard the generation task is. Generative systems often model the coarse (first) levels with one model and fill in the fine levels with another.
Drag the RVQ levels: the bitrate and token count per second scale linearly, and the quality climbs then saturates. See the compression ratio vs. raw audio update live.
Watch a waveform travel the full codec: encode to latents, residual-quantize into a grid of tokens (time × RVQ level), and decode back to audio. Change the codebook size and the number of RVQ levels and see the reconstruction quality and bitrate respond — the exact dials that define a codec like EnCodec.
Top: input waveform. Middle: the RVQ token grid (rows = levels, columns = time frames) — this is what a language model would generate. Bottom: the reconstruction. Drag levels and codebook size; watch fidelity and bitrate trade off.
That middle grid is the whole point: a small, discrete, finite-vocabulary representation of sound. Once audio looks like that — a sequence of integers — a transformer can predict it, and audio generation becomes language modeling.
Here’s the payoff that motivated everything. With a codec, generating audio becomes next-token prediction on codec tokens. Train a transformer on sequences of audio tokens and it learns to continue or generate sound. Condition it on text tokens and you get text-to-speech or text-to-music. This is the recipe behind a wave of systems:
The RVQ structure shapes how these models work: because there are several token levels per step, they often predict the coarse levels first and the fine levels after (or use clever interleaving), since naively flattening all levels makes the sequence very long. The codec is the quiet enabler — it turned the hardest part (continuous audio) into the thing transformers do best (discrete sequences), and unlocked the entire generative-audio era.
Text tokens condition a transformer that predicts codec tokens one at a time; the codec decoder turns them into sound. Step through generation and watch the audio-token sequence grow, then decode.
| Piece | Role |
|---|---|
| Encoder/decoder | conv autoencoder: compress & reconstruct waveform |
| Vector quantization | snap latent to nearest codebook entry → integer token |
| Codebook | the learned audio alphabet (size = fidelity vs. cost) |
| Residual VQ | layered codebooks, coarse→fine; scalable bitrate |
| GAN loss | discriminators → crisp, realistic audio (no muffling) |
| Audio LMs | predict codec tokens → TTS, music, audio generation |
→ Audio Representations — spectrograms, the other audio front-end
→ VAE / VQ-VAE — vector quantization in full detail
→ TTS Architectures — text → speech, including codec-token TTS
→ GPT — the next-token prediction that generates the tokens