Audio & Speech

Neural Audio Codecs

How EnCodec and SoundStream squeeze sound into a stream of discrete tokens — the trick that lets a language model generate audio the same way it generates text. Encoder, vector quantization, residual codebooks, and the GAN that keeps it sounding real.

Prerequisites: An autoencoder compresses then reconstructs + A language model predicts the next token. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Why Tokens?

Language models are spectacular at one thing: predicting the next token in a sequence of discrete symbols. Text is already discrete — a finite vocabulary of words and word-pieces. So GPT-style models slot right in. But audio is continuous: a waveform is a stream of real-valued samples, sixteen thousand per second. There’s no vocabulary, no “next word.” To generate audio with the language-model recipe, we first need to turn sound into a sequence of discrete tokens — a finite alphabet of audio.

That is exactly what a neural audio codec does. EnCodec (Meta), SoundStream (Google), and their kin compress a waveform into a short sequence of integer tokens, and can decode those tokens back into high-quality audio. Once sound is “just tokens,” everything language models can do — generate, continue, translate, condition on text — becomes possible for audio. This single bridge underlies modern text-to-speech, music generation (MusicGen), and audio language models (AudioLM).

The trap: “to generate audio, predict the next waveform sample.” A waveform has 16,000 real numbers per second — predicting them one by one is hopelessly slow and unstable. The codec’s job is to crush that into maybe 75 tokens per second from a small vocabulary, each token carrying a chunk of meaning — a sequence short and discrete enough for a transformer to model.

Continuous samples vs. discrete tokens

Top: a waveform — thousands of continuous samples per second, no vocabulary. Bottom: the codec’s output — a short row of discrete token IDs a language model can predict. Drag to see the compression ratio.

compression0.50

Why do we need a neural audio codec to generate audio with a language model?

To make audio files smaller for storage only Language models predict discrete tokens; the codec turns continuous audio into a short sequence of discrete tokens they can model To remove all the noise from audio

Chapter 1: The Autoencoder Backbone

A codec is, at heart, an autoencoder for sound. An encoder (a stack of strided 1-D convolutions) reads the raw waveform and compresses it down in time, producing a much shorter sequence of latent vectors — each one summarizing a small slice of audio. A decoder (mirror-image transposed convolutions) takes those latents and reconstructs the waveform. Train the pair to make the output sound like the input, and you have learned compression.

The encoder might turn 16,000 samples per second into, say, 75 latent vectors per second — a big temporal squeeze. Those latents are continuous, though. They’re a great compressed representation, but they’re not yet tokens: a language model needs a finite set of discrete symbols, not arbitrary real vectors. So the codec adds one crucial step in the middle, between encoder and decoder, that turns continuous latents into discrete codes. That step is quantization, and it’s the heart of the whole idea.

waveform

16k samples/sec

↓ encoder (strided convs)

latent vectors

~75/sec, continuous

↓ quantize (the new part)

discrete tokens

integer codes

↓ decoder (transposed convs)

reconstructed waveform

sounds like the input

What is the backbone of a neural audio codec?

A decision tree An autoencoder: a conv encoder compresses the waveform to latents, a decoder reconstructs it A single linear layer

Chapter 2: Quantization — continuous to discrete

Vector quantization (VQ) is how we discretize. Keep a fixed list of, say, 1,024 reference vectors — the codebook. To quantize a continuous latent vector, find the nearest codebook entry and replace the latent with it. The token is simply the index of that entry — an integer from 0 to 1,023. The decoder later looks up that index to get the vector back. Continuous vector in, integer out: that’s the discretization.

It’s exactly like rounding, but in many dimensions and to a learned set of allowed values instead of to the integers. The codebook entries are the “allowed sounds” — the alphabet. A latent that lands between two entries snaps to the closer one. There’s some loss in that snapping (quantization error), and a key design goal is to make the codebook rich enough that the snapping barely hurts the reconstruction.

The training subtlety: “pick the nearest entry” isn’t differentiable — you can’t take a gradient through a hard nearest-neighbor lookup. The fix (from VQ-VAE) is the straight-through estimator: on the forward pass you snap to the nearest code; on the backward pass you pretend the snap was the identity, copying the gradient straight through to the encoder. Plus a small loss that pulls codebook entries toward the latents they’re matched with, so the alphabet learns to cover the data.

Snapping a latent to the nearest codebook entry

Codebook entries are fixed points (teal). Drag the continuous latent (orange) around — it always snaps to the nearest entry, and the token is that entry’s index. The gap between them is the quantization error.

In vector quantization, what is the “token”?

The raw latent vector itself The integer index of the nearest codebook entry to the latent The reconstruction error

Chapter 3: The Codebook & Its Size

The codebook is the codec’s vocabulary, and its size sets a fundamental trade-off. A bigger codebook (more entries) means finer quantization — each latent snaps to a closer match, so reconstruction is more faithful. But it also means more bits per token (you need “more letters” to name a larger alphabet) and a harder job for the downstream language model, which now must choose among more options. A smaller codebook is cheaper and easier to model, but quantizes more coarsely — audio quality drops.

There’s also a classic failure: codebook collapse, where the model only ever uses a handful of entries and the rest go dead — wasting the vocabulary. Codecs fight this with tricks like restarting unused entries and using exponential-moving-average updates so the whole codebook stays active. A healthy codebook spreads its entries to tile the space of real audio latents evenly.

Codebook size: fidelity vs. cost

More codebook entries (drag right) tile the latent space more finely — smaller snapping error, better audio — but cost more bits and are harder to model. Watch the reconstruction error shrink as entries multiply.

codebook entries16

What is the trade-off in codebook size?

Bigger is always strictly better with no downside Bigger = finer quantization/better audio but more bits and harder for the language model; smaller = cheaper but coarser Size has no effect on quality

Chapter 4: Residual Vector Quantization

One codebook isn’t enough for high-fidelity audio — you’d need an astronomically large one. The key innovation in EnCodec/SoundStream is Residual Vector Quantization (RVQ): quantize in layers. Quantize the latent with the first codebook; that leaves a residual (what the first code missed). Quantize that residual with a second codebook. Quantize the new residual with a third. And so on — each codebook cleans up the error of the previous one.

So each time-step is now represented by several tokens — one per RVQ level (say 8 levels, 8 tokens). The first level captures the coarse structure; later levels add finer and finer detail. This is wildly more efficient: 8 codebooks of 1,024 entries give the expressive power of an impossibly huge single codebook, at a fraction of the cost. It also gives a beautiful property — scalable bitrate: keep only the first few RVQ levels for low-bitrate, lower-quality audio, or all of them for high fidelity. You can dial quality by choosing how many levels to keep.

Concept → realization: RVQ is coarse-to-fine, like sketching then refining. Level 1 is the rough outline of the sound; each subsequent level adds a layer of texture the previous levels couldn’t capture. Because later tokens only encode residual detail, dropping them degrades gracefully rather than catastrophically — the foundation of variable-bitrate codecs.

Residual quantization, level by level

Each level quantizes what the previous levels missed (the residual). Add levels and watch the reconstruction (teal) close in on the target (dashed) — coarse first, fine detail last. Each level = one more token per step.

RVQ levels3

What does Residual Vector Quantization do?

Uses one enormous codebook Quantizes in layers — each codebook cleans up the residual error of the previous one, giving many tokens/step and scalable bitrate Removes quantization entirely

Chapter 5: Making It Sound Real

If you train the codec only to minimize reconstruction error on the waveform or spectrogram, you get the same problem as image autoencoders: muffled, over-smoothed audio. Squared error rewards hedging, so the decoder blurs fine detail — the result sounds dull and underwater. High-fidelity codecs fix this exactly as image generators do: with a GAN.

Alongside the reconstruction loss, a set of discriminators is trained to tell real audio from the codec’s output. The codec (the generator) is trained to fool them. This adversarial loss pushes the decoder to produce crisp, realistic detail — the sharp transients, the natural texture — that plain reconstruction loss smooths away. Codecs typically use multiple discriminators looking at the signal in different ways (different resolutions, the waveform and the spectrogram) to catch artifacts at every scale. The combination — reconstruction loss for accuracy, adversarial loss for realism — is what makes EnCodec sound genuinely good at low bitrates.

Reconstruction-only vs. + adversarial

Reconstruction loss alone (orange) over-smooths — fine detail muffled. Adding the GAN discriminator (teal) restores crisp transients and texture. Toggle to hear the difference (visually, as spectrogram sharpness).

Why do high-fidelity codecs add an adversarial (GAN) loss?

To make training faster Reconstruction loss alone over-smooths audio; discriminators push the decoder to produce crisp, realistic detail To increase the codebook size automatically

Chapter 6: Bitrate & the Numbers

Let’s ground it. Suppose the encoder produces 75 frames per second, RVQ uses 8 levels, and each codebook has 1,024 entries (so 10 bits per token). The bitrate is frames × levels × bits = 75 × 8 × 10 = 6,000 bits per second — 6 kbps. Compare raw 16 kHz audio at 16 bits: 256 kbps. The codec achieves roughly 40× compression while staying near-transparent in quality. That’s a remarkable feat — and far better than older hand-designed codecs at the same bitrate.

And because of RVQ, you choose the operating point after training. Keep all 8 levels for 6 kbps high quality; keep 4 levels for 3 kbps; keep 2 for 1.5 kbps lower quality. For audio language models, fewer levels also means a shorter, easier token sequence to predict — so there’s a direct trade between audio fidelity and how hard the generation task is. Generative systems often model the coarse (first) levels with one model and fill in the fine levels with another.

Bitrate from frames × levels × bits

Drag the RVQ levels: the bitrate and token count per second scale linearly, and the quality climbs then saturates. See the compression ratio vs. raw audio update live.

RVQ levels kept8

With 75 frames/sec, 8 RVQ levels, and 10 bits per token, the bitrate is:

256 kbps 6 kbps (75 × 8 × 10 = 6,000 bits/sec) 75 kbps

Chapter 7: The Codec, End to End (showcase)

Watch a waveform travel the full codec: encode to latents, residual-quantize into a grid of tokens (time × RVQ level), and decode back to audio. Change the codebook size and the number of RVQ levels and see the reconstruction quality and bitrate respond — the exact dials that define a codec like EnCodec.

Encode → tokenize → decode

Top: input waveform. Middle: the RVQ token grid (rows = levels, columns = time frames) — this is what a language model would generate. Bottom: the reconstruction. Drag levels and codebook size; watch fidelity and bitrate trade off.

RVQ levels4

codebook size32

That middle grid is the whole point: a small, discrete, finite-vocabulary representation of sound. Once audio looks like that — a sequence of integers — a transformer can predict it, and audio generation becomes language modeling.

Chapter 8: Audio Language Models

Here’s the payoff that motivated everything. With a codec, generating audio becomes next-token prediction on codec tokens. Train a transformer on sequences of audio tokens and it learns to continue or generate sound. Condition it on text tokens and you get text-to-speech or text-to-music. This is the recipe behind a wave of systems:

AudioLM — generates speech/music by predicting codec tokens, with a hierarchy from coarse semantic tokens to fine acoustic tokens.
MusicGen — generates music from a text prompt by autoregressively predicting EnCodec tokens.
VALL-E — text-to-speech as codec-token language modeling, able to clone a voice from a few seconds of audio.

The RVQ structure shapes how these models work: because there are several token levels per step, they often predict the coarse levels first and the fine levels after (or use clever interleaving), since naively flattening all levels makes the sequence very long. The codec is the quiet enabler — it turned the hardest part (continuous audio) into the thing transformers do best (discrete sequences), and unlocked the entire generative-audio era.

Audio generation = next-token prediction

Text tokens condition a transformer that predicts codec tokens one at a time; the codec decoder turns them into sound. Step through generation and watch the audio-token sequence grow, then decode.

tokens generated0

How does a neural codec enable audio generation by language models?

It makes the audio louder It turns audio into discrete tokens, so generation becomes next-token prediction (AudioLM, MusicGen, VALL-E) It trains the language model directly on waveforms

Chapter 9: Cheat Sheet & Connections

waveform

continuous, 16k samples/sec

↓ conv encoder (compress in time)

latents

~75/sec, continuous

↓ residual vector quantization

tokens

several discrete codes per frame (coarse→fine)

↓ decoder + GAN (realism)

reconstructed audio

~40× compression, near-transparent

Piece	Role
Encoder/decoder	conv autoencoder: compress & reconstruct waveform
Vector quantization	snap latent to nearest codebook entry → integer token
Codebook	the learned audio alphabet (size = fidelity vs. cost)
Residual VQ	layered codebooks, coarse→fine; scalable bitrate
GAN loss	discriminators → crisp, realistic audio (no muffling)
Audio LMs	predict codec tokens → TTS, music, audio generation

Keep exploring

→ Audio Representations — spectrograms, the other audio front-end
→ VAE / VQ-VAE — vector quantization in full detail
→ TTS Architectures — text → speech, including codec-token TTS
→ GPT — the next-token prediction that generates the tokens

“What I cannot create, I do not understand.” You just rebuilt the neural audio codec: compress sound with a conv autoencoder, discretize the latents by snapping to a learned codebook, refine coarse-to-fine with residual quantization, and keep it crisp with a GAN. The result — sound as a short sequence of integers — is what let language models learn to speak and sing.