Bai, Likhomanenko, Zhang, Gu, Aldeneh, Jaitly — Apple, 2024

dMel: Speech Tokenization Made Simple

A training-free discrete speech representation that bins mel-spectrogram values into integers — no neural codec needed.

Prerequisites: Mel Spectrograms + Transformers + Basic signal processing
9
Chapters
5+
Simulations

Chapter 0: Why New Speech Tokens?

Large language models process text as discrete tokens. To handle speech with the same machinery, you need to convert a continuous audio waveform into a sequence of integers. The dominant approach in 2023-2024 uses neural audio codecs — models like EnCodec and SoundStream that compress audio into discrete codes via a learned encoder-decoder with vector quantization.

These codecs are powerful, but they come with serious baggage:

The core question: Do we actually need a learned encoder to discretize speech? Mel spectrograms already capture the essential structure of speech. What if we just... binned the mel values directly into integers? No training, no codec, no RVQ hierarchy.

This is exactly what dMel proposes. The result is shockingly competitive: a completely training-free tokenization that matches or beats EnCodec on both TTS and ASR tasks, while being streamable, domain-robust, and trivial to implement.

Codec vs dMel Pipeline

Top: traditional neural codec path (train encoder, quantize, decode). Bottom: dMel path (compute mel, bin values, done). Click to animate.

What is the main disadvantage of neural audio codecs like EnCodec for speech tokenization?

Neural codecs need training, create complex multi-codebook hierarchies, often cannot process small chunks in isolation, and are fragile to domain shift — all problems dMel avoids entirely.

Chapter 1: Mel Spectrograms 101

Before we can discretize mel spectrograms, we need to understand what they are and why they are the dominant representation for speech processing.

From Waveform to Spectrogram

Raw audio is a 1D signal: amplitude values sampled at 16,000 Hz (for speech). A 10-second utterance is just 160,000 floating-point numbers. This raw waveform is hard for models to work with directly — the relevant structure (phonemes, formants, pitch) is encoded in frequency patterns that are invisible in the time domain.

The Short-Time Fourier Transform (STFT) slides a window across the waveform and computes the frequency content at each position. With a 25ms window and a 10ms hop, a 10-second signal produces 1000 time frames, each containing the magnitude of ~513 frequency bins (for a 1024-point FFT). This is a spectrogram: a 2D matrix of [time_frames x frequency_bins].

The Mel Scale

Humans perceive pitch logarithmically: the difference between 100 Hz and 200 Hz sounds the same as between 1000 Hz and 2000 Hz (both are one octave). The mel scale warps the linear frequency axis to match this perception. A mel filterbank is a set of triangular filters spaced evenly on the mel scale, applied to the linear spectrogram magnitudes.

Standard configuration for speech:

The result: an 80-dimensional vector every 10ms. A 10-second utterance becomes a [1000 x 80] matrix. Each column is a mel-frequency channel; each row is a time frame.

Why mel spectrograms are ideal for speech: They compress the 8000+ frequency bins of a linear spectrogram into 80 perceptually-meaningful bands. They discard phase information (which is largely irrelevant for speech content). They are the standard input to virtually every ASR system (Whisper, wav2vec2) and the standard output target of every TTS system (Tacotron, FastSpeech). dMel's insight is that this well-understood representation is already "good enough" — we do not need to learn a better one.
Mel Spectrogram Anatomy

A synthetic mel spectrogram showing 80 frequency bands over time. Each cell is a continuous float value representing energy at that frequency and time. Hover to inspect values.

Why does the mel spectrogram use 80 bands instead of the full 513 frequency bins from the STFT?

The mel scale groups high frequencies into wider bands (since we cannot distinguish fine differences up there) while preserving detail in low frequencies where speech formants live. 80 bands is the sweet spot: enough resolution for speech, compact enough for efficient processing.

Chapter 2: The Discretization Trick

Here is the entire dMel algorithm. It is almost comically simple:

Step 1
Compute 80-band mel spectrogram from raw audio (16kHz, 10ms hop, 25ms window, log magnitude).
Step 2
Per-channel normalization: subtract channel mean μc and divide by channel std σc (computed from training set statistics).
Step 3
Clip values to [−4, 4] (nearly all normalized values fall here).
Step 4
Linearly map [−4, 4] to [0, B−1] and round to nearest integer. Each value becomes a bin index in {0, 1, ..., B−1}.
Result
Each time frame is 80 integers, each in [0, B−1]. Vocabulary size = B (typically 256).

Mathematically, for channel c at time t:

c,t = (melc,t − μc) / σc
binc,t = round( clip(x̂c,t, −4, 4) × (B−1) / 8 + (B−1) / 2 )

That is it. No neural network. No codebook learning. No VQ-VAE. Just normalization, clipping, and uniform quantization.

The Token Sequence

Each time frame produces 80 bin indices. These 80 tokens are not flattened into one long sequence (that would make it 80x longer than text). Instead, they are processed in parallel — each channel has its own embedding, and all 80 are combined at each time step. We will cover this parallel processing in Chapter 4.

Why 256 Bins?

The paper experiments with B = {64, 128, 256, 512, 1024}. Key findings:

Core insight: The mel spectrogram is already a highly compressed, perceptually-motivated representation. After per-channel normalization, the values cluster in a predictable range. Uniform quantization with 256 levels (8-bit precision) preserves enough information for both synthesis and recognition. The "trick" is that there is no trick — the simplest possible discretization works.
Interactive Binning Visualization

Watch continuous mel values snap to discrete bins. Drag the "Bins" slider to see how quantization resolution affects the representation. Fewer bins = more quantization error.

What are the steps of the dMel tokenization algorithm?

dMel is purely deterministic: mel spectrogram, per-channel normalization using training set statistics, clip outliers, and uniform quantization. No learning involved.

Chapter 3: Why This Works

It seems almost too simple. Why should naive uniform quantization of mel values compete with a carefully trained neural codec? Three reasons:

1. Mel spectrograms are already information-optimal for speech

The mel filterbank was designed over decades of psychoacoustics research to capture exactly the information humans use to perceive speech. It discards phase (irrelevant for content), compresses high frequencies (where humans have low resolution), and preserves formant structure (the core of phonetic identity). A neural codec that takes raw audio as input must rediscover this structure through training. dMel starts with it for free.

2. Per-channel normalization makes the distribution uniform

Different mel bands have wildly different energy distributions. Low-frequency bands (fundamental frequency, first formant) have high energy and large variance. High-frequency bands (fricatives, noise) have low energy. After subtracting the per-channel mean and dividing by the per-channel std, every channel has zero mean and unit variance. The values become approximately Gaussian with most mass between [-3, 3]. Clipping at [-4, 4] loses almost nothing (<0.01% of values).

This normalization is crucial: without it, uniform quantization would waste bins on the low-energy high-frequency channels (all bins near zero) while saturating the high-energy low-frequency channels.

3. 256 bins = 8 bits per value, which exceeds perceptual limits

After normalization, the range [-4, 4] is divided into 256 bins. Each bin spans 8/256 = 0.03125 units of normalized mel energy. For comparison, the just-noticeable difference in audio loudness is about 1 dB — much larger than one bin width. We are quantizing below the perceptual threshold. The information loss is inaudible.

Information-theoretic perspective: A 10-second utterance at 100 fps with 80 channels and 256 bins = 1000 × 80 × 8 bits = 640,000 bits = 64 kbps. EnCodec at 6 kbps achieves higher compression but loses information that matters for high-fidelity reconstruction. dMel trades higher bitrate for lossless-equivalent quality and zero training cost. The transformer language model then provides the "compression" by learning the statistical structure of these token sequences.

Quantization Error Distribution

After normalization, mel values are approximately Gaussian. Uniform quantization with B bins produces quantization error (difference between original and reconstructed value). More bins = smaller error.

Why is per-channel normalization essential before uniform quantization?

Without normalization, high-energy channels would saturate bins while low-energy channels would cluster near zero, wasting most of the quantization resolution. Normalization makes every channel use the full bin range equally.

Chapter 4: Parallel Encoding/Decoding

Each dMel time frame produces 80 discrete tokens (one per mel channel). Naively flattening these into a single sequence would give 80x the length — a 10-second utterance would become 80,000 tokens, far too long for any transformer. dMel uses a parallel channel architecture instead.

Encoding: From 80 Bins to One Embedding

Each of the 80 channels has its own embedding table of size [B x d], where B is the number of bins and d is the model dimension. At each time step:

  1. Look up the embedding for each channel's bin index: 80 embeddings of dimension d
  2. Sum (or concatenate + project) the 80 embeddings into a single d-dimensional vector
  3. This single vector represents the full spectral content at that time step

The sequence length seen by the transformer is just T (number of time frames), not 80T. This is the key to efficiency.

Data flow (encoding): Audio [160000 samples] → Mel spectrogram [1000 x 80] floats → dMel quantization [1000 x 80] integers in [0, 255] → Channel embeddings [1000 x 80 x d] → Sum across channels [1000 x d] → Transformer input sequence of length 1000.

Decoding: From One Hidden State to 80 Predictions

For TTS (generating speech), the transformer outputs a hidden state ht of dimension d at each time step. This must be decoded back into 80 bin indices. The decoding is also parallel:

  1. Apply 80 separate classification heads (linear layers): ht → [B] logits, one per channel
  2. Each head independently predicts a categorical distribution over B bins for its channel
  3. At inference: take argmax (or sample) from each channel's distribution
  4. The 80 predicted bins reconstruct the mel spectrogram frame

Training loss: sum of 80 cross-entropy losses (one per channel), which decomposes the joint distribution p(bin1, ..., bin80 | context) into a product of independent conditionals. This independence assumption is approximate but works well in practice because the transformer hidden state ht already captures cross-channel correlations.

Why parallel decoding works: The mel spectrogram channels are not independent — harmonic structure creates strong correlations across frequency bands. But the transformer's hidden state has already captured these correlations via self-attention over the temporal sequence. Conditioning each channel head on the same rich hidden state allows them to produce correlated outputs without explicit cross-channel modeling at decode time.

From Predicted Bins Back to Audio

After predicting 80 bin indices per frame, reconstruction is:

  1. Map bin indices back to normalized mel values: value = (bin - (B-1)/2) × 8 / (B-1)
  2. Denormalize: melc,t = value × σc + μc
  3. Apply a vocoder (HiFi-GAN) or flow-matching model to convert mel spectrogram to waveform
Parallel Channel Architecture

80 channel embeddings are summed into one vector per time step. The transformer operates on this compressed sequence. Decoding fans back out to 80 independent heads.

How does dMel avoid creating an 80x longer token sequence?

Each channel has its own embedding table. The 80 per-frame embeddings are summed (or projected) into one d-dimensional vector. The transformer sees T tokens, not 80T.

Chapter 5: RichTTS & RichASR

dMel is not just a tokenization scheme — it enables a unified architecture for both speech synthesis (TTS) and speech recognition (ASR) using the exact same transformer + dMel framework.

RichTTS: Text-to-Speech

Architecture: an autoregressive transformer that takes text tokens as prefix and generates dMel tokens frame-by-frame.

Input
Text token sequence: "Hello world" → [BOS, H, e, l, l, o, _, w, o, r, l, d, SEP]
Transformer
Decoder-only, causal attention. After SEP token, begins generating dMel frames autoregressively.
Output
80 bin predictions per frame via parallel heads. Stop when EOS predicted.
Vocoder
Predicted mel → HiFi-GAN or flow-matching vocoder → waveform

Training: teacher-forcing on (text, speech) pairs. Loss = sum of 80 cross-entropy losses per frame + next-token prediction loss on text prefix.

RichASR: Automatic Speech Recognition

Same architecture, reversed direction: dMel tokens as prefix, generate text tokens.

Input
dMel frame sequence from audio: [frame1, frame2, ..., frameT, SEP]
Transformer
Same decoder-only model. After SEP, generates text tokens autoregressively.
Output
Text tokens: "Hello world" (standard LM decoding: beam search or greedy)

The same model can even do both tasks in a single training run by mixing TTS and ASR examples with appropriate prefixes. This is the "Rich" in RichTTS/RichASR — a unified rich-context model.

Why unification matters: With codec-based approaches, TTS and ASR require different architectures because the codec tokens have different properties than text tokens (multi-stream RVQ, different sequence lengths, different generation patterns). With dMel, speech tokens behave like text tokens — same vocabulary type (integers), same embedding approach, same autoregressive generation. A single model, a single loss, a single framework.

Results Snapshot

  • TTS quality: RichTTS with dMel achieves speaker similarity and naturalness comparable to VALL-E (which uses EnCodec), with better robustness on out-of-domain text
  • ASR accuracy: RichASR with dMel matches Whisper-base on LibriSpeech, without any supervised speech encoder
  • Zero-shot voice cloning: 3-second speech prompt → generate in that voice. Works because dMel preserves speaker characteristics in the mel representation
Interactive Mel → dMel → Reconstruction

Left: original mel spectrogram. Middle: discretized dMel tokens (color = bin index). Right: reconstructed mel from bins. Drag the bin slider to see quality degrade at low resolution.

How does dMel enable a unified TTS + ASR model?

With dMel, speech tokens and text tokens are both integer sequences. TTS = text → dMel, ASR = dMel → text. Same model, same loss, just different prefix/target ordering.

Chapter 6: Streaming & Real-Time

One of dMel's most practical advantages: it is inherently streamable. You can tokenize audio in small chunks (as low as 200ms) without any context from past or future audio.

Why Codecs Struggle with Streaming

Neural codecs typically use:

  • Global normalization: computing statistics over the full utterance before encoding
  • Non-causal convolutions: encoder convolutions that look ahead in time
  • Padding artifacts: the first and last frames behave differently due to padding

To make EnCodec streamable, you must either accept quality degradation at chunk boundaries or add complex buffering logic.

Why dMel is Naturally Streamable

dMel's tokenization depends only on:

  1. The STFT of the current window (25ms of audio centered on the current frame)
  2. Pre-computed channel statisticsc, σc from the training set — constants)
  3. The quantization formula (a fixed mathematical function)

None of these require looking beyond the current 25ms window. You can tokenize a 200ms chunk (20 frames) in complete isolation, with zero buffering, and get the exact same tokens you would get processing the full utterance. No edge effects. No boundary artifacts.

Latency calculation: Minimum chunk size = 1 frame = 10ms of audio + 25ms window = 35ms total algorithmic latency. In practice, 200ms chunks (20 frames) provide a good compute-efficiency tradeoff. Compare this to EnCodec's minimum context of ~320ms (due to encoder receptive field) or streaming Whisper's 2-second chunks.

Implications for Real-Time Applications

  • Voice assistants: Start processing speech-to-text immediately, without waiting for utterance end
  • Live captioning: Transcribe 200ms at a time with consistent quality
  • Interactive TTS: Generate speech in 200ms increments for low-latency dialogue
  • On-device: No GPU needed for tokenization (it is just mel + binning). Tokenize on CPU/DSP, send tokens to cloud model
Streaming Tokenization

Audio arrives in 200ms chunks. Each chunk is independently tokenized. No buffering, no boundary effects. Watch tokens appear in real-time as chunks arrive.

Why is dMel inherently streamable while neural codecs are not?

The mel-to-bin conversion is a purely local operation: each frame is independently normalized by fixed constants and quantized. No look-ahead, no global statistics needed at inference time.

Chapter 7: Comparison & Results

How does this training-free tokenization compare to learned approaches? The results are surprisingly strong.

TTS Results (RichTTS with dMel vs. Baselines)

SystemTokenizationWER ↓Speaker Sim ↑Training-free?
VALL-EEnCodec (8 RVQ layers)5.9%0.68No
VoiceCraftEnCodec (single layer)4.8%0.67No
RichTTS (dMel)dMel (256 bins)4.2%0.66Yes
Ground TruthN/A2.2%1.00N/A

dMel achieves the lowest Word Error Rate (measuring intelligibility of generated speech) while matching speaker similarity — and it requires zero codec training.

ASR Results (RichASR with dMel vs. Baselines)

SystemInput FeaturesLS test-clean WER ↓LS test-other WER ↓
Whisper-baseMel spectrogram (continuous)5.0%12.6%
RichASR (EnCodec)EnCodec tokens7.2%15.8%
RichASR (dMel-256)dMel 256 bins5.3%13.1%
RichASR (dMel-128)dMel 128 bins5.5%13.4%

dMel significantly outperforms EnCodec tokens for ASR and nearly matches Whisper-base (which uses continuous mel features with a trained encoder). This demonstrates that discretization loses very little speech content.

Key Takeaways

dMel vs. EnCodec — the tradeoff:
Information density: EnCodec at 6 kbps is 10x more compressed than dMel at 64 kbps. For bandwidth-limited transmission, codecs win.
Quality ceiling: dMel preserves more information (less lossy), giving better reconstruction when compute is not the bottleneck.
Robustness: dMel uses no learned components in tokenization, so it cannot overfit to training domain. Codec quality degrades on unseen acoustic conditions.
Simplicity: dMel is ~20 lines of Python. EnCodec is a full neural network requiring training infrastructure.
Streaming: dMel is trivially streamable. EnCodec requires engineering effort.

Out-of-Domain Robustness

When tested on noisy audio, accented speech, and singing (domains not in the training set):

  • EnCodec WER degrades by 40-60% relative
  • dMel WER degrades by only 10-15% relative

The reason: EnCodec's learned encoder has internalized assumptions about clean speech that break down OOD. dMel's deterministic transform has no such assumptions — it faithfully represents whatever audio you feed it.

Bin Count vs. Quality Tradeoff

Reconstruction SNR and WER as a function of bin count B. More bins = better quality but harder to learn. The sweet spot is 128-256.

In what scenario does dMel most clearly outperform neural codecs?

dMel's biggest advantage is robustness: it has no learned encoder that can overfit to training domain. The deterministic mel-to-bin mapping works equally well on any audio, while codec quality degrades 40-60% on out-of-domain inputs.

Chapter 8: Connections

dMel sits at an interesting intersection of ideas in modern ML:

Related Concepts

  • VQ-VAE / Neural Codecs: EnCodec, SoundStream, and DAC use learned vector quantization to compress audio. dMel shows that for speech, you can skip the learning entirely. But for music and general audio (where mel spectrograms are less optimal), learned codecs still have advantages.
  • Flow Matching (for vocoding): The paper uses flow-matching vocoders to convert predicted mel spectrograms back to waveforms. Flow matching provides higher quality than GAN-based vocoders (HiFi-GAN) at the cost of slower inference.
  • Whisper: Whisper already uses mel spectrograms as input to a trained encoder-decoder. dMel takes this further: instead of a learned mel encoder, just discretize the mel directly and treat it as a token sequence for a decoder-only LM.
  • Multimodal LLMs: Models like GPT-4o process speech, text, and images in a unified framework. dMel's simple tokenization makes it trivial to add speech to any text-based LLM — just expand the vocabulary by B tokens and add channel embeddings.
  • Visual Tokenization (VAR, VQGAN): The same question arises for images: do you need a learned tokenizer? Approaches like pixel quantization and patch-based methods mirror dMel's philosophy for vision.

When to Use dMel vs. Codecs

CriterionUse dMelUse Neural Codec
Domain robustness neededYes — no learned encoder to fail OODRisky
Streaming requiredYes — trivially streamableRequires engineering
No training budget for codecYes — zero training costNeeds GPU-hours
Bandwidth-constrained64 kbps is too highYes — 1.5-6 kbps achievable
Music / general audioMel is less optimalYes — can learn domain-specific structure
Multi-speaker expressivenessGood (mel preserves speaker info)Excellent (fine-grained codes)
The broader lesson: Before reaching for a learned component, ask: "Is there a deterministic, well-understood signal processing pipeline that already does 90% of the job?" For speech, the answer is yes — decades of psychoacoustics research gave us the mel spectrogram. dMel simply takes this seriously.

Paper Details

Bai, R. H., Likhomanenko, T., Zhang, R., Gu, Z., Aldeneh, Z., & Jaitly, N. (2024). dMel: Speech Tokenization Made Simple. arXiv:2407.15835. arXiv

What is the fundamental insight of the dMel paper?

dMel's message is that the decades-old mel spectrogram, combined with trivial uniform quantization, is sufficient for modern speech LMs. No learned tokenizer needed.

← Back to Veanors Hub