dMel: Speech Tokenization Made Simple

Chapter 3: Why This Works

It seems almost too simple. Why should naive uniform quantization of mel values compete with a carefully trained neural codec? Three reasons:

1. Mel spectrograms are already information-optimal for speech

The mel filterbank was designed over decades of psychoacoustics research to capture exactly the information humans use to perceive speech. It discards phase (irrelevant for content), compresses high frequencies (where humans have low resolution), and preserves formant structure (the core of phonetic identity). A neural codec that takes raw audio as input must rediscover this structure through training. dMel starts with it for free.

2. Per-channel normalization makes the distribution uniform

Different mel bands have wildly different energy distributions. Low-frequency bands (fundamental frequency, first formant) have high energy and large variance. High-frequency bands (fricatives, noise) have low energy. After subtracting the per-channel mean and dividing by the per-channel std, every channel has zero mean and unit variance. The values become approximately Gaussian with most mass between [-3, 3]. Clipping at [-4, 4] loses almost nothing (<0.01% of values).

This normalization is crucial: without it, uniform quantization would waste bins on the low-energy high-frequency channels (all bins near zero) while saturating the high-energy low-frequency channels.

3. 256 bins = 8 bits per value, which exceeds perceptual limits

After normalization, the range [-4, 4] is divided into 256 bins. Each bin spans 8/256 = 0.03125 units of normalized mel energy. For comparison, the just-noticeable difference in audio loudness is about 1 dB — much larger than one bin width. We are quantizing below the perceptual threshold. The information loss is inaudible.

Information-theoretic perspective: A 10-second utterance at 100 fps with 80 channels and 256 bins = 1000 × 80 × 8 bits = 640,000 bits = 64 kbps. EnCodec at 6 kbps achieves higher compression but loses information that matters for high-fidelity reconstruction. dMel trades higher bitrate for lossless-equivalent quality and zero training cost. The transformer language model then provides the "compression" by learning the statistical structure of these token sequences.

Quantization Error Distribution

After normalization, mel values are approximately Gaussian. Uniform quantization with B bins produces quantization error (difference between original and reconstructed value). More bins = smaller error.

Why is per-channel normalization essential before uniform quantization?

Without normalization, high-energy channels would saturate bins while low-energy channels would cluster near zero, wasting most of the quantization resolution. Normalization makes every channel use the full bin range equally.

Chapter 4: Parallel Encoding/Decoding

Each dMel time frame produces 80 discrete tokens (one per mel channel). Naively flattening these into a single sequence would give 80x the length — a 10-second utterance would become 80,000 tokens, far too long for any transformer. dMel uses a parallel channel architecture instead.

Encoding: From 80 Bins to One Embedding

Each of the 80 channels has its own embedding table of size [B x d], where B is the number of bins and d is the model dimension. At each time step:

Look up the embedding for each channel's bin index: 80 embeddings of dimension d
Sum (or concatenate + project) the 80 embeddings into a single d-dimensional vector
This single vector represents the full spectral content at that time step

The sequence length seen by the transformer is just T (number of time frames), not 80T. This is the key to efficiency.

Data flow (encoding): Audio [160000 samples] → Mel spectrogram [1000 x 80] floats → dMel quantization [1000 x 80] integers in [0, 255] → Channel embeddings [1000 x 80 x d] → Sum across channels [1000 x d] → Transformer input sequence of length 1000.

Decoding: From One Hidden State to 80 Predictions

For TTS (generating speech), the transformer outputs a hidden state h_t of dimension d at each time step. This must be decoded back into 80 bin indices. The decoding is also parallel:

Apply 80 separate classification heads (linear layers): h_t → [B] logits, one per channel
Each head independently predicts a categorical distribution over B bins for its channel
At inference: take argmax (or sample) from each channel's distribution
The 80 predicted bins reconstruct the mel spectrogram frame

Training loss: sum of 80 cross-entropy losses (one per channel), which decomposes the joint distribution p(bin₁, ..., bin₈₀ | context) into a product of independent conditionals. This independence assumption is approximate but works well in practice because the transformer hidden state h_t already captures cross-channel correlations.

Why parallel decoding works: The mel spectrogram channels are not independent — harmonic structure creates strong correlations across frequency bands. But the transformer's hidden state has already captured these correlations via self-attention over the temporal sequence. Conditioning each channel head on the same rich hidden state allows them to produce correlated outputs without explicit cross-channel modeling at decode time.

From Predicted Bins Back to Audio

After predicting 80 bin indices per frame, reconstruction is:

Map bin indices back to normalized mel values: value = (bin - (B-1)/2) × 8 / (B-1)
Denormalize: mel_c,t = value × σ_c + μ_c
Apply a vocoder (HiFi-GAN) or flow-matching model to convert mel spectrogram to waveform

Parallel Channel Architecture

80 channel embeddings are summed into one vector per time step. The transformer operates on this compressed sequence. Decoding fans back out to 80 independent heads.

How does dMel avoid creating an 80x longer token sequence?

Each channel has its own embedding table. The 80 per-frame embeddings are summed (or projected) into one d-dimensional vector. The transformer sees T tokens, not 80T.

Chapter 5: RichTTS & RichASR

dMel is not just a tokenization scheme — it enables a unified architecture for both speech synthesis (TTS) and speech recognition (ASR) using the exact same transformer + dMel framework.

RichTTS: Text-to-Speech

Architecture: an autoregressive transformer that takes text tokens as prefix and generates dMel tokens frame-by-frame.

Input

Text token sequence: "Hello world" → [BOS, H, e, l, l, o, _, w, o, r, l, d, SEP]

↓

Transformer

Decoder-only, causal attention. After SEP token, begins generating dMel frames autoregressively.

↓

Output

80 bin predictions per frame via parallel heads. Stop when EOS predicted.

↓

Vocoder

Predicted mel → HiFi-GAN or flow-matching vocoder → waveform

Training: teacher-forcing on (text, speech) pairs. Loss = sum of 80 cross-entropy losses per frame + next-token prediction loss on text prefix.

RichASR: Automatic Speech Recognition

Same architecture, reversed direction: dMel tokens as prefix, generate text tokens.

Input

dMel frame sequence from audio: [frame₁, frame₂, ..., frame_T, SEP]

↓

Transformer

Same decoder-only model. After SEP, generates text tokens autoregressively.

↓

Output

Text tokens: "Hello world" (standard LM decoding: beam search or greedy)

The same model can even do both tasks in a single training run by mixing TTS and ASR examples with appropriate prefixes. This is the "Rich" in RichTTS/RichASR — a unified rich-context model.

Why unification matters: With codec-based approaches, TTS and ASR require different architectures because the codec tokens have different properties than text tokens (multi-stream RVQ, different sequence lengths, different generation patterns). With dMel, speech tokens behave like text tokens — same vocabulary type (integers), same embedding approach, same autoregressive generation. A single model, a single loss, a single framework.

Results Snapshot

TTS quality: RichTTS with dMel achieves speaker similarity and naturalness comparable to VALL-E (which uses EnCodec), with better robustness on out-of-domain text
ASR accuracy: RichASR with dMel matches Whisper-base on LibriSpeech, without any supervised speech encoder
Zero-shot voice cloning: 3-second speech prompt → generate in that voice. Works because dMel preserves speaker characteristics in the mel representation

Interactive Mel → dMel → Reconstruction

Left: original mel spectrogram. Middle: discretized dMel tokens (color = bin index). Right: reconstructed mel from bins. Drag the bin slider to see quality degrade at low resolution.

Bins: 256

How does dMel enable a unified TTS + ASR model?

With dMel, speech tokens and text tokens are both integer sequences. TTS = text → dMel, ASR = dMel → text. Same model, same loss, just different prefix/target ordering.

Chapter 6: Streaming & Real-Time

One of dMel's most practical advantages: it is inherently streamable. You can tokenize audio in small chunks (as low as 200ms) without any context from past or future audio.

Why Codecs Struggle with Streaming

Neural codecs typically use:

Global normalization: computing statistics over the full utterance before encoding
Non-causal convolutions: encoder convolutions that look ahead in time
Padding artifacts: the first and last frames behave differently due to padding

To make EnCodec streamable, you must either accept quality degradation at chunk boundaries or add complex buffering logic.

Why dMel is Naturally Streamable

dMel's tokenization depends only on:

The STFT of the current window (25ms of audio centered on the current frame)
Pre-computed channel statistics (μ_c, σ_c from the training set — constants)
The quantization formula (a fixed mathematical function)

None of these require looking beyond the current 25ms window. You can tokenize a 200ms chunk (20 frames) in complete isolation, with zero buffering, and get the exact same tokens you would get processing the full utterance. No edge effects. No boundary artifacts.

Latency calculation: Minimum chunk size = 1 frame = 10ms of audio + 25ms window = 35ms total algorithmic latency. In practice, 200ms chunks (20 frames) provide a good compute-efficiency tradeoff. Compare this to EnCodec's minimum context of ~320ms (due to encoder receptive field) or streaming Whisper's 2-second chunks.

Implications for Real-Time Applications

Voice assistants: Start processing speech-to-text immediately, without waiting for utterance end
Live captioning: Transcribe 200ms at a time with consistent quality
Interactive TTS: Generate speech in 200ms increments for low-latency dialogue
On-device: No GPU needed for tokenization (it is just mel + binning). Tokenize on CPU/DSP, send tokens to cloud model

Streaming Tokenization

Audio arrives in 200ms chunks. Each chunk is independently tokenized. No buffering, no boundary effects. Watch tokens appear in real-time as chunks arrive.

Why is dMel inherently streamable while neural codecs are not?

The mel-to-bin conversion is a purely local operation: each frame is independently normalized by fixed constants and quantized. No look-ahead, no global statistics needed at inference time.

Chapter 7: Comparison & Results

How does this training-free tokenization compare to learned approaches? The results are surprisingly strong.

TTS Results (RichTTS with dMel vs. Baselines)

System	Tokenization	WER ↓	Speaker Sim ↑	Training-free?
VALL-E	EnCodec (8 RVQ layers)	5.9%	0.68	No
VoiceCraft	EnCodec (single layer)	4.8%	0.67	No
RichTTS (dMel)	dMel (256 bins)	4.2%	0.66	Yes
Ground Truth	N/A	2.2%	1.00	N/A

dMel achieves the lowest Word Error Rate (measuring intelligibility of generated speech) while matching speaker similarity — and it requires zero codec training.

ASR Results (RichASR with dMel vs. Baselines)

System	Input Features	LS test-clean WER ↓	LS test-other WER ↓
Whisper-base	Mel spectrogram (continuous)	5.0%	12.6%
RichASR (EnCodec)	EnCodec tokens	7.2%	15.8%
RichASR (dMel-256)	dMel 256 bins	5.3%	13.1%
RichASR (dMel-128)	dMel 128 bins	5.5%	13.4%

dMel significantly outperforms EnCodec tokens for ASR and nearly matches Whisper-base (which uses continuous mel features with a trained encoder). This demonstrates that discretization loses very little speech content.

Key Takeaways

dMel vs. EnCodec — the tradeoff:
• Information density: EnCodec at 6 kbps is 10x more compressed than dMel at 64 kbps. For bandwidth-limited transmission, codecs win.
• Quality ceiling: dMel preserves more information (less lossy), giving better reconstruction when compute is not the bottleneck.
• Robustness: dMel uses no learned components in tokenization, so it cannot overfit to training domain. Codec quality degrades on unseen acoustic conditions.
• Simplicity: dMel is ~20 lines of Python. EnCodec is a full neural network requiring training infrastructure.
• Streaming: dMel is trivially streamable. EnCodec requires engineering effort.

Out-of-Domain Robustness

When tested on noisy audio, accented speech, and singing (domains not in the training set):

EnCodec WER degrades by 40-60% relative
dMel WER degrades by only 10-15% relative

The reason: EnCodec's learned encoder has internalized assumptions about clean speech that break down OOD. dMel's deterministic transform has no such assumptions — it faithfully represents whatever audio you feed it.

Bin Count vs. Quality Tradeoff

Reconstruction SNR and WER as a function of bin count B. More bins = better quality but harder to learn. The sweet spot is 128-256.

In what scenario does dMel most clearly outperform neural codecs?

dMel's biggest advantage is robustness: it has no learned encoder that can overfit to training domain. The deterministic mel-to-bin mapping works equally well on any audio, while codec quality degrades 40-60% on out-of-domain inputs.

Chapter 8: Connections

dMel sits at an interesting intersection of ideas in modern ML:

Related Concepts

VQ-VAE / Neural Codecs: EnCodec, SoundStream, and DAC use learned vector quantization to compress audio. dMel shows that for speech, you can skip the learning entirely. But for music and general audio (where mel spectrograms are less optimal), learned codecs still have advantages.
Flow Matching (for vocoding): The paper uses flow-matching vocoders to convert predicted mel spectrograms back to waveforms. Flow matching provides higher quality than GAN-based vocoders (HiFi-GAN) at the cost of slower inference.
Whisper: Whisper already uses mel spectrograms as input to a trained encoder-decoder. dMel takes this further: instead of a learned mel encoder, just discretize the mel directly and treat it as a token sequence for a decoder-only LM.
Multimodal LLMs: Models like GPT-4o process speech, text, and images in a unified framework. dMel's simple tokenization makes it trivial to add speech to any text-based LLM — just expand the vocabulary by B tokens and add channel embeddings.
Visual Tokenization (VAR, VQGAN): The same question arises for images: do you need a learned tokenizer? Approaches like pixel quantization and patch-based methods mirror dMel's philosophy for vision.

When to Use dMel vs. Codecs

Criterion	Use dMel	Use Neural Codec
Domain robustness needed	Yes — no learned encoder to fail OOD	Risky
Streaming required	Yes — trivially streamable	Requires engineering
No training budget for codec	Yes — zero training cost	Needs GPU-hours
Bandwidth-constrained	64 kbps is too high	Yes — 1.5-6 kbps achievable
Music / general audio	Mel is less optimal	Yes — can learn domain-specific structure
Multi-speaker expressiveness	Good (mel preserves speaker info)	Excellent (fine-grained codes)

The broader lesson: Before reaching for a learned component, ask: "Is there a deterministic, well-understood signal processing pipeline that already does 90% of the job?" For speech, the answer is yes — decades of psychoacoustics research gave us the mel spectrogram. dMel simply takes this seriously.

Paper Details

Bai, R. H., Likhomanenko, T., Zhang, R., Gu, Z., Aldeneh, Z., & Jaitly, N. (2024). dMel: Speech Tokenization Made Simple. arXiv:2407.15835. arXiv

What is the fundamental insight of the dMel paper?

dMel's message is that the decades-old mel spectrogram, combined with trivial uniform quantization, is sufficient for modern speech LMs. No learned tokenizer needed.

← Back to Veanors Hub