A training-free discrete speech representation that bins mel-spectrogram values into integers — no neural codec needed.
Large language models process text as discrete tokens. To handle speech with the same machinery, you need to convert a continuous audio waveform into a sequence of integers. The dominant approach in 2023-2024 uses neural audio codecs — models like EnCodec and SoundStream that compress audio into discrete codes via a learned encoder-decoder with vector quantization.
These codecs are powerful, but they come with serious baggage:
This is exactly what dMel proposes. The result is shockingly competitive: a completely training-free tokenization that matches or beats EnCodec on both TTS and ASR tasks, while being streamable, domain-robust, and trivial to implement.
Top: traditional neural codec path (train encoder, quantize, decode). Bottom: dMel path (compute mel, bin values, done). Click to animate.
What is the main disadvantage of neural audio codecs like EnCodec for speech tokenization?
Neural codecs need training, create complex multi-codebook hierarchies, often cannot process small chunks in isolation, and are fragile to domain shift — all problems dMel avoids entirely.
Before we can discretize mel spectrograms, we need to understand what they are and why they are the dominant representation for speech processing.
Raw audio is a 1D signal: amplitude values sampled at 16,000 Hz (for speech). A 10-second utterance is just 160,000 floating-point numbers. This raw waveform is hard for models to work with directly — the relevant structure (phonemes, formants, pitch) is encoded in frequency patterns that are invisible in the time domain.
The Short-Time Fourier Transform (STFT) slides a window across the waveform and computes the frequency content at each position. With a 25ms window and a 10ms hop, a 10-second signal produces 1000 time frames, each containing the magnitude of ~513 frequency bins (for a 1024-point FFT). This is a spectrogram: a 2D matrix of [time_frames x frequency_bins].
Humans perceive pitch logarithmically: the difference between 100 Hz and 200 Hz sounds the same as between 1000 Hz and 2000 Hz (both are one octave). The mel scale warps the linear frequency axis to match this perception. A mel filterbank is a set of triangular filters spaced evenly on the mel scale, applied to the linear spectrogram magnitudes.
Standard configuration for speech:
The result: an 80-dimensional vector every 10ms. A 10-second utterance becomes a [1000 x 80] matrix. Each column is a mel-frequency channel; each row is a time frame.
A synthetic mel spectrogram showing 80 frequency bands over time. Each cell is a continuous float value representing energy at that frequency and time. Hover to inspect values.
Why does the mel spectrogram use 80 bands instead of the full 513 frequency bins from the STFT?
The mel scale groups high frequencies into wider bands (since we cannot distinguish fine differences up there) while preserving detail in low frequencies where speech formants live. 80 bands is the sweet spot: enough resolution for speech, compact enough for efficient processing.
Here is the entire dMel algorithm. It is almost comically simple:
Mathematically, for channel c at time t:
That is it. No neural network. No codebook learning. No VQ-VAE. Just normalization, clipping, and uniform quantization.
Each time frame produces 80 bin indices. These 80 tokens are not flattened into one long sequence (that would make it 80x longer than text). Instead, they are processed in parallel — each channel has its own embedding, and all 80 are combined at each time step. We will cover this parallel processing in Chapter 4.
The paper experiments with B = {64, 128, 256, 512, 1024}. Key findings:
Watch continuous mel values snap to discrete bins. Drag the "Bins" slider to see how quantization resolution affects the representation. Fewer bins = more quantization error.
What are the steps of the dMel tokenization algorithm?
dMel is purely deterministic: mel spectrogram, per-channel normalization using training set statistics, clip outliers, and uniform quantization. No learning involved.
It seems almost too simple. Why should naive uniform quantization of mel values compete with a carefully trained neural codec? Three reasons:
The mel filterbank was designed over decades of psychoacoustics research to capture exactly the information humans use to perceive speech. It discards phase (irrelevant for content), compresses high frequencies (where humans have low resolution), and preserves formant structure (the core of phonetic identity). A neural codec that takes raw audio as input must rediscover this structure through training. dMel starts with it for free.
Different mel bands have wildly different energy distributions. Low-frequency bands (fundamental frequency, first formant) have high energy and large variance. High-frequency bands (fricatives, noise) have low energy. After subtracting the per-channel mean and dividing by the per-channel std, every channel has zero mean and unit variance. The values become approximately Gaussian with most mass between [-3, 3]. Clipping at [-4, 4] loses almost nothing (<0.01% of values).
This normalization is crucial: without it, uniform quantization would waste bins on the low-energy high-frequency channels (all bins near zero) while saturating the high-energy low-frequency channels.
After normalization, the range [-4, 4] is divided into 256 bins. Each bin spans 8/256 = 0.03125 units of normalized mel energy. For comparison, the just-noticeable difference in audio loudness is about 1 dB — much larger than one bin width. We are quantizing below the perceptual threshold. The information loss is inaudible.
After normalization, mel values are approximately Gaussian. Uniform quantization with B bins produces quantization error (difference between original and reconstructed value). More bins = smaller error.
Why is per-channel normalization essential before uniform quantization?
Without normalization, high-energy channels would saturate bins while low-energy channels would cluster near zero, wasting most of the quantization resolution. Normalization makes every channel use the full bin range equally.
Each dMel time frame produces 80 discrete tokens (one per mel channel). Naively flattening these into a single sequence would give 80x the length — a 10-second utterance would become 80,000 tokens, far too long for any transformer. dMel uses a parallel channel architecture instead.
Each of the 80 channels has its own embedding table of size [B x d], where B is the number of bins and d is the model dimension. At each time step:
The sequence length seen by the transformer is just T (number of time frames), not 80T. This is the key to efficiency.
For TTS (generating speech), the transformer outputs a hidden state ht of dimension d at each time step. This must be decoded back into 80 bin indices. The decoding is also parallel:
Training loss: sum of 80 cross-entropy losses (one per channel), which decomposes the joint distribution p(bin1, ..., bin80 | context) into a product of independent conditionals. This independence assumption is approximate but works well in practice because the transformer hidden state ht already captures cross-channel correlations.
After predicting 80 bin indices per frame, reconstruction is:
80 channel embeddings are summed into one vector per time step. The transformer operates on this compressed sequence. Decoding fans back out to 80 independent heads.
How does dMel avoid creating an 80x longer token sequence?
Each channel has its own embedding table. The 80 per-frame embeddings are summed (or projected) into one d-dimensional vector. The transformer sees T tokens, not 80T.
dMel is not just a tokenization scheme — it enables a unified architecture for both speech synthesis (TTS) and speech recognition (ASR) using the exact same transformer + dMel framework.
Architecture: an autoregressive transformer that takes text tokens as prefix and generates dMel tokens frame-by-frame.
Training: teacher-forcing on (text, speech) pairs. Loss = sum of 80 cross-entropy losses per frame + next-token prediction loss on text prefix.
Same architecture, reversed direction: dMel tokens as prefix, generate text tokens.
The same model can even do both tasks in a single training run by mixing TTS and ASR examples with appropriate prefixes. This is the "Rich" in RichTTS/RichASR — a unified rich-context model.
Left: original mel spectrogram. Middle: discretized dMel tokens (color = bin index). Right: reconstructed mel from bins. Drag the bin slider to see quality degrade at low resolution.
How does dMel enable a unified TTS + ASR model?
With dMel, speech tokens and text tokens are both integer sequences. TTS = text → dMel, ASR = dMel → text. Same model, same loss, just different prefix/target ordering.
One of dMel's most practical advantages: it is inherently streamable. You can tokenize audio in small chunks (as low as 200ms) without any context from past or future audio.
Neural codecs typically use:
To make EnCodec streamable, you must either accept quality degradation at chunk boundaries or add complex buffering logic.
dMel's tokenization depends only on:
None of these require looking beyond the current 25ms window. You can tokenize a 200ms chunk (20 frames) in complete isolation, with zero buffering, and get the exact same tokens you would get processing the full utterance. No edge effects. No boundary artifacts.
Audio arrives in 200ms chunks. Each chunk is independently tokenized. No buffering, no boundary effects. Watch tokens appear in real-time as chunks arrive.
Why is dMel inherently streamable while neural codecs are not?
The mel-to-bin conversion is a purely local operation: each frame is independently normalized by fixed constants and quantized. No look-ahead, no global statistics needed at inference time.
How does this training-free tokenization compare to learned approaches? The results are surprisingly strong.
| System | Tokenization | WER ↓ | Speaker Sim ↑ | Training-free? |
|---|---|---|---|---|
| VALL-E | EnCodec (8 RVQ layers) | 5.9% | 0.68 | No |
| VoiceCraft | EnCodec (single layer) | 4.8% | 0.67 | No |
| RichTTS (dMel) | dMel (256 bins) | 4.2% | 0.66 | Yes |
| Ground Truth | N/A | 2.2% | 1.00 | N/A |
dMel achieves the lowest Word Error Rate (measuring intelligibility of generated speech) while matching speaker similarity — and it requires zero codec training.
| System | Input Features | LS test-clean WER ↓ | LS test-other WER ↓ |
|---|---|---|---|
| Whisper-base | Mel spectrogram (continuous) | 5.0% | 12.6% |
| RichASR (EnCodec) | EnCodec tokens | 7.2% | 15.8% |
| RichASR (dMel-256) | dMel 256 bins | 5.3% | 13.1% |
| RichASR (dMel-128) | dMel 128 bins | 5.5% | 13.4% |
dMel significantly outperforms EnCodec tokens for ASR and nearly matches Whisper-base (which uses continuous mel features with a trained encoder). This demonstrates that discretization loses very little speech content.
When tested on noisy audio, accented speech, and singing (domains not in the training set):
The reason: EnCodec's learned encoder has internalized assumptions about clean speech that break down OOD. dMel's deterministic transform has no such assumptions — it faithfully represents whatever audio you feed it.
Reconstruction SNR and WER as a function of bin count B. More bins = better quality but harder to learn. The sweet spot is 128-256.
In what scenario does dMel most clearly outperform neural codecs?
dMel's biggest advantage is robustness: it has no learned encoder that can overfit to training domain. The deterministic mel-to-bin mapping works equally well on any audio, while codec quality degrades 40-60% on out-of-domain inputs.
dMel sits at an interesting intersection of ideas in modern ML:
| Criterion | Use dMel | Use Neural Codec |
|---|---|---|
| Domain robustness needed | Yes — no learned encoder to fail OOD | Risky |
| Streaming required | Yes — trivially streamable | Requires engineering |
| No training budget for codec | Yes — zero training cost | Needs GPU-hours |
| Bandwidth-constrained | 64 kbps is too high | Yes — 1.5-6 kbps achievable |
| Music / general audio | Mel is less optimal | Yes — can learn domain-specific structure |
| Multi-speaker expressiveness | Good (mel preserves speaker info) | Excellent (fine-grained codes) |
Bai, R. H., Likhomanenko, T., Zhang, R., Gu, Z., Aldeneh, Z., & Jaitly, N. (2024). dMel: Speech Tokenization Made Simple. arXiv:2407.15835. arXiv
What is the fundamental insight of the dMel paper?
dMel's message is that the decades-old mel spectrogram, combined with trivial uniform quantization, is sufficient for modern speech LMs. No learned tokenizer needed.