From robotic to indistinguishable-from-human. How text-to-speech evolved — text frontend, acoustic model, vocoder, the alignment problem, and the one-to-many trick — through Tacotron, FastSpeech, VITS, and modern diffusion & codec-token TTS.
Reading text aloud sounds trivial — until you try to make a machine do it naturally. Old TTS sounded robotic for deep reasons, not just bad engineering. Three hard problems lurk in “turn this text into speech.”
Modern TTS — the kind that fools you on a phone call — solves all three. This lesson traces the architecture from the classic two-stage pipeline through Tacotron, the alignment fixes of FastSpeech, the end-to-end elegance of VITS, and today’s diffusion and codec-token systems. By the end you could diagram the model behind any voice assistant or audiobook narrator.
Naturalness (mean opinion score) over the eras: concatenative → parametric → Tacotron+neural vocoder → VITS/end-to-end → diffusion/codec TTS, closing on real human speech. Slide through the years.
The classic architecture splits the impossible “text → waveform” leap into two manageable stages, bridged by the mel spectrogram (the time×frequency picture from the Audio Representations lesson). Stage one: an acoustic model turns text into a mel spectrogram — deciding what frequencies, when. Stage two: a vocoder turns that mel spectrogram into an actual waveform — filling in the fine sample-level detail the mel discarded.
Why the mel intermediate? Because it splits the problem along its natural seam. The acoustic model handles the linguistic part (which sounds, what prosody) on a compact, low-rate representation. The vocoder handles the signal part (turning a coarse mel into 16,000+ crisp samples per second) and can be trained once and reused across voices. Modern end-to-end models (like VITS) fuse both stages, but the conceptual split — linguistic then signal — still organizes how everyone thinks about TTS.
Before any neural net runs, raw text needs cleaning — the text frontend, the most underrated part of TTS. Two jobs. First, normalization: expand everything that isn’t spelled how it’s spoken. “Dr. Smith paid $2.50 on 3/4” must become “Doctor Smith paid two dollars and fifty cents on March fourth.” Numbers, dates, currency, abbreviations, units — all expanded by rules and models. Get this wrong and the cleanest voice says “dollar two point five oh.”
Second, grapheme-to-phoneme (G2P): convert letters to the sounds they make. English spelling is treacherous — “through,” “though,” “tough” share letters but not sounds; “read” is two different words. G2P maps text to phonemes (the atomic speech sounds), often the actual input to the acoustic model, because phonemes are far more regular than spelling. Homographs like “lead” (metal vs. guide) need context to disambiguate — a genuinely hard sub-problem.
Pick an input and watch the frontend transform it: messy text → normalized words → phonemes. Notice how “$2.50” and “read” need real work, not a lookup.
Tacotron (Google, 2017) was the breakthrough acoustic model: a sequence-to- sequence with attention, just like neural machine translation, but translating phonemes into mel spectrogram frames. An encoder reads the phonemes; a decoder generates the mel one frame at a time; and attention decides, for each output frame, which input phoneme it should be voicing right now. Tacotron 2 paired this with a WaveNet vocoder and reached near-human quality — the moment TTS stopped sounding robotic.
The attention here is special: it must be monotonic. Speech goes left-to-right — you voice phoneme 1, then 2, then 3, never jumping back to 1 or skipping to 5. So a healthy Tacotron attention map is a clean diagonal line: as output frames advance, the attended phoneme advances steadily. When training works, you literally see that diagonal form — it’s the alignment being learned.
The attention map: input phonemes (vertical) vs. output mel frames (horizontal); bright = attended. A healthy model traces a clean diagonal. Drag toward “unstable” to see it stall (repeat) and jump (skip) — the classic Tacotron failures.
The acoustic model gives a mel spectrogram, but a mel isn’t playable — it threw away the phase and the sample-level detail. The vocoder reconstructs the full waveform from the mel, inventing the fine structure. This is hard: one mel frame must become hundreds of waveform samples, and they must be smooth and natural.
The history is a speed story. WaveNet (2016) generated audio one sample at a time, autoregressively — stunning quality, but agonizingly slow (predicting 24,000 samples in sequence for one second). Then came GAN vocoders like HiFi-GAN: a generator that produces the whole waveform in parallel from the mel, trained adversarially against discriminators so it sounds real (the same GAN-for-crispness trick as neural codecs). HiFi-GAN is hundreds of times faster than WaveNet at comparable quality — which is what made real-time, scalable TTS practical. Today, neural codec decoders can also serve as vocoders.
WaveNet (orange) generates samples one-by-one — quality but painfully serial. HiFi-GAN (teal) generates the whole waveform in parallel from the mel. Drag the clip length and watch the time gap explode for the autoregressive vocoder.
Tacotron’s autoregressive attention was both slow (one frame at a time) and fragile (skip/repeat failures). FastSpeech (2019) fixed both with a different idea: a non-autoregressive model with an explicit duration predictor. Instead of learning alignment implicitly through attention, FastSpeech predicts how many mel frames each phoneme should last, then expands the phonemes accordingly and generates all the frames in parallel.
So “cat” might be predicted as: C for 5 frames, A for 12 frames, T for 4 frames. You repeat each phoneme’s representation that many times to build the frame sequence, then a parallel network fills in the mel. Because alignment is now explicit and enforced (each phoneme gets a definite span), words can’t be skipped or repeated — robust by construction. And because it’s parallel, it’s fast. The durations are learned from a teacher model or a forced aligner. This also gives a bonus: you can control speed by scaling all durations, and prosody by adjusting individual ones.
Each phoneme gets a predicted duration (number of frames). Drag the speed: scale all durations down for faster speech, up for slower — the same control real TTS exposes. No attention to break.
Here’s the subtlest issue, and the reason early TTS sounded flat even when it was intelligible. The same text has many valid spoken renditions — different pitch contours, rhythms, emphases. A model trained with plain regression (predict the mel, minimize error) learns the average of all those renditions. And the average of many expressive prosodies is a flat, monotone, lifeless one — exactly the blur problem from JEPA and diffusion, now in pitch and rhythm.
The cure is to make the model stochastic: let it sample a specific prosody rather than average over all of them. Different architectures do this differently — a variational latent (VAE) capturing prosodic variation, normalizing flows modeling the full distribution, a stochastic duration predictor for rhythm variety, or diffusion sampling expressive mels. The common thread: inject controlled randomness so each generation commits to one lively rendition instead of the dead average.
Many valid pitch contours for one sentence (teal). A deterministic model predicts their average (orange) — a flat monotone. A stochastic model samples one expressive contour. Press resample to hear (see) the variety a stochastic model captures.
VITS (2021) unified everything into one end-to-end model: text straight to waveform, no separate vocoder. It combines a variational autoencoder (to model variation), a normalizing flow (to model the full distribution richly), a stochastic duration predictor (for rhythmic variety), and adversarial training (for crisp audio). The result: natural, expressive, fast speech from a single network — and the template for much of modern TTS.
Phonemes → predicted durations → flow/VAE samples a prosody → waveform. Adjust speed, pitch, and which speaker. Press Speak to generate (visualized as a pitch contour + waveform); resample to get a different valid rendition of the same text.
Notice you can change speed and pitch without re-running everything, get a different prosody each resample, and switch speaker identity by a single embedding — all the controls a production voice needs, in one model. That fusion of explicit duration, stochastic prosody, and adversarial realism is why VITS-style systems sound alive.
The frontier keeps moving, in two big directions.
Voice cloning generally works by conditioning on a speaker embedding (or a reference clip): a vector capturing timbre and style, fed into the model so the output sounds like that person. The same architecture, different speaker vector, different voice — which is exactly how multi-voice TTS services (like the one narrating these lessons) offer many voices from one model. The arc of TTS mirrors the rest of generative AI: toward end-to-end models, learned representations, and generation-as-sampling.
The same text + model, conditioned on different speaker embeddings (or reference clips), yields different voices. Drag through speakers and watch the timbre “fingerprint” and resulting waveform change — one model, many voices.
| Model | Alignment | Speed | One-to-many |
|---|---|---|---|
| Tacotron 2 | attention (can break) | slow (AR) | limited |
| FastSpeech 2 | duration predictor | fast (parallel) | some (variance adaptors) |
| VITS | monotonic + stochastic dur | fast, end-to-end | yes (VAE + flow) |
| Diffusion TTS | various | medium (few-step flow) | yes (sampling) |
| Codec-LM (VALL-E) | implicit (LM) | AR over tokens | yes; zero-shot cloning |
→ Audio Representations — the mel spectrogram TTS produces
→ Neural Audio Codecs — the tokens behind VALL-E-style TTS
→ Whisper — the reverse problem, speech → text
→ Diffusion / Flow Matching — modern TTS generation engines