Audio & Speech

TTS Architectures

From robotic to indistinguishable-from-human. How text-to-speech evolved — text frontend, acoustic model, vocoder, the alignment problem, and the one-to-many trick — through Tacotron, FastSpeech, VITS, and modern diffusion & codec-token TTS.

Prerequisites: A spectrogram is a time×frequency picture of sound + Seq2seq maps one sequence to another. That’s it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: Speaking Is Hard

Reading text aloud sounds trivial — until you try to make a machine do it naturally. Old TTS sounded robotic for deep reasons, not just bad engineering. Three hard problems lurk in “turn this text into speech.”

Modern TTS — the kind that fools you on a phone call — solves all three. This lesson traces the architecture from the classic two-stage pipeline through Tacotron, the alignment fixes of FastSpeech, the end-to-end elegance of VITS, and today’s diffusion and codec-token systems. By the end you could diagram the model behind any voice assistant or audiobook narrator.

The trap: “just predict the waveform from the text.” Text is short and discrete; a waveform is long, continuous, and one-to-many. You can’t bridge that gap in one naive step — you need an intermediate representation (the mel spectrogram), a way to handle timing (alignment), and a source of variation (stochasticity). Those three ideas organize the entire field.
The march toward natural speech

Naturalness (mean opinion score) over the eras: concatenative → parametric → Tacotron+neural vocoder → VITS/end-to-end → diffusion/codec TTS, closing on real human speech. Slide through the years.

era2018
Which is NOT one of the three core hard problems of TTS?

Chapter 1: The Two-Stage Pipeline

The classic architecture splits the impossible “text → waveform” leap into two manageable stages, bridged by the mel spectrogram (the time×frequency picture from the Audio Representations lesson). Stage one: an acoustic model turns text into a mel spectrogram — deciding what frequencies, when. Stage two: a vocoder turns that mel spectrogram into an actual waveform — filling in the fine sample-level detail the mel discarded.

text
“hello there”
↓ text frontend (normalize, phonemes)
phonemes
HH AH L OW …
↓ acoustic model
mel spectrogram
what frequencies, when
↓ vocoder
waveform
playable audio

Why the mel intermediate? Because it splits the problem along its natural seam. The acoustic model handles the linguistic part (which sounds, what prosody) on a compact, low-rate representation. The vocoder handles the signal part (turning a coarse mel into 16,000+ crisp samples per second) and can be trained once and reused across voices. Modern end-to-end models (like VITS) fuse both stages, but the conceptual split — linguistic then signal — still organizes how everyone thinks about TTS.

In the classic two-stage TTS pipeline, what bridges the two stages?

Chapter 2: The Text Frontend

Before any neural net runs, raw text needs cleaning — the text frontend, the most underrated part of TTS. Two jobs. First, normalization: expand everything that isn’t spelled how it’s spoken. “Dr. Smith paid $2.50 on 3/4” must become “Doctor Smith paid two dollars and fifty cents on March fourth.” Numbers, dates, currency, abbreviations, units — all expanded by rules and models. Get this wrong and the cleanest voice says “dollar two point five oh.”

Second, grapheme-to-phoneme (G2P): convert letters to the sounds they make. English spelling is treacherous — “through,” “though,” “tough” share letters but not sounds; “read” is two different words. G2P maps text to phonemes (the atomic speech sounds), often the actual input to the acoustic model, because phonemes are far more regular than spelling. Homographs like “lead” (metal vs. guide) need context to disambiguate — a genuinely hard sub-problem.

Why phonemes, not letters: a model fed raw letters must also learn English’s chaotic spelling rules — wasting capacity and failing on rare/new words. Feeding phonemes hands it the actual pronunciation, so it can focus on prosody and timbre. (Some modern end-to-end models do learn from characters directly, given enough data — but a phoneme frontend is still the robust default, especially for names and numbers.)
Normalization & grapheme-to-phoneme

Pick an input and watch the frontend transform it: messy text → normalized words → phonemes. Notice how “$2.50” and “read” need real work, not a lookup.

exampleprices
Why do most TTS systems convert text to phonemes before the acoustic model?

Chapter 3: Tacotron & Attention Alignment

Tacotron (Google, 2017) was the breakthrough acoustic model: a sequence-to- sequence with attention, just like neural machine translation, but translating phonemes into mel spectrogram frames. An encoder reads the phonemes; a decoder generates the mel one frame at a time; and attention decides, for each output frame, which input phoneme it should be voicing right now. Tacotron 2 paired this with a WaveNet vocoder and reached near-human quality — the moment TTS stopped sounding robotic.

The attention here is special: it must be monotonic. Speech goes left-to-right — you voice phoneme 1, then 2, then 3, never jumping back to 1 or skipping to 5. So a healthy Tacotron attention map is a clean diagonal line: as output frames advance, the attended phoneme advances steadily. When training works, you literally see that diagonal form — it’s the alignment being learned.

The failure mode: because the attention is learned and autoregressive, it can break. The diagonal can stall (the model repeats a sound — “the the the”), jump ahead (it skips words), or collapse (it babbles). These attention failures plagued early Tacotron and are exactly what the next generation of models was designed to eliminate.
The alignment diagonal

The attention map: input phonemes (vertical) vs. output mel frames (horizontal); bright = attended. A healthy model traces a clean diagonal. Drag toward “unstable” to see it stall (repeat) and jump (skip) — the classic Tacotron failures.

stability0.90
A healthy Tacotron attention map looks like a clean diagonal because:

Chapter 4: The Vocoder

The acoustic model gives a mel spectrogram, but a mel isn’t playable — it threw away the phase and the sample-level detail. The vocoder reconstructs the full waveform from the mel, inventing the fine structure. This is hard: one mel frame must become hundreds of waveform samples, and they must be smooth and natural.

The history is a speed story. WaveNet (2016) generated audio one sample at a time, autoregressively — stunning quality, but agonizingly slow (predicting 24,000 samples in sequence for one second). Then came GAN vocoders like HiFi-GAN: a generator that produces the whole waveform in parallel from the mel, trained adversarially against discriminators so it sounds real (the same GAN-for-crispness trick as neural codecs). HiFi-GAN is hundreds of times faster than WaveNet at comparable quality — which is what made real-time, scalable TTS practical. Today, neural codec decoders can also serve as vocoders.

Autoregressive vs. parallel vocoding

WaveNet (orange) generates samples one-by-one — quality but painfully serial. HiFi-GAN (teal) generates the whole waveform in parallel from the mel. Drag the clip length and watch the time gap explode for the autoregressive vocoder.

audio length0.40
Why did GAN vocoders like HiFi-GAN largely replace WaveNet for production TTS?

Chapter 5: Solving Alignment — FastSpeech

Tacotron’s autoregressive attention was both slow (one frame at a time) and fragile (skip/repeat failures). FastSpeech (2019) fixed both with a different idea: a non-autoregressive model with an explicit duration predictor. Instead of learning alignment implicitly through attention, FastSpeech predicts how many mel frames each phoneme should last, then expands the phonemes accordingly and generates all the frames in parallel.

So “cat” might be predicted as: C for 5 frames, A for 12 frames, T for 4 frames. You repeat each phoneme’s representation that many times to build the frame sequence, then a parallel network fills in the mel. Because alignment is now explicit and enforced (each phoneme gets a definite span), words can’t be skipped or repeated — robust by construction. And because it’s parallel, it’s fast. The durations are learned from a teacher model or a forced aligner. This also gives a bonus: you can control speed by scaling all durations, and prosody by adjusting individual ones.

The key swap: attention “discovers” alignment and can get it wrong; a duration predictor states the alignment up front and guarantees it. Trading implicit learned alignment for explicit predicted alignment is what made TTS reliable enough to ship at scale.
Duration predictor: expanding phonemes to frames

Each phoneme gets a predicted duration (number of frames). Drag the speed: scale all durations down for faster speech, up for slower — the same control real TTS exposes. No attention to break.

speed1.0
How does FastSpeech avoid the skip/repeat failures of Tacotron?

Chapter 6: The One-to-Many Problem

Here’s the subtlest issue, and the reason early TTS sounded flat even when it was intelligible. The same text has many valid spoken renditions — different pitch contours, rhythms, emphases. A model trained with plain regression (predict the mel, minimize error) learns the average of all those renditions. And the average of many expressive prosodies is a flat, monotone, lifeless one — exactly the blur problem from JEPA and diffusion, now in pitch and rhythm.

The cure is to make the model stochastic: let it sample a specific prosody rather than average over all of them. Different architectures do this differently — a variational latent (VAE) capturing prosodic variation, normalizing flows modeling the full distribution, a stochastic duration predictor for rhythm variety, or diffusion sampling expressive mels. The common thread: inject controlled randomness so each generation commits to one lively rendition instead of the dead average.

Why deterministic TTS sounds flat

Many valid pitch contours for one sentence (teal). A deterministic model predicts their average (orange) — a flat monotone. A stochastic model samples one expressive contour. Press resample to hear (see) the variety a stochastic model captures.

Why does a deterministic (plain-regression) TTS model sound flat and monotone?

Chapter 7: VITS, End to End (showcase)

VITS (2021) unified everything into one end-to-end model: text straight to waveform, no separate vocoder. It combines a variational autoencoder (to model variation), a normalizing flow (to model the full distribution richly), a stochastic duration predictor (for rhythmic variety), and adversarial training (for crisp audio). The result: natural, expressive, fast speech from a single network — and the template for much of modern TTS.

VITS pipeline, interactive

Phonemes → predicted durations → flow/VAE samples a prosody → waveform. Adjust speed, pitch, and which speaker. Press Speak to generate (visualized as a pitch contour + waveform); resample to get a different valid rendition of the same text.

speed1.0
pitch1.0
speakerA

Notice you can change speed and pitch without re-running everything, get a different prosody each resample, and switch speaker identity by a single embedding — all the controls a production voice needs, in one model. That fusion of explicit duration, stochastic prosody, and adversarial realism is why VITS-style systems sound alive.

Chapter 8: Modern TTS & Voice Cloning

The frontier keeps moving, in two big directions.

Voice cloning generally works by conditioning on a speaker embedding (or a reference clip): a vector capturing timbre and style, fed into the model so the output sounds like that person. The same architecture, different speaker vector, different voice — which is exactly how multi-voice TTS services (like the one narrating these lessons) offer many voices from one model. The arc of TTS mirrors the rest of generative AI: toward end-to-end models, learned representations, and generation-as-sampling.

Voice cloning via speaker embedding

The same text + model, conditioned on different speaker embeddings (or reference clips), yields different voices. Drag through speakers and watch the timbre “fingerprint” and resulting waveform change — one model, many voices.

speaker embedding0.30
How does zero-shot voice cloning typically work?

Chapter 9: Cheat Sheet & Connections

text
normalize + grapheme-to-phoneme → phonemes
↓ acoustic model (Tacotron / FastSpeech / VITS)
alignment + mel
attention diagonal OR duration predictor; stochastic for prosody
↓ vocoder (WaveNet → HiFi-GAN) or end-to-end
waveform
+ speaker embedding for voice identity/cloning
ModelAlignmentSpeedOne-to-many
Tacotron 2attention (can break)slow (AR)limited
FastSpeech 2duration predictorfast (parallel)some (variance adaptors)
VITSmonotonic + stochastic durfast, end-to-endyes (VAE + flow)
Diffusion TTSvariousmedium (few-step flow)yes (sampling)
Codec-LM (VALL-E)implicit (LM)AR over tokensyes; zero-shot cloning

Keep exploring

Audio Representations — the mel spectrogram TTS produces
Neural Audio Codecs — the tokens behind VALL-E-style TTS
Whisper — the reverse problem, speech → text
Diffusion / Flow Matching — modern TTS generation engines

“What I cannot create, I do not understand.” You just rebuilt the TTS stack: clean the text into phonemes, turn phonemes into a mel (handling alignment with attention or a duration predictor), sample a lively prosody instead of the flat average, vocode the mel into a waveform, and steer the voice with a speaker embedding. That’s how machines learned to speak — and to sound human.