Audio & Speech

TTS Architectures

From robotic to indistinguishable-from-human. How text-to-speech evolved — text frontend, acoustic model, vocoder, the alignment problem, and the one-to-many trick — through Tacotron, FastSpeech, VITS, and modern diffusion & codec-token TTS.

Prerequisites: A spectrogram is a time×frequency picture of sound + Seq2seq maps one sequence to another. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Speaking Is Hard

Reading text aloud sounds trivial — until you try to make a machine do it naturally. Old TTS sounded robotic for deep reasons, not just bad engineering. Three hard problems lurk in “turn this text into speech.”

One-to-many: the same sentence can be said a thousand valid ways — fast or slow, excited or bored, rising or falling. There is no single “correct” waveform.
Alignment: text and audio run at different rates. “Cat” is 3 letters but maybe 300 milliseconds of sound; a pause has no letters at all. Which sound goes with which letter, and for how long?
Prosody & identity: the melody of speech — pitch, rhythm, stress — carries meaning and emotion, and every voice is different.

Modern TTS — the kind that fools you on a phone call — solves all three. This lesson traces the architecture from the classic two-stage pipeline through Tacotron, the alignment fixes of FastSpeech, the end-to-end elegance of VITS, and today’s diffusion and codec-token systems. By the end you could diagram the model behind any voice assistant or audiobook narrator.

The trap: “just predict the waveform from the text.” Text is short and discrete; a waveform is long, continuous, and one-to-many. You can’t bridge that gap in one naive step — you need an intermediate representation (the mel spectrogram), a way to handle timing (alignment), and a source of variation (stochasticity). Those three ideas organize the entire field.

The march toward natural speech

Naturalness (mean opinion score) over the eras: concatenative → parametric → Tacotron+neural vocoder → VITS/end-to-end → diffusion/codec TTS, closing on real human speech. Slide through the years.

era2018

Which is NOT one of the three core hard problems of TTS?

One-to-many (same text, many valid renditions) Alignment (text and audio run at different rates) Running out of vocabulary words to say

Chapter 1: The Two-Stage Pipeline

The classic architecture splits the impossible “text → waveform” leap into two manageable stages, bridged by the mel spectrogram (the time×frequency picture from the Audio Representations lesson). Stage one: an acoustic model turns text into a mel spectrogram — deciding what frequencies, when. Stage two: a vocoder turns that mel spectrogram into an actual waveform — filling in the fine sample-level detail the mel discarded.

text

“hello there”

↓ text frontend (normalize, phonemes)

phonemes

HH AH L OW …

↓ acoustic model

mel spectrogram

what frequencies, when

↓ vocoder

waveform

playable audio

Why the mel intermediate? Because it splits the problem along its natural seam. The acoustic model handles the linguistic part (which sounds, what prosody) on a compact, low-rate representation. The vocoder handles the signal part (turning a coarse mel into 16,000+ crisp samples per second) and can be trained once and reused across voices. Modern end-to-end models (like VITS) fuse both stages, but the conceptual split — linguistic then signal — still organizes how everyone thinks about TTS.

In the classic two-stage TTS pipeline, what bridges the two stages?

The raw waveform The mel spectrogram: the acoustic model produces it, the vocoder turns it into a waveform The text embedding only

Chapter 2: The Text Frontend

Before any neural net runs, raw text needs cleaning — the text frontend, the most underrated part of TTS. Two jobs. First, normalization: expand everything that isn’t spelled how it’s spoken. “Dr. Smith paid $2.50 on 3/4” must become “Doctor Smith paid two dollars and fifty cents on March fourth.” Numbers, dates, currency, abbreviations, units — all expanded by rules and models. Get this wrong and the cleanest voice says “dollar two point five oh.”

Second, grapheme-to-phoneme (G2P): convert letters to the sounds they make. English spelling is treacherous — “through,” “though,” “tough” share letters but not sounds; “read” is two different words. G2P maps text to phonemes (the atomic speech sounds), often the actual input to the acoustic model, because phonemes are far more regular than spelling. Homographs like “lead” (metal vs. guide) need context to disambiguate — a genuinely hard sub-problem.

Why phonemes, not letters: a model fed raw letters must also learn English’s chaotic spelling rules — wasting capacity and failing on rare/new words. Feeding phonemes hands it the actual pronunciation, so it can focus on prosody and timbre. (Some modern end-to-end models do learn from characters directly, given enough data — but a phoneme frontend is still the robust default, especially for names and numbers.)

Normalization & grapheme-to-phoneme

Pick an input and watch the frontend transform it: messy text → normalized words → phonemes. Notice how “$2.50” and “read” need real work, not a lookup.

exampleprices

Why do most TTS systems convert text to phonemes before the acoustic model?

Phonemes are shorter to store Phonemes are far more regular than chaotic English spelling, so the model can focus on prosody/timbre instead of relearning spelling rules Phonemes remove the need for a vocoder

Chapter 3: Tacotron & Attention Alignment

Tacotron (Google, 2017) was the breakthrough acoustic model: a sequence-to- sequence with attention, just like neural machine translation, but translating phonemes into mel spectrogram frames. An encoder reads the phonemes; a decoder generates the mel one frame at a time; and attention decides, for each output frame, which input phoneme it should be voicing right now. Tacotron 2 paired this with a WaveNet vocoder and reached near-human quality — the moment TTS stopped sounding robotic.

The attention here is special: it must be monotonic. Speech goes left-to-right — you voice phoneme 1, then 2, then 3, never jumping back to 1 or skipping to 5. So a healthy Tacotron attention map is a clean diagonal line: as output frames advance, the attended phoneme advances steadily. When training works, you literally see that diagonal form — it’s the alignment being learned.

The failure mode: because the attention is learned and autoregressive, it can break. The diagonal can stall (the model repeats a sound — “the the the”), jump ahead (it skips words), or collapse (it babbles). These attention failures plagued early Tacotron and are exactly what the next generation of models was designed to eliminate.

The alignment diagonal

The attention map: input phonemes (vertical) vs. output mel frames (horizontal); bright = attended. A healthy model traces a clean diagonal. Drag toward “unstable” to see it stall (repeat) and jump (skip) — the classic Tacotron failures.

stability0.90

A healthy Tacotron attention map looks like a clean diagonal because:

attention is random speech is monotonic — output frames voice the phonemes in order, so the attended phoneme advances steadily with the frames diagonals compress better

Chapter 4: The Vocoder

The acoustic model gives a mel spectrogram, but a mel isn’t playable — it threw away the phase and the sample-level detail. The vocoder reconstructs the full waveform from the mel, inventing the fine structure. This is hard: one mel frame must become hundreds of waveform samples, and they must be smooth and natural.

The history is a speed story. WaveNet (2016) generated audio one sample at a time, autoregressively — stunning quality, but agonizingly slow (predicting 24,000 samples in sequence for one second). Then came GAN vocoders like HiFi-GAN: a generator that produces the whole waveform in parallel from the mel, trained adversarially against discriminators so it sounds real (the same GAN-for-crispness trick as neural codecs). HiFi-GAN is hundreds of times faster than WaveNet at comparable quality — which is what made real-time, scalable TTS practical. Today, neural codec decoders can also serve as vocoders.

Autoregressive vs. parallel vocoding

WaveNet (orange) generates samples one-by-one — quality but painfully serial. HiFi-GAN (teal) generates the whole waveform in parallel from the mel. Drag the clip length and watch the time gap explode for the autoregressive vocoder.

audio length0.40

Why did GAN vocoders like HiFi-GAN largely replace WaveNet for production TTS?

They need no training They generate the whole waveform in parallel (vs WaveNet’s sample-by-sample), making them hundreds of times faster at comparable quality They produce text instead of audio

Chapter 5: Solving Alignment — FastSpeech

Tacotron’s autoregressive attention was both slow (one frame at a time) and fragile (skip/repeat failures). FastSpeech (2019) fixed both with a different idea: a non-autoregressive model with an explicit duration predictor. Instead of learning alignment implicitly through attention, FastSpeech predicts how many mel frames each phoneme should last, then expands the phonemes accordingly and generates all the frames in parallel.

So “cat” might be predicted as: C for 5 frames, A for 12 frames, T for 4 frames. You repeat each phoneme’s representation that many times to build the frame sequence, then a parallel network fills in the mel. Because alignment is now explicit and enforced (each phoneme gets a definite span), words can’t be skipped or repeated — robust by construction. And because it’s parallel, it’s fast. The durations are learned from a teacher model or a forced aligner. This also gives a bonus: you can control speed by scaling all durations, and prosody by adjusting individual ones.

The key swap: attention “discovers” alignment and can get it wrong; a duration predictor states the alignment up front and guarantees it. Trading implicit learned alignment for explicit predicted alignment is what made TTS reliable enough to ship at scale.

Duration predictor: expanding phonemes to frames

Each phoneme gets a predicted duration (number of frames). Drag the speed: scale all durations down for faster speech, up for slower — the same control real TTS exposes. No attention to break.

speed1.0

How does FastSpeech avoid the skip/repeat failures of Tacotron?

It uses a bigger attention head It explicitly predicts each phoneme’s duration and expands phonemes to frames, so alignment is enforced (not discovered) — and it runs in parallel It generates text instead of mel

Chapter 6: The One-to-Many Problem

Here’s the subtlest issue, and the reason early TTS sounded flat even when it was intelligible. The same text has many valid spoken renditions — different pitch contours, rhythms, emphases. A model trained with plain regression (predict the mel, minimize error) learns the average of all those renditions. And the average of many expressive prosodies is a flat, monotone, lifeless one — exactly the blur problem from JEPA and diffusion, now in pitch and rhythm.

The cure is to make the model stochastic: let it sample a specific prosody rather than average over all of them. Different architectures do this differently — a variational latent (VAE) capturing prosodic variation, normalizing flows modeling the full distribution, a stochastic duration predictor for rhythm variety, or diffusion sampling expressive mels. The common thread: inject controlled randomness so each generation commits to one lively rendition instead of the dead average.

Why deterministic TTS sounds flat

Many valid pitch contours for one sentence (teal). A deterministic model predicts their average (orange) — a flat monotone. A stochastic model samples one expressive contour. Press resample to hear (see) the variety a stochastic model captures.

Why does a deterministic (plain-regression) TTS model sound flat and monotone?

It uses too few parameters The same text has many valid prosodies; minimizing error makes it predict their average, which is a lifeless monotone It skips the vocoder

Chapter 7: VITS, End to End (showcase)

VITS (2021) unified everything into one end-to-end model: text straight to waveform, no separate vocoder. It combines a variational autoencoder (to model variation), a normalizing flow (to model the full distribution richly), a stochastic duration predictor (for rhythmic variety), and adversarial training (for crisp audio). The result: natural, expressive, fast speech from a single network — and the template for much of modern TTS.

VITS pipeline, interactive

Phonemes → predicted durations → flow/VAE samples a prosody → waveform. Adjust speed, pitch, and which speaker. Press Speak to generate (visualized as a pitch contour + waveform); resample to get a different valid rendition of the same text.

speed1.0

pitch1.0

speakerA

Notice you can change speed and pitch without re-running everything, get a different prosody each resample, and switch speaker identity by a single embedding — all the controls a production voice needs, in one model. That fusion of explicit duration, stochastic prosody, and adversarial realism is why VITS-style systems sound alive.

Chapter 8: Modern TTS & Voice Cloning

The frontier keeps moving, in two big directions.

Diffusion / flow TTS (Grad-TTS, Voicebox, NaturalSpeech, E2-TTS): generate the mel (or audio) by denoising, which naturally captures the one-to-many distribution and produces extremely natural prosody. Flow-matching variants make it fast.
Codec-token LM TTS (VALL-E and kin): from the Neural Audio Codecs lesson — treat speech as codec tokens and let a language model generate them from text. This unlocked zero-shot voice cloning: give it 3 seconds of a target voice as a prompt, and it speaks new text in that voice, because the LM conditions on the acoustic tokens of the sample.

Voice cloning generally works by conditioning on a speaker embedding (or a reference clip): a vector capturing timbre and style, fed into the model so the output sounds like that person. The same architecture, different speaker vector, different voice — which is exactly how multi-voice TTS services (like the one narrating these lessons) offer many voices from one model. The arc of TTS mirrors the rest of generative AI: toward end-to-end models, learned representations, and generation-as-sampling.

Voice cloning via speaker embedding

The same text + model, conditioned on different speaker embeddings (or reference clips), yields different voices. Drag through speakers and watch the timbre “fingerprint” and resulting waveform change — one model, many voices.

speaker embedding0.30

How does zero-shot voice cloning typically work?

By retraining the whole model on the target voice By conditioning the model on a speaker embedding / short reference clip that captures the target’s timbre, so it speaks new text in that voice By slowing down a recording

Chapter 9: Cheat Sheet & Connections

text

normalize + grapheme-to-phoneme → phonemes

↓ acoustic model (Tacotron / FastSpeech / VITS)

alignment + mel

attention diagonal OR duration predictor; stochastic for prosody

↓ vocoder (WaveNet → HiFi-GAN) or end-to-end

waveform

+ speaker embedding for voice identity/cloning

Model	Alignment	Speed	One-to-many
Tacotron 2	attention (can break)	slow (AR)	limited
FastSpeech 2	duration predictor	fast (parallel)	some (variance adaptors)
VITS	monotonic + stochastic dur	fast, end-to-end	yes (VAE + flow)
Diffusion TTS	various	medium (few-step flow)	yes (sampling)
Codec-LM (VALL-E)	implicit (LM)	AR over tokens	yes; zero-shot cloning

Keep exploring

→ Audio Representations — the mel spectrogram TTS produces
→ Neural Audio Codecs — the tokens behind VALL-E-style TTS
→ Whisper — the reverse problem, speech → text
→ Diffusion / Flow Matching — modern TTS generation engines

“What I cannot create, I do not understand.” You just rebuilt the TTS stack: clean the text into phonemes, turn phonemes into a mel (handling alignment with attention or a duration predictor), sample a lively prosody instead of the flat average, vocode the mel into a waveform, and steer the voice with a speaker embedding. That’s how machines learned to speak — and to sound human.