Audio & Speech

Music Generation

How AudioLM, MusicGen, and Stable Audio turn a text prompt into a song — tokenizing audio, modeling long-range musical structure with a hierarchy of semantic and acoustic tokens, and conditioning on text in a shared embedding space.

Prerequisites: Neural codecs turn audio into discrete tokens + A transformer predicts the next token. That’s it.
10
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: Harder Than Speech

Generating music is generating audio — but it’s dramatically harder than speech, for three reasons. Long-range structure: a song has motifs, verses, choruses, a key, a tempo — coherence that must hold over minutes, not the seconds of an utterance. Lose the thread and you get aimless noodling. High fidelity: music is full-band, often 44.1 kHz stereo, many instruments at once — far richer than 16 kHz mono speech. Vague prompts: “an upbeat lo-fi track with a jazzy piano” specifies almost nothing precisely, yet must yield something musical.

The breakthrough systems — AudioLM, MusicLM, MusicGen, Stable Audio, and the consumer tools built on these ideas — solve all three by combining the pieces you already know: neural codecs (audio → tokens), transformers (predict the tokens), self-supervised units (capture structure), and text conditioning. This lesson assembles them into a music generator.

The trap: “music generation is just speech generation, longer.” The killer difference is structure over time. A speech model only needs local coherence (the next few words). A music model must remember a melody from 90 seconds ago and resolve it — demanding a representation that captures long-range musical content, not just local sound. That demand is what shapes the whole architecture.
Why structure is the hard part

A coherent song (teal) repeats and develops motifs over time — verse, chorus, callback. Incoherent generation (orange) wanders aimlessly. Slide the “structure memory” and watch coherence appear or dissolve over the timeline.

long-range structure0.50
What makes music generation harder than speech generation?

Chapter 1: Music as Tokens

The dominant approach treats music generation as language modeling on audio tokens. From the Neural Audio Codecs lesson: a codec like EnCodec turns a waveform into a sequence of discrete tokens, and can decode tokens back to audio. So train a transformer to predict the next audio token — exactly like GPT predicts the next word — and you can generate or continue music. Condition it on a text prompt, and you get text-to-music.

This reuse is the whole elegance: once audio is tokens, all the language-model machinery applies. The codec handles fidelity (turning tokens back into rich 44.1 kHz sound); the transformer handles composition (which tokens come next). The two hard sub-problems — sounding good and being musical — are cleanly separated. The remaining challenge is the one from Chapter 0: making those token predictions structurally coherent over a long song, not just locally plausible.

text prompt
“upbeat lo-fi jazz”
↓ condition
transformer
predict audio tokens, one by one
↓ codec decoder
music waveform
full-fidelity audio
How does the dominant approach frame music generation?

Chapter 2: The Long-Structure Problem

Here’s why naive token prediction isn’t enough. Codec tokens are acoustic — they describe the fine sound at each instant. At ~75 tokens per second, a 90-second song is thousands of tokens. A transformer predicting acoustic tokens locally can keep the texture consistent (it still sounds like a piano) but easily loses the musical thread — it forgets the melody it stated, drifts out of key, never resolves the phrase. The result sounds like plausible audio that isn’t actually a composition.

The reason: acoustic tokens carry too much fine detail and too little abstraction. Predicting them is like writing a novel by predicting individual letters — you can keep spelling words, but plot and theme dissolve. What you need is a representation at the level of musical content — melody, harmony, rhythm — abstract enough to model long-range structure, with the fine acoustic detail filled in separately. That separation — structure first, detail second — is the key idea behind coherent music models.

Concept → realization: the fix mirrors how a composer works — sketch the structure (the chord progression, the melody) first, then orchestrate the details. Music models split the problem the same way: a model for long-range content, then a model for local sound. Trying to do both at once, at the acoustic level, is what produces incoherent noodling.
Acoustic-only loses the thread

Predicting only acoustic tokens keeps local texture but drifts: the melody stated early (teal motif) fails to return; the key wanders. Slide the song length — the longer it goes, the more an acoustic-only model loses coherence.

song length0.50
Why does predicting only acoustic codec tokens lose long-range musical coherence?

Chapter 3: Two Kinds of Tokens

The solution uses two different tokenizations of the same audio, capturing different things. Semantic tokens come from a self-supervised model (like w2v-BERT or HuBERT — the Self-Supervised Speech lesson). They’re coarse, low-rate, and capture content and structure: melody, harmony, rhythm, the “what” of the music. Acoustic tokens come from a neural codec (EnCodec). They’re high-rate and capture fine sound: timbre, texture, the exact waveform detail, the “how it sounds.”

The division of labor is exactly the composer’s: semantic tokens are the score — abstract, long-range, structural; acoustic tokens are the recording — concrete, detailed, high-fidelity. Because semantic tokens are abstract and low-rate, a transformer can model long-range structure over them coherently. And because acoustic tokens are detailed, they can render that structure into rich sound. Neither alone suffices — semantic tokens sound thin, acoustic tokens drift — but together, with semantic guiding acoustic, you get music that is both coherent and high-fidelity.

Semantic vs. acoustic tokens

Top: semantic tokens — sparse, abstract, capturing the melodic/structural “score.” Bottom: acoustic tokens — dense, detailed, capturing fine sound. The same music, two representations. Toggle to compare what each holds.

What do semantic vs. acoustic tokens each capture?

Chapter 4: The AudioLM Hierarchy

AudioLM (Google, 2022) chains these into a three-stage hierarchy, each stage a transformer. Stage 1 — semantic modeling: generate semantic tokens autoregressively, establishing the long-range structure (melody, progression). Stage 2 — coarse acoustic: conditioned on the semantic tokens, generate the coarse (first few RVQ levels) acoustic tokens — rough timbre and dynamics that follow the structure. Stage 3 — fine acoustic: conditioned on the coarse tokens, generate the remaining fine acoustic levels — the last layer of high-frequency detail.

The hierarchy is coarse-to-fine in abstraction: structure, then rough sound, then fine sound. Each stage solves a tractable problem conditioned on the level above. This is what gives AudioLM (and its text-conditioned successor MusicLM) coherent, high-quality long-form audio — the semantic stage holds the thread across the whole piece, while the acoustic stages render it faithfully. It’s the same divide-and-conquer that makes cascaded diffusion and other hierarchical generators work: model the hard abstract part first, fill in detail conditionally.

Stage 1: semantic
long-range structure (melody/harmony)
↓ condition
Stage 2: coarse acoustic
rough timbre/dynamics following the structure
↓ condition
Stage 3: fine acoustic
high-frequency detail → decode to waveform
Three stages: structure → coarse → fine

Step through the stages: first the abstract structure forms (sparse semantic), then coarse acoustic detail follows it, then fine detail completes it. Each stage conditions on the one above.

stage1
Why does AudioLM use a semantic→coarse→fine hierarchy?

Chapter 5: MusicGen & the RVQ Token Problem

There’s a practical headache with codec tokens: residual quantization gives several tokens per timestep (one per RVQ level). Naively flattening them into one stream makes the sequence K× longer — brutal for a transformer. AudioLM splits coarse/fine into separate stages. MusicGen (Meta, 2023) took a slicker single-stage route with a clever delay/interleaving pattern: it offsets the RVQ levels in time so a single transformer predicts all levels together efficiently, without a K× blowup, in one model.

MusicGen is a single autoregressive transformer over EnCodec tokens, text-conditioned, that generates music directly — simpler than AudioLM’s three stages, and high quality. Its delay pattern is the key trick: instead of generating all K levels of a timestep at once (impossible autoregressively) or flattening them (too long), it staggers them so level 1 of step t, level 2 of step t−1, etc., are predicted in parallel positions — a neat way to handle the multi-level structure cheaply. It shows the field’s two styles: hierarchical multi-stage (AudioLM) vs. clever single-stage (MusicGen).

The delay pattern for RVQ levels

Each timestep has K RVQ tokens (rows). Flattening (left) makes the sequence K× long. The delay pattern (drag right) staggers the levels diagonally so one transformer predicts them efficiently without the blowup.

delay patternflatten
What problem does MusicGen’s delay pattern solve?

Chapter 6: Text-to-Music Conditioning

How does “a melancholic cello piece in D minor” steer the tokens? The elegant answer (MusicLM, MusicGen) is a joint text–audio embedding space. A model like MuLan (or CLAP) is trained — contrastively, like CLIP — so that a piece of music and its text description land at nearby points in a shared embedding space. Then the generator conditions on the text embedding, and because text and audio share the space, it knows what audio that text implies.

This is the CLIP idea applied to music: learn a shared space where “upbeat jazz piano” (text) sits near actual upbeat-jazz-piano audio. At generation, encode the prompt into that space and let it condition the token transformer (via cross-attention or prefix). The vagueness of music prompts is handled exactly because the embedding captures fuzzy semantic similarity, not exact words. Some systems instead use a generic text encoder (like T5) with cross-attention — MusicGen does — but the joint-embedding approach directly bridges the text–music gap.

Shared text–music embedding space

Text descriptions (squares) and music clips (dots) live in one space, trained so matching pairs sit close. Drag a text prompt around — the nearest music region is what conditioning will steer toward. This is CLIP, for music.

How do text-to-music models connect a vague text prompt to audio?

Chapter 7: Generating a Track, Live (showcase)

Watch the whole pipeline: a text prompt is embedded, a transformer generates the structural (semantic) tokens that hold the melody together, then acoustic tokens render the sound, and the codec decodes to a waveform. Adjust the prompt style and length, and watch coherence hold across the timeline — the payoff of the semantic-then-acoustic hierarchy.

Text → structure → sound

Press Generate: the prompt seeds semantic structure (a motif that recurs), then acoustic detail fills in, then it decodes to a waveform with a recurring theme. Change style and length; notice the motif returns even in long clips — structure preserved.

stylelo-fi
length0.60

Notice the recurring motif in the structure track — that’s the semantic stage doing its job, keeping the piece musically coherent even as the acoustic detail varies. Without the semantic layer, the waveform would still sound like music locally, but the theme would never come back. Structure first, sound second.

Chapter 8: History, the Landscape & the Live Frontier

Where it began — WaveNet (2016). Long before codec tokens, DeepMind’s WaveNet was the foundational neural audio generator. It modeled the raw waveform directly, one sample at a time, autoregressively — using stacked dilated causal convolutions to reach far back in time cheaply. It produced startlingly natural speech and piano music, proving for the first time that a neural net could generate audio from scratch. Its fatal flaw was speed: generating 16,000+ samples in strict sequence per second made it far too slow for real use. Everything since — faster vocoders, codec tokens, diffusion — is in some sense a response to WaveNet’s “great quality, impossible speed” trade-off. (WaveNet, van den Oord et al., 2016, arXiv:1609.03499.)

Token-LM isn’t the only modern path. A second family uses diffusion:

The consumer wave (Suno, Udio) blends these ideas at scale. And the honest limits remain: controllability (precise musical control — “modulate to the relative minor in bar 16” — is still hard), long-form coherence (multi-minute songs strain even hierarchies), originality & copyright (models trained on copyrighted music raise unresolved legal and ethical questions), and vocals/lyrics (coherent sung lyrics are especially hard). Music generation sits at the intersection of everything in the audio track — codecs, SSL units, transformers, diffusion, and joint embeddings.

The live frontier — real-time, interactive music (Lyria, 2025). Everything so far generates a fixed clip: prompt in, finished track out. The newest class, live music models (Google DeepMind’s Lyria Team, “Live Music Models,” 2025, arXiv:2508.04651), breaks that mold — they produce a continuous stream of music in real time that you steer as it plays. Two were released: Magenta RealTime (open weights) and Lyria RealTime (an API with richer controls). You change a text or audio prompt and the ongoing generation morphs toward the new style on the fly — a human-in-the-loop instrument for live performance, not a render-and-wait tool.

The technical leap is from offline to streaming: the model must generate audio faster than real time, in a causal, chunk-by-chunk way (it can only use the past, like WaveNet’s causal convolutions — a nice full-circle), while staying coherent and responding to control changes within a fraction of a second. It’s the same shift you saw between batch transcription and streaming ASR, now for generation: latency and interactivity become first-class constraints. Live models point at music AI as a real-time creative partner — jamming with a model the way you’d jam with a bandmate — which is a fitting place to end the audio track.

Two families: token-LM vs. diffusion

Token-LM (MusicGen/AudioLM): autoregressive over codec tokens, strong structure via hierarchy. Diffusion (Stable Audio): denoise an audio latent, strong fidelity + timing control. Drag to compare their strengths.

approach0.50
Which is a real, current limitation of music generation?

Chapter 9: Cheat Sheet & Connections

text prompt
embedded in a joint text–music space (MuLan/CLAP) or via a text encoder
↓ semantic stage
semantic tokens
long-range structure (melody/harmony) from SSL units
↓ acoustic stage(s)
acoustic tokens
codec RVQ tokens; coarse→fine, or MusicGen delay pattern
↓ codec decoder
music waveform
full-fidelity audio with coherent structure
SystemApproachKey idea
AudioLMtoken-LM, 3 stagessemantic → coarse → fine hierarchy
MusicLMAudioLM + textMuLan joint text–music embedding
MusicGentoken-LM, single-stageRVQ delay/interleave pattern
Stable Audiolatent diffusiontiming-conditioned audio latent
Riffusionspectrogram diffusiongenerate mel as an image
WaveNet (2016)autoregressive raw waveformthe origin; dilated causal convs; slow
Lyria / Magenta RealTimelive music modelsreal-time streaming + interactive control

Keep exploring

Neural Audio Codecs — the acoustic tokens music LMs predict
Self-Supervised Speech — where semantic tokens come from
CLIP — the joint-embedding idea behind MuLan
Diffusion — the other generation family (Stable Audio)
WaveNet (2016) — the foundational raw-audio generator
Live Music Models (Lyria, 2025) — real-time, interactive generation

“What I cannot create, I do not understand.” You just rebuilt music generation: tokenize audio, split it into semantic structure and acoustic detail, model the structure first so the melody holds over minutes, render it with codec tokens (a hierarchy or a delay pattern), and steer it all with a text prompt in a shared embedding space. Tokens in, a song out.