How AudioLM, MusicGen, and Stable Audio turn a text prompt into a song — tokenizing audio, modeling long-range musical structure with a hierarchy of semantic and acoustic tokens, and conditioning on text in a shared embedding space.
Generating music is generating audio — but it’s dramatically harder than speech, for three reasons. Long-range structure: a song has motifs, verses, choruses, a key, a tempo — coherence that must hold over minutes, not the seconds of an utterance. Lose the thread and you get aimless noodling. High fidelity: music is full-band, often 44.1 kHz stereo, many instruments at once — far richer than 16 kHz mono speech. Vague prompts: “an upbeat lo-fi track with a jazzy piano” specifies almost nothing precisely, yet must yield something musical.
The breakthrough systems — AudioLM, MusicLM, MusicGen, Stable Audio, and the consumer tools built on these ideas — solve all three by combining the pieces you already know: neural codecs (audio → tokens), transformers (predict the tokens), self-supervised units (capture structure), and text conditioning. This lesson assembles them into a music generator.
A coherent song (teal) repeats and develops motifs over time — verse, chorus, callback. Incoherent generation (orange) wanders aimlessly. Slide the “structure memory” and watch coherence appear or dissolve over the timeline.
The dominant approach treats music generation as language modeling on audio tokens. From the Neural Audio Codecs lesson: a codec like EnCodec turns a waveform into a sequence of discrete tokens, and can decode tokens back to audio. So train a transformer to predict the next audio token — exactly like GPT predicts the next word — and you can generate or continue music. Condition it on a text prompt, and you get text-to-music.
This reuse is the whole elegance: once audio is tokens, all the language-model machinery applies. The codec handles fidelity (turning tokens back into rich 44.1 kHz sound); the transformer handles composition (which tokens come next). The two hard sub-problems — sounding good and being musical — are cleanly separated. The remaining challenge is the one from Chapter 0: making those token predictions structurally coherent over a long song, not just locally plausible.
Here’s why naive token prediction isn’t enough. Codec tokens are acoustic — they describe the fine sound at each instant. At ~75 tokens per second, a 90-second song is thousands of tokens. A transformer predicting acoustic tokens locally can keep the texture consistent (it still sounds like a piano) but easily loses the musical thread — it forgets the melody it stated, drifts out of key, never resolves the phrase. The result sounds like plausible audio that isn’t actually a composition.
The reason: acoustic tokens carry too much fine detail and too little abstraction. Predicting them is like writing a novel by predicting individual letters — you can keep spelling words, but plot and theme dissolve. What you need is a representation at the level of musical content — melody, harmony, rhythm — abstract enough to model long-range structure, with the fine acoustic detail filled in separately. That separation — structure first, detail second — is the key idea behind coherent music models.
Predicting only acoustic tokens keeps local texture but drifts: the melody stated early (teal motif) fails to return; the key wanders. Slide the song length — the longer it goes, the more an acoustic-only model loses coherence.
The solution uses two different tokenizations of the same audio, capturing different things. Semantic tokens come from a self-supervised model (like w2v-BERT or HuBERT — the Self-Supervised Speech lesson). They’re coarse, low-rate, and capture content and structure: melody, harmony, rhythm, the “what” of the music. Acoustic tokens come from a neural codec (EnCodec). They’re high-rate and capture fine sound: timbre, texture, the exact waveform detail, the “how it sounds.”
The division of labor is exactly the composer’s: semantic tokens are the score — abstract, long-range, structural; acoustic tokens are the recording — concrete, detailed, high-fidelity. Because semantic tokens are abstract and low-rate, a transformer can model long-range structure over them coherently. And because acoustic tokens are detailed, they can render that structure into rich sound. Neither alone suffices — semantic tokens sound thin, acoustic tokens drift — but together, with semantic guiding acoustic, you get music that is both coherent and high-fidelity.
Top: semantic tokens — sparse, abstract, capturing the melodic/structural “score.” Bottom: acoustic tokens — dense, detailed, capturing fine sound. The same music, two representations. Toggle to compare what each holds.
AudioLM (Google, 2022) chains these into a three-stage hierarchy, each stage a transformer. Stage 1 — semantic modeling: generate semantic tokens autoregressively, establishing the long-range structure (melody, progression). Stage 2 — coarse acoustic: conditioned on the semantic tokens, generate the coarse (first few RVQ levels) acoustic tokens — rough timbre and dynamics that follow the structure. Stage 3 — fine acoustic: conditioned on the coarse tokens, generate the remaining fine acoustic levels — the last layer of high-frequency detail.
The hierarchy is coarse-to-fine in abstraction: structure, then rough sound, then fine sound. Each stage solves a tractable problem conditioned on the level above. This is what gives AudioLM (and its text-conditioned successor MusicLM) coherent, high-quality long-form audio — the semantic stage holds the thread across the whole piece, while the acoustic stages render it faithfully. It’s the same divide-and-conquer that makes cascaded diffusion and other hierarchical generators work: model the hard abstract part first, fill in detail conditionally.
Step through the stages: first the abstract structure forms (sparse semantic), then coarse acoustic detail follows it, then fine detail completes it. Each stage conditions on the one above.
There’s a practical headache with codec tokens: residual quantization gives several tokens per timestep (one per RVQ level). Naively flattening them into one stream makes the sequence K× longer — brutal for a transformer. AudioLM splits coarse/fine into separate stages. MusicGen (Meta, 2023) took a slicker single-stage route with a clever delay/interleaving pattern: it offsets the RVQ levels in time so a single transformer predicts all levels together efficiently, without a K× blowup, in one model.
MusicGen is a single autoregressive transformer over EnCodec tokens, text-conditioned, that generates music directly — simpler than AudioLM’s three stages, and high quality. Its delay pattern is the key trick: instead of generating all K levels of a timestep at once (impossible autoregressively) or flattening them (too long), it staggers them so level 1 of step t, level 2 of step t−1, etc., are predicted in parallel positions — a neat way to handle the multi-level structure cheaply. It shows the field’s two styles: hierarchical multi-stage (AudioLM) vs. clever single-stage (MusicGen).
Each timestep has K RVQ tokens (rows). Flattening (left) makes the sequence K× long. The delay pattern (drag right) staggers the levels diagonally so one transformer predicts them efficiently without the blowup.
How does “a melancholic cello piece in D minor” steer the tokens? The elegant answer (MusicLM, MusicGen) is a joint text–audio embedding space. A model like MuLan (or CLAP) is trained — contrastively, like CLIP — so that a piece of music and its text description land at nearby points in a shared embedding space. Then the generator conditions on the text embedding, and because text and audio share the space, it knows what audio that text implies.
This is the CLIP idea applied to music: learn a shared space where “upbeat jazz piano” (text) sits near actual upbeat-jazz-piano audio. At generation, encode the prompt into that space and let it condition the token transformer (via cross-attention or prefix). The vagueness of music prompts is handled exactly because the embedding captures fuzzy semantic similarity, not exact words. Some systems instead use a generic text encoder (like T5) with cross-attention — MusicGen does — but the joint-embedding approach directly bridges the text–music gap.
Text descriptions (squares) and music clips (dots) live in one space, trained so matching pairs sit close. Drag a text prompt around — the nearest music region is what conditioning will steer toward. This is CLIP, for music.
Watch the whole pipeline: a text prompt is embedded, a transformer generates the structural (semantic) tokens that hold the melody together, then acoustic tokens render the sound, and the codec decodes to a waveform. Adjust the prompt style and length, and watch coherence hold across the timeline — the payoff of the semantic-then-acoustic hierarchy.
Press Generate: the prompt seeds semantic structure (a motif that recurs), then acoustic detail fills in, then it decodes to a waveform with a recurring theme. Change style and length; notice the motif returns even in long clips — structure preserved.
Notice the recurring motif in the structure track — that’s the semantic stage doing its job, keeping the piece musically coherent even as the acoustic detail varies. Without the semantic layer, the waveform would still sound like music locally, but the theme would never come back. Structure first, sound second.
Where it began — WaveNet (2016). Long before codec tokens, DeepMind’s WaveNet was the foundational neural audio generator. It modeled the raw waveform directly, one sample at a time, autoregressively — using stacked dilated causal convolutions to reach far back in time cheaply. It produced startlingly natural speech and piano music, proving for the first time that a neural net could generate audio from scratch. Its fatal flaw was speed: generating 16,000+ samples in strict sequence per second made it far too slow for real use. Everything since — faster vocoders, codec tokens, diffusion — is in some sense a response to WaveNet’s “great quality, impossible speed” trade-off. (WaveNet, van den Oord et al., 2016, arXiv:1609.03499.)
Token-LM isn’t the only modern path. A second family uses diffusion:
The consumer wave (Suno, Udio) blends these ideas at scale. And the honest limits remain: controllability (precise musical control — “modulate to the relative minor in bar 16” — is still hard), long-form coherence (multi-minute songs strain even hierarchies), originality & copyright (models trained on copyrighted music raise unresolved legal and ethical questions), and vocals/lyrics (coherent sung lyrics are especially hard). Music generation sits at the intersection of everything in the audio track — codecs, SSL units, transformers, diffusion, and joint embeddings.
The technical leap is from offline to streaming: the model must generate audio faster than real time, in a causal, chunk-by-chunk way (it can only use the past, like WaveNet’s causal convolutions — a nice full-circle), while staying coherent and responding to control changes within a fraction of a second. It’s the same shift you saw between batch transcription and streaming ASR, now for generation: latency and interactivity become first-class constraints. Live models point at music AI as a real-time creative partner — jamming with a model the way you’d jam with a bandmate — which is a fitting place to end the audio track.
Token-LM (MusicGen/AudioLM): autoregressive over codec tokens, strong structure via hierarchy. Diffusion (Stable Audio): denoise an audio latent, strong fidelity + timing control. Drag to compare their strengths.
| System | Approach | Key idea |
|---|---|---|
| AudioLM | token-LM, 3 stages | semantic → coarse → fine hierarchy |
| MusicLM | AudioLM + text | MuLan joint text–music embedding |
| MusicGen | token-LM, single-stage | RVQ delay/interleave pattern |
| Stable Audio | latent diffusion | timing-conditioned audio latent |
| Riffusion | spectrogram diffusion | generate mel as an image |
| WaveNet (2016) | autoregressive raw waveform | the origin; dilated causal convs; slow |
| Lyria / Magenta RealTime | live music models | real-time streaming + interactive control |
→ Neural Audio Codecs — the acoustic tokens music LMs predict
→ Self-Supervised Speech — where semantic tokens come from
→ CLIP — the joint-embedding idea behind MuLan
→ Diffusion — the other generation family (Stable Audio)
→ WaveNet (2016) — the foundational raw-audio generator
→ Live Music Models (Lyria, 2025) — real-time, interactive generation