Music Generation — From Tokens to Songs

Chapter 0: Harder Than Speech

Generating music is generating audio — but it’s dramatically harder than speech, for three reasons. Long-range structure: a song has motifs, verses, choruses, a key, a tempo — coherence that must hold over minutes, not the seconds of an utterance. Lose the thread and you get aimless noodling. High fidelity: music is full-band, often 44.1 kHz stereo, many instruments at once — far richer than 16 kHz mono speech. Vague prompts: “an upbeat lo-fi track with a jazzy piano” specifies almost nothing precisely, yet must yield something musical.

The breakthrough systems — AudioLM, MusicLM, MusicGen, Stable Audio, and the consumer tools built on these ideas — solve all three by combining the pieces you already know: neural codecs (audio → tokens), transformers (predict the tokens), self-supervised units (capture structure), and text conditioning. This lesson assembles them into a music generator.

The trap: “music generation is just speech generation, longer.” The killer difference is structure over time. A speech model only needs local coherence (the next few words). A music model must remember a melody from 90 seconds ago and resolve it — demanding a representation that captures long-range musical content, not just local sound. That demand is what shapes the whole architecture.

Why structure is the hard part

A coherent song (teal) repeats and develops motifs over time — verse, chorus, callback. Incoherent generation (orange) wanders aimlessly. Slide the “structure memory” and watch coherence appear or dissolve over the timeline.

long-range structure0.50

What makes music generation harder than speech generation?

Music files are larger Long-range structure (motifs/sections over minutes), high fidelity, and vague prompts — coherence must hold far longer than in speech Music has no frequencies

Chapter 1: Music as Tokens

The dominant approach treats music generation as language modeling on audio tokens. From the Neural Audio Codecs lesson: a codec like EnCodec turns a waveform into a sequence of discrete tokens, and can decode tokens back to audio. So train a transformer to predict the next audio token — exactly like GPT predicts the next word — and you can generate or continue music. Condition it on a text prompt, and you get text-to-music.

This reuse is the whole elegance: once audio is tokens, all the language-model machinery applies. The codec handles fidelity (turning tokens back into rich 44.1 kHz sound); the transformer handles composition (which tokens come next). The two hard sub-problems — sounding good and being musical — are cleanly separated. The remaining challenge is the one from Chapter 0: making those token predictions structurally coherent over a long song, not just locally plausible.

text prompt

“upbeat lo-fi jazz”

↓ condition

transformer

predict audio tokens, one by one

↓ codec decoder

music waveform

full-fidelity audio

How does the dominant approach frame music generation?

As predicting raw waveform samples directly As language modeling on audio tokens: a transformer predicts the next codec token, conditioned on text As classifying the genre of existing songs

Chapter 2: The Long-Structure Problem

Here’s why naive token prediction isn’t enough. Codec tokens are acoustic — they describe the fine sound at each instant. At ~75 tokens per second, a 90-second song is thousands of tokens. A transformer predicting acoustic tokens locally can keep the texture consistent (it still sounds like a piano) but easily loses the musical thread — it forgets the melody it stated, drifts out of key, never resolves the phrase. The result sounds like plausible audio that isn’t actually a composition.

The reason: acoustic tokens carry too much fine detail and too little abstraction. Predicting them is like writing a novel by predicting individual letters — you can keep spelling words, but plot and theme dissolve. What you need is a representation at the level of musical content — melody, harmony, rhythm — abstract enough to model long-range structure, with the fine acoustic detail filled in separately. That separation — structure first, detail second — is the key idea behind coherent music models.

Concept → realization: the fix mirrors how a composer works — sketch the structure (the chord progression, the melody) first, then orchestrate the details. Music models split the problem the same way: a model for long-range content, then a model for local sound. Trying to do both at once, at the acoustic level, is what produces incoherent noodling.

Acoustic-only loses the thread

Predicting only acoustic tokens keeps local texture but drifts: the melody stated early (teal motif) fails to return; the key wanders. Slide the song length — the longer it goes, the more an acoustic-only model loses coherence.

song length0.50

Why does predicting only acoustic codec tokens lose long-range musical coherence?

Acoustic tokens are too few They carry fine local detail but little abstraction — like writing a novel letter-by-letter, the structure/theme dissolves over time Transformers can’t process audio at all

Chapter 3: Two Kinds of Tokens

The solution uses two different tokenizations of the same audio, capturing different things. Semantic tokens come from a self-supervised model (like w2v-BERT or HuBERT — the Self-Supervised Speech lesson). They’re coarse, low-rate, and capture content and structure: melody, harmony, rhythm, the “what” of the music. Acoustic tokens come from a neural codec (EnCodec). They’re high-rate and capture fine sound: timbre, texture, the exact waveform detail, the “how it sounds.”

The division of labor is exactly the composer’s: semantic tokens are the score — abstract, long-range, structural; acoustic tokens are the recording — concrete, detailed, high-fidelity. Because semantic tokens are abstract and low-rate, a transformer can model long-range structure over them coherently. And because acoustic tokens are detailed, they can render that structure into rich sound. Neither alone suffices — semantic tokens sound thin, acoustic tokens drift — but together, with semantic guiding acoustic, you get music that is both coherent and high-fidelity.

Semantic vs. acoustic tokens

Top: semantic tokens — sparse, abstract, capturing the melodic/structural “score.” Bottom: acoustic tokens — dense, detailed, capturing fine sound. The same music, two representations. Toggle to compare what each holds.

What do semantic vs. acoustic tokens each capture?

Both capture exactly the same thing Semantic (from SSL) = content/structure (the score); acoustic (from codec) = fine sound/timbre (the recording) Semantic = loudness, acoustic = tempo

Chapter 4: The AudioLM Hierarchy

AudioLM (Google, 2022) chains these into a three-stage hierarchy, each stage a transformer. Stage 1 — semantic modeling: generate semantic tokens autoregressively, establishing the long-range structure (melody, progression). Stage 2 — coarse acoustic: conditioned on the semantic tokens, generate the coarse (first few RVQ levels) acoustic tokens — rough timbre and dynamics that follow the structure. Stage 3 — fine acoustic: conditioned on the coarse tokens, generate the remaining fine acoustic levels — the last layer of high-frequency detail.

The hierarchy is coarse-to-fine in abstraction: structure, then rough sound, then fine sound. Each stage solves a tractable problem conditioned on the level above. This is what gives AudioLM (and its text-conditioned successor MusicLM) coherent, high-quality long-form audio — the semantic stage holds the thread across the whole piece, while the acoustic stages render it faithfully. It’s the same divide-and-conquer that makes cascaded diffusion and other hierarchical generators work: model the hard abstract part first, fill in detail conditionally.

Stage 1: semantic

long-range structure (melody/harmony)

↓ condition

Stage 2: coarse acoustic

rough timbre/dynamics following the structure

↓ condition

Stage 3: fine acoustic

high-frequency detail → decode to waveform

Three stages: structure → coarse → fine

Step through the stages: first the abstract structure forms (sparse semantic), then coarse acoustic detail follows it, then fine detail completes it. Each stage conditions on the one above.

stage1

Why does AudioLM use a semantic→coarse→fine hierarchy?

To use more GPUs So the semantic stage holds long-range structure while the acoustic stages render it faithfully — divide-and-conquer from abstract to detailed To avoid using a codec

Chapter 5: MusicGen & the RVQ Token Problem

There’s a practical headache with codec tokens: residual quantization gives several tokens per timestep (one per RVQ level). Naively flattening them into one stream makes the sequence K× longer — brutal for a transformer. AudioLM splits coarse/fine into separate stages. MusicGen (Meta, 2023) took a slicker single-stage route with a clever delay/interleaving pattern: it offsets the RVQ levels in time so a single transformer predicts all levels together efficiently, without a K× blowup, in one model.

MusicGen is a single autoregressive transformer over EnCodec tokens, text-conditioned, that generates music directly — simpler than AudioLM’s three stages, and high quality. Its delay pattern is the key trick: instead of generating all K levels of a timestep at once (impossible autoregressively) or flattening them (too long), it staggers them so level 1 of step t, level 2 of step t−1, etc., are predicted in parallel positions — a neat way to handle the multi-level structure cheaply. It shows the field’s two styles: hierarchical multi-stage (AudioLM) vs. clever single-stage (MusicGen).

The delay pattern for RVQ levels

Each timestep has K RVQ tokens (rows). Flattening (left) makes the sequence K× long. The delay pattern (drag right) staggers the levels diagonally so one transformer predicts them efficiently without the blowup.

delay patternflatten

What problem does MusicGen’s delay pattern solve?

Text being too long RVQ gives K tokens per timestep; the delay pattern staggers the levels so one transformer predicts them efficiently, avoiding a K× longer sequence Audio being too quiet

Chapter 6: Text-to-Music Conditioning

How does “a melancholic cello piece in D minor” steer the tokens? The elegant answer (MusicLM, MusicGen) is a joint text–audio embedding space. A model like MuLan (or CLAP) is trained — contrastively, like CLIP — so that a piece of music and its text description land at nearby points in a shared embedding space. Then the generator conditions on the text embedding, and because text and audio share the space, it knows what audio that text implies.

This is the CLIP idea applied to music: learn a shared space where “upbeat jazz piano” (text) sits near actual upbeat-jazz-piano audio. At generation, encode the prompt into that space and let it condition the token transformer (via cross-attention or prefix). The vagueness of music prompts is handled exactly because the embedding captures fuzzy semantic similarity, not exact words. Some systems instead use a generic text encoder (like T5) with cross-attention — MusicGen does — but the joint-embedding approach directly bridges the text–music gap.

Shared text–music embedding space

Text descriptions (squares) and music clips (dots) live in one space, trained so matching pairs sit close. Drag a text prompt around — the nearest music region is what conditioning will steer toward. This is CLIP, for music.

How do text-to-music models connect a vague text prompt to audio?

They match exact words to song titles A joint text–audio embedding (MuLan/CLAP), trained contrastively like CLIP, so matching text and music sit nearby; the generator conditions on the text embedding They ignore the text

Chapter 7: Generating a Track, Live (showcase)

Watch the whole pipeline: a text prompt is embedded, a transformer generates the structural (semantic) tokens that hold the melody together, then acoustic tokens render the sound, and the codec decodes to a waveform. Adjust the prompt style and length, and watch coherence hold across the timeline — the payoff of the semantic-then-acoustic hierarchy.

Text → structure → sound

Press Generate: the prompt seeds semantic structure (a motif that recurs), then acoustic detail fills in, then it decodes to a waveform with a recurring theme. Change style and length; notice the motif returns even in long clips — structure preserved.

stylelo-fi

length0.60

Notice the recurring motif in the structure track — that’s the semantic stage doing its job, keeping the piece musically coherent even as the acoustic detail varies. Without the semantic layer, the waveform would still sound like music locally, but the theme would never come back. Structure first, sound second.

Chapter 8: History, the Landscape & the Live Frontier

Where it began — WaveNet (2016). Long before codec tokens, DeepMind’s WaveNet was the foundational neural audio generator. It modeled the raw waveform directly, one sample at a time, autoregressively — using stacked dilated causal convolutions to reach far back in time cheaply. It produced startlingly natural speech and piano music, proving for the first time that a neural net could generate audio from scratch. Its fatal flaw was speed: generating 16,000+ samples in strict sequence per second made it far too slow for real use. Everything since — faster vocoders, codec tokens, diffusion — is in some sense a response to WaveNet’s “great quality, impossible speed” trade-off. (WaveNet, van den Oord et al., 2016, arXiv:1609.03499.)

Token-LM isn’t the only modern path. A second family uses diffusion:

Spectrogram diffusion (Riffusion): generate a mel spectrogram image with an image-diffusion model, then vocode to audio — clever but limited in length/quality.
Latent audio diffusion (Stable Audio, AudioLDM): diffusion in a learned audio latent, often with timing conditioning so you can specify duration and structure — high fidelity, good for sound effects and music, and fast with flow-matching variants.

The consumer wave (Suno, Udio) blends these ideas at scale. And the honest limits remain: controllability (precise musical control — “modulate to the relative minor in bar 16” — is still hard), long-form coherence (multi-minute songs strain even hierarchies), originality & copyright (models trained on copyrighted music raise unresolved legal and ethical questions), and vocals/lyrics (coherent sung lyrics are especially hard). Music generation sits at the intersection of everything in the audio track — codecs, SSL units, transformers, diffusion, and joint embeddings.

The live frontier — real-time, interactive music (Lyria, 2025). Everything so far generates a fixed clip: prompt in, finished track out. The newest class, live music models (Google DeepMind’s Lyria Team, “Live Music Models,” 2025, arXiv:2508.04651), breaks that mold — they produce a continuous stream of music in real time that you steer as it plays. Two were released: Magenta RealTime (open weights) and Lyria RealTime (an API with richer controls). You change a text or audio prompt and the ongoing generation morphs toward the new style on the fly — a human-in-the-loop instrument for live performance, not a render-and-wait tool.

The technical leap is from offline to streaming: the model must generate audio faster than real time, in a causal, chunk-by-chunk way (it can only use the past, like WaveNet’s causal convolutions — a nice full-circle), while staying coherent and responding to control changes within a fraction of a second. It’s the same shift you saw between batch transcription and streaming ASR, now for generation: latency and interactivity become first-class constraints. Live models point at music AI as a real-time creative partner — jamming with a model the way you’d jam with a bandmate — which is a fitting place to end the audio track.

Two families: token-LM vs. diffusion

Token-LM (MusicGen/AudioLM): autoregressive over codec tokens, strong structure via hierarchy. Diffusion (Stable Audio): denoise an audio latent, strong fidelity + timing control. Drag to compare their strengths.

approach0.50

Which is a real, current limitation of music generation?

It cannot produce any sound at all Precise musical control, long-form coherence, coherent vocals/lyrics, and originality/copyright remain hard, unresolved challenges It only works on classical music

Chapter 9: Cheat Sheet & Connections

text prompt

embedded in a joint text–music space (MuLan/CLAP) or via a text encoder

↓ semantic stage

semantic tokens

long-range structure (melody/harmony) from SSL units

↓ acoustic stage(s)

acoustic tokens

codec RVQ tokens; coarse→fine, or MusicGen delay pattern

↓ codec decoder

music waveform

full-fidelity audio with coherent structure

System	Approach	Key idea
AudioLM	token-LM, 3 stages	semantic → coarse → fine hierarchy
MusicLM	AudioLM + text	MuLan joint text–music embedding
MusicGen	token-LM, single-stage	RVQ delay/interleave pattern
Stable Audio	latent diffusion	timing-conditioned audio latent
Riffusion	spectrogram diffusion	generate mel as an image
WaveNet (2016)	autoregressive raw waveform	the origin; dilated causal convs; slow
Lyria / Magenta RealTime	live music models	real-time streaming + interactive control

Keep exploring

→ Neural Audio Codecs — the acoustic tokens music LMs predict
→ Self-Supervised Speech — where semantic tokens come from
→ CLIP — the joint-embedding idea behind MuLan
→ Diffusion — the other generation family (Stable Audio)
→ WaveNet (2016) — the foundational raw-audio generator
→ Live Music Models (Lyria, 2025) — real-time, interactive generation

“What I cannot create, I do not understand.” You just rebuilt music generation: tokenize audio, split it into semantic structure and acoustic detail, model the structure first so the melody holds over minutes, render it with codec tokens (a hierarchy or a delay pattern), and steer it all with a text prompt in a shared embedding space. Tokens in, a song out.