Generating music with long-term structure by introducing relative position representations into the Transformer — enabling it to capture the patterns of repetition, transposition, and development that make music coherent over minutes, not just seconds.
Listen to any piece of classical piano music — say, a Chopin Ballade. Within the first minute, a theme is introduced: a melody with a particular rhythm and contour. Over the next eight minutes, that theme returns — sometimes in the same key, sometimes transposed, sometimes varied but recognizable. The ending recalls the opening. Phrases echo phrases heard minutes earlier. This is long-range structure, and it's what makes music feel like music rather than random notes.
Now try to generate music with an LSTM. The LSTM processes events one at a time, carrying information through its hidden state. After 100 events (~10 seconds of music), the hidden state has been overwritten and rewritten so many times that information from the beginning is practically gone. The generated music might sound locally coherent — nice chords, reasonable rhythms — but it wanders aimlessly, never returning to a theme, never building tension, never resolving.
The Transformer can see every previous event directly through attention. But the original Transformer uses absolute positional encodings: each position gets a fixed vector that encodes "I am at position 42" or "I am at position 317." This creates a problem for music: a melody at position 10-20 and the same melody repeated at position 200-210 look completely different to absolute position encodings. The model has to independently learn that the same musical pattern can occur at different absolute positions.
A piano roll showing a musical phrase (teal) that repeats later (warm). LSTMs lose the connection after ~100 events. Standard Transformers see both occurrences but don't recognize them as "the same pattern shifted in time." Relative attention solves this. Click "Show Repetition" to highlight the structure.
Before the Music Transformer, the best neural music generators were RNN-based:
| Model | Year | Architecture | Limitation |
|---|---|---|---|
| DeepBach | 2017 | RNN + Gibbs sampling | Limited to Bach chorales style |
| Performance RNN | 2017 | LSTM | Wanders after 10-15 seconds |
| MusicVAE | 2018 | Hierarchical VAE + LSTM | Fixed-length, interpolation focus |
| Coconet | 2017 | CNN (non-autoregressive) | Short pieces only (16 bars) |
All these models struggled with the same fundamental issue: maintaining coherent structure beyond about 10 seconds of music. The hidden state bottleneck was the universal constraint.
Huang et al. proposed the Music Transformer: a Transformer with relative position representations instead of absolute positional encodings. Instead of "I am at position 42," each attention computation encodes "the key is 5 positions before the query." This lets the model learn that "a note followed by the same note 4 steps later" is a consistent pattern, regardless of where in the sequence it occurs.
The result: piano performances with coherent long-range structure — themes that repeat, develop, and resolve over sequences of 2,000+ events. For the first time, a neural network could generate music that sounded like it had a plan.
Musical structure operates at multiple timescales. Understanding these levels helps you appreciate what the Music Transformer actually achieved:
| Level | Time Scale | Musical Term | Example |
|---|---|---|---|
| Note | ~100ms | Intervals, ornaments | A trill, a grace note |
| Beat | ~500ms | Rhythm, meter | Waltz (3/4 time) |
| Phrase | ~4s | Melody, motif | The first 4 bars of Fur Elise |
| Section | ~30s | Verse, chorus, development | The A section returns after B |
| Form | ~3min | Sonata, ABA, rondo | Exposition-Development-Recap |
LSTMs can handle note and beat-level structure — they produce locally coherent rhythms and harmonies. But phrase-level and beyond? The hidden state has been overwritten too many times. The Music Transformer, with relative attention, captures structure up to the section level — phrases repeat and develop in ways that sound intentional.
Before we can feed music to a Transformer, we need to represent it as a sequence of discrete tokens. The Music Transformer uses an event-based MIDI representation developed by Oore et al. (2018). Instead of a piano roll (a 2D grid of time vs pitch), the music is encoded as a 1D sequence of events, like text.
The vocabulary consists of 388 event types across four categories:
| Event Type | Count | Description |
|---|---|---|
| NOTE_ON | 128 | Start playing note (pitch 0-127, where 60 = middle C) |
| NOTE_OFF | 128 | Stop playing note (same pitch range) |
| TIME_SHIFT | 100 | Advance time by 10ms increments (10ms to 1000ms) |
| VELOCITY | 32 | Set velocity (volume) for subsequent notes (0-127, quantized to 32 bins) |
A simple example: playing middle C for half a second at medium volume:
A polyphonic texture (multiple notes at once) is a key advantage of this event representation. A chord is represented by consecutive NOTE_ON events with no TIME_SHIFT between them — the model learns that consecutive NOTE_ON events without a TIME_SHIFT form a chord:
This encodes a C major chord (C-E-G, MIDI 60-64-67) held for 1 second at forte volume.
A piano roll (top) and its MIDI event sequence (bottom). Events are color-coded: teal = NOTE_ON, blue = NOTE_OFF, warm = TIME_SHIFT, purple = VELOCITY. Click "Play Sequence" to step through the events and watch them populate the piano roll.
The event representation has several properties that make it ideal for Transformer modeling:
The VELOCITY events capture dynamics — how loudly or softly notes are played. A skilled pianist varies velocity constantly: a melody note might be played forte (loud, VELOCITY_28) while the accompanying chord is piano (soft, VELOCITY_8). The 32 velocity bins are coarse but sufficient to capture the basic shape of a performance's dynamics. The ordering convention is that a VELOCITY event sets the velocity for all subsequent NOTE_ON events until the next VELOCITY event appears.
TIME_SHIFT events with 10ms granularity capture timing nuance. A note slightly before the beat (anticipation) vs slightly after (laid-back feel) is the difference of 20-30ms — just 2-3 TIME_SHIFT events. This granularity is enough for the model to learn micro-timing patterns that distinguish a mechanical MIDI playback from an expressive human performance.
python # MIDI event vocabulary VOCAB_SIZE = 388 # NOTE_ON: 0-127 (128 events) # NOTE_OFF: 128-255 (128 events) # TIME_SHIFT: 256-355 (100 events, 10ms to 1s) # VELOCITY: 356-387 (32 events) def encode_note_on(pitch): return pitch # 0-127 def encode_note_off(pitch): return 128 + pitch # 128-255 def encode_time_shift(ms): # Quantize to 10ms bins, max 1000ms bins = min(100, max(1, round(ms / 10))) return 256 + bins - 1 # 256-355 def encode_velocity(vel): # Quantize 0-127 to 32 bins bin_idx = min(31, vel // 4) return 356 + bin_idx # 356-387 # C major chord for 1s at medium velocity: sequence = [ encode_velocity(80), # VELOCITY_20 encode_note_on(60), # NOTE_ON_60 (C4) encode_note_on(64), # NOTE_ON_64 (E4) encode_note_on(67), # NOTE_ON_67 (G4) encode_time_shift(1000), # TIME_SHIFT_100 encode_note_off(60), # NOTE_OFF_60 encode_note_off(64), # NOTE_OFF_64 encode_note_off(67), # NOTE_OFF_67 ]
The original Transformer (Vaswani et al., 2017) uses absolute positional encodings: a fixed vector is added to each token embedding that encodes "I am at position t." Sinusoidal encodings use:
This tells the model "this token is at absolute position 42." But consider a musical sequence: a C-E-G chord at positions 10-12, and the same C-E-G chord at positions 200-202. With absolute encodings:
| Event | Absolute position | What the model sees |
|---|---|---|
| NOTE_ON_60 | 10 | embedding(60) + PE(10) |
| NOTE_ON_64 | 11 | embedding(64) + PE(11) |
| NOTE_ON_60 | 200 | embedding(60) + PE(200) |
| NOTE_ON_64 | 201 | embedding(64) + PE(201) |
The C at position 10 and the C at position 200 have the same content embedding but completely different positional vectors. The model must learn — separately for every possible offset — that "a C followed by an E" is a third interval. With 2000+ possible positions, this is an enormous learning burden.
Relative position representations (Shaw et al., 2018) replace "I am at position 42" with "the key is 5 positions before me." The attention computation now depends on the distance between query and key, not their absolute locations:
Two occurrences of the same musical pattern (C-E-G) at different positions. With absolute encoding (left), they look completely different. With relative encoding (right), the internal structure is identical — only the relative distances matter. Toggle between them.
The benefit is especially powerful for transposition. A melody in C major and the same melody in G major are exactly the same pattern, shifted by 7 semitones. With relative pitch encoding, the model doesn't need to relearn the pattern for every possible key — it learns the interval structure once.
Let's be precise. In the standard Transformer, the attention logit between query at position i and key at position j is:
Expanding this product gives four terms:
The last three terms all depend on absolute positions. In the relative formulation, these are replaced by terms that depend only on the offset (i-j). The substitution PEj → Er[i-j] converts absolute position dependence to relative.
python # Absolute vs relative attention # Absolute: position info added to embeddings def absolute_attention(X, W_q, W_k, W_v, PE): # Add absolute position to input X = X + PE # PE[t] encodes "I am at position t" Q, K, V = X @ W_q, X @ W_k, X @ W_v scores = Q @ K.T / (d ** 0.5) return F.softmax(scores, dim=-1) @ V # Relative: position info injected into attention scores def relative_attention(X, W_q, W_k, W_v, E_r): # E_r[i-j] encodes "key is (i-j) steps away from query" Q, K, V = X @ W_q, X @ W_k, X @ W_v # Content-based scores (same as standard) content_scores = Q @ K.T # Position-based scores (query vs relative position) position_scores = Q @ E_r.T # NEW: relative offset scores = (content_scores + position_scores) / (d ** 0.5) return F.softmax(scores, dim=-1) @ V
Let's formalize relative attention. In standard attention, the score between query at position i and key at position j is:
Shaw et al. (2018) extended this by adding a learned relative position embedding aij that depends on the distance (i - j):
Where aij = Er[clip(i-j, -K, K)] is a learned d-dimensional embedding for the relative offset (i-j), clipped to a maximum distance K. The full decomposition (from Dai et al., 2019, which the Music Transformer builds on) expands the attention score into four terms:
Let's understand each term:
| Term | Name | Depends on | What it captures |
|---|---|---|---|
| xiWQ(xjWK)T | Content-Content | Content at i and j | "These two notes are related" (same as standard attention) |
| xiWQ(Er[i-j])T | Content-Position | Content at i, offset (i-j) | "This query content likes keys that are 3 steps back" |
| u(xjWK)T | Global content bias | Content at j | "This key content is generally important" (content saliency) |
| v(Er[i-j])T | Global position bias | Offset (i-j) | "Keys 1 step back are generally important" (recency bias) |
The vectors u and v are learned global bias vectors (shared across all queries), replacing the absolute position encoding of the query. Er is a lookup table of learned embeddings indexed by the relative distance (i-j).
The attention score between query i and key j is the sum of four terms. Toggle each term on/off to see its contribution to the total attention pattern. The heatmap shows a 6-position sequence — brighter = higher attention score.
You might wonder: why not just replace Q·KT with (Q + position)·(K + position)T? The decomposition into four terms happens naturally from this expansion, but the key insight is that the position-dependent terms should use different representations than the content-dependent terms. In the full formulation:
These serve different roles: content keys encode what is at a position, while position embeddings encode where it is relative to the query. Separating them lets the model independently learn content-based and position-based attention patterns. A head might learn to attend to "the closest NOTE_ON event" (position-based) regardless of which note it is, or to "any occurrence of middle C" (content-based) regardless of where it appears.
Here's the catch: a naive implementation of relative attention requires storing the full L × L × d tensor of relative position embeddings, where L is the sequence length and d is the head dimension. For a 2048-token music sequence with d = 64:
This is O(L2D) memory — much worse than the O(L2) of standard attention. With 8 heads, that's over 2 billion entries just for the relative position embeddings. This made the approach impractical for long sequences... until the skewing trick.
This is the key technical contribution of the Music Transformer paper. The naive computation of relative attention requires storing Q × ErT as an L × L matrix where entry (i, j) uses Er[i-j] — a different embedding for each relative offset. Computing this naively requires O(L2D) memory.
Huang et al. observed that this matrix has a specific structure: it's a Toeplitz-like matrix — each diagonal has the same value. Entry (i, j) depends only on (i-j), so all entries on the same diagonal are identical. They exploit this structure with an elegant reshaping trick.
Instead of creating L separate embeddings for each query, we use the fact that the relative offset (i-j) ranges from 0 to L-1 (in the causal case). We only need L unique embeddings, stored in Er of shape [L, d].
Compute the product Srel = Q ErT, which has shape [L, L]:
But we need Srel[i, j] = qiT Er[i-j], not qiT Er[k]. The index k in our computation corresponds to the relative offset, but we need to rearrange so that column j uses offset (i-j).
The trick: pad Srel with one column of zeros on the left, reshape it from [L, L+1] to [L+1, L], then slice off the first row. The resulting [L, L] matrix has exactly the Toeplitz structure we need: entry (i, j) = qiT Er[i-j].
Watch the matrix transform through the skewing operation. Step 1: the raw Q·ErT matrix (wrong indexing). Step 2: pad with zeros. Step 3: reshape. Step 4: slice to get the correct relative-position attention matrix. Click "Next Step" to advance.
Let's trace through with L=4. The raw matrix Q · ErT has entry (i,k) = qi · ek:
We want entry (i,j) to use ei-j. So position (0,0) should use e0, position (1,0) should use e1, position (2,1) should use e1, etc. After padding a zero column on the left and reshaping:
Reshape [4, 5] → [5, 4] by reading elements in row-major order and filling a new shape:
Slice off row 0: the resulting [4, 4] matrix has entry (i,j) = ei-j (reading from row i=0 which is now [e3, 0, e0, e1] — wait, that's not right). Actually, the correct interpretation requires that Er is indexed in reverse order (e0 = offset 0, e1 = offset 1, etc.), and the resulting matrix after slicing rows 1-4 gives the correct Toeplitz structure.
The implementation handles the indexing details automatically through the reshape — the key insight is that the operation is O(1) additional computation (just pointer arithmetic, no new multiply-adds).
In the multi-head setting, each head has its own relative position embeddings Er. This lets different heads specialize: one head might learn to attend at beat intervals (every 8 positions), another at bar intervals (every 32 positions), and another at phrase intervals (every 128 positions). The skewing trick is applied independently per head.
python # Multi-head relative attention with skewing class RelativeMultiHeadAttn(nn.Module): def __init__(self, d_model, n_heads, max_len): super().__init__() self.d_k = d_model // n_heads self.n_heads = n_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) # Per-head relative position embeddings self.E_r = nn.Parameter( torch.randn(max_len, self.d_k) ) # Global bias vectors (u, v) self.u = nn.Parameter(torch.randn(n_heads, self.d_k)) self.v = nn.Parameter(torch.randn(n_heads, self.d_k)) def forward(self, x): B, L, D = x.shape # Project to Q, K, V [B, L, H, d_k] Q = self.W_q(x).view(B, L, self.n_heads, self.d_k) K = self.W_k(x).view(B, L, self.n_heads, self.d_k) V = self.W_v(x).view(B, L, self.n_heads, self.d_k) # Content-content: (Q + u) @ K.T Qu = Q + self.u # broadcast u over B, L S_content = torch.einsum('blhd,bmhd->bhlm', Qu, K) # Content-position: (Q + v) @ E_r.T + skew Qv = Q + self.v E = self.E_r[:L] # [L, d_k] S_pos = torch.einsum('blhd,md->bhlm', Qv, E) S_pos = self._skew(S_pos) # apply skewing trick # Combine, mask, softmax scores = (S_content + S_pos) / (self.d_k ** 0.5) # ... causal mask and softmax ... return self.W_o(out) def _skew(self, S): # S: [B, H, L, L] → skewed [B, H, L, L] B, H, L, _ = S.shape S = F.pad(S, (1, 0)) # [B, H, L, L+1] S = S.reshape(B, H, L+1, L) S = S[:, :, 1:, :] # [B, H, L, L] return S
| Approach | Memory for relative positions | For L=2048, d=64 |
|---|---|---|
| Naive | O(L2D) — full tensor | 268M floats |
| Skewing trick | O(LD) — only L embeddings needed | 131K floats |
| Savings | L× reduction | 2048× less memory |
python def relative_attention_skew(Q, K, V, E_r): # Q, K, V: [L, d] # E_r: [L, d] — relative position embeddings L, d = Q.shape # Content-to-content scores (standard attention) S_content = Q @ K.T # [L, L] # Content-to-position scores (needs skewing) S_rel = Q @ E_r.T # [L, L] — but wrong indexing! # THE SKEWING TRICK: # Step 1: Pad with zero column on left S_rel = F.pad(S_rel, (1, 0)) # [L, L+1] # Step 2: Reshape to shift rows S_rel = S_rel.reshape(L + 1, L) # [L+1, L] # Step 3: Slice off first row S_rel = S_rel[1:] # [L, L] — now correctly indexed! # Combine and scale scores = (S_content + S_rel) / (d ** 0.5) # Apply causal mask mask = torch.triu(torch.ones(L, L), diagonal=1).bool() scores.masked_fill_(mask, float('-inf')) # Standard softmax and value aggregation weights = F.softmax(scores, dim=-1) return weights @ V # [L, d]
The Music Transformer was trained on the J.S. Bach Chorales dataset (382 four-part chorales) and the Piano-e-Competition dataset (1,573 virtuoso piano performances). The results demonstrated a clear advantage for relative attention in generating music with long-term structure.
| Model | NLL (nats, lower=better) | Long-range coherence |
|---|---|---|
| LSTM baseline | 5.67 | Poor — wanders after ~10s |
| Transformer (absolute position) | 5.52 | Medium — local coherence, weak repetition |
| Music Transformer (relative) | 5.36 | Best — themes repeat, motifs develop |
| Transformer XL | 5.49 | Good — but requires segment-level recurrence |
The negative log-likelihood improvement from 5.52 to 5.36 might seem small, but in autoregressive models, even small NLL improvements translate to noticeably better generation quality. More importantly, the qualitative difference was dramatic.
The Music Transformer was trained on the Piano-e-Competition dataset — 1,573 performances by virtuoso pianists, totaling ~172 hours of music. Each performance was converted to the MIDI event representation, giving sequences of 1,000-3,000 events. Training used:
| Hyperparameter | Value |
|---|---|
| dmodel | 256 |
| Attention heads | 8 |
| Layers | 6 |
| Head dimension | 32 (256/8) |
| FFN inner dimension | 1024 |
| Max sequence length | 2048 |
| Optimizer | Adam |
| Learning rate | Noam schedule, warmup 4000 |
| Dropout | 0.1 |
The model is relatively small by modern standards (~10M parameters), but this was sufficient because the vocabulary is tiny (388 events) and the task is well-structured. Larger models didn't significantly improve quality — the bottleneck was training data, not model capacity.
Evaluating music generation is notoriously difficult — there's no BLEU score for music. The paper used three evaluation methods:
The human evaluation was critical. NLL measures density estimation quality, but a model with good NLL might generate boring, repetitive music (the "safe average"). The listening study confirmed that the Music Transformer's outputs were not just statistically good but actually musical — with structure, development, and emotional arc.
Human evaluators consistently preferred the Music Transformer's outputs. The key differences:
What the Music Transformer actually learns to attend to. Relative attention heads develop specialized patterns: some attend to the beat (regular intervals), some to the melody (pitch-similar events), some to harmony (events within the same chord). Click each head type to see its pattern.
The paper includes attention visualizations on real performances from the Piano-e-Competition dataset. These visualizations confirm that the model learns musically meaningful patterns:
| Observation | What the model learned |
|---|---|
| Strong diagonal lines at regular intervals | Beat structure — the model attends to events at beat positions |
| Vertical stripes at specific positions | Key events (e.g., first notes of phrases) that provide global context |
| Diagonal bands of varying width | Rhythmic patterns of different note durations |
| Sparse distant attention | Theme recurrence — later events attending to matching earlier phrases |
The attention patterns reveal something fascinating: different heads specialize for different musical functions. Some heads learn a strong diagonal pattern (attending to events at regular beat intervals — discovering the meter without being told). Other heads attend to events with similar pitch content (tracking the melody). Still others attend to the most recent events (local context for harmony). This emergent specialization is possible precisely because relative position encodings allow the model to express "attend to events N steps back" as a learned pattern.
python # The Music Transformer architecture class MusicTransformer(nn.Module): def __init__(self, vocab=388, d_model=256, n_heads=8, n_layers=6, max_len=2048): super().__init__() self.embed = nn.Embedding(vocab, d_model) # Relative position embeddings (shared across layers) self.E_r = nn.Embedding(max_len, d_model // n_heads) self.layers = nn.ModuleList([ RelativeAttentionBlock(d_model, n_heads, max_len) for _ in range(n_layers) ]) self.ln = nn.LayerNorm(d_model) self.head = nn.Linear(d_model, vocab) def forward(self, x): # x: [B, L] — event token ids h = self.embed(x) # [B, L, D] # NO absolute position added! for layer in self.layers: h = layer(h, self.E_r) h = self.ln(h) return self.head(h) # [B, L, 388]
Let's explore how the Music Transformer captures musical structure with an interactive simulation. This is the showcase — you'll see how relative attention enables the model to generate coherent patterns.
A simulated 32-step musical sequence shown as a piano roll. The model generates each step by attending to previous steps (shown as attention weights below). Watch how relative attention creates recurring patterns — the warm highlights show which past events influence the current generation step. Use the controls to explore.
The three levels of musical structure that relative attention captures:
| Level | Time Scale | What Attention Captures | Example |
|---|---|---|---|
| Local | 1-8 events | Note-to-note transitions, chord voicing | C-E-G chord pattern |
| Phrase | 8-32 events | Melodic motifs, rhythmic patterns | A 4-bar melody repeating |
| Section | 32-200 events | Theme recurrence, key changes | ABA form, verse-chorus |
The LSTM can handle local structure well. The Transformer with absolute position handles local and some phrase-level structure. The Music Transformer with relative attention handles all three levels — and this is what makes its generated music sound structured rather than wandering.
Here's the mechanism: when the model learns that "a NOTE_ON event should attend strongly to the event 32 steps back" (via the Content-Position term in the attention score), it creates a natural copying mechanism. If the event 32 steps back was a C, the model is biased toward producing a C again. If the attention pattern says "attend at offset -32 and offset -64," the model creates a repeating pattern with period 32. This is exactly how musical phrases repeat — and the model discovers it without being told anything about musical structure.
The best way to understand the Music Transformer's advantage is to compare outputs at different time horizons:
| Model | First 15 seconds | 30-60 seconds | 60-120 seconds |
|---|---|---|---|
| LSTM | Clear rhythm, melody | Rhythm drifts, melody wanders | Formless noodling |
| Transformer (abs) | Clear rhythm, melody | Rhythm steady, locally coherent | No repetition, new material only |
| Music Transformer | Clear rhythm, melody | Rhythm steady, themes develop | Themes return, feel composed |
The difference at the 60-120 second mark is the most telling. LSTMs have completely lost the thread. Absolute-position Transformers produce locally coherent music but without structural backbone. Only the Music Transformer with relative attention generates music that sounds like it has a plan — because relative attention enables the "copy from N steps back" behavior that creates repetition and development.
During generation, the model's output logits are divided by a temperature parameter before softmax. Lower temperature (0.7-0.9) makes the model more deterministic, favoring the most likely events — producing safer, more conventional music. Higher temperature (1.0-1.3) adds randomness, producing more creative but potentially less coherent output. The paper found that temperature around 1.0 gave the best balance of coherence and creativity for piano performances.
python # Generating music with the Music Transformer def generate(model, primer, length, temperature=1.0): # primer: initial MIDI events (e.g., first bar) generated = list(primer) for _ in range(length): x = torch.tensor([generated]) # [1, T] logits = model(x)[0, -1] # [388] — last position # Temperature scaling probs = F.softmax(logits / temperature, dim=-1) # Sample from distribution next_event = torch.multinomial(probs, 1).item() generated.append(next_event) return generated # List of MIDI event tokens
The Music Transformer's contributions extend far beyond music generation. Relative position representations became a standard component of modern Transformers, and music generation evolved into a broader field of audio AI.
| Paper | Year | Position Approach | Used in |
|---|---|---|---|
| Original Transformer | 2017 | Sinusoidal absolute | Translation, BERT |
| Shaw et al. | 2018 | Learned relative (clipped) | Translation improvements |
| Music Transformer | 2019 | Relative + skewing trick | Music generation |
| Transformer-XL | 2019 | Relative + segment recurrence | Language modeling (XLNet) |
| RoPE (Su et al.) | 2021 | Rotary position embedding | Llama, GPT-NeoX, modern LLMs |
| ALiBi (Press et al.) | 2022 | Linear bias (no learned params) | BLOOM, MPT |
| System | Year | Modality | Key Innovation |
|---|---|---|---|
| Music Transformer | 2019 | Symbolic (MIDI) | Relative attention for structure |
| Jukebox (OpenAI) | 2020 | Raw audio (waveform) | VQ-VAE + Transformer on audio tokens |
| MusicLM (Google) | 2023 | Raw audio | Hierarchical audio tokens, text-conditioned |
| MusicGen (Meta) | 2023 | Raw audio | Codebook interleaving, efficient generation |
| Stable Audio (Stability) | 2023 | Raw audio | Latent diffusion for long-form audio |
| Suno v3 | 2024 | Raw audio + vocals | Full song generation with lyrics |
The progression from symbolic MIDI generation to full audio generation. Click each node to see its contribution.
The biggest shift: moving from symbolic (MIDI) to raw audio. The Music Transformer generates MIDI events — a score, like sheet music. To hear it, you need a synthesizer. Modern models like MusicLM and MusicGen generate raw audio directly — including timbre, dynamics, and even vocals. This required new tokenization schemes (VQ-VAE codebooks instead of MIDI events) and much more compute.
The progression in audio tokenization mirrors the progression from character-level to subword-level models in NLP:
| Approach | Token Type | Tokens per second | Quality |
|---|---|---|---|
| Music Transformer (2019) | MIDI events | ~10-50 | Score only (need synth) |
| Jukebox VQ-VAE (2020) | Discrete audio codes | ~340 | Raw audio (44.1kHz) |
| SoundStream/EnCodec (2022) | Multi-codebook codes | ~50 per codebook | High quality at low bitrate |
| w2v-BERT tokens (2023) | Semantic audio tokens | ~25 | Semantic, not acoustic |
The key innovation: neural audio codecs like SoundStream (Google) and EnCodec (Meta) that compress audio into discrete tokens at ~50 tokens/second — compact enough for a Transformer to model. MusicLM and MusicGen both build on these codecs, using the Transformer to model sequences of audio tokens rather than MIDI events.
The Music Transformer had notable limitations:
| Limitation | Impact | How later work addressed it |
|---|---|---|
| Piano only | Can't generate orchestral or vocal music | Jukebox: multi-instrument raw audio |
| MIDI output | Need synthesizer to hear result | MusicGen: direct audio generation |
| ~2 min max | Context window limits piece length | Segment-level recurrence, infinite generation |
| No conditioning | Can't specify genre, mood, style | MusicLM: text-conditioned generation |
| No evaluation standard | Hard to compare across papers | FAD, CLAP scores for audio quality |
python # The progression of music AI architectures # 2019: Music Transformer # Input: MIDI events (388 vocab, ~2000 tokens/piece) # Model: Transformer decoder + relative attention # Output: MIDI events → render with synthesizer # 2020: Jukebox # Input: raw audio → VQ-VAE → discrete codes # Model: Sparse Transformer (3 levels of hierarchy) # Output: discrete codes → VQ-VAE decoder → raw audio # 2023: MusicGen # Input: text description + (optional) melody # Model: Transformer decoder over interleaved codebooks # Output: EnCodec tokens → EnCodec decoder → raw audio # Key: the Transformer stayed constant; only the # tokenization evolved (MIDI → VQ-VAE → EnCodec)
But the core architecture remained: the Transformer with some form of relative position encoding. The Music Transformer proved that attention could capture long-range musical structure. Everything since has been about scaling that insight to richer audio representations.
The most widely used relative position encoding today is Rotary Position Embedding (RoPE, Su et al., 2021). RoPE encodes relative position by rotating the query and key vectors in the complex plane. The dot product qTk naturally decomposes into a content component and a position component that depends only on the offset (i-j):
Where Rt is a rotation matrix that rotates by angle t×θ. The key insight: the rotation makes the dot product depend only on the difference (i-j), achieving the same relative position effect as the Music Transformer but more elegantly — no skewing trick needed, no separate position embedding table.
| Property | Music Transformer | RoPE |
|---|---|---|
| Position info | Additive (Er[i-j] added to score) | Multiplicative (rotation of Q/K) |
| Extra parameters | L × d learned embeddings | None (uses fixed rotations) |
| Extrapolation | Limited to trained lengths | Better (smooth rotation) |
| Implementation | Skewing trick needed | Simple complex multiply |
| Used in | Music generation | Llama, Gemma, Mistral, GPT-NeoX |
python # RoPE: the modern descendant of relative attention def apply_rope(q, k, positions, d_model, theta=10000): # q, k: [B, T, d] # Compute rotation frequencies freqs = 1.0 / (theta ** (torch.arange(0, d_model, 2) / d_model)) # Build rotation angles: [T, d/2] angles = positions[:, None] * freqs[None, :] # Apply rotation (complex multiply) cos_a, sin_a = torch.cos(angles), torch.sin(angles) # Rotate pairs of dimensions q1, q2 = q[..., ::2], q[..., 1::2] k1, k2 = k[..., ::2], k[..., 1::2] q_rot = torch.cat([q1*cos_a - q2*sin_a, q1*sin_a + q2*cos_a], dim=-1) k_rot = torch.cat([k1*cos_a - k2*sin_a, k1*sin_a + k2*cos_a], dim=-1) return q_rot, k_rot # Now q_rot @ k_rot.T depends only on (i-j)!
Despite the enormous progress since 2019, several challenges from the Music Transformer era remain open:
| Challenge | Status (2025) |
|---|---|
| True musical form (sonata, fugue) | No model reliably generates multi-minute form |
| Emotional arc (tension/release) | Models capture mood but not narrative |
| Controllable structure | Text controls style/mood but not form |
| Multi-instrument orchestration | Raw audio models don't understand parts |
| Musical understanding | Models generate plausible sounds but may not "understand" harmony |
The Music Transformer showed that relative attention could capture phrase-level repetition (~30 seconds). Getting to movement-level structure (~5 minutes) and piece-level form (~20 minutes) remains an open challenge. The attention mechanism can theoretically handle these timescales if the context window is long enough, but training data with clear large-scale structure is scarce, and evaluation is subjective.
Perhaps the most important open question is: does generating coherent musical form require understanding music (in some computational sense), or can it emerge purely from pattern matching on enough training data? The Music Transformer suggests the latter — its relative attention heads discovered beats, phrases, and repetition structure purely from statistical patterns in MIDI data, with no explicit musical knowledge. If this scaling hypothesis holds, then sufficiently large models trained on enough music data might eventually generate pieces with genuine formal coherence. Whether this constitutes "understanding" music is a philosophical question the paper wisely leaves open.
python # Summary: Music Transformer contributions # 1. Relative attention for sequences # - Replaces absolute PE with relative offset embeddings # - Attention(i,j) depends on (i-j), not on i and j separately # - Enables pattern recognition invariant to position # 2. The skewing trick for efficient computation # - Reduces memory from O(L²D) to O(LD) # - Zero additional compute (just pad + reshape) # - Makes relative attention practical for long sequences # 3. Event-based MIDI representation # - 388-token vocabulary captures full piano performance # - 1000x compression vs piano roll # - Includes dynamics (velocity) and timing nuance # 4. Proof that attention captures musical structure # - Heads discover beats, phrases, harmony automatically # - Long-range coherence up to 2 minutes # - Preferred by human listeners over LSTM 70% of the time # 5. Influence on all subsequent position encodings # - Transformer-XL (2019): relative + segment recurrence # - RoPE (2021): rotary encoding, used in Llama/Gemma # - ALiBi (2022): linear bias, no learned params # - Every modern LLM uses relative position encoding # The Music Transformer was a small model (10M params) # on a small dataset (172 hours of piano). # But its ideas — relative attention, the skewing trick, # event-based tokenization — influenced every model that # came after. Sometimes the most important papers are not # the biggest, but the ones that ask the right question.