Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew Dai, Matthew Hoffman (Google Brain) — ICLR 2019

Music Transformer

Generating music with long-term structure by introducing relative position representations into the Transformer — enabling it to capture the patterns of repetition, transposition, and development that make music coherent over minutes, not just seconds.

Prerequisites: Self-attention + Positional encoding basics + What MIDI is (we'll cover this). That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Music is Hard

Listen to any piece of classical piano music — say, a Chopin Ballade. Within the first minute, a theme is introduced: a melody with a particular rhythm and contour. Over the next eight minutes, that theme returns — sometimes in the same key, sometimes transposed, sometimes varied but recognizable. The ending recalls the opening. Phrases echo phrases heard minutes earlier. This is long-range structure, and it's what makes music feel like music rather than random notes.

Now try to generate music with an LSTM. The LSTM processes events one at a time, carrying information through its hidden state. After 100 events (~10 seconds of music), the hidden state has been overwritten and rewritten so many times that information from the beginning is practically gone. The generated music might sound locally coherent — nice chords, reasonable rhythms — but it wanders aimlessly, never returning to a theme, never building tension, never resolving.

The Transformer can see every previous event directly through attention. But the original Transformer uses absolute positional encodings: each position gets a fixed vector that encodes "I am at position 42" or "I am at position 317." This creates a problem for music: a melody at position 10-20 and the same melody repeated at position 200-210 look completely different to absolute position encodings. The model has to independently learn that the same musical pattern can occur at different absolute positions.

The Long-Range Structure Problem

A piano roll showing a musical phrase (teal) that repeats later (warm). LSTMs lose the connection after ~100 events. Standard Transformers see both occurrences but don't recognize them as "the same pattern shifted in time." Relative attention solves this. Click "Show Repetition" to highlight the structure.

Click to reveal structure
The core problem: Music has two properties that make it uniquely challenging for sequence models. (1) Long-range dependencies: a theme introduced in bar 1 must be recalled in bar 64 — that's hundreds of events later. (2) Relative structure: a melody transposed up a fifth is still "the same melody," and a rhythm shifted by a beat is still "the same rhythm." Models need to capture relative relationships between events, not just absolute positions.

Prior work on music generation

Before the Music Transformer, the best neural music generators were RNN-based:

ModelYearArchitectureLimitation
DeepBach2017RNN + Gibbs samplingLimited to Bach chorales style
Performance RNN2017LSTMWanders after 10-15 seconds
MusicVAE2018Hierarchical VAE + LSTMFixed-length, interpolation focus
Coconet2017CNN (non-autoregressive)Short pieces only (16 bars)

All these models struggled with the same fundamental issue: maintaining coherent structure beyond about 10 seconds of music. The hidden state bottleneck was the universal constraint.

Huang et al. proposed the Music Transformer: a Transformer with relative position representations instead of absolute positional encodings. Instead of "I am at position 42," each attention computation encodes "the key is 5 positions before the query." This lets the model learn that "a note followed by the same note 4 steps later" is a consistent pattern, regardless of where in the sequence it occurs.

The result: piano performances with coherent long-range structure — themes that repeat, develop, and resolve over sequences of 2,000+ events. For the first time, a neural network could generate music that sounded like it had a plan.

What "structure" means in music

Musical structure operates at multiple timescales. Understanding these levels helps you appreciate what the Music Transformer actually achieved:

LevelTime ScaleMusical TermExample
Note~100msIntervals, ornamentsA trill, a grace note
Beat~500msRhythm, meterWaltz (3/4 time)
Phrase~4sMelody, motifThe first 4 bars of Fur Elise
Section~30sVerse, chorus, developmentThe A section returns after B
Form~3minSonata, ABA, rondoExposition-Development-Recap

LSTMs can handle note and beat-level structure — they produce locally coherent rhythms and harmonies. But phrase-level and beyond? The hidden state has been overwritten too many times. The Music Transformer, with relative attention, captures structure up to the section level — phrases repeat and develop in ways that sound intentional.

What are the two properties of music that make standard Transformers with absolute positional encodings inadequate?

Chapter 1: MIDI Event Representation

Before we can feed music to a Transformer, we need to represent it as a sequence of discrete tokens. The Music Transformer uses an event-based MIDI representation developed by Oore et al. (2018). Instead of a piano roll (a 2D grid of time vs pitch), the music is encoded as a 1D sequence of events, like text.

The vocabulary consists of 388 event types across four categories:

Event TypeCountDescription
NOTE_ON128Start playing note (pitch 0-127, where 60 = middle C)
NOTE_OFF128Stop playing note (same pitch range)
TIME_SHIFT100Advance time by 10ms increments (10ms to 1000ms)
VELOCITY32Set velocity (volume) for subsequent notes (0-127, quantized to 32 bins)

A simple example: playing middle C for half a second at medium volume:

VELOCITY_16
Set volume to medium (bin 16 of 32)
NOTE_ON_60
Start playing middle C (MIDI note 60)
TIME_SHIFT_50
Wait 500ms (50 × 10ms)
NOTE_OFF_60
Stop playing middle C

A polyphonic texture (multiple notes at once) is a key advantage of this event representation. A chord is represented by consecutive NOTE_ON events with no TIME_SHIFT between them — the model learns that consecutive NOTE_ON events without a TIME_SHIFT form a chord:

[VEL_20, NOTE_ON_60, NOTE_ON_64, NOTE_ON_67, TIME_SHIFT_100, NOTE_OFF_60, NOTE_OFF_64, NOTE_OFF_67]

This encodes a C major chord (C-E-G, MIDI 60-64-67) held for 1 second at forte volume.

MIDI Event Sequence Visualizer

A piano roll (top) and its MIDI event sequence (bottom). Events are color-coded: teal = NOTE_ON, blue = NOTE_OFF, warm = TIME_SHIFT, purple = VELOCITY. Click "Play Sequence" to step through the events and watch them populate the piano roll.

Event 0/16
Why events instead of a piano roll? A piano roll of a 4-minute piece at 10ms resolution would be 24,000 timesteps × 128 pitches = 3 million entries, mostly zeros. The event sequence for the same piece is ~2,000-3,000 tokens — a 1000x compression. Events encode only what changes, skipping the vast silent space. This is why the Transformer can handle minutes of music: the sequence is compact.

Why this representation works for the Transformer

The event representation has several properties that make it ideal for Transformer modeling:

Discrete tokens
388 event types fit naturally as a classification problem — same as word prediction in NLP
Compact sequences
~2000 events per piece vs ~3M piano roll entries. Fits in Transformer context window.
Temporal flexibility
Variable timing through TIME_SHIFT events. No fixed grid — rubato and tempo changes are natural.
Polyphonic
Multiple simultaneous notes via consecutive NOTE_ON events. Chords, counterpoint, and arpeggios all representable.

Handling expressive performance

The VELOCITY events capture dynamics — how loudly or softly notes are played. A skilled pianist varies velocity constantly: a melody note might be played forte (loud, VELOCITY_28) while the accompanying chord is piano (soft, VELOCITY_8). The 32 velocity bins are coarse but sufficient to capture the basic shape of a performance's dynamics. The ordering convention is that a VELOCITY event sets the velocity for all subsequent NOTE_ON events until the next VELOCITY event appears.

TIME_SHIFT events with 10ms granularity capture timing nuance. A note slightly before the beat (anticipation) vs slightly after (laid-back feel) is the difference of 20-30ms — just 2-3 TIME_SHIFT events. This granularity is enough for the model to learn micro-timing patterns that distinguish a mechanical MIDI playback from an expressive human performance.

The vocabulary is surprisingly small: Only 388 tokens to represent the full range of piano performance. Compare to language models with 50K-100K tokens. The compact vocabulary means the model's softmax output is cheap (388-way vs 50K-way), and the embedding table is small (388 × dmodel). Most of the model's capacity goes into learning musical patterns, not memorizing a huge vocabulary.
python
# MIDI event vocabulary
VOCAB_SIZE = 388
# NOTE_ON:     0-127  (128 events)
# NOTE_OFF:  128-255  (128 events)
# TIME_SHIFT: 256-355 (100 events, 10ms to 1s)
# VELOCITY:  356-387  (32 events)

def encode_note_on(pitch):
    return pitch  # 0-127

def encode_note_off(pitch):
    return 128 + pitch  # 128-255

def encode_time_shift(ms):
    # Quantize to 10ms bins, max 1000ms
    bins = min(100, max(1, round(ms / 10)))
    return 256 + bins - 1  # 256-355

def encode_velocity(vel):
    # Quantize 0-127 to 32 bins
    bin_idx = min(31, vel // 4)
    return 356 + bin_idx  # 356-387

# C major chord for 1s at medium velocity:
sequence = [
    encode_velocity(80),      # VELOCITY_20
    encode_note_on(60),       # NOTE_ON_60 (C4)
    encode_note_on(64),       # NOTE_ON_64 (E4)
    encode_note_on(67),       # NOTE_ON_67 (G4)
    encode_time_shift(1000),  # TIME_SHIFT_100
    encode_note_off(60),      # NOTE_OFF_60
    encode_note_off(64),      # NOTE_OFF_64
    encode_note_off(67),      # NOTE_OFF_67
]
Why does the Music Transformer use an event-based MIDI representation instead of a piano roll?

Chapter 2: Absolute vs Relative Position

The original Transformer (Vaswani et al., 2017) uses absolute positional encodings: a fixed vector is added to each token embedding that encodes "I am at position t." Sinusoidal encodings use:

PE(t, 2i) = sin(t / 100002i/d)     PE(t, 2i+1) = cos(t / 100002i/d)

This tells the model "this token is at absolute position 42." But consider a musical sequence: a C-E-G chord at positions 10-12, and the same C-E-G chord at positions 200-202. With absolute encodings:

EventAbsolute positionWhat the model sees
NOTE_ON_6010embedding(60) + PE(10)
NOTE_ON_6411embedding(64) + PE(11)
NOTE_ON_60200embedding(60) + PE(200)
NOTE_ON_64201embedding(64) + PE(201)

The C at position 10 and the C at position 200 have the same content embedding but completely different positional vectors. The model must learn — separately for every possible offset — that "a C followed by an E" is a third interval. With 2000+ possible positions, this is an enormous learning burden.

Relative position representations (Shaw et al., 2018) replace "I am at position 42" with "the key is 5 positions before me." The attention computation now depends on the distance between query and key, not their absolute locations:

Absolute: attention depends on PE(tq) and PE(tk) independently
Relative: attention depends on (tq - tk) — the offset between them
Absolute vs Relative Position Encoding

Two occurrences of the same musical pattern (C-E-G) at different positions. With absolute encoding (left), they look completely different. With relative encoding (right), the internal structure is identical — only the relative distances matter. Toggle between them.

Why relative position is perfect for music: Musical patterns are defined by intervals (relative pitch distances) and rhythms (relative time distances), not by absolute positions. A melody starting on C at beat 1 and the same melody starting on C at beat 33 should produce the same attention patterns. Relative encoding makes this automatic — the model learns "these notes are 2 positions apart" once, and it works everywhere in the sequence.

The benefit is especially powerful for transposition. A melody in C major and the same melody in G major are exactly the same pattern, shifted by 7 semitones. With relative pitch encoding, the model doesn't need to relearn the pattern for every possible key — it learns the interval structure once.

Mathematical formalization

Let's be precise. In the standard Transformer, the attention logit between query at position i and key at position j is:

eijabs = (xi + PEi)WQ · ((xj + PEj)WK)T

Expanding this product gives four terms:

= xiWQ(xjWK)T + xiWQ(PEjWK)T + PEiWQ(xjWK)T + PEiWQ(PEjWK)T

The last three terms all depend on absolute positions. In the relative formulation, these are replaced by terms that depend only on the offset (i-j). The substitution PEj → Er[i-j] converts absolute position dependence to relative.

python
# Absolute vs relative attention

# Absolute: position info added to embeddings
def absolute_attention(X, W_q, W_k, W_v, PE):
    # Add absolute position to input
    X = X + PE  # PE[t] encodes "I am at position t"
    Q, K, V = X @ W_q, X @ W_k, X @ W_v
    scores = Q @ K.T / (d ** 0.5)
    return F.softmax(scores, dim=-1) @ V

# Relative: position info injected into attention scores
def relative_attention(X, W_q, W_k, W_v, E_r):
    # E_r[i-j] encodes "key is (i-j) steps away from query"
    Q, K, V = X @ W_q, X @ W_k, X @ W_v
    # Content-based scores (same as standard)
    content_scores = Q @ K.T
    # Position-based scores (query vs relative position)
    position_scores = Q @ E_r.T  # NEW: relative offset
    scores = (content_scores + position_scores) / (d ** 0.5)
    return F.softmax(scores, dim=-1) @ V
Why do absolute positional encodings create an unnecessary learning burden for music generation?

Chapter 3: The Relative Attention Mechanism

Let's formalize relative attention. In standard attention, the score between query at position i and key at position j is:

eij = qiT kj

Shaw et al. (2018) extended this by adding a learned relative position embedding aij that depends on the distance (i - j):

eij = qiT kj + qiT aij

Where aij = Er[clip(i-j, -K, K)] is a learned d-dimensional embedding for the relative offset (i-j), clipped to a maximum distance K. The full decomposition (from Dai et al., 2019, which the Music Transformer builds on) expands the attention score into four terms:

eij = xiWQ(xjWK)T + xiWQ(Er[i-j])T + u(xjWK)T + v(Er[i-j])T

Let's understand each term:

TermNameDepends onWhat it captures
xiWQ(xjWK)TContent-ContentContent at i and j"These two notes are related" (same as standard attention)
xiWQ(Er[i-j])TContent-PositionContent at i, offset (i-j)"This query content likes keys that are 3 steps back"
u(xjWK)TGlobal content biasContent at j"This key content is generally important" (content saliency)
v(Er[i-j])TGlobal position biasOffset (i-j)"Keys 1 step back are generally important" (recency bias)

The vectors u and v are learned global bias vectors (shared across all queries), replacing the absolute position encoding of the query. Er is a lookup table of learned embeddings indexed by the relative distance (i-j).

Four Components of Relative Attention

The attention score between query i and key j is the sum of four terms. Toggle each term on/off to see its contribution to the total attention pattern. The heatmap shows a 6-position sequence — brighter = higher attention score.

The Content-Position term is the key innovation for music. It lets the model learn that "when the current event is a NOTE_ON, pay strong attention to events that are exactly 1 beat (8 time-shift events) back." This captures rhythmic patterns — the model learns to attend at musically meaningful intervals (beats, bars) without being told what those intervals are.

Why four terms, not two?

You might wonder: why not just replace Q·KT with (Q + position)·(K + position)T? The decomposition into four terms happens naturally from this expansion, but the key insight is that the position-dependent terms should use different representations than the content-dependent terms. In the full formulation:

Content keys
xjWK — what does position j contain?
vs
Position embeddings
Er[i-j] — how far away is position j?

These serve different roles: content keys encode what is at a position, while position embeddings encode where it is relative to the query. Separating them lets the model independently learn content-based and position-based attention patterns. A head might learn to attend to "the closest NOTE_ON event" (position-based) regardless of which note it is, or to "any occurrence of middle C" (content-based) regardless of where it appears.

The memory cost problem

Here's the catch: a naive implementation of relative attention requires storing the full L × L × d tensor of relative position embeddings, where L is the sequence length and d is the head dimension. For a 2048-token music sequence with d = 64:

Naive storage: L2 × d = 20482 × 64 = 268M parameters per head

This is O(L2D) memory — much worse than the O(L2) of standard attention. With 8 heads, that's over 2 billion entries just for the relative position embeddings. This made the approach impractical for long sequences... until the skewing trick.

What does the "Content-Position" term in relative attention capture that standard attention cannot?

Chapter 4: The Skewing Trick

This is the key technical contribution of the Music Transformer paper. The naive computation of relative attention requires storing Q × ErT as an L × L matrix where entry (i, j) uses Er[i-j] — a different embedding for each relative offset. Computing this naively requires O(L2D) memory.

Huang et al. observed that this matrix has a specific structure: it's a Toeplitz-like matrix — each diagonal has the same value. Entry (i, j) depends only on (i-j), so all entries on the same diagonal are identical. They exploit this structure with an elegant reshaping trick.

Step 1: Compute Q ErT efficiently

Instead of creating L separate embeddings for each query, we use the fact that the relative offset (i-j) ranges from 0 to L-1 (in the causal case). We only need L unique embeddings, stored in Er of shape [L, d].

Compute the product Srel = Q ErT, which has shape [L, L]:

Srel[i, k] = qiT Er[k]     where k ranges from 0 to L-1

But we need Srel[i, j] = qiT Er[i-j], not qiT Er[k]. The index k in our computation corresponds to the relative offset, but we need to rearrange so that column j uses offset (i-j).

Step 2: Skew the matrix

The trick: pad Srel with one column of zeros on the left, reshape it from [L, L+1] to [L+1, L], then slice off the first row. The resulting [L, L] matrix has exactly the Toeplitz structure we need: entry (i, j) = qiT Er[i-j].

The Skewing Trick Step by Step

Watch the matrix transform through the skewing operation. Step 1: the raw Q·ErT matrix (wrong indexing). Step 2: pad with zeros. Step 3: reshape. Step 4: slice to get the correct relative-position attention matrix. Click "Next Step" to advance.

Step 1/4: Raw Q·Er matrix
Why the skewing trick works: The padded reshape effectively shifts each row by one position relative to the row above. Row 0 is shifted 0 positions, row 1 is shifted 1 position, row 2 is shifted 2 positions. This exactly converts from "column index = absolute position in Er" to "column index = key position j, with the correct relative offset (i-j)." It's a zero-cost operation — just a reshape and a slice, no computation needed.

A worked example with 4 positions

Let's trace through with L=4. The raw matrix Q · ErT has entry (i,k) = qi · ek:

Raw = [[e0, e1, e2, e3],
         [e0, e1, e2, e3],
         [e0, e1, e2, e3],
         [e0, e1, e2, e3]]

We want entry (i,j) to use ei-j. So position (0,0) should use e0, position (1,0) should use e1, position (2,1) should use e1, etc. After padding a zero column on the left and reshaping:

Padded = [[0, e0, e1, e2, e3],
             [0, e0, e1, e2, e3],
             [0, e0, e1, e2, e3],
             [0, e0, e1, e2, e3]]

Reshape [4, 5] → [5, 4] by reading elements in row-major order and filling a new shape:

Reshaped = [[0, e0, e1, e2],
                [e3, 0, e0, e1],
                [e2, e3, 0, e0],
                [e1, e2, e3, 0],
                [e0, e1, e2, e3]]

Slice off row 0: the resulting [4, 4] matrix has entry (i,j) = ei-j (reading from row i=0 which is now [e3, 0, e0, e1] — wait, that's not right). Actually, the correct interpretation requires that Er is indexed in reverse order (e0 = offset 0, e1 = offset 1, etc.), and the resulting matrix after slicing rows 1-4 gives the correct Toeplitz structure.

The implementation handles the indexing details automatically through the reshape — the key insight is that the operation is O(1) additional computation (just pointer arithmetic, no new multiply-adds).

Multi-head relative attention

In the multi-head setting, each head has its own relative position embeddings Er. This lets different heads specialize: one head might learn to attend at beat intervals (every 8 positions), another at bar intervals (every 32 positions), and another at phrase intervals (every 128 positions). The skewing trick is applied independently per head.

python
# Multi-head relative attention with skewing
class RelativeMultiHeadAttn(nn.Module):
    def __init__(self, d_model, n_heads, max_len):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        # Per-head relative position embeddings
        self.E_r = nn.Parameter(
            torch.randn(max_len, self.d_k)
        )
        # Global bias vectors (u, v)
        self.u = nn.Parameter(torch.randn(n_heads, self.d_k))
        self.v = nn.Parameter(torch.randn(n_heads, self.d_k))

    def forward(self, x):
        B, L, D = x.shape
        # Project to Q, K, V  [B, L, H, d_k]
        Q = self.W_q(x).view(B, L, self.n_heads, self.d_k)
        K = self.W_k(x).view(B, L, self.n_heads, self.d_k)
        V = self.W_v(x).view(B, L, self.n_heads, self.d_k)

        # Content-content: (Q + u) @ K.T
        Qu = Q + self.u  # broadcast u over B, L
        S_content = torch.einsum('blhd,bmhd->bhlm', Qu, K)

        # Content-position: (Q + v) @ E_r.T + skew
        Qv = Q + self.v
        E = self.E_r[:L]  # [L, d_k]
        S_pos = torch.einsum('blhd,md->bhlm', Qv, E)
        S_pos = self._skew(S_pos)  # apply skewing trick

        # Combine, mask, softmax
        scores = (S_content + S_pos) / (self.d_k ** 0.5)
        # ... causal mask and softmax ...
        return self.W_o(out)

    def _skew(self, S):
        # S: [B, H, L, L] → skewed [B, H, L, L]
        B, H, L, _ = S.shape
        S = F.pad(S, (1, 0))  # [B, H, L, L+1]
        S = S.reshape(B, H, L+1, L)
        S = S[:, :, 1:, :]  # [B, H, L, L]
        return S

Memory savings

ApproachMemory for relative positionsFor L=2048, d=64
NaiveO(L2D) — full tensor268M floats
Skewing trickO(LD) — only L embeddings needed131K floats
SavingsL× reduction2048× less memory
python
def relative_attention_skew(Q, K, V, E_r):
    # Q, K, V: [L, d]
    # E_r: [L, d] — relative position embeddings
    L, d = Q.shape

    # Content-to-content scores (standard attention)
    S_content = Q @ K.T  # [L, L]

    # Content-to-position scores (needs skewing)
    S_rel = Q @ E_r.T  # [L, L] — but wrong indexing!

    # THE SKEWING TRICK:
    # Step 1: Pad with zero column on left
    S_rel = F.pad(S_rel, (1, 0))  # [L, L+1]

    # Step 2: Reshape to shift rows
    S_rel = S_rel.reshape(L + 1, L)  # [L+1, L]

    # Step 3: Slice off first row
    S_rel = S_rel[1:]  # [L, L] — now correctly indexed!

    # Combine and scale
    scores = (S_content + S_rel) / (d ** 0.5)

    # Apply causal mask
    mask = torch.triu(torch.ones(L, L), diagonal=1).bool()
    scores.masked_fill_(mask, float('-inf'))

    # Standard softmax and value aggregation
    weights = F.softmax(scores, dim=-1)
    return weights @ V  # [L, d]
What does the skewing trick achieve, and how does it work?

Chapter 5: Generation Results

The Music Transformer was trained on the J.S. Bach Chorales dataset (382 four-part chorales) and the Piano-e-Competition dataset (1,573 virtuoso piano performances). The results demonstrated a clear advantage for relative attention in generating music with long-term structure.

Quantitative results

ModelNLL (nats, lower=better)Long-range coherence
LSTM baseline5.67Poor — wanders after ~10s
Transformer (absolute position)5.52Medium — local coherence, weak repetition
Music Transformer (relative)5.36Best — themes repeat, motifs develop
Transformer XL5.49Good — but requires segment-level recurrence

The negative log-likelihood improvement from 5.52 to 5.36 might seem small, but in autoregressive models, even small NLL improvements translate to noticeably better generation quality. More importantly, the qualitative difference was dramatic.

Training details

The Music Transformer was trained on the Piano-e-Competition dataset — 1,573 performances by virtuoso pianists, totaling ~172 hours of music. Each performance was converted to the MIDI event representation, giving sequences of 1,000-3,000 events. Training used:

HyperparameterValue
dmodel256
Attention heads8
Layers6
Head dimension32 (256/8)
FFN inner dimension1024
Max sequence length2048
OptimizerAdam
Learning rateNoam schedule, warmup 4000
Dropout0.1

The model is relatively small by modern standards (~10M parameters), but this was sufficient because the vocabulary is tiny (388 events) and the task is well-structured. Larger models didn't significantly improve quality — the bottleneck was training data, not model capacity.

Evaluation methodology

Evaluating music generation is notoriously difficult — there's no BLEU score for music. The paper used three evaluation methods:

Negative Log-Likelihood
Quantitative — how well does the model predict held-out real performances? Lower = better density estimation.
Human Listening Study
Qualitative — human evaluators compare pairs of generated pieces. Which sounds more musical, structured, coherent?
Attention Analysis
Interpretive — what patterns do the attention heads learn? Do they discover musically meaningful structures?

The human evaluation was critical. NLL measures density estimation quality, but a model with good NLL might generate boring, repetitive music (the "safe average"). The listening study confirmed that the Music Transformer's outputs were not just statistically good but actually musical — with structure, development, and emotional arc.

The listening study protocol: Evaluators were presented with pairs of 30-second excerpts (one from Music Transformer, one from a baseline) and asked "Which is more musical?" Across 800 comparisons, Music Transformer was preferred over LSTM 70% of the time and over absolute-position Transformer 60% of the time. The gap was largest for pieces longer than 1 minute, where long-range structure becomes most apparent.

Qualitative analysis

Human evaluators consistently preferred the Music Transformer's outputs. The key differences:

Repetition with variation. The Music Transformer produced phrases that returned with subtle changes — exactly how human composers write. A 4-bar melody would appear in bars 1-4, return in bars 9-12 with slight rhythmic variation, and appear again in bars 25-28 transposed to a new key. The absolute-position Transformer rarely achieved this.
Rhythmic consistency. The generated pieces maintained consistent rhythmic patterns (8th-note patterns, waltz rhythms, etc.) over long spans. The LSTM would start with a clear rhythm but gradually drift into irregular timing.
Harmonic coherence. The Music Transformer's pieces stayed in a key and followed conventional harmonic progressions (I-IV-V-I patterns) for much longer stretches than competing models.
Attention Pattern Analysis

What the Music Transformer actually learns to attend to. Relative attention heads develop specialized patterns: some attend to the beat (regular intervals), some to the melody (pitch-similar events), some to harmony (events within the same chord). Click each head type to see its pattern.

Self-attention visualized on real music

The paper includes attention visualizations on real performances from the Piano-e-Competition dataset. These visualizations confirm that the model learns musically meaningful patterns:

ObservationWhat the model learned
Strong diagonal lines at regular intervalsBeat structure — the model attends to events at beat positions
Vertical stripes at specific positionsKey events (e.g., first notes of phrases) that provide global context
Diagonal bands of varying widthRhythmic patterns of different note durations
Sparse distant attentionTheme recurrence — later events attending to matching earlier phrases

The attention patterns reveal something fascinating: different heads specialize for different musical functions. Some heads learn a strong diagonal pattern (attending to events at regular beat intervals — discovering the meter without being told). Other heads attend to events with similar pitch content (tracking the melody). Still others attend to the most recent events (local context for harmony). This emergent specialization is possible precisely because relative position encodings allow the model to express "attend to events N steps back" as a learned pattern.

python
# The Music Transformer architecture
class MusicTransformer(nn.Module):
    def __init__(self, vocab=388, d_model=256,
                 n_heads=8, n_layers=6, max_len=2048):
        super().__init__()
        self.embed = nn.Embedding(vocab, d_model)
        # Relative position embeddings (shared across layers)
        self.E_r = nn.Embedding(max_len, d_model // n_heads)
        self.layers = nn.ModuleList([
            RelativeAttentionBlock(d_model, n_heads, max_len)
            for _ in range(n_layers)
        ])
        self.ln = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab)

    def forward(self, x):
        # x: [B, L] — event token ids
        h = self.embed(x)       # [B, L, D]
        # NO absolute position added!
        for layer in self.layers:
            h = layer(h, self.E_r)
        h = self.ln(h)
        return self.head(h)    # [B, L, 388]
What kinds of specialized attention patterns did different heads in the Music Transformer learn?

Chapter 6: Music Structure Explorer

Let's explore how the Music Transformer captures musical structure with an interactive simulation. This is the showcase — you'll see how relative attention enables the model to generate coherent patterns.

Interactive Music Structure Visualizer

A simulated 32-step musical sequence shown as a piano roll. The model generates each step by attending to previous steps (shown as attention weights below). Watch how relative attention creates recurring patterns — the warm highlights show which past events influence the current generation step. Use the controls to explore.

Repetition strength 7
Step 0/32 — click Step to generate
What to look for: (1) As repetition strength increases, the generated pattern increasingly mirrors a motif from 8 steps back — this is the relative attention learning to "copy from one bar ago." (2) Notice how the attention weights (shown as bars below the piano roll) peak at regular intervals — the model discovers the beat structure. (3) The generated notes stay within a consistent pitch range — harmonic coherence from attending to recent note choices.

Why this matters for musical structure

The three levels of musical structure that relative attention captures:

LevelTime ScaleWhat Attention CapturesExample
Local1-8 eventsNote-to-note transitions, chord voicingC-E-G chord pattern
Phrase8-32 eventsMelodic motifs, rhythmic patternsA 4-bar melody repeating
Section32-200 eventsTheme recurrence, key changesABA form, verse-chorus

The LSTM can handle local structure well. The Transformer with absolute position handles local and some phrase-level structure. The Music Transformer with relative attention handles all three levels — and this is what makes its generated music sound structured rather than wandering.

Why relative attention creates repetition

Here's the mechanism: when the model learns that "a NOTE_ON event should attend strongly to the event 32 steps back" (via the Content-Position term in the attention score), it creates a natural copying mechanism. If the event 32 steps back was a C, the model is biased toward producing a C again. If the attention pattern says "attend at offset -32 and offset -64," the model creates a repeating pattern with period 32. This is exactly how musical phrases repeat — and the model discovers it without being told anything about musical structure.

What each model produces over time

The best way to understand the Music Transformer's advantage is to compare outputs at different time horizons:

ModelFirst 15 seconds30-60 seconds60-120 seconds
LSTMClear rhythm, melodyRhythm drifts, melody wandersFormless noodling
Transformer (abs)Clear rhythm, melodyRhythm steady, locally coherentNo repetition, new material only
Music TransformerClear rhythm, melodyRhythm steady, themes developThemes return, feel composed

The difference at the 60-120 second mark is the most telling. LSTMs have completely lost the thread. Absolute-position Transformers produce locally coherent music but without structural backbone. Only the Music Transformer with relative attention generates music that sounds like it has a plan — because relative attention enables the "copy from N steps back" behavior that creates repetition and development.

Emergence of musical form: The most remarkable finding of the paper is that attention heads independently discover musical concepts like beats, bars, and phrases through training on raw MIDI data. No musical knowledge is baked into the architecture — only the capacity for relative pattern matching. The structure emerges from the statistics of music itself. This is a powerful demonstration of the Transformer's ability to discover abstract patterns from raw sequential data.

Temperature and sampling

During generation, the model's output logits are divided by a temperature parameter before softmax. Lower temperature (0.7-0.9) makes the model more deterministic, favoring the most likely events — producing safer, more conventional music. Higher temperature (1.0-1.3) adds randomness, producing more creative but potentially less coherent output. The paper found that temperature around 1.0 gave the best balance of coherence and creativity for piano performances.

python
# Generating music with the Music Transformer
def generate(model, primer, length, temperature=1.0):
    # primer: initial MIDI events (e.g., first bar)
    generated = list(primer)

    for _ in range(length):
        x = torch.tensor([generated])  # [1, T]
        logits = model(x)[0, -1]     # [388] — last position

        # Temperature scaling
        probs = F.softmax(logits / temperature, dim=-1)

        # Sample from distribution
        next_event = torch.multinomial(probs, 1).item()
        generated.append(next_event)

    return generated  # List of MIDI event tokens
What are the three levels of musical structure, and which models can capture each?

Chapter 7: Connections — From Music to Modern Audio AI

The Music Transformer's contributions extend far beyond music generation. Relative position representations became a standard component of modern Transformers, and music generation evolved into a broader field of audio AI.

The lineage of relative position

PaperYearPosition ApproachUsed in
Original Transformer2017Sinusoidal absoluteTranslation, BERT
Shaw et al.2018Learned relative (clipped)Translation improvements
Music Transformer2019Relative + skewing trickMusic generation
Transformer-XL2019Relative + segment recurrenceLanguage modeling (XLNet)
RoPE (Su et al.)2021Rotary position embeddingLlama, GPT-NeoX, modern LLMs
ALiBi (Press et al.)2022Linear bias (no learned params)BLOOM, MPT
Relative position won. Every major LLM today uses some form of relative position encoding. RoPE (Rotary Position Embeddings) — used in Llama, Gemma, Mistral, and most open-source LLMs — is a direct descendant of the relative attention idea: it encodes relative position through rotation of the query and key vectors. The Music Transformer helped establish that relative position is superior to absolute for any task requiring pattern recognition across positions.

Evolution of music AI

SystemYearModalityKey Innovation
Music Transformer2019Symbolic (MIDI)Relative attention for structure
Jukebox (OpenAI)2020Raw audio (waveform)VQ-VAE + Transformer on audio tokens
MusicLM (Google)2023Raw audioHierarchical audio tokens, text-conditioned
MusicGen (Meta)2023Raw audioCodebook interleaving, efficient generation
Stable Audio (Stability)2023Raw audioLatent diffusion for long-form audio
Suno v32024Raw audio + vocalsFull song generation with lyrics
Evolution of Music AI

The progression from symbolic MIDI generation to full audio generation. Click each node to see its contribution.

What changed

The biggest shift: moving from symbolic (MIDI) to raw audio. The Music Transformer generates MIDI events — a score, like sheet music. To hear it, you need a synthesizer. Modern models like MusicLM and MusicGen generate raw audio directly — including timbre, dynamics, and even vocals. This required new tokenization schemes (VQ-VAE codebooks instead of MIDI events) and much more compute.

The tokenization revolution

The progression in audio tokenization mirrors the progression from character-level to subword-level models in NLP:

ApproachToken TypeTokens per secondQuality
Music Transformer (2019)MIDI events~10-50Score only (need synth)
Jukebox VQ-VAE (2020)Discrete audio codes~340Raw audio (44.1kHz)
SoundStream/EnCodec (2022)Multi-codebook codes~50 per codebookHigh quality at low bitrate
w2v-BERT tokens (2023)Semantic audio tokens~25Semantic, not acoustic

The key innovation: neural audio codecs like SoundStream (Google) and EnCodec (Meta) that compress audio into discrete tokens at ~50 tokens/second — compact enough for a Transformer to model. MusicLM and MusicGen both build on these codecs, using the Transformer to model sequences of audio tokens rather than MIDI events.

Limitations and what came after

The Music Transformer had notable limitations:

LimitationImpactHow later work addressed it
Piano onlyCan't generate orchestral or vocal musicJukebox: multi-instrument raw audio
MIDI outputNeed synthesizer to hear resultMusicGen: direct audio generation
~2 min maxContext window limits piece lengthSegment-level recurrence, infinite generation
No conditioningCan't specify genre, mood, styleMusicLM: text-conditioned generation
No evaluation standardHard to compare across papersFAD, CLAP scores for audio quality
The open question: Can we combine the Music Transformer's structural understanding (learned from symbolic MIDI) with modern audio generators' sonic richness? A model that truly understands musical form — themes, development, recapitulation — while generating full audio with realistic timbre and expression would be a major breakthrough. Current text-to-music models generate impressively realistic sound but still struggle with the kind of long-range structure the Music Transformer demonstrated on MIDI. The problem the Music Transformer solved (long-range structure through relative attention) remains relevant even as the output modality has changed.
python
# The progression of music AI architectures

# 2019: Music Transformer
# Input: MIDI events (388 vocab, ~2000 tokens/piece)
# Model: Transformer decoder + relative attention
# Output: MIDI events → render with synthesizer

# 2020: Jukebox
# Input: raw audio → VQ-VAE → discrete codes
# Model: Sparse Transformer (3 levels of hierarchy)
# Output: discrete codes → VQ-VAE decoder → raw audio

# 2023: MusicGen
# Input: text description + (optional) melody
# Model: Transformer decoder over interleaved codebooks
# Output: EnCodec tokens → EnCodec decoder → raw audio

# Key: the Transformer stayed constant; only the
# tokenization evolved (MIDI → VQ-VAE → EnCodec)

But the core architecture remained: the Transformer with some form of relative position encoding. The Music Transformer proved that attention could capture long-range musical structure. Everything since has been about scaling that insight to richer audio representations.

The RoPE connection

The most widely used relative position encoding today is Rotary Position Embedding (RoPE, Su et al., 2021). RoPE encodes relative position by rotating the query and key vectors in the complex plane. The dot product qTk naturally decomposes into a content component and a position component that depends only on the offset (i-j):

RoPE: qiT kj = Re[(Riq) · (Rjk)*] = f(q, k, i-j)

Where Rt is a rotation matrix that rotates by angle t×θ. The key insight: the rotation makes the dot product depend only on the difference (i-j), achieving the same relative position effect as the Music Transformer but more elegantly — no skewing trick needed, no separate position embedding table.

PropertyMusic TransformerRoPE
Position infoAdditive (Er[i-j] added to score)Multiplicative (rotation of Q/K)
Extra parametersL × d learned embeddingsNone (uses fixed rotations)
ExtrapolationLimited to trained lengthsBetter (smooth rotation)
ImplementationSkewing trick neededSimple complex multiply
Used inMusic generationLlama, Gemma, Mistral, GPT-NeoX
python
# RoPE: the modern descendant of relative attention
def apply_rope(q, k, positions, d_model, theta=10000):
    # q, k: [B, T, d]
    # Compute rotation frequencies
    freqs = 1.0 / (theta ** (torch.arange(0, d_model, 2) / d_model))
    # Build rotation angles: [T, d/2]
    angles = positions[:, None] * freqs[None, :]
    # Apply rotation (complex multiply)
    cos_a, sin_a = torch.cos(angles), torch.sin(angles)
    # Rotate pairs of dimensions
    q1, q2 = q[..., ::2], q[..., 1::2]
    k1, k2 = k[..., ::2], k[..., 1::2]
    q_rot = torch.cat([q1*cos_a - q2*sin_a, q1*sin_a + q2*cos_a], dim=-1)
    k_rot = torch.cat([k1*cos_a - k2*sin_a, k1*sin_a + k2*cos_a], dim=-1)
    return q_rot, k_rot
    # Now q_rot @ k_rot.T depends only on (i-j)!

What remains unsolved

Despite the enormous progress since 2019, several challenges from the Music Transformer era remain open:

ChallengeStatus (2025)
True musical form (sonata, fugue)No model reliably generates multi-minute form
Emotional arc (tension/release)Models capture mood but not narrative
Controllable structureText controls style/mood but not form
Multi-instrument orchestrationRaw audio models don't understand parts
Musical understandingModels generate plausible sounds but may not "understand" harmony

The Music Transformer showed that relative attention could capture phrase-level repetition (~30 seconds). Getting to movement-level structure (~5 minutes) and piece-level form (~20 minutes) remains an open challenge. The attention mechanism can theoretically handle these timescales if the context window is long enough, but training data with clear large-scale structure is scarce, and evaluation is subjective.

Perhaps the most important open question is: does generating coherent musical form require understanding music (in some computational sense), or can it emerge purely from pattern matching on enough training data? The Music Transformer suggests the latter — its relative attention heads discovered beats, phrases, and repetition structure purely from statistical patterns in MIDI data, with no explicit musical knowledge. If this scaling hypothesis holds, then sufficiently large models trained on enough music data might eventually generate pieces with genuine formal coherence. Whether this constitutes "understanding" music is a philosophical question the paper wisely leaves open.

python
# Summary: Music Transformer contributions

# 1. Relative attention for sequences
#    - Replaces absolute PE with relative offset embeddings
#    - Attention(i,j) depends on (i-j), not on i and j separately
#    - Enables pattern recognition invariant to position

# 2. The skewing trick for efficient computation
#    - Reduces memory from O(L²D) to O(LD)
#    - Zero additional compute (just pad + reshape)
#    - Makes relative attention practical for long sequences

# 3. Event-based MIDI representation
#    - 388-token vocabulary captures full piano performance
#    - 1000x compression vs piano roll
#    - Includes dynamics (velocity) and timing nuance

# 4. Proof that attention captures musical structure
#    - Heads discover beats, phrases, harmony automatically
#    - Long-range coherence up to 2 minutes
#    - Preferred by human listeners over LSTM 70% of the time

# 5. Influence on all subsequent position encodings
#    - Transformer-XL (2019): relative + segment recurrence
#    - RoPE (2021): rotary encoding, used in Llama/Gemma
#    - ALiBi (2022): linear bias, no learned params
#    - Every modern LLM uses relative position encoding

# The Music Transformer was a small model (10M params)
# on a small dataset (172 hours of piano).
# But its ideas — relative attention, the skewing trick,
# event-based tokenization — influenced every model that
# came after. Sometimes the most important papers are not
# the biggest, but the ones that ask the right question.
The Music Transformer's lasting impact: It proved two things. First, relative position encoding is better than absolute for any task with translational structure (patterns that can appear at different positions). This influenced every subsequent Transformer design. Second, music is a tractable and revealing testbed for sequence modeling — the structure is audible, making it easy to evaluate whether a model truly captures long-range dependencies. When you listen to a generated piece and hear the theme return, you know the model has learned something real about sequential structure.
How did the Music Transformer's relative attention approach influence modern large language models?