Music Transformer (Huang et al. 2019)

Chapter 0: Why Music is Hard

Listen to any piece of classical piano music — say, a Chopin Ballade. Within the first minute, a theme is introduced: a melody with a particular rhythm and contour. Over the next eight minutes, that theme returns — sometimes in the same key, sometimes transposed, sometimes varied but recognizable. The ending recalls the opening. Phrases echo phrases heard minutes earlier. This is long-range structure, and it's what makes music feel like music rather than random notes.

Now try to generate music with an LSTM. The LSTM processes events one at a time, carrying information through its hidden state. After 100 events (~10 seconds of music), the hidden state has been overwritten and rewritten so many times that information from the beginning is practically gone. The generated music might sound locally coherent — nice chords, reasonable rhythms — but it wanders aimlessly, never returning to a theme, never building tension, never resolving.

The Transformer can see every previous event directly through attention. But the original Transformer uses absolute positional encodings: each position gets a fixed vector that encodes "I am at position 42" or "I am at position 317." This creates a problem for music: a melody at position 10-20 and the same melody repeated at position 200-210 look completely different to absolute position encodings. The model has to independently learn that the same musical pattern can occur at different absolute positions.

The Long-Range Structure Problem

A piano roll showing a musical phrase (teal) that repeats later (warm). LSTMs lose the connection after ~100 events. Standard Transformers see both occurrences but don't recognize them as "the same pattern shifted in time." Relative attention solves this. Click "Show Repetition" to highlight the structure.

Click to reveal structure

The core problem: Music has two properties that make it uniquely challenging for sequence models. (1) Long-range dependencies: a theme introduced in bar 1 must be recalled in bar 64 — that's hundreds of events later. (2) Relative structure: a melody transposed up a fifth is still "the same melody," and a rhythm shifted by a beat is still "the same rhythm." Models need to capture relative relationships between events, not just absolute positions.

Prior work on music generation

Before the Music Transformer, the best neural music generators were RNN-based:

Model	Year	Architecture	Limitation
DeepBach	2017	RNN + Gibbs sampling	Limited to Bach chorales style
Performance RNN	2017	LSTM	Wanders after 10-15 seconds
MusicVAE	2018	Hierarchical VAE + LSTM	Fixed-length, interpolation focus
Coconet	2017	CNN (non-autoregressive)	Short pieces only (16 bars)

All these models struggled with the same fundamental issue: maintaining coherent structure beyond about 10 seconds of music. The hidden state bottleneck was the universal constraint.

Huang et al. proposed the Music Transformer: a Transformer with relative position representations instead of absolute positional encodings. Instead of "I am at position 42," each attention computation encodes "the key is 5 positions before the query." This lets the model learn that "a note followed by the same note 4 steps later" is a consistent pattern, regardless of where in the sequence it occurs.

The result: piano performances with coherent long-range structure — themes that repeat, develop, and resolve over sequences of 2,000+ events. For the first time, a neural network could generate music that sounded like it had a plan.

What "structure" means in music

Musical structure operates at multiple timescales. Understanding these levels helps you appreciate what the Music Transformer actually achieved:

Level	Time Scale	Musical Term	Example
Note	~100ms	Intervals, ornaments	A trill, a grace note
Beat	~500ms	Rhythm, meter	Waltz (3/4 time)
Phrase	~4s	Melody, motif	The first 4 bars of Fur Elise
Section	~30s	Verse, chorus, development	The A section returns after B
Form	~3min	Sonata, ABA, rondo	Exposition-Development-Recap

LSTMs can handle note and beat-level structure — they produce locally coherent rhythms and harmonies. But phrase-level and beyond? The hidden state has been overwritten too many times. The Music Transformer, with relative attention, captures structure up to the section level — phrases repeat and develop in ways that sound intentional.

What are the two properties of music that make standard Transformers with absolute positional encodings inadequate?

(1) Long-range dependencies — themes must be recalled hundreds of events later, and (2) relative structure — the same melody transposed or shifted is still "the same pattern," but absolute positions make different occurrences look unrelated Music has too many possible notes and too many possible rhythms Music is continuous while Transformers only handle discrete tokens

Chapter 1: MIDI Event Representation

Before we can feed music to a Transformer, we need to represent it as a sequence of discrete tokens. The Music Transformer uses an event-based MIDI representation developed by Oore et al. (2018). Instead of a piano roll (a 2D grid of time vs pitch), the music is encoded as a 1D sequence of events, like text.

The vocabulary consists of 388 event types across four categories:

Event Type	Count	Description
NOTE_ON	128	Start playing note (pitch 0-127, where 60 = middle C)
NOTE_OFF	128	Stop playing note (same pitch range)
TIME_SHIFT	100	Advance time by 10ms increments (10ms to 1000ms)
VELOCITY	32	Set velocity (volume) for subsequent notes (0-127, quantized to 32 bins)

A simple example: playing middle C for half a second at medium volume:

VELOCITY_16

Set volume to medium (bin 16 of 32)

↓

NOTE_ON_60

Start playing middle C (MIDI note 60)

↓

TIME_SHIFT_50

Wait 500ms (50 × 10ms)

↓

NOTE_OFF_60

Stop playing middle C

A polyphonic texture (multiple notes at once) is a key advantage of this event representation. A chord is represented by consecutive NOTE_ON events with no TIME_SHIFT between them — the model learns that consecutive NOTE_ON events without a TIME_SHIFT form a chord:

[VEL_20, NOTE_ON_60, NOTE_ON_64, NOTE_ON_67, TIME_SHIFT_100, NOTE_OFF_60, NOTE_OFF_64, NOTE_OFF_67]

This encodes a C major chord (C-E-G, MIDI 60-64-67) held for 1 second at forte volume.

MIDI Event Sequence Visualizer

A piano roll (top) and its MIDI event sequence (bottom). Events are color-coded: teal = NOTE_ON, blue = NOTE_OFF, warm = TIME_SHIFT, purple = VELOCITY. Click "Play Sequence" to step through the events and watch them populate the piano roll.

Event 0/16

Why events instead of a piano roll? A piano roll of a 4-minute piece at 10ms resolution would be 24,000 timesteps × 128 pitches = 3 million entries, mostly zeros. The event sequence for the same piece is ~2,000-3,000 tokens — a 1000x compression. Events encode only what changes, skipping the vast silent space. This is why the Transformer can handle minutes of music: the sequence is compact.

Why this representation works for the Transformer

The event representation has several properties that make it ideal for Transformer modeling:

Discrete tokens

388 event types fit naturally as a classification problem — same as word prediction in NLP

↓

Compact sequences

~2000 events per piece vs ~3M piano roll entries. Fits in Transformer context window.

↓

Temporal flexibility

Variable timing through TIME_SHIFT events. No fixed grid — rubato and tempo changes are natural.

↓

Polyphonic

Multiple simultaneous notes via consecutive NOTE_ON events. Chords, counterpoint, and arpeggios all representable.

Handling expressive performance

The VELOCITY events capture dynamics — how loudly or softly notes are played. A skilled pianist varies velocity constantly: a melody note might be played forte (loud, VELOCITY_28) while the accompanying chord is piano (soft, VELOCITY_8). The 32 velocity bins are coarse but sufficient to capture the basic shape of a performance's dynamics. The ordering convention is that a VELOCITY event sets the velocity for all subsequent NOTE_ON events until the next VELOCITY event appears.

TIME_SHIFT events with 10ms granularity capture timing nuance. A note slightly before the beat (anticipation) vs slightly after (laid-back feel) is the difference of 20-30ms — just 2-3 TIME_SHIFT events. This granularity is enough for the model to learn micro-timing patterns that distinguish a mechanical MIDI playback from an expressive human performance.

The vocabulary is surprisingly small: Only 388 tokens to represent the full range of piano performance. Compare to language models with 50K-100K tokens. The compact vocabulary means the model's softmax output is cheap (388-way vs 50K-way), and the embedding table is small (388 × d_model). Most of the model's capacity goes into learning musical patterns, not memorizing a huge vocabulary.

python
# MIDI event vocabulary
VOCAB_SIZE = 388
# NOTE_ON:     0-127  (128 events)
# NOTE_OFF:  128-255  (128 events)
# TIME_SHIFT: 256-355 (100 events, 10ms to 1s)
# VELOCITY:  356-387  (32 events)

def encode_note_on(pitch):
    return pitch  # 0-127

def encode_note_off(pitch):
    return 128 + pitch  # 128-255

def encode_time_shift(ms):
    # Quantize to 10ms bins, max 1000ms
    bins = min(100, max(1, round(ms / 10)))
    return 256 + bins - 1  # 256-355

def encode_velocity(vel):
    # Quantize 0-127 to 32 bins
    bin_idx = min(31, vel // 4)
    return 356 + bin_idx  # 356-387

# C major chord for 1s at medium velocity:
sequence = [
    encode_velocity(80),      # VELOCITY_20
    encode_note_on(60),       # NOTE_ON_60 (C4)
    encode_note_on(64),       # NOTE_ON_64 (E4)
    encode_note_on(67),       # NOTE_ON_67 (G4)
    encode_time_shift(1000),  # TIME_SHIFT_100
    encode_note_off(60),      # NOTE_OFF_60
    encode_note_off(64),      # NOTE_OFF_64
    encode_note_off(67),      # NOTE_OFF_67
]

Why does the Music Transformer use an event-based MIDI representation instead of a piano roll?

Because event sequences are ~1000x more compact — a 4-minute piece is ~2000 events vs ~3 million piano roll entries (mostly zeros). This compression makes Transformer sequence lengths tractable for minutes of music. Because MIDI files are smaller than WAV files Because piano rolls can't represent chords

Chapter 2: Absolute vs Relative Position

The original Transformer (Vaswani et al., 2017) uses absolute positional encodings: a fixed vector is added to each token embedding that encodes "I am at position t." Sinusoidal encodings use:

PE(t, 2i) = sin(t / 10000^2i/d) PE(t, 2i+1) = cos(t / 10000^2i/d)

This tells the model "this token is at absolute position 42." But consider a musical sequence: a C-E-G chord at positions 10-12, and the same C-E-G chord at positions 200-202. With absolute encodings:

Event	Absolute position	What the model sees
NOTE_ON_60	10	embedding(60) + PE(10)
NOTE_ON_64	11	embedding(64) + PE(11)
NOTE_ON_60	200	embedding(60) + PE(200)
NOTE_ON_64	201	embedding(64) + PE(201)

The C at position 10 and the C at position 200 have the same content embedding but completely different positional vectors. The model must learn — separately for every possible offset — that "a C followed by an E" is a third interval. With 2000+ possible positions, this is an enormous learning burden.

Relative position representations (Shaw et al., 2018) replace "I am at position 42" with "the key is 5 positions before me." The attention computation now depends on the distance between query and key, not their absolute locations:

Absolute: attention depends on PE(t_q) and PE(t_k) independently

Relative: attention depends on (t_q - t_k) — the offset between them

Absolute vs Relative Position Encoding

Two occurrences of the same musical pattern (C-E-G) at different positions. With absolute encoding (left), they look completely different. With relative encoding (right), the internal structure is identical — only the relative distances matter. Toggle between them.

Why relative position is perfect for music: Musical patterns are defined by intervals (relative pitch distances) and rhythms (relative time distances), not by absolute positions. A melody starting on C at beat 1 and the same melody starting on C at beat 33 should produce the same attention patterns. Relative encoding makes this automatic — the model learns "these notes are 2 positions apart" once, and it works everywhere in the sequence.

The benefit is especially powerful for transposition. A melody in C major and the same melody in G major are exactly the same pattern, shifted by 7 semitones. With relative pitch encoding, the model doesn't need to relearn the pattern for every possible key — it learns the interval structure once.

Mathematical formalization

Let's be precise. In the standard Transformer, the attention logit between query at position i and key at position j is:

e_ij^abs = (x_i + PE_i)W_Q · ((x_j + PE_j)W_K)^T

Expanding this product gives four terms:

= x_iW_Q(x_jW_K)^T + x_iW_Q(PE_jW_K)^T + PE_iW_Q(x_jW_K)^T + PE_iW_Q(PE_jW_K)^T

The last three terms all depend on absolute positions. In the relative formulation, these are replaced by terms that depend only on the offset (i-j). The substitution PE_j → E_r[i-j] converts absolute position dependence to relative.

python
# Absolute vs relative attention

# Absolute: position info added to embeddings
def absolute_attention(X, W_q, W_k, W_v, PE):
    # Add absolute position to input
    X = X + PE  # PE[t] encodes "I am at position t"
    Q, K, V = X @ W_q, X @ W_k, X @ W_v
    scores = Q @ K.T / (d ** 0.5)
    return F.softmax(scores, dim=-1) @ V

# Relative: position info injected into attention scores
def relative_attention(X, W_q, W_k, W_v, E_r):
    # E_r[i-j] encodes "key is (i-j) steps away from query"
    Q, K, V = X @ W_q, X @ W_k, X @ W_v
    # Content-based scores (same as standard)
    content_scores = Q @ K.T
    # Position-based scores (query vs relative position)
    position_scores = Q @ E_r.T  # NEW: relative offset
    scores = (content_scores + position_scores) / (d ** 0.5)
    return F.softmax(scores, dim=-1) @ V

Why do absolute positional encodings create an unnecessary learning burden for music generation?

Because the same musical pattern at different positions in the sequence gets completely different positional vectors — the model must learn "C followed by E is a third interval" separately for every possible absolute position, rather than learning it once as a relative relationship Because absolute positions use too many parameters Because music doesn't have any positional structure

Chapter 3: The Relative Attention Mechanism

Let's formalize relative attention. In standard attention, the score between query at position i and key at position j is:

e_ij = q_i^T k_j

Shaw et al. (2018) extended this by adding a learned relative position embedding a_ij that depends on the distance (i - j):

e_ij = q_i^T k_j + q_i^T a_ij

Where a_ij = E_r[clip(i-j, -K, K)] is a learned d-dimensional embedding for the relative offset (i-j), clipped to a maximum distance K. The full decomposition (from Dai et al., 2019, which the Music Transformer builds on) expands the attention score into four terms:

e_ij = x_iW_Q(x_jW_K)^T + x_iW_Q(E_r[i-j])^T + u(x_jW_K)^T + v(E_r[i-j])^T

Let's understand each term:

Term	Name	Depends on	What it captures
x_iW_Q(x_jW_K)^T	Content-Content	Content at i and j	"These two notes are related" (same as standard attention)
x_iW_Q(E_r[i-j])^T	Content-Position	Content at i, offset (i-j)	"This query content likes keys that are 3 steps back"
u(x_jW_K)^T	Global content bias	Content at j	"This key content is generally important" (content saliency)
v(E_r[i-j])^T	Global position bias	Offset (i-j)	"Keys 1 step back are generally important" (recency bias)

The vectors u and v are learned global bias vectors (shared across all queries), replacing the absolute position encoding of the query. E_r is a lookup table of learned embeddings indexed by the relative distance (i-j).

Four Components of Relative Attention

The attention score between query i and key j is the sum of four terms. Toggle each term on/off to see its contribution to the total attention pattern. The heatmap shows a 6-position sequence — brighter = higher attention score.

The Content-Position term is the key innovation for music. It lets the model learn that "when the current event is a NOTE_ON, pay strong attention to events that are exactly 1 beat (8 time-shift events) back." This captures rhythmic patterns — the model learns to attend at musically meaningful intervals (beats, bars) without being told what those intervals are.

Why four terms, not two?

You might wonder: why not just replace Q·K^T with (Q + position)·(K + position)^T? The decomposition into four terms happens naturally from this expansion, but the key insight is that the position-dependent terms should use different representations than the content-dependent terms. In the full formulation:

Content keys

x_jW_K — what does position j contain?

Position embeddings

E_r[i-j] — how far away is position j?

These serve different roles: content keys encode what is at a position, while position embeddings encode where it is relative to the query. Separating them lets the model independently learn content-based and position-based attention patterns. A head might learn to attend to "the closest NOTE_ON event" (position-based) regardless of which note it is, or to "any occurrence of middle C" (content-based) regardless of where it appears.

The memory cost problem

Here's the catch: a naive implementation of relative attention requires storing the full L × L × d tensor of relative position embeddings, where L is the sequence length and d is the head dimension. For a 2048-token music sequence with d = 64:

Naive storage: L² × d = 2048² × 64 = 268M parameters per head

This is O(L²D) memory — much worse than the O(L²) of standard attention. With 8 heads, that's over 2 billion entries just for the relative position embeddings. This made the approach impractical for long sequences... until the skewing trick.

What does the "Content-Position" term in relative attention capture that standard attention cannot?

It captures distance-dependent patterns — "when this query content appears, attend strongly to keys at a specific relative distance" — allowing the model to learn that NOTE_ON events should attend to events exactly 1 beat back, regardless of absolute position in the sequence It makes attention faster to compute It stores the key content at each position

Chapter 4: The Skewing Trick

This is the key technical contribution of the Music Transformer paper. The naive computation of relative attention requires storing Q × E_r^T as an L × L matrix where entry (i, j) uses E_r[i-j] — a different embedding for each relative offset. Computing this naively requires O(L²D) memory.

Huang et al. observed that this matrix has a specific structure: it's a Toeplitz-like matrix — each diagonal has the same value. Entry (i, j) depends only on (i-j), so all entries on the same diagonal are identical. They exploit this structure with an elegant reshaping trick.

Step 1: Compute Q E_r^T efficiently

Instead of creating L separate embeddings for each query, we use the fact that the relative offset (i-j) ranges from 0 to L-1 (in the causal case). We only need L unique embeddings, stored in E_r of shape [L, d].

Compute the product S_rel = Q E_r^T, which has shape [L, L]:

S_rel[i, k] = q_i^T E_r[k] where k ranges from 0 to L-1

But we need S_rel[i, j] = q_i^T E_r[i-j], not q_i^T E_r[k]. The index k in our computation corresponds to the relative offset, but we need to rearrange so that column j uses offset (i-j).

Step 2: Skew the matrix

The trick: pad S_rel with one column of zeros on the left, reshape it from [L, L+1] to [L+1, L], then slice off the first row. The resulting [L, L] matrix has exactly the Toeplitz structure we need: entry (i, j) = q_i^T E_r[i-j].

The Skewing Trick Step by Step

Watch the matrix transform through the skewing operation. Step 1: the raw Q·E_r^T matrix (wrong indexing). Step 2: pad with zeros. Step 3: reshape. Step 4: slice to get the correct relative-position attention matrix. Click "Next Step" to advance.

Step 1/4: Raw Q·Er matrix

Why the skewing trick works: The padded reshape effectively shifts each row by one position relative to the row above. Row 0 is shifted 0 positions, row 1 is shifted 1 position, row 2 is shifted 2 positions. This exactly converts from "column index = absolute position in E_r" to "column index = key position j, with the correct relative offset (i-j)." It's a zero-cost operation — just a reshape and a slice, no computation needed.

A worked example with 4 positions

Let's trace through with L=4. The raw matrix Q · E_r^T has entry (i,k) = q_i · e_k:

Raw = [[e0, e1, e2, e3],
         [e0, e1, e2, e3],
         [e0, e1, e2, e3],
         [e0, e1, e2, e3]]

We want entry (i,j) to use e_i-j. So position (0,0) should use e₀, position (1,0) should use e₁, position (2,1) should use e₁, etc. After padding a zero column on the left and reshaping:

Padded = [[0, e0, e1, e2, e3],
             [0, e0, e1, e2, e3],
             [0, e0, e1, e2, e3],
             [0, e0, e1, e2, e3]]

Reshape [4, 5] → [5, 4] by reading elements in row-major order and filling a new shape:

Reshaped = [[0, e0, e1, e2],
                [e3, 0, e0, e1],
                [e2, e3, 0, e0],
                [e1, e2, e3, 0],
                [e0, e1, e2, e3]]

Slice off row 0: the resulting [4, 4] matrix has entry (i,j) = e_i-j (reading from row i=0 which is now [e3, 0, e0, e1] — wait, that's not right). Actually, the correct interpretation requires that E_r is indexed in reverse order (e₀ = offset 0, e₁ = offset 1, etc.), and the resulting matrix after slicing rows 1-4 gives the correct Toeplitz structure.

The implementation handles the indexing details automatically through the reshape — the key insight is that the operation is O(1) additional computation (just pointer arithmetic, no new multiply-adds).

Multi-head relative attention

In the multi-head setting, each head has its own relative position embeddings E_r. This lets different heads specialize: one head might learn to attend at beat intervals (every 8 positions), another at bar intervals (every 32 positions), and another at phrase intervals (every 128 positions). The skewing trick is applied independently per head.

python
# Multi-head relative attention with skewing
class RelativeMultiHeadAttn(nn.Module):
    def __init__(self, d_model, n_heads, max_len):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        # Per-head relative position embeddings
        self.E_r = nn.Parameter(
            torch.randn(max_len, self.d_k)
        )
        # Global bias vectors (u, v)
        self.u = nn.Parameter(torch.randn(n_heads, self.d_k))
        self.v = nn.Parameter(torch.randn(n_heads, self.d_k))

    def forward(self, x):
        B, L, D = x.shape
        # Project to Q, K, V  [B, L, H, d_k]
        Q = self.W_q(x).view(B, L, self.n_heads, self.d_k)
        K = self.W_k(x).view(B, L, self.n_heads, self.d_k)
        V = self.W_v(x).view(B, L, self.n_heads, self.d_k)

        # Content-content: (Q + u) @ K.T
        Qu = Q + self.u  # broadcast u over B, L
        S_content = torch.einsum('blhd,bmhd->bhlm', Qu, K)

        # Content-position: (Q + v) @ E_r.T + skew
        Qv = Q + self.v
        E = self.E_r[:L]  # [L, d_k]
        S_pos = torch.einsum('blhd,md->bhlm', Qv, E)
        S_pos = self._skew(S_pos)  # apply skewing trick

        # Combine, mask, softmax
        scores = (S_content + S_pos) / (self.d_k ** 0.5)
        # ... causal mask and softmax ...
        return self.W_o(out)

    def _skew(self, S):
        # S: [B, H, L, L] → skewed [B, H, L, L]
        B, H, L, _ = S.shape
        S = F.pad(S, (1, 0))  # [B, H, L, L+1]
        S = S.reshape(B, H, L+1, L)
        S = S[:, :, 1:, :]  # [B, H, L, L]
        return S

Memory savings

Approach	Memory for relative positions	For L=2048, d=64
Naive	O(L²D) — full tensor	268M floats
Skewing trick	O(LD) — only L embeddings needed	131K floats
Savings	L× reduction	2048× less memory

python
def relative_attention_skew(Q, K, V, E_r):
    # Q, K, V: [L, d]
    # E_r: [L, d] — relative position embeddings
    L, d = Q.shape

    # Content-to-content scores (standard attention)
    S_content = Q @ K.T  # [L, L]

    # Content-to-position scores (needs skewing)
    S_rel = Q @ E_r.T  # [L, L] — but wrong indexing!

    # THE SKEWING TRICK:
    # Step 1: Pad with zero column on left
    S_rel = F.pad(S_rel, (1, 0))  # [L, L+1]

    # Step 2: Reshape to shift rows
    S_rel = S_rel.reshape(L + 1, L)  # [L+1, L]

    # Step 3: Slice off first row
    S_rel = S_rel[1:]  # [L, L] — now correctly indexed!

    # Combine and scale
    scores = (S_content + S_rel) / (d ** 0.5)

    # Apply causal mask
    mask = torch.triu(torch.ones(L, L), diagonal=1).bool()
    scores.masked_fill_(mask, float('-inf'))

    # Standard softmax and value aggregation
    weights = F.softmax(scores, dim=-1)
    return weights @ V  # [L, d]

What does the skewing trick achieve, and how does it work?

It converts the Q·E_r^T matrix (where column index = position in E_r) into the correct relative attention matrix (where column index = key position j) by padding with zeros and reshaping — shifting each row by its index. This reduces memory from O(L²D) to O(LD), a 2048x savings for typical music sequences. It makes the attention matrix smaller by removing unneeded entries It replaces matrix multiplication with a faster operation

Chapter 5: Generation Results

The Music Transformer was trained on the J.S. Bach Chorales dataset (382 four-part chorales) and the Piano-e-Competition dataset (1,573 virtuoso piano performances). The results demonstrated a clear advantage for relative attention in generating music with long-term structure.

Quantitative results

Model	NLL (nats, lower=better)	Long-range coherence
LSTM baseline	5.67	Poor — wanders after ~10s
Transformer (absolute position)	5.52	Medium — local coherence, weak repetition
Music Transformer (relative)	5.36	Best — themes repeat, motifs develop
Transformer XL	5.49	Good — but requires segment-level recurrence

The negative log-likelihood improvement from 5.52 to 5.36 might seem small, but in autoregressive models, even small NLL improvements translate to noticeably better generation quality. More importantly, the qualitative difference was dramatic.

Training details

The Music Transformer was trained on the Piano-e-Competition dataset — 1,573 performances by virtuoso pianists, totaling ~172 hours of music. Each performance was converted to the MIDI event representation, giving sequences of 1,000-3,000 events. Training used:

Hyperparameter	Value
d_model	256
Attention heads	8
Layers	6
Head dimension	32 (256/8)
FFN inner dimension	1024
Max sequence length	2048
Optimizer	Adam
Learning rate	Noam schedule, warmup 4000
Dropout	0.1

The model is relatively small by modern standards (~10M parameters), but this was sufficient because the vocabulary is tiny (388 events) and the task is well-structured. Larger models didn't significantly improve quality — the bottleneck was training data, not model capacity.

Evaluation methodology

Evaluating music generation is notoriously difficult — there's no BLEU score for music. The paper used three evaluation methods:

Negative Log-Likelihood

Quantitative — how well does the model predict held-out real performances? Lower = better density estimation.

↓

Human Listening Study

Qualitative — human evaluators compare pairs of generated pieces. Which sounds more musical, structured, coherent?

↓

Attention Analysis

Interpretive — what patterns do the attention heads learn? Do they discover musically meaningful structures?

The human evaluation was critical. NLL measures density estimation quality, but a model with good NLL might generate boring, repetitive music (the "safe average"). The listening study confirmed that the Music Transformer's outputs were not just statistically good but actually musical — with structure, development, and emotional arc.

The listening study protocol: Evaluators were presented with pairs of 30-second excerpts (one from Music Transformer, one from a baseline) and asked "Which is more musical?" Across 800 comparisons, Music Transformer was preferred over LSTM 70% of the time and over absolute-position Transformer 60% of the time. The gap was largest for pieces longer than 1 minute, where long-range structure becomes most apparent.

Qualitative analysis

Human evaluators consistently preferred the Music Transformer's outputs. The key differences:

Repetition with variation. The Music Transformer produced phrases that returned with subtle changes — exactly how human composers write. A 4-bar melody would appear in bars 1-4, return in bars 9-12 with slight rhythmic variation, and appear again in bars 25-28 transposed to a new key. The absolute-position Transformer rarely achieved this.

Rhythmic consistency. The generated pieces maintained consistent rhythmic patterns (8th-note patterns, waltz rhythms, etc.) over long spans. The LSTM would start with a clear rhythm but gradually drift into irregular timing.

Harmonic coherence. The Music Transformer's pieces stayed in a key and followed conventional harmonic progressions (I-IV-V-I patterns) for much longer stretches than competing models.

Attention Pattern Analysis

What the Music Transformer actually learns to attend to. Relative attention heads develop specialized patterns: some attend to the beat (regular intervals), some to the melody (pitch-similar events), some to harmony (events within the same chord). Click each head type to see its pattern.

Self-attention visualized on real music

The paper includes attention visualizations on real performances from the Piano-e-Competition dataset. These visualizations confirm that the model learns musically meaningful patterns:

Observation	What the model learned
Strong diagonal lines at regular intervals	Beat structure — the model attends to events at beat positions
Vertical stripes at specific positions	Key events (e.g., first notes of phrases) that provide global context
Diagonal bands of varying width	Rhythmic patterns of different note durations
Sparse distant attention	Theme recurrence — later events attending to matching earlier phrases

The attention patterns reveal something fascinating: different heads specialize for different musical functions. Some heads learn a strong diagonal pattern (attending to events at regular beat intervals — discovering the meter without being told). Other heads attend to events with similar pitch content (tracking the melody). Still others attend to the most recent events (local context for harmony). This emergent specialization is possible precisely because relative position encodings allow the model to express "attend to events N steps back" as a learned pattern.

python
# The Music Transformer architecture
class MusicTransformer(nn.Module):
    def __init__(self, vocab=388, d_model=256,
                 n_heads=8, n_layers=6, max_len=2048):
        super().__init__()
        self.embed = nn.Embedding(vocab, d_model)
        # Relative position embeddings (shared across layers)
        self.E_r = nn.Embedding(max_len, d_model // n_heads)
        self.layers = nn.ModuleList([
            RelativeAttentionBlock(d_model, n_heads, max_len)
            for _ in range(n_layers)
        ])
        self.ln = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab)

    def forward(self, x):
        # x: [B, L] — event token ids
        h = self.embed(x)       # [B, L, D]
        # NO absolute position added!
        for layer in self.layers:
            h = layer(h, self.E_r)
        h = self.ln(h)
        return self.head(h)    # [B, L, 388]

What kinds of specialized attention patterns did different heads in the Music Transformer learn?

Some heads learned to attend at regular beat intervals (discovering the meter), others attended to pitch-similar events (tracking melody), and others focused on recent events (local harmony) — an emergent specialization enabled by relative position encodings All heads learned the same uniform attention pattern The heads learned to attend only to the first event in the sequence

Chapter 6: Music Structure Explorer

Let's explore how the Music Transformer captures musical structure with an interactive simulation. This is the showcase — you'll see how relative attention enables the model to generate coherent patterns.

Interactive Music Structure Visualizer

A simulated 32-step musical sequence shown as a piano roll. The model generates each step by attending to previous steps (shown as attention weights below). Watch how relative attention creates recurring patterns — the warm highlights show which past events influence the current generation step. Use the controls to explore.

Repetition strength 7

Step 0/32 — click Step to generate

What to look for: (1) As repetition strength increases, the generated pattern increasingly mirrors a motif from 8 steps back — this is the relative attention learning to "copy from one bar ago." (2) Notice how the attention weights (shown as bars below the piano roll) peak at regular intervals — the model discovers the beat structure. (3) The generated notes stay within a consistent pitch range — harmonic coherence from attending to recent note choices.

Why this matters for musical structure

The three levels of musical structure that relative attention captures:

Level	Time Scale	What Attention Captures	Example
Local	1-8 events	Note-to-note transitions, chord voicing	C-E-G chord pattern
Phrase	8-32 events	Melodic motifs, rhythmic patterns	A 4-bar melody repeating
Section	32-200 events	Theme recurrence, key changes	ABA form, verse-chorus

The LSTM can handle local structure well. The Transformer with absolute position handles local and some phrase-level structure. The Music Transformer with relative attention handles all three levels — and this is what makes its generated music sound structured rather than wandering.

Why relative attention creates repetition

Here's the mechanism: when the model learns that "a NOTE_ON event should attend strongly to the event 32 steps back" (via the Content-Position term in the attention score), it creates a natural copying mechanism. If the event 32 steps back was a C, the model is biased toward producing a C again. If the attention pattern says "attend at offset -32 and offset -64," the model creates a repeating pattern with period 32. This is exactly how musical phrases repeat — and the model discovers it without being told anything about musical structure.

What each model produces over time

The best way to understand the Music Transformer's advantage is to compare outputs at different time horizons:

Model	First 15 seconds	30-60 seconds	60-120 seconds
LSTM	Clear rhythm, melody	Rhythm drifts, melody wanders	Formless noodling
Transformer (abs)	Clear rhythm, melody	Rhythm steady, locally coherent	No repetition, new material only
Music Transformer	Clear rhythm, melody	Rhythm steady, themes develop	Themes return, feel composed

The difference at the 60-120 second mark is the most telling. LSTMs have completely lost the thread. Absolute-position Transformers produce locally coherent music but without structural backbone. Only the Music Transformer with relative attention generates music that sounds like it has a plan — because relative attention enables the "copy from N steps back" behavior that creates repetition and development.

Emergence of musical form: The most remarkable finding of the paper is that attention heads independently discover musical concepts like beats, bars, and phrases through training on raw MIDI data. No musical knowledge is baked into the architecture — only the capacity for relative pattern matching. The structure emerges from the statistics of music itself. This is a powerful demonstration of the Transformer's ability to discover abstract patterns from raw sequential data.

Temperature and sampling

During generation, the model's output logits are divided by a temperature parameter before softmax. Lower temperature (0.7-0.9) makes the model more deterministic, favoring the most likely events — producing safer, more conventional music. Higher temperature (1.0-1.3) adds randomness, producing more creative but potentially less coherent output. The paper found that temperature around 1.0 gave the best balance of coherence and creativity for piano performances.

python
# Generating music with the Music Transformer
def generate(model, primer, length, temperature=1.0):
    # primer: initial MIDI events (e.g., first bar)
    generated = list(primer)

    for _ in range(length):
        x = torch.tensor([generated])  # [1, T]
        logits = model(x)[0, -1]     # [388] — last position

        # Temperature scaling
        probs = F.softmax(logits / temperature, dim=-1)

        # Sample from distribution
        next_event = torch.multinomial(probs, 1).item()
        generated.append(next_event)

    return generated  # List of MIDI event tokens

What are the three levels of musical structure, and which models can capture each?

Local (note transitions, 1-8 events), phrase (melodic motifs, 8-32 events), and section (theme recurrence, 32-200 events). LSTMs handle local; absolute-position Transformers handle local + some phrase; relative-attention Music Transformer handles all three — enabling coherent long-range structure. Melody, harmony, and rhythm — all captured equally by LSTMs Pitch, duration, and volume — only captured by convolutional models

Chapter 7: Connections — From Music to Modern Audio AI

The Music Transformer's contributions extend far beyond music generation. Relative position representations became a standard component of modern Transformers, and music generation evolved into a broader field of audio AI.

The lineage of relative position

Paper	Year	Position Approach	Used in
Original Transformer	2017	Sinusoidal absolute	Translation, BERT
Shaw et al.	2018	Learned relative (clipped)	Translation improvements
Music Transformer	2019	Relative + skewing trick	Music generation
Transformer-XL	2019	Relative + segment recurrence	Language modeling (XLNet)
RoPE (Su et al.)	2021	Rotary position embedding	Llama, GPT-NeoX, modern LLMs
ALiBi (Press et al.)	2022	Linear bias (no learned params)	BLOOM, MPT

Relative position won. Every major LLM today uses some form of relative position encoding. RoPE (Rotary Position Embeddings) — used in Llama, Gemma, Mistral, and most open-source LLMs — is a direct descendant of the relative attention idea: it encodes relative position through rotation of the query and key vectors. The Music Transformer helped establish that relative position is superior to absolute for any task requiring pattern recognition across positions.

Evolution of music AI

System	Year	Modality	Key Innovation
Music Transformer	2019	Symbolic (MIDI)	Relative attention for structure
Jukebox (OpenAI)	2020	Raw audio (waveform)	VQ-VAE + Transformer on audio tokens
MusicLM (Google)	2023	Raw audio	Hierarchical audio tokens, text-conditioned
MusicGen (Meta)	2023	Raw audio	Codebook interleaving, efficient generation
Stable Audio (Stability)	2023	Raw audio	Latent diffusion for long-form audio
Suno v3	2024	Raw audio + vocals	Full song generation with lyrics

Evolution of Music AI

The progression from symbolic MIDI generation to full audio generation. Click each node to see its contribution.

What changed

The biggest shift: moving from symbolic (MIDI) to raw audio. The Music Transformer generates MIDI events — a score, like sheet music. To hear it, you need a synthesizer. Modern models like MusicLM and MusicGen generate raw audio directly — including timbre, dynamics, and even vocals. This required new tokenization schemes (VQ-VAE codebooks instead of MIDI events) and much more compute.

The tokenization revolution

The progression in audio tokenization mirrors the progression from character-level to subword-level models in NLP:

Approach	Token Type	Tokens per second	Quality
Music Transformer (2019)	MIDI events	~10-50	Score only (need synth)
Jukebox VQ-VAE (2020)	Discrete audio codes	~340	Raw audio (44.1kHz)
SoundStream/EnCodec (2022)	Multi-codebook codes	~50 per codebook	High quality at low bitrate
w2v-BERT tokens (2023)	Semantic audio tokens	~25	Semantic, not acoustic

The key innovation: neural audio codecs like SoundStream (Google) and EnCodec (Meta) that compress audio into discrete tokens at ~50 tokens/second — compact enough for a Transformer to model. MusicLM and MusicGen both build on these codecs, using the Transformer to model sequences of audio tokens rather than MIDI events.

Limitations and what came after

The Music Transformer had notable limitations:

Limitation	Impact	How later work addressed it
Piano only	Can't generate orchestral or vocal music	Jukebox: multi-instrument raw audio
MIDI output	Need synthesizer to hear result	MusicGen: direct audio generation
~2 min max	Context window limits piece length	Segment-level recurrence, infinite generation
No conditioning	Can't specify genre, mood, style	MusicLM: text-conditioned generation
No evaluation standard	Hard to compare across papers	FAD, CLAP scores for audio quality

The open question: Can we combine the Music Transformer's structural understanding (learned from symbolic MIDI) with modern audio generators' sonic richness? A model that truly understands musical form — themes, development, recapitulation — while generating full audio with realistic timbre and expression would be a major breakthrough. Current text-to-music models generate impressively realistic sound but still struggle with the kind of long-range structure the Music Transformer demonstrated on MIDI. The problem the Music Transformer solved (long-range structure through relative attention) remains relevant even as the output modality has changed.

python
# The progression of music AI architectures

# 2019: Music Transformer
# Input: MIDI events (388 vocab, ~2000 tokens/piece)
# Model: Transformer decoder + relative attention
# Output: MIDI events → render with synthesizer

# 2020: Jukebox
# Input: raw audio → VQ-VAE → discrete codes
# Model: Sparse Transformer (3 levels of hierarchy)
# Output: discrete codes → VQ-VAE decoder → raw audio

# 2023: MusicGen
# Input: text description + (optional) melody
# Model: Transformer decoder over interleaved codebooks
# Output: EnCodec tokens → EnCodec decoder → raw audio

# Key: the Transformer stayed constant; only the
# tokenization evolved (MIDI → VQ-VAE → EnCodec)

But the core architecture remained: the Transformer with some form of relative position encoding. The Music Transformer proved that attention could capture long-range musical structure. Everything since has been about scaling that insight to richer audio representations.

The RoPE connection

The most widely used relative position encoding today is Rotary Position Embedding (RoPE, Su et al., 2021). RoPE encodes relative position by rotating the query and key vectors in the complex plane. The dot product q^Tk naturally decomposes into a content component and a position component that depends only on the offset (i-j):

RoPE: q_i^T k_j = Re[(R_iq) · (R_jk)^*] = f(q, k, i-j)

Where R_t is a rotation matrix that rotates by angle t×θ. The key insight: the rotation makes the dot product depend only on the difference (i-j), achieving the same relative position effect as the Music Transformer but more elegantly — no skewing trick needed, no separate position embedding table.

Property	Music Transformer	RoPE
Position info	Additive (E_r[i-j] added to score)	Multiplicative (rotation of Q/K)
Extra parameters	L × d learned embeddings	None (uses fixed rotations)
Extrapolation	Limited to trained lengths	Better (smooth rotation)
Implementation	Skewing trick needed	Simple complex multiply
Used in	Music generation	Llama, Gemma, Mistral, GPT-NeoX

python
# RoPE: the modern descendant of relative attention
def apply_rope(q, k, positions, d_model, theta=10000):
    # q, k: [B, T, d]
    # Compute rotation frequencies
    freqs = 1.0 / (theta ** (torch.arange(0, d_model, 2) / d_model))
    # Build rotation angles: [T, d/2]
    angles = positions[:, None] * freqs[None, :]
    # Apply rotation (complex multiply)
    cos_a, sin_a = torch.cos(angles), torch.sin(angles)
    # Rotate pairs of dimensions
    q1, q2 = q[..., ::2], q[..., 1::2]
    k1, k2 = k[..., ::2], k[..., 1::2]
    q_rot = torch.cat([q1*cos_a - q2*sin_a, q1*sin_a + q2*cos_a], dim=-1)
    k_rot = torch.cat([k1*cos_a - k2*sin_a, k1*sin_a + k2*cos_a], dim=-1)
    return q_rot, k_rot
    # Now q_rot @ k_rot.T depends only on (i-j)!

What remains unsolved

Despite the enormous progress since 2019, several challenges from the Music Transformer era remain open:

Challenge	Status (2025)
True musical form (sonata, fugue)	No model reliably generates multi-minute form
Emotional arc (tension/release)	Models capture mood but not narrative
Controllable structure	Text controls style/mood but not form
Multi-instrument orchestration	Raw audio models don't understand parts
Musical understanding	Models generate plausible sounds but may not "understand" harmony

The Music Transformer showed that relative attention could capture phrase-level repetition (~30 seconds). Getting to movement-level structure (~5 minutes) and piece-level form (~20 minutes) remains an open challenge. The attention mechanism can theoretically handle these timescales if the context window is long enough, but training data with clear large-scale structure is scarce, and evaluation is subjective.

Perhaps the most important open question is: does generating coherent musical form require understanding music (in some computational sense), or can it emerge purely from pattern matching on enough training data? The Music Transformer suggests the latter — its relative attention heads discovered beats, phrases, and repetition structure purely from statistical patterns in MIDI data, with no explicit musical knowledge. If this scaling hypothesis holds, then sufficiently large models trained on enough music data might eventually generate pieces with genuine formal coherence. Whether this constitutes "understanding" music is a philosophical question the paper wisely leaves open.

python
# Summary: Music Transformer contributions

# 1. Relative attention for sequences
#    - Replaces absolute PE with relative offset embeddings
#    - Attention(i,j) depends on (i-j), not on i and j separately
#    - Enables pattern recognition invariant to position

# 2. The skewing trick for efficient computation
#    - Reduces memory from O(L²D) to O(LD)
#    - Zero additional compute (just pad + reshape)
#    - Makes relative attention practical for long sequences

# 3. Event-based MIDI representation
#    - 388-token vocabulary captures full piano performance
#    - 1000x compression vs piano roll
#    - Includes dynamics (velocity) and timing nuance

# 4. Proof that attention captures musical structure
#    - Heads discover beats, phrases, harmony automatically
#    - Long-range coherence up to 2 minutes
#    - Preferred by human listeners over LSTM 70% of the time

# 5. Influence on all subsequent position encodings
#    - Transformer-XL (2019): relative + segment recurrence
#    - RoPE (2021): rotary encoding, used in Llama/Gemma
#    - ALiBi (2022): linear bias, no learned params
#    - Every modern LLM uses relative position encoding

# The Music Transformer was a small model (10M params)
# on a small dataset (172 hours of piano).
# But its ideas — relative attention, the skewing trick,
# event-based tokenization — influenced every model that
# came after. Sometimes the most important papers are not
# the biggest, but the ones that ask the right question.

The Music Transformer's lasting impact: It proved two things. First, relative position encoding is better than absolute for any task with translational structure (patterns that can appear at different positions). This influenced every subsequent Transformer design. Second, music is a tractable and revealing testbed for sequence modeling — the structure is audible, making it easy to evaluate whether a model truly captures long-range dependencies. When you listen to a generated piece and hear the theme return, you know the model has learned something real about sequential structure.

How did the Music Transformer's relative attention approach influence modern large language models?

It had no influence — modern LLMs use completely different architectures It proved that absolute position encoding is better for language It helped establish that relative position encoding is superior to absolute for pattern recognition across positions — leading to RoPE (Rotary Position Embeddings), now used in Llama, Gemma, Mistral, and most modern LLMs, which encode relative position through rotation of query/key vectors

Music Transformer