Why transformers can't tell order from chaos — and the rotation trick that fixed it, from sinusoidal waves to the RoPE inside every modern LLM.
Take the sentence "The dog bit the man" and rearrange it to "The man bit the dog." Very different meanings. A dog biting a human is Tuesday in the park. A human biting a dog is front-page news. Now feed both sentences to a transformer with no positional encoding. You get the same output. Exactly the same. The model cannot tell them apart.
This isn't a subtle theoretical concern. It's a catastrophic blind spot baked into the core mechanism of the transformer. Self-attention — the operation that gives transformers their power — treats its input as a set, not a sequence. It has no concept of "first," "second," or "third." Every token might as well arrive simultaneously in a bag.
Let's prove this. Not with hand-waving, but with actual math.
Here's what self-attention does. Given a sequence of token embeddings X = [x1, x2, ..., xn], it computes three matrices — Queries (Q = XWQ), Keys (K = XWK), and Values (V = XWV). Then it computes:
Now imagine you permute the input. Swap rows of X — put "dog" where "man" was. Q, K, and V are just linear transformations of X, so their rows get permuted in exactly the same way. The attention score matrix Q·KT is a row/column permutation of the original — same values, same weighted sums, just reordered. The final output is the original output with its rows permuted identically.
In math: if P is a permutation matrix and you feed PX instead of X, the output is P · Attention(X). The values are identical — they just come out in the new order. The attention mechanism has zero awareness that you rearranged anything.
Let's make this concrete with the smallest possible example. Three tokens A, B, C with 2D embeddings. We'll compute self-attention for two different orderings and prove the outputs are identical (up to reordering).
Order 1: A, B, C
X = [[1,0], [0,1], [1,1]]. Compute Q·KT / √2:
| A=[1,0] | B=[0,1] | C=[1,1] | |
|---|---|---|---|
| A=[1,0] | 1/√2 = 0.707 | 0/√2 = 0.000 | 1/√2 = 0.707 |
| B=[0,1] | 0/√2 = 0.000 | 1/√2 = 0.707 | 1/√2 = 0.707 |
| C=[1,1] | 1/√2 = 0.707 | 1/√2 = 0.707 | 2/√2 = 1.414 |
Apply softmax row-wise (each row sums to 1):
| A | B | C | |
|---|---|---|---|
| Row A | 0.422 | 0.208 | 0.422 |
| Row B | 0.208 | 0.422 | 0.422 |
| Row C | 0.268 | 0.268 | 0.545 |
Multiply by V (which equals X) to get outputs:
Order 2: B, C, A (permuted input)
X' = [[0,1], [1,1], [1,0]]. The score matrix Q'K'T / √2:
| B=[0,1] | C=[1,1] | A=[1,0] | |
|---|---|---|---|
| B=[0,1] | 0.707 | 0.707 | 0.000 |
| C=[1,1] | 0.707 | 1.414 | 0.707 |
| A=[1,0] | 0.000 | 0.707 | 0.707 |
Look carefully. Row B in the permuted version has the same values as row B in the original — just in a different column order. Same for every row. After softmax and multiplying by V:
Every token gets the exact same output vector regardless of its position in the sequence. The outputs are just reordered to match the new input order. Permutation-equivariance: permuting the input permutes the output, but doesn't change any values.
The simulation below shows two sentences. Drag tokens to rearrange them. With positional encoding OFF, the output values (shown as colored bars) are identical regardless of word order — they just shuffle position. Toggle positional encoding ON, and the outputs genuinely change when you reorder words. Position makes the model care about order.
Click tokens in the bottom row to swap their positions. Watch how outputs change (or don't) based on the Position Encoding toggle.
Five lines of Python that prove the point. We compute attention for the original order and a permuted order. The outputs are identical (just reordered).
python import numpy as np def attention(X): # Q=K=V=X, scale by sqrt(d_k) scores = X @ X.T / np.sqrt(X.shape[1]) weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True) return weights @ X X = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]]) X_perm = X[[1, 2, 0]] # reorder: B, C, A out1 = attention(X) out2 = attention(X_perm) print("Original :", np.round(out1, 3)) print("Permuted :", np.round(out2, 3)) print("Same values?", np.allclose(out1[[1,2,0]], out2)) # True
Run it. out1[[1,2,0]] (the original output, reordered) matches
out2 exactly. The attention operation literally cannot distinguish
order. It operates on a set.
The fix is beautifully simple: add position information to the input embeddings before they enter the attention mechanism. If token "dog" at position 1 has a different vector than "dog" at position 3, the attention scores change. The model can now distinguish "dog bites man" from "man bites dog" because the Q, K, V matrices see different vectors depending on where each token sits.
The question is: what position information should we add? An integer? A learned vector? A pattern of waves? The answer — and its surprising elegance — is what the rest of this lesson is about.
We need to give each position a unique identity. The simplest idea: just use the integer position itself. Position 0 gets 0, position 1 gets 1, position 512 gets 512. But this creates a problem — position 512 is a huge number that would completely dominate the token embedding (which typically has values between -1 and 1). The model would pay more attention to position than content.
What about normalizing? Divide by the max length, so positions range from 0 to 1. Now the problem flips: positions 0 and 1 are distinguishable (0.000 vs 0.002), but positions 499 and 500 are nearly identical (0.998 vs 1.000). And worse, the encoding changes meaning if you increase the max length — position 0.5 used to mean "halfway through" but now means something different.
The Vaswani et al. 2017 solution is elegant: encode position as a pattern of sine and cosine waves at different frequencies. Each position gets a unique "fingerprint" that is bounded, meaningful, and theoretically capable of expressing relative distances.
Here's the idea. We have a model dimension dmodel (say, 512). For each position p in the sequence, we create a dmodel-dimensional vector. Each pair of dimensions (2i, 2i+1) uses a sine and cosine at a specific frequency:
Let's unpack this. The denominator 100002i/dmodel controls the frequency of the wave. When i = 0 (the first dimension pair), the denominator is 100000 = 1, so the wave oscillates at frequency 1 — it completes one full cycle every 2π ≈ 6.28 positions. When i is large (near dmodel/2), the denominator approaches 10000, and the wave oscillates extremely slowly — one cycle every 10000 × 2π ≈ 62,832 positions.
Think of it like a clock. The fast-ticking dimensions are the second hand — they change rapidly between nearby positions, giving fine-grained local discrimination. The slow-ticking dimensions are the hour hand — they change gradually, giving broad positional context over thousands of positions. Together, they form a unique binary-like code for each position.
Let's compute every value by hand. With dmodel = 4, we have two frequency bands: i = 0 (fast) and i = 1 (slow).
Frequency band i = 0: denominator = 100000/4 = 100000 = 1.
Frequency band i = 1: denominator = 100002/4 = 100000.5 = 100.
Now compute for each position:
| Position | Dim 0: sin(p) | Dim 1: cos(p) | Dim 2: sin(p/100) | Dim 3: cos(p/100) |
|---|---|---|---|---|
| p = 0 | 0.000 | 1.000 | 0.000 | 1.000 |
| p = 1 | 0.841 | 0.540 | 0.010 | 1.000 |
| p = 2 | 0.909 | −0.416 | 0.020 | 1.000 |
| p = 3 | 0.141 | −0.990 | 0.030 | 1.000 |
Notice the pattern. Dimensions 0-1 (the fast band) change dramatically between positions — sin(0) = 0, sin(1) = 0.841, sin(2) = 0.909. They give fine-grained local discrimination. Dimensions 2-3 (the slow band) barely change — sin(0/100) = 0, sin(1/100) = 0.01, sin(2/100) = 0.02. They evolve over hundreds of positions.
Each row is unique. No two positions produce the same 4D vector. And this holds for dmodel = 512: with 256 frequency bands spanning wavelengths from 2π to 62,832, the encoding is effectively unique for any reasonable sequence length.
The base 10000 sets the wavelength range. The fastest dimension has wavelength 2π ≈ 6.28 positions — enough to distinguish adjacent tokens. The slowest dimension has wavelength 10000 × 2π ≈ 62,832 positions — enough to uniquely identify positions in sequences up to ~60K tokens long.
If you used a smaller base (say 100), the slowest wavelength would be only 628 positions. Sequences longer than that would see position encodings repeat — positions 0 and 628 would get nearly identical encodings, confusing the model. The choice of 10000 gives headroom for long sequences while keeping the fast dimensions discriminative.
Modern models (GPT-4, Llama) don't use sinusoidal encoding — they use RoPE, which we'll cover later. But the frequency-band intuition carries over directly: RoPE uses the same 10000 base and the same geometric spacing of frequencies. Understanding sinusoidal encoding is the foundation for everything that follows.
The simulation below shows a heatmap of sinusoidal encodings. Each row is a position (0 at the top). Each column is a dimension. Color represents the encoding value: warm/orange for positive, teal for negative. You can see the fast-oscillating dimensions on the left and the slow ones on the right. Hover to see exact values. Use the slider to control how many positions are visible.
Each row is a sequence position. Each column is a model dimension. Color encodes the value: warm = positive, teal = negative. Notice the wave patterns — fast on the left, slow on the right.
python import numpy as np def sinusoidal_pe(max_len, d_model): pe = np.zeros((max_len, d_model)) pos = np.arange(max_len)[:, None] # (max_len, 1) i = np.arange(0, d_model, 2)[None, :] # (1, d_model/2) freq = 1.0 / (10000 ** (i / d_model)) # geometric spacing pe[:, 0::2] = np.sin(pos * freq) pe[:, 1::2] = np.cos(pos * freq) return pe pe = sinusoidal_pe(128, 512) print(pe.shape) # (128, 512) print(pe[0, :4]) # [0.000, 1.000, 0.000, 1.000] (position 0) print(pe[1, :4]) # [0.841, 0.540, 0.010, 1.000] (position 1)
The PyTorch equivalent uses the same logic but wraps it in a buffer so the encoding is stored with the model (on the right device) but not updated by the optimizer:
python import torch import torch.nn as nn class SinusoidalPE(nn.Module): def __init__(self, d_model, max_len=5000): super().__init__() pe = torch.zeros(max_len, d_model) pos = torch.arange(max_len).unsqueeze(1).float() div = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(pos * div) pe[:, 1::2] = torch.cos(pos * div) self.register_buffer('pe', pe.unsqueeze(0)) # (1, max_len, d_model) def forward(self, x): return x + self.pe[:, :x.size(1)] # add PE to embeddings
Sinusoidal encodings are elegant and parameter-free. They require zero training — you compute them once from a formula and they work for any position. But they're also completely fixed. The model can't adapt them. What if the optimal position encoding for language isn't a pattern of sine waves? What if the model could learn a better one?
BERT (2018) and GPT-2 (2019) answered this question with brute force: just learn the position encodings. Create a lookup table of shape (max_positions, dmodel), initialize it randomly, and let gradient descent figure out the best encoding for each position. Simple, effective, and — as we'll see — remarkably similar to sinusoidal encodings after training.
Create an embedding matrix Epos of shape (max_seq_len, dmodel). This is a regular model parameter — a block of learnable numbers just like any weight matrix. Position p gets the p-th row: Epos[p].
During the forward pass, you look up the position embedding for each token's position and add it to the token embedding. That's it. The position embeddings are randomly initialized (usually from a normal distribution with small standard deviation) and updated by gradient descent during training, just like every other parameter in the model.
nn.Embedding(max_len, d_model) layer. The same
mechanism used for token embeddings. The only difference is the input:
token embeddings take token IDs (integers from the vocabulary), position
embeddings take position indices (integers 0 to max_len-1).
Let's trace through a concrete example. max_len = 4, dmodel = 3. After random initialization, the position embedding table looks like:
| Position | Dim 0 | Dim 1 | Dim 2 |
|---|---|---|---|
| p = 0 | 0.12 | −0.34 | 0.56 |
| p = 1 | −0.23 | 0.45 | −0.11 |
| p = 2 | 0.67 | −0.12 | 0.33 |
| p = 3 | −0.45 | 0.78 | −0.22 |
Suppose the token embedding for "cat" is ecat = [0.80, 0.30, −0.10]. If "cat" appears at position 2:
If the same "cat" appears at position 0 instead:
Different positions produce different input vectors for the same token. Now the attention mechanism sees different Q, K, V values depending on where "cat" sits in the sequence. Problem solved — at least for positions the model was trained on.
Here's the remarkable thing: after training, learned position embeddings often look strikingly similar to sinusoidal encodings. When researchers visualize the learned embedding matrix as a heatmap, they see wave-like patterns — low-frequency oscillations in some dimensions, high-frequency in others. The model independently rediscovers that multi-frequency waves are a good way to encode position.
This makes sense. Gradient descent optimizes the position embeddings to maximize task performance. It turns out that multi-frequency wave patterns are a highly efficient way to give the model both local (nearby token) and global (document-level) position information. Sinusoidal encoding just happens to be close to the optimum that gradient descent finds naturally.
The simulation below shows two heatmaps side by side. Left: sinusoidal encoding (fixed formula). Right: learned position embeddings (representative patterns from a trained model). Toggle between them to see how similar they are — and where they differ. The learned version has smoother gradients in some frequency bands and sharper transitions in others.
Compare fixed sinusoidal encodings (left) with learned position embeddings (right). Notice the similar wave structure — gradient descent rediscovers the multi-frequency pattern.
Here's the critical flaw. Learned embeddings have a hard maximum sequence length. If the model was trained with max_seq_len = 512 (like BERT), there are exactly 512 rows in the position embedding table. Position 513 simply doesn't exist. There's no embedding for it. The model literally cannot process a sequence with more than 512 tokens.
What happens if you try? Depending on the implementation, you get an index-out-of-bounds error, or the model wraps around to position 0 (which makes no sense), or it uses an untrained random vector (which produces garbage). None of these are acceptable.
Sinusoidal encodings don't have this problem. The formula produces a valid encoding for any position — 0, 512, 10000, or 1 million. Whether those encodings work well beyond the trained range is a separate question (they haven't been optimized for those positions), but at least they produce a reasonable, unique, bounded vector.
| Property | Sinusoidal (Fixed) | Learned |
|---|---|---|
| Parameters | 0 (computed from formula) | max_len × dmodel |
| Flexibility | Fixed pattern, model cannot adapt | Fully flexible, optimized by gradient descent |
| Length generalization | Produces values for any position (quality degrades) | Hard ceiling — crashes or garbage beyond max_len |
| Performance | Slightly worse on benchmarks | Slightly better on in-distribution data |
| Relative position | Theoretically expressible via linear transform | Model must learn relative from absolute (harder) |
| Used in | Original Transformer (2017) | BERT, GPT-2, GPT-3 (2018-2020) |
python import torch.nn as nn class LearnedPE(nn.Module): def __init__(self, max_len, d_model): super().__init__() self.pe = nn.Embedding(max_len, d_model) # learnable table def forward(self, x): seq_len = x.size(1) positions = torch.arange(seq_len, device=x.device) # [0, 1, ..., seq_len-1] return x + self.pe(positions) # lookup + add # Usage: pe = LearnedPE(max_len=512, d_model=768) embeddings = token_embed(input_ids) # (batch, seq_len, 768) embeddings_with_pos = pe(embeddings) # adds position info # pe(embeddings) with seq_len=513 → CRASH: index 512 out of range
That's it. The simplicity is the appeal. One extra embedding layer, three extra lines of code, and the model gets position information. The limitation — a hard maximum length — is the price.
Consider the sentence "The cat sat on the mat." The subject-verb relationship between "cat" and "sat" is the same whether this sentence starts at position 0 or position 5000 in a long document. "Cat" is always one token before "sat." The grammatical relationship depends on the distance between tokens, not their absolute indices.
Both sinusoidal and learned embeddings encode absolute position: position 0 gets one vector, position 1 gets another, position 5000 gets yet another. The model must learn that the interaction between position 5 and position 7 encodes the same "distance-2" relationship as the interaction between position 500 and position 502. And between position 3000 and position 3002. And every other pair at distance 2.
That's a lot of redundant learning. Relative position encoding says: just encode the distance directly.
Let's count. With a context length of 4096 and absolute position embeddings, how many distinct position pairs encode "distance = 2"? Positions (0,2), (1,3), (2,4), ..., (4094,4096). That's 4094 pairs. For the model to learn that "distance-2" means "adjective modifies noun" (for example), it must see examples at enough of these 4094 pairs to generalize. The model doesn't know that position 5 and position 500 encode the same relative relationship — that's an emergent pattern it must discover from data.
Now consider what happens at test time. If the model trained on sequences up to 512 tokens, it has seen distance-2 pairs at positions (0,2) through (510,512). If it now encounters position (4094,4096) — same distance, but at absolute positions it has never seen — the absolute position embeddings for 4094 and 4096 are either undefined (learned) or untested (sinusoidal). The model has no guarantee that it will handle this pair correctly.
Let's make the absolute-position problem concrete. Take sinusoidal encoding with dmodel = 4. We'll compare two pairs that are both at distance 2: positions (3, 5) and positions (103, 105).
Using the formulas from Chapter 1 (frequencies 1 and 1/100):
| Position | Dim 0: sin(p) | Dim 1: cos(p) | Dim 2: sin(p/100) | Dim 3: cos(p/100) |
|---|---|---|---|---|
| p = 3 | 0.141 | −0.990 | 0.030 | 1.000 |
| p = 5 | −0.959 | 0.284 | 0.050 | 0.999 |
| p = 103 | −0.863 | −0.505 | 0.926 | 0.378 |
| p = 105 | −0.970 | 0.243 | 0.938 | 0.347 |
Look at the raw vectors. PE(3) = [0.141, −0.990, 0.030, 1.000] and PE(103) = [−0.863, −0.505, 0.926, 0.378]. These are completely different vectors, even though both are the "start" of a distance-2 pair.
The model computes attention scores using dot products: q3 · k5 and q103 · k105. Because the absolute encodings differ wildly, these dot products will be very different numbers — even though the underlying relationship ("2 tokens apart") is identical.
In principle, the learned attention weights WQ and WK could learn to extract the relative offset from the absolute encodings. Sinusoidal encodings even have a theoretical property that makes this possible: PE(p + k) can be expressed as a linear function of PE(p). But the model must discover and exploit this relationship through training. It's an extra burden that relative methods eliminate entirely.
The core idea of relative position encoding: instead of adding position vectors to the input, add a position-dependent bias to the attention scores based on the distance between the query and key tokens.
In standard attention, the score between positions i and j is:
Shaw et al. (2018) proposed adding a learned bias that depends only on the relative distance (i − j):
Here bi−j is a learned scalar indexed by the distance between positions. If i − j = 2, we look up b2 — the same value regardless of whether i = 5 or i = 5000. The model learns one bias per distance, not one per position.
The distance is usually clipped to a window: bk for k in [−K, K] where K is a maximum distance (e.g., 128). Beyond that window, all far-away tokens share the same bias. This keeps the parameter count small: 2K + 1 learnable scalars instead of max_len × dmodel.
The simulation below shows two tokens on a number line. In absolute mode, each token's encoding depends on its position. Slide the pair along the number line (keeping the distance fixed) and watch the encoding vectors change dramatically. In relative mode, only the distance matters — slide the pair and the relative encoding stays constant.
Drag the slider to move a pair of tokens along the sequence. The distance between them stays fixed at the value you set. In absolute mode, the encoding vectors change as you slide. In relative mode, they don't.
This is the clincher. If a model only saw sequences up to length 512 during training:
This is why every modern LLM (GPT-4, Llama, Mistral, Gemma) uses some form of relative position encoding. The shift from absolute to relative was one of the most important architectural changes in the transformer's evolution — and it enabled the jump from 512-token contexts to 128K+ token contexts that we see today.
python import torch import torch.nn as nn class RelativePositionBias(nn.Module): def __init__(self, max_dist=128): super().__init__() # One learnable bias per distance in [-max_dist, max_dist] self.bias = nn.Embedding(2 * max_dist + 1, 1) self.max_dist = max_dist def forward(self, seq_len): # Build distance matrix: dist[i,j] = i - j pos = torch.arange(seq_len) dist = pos[:, None] - pos[None, :] # (L, L) dist = dist.clamp(-self.max_dist, self.max_dist) dist = dist + self.max_dist # shift to [0, 2*max_dist] return self.bias(dist).squeeze(-1) # (L, L) bias matrix # Usage: add to attention scores rpb = RelativePositionBias(max_dist=128) attn_scores = q @ k.transpose(-2, -1) / d_k**0.5 attn_scores = attn_scores + rpb(seq_len) # position-aware! # Works for seq_len=64 AND seq_len=4096 — distance is all that matters
Every method so far adds something to the embedding. A sinusoidal vector. A learned lookup. A bias. RoPE does something fundamentally different — it rotates the query and key vectors before computing attention.
Position 0 gets no rotation. Position 1 gets a small rotation. Position 100 gets a large rotation. And here's the magic: when two rotated vectors are dotted together, the rotation angles subtract, leaving only the distance between positions.
RoPE was introduced by Jianlin Su et al. in 2021 and immediately became the default in nearly every open-weights LLM: Llama, Mistral, Gemma, Qwen, Phi. The reason? It gives you relative position encoding for free through the attention mechanism, with no extra parameters and no extra memory.
Start with a single 2D vector [x, y]. To encode that this vector lives at position p, we rotate it by an angle proportional to p. The rotation angle is p · θ, where θ is a fixed frequency constant.
The standard 2D rotation matrix does this:
Now here's why this is brilliant. Suppose you have a query vector q at position m and a key vector k at position n. Both get rotated before the dot product. The dot product of two 2D vectors rotated by different angles has a beautiful property:
The result depends on (m−n) — the relative distance — not on m or n individually. Move both the query and key to positions 1000 and 998? Same dot product as positions 2 and 0. The rotation angles cancel out, leaving only the gap.
Let's work through a concrete example. We have two tokens:
Step 1: Rotate the query. q gets rotated by mθ = 3π/8 ≈ 1.178 radians.
Rotated query: [-0.079, 1.115]
Step 2: Rotate the key. k gets rotated by nθ = 1 × π/8 ≈ 0.393 radians.
Rotated key: [0.624, 0.583]
Step 3: Dot product.
Step 4: The magic test. Now shift both positions by 100 — query at position 103, key at position 101. Same relative distance of 2. Recompute:
Watch query and key vectors rotate as position increases. The dot product stays constant when both positions shift together — proof of relative position encoding.
In practice, dmodel has many dimensions — 128, 256, or more in each attention head. RoPE splits these into d/2 pairs, where each pair forms an independent 2D subspace. Each subspace uses a different rotation frequency:
This is the same base-10000 formula used in sinusoidal encodings — and for the same reason. Low-index pairs (small i) get high frequencies — they rotate fast and encode fine-grained position differences. High-index pairs (large i) get low frequencies — they rotate slowly and encode coarse, long-range relative position. The full spectrum lets the model attend at multiple scales simultaneously.
For a head dimension d=64, you get 32 pairs. Pair 0 has θ0=1.0 (rotates one full radian per position). Pair 31 has θ31=1/1000062/64 ≈ 0.00011 (barely rotates, even over thousands of positions). Together they give the model both a high-resolution local clock and a slowly-ticking global clock.
python import torch import math def precompute_rope_freqs(dim, max_seq_len, base=10000.0): # dim = head dimension (e.g., 64) # Each pair of dims gets a frequency: theta_i = 1 / base^(2i/dim) freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim)) # freqs shape: [dim/2] # Positions: [0, 1, 2, ..., max_seq_len-1] positions = torch.arange(max_seq_len).float() # positions shape: [max_seq_len] # Outer product: angle at each (position, frequency pair) angles = torch.outer(positions, freqs) # angles shape: [max_seq_len, dim/2] # Precompute cos and sin for efficiency return torch.cos(angles), torch.sin(angles) def apply_rope(x, cos_cached, sin_cached): # x shape: [batch, seq_len, n_heads, dim] # Split into even/odd pairs: [x0,x1], [x2,x3], ... x_even = x[..., ::2] # shape: [batch, seq, heads, dim/2] x_odd = x[..., 1::2] # shape: [batch, seq, heads, dim/2] seq_len = x.shape[1] cos = cos_cached[:seq_len].unsqueeze(0).unsqueeze(2) # [1, seq, 1, dim/2] sin = sin_cached[:seq_len].unsqueeze(0).unsqueeze(2) # [1, seq, 1, dim/2] # 2D rotation: [x*cos - y*sin, x*sin + y*cos] out_even = x_even * cos - x_odd * sin out_odd = x_even * sin + x_odd * cos # Interleave back: [x0', x1', x2', x3', ...] out = torch.stack([out_even, out_odd], dim=-1).flatten(-2) return out
What if position encoding didn't touch the embeddings at all? What if, instead of modifying Q, K, or the input, you just subtracted a penalty from the attention score — a penalty proportional to how far apart two tokens are?
Tokens nearby pay no penalty and attend freely. Distant tokens pay a steep penalty and get nearly zero attention weight. That's ALiBi (Attention with Linear Biases), introduced by Press, Smith, and Lewis in 2022.
No position embedding. No rotation. No extra parameters at all. Just a simple bias subtracted from attention scores. It's almost offensively simple — and it works remarkably well.
For head h with slope mh, the attention score between a query at position i and a key at position j becomes:
That's it. The raw dot-product score, minus a linear penalty proportional to distance. The slope mh controls how aggressive the penalty is. A large slope means "pay attention mostly to nearby tokens." A small slope means "distance barely matters — attend broadly."
The slopes are not learned. They're set geometrically, fixed before training and never updated. For H attention heads:
This gives a geometric series of slopes. Head 1 has the steepest slope (strong locality). The last head has the gentlest slope (wide attention reach). The model learns to route local information through steep-slope heads and global information through gentle-slope heads.
Let's compute the ALiBi biases for a model with H = 4 heads.
Step 1: Compute the slopes.
Head 1's slope is 64× steeper than head 4's. They see the world at completely different scales.
Step 2: Build the bias for head 1 (m=0.25).
Consider a 6-token sequence. For the query at position 5 (the last token), the bias to each key position is:
| Key pos j | |i−j| | Bias = −m · |i−j| |
|---|---|---|
| 0 | 5 | −0.25 × 5 = −1.25 |
| 1 | 4 | −0.25 × 4 = −1.00 |
| 2 | 3 | −0.25 × 3 = −0.75 |
| 3 | 2 | −0.25 × 2 = −0.50 |
| 4 | 1 | −0.25 × 1 = −0.25 |
| 5 | 0 | −0.25 × 0 = 0.00 |
If the raw attention score to position 0 was 3.0, it becomes 3.0 − 1.25 = 1.75 after the ALiBi bias. The nearby position 5 keeps its full score of 3.0. After softmax, this distance penalty translates to dramatically lower attention weights for far-away tokens in this steep-slope head.
Step 3: Compare with head 4 (m=0.0039).
Same query at position 5, same key at position 0. Bias = −0.0039 × 5 = −0.0195. A raw score of 3.0 becomes 2.98. Head 4 barely notices the distance. It attends almost uniformly across the sequence — a global attention head.
Visualize the attention bias matrix for each head. Steep slopes create sharp diagonal patterns (local attention). Gentle slopes create nearly uniform patterns (global attention).
ALiBi's killer feature: since the bias is just a linear function of distance, it works at any sequence length — even lengths never seen during training. Train on 1024 tokens, deploy at 8192: the bias formula is the same, just applied to larger distances. No retraining needed. No fine-tuning. No interpolation tricks.
This was revolutionary when ALiBi was published. Learned embeddings fail catastrophically beyond training length (no embedding exists for position 1025). Sinusoidal encodings degrade. RoPE starts to break at 2−4× training length. ALiBi just... works. The linear penalty scales naturally because distance is distance, whether it's 5 tokens or 5000.
python import torch import math def get_alibi_slopes(n_heads): # Geometric series: 1/2^(1*8/H), 1/2^(2*8/H), ... ratio = 2 ** (8 / n_heads) slopes = [1.0 / (ratio ** i) for i in range(1, n_heads + 1)] return torch.tensor(slopes) def build_alibi_bias(n_heads, max_seq_len): # slopes shape: [n_heads] slopes = get_alibi_slopes(n_heads) # e.g., [0.25, 0.0625, ...] # Distance matrix: |i - j| for all query-key pairs positions = torch.arange(max_seq_len) dist = (positions.unsqueeze(1) - positions.unsqueeze(0)).abs().float() # dist shape: [seq_len, seq_len] # Bias = -slope * distance, per head bias = -slopes.view(-1, 1, 1) * dist.unsqueeze(0) # bias shape: [n_heads, seq_len, seq_len] return bias def alibi_attention(Q, K, V, alibi_bias): # Q, K shape: [batch, n_heads, seq_len, d_k] d_k = Q.shape[-1] scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) # scores shape: [batch, n_heads, seq_len, seq_len] # Add ALiBi bias (broadcasts over batch) scores = scores + alibi_bias[:, :Q.shape[2], :K.shape[2]] weights = torch.softmax(scores, dim=-1) return torch.matmul(weights, V)
You've trained a model on sequences of length 512. Now someone pastes a 4096-token document. What happens?
The answer depends entirely on your position encoding choice. Some methods fail catastrophically the instant you exceed the training length. Others degrade gracefully. One barely notices. This simulation lets you see every failure mode — and every survival strategy — side by side.
Train on short sequences, test on long ones. Which position encodings survive?
Learned embeddings: instant catastrophe. Beyond the training length, there simply is no embedding for position 513. The model receives random, untrained vectors. Attention patterns become pure noise. This is not graceful degradation — it's a cliff.
Sinusoidal: mild degradation. The sin/cos functions are defined for all positions, so the model doesn't crash. But the attention patterns it learned during training assumed certain frequency relationships that become less reliable at unseen positions. You get blurriness, not static.
RoPE (vanilla): gradual breakdown at 2−4×. The rotation frequencies are all mathematically valid at longer positions. But the high-frequency dimensions cycle through rotation angles the model never encountered during training. The model has never seen these particular combinations of rotations and doesn't know what they mean. Attention patterns become increasingly incoherent.
ALiBi: graceful to 8× and beyond. The linear penalty is the same function at any distance. A penalty of −m · 1000 is just a bigger version of −m · 10. The model learned to use these biases during training, and the extrapolation is just a natural extension. Only at extreme multiples (16×+) do the distant-token penalties become so large that information flow is completely blocked.
RoPE + NTK scaling: the rescue strategy. By increasing the rotation base, NTK scaling slows down the high-frequency dimensions that cause trouble. The result: RoPE that works reliably at 4−8× the training length. This is how Llama models extended from 4K to 128K context. More on NTK in the next chapter.
"Just train on longer sequences" sounds like a solution, but sequence length has a quadratic cost in attention. Training on 8K is 16× more expensive than 2K. Training on 32K is 256× more. If your position encoding can extrapolate reliably, you train on affordable short sequences and deploy at the long context you actually need.
This is exactly what happened in practice. Llama 2 trained on 4K context. With RoPE + NTK scaling (via fine-tuning), Llama 2 Long extended to 32K. Code Llama went from 4K to 100K. The position encoding's extrapolation ability was the enabling technology.
RoPE works beautifully within the training length. But push it to 2× or 4× and attention patterns start to break. We saw this in the arena. Now let's understand exactly why it breaks and how NTK-aware scaling fixes it.
This is the trick that let models jump from 4K to 128K context windows. It was discovered not by a major lab, but by a pseudonymous researcher on Reddit (u/bloc97) in 2023. Within weeks, every open-source LLM had adopted it.
Recall that RoPE splits the head dimension into d/2 pairs, each rotating at a different frequency:
For d=64, pair 0 has θ0 = 1.0 — it rotates one full radian per position. Pair 31 has θ31 ≈ 0.00011 — it barely moves. At the training length of, say, 4096:
Now extend to 8192 (2× training length):
The fix is elegant: increase the rotation base. Instead of base = 10000, use:
where scale = test_length / train_length. This changes every frequency:
The key insight: because the exponent is 2i/d, low-index dimensions (fast rotators) are barely affected — they're raised to a small power. High-index dimensions (slow rotators) are raised to a larger power, so the base increase hits them harder, slowing them down proportionally more.
The result: fine-grained position discrimination (fast dimensions) is almost unchanged, while the dangerous slow dimensions are pulled back into the training range. It's a nonlinear frequency adjustment — not uniform stretching.
Setup: d=64, base=10000, training length=4096, test length=8192, so scale=2.
Without scaling (vanilla RoPE):
| Pair | θi | Angle at pos 4096 | Angle at pos 8192 | New territory? |
|---|---|---|---|---|
| 0 (fast) | 1.0 | 4096 rad | 8192 rad | No — wraps |
| 8 | 0.0178 | 72.8 rad | 145.6 rad | No — wraps |
| 16 | 0.000316 | 1.295 rad | 2.590 rad | Yes |
| 24 | 0.0000056 | 0.023 rad | 0.046 rad | Yes |
| 31 (slow) | 0.00011 | 0.45 rad | 0.90 rad | Yes |
With NTK scaling:
base' = 10000 · 264/62 = 10000 × 2.0226 = 20,226.
| Pair | θ'i | Change | Angle at 8192 | Still in range? |
|---|---|---|---|---|
| 0 (fast) | 1.0 | Unchanged | 8192 rad | Yes — wraps |
| 8 | 0.0111 | −38% | 90.9 rad | Yes — wraps |
| 16 | 0.000123 | −61% | 1.007 rad | Yes |
| 24 | 0.00000136 | −76% | 0.011 rad | Yes |
| 31 (slow) | 0.0000248 | −77% | 0.203 rad | Yes |
With NTK, pair 16's angle at position 8192 is 1.007 rad — safely below the 1.295 rad it saw during training. Pair 31 is at 0.203 rad, well within its training range of 0.45. The slow dimensions have been pulled back into familiar territory.
Compare rotation frequencies across dimension pairs. Watch how NTK scaling selectively slows high-index (slow) dimensions while leaving low-index (fast) dimensions untouched.
NTK-aware scaling was the breakthrough, but several refinements followed:
| Method | How it works | Pros | Cons |
|---|---|---|---|
| Linear Interpolation (Chen et al., 2023) |
Divide position by scale factor: use position p/s instead of p. Uniformly slows ALL frequencies. | Dead simple. One line of code. | Slows fast dims too, hurting fine-grained local discrimination. Needs short fine-tuning to recover. |
| NTK-Aware (bloc97, 2023) |
Increase base to slow frequencies nonlinearly — more for slow dims, less for fast dims. | Preserves local resolution. Works well at 4−8×. | Needs a known scale factor. Moderate quality loss at extreme scales. |
| Dynamic NTK (emozilla, 2023) |
Compute scale factor dynamically: scale = max(1, current_seq_len / train_len). Adjusts on the fly. | No fixed scale needed. Adapts to actual input length. | Slightly more compute. Edge effects at the transition point. |
| YaRN (Peng et al., 2023) |
Split dims into 3 regions: don't scale fast dims, interpolate slow dims, NTK-scale middle dims. Also adds temperature scaling. | Best-in-class quality. Minimal fine-tuning needed. | More hyperparameters. Complex implementation. |
In practice, the industry converged on YaRN or Dynamic NTK for production deployments. Llama 3.1 uses a YaRN-inspired approach to achieve 128K context from a 8K training length. Mistral uses a sliding-window variant combined with RoPE extension.
python import torch def ntk_rope_freqs(dim, max_seq_len, base=10000.0, train_len=4096, target_len=32768): # Scale factor: how many times beyond training? scale = max(1.0, target_len / train_len) # NTK-aware base adjustment # base' = base * scale^(d/(d-2)) ntk_base = base * (scale ** (dim / (dim - 2))) # For scale=8, dim=64: base goes from 10000 to ~96,980 # Recompute frequencies with new base freqs = 1.0 / (ntk_base ** (torch.arange(0, dim, 2).float() / dim)) # Same outer product as standard RoPE positions = torch.arange(max_seq_len).float() angles = torch.outer(positions, freqs) return torch.cos(angles), torch.sin(angles) # Compare: standard vs NTK-scaled cos_std, sin_std = precompute_rope_freqs(64, 32768) cos_ntk, sin_ntk = ntk_rope_freqs(64, 32768, train_len=4096, target_len=32768) # Fast dims (pair 0): frequencies nearly identical # Slow dims (pair 31): NTK frequency is much smaller # → slow dims stay in the trained angle range
You now know five position encoding strategies: sinusoidal, learned, relative bias, RoPE, and ALiBi. Plus the extension tricks — NTK scaling, YaRN, linear interpolation. So when you're building or fine-tuning a model, which one do you actually pick?
The answer depends on three things: what kind of model you're building, how long your sequences need to be, and whether you need to extrapolate beyond training length. Here's the decision framework.
| Method | Parameters | Where applied | Relative? | Extrapolates? | Used in |
|---|---|---|---|---|---|
| Sinusoidal | 0 | Added to embeddings | In theory | Somewhat | Original Transformer (2017) |
| Learned | L × d | Added to embeddings | No | No — hard crash | BERT, GPT-2, GPT-3 |
| Relative Bias | 2K+1 per head | Added to attention scores | Yes | Yes (clipped) | T5, DeBERTa |
| RoPE | 0 | Rotates Q and K | Yes | Needs NTK/YaRN | Llama, Mistral, Gemma, Qwen |
| ALiBi | 0 | Bias on attention scores | Yes | Excellent | BLOOM, MPT, Falcon |
Select a scenario below and see which position encoding method is recommended, along with the key tradeoffs. Each scenario represents a real-world use case.
Click a scenario to see the recommended position encoding and why.
python import torch def build_rope(dim, max_len, base=10000.0, device=None): # Standard RoPE for modern LLMs freqs = 1.0 / (base ** (torch.arange(0, dim, 2, device=device).float() / dim)) t = torch.arange(max_len, device=device).float() angles = torch.outer(t, freqs) return torch.polar(torch.ones_like(angles), angles) # complex exp def apply_rope(x, rope_cache): # x: [batch, seq, heads, dim] x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2)) rope = rope_cache[:x.shape[1]].unsqueeze(0).unsqueeze(2) return torch.view_as_real(x_complex * rope).flatten(-2).type_as(x)
python def build_ntk_rope(dim, max_len, base=10000.0, train_len=4096, target_len=32768, device=None): scale = max(1.0, target_len / train_len) ntk_base = base * (scale ** (dim / (dim - 2))) # THE key line return build_rope(dim, max_len, base=ntk_base, device=device)
You now understand the complete positional encoding toolkit — from sinusoidal waves to RoPE rotations. This chapter is your practical reference. No new concepts. Just the formulas, the decision guide, and the connections to where you go next.
| Symbol | Meaning | Typical values |
|---|---|---|
| p | Absolute position index in the sequence | 0 to L-1 |
| dmodel | Model embedding dimension | 512–8192 |
| dk | Head dimension (dmodel / n_heads) | 64–128 |
| i | Dimension pair index (0 to d/2-1) | 0–63 |
| θi | RoPE rotation frequency for pair i | 1.0 to ~0.0001 |
| mh | ALiBi slope for head h | 0.25 to ~0.004 |
| base | RoPE base frequency | 10000 (standard) |
| scale | Length extension ratio (test/train) | 1× to 32× |
| Formula | What it says in words |
|---|---|
| PE(p, 2i) = sin(p / 100002i/d) | The even dimension of position p oscillates like a wave. Fast for small i, slow for large i. |
| PE(p, 2i+1) = cos(p / 100002i/d) | The odd dimension is the same wave, phase-shifted by 90 degrees. |
| q' = R(mθ) · q | RoPE: rotate the query vector by an angle proportional to its position. |
| q' · k' = f(m − n) | The dot product after rotation depends ONLY on the distance between positions. |
| score − mh|i−j| | ALiBi: subtract a distance penalty from the attention score. Nearby tokens pay less. |
| base' = base · sd/(d-2) | NTK scaling: increase the base to slow down frequencies, more for slow dims than fast dims. |
| If you want to learn about... | Go to... |
|---|---|
| How attention works (Q, K, V) | Attention & Transformers |
| Multi-head attention, cross-attention, GQA | Attention Variants |
| Normalization (BatchNorm to RMSNorm) | Normalization |
| Optimizers (SGD to AdamW) | Optimizers |
| Loss functions (cross-entropy, focal, etc.) | Loss Functions |
| The full GPT architecture | GPT — From Zero to Hero |
| Paper | Year | Contribution |
|---|---|---|
| Vaswani et al. — "Attention Is All You Need" | 2017 | Introduced sinusoidal position encoding |
| Devlin et al. — BERT | 2018 | Popularized learned position embeddings |
| Shaw et al. — "Self-Attention with Relative Position" | 2018 | First relative position bias method |
| Su et al. — "RoFormer: Enhanced Transformer with Rotary Position Embedding" | 2021 | Introduced RoPE |
| Press et al. — "ALiBi: Train Short, Test Long" | 2022 | Introduced ALiBi |
| bloc97 — "NTK-Aware Scaled RoPE" | 2023 | NTK-aware frequency scaling for RoPE extension |
| Peng et al. — "YaRN: Efficient Context Window Extension" | 2023 | State-of-the-art RoPE extension |