Positional Encoding

Chapter 0: The Bag-of-Words Problem

Take the sentence "The dog bit the man" and rearrange it to "The man bit the dog." Very different meanings. A dog biting a human is Tuesday in the park. A human biting a dog is front-page news. Now feed both sentences to a transformer with no positional encoding. You get the same output. Exactly the same. The model cannot tell them apart.

This isn't a subtle theoretical concern. It's a catastrophic blind spot baked into the core mechanism of the transformer. Self-attention — the operation that gives transformers their power — treats its input as a set, not a sequence. It has no concept of "first," "second," or "third." Every token might as well arrive simultaneously in a bag.

Let's prove this. Not with hand-waving, but with actual math.

Why Attention Is Permutation-Equivariant

Here's what self-attention does. Given a sequence of token embeddings X = [x₁, x₂, ..., x_n], it computes three matrices — Queries (Q = XW_Q), Keys (K = XW_K), and Values (V = XW_V). Then it computes:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

Now imagine you permute the input. Swap rows of X — put "dog" where "man" was. Q, K, and V are just linear transformations of X, so their rows get permuted in exactly the same way. The attention score matrix Q·K^T is a row/column permutation of the original — same values, same weighted sums, just reordered. The final output is the original output with its rows permuted identically.

In math: if P is a permutation matrix and you feed PX instead of X, the output is P · Attention(X). The values are identical — they just come out in the new order. The attention mechanism has zero awareness that you rearranged anything.

Token embeddings don't carry position. You might think the embeddings themselves carry position info since they're different vectors. They don't — embedding("the") is the same vector at position 0 and position 5. The embedding table is a dictionary lookup, not a position-aware function. The word "the" gets vector #4217 (or whatever its index is) regardless of where it appears in the sentence. Position is invisible.

Hand Calculation: Three Tokens, Three Permutations

Let's make this concrete with the smallest possible example. Three tokens A, B, C with 2D embeddings. We'll compute self-attention for two different orderings and prove the outputs are identical (up to reordering).

Setup. Token embeddings: A = [1, 0], B = [0, 1], C = [1, 1]. For simplicity, W_Q = W_K = W_V = I (identity matrix). So Q = K = V = X. Scale factor √d_k = √2.

Order 1: A, B, C

X = [[1,0], [0,1], [1,1]]. Compute Q·K^T / √2:

	A=[1,0]	B=[0,1]	C=[1,1]
A=[1,0]	1/√2 = 0.707	0/√2 = 0.000	1/√2 = 0.707
B=[0,1]	0/√2 = 0.000	1/√2 = 0.707	1/√2 = 0.707
C=[1,1]	1/√2 = 0.707	1/√2 = 0.707	2/√2 = 1.414

Apply softmax row-wise (each row sums to 1):

	A	B	C
Row A	0.422	0.208	0.422
Row B	0.208	0.422	0.422
Row C	0.268	0.268	0.545

Multiply by V (which equals X) to get outputs:

Output for A: 0.422×[1,0] + 0.208×[0,1] + 0.422×[1,1] = [0.422+0+0.422, 0+0.208+0.422] = [0.844, 0.630]
Output for B: 0.208×[1,0] + 0.422×[0,1] + 0.422×[1,1] = [0.208+0+0.422, 0+0.422+0.422] = [0.630, 0.844]
Output for C: 0.268×[1,0] + 0.268×[0,1] + 0.545×[1,1] = [0.268+0+0.545, 0+0.268+0.545] = [0.813, 0.813]

Order 2: B, C, A (permuted input)

X' = [[0,1], [1,1], [1,0]]. The score matrix Q'K'^T / √2:

	B=[0,1]	C=[1,1]	A=[1,0]
B=[0,1]	0.707	0.707	0.000
C=[1,1]	0.707	1.414	0.707
A=[1,0]	0.000	0.707	0.707

Look carefully. Row B in the permuted version has the same values as row B in the original — just in a different column order. Same for every row. After softmax and multiplying by V:

Output for B: [0.630, 0.844] — same as before
Output for C: [0.813, 0.813] — same as before
Output for A: [0.844, 0.630] — same as before

Every token gets the exact same output vector regardless of its position in the sequence. The outputs are just reordered to match the new input order. Permutation-equivariance: permuting the input permutes the output, but doesn't change any values.

The consequence is devastating. "Dog bites man" and "Man bites dog" produce the same set of output vectors. The model cannot distinguish them. Subject and object are interchangeable. Word order — the backbone of grammar in every human language — is completely invisible.

See It: Drag Tokens, Watch Outputs

The simulation below shows two sentences. Drag tokens to rearrange them. With positional encoding OFF, the output values (shown as colored bars) are identical regardless of word order — they just shuffle position. Toggle positional encoding ON, and the outputs genuinely change when you reorder words. Position makes the model care about order.

Permutation Invariance Demo

Click tokens in the bottom row to swap their positions. Watch how outputs change (or don't) based on the Position Encoding toggle.

From Scratch: Proving It in Code

Five lines of Python that prove the point. We compute attention for the original order and a permuted order. The outputs are identical (just reordered).

python
import numpy as np

def attention(X):
    # Q=K=V=X, scale by sqrt(d_k)
    scores = X @ X.T / np.sqrt(X.shape[1])
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
    return weights @ X

X = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
X_perm = X[[1, 2, 0]]  # reorder: B, C, A

out1 = attention(X)
out2 = attention(X_perm)

print("Original  :", np.round(out1, 3))
print("Permuted  :", np.round(out2, 3))
print("Same values?", np.allclose(out1[[1,2,0]], out2))  # True

Run it. out1[[1,2,0]] (the original output, reordered) matches out2 exactly. The attention operation literally cannot distinguish order. It operates on a set.

So How Do We Fix This?

The fix is beautifully simple: add position information to the input embeddings before they enter the attention mechanism. If token "dog" at position 1 has a different vector than "dog" at position 3, the attention scores change. The model can now distinguish "dog bites man" from "man bites dog" because the Q, K, V matrices see different vectors depending on where each token sits.

The question is: what position information should we add? An integer? A learned vector? A pattern of waves? The answer — and its surprising elegance — is what the rest of this lesson is about.

If you shuffle all input tokens of a transformer with no positional encoding, what happens to the output?

The model produces random garbage The outputs are completely different vectors The outputs are the same values, just shuffled in the same way The model ignores the shuffled tokens and uses the original order

Chapter 1: Sinusoidal Position Encoding

We need to give each position a unique identity. The simplest idea: just use the integer position itself. Position 0 gets 0, position 1 gets 1, position 512 gets 512. But this creates a problem — position 512 is a huge number that would completely dominate the token embedding (which typically has values between -1 and 1). The model would pay more attention to position than content.

What about normalizing? Divide by the max length, so positions range from 0 to 1. Now the problem flips: positions 0 and 1 are distinguishable (0.000 vs 0.002), but positions 499 and 500 are nearly identical (0.998 vs 1.000). And worse, the encoding changes meaning if you increase the max length — position 0.5 used to mean "halfway through" but now means something different.

The Vaswani et al. 2017 solution is elegant: encode position as a pattern of sine and cosine waves at different frequencies. Each position gets a unique "fingerprint" that is bounded, meaningful, and theoretically capable of expressing relative distances.

Building the Encoding From Scratch

Here's the idea. We have a model dimension d_model (say, 512). For each position p in the sequence, we create a d_model-dimensional vector. Each pair of dimensions (2i, 2i+1) uses a sine and cosine at a specific frequency:

PE(p, 2i) = sin(p / 10000^2i/d_model)

PE(p, 2i+1) = cos(p / 10000^2i/d_model)

Let's unpack this. The denominator 10000^2i/d_model controls the frequency of the wave. When i = 0 (the first dimension pair), the denominator is 10000⁰ = 1, so the wave oscillates at frequency 1 — it completes one full cycle every 2π ≈ 6.28 positions. When i is large (near d_model/2), the denominator approaches 10000, and the wave oscillates extremely slowly — one cycle every 10000 × 2π ≈ 62,832 positions.

Think of it like a clock. The fast-ticking dimensions are the second hand — they change rapidly between nearby positions, giving fine-grained local discrimination. The slow-ticking dimensions are the hour hand — they change gradually, giving broad positional context over thousands of positions. Together, they form a unique binary-like code for each position.

Why sine AND cosine? Using both sin and cos for each frequency gives the model access to both the "phase" and "magnitude" at that frequency. More importantly, it enables a key mathematical property: the encoding at position p+k can be expressed as a linear transformation of the encoding at position p. This means the model can, in principle, learn to compute relative position from absolute encodings using only linear operations — exactly what attention heads do.

Hand Calculation: d_model = 4, Positions 0-3

Let's compute every value by hand. With d_model = 4, we have two frequency bands: i = 0 (fast) and i = 1 (slow).

Frequency band i = 0: denominator = 10000^0/4 = 10000⁰ = 1.

Dim 0: sin(p / 1) = sin(p)
Dim 1: cos(p / 1) = cos(p)

Frequency band i = 1: denominator = 10000^2/4 = 10000^0.5 = 100.

Dim 2: sin(p / 100)
Dim 3: cos(p / 100)

Now compute for each position:

Position	Dim 0: sin(p)	Dim 1: cos(p)	Dim 2: sin(p/100)	Dim 3: cos(p/100)
p = 0	0.000	1.000	0.000	1.000
p = 1	0.841	0.540	0.010	1.000
p = 2	0.909	−0.416	0.020	1.000
p = 3	0.141	−0.990	0.030	1.000

Notice the pattern. Dimensions 0-1 (the fast band) change dramatically between positions — sin(0) = 0, sin(1) = 0.841, sin(2) = 0.909. They give fine-grained local discrimination. Dimensions 2-3 (the slow band) barely change — sin(0/100) = 0, sin(1/100) = 0.01, sin(2/100) = 0.02. They evolve over hundreds of positions.

Each row is unique. No two positions produce the same 4D vector. And this holds for d_model = 512: with 256 frequency bands spanning wavelengths from 2π to 62,832, the encoding is effectively unique for any reasonable sequence length.

The binary analogy. Think of the dimensions like bits in a binary counter. The lowest bit flips every step (0, 1, 0, 1, ...). The next bit flips every 2 steps. The next every 4 steps. Sinusoidal encoding is a continuous version of this — the fastest dimension oscillates every ~6 positions, the next every ~12, and so on up to ~62,832. Each position has a unique "binary" fingerprint.

Why 10000 as the Base?

The base 10000 sets the wavelength range. The fastest dimension has wavelength 2π ≈ 6.28 positions — enough to distinguish adjacent tokens. The slowest dimension has wavelength 10000 × 2π ≈ 62,832 positions — enough to uniquely identify positions in sequences up to ~60K tokens long.

If you used a smaller base (say 100), the slowest wavelength would be only 628 positions. Sequences longer than that would see position encodings repeat — positions 0 and 628 would get nearly identical encodings, confusing the model. The choice of 10000 gives headroom for long sequences while keeping the fast dimensions discriminative.

Modern models (GPT-4, Llama) don't use sinusoidal encoding — they use RoPE, which we'll cover later. But the frequency-band intuition carries over directly: RoPE uses the same 10000 base and the same geometric spacing of frequencies. Understanding sinusoidal encoding is the foundation for everything that follows.

See It: The Wave Heatmap

The simulation below shows a heatmap of sinusoidal encodings. Each row is a position (0 at the top). Each column is a dimension. Color represents the encoding value: warm/orange for positive, teal for negative. You can see the fast-oscillating dimensions on the left and the slow ones on the right. Hover to see exact values. Use the slider to control how many positions are visible.

Sinusoidal Position Encoding Heatmap

Each row is a sequence position. Each column is a model dimension. Color encodes the value: warm = positive, teal = negative. Notice the wave patterns — fast on the left, slow on the right.

Positions 64

Dimensions 128

From Scratch: NumPy and PyTorch

python
import numpy as np

def sinusoidal_pe(max_len, d_model):
    pe = np.zeros((max_len, d_model))
    pos = np.arange(max_len)[:, None]       # (max_len, 1)
    i = np.arange(0, d_model, 2)[None, :]    # (1, d_model/2)
    freq = 1.0 / (10000 ** (i / d_model))    # geometric spacing
    pe[:, 0::2] = np.sin(pos * freq)
    pe[:, 1::2] = np.cos(pos * freq)
    return pe

pe = sinusoidal_pe(128, 512)
print(pe.shape)          # (128, 512)
print(pe[0, :4])         # [0.000, 1.000, 0.000, 1.000]  (position 0)
print(pe[1, :4])         # [0.841, 0.540, 0.010, 1.000]  (position 1)

The PyTorch equivalent uses the same logic but wraps it in a buffer so the encoding is stored with the model (on the right device) but not updated by the optimizer:

python
import torch
import torch.nn as nn

class SinusoidalPE(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float()
                        * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe.unsqueeze(0))  # (1, max_len, d_model)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]  # add PE to embeddings

Added, not concatenated. Sinusoidal encoding is ADDED to the token embedding, not concatenated. The model learns to use both signals — content and position — in the same vector space. This means each dimension pulls double duty: carrying both what the token is and where it is. The model learns to disentangle these during training. Concatenation would be cleaner (separate channels for content and position), but it doubles d_model and wastes parameters. Addition is the pragmatic choice.

Why do sinusoidal encodings use multiple frequencies instead of just one sine wave?

A single frequency can only encode a limited number of positions before repeating Different frequencies encode different scales — high frequencies distinguish nearby positions, low frequencies distinguish distant ones Multiple frequencies make the math easier to compute on GPUs A single sine wave would interfere with the token embeddings

Chapter 2: Learned Position Embeddings

Sinusoidal encodings are elegant and parameter-free. They require zero training — you compute them once from a formula and they work for any position. But they're also completely fixed. The model can't adapt them. What if the optimal position encoding for language isn't a pattern of sine waves? What if the model could learn a better one?

BERT (2018) and GPT-2 (2019) answered this question with brute force: just learn the position encodings. Create a lookup table of shape (max_positions, d_model), initialize it randomly, and let gradient descent figure out the best encoding for each position. Simple, effective, and — as we'll see — remarkably similar to sinusoidal encodings after training.

How It Works

Create an embedding matrix E_pos of shape (max_seq_len, d_model). This is a regular model parameter — a block of learnable numbers just like any weight matrix. Position p gets the p-th row: E_pos[p].

During the forward pass, you look up the position embedding for each token's position and add it to the token embedding. That's it. The position embeddings are randomly initialized (usually from a normal distribution with small standard deviation) and updated by gradient descent during training, just like every other parameter in the model.

It's literally nn.Embedding. In PyTorch, learned position embeddings are just an nn.Embedding(max_len, d_model) layer. The same mechanism used for token embeddings. The only difference is the input: token embeddings take token IDs (integers from the vocabulary), position embeddings take position indices (integers 0 to max_len-1).

Hand Calculation: Token + Position

Let's trace through a concrete example. max_len = 4, d_model = 3. After random initialization, the position embedding table looks like:

Position	Dim 0	Dim 1	Dim 2
p = 0	0.12	−0.34	0.56
p = 1	−0.23	0.45	−0.11
p = 2	0.67	−0.12	0.33
p = 3	−0.45	0.78	−0.22

Suppose the token embedding for "cat" is e_cat = [0.80, 0.30, −0.10]. If "cat" appears at position 2:

input = e_cat + E_pos[2] = [0.80, 0.30, −0.10] + [0.67, −0.12, 0.33]

= [1.47, 0.18, 0.23]

If the same "cat" appears at position 0 instead:

input = e_cat + E_pos[0] = [0.80, 0.30, −0.10] + [0.12, −0.34, 0.56]

= [0.92, −0.04, 0.46]

Different positions produce different input vectors for the same token. Now the attention mechanism sees different Q, K, V values depending on where "cat" sits in the sequence. Problem solved — at least for positions the model was trained on.

What Gets Learned?

Here's the remarkable thing: after training, learned position embeddings often look strikingly similar to sinusoidal encodings. When researchers visualize the learned embedding matrix as a heatmap, they see wave-like patterns — low-frequency oscillations in some dimensions, high-frequency in others. The model independently rediscovers that multi-frequency waves are a good way to encode position.

This makes sense. Gradient descent optimizes the position embeddings to maximize task performance. It turns out that multi-frequency wave patterns are a highly efficient way to give the model both local (nearby token) and global (document-level) position information. Sinusoidal encoding just happens to be close to the optimum that gradient descent finds naturally.

So why bother learning? Because "close to optimal" isn't "optimal." In practice, learned embeddings consistently outperform sinusoidal ones by a small margin on in-distribution benchmarks. The model can fine-tune the encoding for its specific task — emphasizing certain position relationships that matter more for language modeling (like adjacent-token patterns or sentence boundaries). The gains are modest but real, which is why BERT, GPT-2, GPT-3, and most models before 2022 used learned embeddings.

See It: Sinusoidal vs Learned

The simulation below shows two heatmaps side by side. Left: sinusoidal encoding (fixed formula). Right: learned position embeddings (representative patterns from a trained model). Toggle between them to see how similar they are — and where they differ. The learned version has smoother gradients in some frequency bands and sharper transitions in others.

Sinusoidal vs Learned Embeddings

Compare fixed sinusoidal encodings (left) with learned position embeddings (right). Notice the similar wave structure — gradient descent rediscovers the multi-frequency pattern.

The Length Limit Problem

Here's the critical flaw. Learned embeddings have a hard maximum sequence length. If the model was trained with max_seq_len = 512 (like BERT), there are exactly 512 rows in the position embedding table. Position 513 simply doesn't exist. There's no embedding for it. The model literally cannot process a sequence with more than 512 tokens.

What happens if you try? Depending on the implementation, you get an index-out-of-bounds error, or the model wraps around to position 0 (which makes no sense), or it uses an untrained random vector (which produces garbage). None of these are acceptable.

Sinusoidal encodings don't have this problem. The formula produces a valid encoding for any position — 0, 512, 10000, or 1 million. Whether those encodings work well beyond the trained range is a separate question (they haven't been optimized for those positions), but at least they produce a reasonable, unique, bounded vector.

Learned embeddings are NOT strictly better. You might think learned embeddings are always preferable because they're more flexible. In practice, they learn patterns very similar to sinusoidal encodings. The real advantage is slight performance gains on in-distribution lengths. The real disadvantage is catastrophic failure beyond the trained length. For tasks requiring long or variable-length contexts, this limitation is a dealbreaker — and it's a major reason the field moved to relative and rotary methods.

Comparison: Fixed vs Learned

Property	Sinusoidal (Fixed)	Learned
Parameters	0 (computed from formula)	max_len × d_model
Flexibility	Fixed pattern, model cannot adapt	Fully flexible, optimized by gradient descent
Length generalization	Produces values for any position (quality degrades)	Hard ceiling — crashes or garbage beyond max_len
Performance	Slightly worse on benchmarks	Slightly better on in-distribution data
Relative position	Theoretically expressible via linear transform	Model must learn relative from absolute (harder)
Used in	Original Transformer (2017)	BERT, GPT-2, GPT-3 (2018-2020)

From Scratch: Three Lines of PyTorch

python
import torch.nn as nn

class LearnedPE(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)  # learnable table

    def forward(self, x):
        seq_len = x.size(1)
        positions = torch.arange(seq_len, device=x.device)  # [0, 1, ..., seq_len-1]
        return x + self.pe(positions)  # lookup + add

# Usage:
pe = LearnedPE(max_len=512, d_model=768)
embeddings = token_embed(input_ids)         # (batch, seq_len, 768)
embeddings_with_pos = pe(embeddings)        # adds position info
# pe(embeddings) with seq_len=513 → CRASH: index 512 out of range

That's it. The simplicity is the appeal. One extra embedding layer, three extra lines of code, and the model gets position information. The limitation — a hard maximum length — is the price.

What happens when a model with learned position embeddings of max length 512 encounters a sequence of 600 tokens?

The model interpolates between the nearest known positions It fails — positions 512+ have no embedding and produce an error or undefined behavior The model wraps around, using position 0's embedding for position 512 The extra tokens are simply ignored

Chapter 3: Absolute vs Relative Position

Consider the sentence "The cat sat on the mat." The subject-verb relationship between "cat" and "sat" is the same whether this sentence starts at position 0 or position 5000 in a long document. "Cat" is always one token before "sat." The grammatical relationship depends on the distance between tokens, not their absolute indices.

Both sinusoidal and learned embeddings encode absolute position: position 0 gets one vector, position 1 gets another, position 5000 gets yet another. The model must learn that the interaction between position 5 and position 7 encodes the same "distance-2" relationship as the interaction between position 500 and position 502. And between position 3000 and position 3002. And every other pair at distance 2.

That's a lot of redundant learning. Relative position encoding says: just encode the distance directly.

The Inefficiency of Absolute Position

Let's count. With a context length of 4096 and absolute position embeddings, how many distinct position pairs encode "distance = 2"? Positions (0,2), (1,3), (2,4), ..., (4094,4096). That's 4094 pairs. For the model to learn that "distance-2" means "adjective modifies noun" (for example), it must see examples at enough of these 4094 pairs to generalize. The model doesn't know that position 5 and position 500 encode the same relative relationship — that's an emergent pattern it must discover from data.

Now consider what happens at test time. If the model trained on sequences up to 512 tokens, it has seen distance-2 pairs at positions (0,2) through (510,512). If it now encounters position (4094,4096) — same distance, but at absolute positions it has never seen — the absolute position embeddings for 4094 and 4096 are either undefined (learned) or untested (sinusoidal). The model has no guarantee that it will handle this pair correctly.

The key insight. Absolute position forces the model to learn O(L²) pairwise relationships (every position pair in the context window). Relative position reduces this to O(L) relationships (one per distance value: 0, 1, 2, ..., L-1). This is not just more efficient — it's the reason relative methods generalize to longer sequences. If the model learns "distance = 2" once, it works at any absolute position.

Hand Calculation: Same Distance, Different Encodings

Let's make the absolute-position problem concrete. Take sinusoidal encoding with d_model = 4. We'll compare two pairs that are both at distance 2: positions (3, 5) and positions (103, 105).

Using the formulas from Chapter 1 (frequencies 1 and 1/100):

Position	Dim 0: sin(p)	Dim 1: cos(p)	Dim 2: sin(p/100)	Dim 3: cos(p/100)
p = 3	0.141	−0.990	0.030	1.000
p = 5	−0.959	0.284	0.050	0.999
p = 103	−0.863	−0.505	0.926	0.378
p = 105	−0.970	0.243	0.938	0.347

Look at the raw vectors. PE(3) = [0.141, −0.990, 0.030, 1.000] and PE(103) = [−0.863, −0.505, 0.926, 0.378]. These are completely different vectors, even though both are the "start" of a distance-2 pair.

The model computes attention scores using dot products: q₃ · k₅ and q₁₀₃ · k₁₀₅. Because the absolute encodings differ wildly, these dot products will be very different numbers — even though the underlying relationship ("2 tokens apart") is identical.

In principle, the learned attention weights W_Q and W_K could learn to extract the relative offset from the absolute encodings. Sinusoidal encodings even have a theoretical property that makes this possible: PE(p + k) can be expressed as a linear function of PE(p). But the model must discover and exploit this relationship through training. It's an extra burden that relative methods eliminate entirely.

What Relative Encoding Looks Like

The core idea of relative position encoding: instead of adding position vectors to the input, add a position-dependent bias to the attention scores based on the distance between the query and key tokens.

In standard attention, the score between positions i and j is:

score(i, j) = q_i · k_j / √d_k

Shaw et al. (2018) proposed adding a learned bias that depends only on the relative distance (i − j):

score(i, j) = (q_i · k_j + b_i−j) / √d_k

Here b_i−j is a learned scalar indexed by the distance between positions. If i − j = 2, we look up b₂ — the same value regardless of whether i = 5 or i = 5000. The model learns one bias per distance, not one per position.

The distance is usually clipped to a window: b_k for k in [−K, K] where K is a maximum distance (e.g., 128). Beyond that window, all far-away tokens share the same bias. This keeps the parameter count small: 2K + 1 learnable scalars instead of max_len × d_model.

Multiple flavors. Shaw et al. (2018) introduced relative position biases. T5 (2020) simplified them to scalar biases in each attention head. ALiBi (2022) used non-learned linear biases (just penalize distance). RoPE (2021) encoded relative position directly into the Q/K vectors via rotation — the most popular modern approach and the subject of our next chapter. All share the same principle: encode distance, not index.

See It: Absolute vs Relative

The simulation below shows two tokens on a number line. In absolute mode, each token's encoding depends on its position. Slide the pair along the number line (keeping the distance fixed) and watch the encoding vectors change dramatically. In relative mode, only the distance matters — slide the pair and the relative encoding stays constant.

Absolute vs Relative Position

Drag the slider to move a pair of tokens along the sequence. The distance between them stays fixed at the value you set. In absolute mode, the encoding vectors change as you slide. In relative mode, they don't.

Pair Start Position 3

Distance Between Tokens 2

Why This Matters for Length Generalization

This is the clincher. If a model only saw sequences up to length 512 during training:

Absolute (learned): Positions 513+ have no embedding. The model crashes or produces garbage. Complete failure.
Absolute (sinusoidal): Positions 513+ produce valid vectors, but the model never learned how to use them. Quality degrades unpredictably.
Relative: The model learned that "distance = 2" encodes a certain relationship. At position 5000, distance-2 still means distance-2. The bias b₂ is the same. The model generalizes naturally to any absolute position, because it never depended on absolute position in the first place.

This is why every modern LLM (GPT-4, Llama, Mistral, Gemma) uses some form of relative position encoding. The shift from absolute to relative was one of the most important architectural changes in the transformer's evolution — and it enabled the jump from 512-token contexts to 128K+ token contexts that we see today.

From Scratch: Shaw et al. Relative Bias

python
import torch
import torch.nn as nn

class RelativePositionBias(nn.Module):
    def __init__(self, max_dist=128):
        super().__init__()
        # One learnable bias per distance in [-max_dist, max_dist]
        self.bias = nn.Embedding(2 * max_dist + 1, 1)
        self.max_dist = max_dist

    def forward(self, seq_len):
        # Build distance matrix: dist[i,j] = i - j
        pos = torch.arange(seq_len)
        dist = pos[:, None] - pos[None, :]          # (L, L)
        dist = dist.clamp(-self.max_dist, self.max_dist)
        dist = dist + self.max_dist                   # shift to [0, 2*max_dist]
        return self.bias(dist).squeeze(-1)              # (L, L) bias matrix

# Usage: add to attention scores
rpb = RelativePositionBias(max_dist=128)
attn_scores = q @ k.transpose(-2, -1) / d_k**0.5
attn_scores = attn_scores + rpb(seq_len)  # position-aware!
# Works for seq_len=64 AND seq_len=4096 — distance is all that matters

Sinusoidal encoding DOES have a relative property — in theory. The encoding at position p+k can be expressed as a linear transformation of the encoding at position p: PE(p+k) = M_k · PE(p), where M_k is a rotation matrix that depends only on k. This means the dot product PE(p) · PE(p+k) is theoretically a function of k alone. But in practice, the model has to discover and exploit this through the learned W_Q and W_K matrices. RoPE (next chapter) bakes this rotation directly into the attention computation — making relative position a first-class citizen instead of an emergent property.

Why do relative position methods generalize to longer sequences better than absolute methods?

They use fewer parameters, reducing overfitting They encode token-pair distances, not absolute indices, so any distance learned during training works at any position They clip distances to a maximum, ignoring distant tokens They use sinusoidal functions which produce valid values at any position

Chapter 4: Rotary Position Embeddings (RoPE)

Every method so far adds something to the embedding. A sinusoidal vector. A learned lookup. A bias. RoPE does something fundamentally different — it rotates the query and key vectors before computing attention.

Position 0 gets no rotation. Position 1 gets a small rotation. Position 100 gets a large rotation. And here's the magic: when two rotated vectors are dotted together, the rotation angles subtract, leaving only the distance between positions.

RoPE was introduced by Jianlin Su et al. in 2021 and immediately became the default in nearly every open-weights LLM: Llama, Mistral, Gemma, Qwen, Phi. The reason? It gives you relative position encoding for free through the attention mechanism, with no extra parameters and no extra memory.

The Rotation Intuition

Start with a single 2D vector [x, y]. To encode that this vector lives at position p, we rotate it by an angle proportional to p. The rotation angle is p · θ, where θ is a fixed frequency constant.

The standard 2D rotation matrix does this:

[x', y'] = [x · cos(pθ) - y · sin(pθ), x · sin(pθ) + y · cos(pθ)]

Now here's why this is brilliant. Suppose you have a query vector q at position m and a key vector k at position n. Both get rotated before the dot product. The dot product of two 2D vectors rotated by different angles has a beautiful property:

q_rot · k_rot = |q| · |k| · cos((m - n) · θ)

The result depends on (m−n) — the relative distance — not on m or n individually. Move both the query and key to positions 1000 and 998? Same dot product as positions 2 and 0. The rotation angles cancel out, leaving only the gap.

The core insight: rotation subtraction creates relative position encoding. Adding position vectors to embeddings gives you absolute position. Rotating Q and K gives you relative position, because the dot product only sees the difference in rotation angles. No extra parameters. No extra memory. Just a matrix multiply before attention.

Hand Calculation: RoPE in 2D

Let's work through a concrete example. We have two tokens:

Query q = [1.0, 0.5] at position m = 3
Key k = [0.8, 0.3] at position n = 1
Base frequency: θ = π/8

Step 1: Rotate the query. q gets rotated by mθ = 3π/8 ≈ 1.178 radians.

cos(3π/8) ≈ 0.383, sin(3π/8) ≈ 0.924
q'[0] = 1.0 × 0.383 - 0.5 × 0.924 = 0.383 - 0.462 = -0.079
q'[1] = 1.0 × 0.924 + 0.5 × 0.383 = 0.924 + 0.191 = 1.115

Rotated query: [-0.079, 1.115]

Step 2: Rotate the key. k gets rotated by nθ = 1 × π/8 ≈ 0.393 radians.

cos(π/8) ≈ 0.924, sin(π/8) ≈ 0.383
k'[0] = 0.8 × 0.924 - 0.3 × 0.383 = 0.739 - 0.115 = 0.624
k'[1] = 0.8 × 0.383 + 0.3 × 0.924 = 0.306 + 0.277 = 0.583

Rotated key: [0.624, 0.583]

Step 3: Dot product.

q' · k' = (-0.079)(0.624) + (1.115)(0.583) = -0.049 + 0.650 = 0.601

Step 4: The magic test. Now shift both positions by 100 — query at position 103, key at position 101. Same relative distance of 2. Recompute:

Rotate q by 103θ = 103π/8. Since cos and the dot product formula depend only on (m-n)θ, the dot product is still governed by (103-101)θ = 2π/8 = π/4.
The dot product is identical: 0.601.

Verified: relative position invariance. Shifting both positions by any amount leaves the dot product unchanged. The attention score between these two tokens is the same whether they're at positions 3&1, 103&101, or 10003&10001. RoPE is a true relative position encoding.

RoPE Rotation Visualizer

Watch query and key vectors rotate as position increases. The dot product stays constant when both positions shift together — proof of relative position encoding.

Query position (m) 3

Key position (n) 1

Shift both by 0

The Multi-Frequency Extension

In practice, d_model has many dimensions — 128, 256, or more in each attention head. RoPE splits these into d/2 pairs, where each pair forms an independent 2D subspace. Each subspace uses a different rotation frequency:

θ_i = 1 / 10000^2i/d for i = 0, 1, ..., d/2 - 1

This is the same base-10000 formula used in sinusoidal encodings — and for the same reason. Low-index pairs (small i) get high frequencies — they rotate fast and encode fine-grained position differences. High-index pairs (large i) get low frequencies — they rotate slowly and encode coarse, long-range relative position. The full spectrum lets the model attend at multiple scales simultaneously.

For a head dimension d=64, you get 32 pairs. Pair 0 has θ₀=1.0 (rotates one full radian per position). Pair 31 has θ₃₁=1/10000^62/64 ≈ 0.00011 (barely rotates, even over thousands of positions). Together they give the model both a high-resolution local clock and a slowly-ticking global clock.

Implementation: RoPE from Scratch

python
import torch
import math

def precompute_rope_freqs(dim, max_seq_len, base=10000.0):
    # dim = head dimension (e.g., 64)
    # Each pair of dims gets a frequency: theta_i = 1 / base^(2i/dim)
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    # freqs shape: [dim/2]

    # Positions: [0, 1, 2, ..., max_seq_len-1]
    positions = torch.arange(max_seq_len).float()
    # positions shape: [max_seq_len]

    # Outer product: angle at each (position, frequency pair)
    angles = torch.outer(positions, freqs)
    # angles shape: [max_seq_len, dim/2]

    # Precompute cos and sin for efficiency
    return torch.cos(angles), torch.sin(angles)

def apply_rope(x, cos_cached, sin_cached):
    # x shape: [batch, seq_len, n_heads, dim]
    # Split into even/odd pairs: [x0,x1], [x2,x3], ...
    x_even = x[..., ::2]   # shape: [batch, seq, heads, dim/2]
    x_odd  = x[..., 1::2]   # shape: [batch, seq, heads, dim/2]

    seq_len = x.shape[1]
    cos = cos_cached[:seq_len].unsqueeze(0).unsqueeze(2)  # [1, seq, 1, dim/2]
    sin = sin_cached[:seq_len].unsqueeze(0).unsqueeze(2)  # [1, seq, 1, dim/2]

    # 2D rotation: [x*cos - y*sin, x*sin + y*cos]
    out_even = x_even * cos - x_odd * sin
    out_odd  = x_even * sin + x_odd * cos

    # Interleave back: [x0', x1', x2', x3', ...]
    out = torch.stack([out_even, out_odd], dim=-1).flatten(-2)
    return out

RoPE is NOT applied to the embeddings. This is a critical distinction. RoPE rotates Q and K after the linear projection, before the dot product. The value vectors V are never rotated. Position only affects how tokens find each other (the Q·K attention score). It does not affect what information they carry (the V vectors). This keeps the value pathway position-free — the model can transport content independently of where it sits in the sequence.

Why does the dot product between RoPE-rotated vectors depend only on relative position?

Because RoPE adds position-dependent vectors that cancel in the dot product Because rotation angles subtract in the dot product: angle(m) - angle(n) = θ·(m-n), depending only on distance Because RoPE normalizes vectors to unit length before computing attention Because the base frequency 10000 was specifically chosen to create relative encodings

Chapter 5: ALiBi — Attention with Linear Biases

What if position encoding didn't touch the embeddings at all? What if, instead of modifying Q, K, or the input, you just subtracted a penalty from the attention score — a penalty proportional to how far apart two tokens are?

Tokens nearby pay no penalty and attend freely. Distant tokens pay a steep penalty and get nearly zero attention weight. That's ALiBi (Attention with Linear Biases), introduced by Press, Smith, and Lewis in 2022.

No position embedding. No rotation. No extra parameters at all. Just a simple bias subtracted from attention scores. It's almost offensively simple — and it works remarkably well.

The Mechanism

For head h with slope m_h, the attention score between a query at position i and a key at position j becomes:

score(i, j) = q_i · k_j / √d_k − m_h · |i − j|

That's it. The raw dot-product score, minus a linear penalty proportional to distance. The slope m_h controls how aggressive the penalty is. A large slope means "pay attention mostly to nearby tokens." A small slope means "distance barely matters — attend broadly."

The slopes are not learned. They're set geometrically, fixed before training and never updated. For H attention heads:

m_h = 1 / 2^{h · 8/H} for h = 1, 2, ..., H

This gives a geometric series of slopes. Head 1 has the steepest slope (strong locality). The last head has the gentlest slope (wide attention reach). The model learns to route local information through steep-slope heads and global information through gentle-slope heads.

Hand Calculation: ALiBi Bias Matrix

Let's compute the ALiBi biases for a model with H = 4 heads.

Step 1: Compute the slopes.

m₁ = 1/2^1·8/4 = 1/2² = 1/4 = 0.25
m₂ = 1/2^2·8/4 = 1/2⁴ = 1/16 = 0.0625
m₃ = 1/2^3·8/4 = 1/2⁶ = 1/64 ≈ 0.0156
m₄ = 1/2^4·8/4 = 1/2⁸ = 1/256 ≈ 0.0039

Head 1's slope is 64× steeper than head 4's. They see the world at completely different scales.

Step 2: Build the bias for head 1 (m=0.25).

Consider a 6-token sequence. For the query at position 5 (the last token), the bias to each key position is:

Key pos j	\|i−j\|	Bias = −m · \|i−j\|
0	5	−0.25 × 5 = −1.25
1	4	−0.25 × 4 = −1.00
2	3	−0.25 × 3 = −0.75
3	2	−0.25 × 2 = −0.50
4	1	−0.25 × 1 = −0.25
5	0	−0.25 × 0 = 0.00

If the raw attention score to position 0 was 3.0, it becomes 3.0 − 1.25 = 1.75 after the ALiBi bias. The nearby position 5 keeps its full score of 3.0. After softmax, this distance penalty translates to dramatically lower attention weights for far-away tokens in this steep-slope head.

Step 3: Compare with head 4 (m=0.0039).

Same query at position 5, same key at position 0. Bias = −0.0039 × 5 = −0.0195. A raw score of 3.0 becomes 2.98. Head 4 barely notices the distance. It attends almost uniformly across the sequence — a global attention head.

Multi-scale attention by design. ALiBi's geometric slopes naturally create a spectrum: some heads are local (steep penalty, attend only to neighbors), others are global (gentle penalty, attend across the full sequence). The model doesn't have to learn this division — it's baked in from the architecture.

ALiBi Bias Heatmap

Visualize the attention bias matrix for each head. Steep slopes create sharp diagonal patterns (local attention). Gentle slopes create nearly uniform patterns (global attention).

Number of heads (H) 8

Selected head 1

Sequence length 16

Length Extrapolation

ALiBi's killer feature: since the bias is just a linear function of distance, it works at any sequence length — even lengths never seen during training. Train on 1024 tokens, deploy at 8192: the bias formula is the same, just applied to larger distances. No retraining needed. No fine-tuning. No interpolation tricks.

This was revolutionary when ALiBi was published. Learned embeddings fail catastrophically beyond training length (no embedding exists for position 1025). Sinusoidal encodings degrade. RoPE starts to break at 2−4× training length. ALiBi just... works. The linear penalty scales naturally because distance is distance, whether it's 5 tokens or 5000.

Implementation: ALiBi from Scratch

python
import torch
import math

def get_alibi_slopes(n_heads):
    # Geometric series: 1/2^(1*8/H), 1/2^(2*8/H), ...
    ratio = 2 ** (8 / n_heads)
    slopes = [1.0 / (ratio ** i) for i in range(1, n_heads + 1)]
    return torch.tensor(slopes)

def build_alibi_bias(n_heads, max_seq_len):
    # slopes shape: [n_heads]
    slopes = get_alibi_slopes(n_heads)  # e.g., [0.25, 0.0625, ...]

    # Distance matrix: |i - j| for all query-key pairs
    positions = torch.arange(max_seq_len)
    dist = (positions.unsqueeze(1) - positions.unsqueeze(0)).abs().float()
    # dist shape: [seq_len, seq_len]

    # Bias = -slope * distance, per head
    bias = -slopes.view(-1, 1, 1) * dist.unsqueeze(0)
    # bias shape: [n_heads, seq_len, seq_len]
    return bias

def alibi_attention(Q, K, V, alibi_bias):
    # Q, K shape: [batch, n_heads, seq_len, d_k]
    d_k = Q.shape[-1]
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # scores shape: [batch, n_heads, seq_len, seq_len]

    # Add ALiBi bias (broadcasts over batch)
    scores = scores + alibi_bias[:, :Q.shape[2], :K.shape[2]]

    weights = torch.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

ALiBi doesn't learn position — it ASSUMES locality. The biases are fixed (not learned) and they always penalize distance. This is a strong inductive bias: "nearby tokens are more important." For most language tasks, this is true. But for tasks requiring long-range exact retrieval — like finding a specific fact buried at the start of a 100K document — ALiBi's linear penalty can be too aggressive. The relevant token might get a bias of −500, effectively zeroing its attention weight regardless of content relevance.

How does ALiBi encode position information?

By rotating query and key vectors before the dot product By adding a learned position embedding to the input tokens By subtracting a distance-proportional penalty from attention scores — no embedding modification needed By multiplying attention weights by a distance-dependent decay factor after softmax

Chapter 6: Length Extrapolation Arena

You've trained a model on sequences of length 512. Now someone pastes a 4096-token document. What happens?

The answer depends entirely on your position encoding choice. Some methods fail catastrophically the instant you exceed the training length. Others degrade gracefully. One barely notices. This simulation lets you see every failure mode — and every survival strategy — side by side.

How to use the Arena. Set a training length, then drag the test length beyond it. Toggle each position encoding method on/off to compare. Watch the attention heatmap change: clean structure means the model still works; noisy static means it's broken. The quality bars below give a quantitative summary. Try these experiments: (1) Learned at 2× training length — instant death. (2) RoPE at 4× — noticeable degradation. (3) ALiBi at 8× — barely affected. (4) Turn on NTK scaling to rescue RoPE.

Length Extrapolation Showdown

Train on short sequences, test on long ones. Which position encodings survive?

Sinusoidal Learned RoPE ALiBi RoPE+NTK

Training length 512

Test length 512

What the Arena Reveals

Learned embeddings: instant catastrophe. Beyond the training length, there simply is no embedding for position 513. The model receives random, untrained vectors. Attention patterns become pure noise. This is not graceful degradation — it's a cliff.

Sinusoidal: mild degradation. The sin/cos functions are defined for all positions, so the model doesn't crash. But the attention patterns it learned during training assumed certain frequency relationships that become less reliable at unseen positions. You get blurriness, not static.

RoPE (vanilla): gradual breakdown at 2−4×. The rotation frequencies are all mathematically valid at longer positions. But the high-frequency dimensions cycle through rotation angles the model never encountered during training. The model has never seen these particular combinations of rotations and doesn't know what they mean. Attention patterns become increasingly incoherent.

ALiBi: graceful to 8× and beyond. The linear penalty is the same function at any distance. A penalty of −m · 1000 is just a bigger version of −m · 10. The model learned to use these biases during training, and the extrapolation is just a natural extension. Only at extreme multiples (16×+) do the distant-token penalties become so large that information flow is completely blocked.

RoPE + NTK scaling: the rescue strategy. By increasing the rotation base, NTK scaling slows down the high-frequency dimensions that cause trouble. The result: RoPE that works reliably at 4−8× the training length. This is how Llama models extended from 4K to 128K context. More on NTK in the next chapter.

The extrapolation hierarchy is clear: Learned < Sinusoidal < RoPE < RoPE+NTK < ALiBi. But extrapolation isn't everything. RoPE gives better in-distribution quality than ALiBi because rotation preserves the full attention structure without assuming locality. That's why most production LLMs use RoPE + NTK scaling rather than ALiBi — you get both quality and extrapolation.

Why Extrapolation Matters

"Just train on longer sequences" sounds like a solution, but sequence length has a quadratic cost in attention. Training on 8K is 16× more expensive than 2K. Training on 32K is 256× more. If your position encoding can extrapolate reliably, you train on affordable short sequences and deploy at the long context you actually need.

This is exactly what happened in practice. Llama 2 trained on 4K context. With RoPE + NTK scaling (via fine-tuning), Llama 2 Long extended to 32K. Code Llama went from 4K to 100K. The position encoding's extrapolation ability was the enabling technology.

Chapter 7: NTK-Aware Scaling & Modern Tricks

RoPE works beautifully within the training length. But push it to 2× or 4× and attention patterns start to break. We saw this in the arena. Now let's understand exactly why it breaks and how NTK-aware scaling fixes it.

This is the trick that let models jump from 4K to 128K context windows. It was discovered not by a major lab, but by a pseudonymous researcher on Reddit (u/bloc97) in 2023. Within weeks, every open-source LLM had adopted it.

The Frequency Spectrum Problem

Recall that RoPE splits the head dimension into d/2 pairs, each rotating at a different frequency:

θ_i = 1 / 10000^2i/d

For d=64, pair 0 has θ₀ = 1.0 — it rotates one full radian per position. Pair 31 has θ₃₁ ≈ 0.00011 — it barely moves. At the training length of, say, 4096:

Pair 0 (fast): total rotation = 4096 × 1.0 = 4096 radians. That's 651 full circles. The model has seen every possible angle many times.
Pair 31 (slow): total rotation = 4096 × 0.00011 = 0.45 radians. That's about 26 degrees. This dimension has only ever seen a tiny arc of the rotation circle.

Now extend to 8192 (2× training length):

Pair 0: 8192 radians. Still fine — it cycles through the same angles as before, just more times. No new territory.
Pair 31: 0.9 radians (52 degrees). This is twice the angle it ever saw during training. The model has no idea what a rotation of 0.9 radians means in this dimension. It's extrapolating into unknown territory.

The problem is asymmetric across dimensions. Fast-rotating dimensions are fine — they've already wrapped around many times, so there are no "new" angles. Slow-rotating dimensions see genuinely new angles beyond the training length. The high-frequency dimensions are safe; the low-frequency dimensions break. Any fix must be dimension-aware.

NTK-Aware Scaling

The fix is elegant: increase the rotation base. Instead of base = 10000, use:

base' = 10000 · (scale)^d/(d-2)

where scale = test_length / train_length. This changes every frequency:

θ'_i = 1 / base'^2i/d

The key insight: because the exponent is 2i/d, low-index dimensions (fast rotators) are barely affected — they're raised to a small power. High-index dimensions (slow rotators) are raised to a larger power, so the base increase hits them harder, slowing them down proportionally more.

The result: fine-grained position discrimination (fast dimensions) is almost unchanged, while the dangerous slow dimensions are pulled back into the training range. It's a nonlinear frequency adjustment — not uniform stretching.

Hand Calculation: NTK vs No Scaling

Setup: d=64, base=10000, training length=4096, test length=8192, so scale=2.

Without scaling (vanilla RoPE):

Pair	θ_i	Angle at pos 4096	Angle at pos 8192	New territory?
0 (fast)	1.0	4096 rad	8192 rad	No — wraps
8	0.0178	72.8 rad	145.6 rad	No — wraps
16	0.000316	1.295 rad	2.590 rad	Yes
24	0.0000056	0.023 rad	0.046 rad	Yes
31 (slow)	0.00011	0.45 rad	0.90 rad	Yes

With NTK scaling:

base' = 10000 · 2^64/62 = 10000 × 2.0226 = 20,226.

Pair	θ'_i	Change	Angle at 8192	Still in range?
0 (fast)	1.0	Unchanged	8192 rad	Yes — wraps
8	0.0111	−38%	90.9 rad	Yes — wraps
16	0.000123	−61%	1.007 rad	Yes
24	0.00000136	−76%	0.011 rad	Yes
31 (slow)	0.0000248	−77%	0.203 rad	Yes

With NTK, pair 16's angle at position 8192 is 1.007 rad — safely below the 1.295 rad it saw during training. Pair 31 is at 0.203 rad, well within its training range of 0.45. The slow dimensions have been pulled back into familiar territory.

NTK scaling preserves fast dimensions while rescuing slow ones. Pair 0 (fast) is completely unchanged — base' raised to the power 0/64 is still 1. Pair 31 (slow) is slowed by 77%. The transition is smooth, not a cliff. Every dimension gets exactly the amount of slowdown it needs.

Frequency Spectrum Visualizer

Compare rotation frequencies across dimension pairs. Watch how NTK scaling selectively slows high-index (slow) dimensions while leaving low-index (fast) dimensions untouched.

Scale factor 1.0

Original NTK Scaled Linear Interp

Head dimension (d) 64

Other Extension Tricks

NTK-aware scaling was the breakthrough, but several refinements followed:

Method	How it works	Pros	Cons
Linear Interpolation (Chen et al., 2023)	Divide position by scale factor: use position p/s instead of p. Uniformly slows ALL frequencies.	Dead simple. One line of code.	Slows fast dims too, hurting fine-grained local discrimination. Needs short fine-tuning to recover.
NTK-Aware (bloc97, 2023)	Increase base to slow frequencies nonlinearly — more for slow dims, less for fast dims.	Preserves local resolution. Works well at 4−8×.	Needs a known scale factor. Moderate quality loss at extreme scales.
Dynamic NTK (emozilla, 2023)	Compute scale factor dynamically: scale = max(1, current_seq_len / train_len). Adjusts on the fly.	No fixed scale needed. Adapts to actual input length.	Slightly more compute. Edge effects at the transition point.
YaRN (Peng et al., 2023)	Split dims into 3 regions: don't scale fast dims, interpolate slow dims, NTK-scale middle dims. Also adds temperature scaling.	Best-in-class quality. Minimal fine-tuning needed.	More hyperparameters. Complex implementation.

In practice, the industry converged on YaRN or Dynamic NTK for production deployments. Llama 3.1 uses a YaRN-inspired approach to achieve 128K context from a 8K training length. Mistral uses a sliding-window variant combined with RoPE extension.

Implementation: NTK-Aware RoPE

python
import torch

def ntk_rope_freqs(dim, max_seq_len, base=10000.0,
                      train_len=4096, target_len=32768):
    # Scale factor: how many times beyond training?
    scale = max(1.0, target_len / train_len)

    # NTK-aware base adjustment
    # base' = base * scale^(d/(d-2))
    ntk_base = base * (scale ** (dim / (dim - 2)))
    # For scale=8, dim=64: base goes from 10000 to ~96,980

    # Recompute frequencies with new base
    freqs = 1.0 / (ntk_base ** (torch.arange(0, dim, 2).float() / dim))

    # Same outer product as standard RoPE
    positions = torch.arange(max_seq_len).float()
    angles = torch.outer(positions, freqs)

    return torch.cos(angles), torch.sin(angles)

# Compare: standard vs NTK-scaled
cos_std, sin_std = precompute_rope_freqs(64, 32768)
cos_ntk, sin_ntk = ntk_rope_freqs(64, 32768,
                                    train_len=4096,
                                    target_len=32768)

# Fast dims (pair 0): frequencies nearly identical
# Slow dims (pair 31): NTK frequency is much smaller
# → slow dims stay in the trained angle range

NTK scaling is not "just making the base bigger." A naive larger base would reduce ALL frequencies uniformly, hurting fine-grained position discrimination just as much as linear interpolation does. NTK-aware scaling applies a nonlinear frequency adjustment: the exponent 2i/d means that dimensions with small i (fast rotators, small exponent) are barely affected by the base change, while dimensions with large i (slow rotators, large exponent) are strongly slowed. It's the 2i/d exponent that creates the selectivity — not the base change itself.

What is the core problem NTK-aware scaling solves?

RoPE uses too much memory at long sequence lengths High-frequency (slow-rotating) RoPE dimensions see rotation angles outside the training distribution at extended sequence lengths The attention computation becomes numerically unstable beyond the training length RoPE cannot represent positions larger than the base frequency parameter

Chapter 8: Which Method to Use

You now know five position encoding strategies: sinusoidal, learned, relative bias, RoPE, and ALiBi. Plus the extension tricks — NTK scaling, YaRN, linear interpolation. So when you're building or fine-tuning a model, which one do you actually pick?

The answer depends on three things: what kind of model you're building, how long your sequences need to be, and whether you need to extrapolate beyond training length. Here's the decision framework.

The Decision Flowchart

What are you building?

Start here. The architecture constrains your options.

↓

Decoder-only LLM (GPT-style)

Use RoPE. It's the default in Llama, Mistral, Gemma, Qwen, Phi. If you need >4× training length, add NTK or YaRN scaling.

↓

Encoder-only (BERT-style)

Use Learned position embeddings. BERT, RoBERTa, DeBERTa all use learned PE. Max length is fixed anyway (512 or 1024).

↓

Need zero-shot length extrapolation?

Use ALiBi. No fine-tuning needed to extend. But in-distribution quality is slightly lower than RoPE.

↓

Encoder-decoder (T5-style)

Use Relative position bias (T5 style). Each head learns a small bias table indexed by distance.

↓

Vision Transformer (ViT)?

Use Learned 2D position embeddings or sinusoidal. Image patches have fixed count at fixed resolution. RoPE-2D is gaining traction for variable-resolution ViTs.

Comparison Table

Method	Parameters	Where applied	Relative?	Extrapolates?	Used in
Sinusoidal	0	Added to embeddings	In theory	Somewhat	Original Transformer (2017)
Learned	L × d	Added to embeddings	No	No — hard crash	BERT, GPT-2, GPT-3
Relative Bias	2K+1 per head	Added to attention scores	Yes	Yes (clipped)	T5, DeBERTa
RoPE	0	Rotates Q and K	Yes	Needs NTK/YaRN	Llama, Mistral, Gemma, Qwen
ALiBi	0	Bias on attention scores	Yes	Excellent	BLOOM, MPT, Falcon

See It: Method Configurator

Select a scenario below and see which position encoding method is recommended, along with the key tradeoffs. Each scenario represents a real-world use case.

Position Encoding Configurator

Click a scenario to see the recommended position encoding and why.

RoPE from Scratch: Production Pattern

python
import torch

def build_rope(dim, max_len, base=10000.0, device=None):
    # Standard RoPE for modern LLMs
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2, device=device).float() / dim))
    t = torch.arange(max_len, device=device).float()
    angles = torch.outer(t, freqs)
    return torch.polar(torch.ones_like(angles), angles)  # complex exp

def apply_rope(x, rope_cache):
    # x: [batch, seq, heads, dim]
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    rope = rope_cache[:x.shape[1]].unsqueeze(0).unsqueeze(2)
    return torch.view_as_real(x_complex * rope).flatten(-2).type_as(x)

NTK-Aware Extension: One-Line Fix

python
def build_ntk_rope(dim, max_len, base=10000.0,
                      train_len=4096, target_len=32768, device=None):
    scale = max(1.0, target_len / train_len)
    ntk_base = base * (scale ** (dim / (dim - 2)))  # THE key line
    return build_rope(dim, max_len, base=ntk_base, device=device)

The industry has converged on RoPE + extensions. As of 2024-2025, virtually every open-weights LLM uses RoPE. The debate isn't RoPE vs ALiBi anymore — it's which RoPE extension to use. YaRN and Dynamic NTK are the frontrunners for production deployments. ALiBi remains relevant for models that need effortless length generalization without any fine-tuning.

You're building a decoder-only LLM that will be trained on 8K sequences but must handle 32K at inference. Which position encoding strategy is best?

Learned position embeddings with max_len=32K Sinusoidal encoding (works for any length) RoPE with NTK-aware or YaRN scaling (train at 8K, extend to 32K) ALiBi (zero-shot extrapolation)

Chapter 9: Cheat Sheet & Connections

You now understand the complete positional encoding toolkit — from sinusoidal waves to RoPE rotations. This chapter is your practical reference. No new concepts. Just the formulas, the decision guide, and the connections to where you go next.

Symbol Glossary

Symbol	Meaning	Typical values
p	Absolute position index in the sequence	0 to L-1
d_model	Model embedding dimension	512–8192
d_k	Head dimension (d_model / n_heads)	64–128
i	Dimension pair index (0 to d/2-1)	0–63
θ_i	RoPE rotation frequency for pair i	1.0 to ~0.0001
m_h	ALiBi slope for head h	0.25 to ~0.004
base	RoPE base frequency	10000 (standard)
scale	Length extension ratio (test/train)	1× to 32×

Every Formula in Plain English

Formula	What it says in words
PE(p, 2i) = sin(p / 10000^2i/d)	The even dimension of position p oscillates like a wave. Fast for small i, slow for large i.
PE(p, 2i+1) = cos(p / 10000^2i/d)	The odd dimension is the same wave, phase-shifted by 90 degrees.
q' = R(mθ) · q	RoPE: rotate the query vector by an angle proportional to its position.
q' · k' = f(m − n)	The dot product after rotation depends ONLY on the distance between positions.
score − m_h\|i−j\|	ALiBi: subtract a distance penalty from the attention score. Nearby tokens pay less.
base' = base · s^d/(d-2)	NTK scaling: increase the base to slow down frequencies, more for slow dims than fast dims.

The Timeline

2017 — Sinusoidal

Vaswani et al. Fixed sine/cosine waves. The original. Still teaches the key ideas.

↓

2018 — Learned Embeddings

BERT, GPT-2. Just learn the position table. Slightly better in-distribution, hard length limit.

↓

2018 — Relative Bias

Shaw et al. Encode distance, not index. First relative position method.

↓

2021 — RoPE

Su et al. Rotate Q and K. Relative position from rotation angle subtraction. The modern default.

↓

2022 — ALiBi

Press et al. Just subtract distance from scores. Zero parameters. Best zero-shot extrapolation.

↓

2023 — NTK / YaRN

bloc97, Peng et al. Extend RoPE to longer contexts by adjusting the frequency base. Enabled 128K contexts.

Where to Go Next

If you want to learn about...	Go to...
How attention works (Q, K, V)	Attention & Transformers
Multi-head attention, cross-attention, GQA	Attention Variants
Normalization (BatchNorm to RMSNorm)	Normalization
Optimizers (SGD to AdamW)	Optimizers
Loss functions (cross-entropy, focal, etc.)	Loss Functions
The full GPT architecture	GPT — From Zero to Hero

Key Papers

Paper	Year	Contribution
Vaswani et al. — "Attention Is All You Need"	2017	Introduced sinusoidal position encoding
Devlin et al. — BERT	2018	Popularized learned position embeddings
Shaw et al. — "Self-Attention with Relative Position"	2018	First relative position bias method
Su et al. — "RoFormer: Enhanced Transformer with Rotary Position Embedding"	2021	Introduced RoPE
Press et al. — "ALiBi: Train Short, Test Long"	2022	Introduced ALiBi
bloc97 — "NTK-Aware Scaled RoPE"	2023	NTK-aware frequency scaling for RoPE extension
Peng et al. — "YaRN: Efficient Context Window Extension"	2023	State-of-the-art RoPE extension

The big picture. Positional encoding is one of only four fundamental components of a transformer layer: embedding + position, attention, feedforward, and normalization. You've now mastered one. Each of the others has its own lesson in the Training Foundations series. The path to understanding the full transformer is: position (you are here) → attention → normalization → the complete GPT architecture.

Which statement best summarizes the evolution of positional encoding?

Fixed formulas (sinusoidal) → learned tables → fixed formulas again (RoPE uses the same frequencies) Absolute position (sinusoidal, learned) → relative position (bias, RoPE, ALiBi) → extended relative (NTK, YaRN) The field moved from absolute to relative position encoding, then developed scaling tricks to extend relative methods to longer contexts Each new method completely replaced the previous one, so only RoPE matters today