Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu — Zhuiyi Technology, 2021

RoFormer: Rotary Position Embedding

Encode position as rotation in 2D subspaces. Relative position information emerges naturally from the angle between rotated vectors — no additive terms, no learned parameters. The position encoding behind LLaMA, Mistral, Gemma, and every modern LLM.

Prerequisites: Transformer attention + Complex numbers / 2D rotation. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Position Problem

A transformer processes tokens in parallel. Unlike RNNs, it has no inherent notion of order. Without position information, "the cat sat on the mat" and "mat the on sat cat the" produce identical attention patterns. The model literally can't tell which word comes first.

Position encoding solves this by injecting order information into the token representations. But HOW you encode position matters enormously for the model's ability to understand language.

Previous approaches and their limitations

MethodHow It WorksLimitation
Sinusoidal (Vaswani 2017)Add fixed sin/cos embeddings to inputNo learnable structure. Absolute, not relative.
Learned absoluteLearn a separate embedding for each positionFixed max length. Position 513 is undefined if trained on 512.
Relative position bias (T5)Add learned bias to attention scores based on distanceAdds parameters. Doesn't compose with keys/queries naturally.
ALiBiLinear attention bias that decays with distanceFixed decay rate. No per-head learning.
The ideal position encoding should: (1) Encode relative position — what matters is the distance between tokens, not their absolute positions. (2) Be parameter-free — no extra learned weights. (3) Generalize to unseen lengths — work at position 10,000 even if trained on 2,048. (4) Integrate with attention naturally — modify Q and K so that attention scores automatically reflect position. RoPE achieves all four.

The key insight is geometric: instead of adding position information to the embedding (which mixes content and position), rotate the embedding by an angle proportional to position. The attention score between two rotated vectors naturally depends on their relative angle — which is their relative position.

Position Encoding Comparison

Compare how different position encodings inject position information. Sinusoidal and learned add to the embedding. RoPE rotates it. Click each method to see its approach.

What are the four desirable properties of a position encoding that RoPE achieves?

Chapter 1: Rotation as Position

Start with the simplest case: 2D vectors. Imagine the query vector q for token at position m is a 2D vector [q0, q1]. To encode position m, RoPE rotates this vector by angle mθ:

qm = R(mθ) · q = [q0cos(mθ) − q1sin(mθ), q0sin(mθ) + q1cos(mθ)]

This is just a 2D rotation matrix applied to the vector. The angle is proportional to the position: position 0 gets no rotation, position 1 gets rotation θ, position 2 gets rotation 2θ, and so on.

Why rotation encodes relative position

The attention score between query at position m and key at position n is their dot product: qm · kn. When both vectors are rotated:

qm · kn = R(mθ)q · R(nθ)k = q · R((n-m)θ)k

The dot product of two rotated vectors depends only on their relative rotation (n-m)θ, not on their absolute positions m and n. This is the mathematical magic: rotation naturally produces relative position encoding.

The rotation trick: When you rotate vector A by angle α and vector B by angle β, their dot product depends only on the angle difference (β - α). This is because rotation preserves dot products: rotating both vectors by the same amount doesn't change their dot product. So the "extra" information in the dot product is exactly the relative rotation — which encodes relative position.

Complex number perspective

RoPE is even simpler in complex notation. Treat each 2D pair as a complex number: q = q0 + iq1. Rotation by angle θ is multiplication by e:

qm = q · eimθ

The attention dot product (in complex notation, using the conjugate) becomes:

qm · kn* = q · k* · ei(m-n)θ

The position information appears as a phase factor ei(m-n)θ that depends only on the relative position (m-n).

python
# RoPE in 2D — complex number implementation
import torch

def apply_rope_2d(x, position, theta=10000.0):
    """
    x: [batch, 2] — 2D query or key vector
    position: int — token position (0, 1, 2, ...)
    """
    angle = position / theta  # rotation angle
    cos_a = torch.cos(torch.tensor(angle))
    sin_a = torch.sin(torch.tensor(angle))

    # Apply 2D rotation matrix
    x_rot = torch.stack([
        x[..., 0] * cos_a - x[..., 1] * sin_a,
        x[..., 0] * sin_a + x[..., 1] * cos_a
    ], dim=-1)
    return x_rot
2D Rotation Visualizer

Watch how a 2D vector gets rotated by different positions. Position 0 = no rotation. Higher positions = more rotation. The relative angle between any two positions is always (m-n)θ.

Position m 0
Why does rotating query and key vectors by position-proportional angles produce relative position encoding?

Chapter 2: The RoPE Formula

In practice, embedding dimensions are much larger than 2 — typically d = 128 per head. RoPE extends the 2D rotation to d dimensions by pairing up consecutive dimensions and rotating each pair with a different frequency.

The block diagonal rotation

For a d-dimensional vector, RoPE pairs dimensions (0,1), (2,3), (4,5), ..., (d-2, d-1). Each pair gets its own rotation frequency θi. The rotation matrix is block-diagonal:

R(m) = diag(R2(mθ0), R2(mθ1), ..., R2(mθd/2-1))

Where R2(angle) is the standard 2x2 rotation matrix and the frequencies follow a geometric progression:

θi = 1 / 100002i/d

This is the same frequency schedule as the original sinusoidal position encoding. Low-frequency dimensions (large i) rotate slowly — they encode long-range positional information. High-frequency dimensions (small i) rotate quickly — they encode fine-grained positional information.

python
# Full RoPE implementation
import torch

def compute_rope_freqs(dim, max_seq_len, theta=10000.0):
    """Precompute rotation frequencies for all positions."""
    # Frequency for each dimension pair
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    # freqs: [dim/2] — one frequency per pair

    # Position indices
    positions = torch.arange(max_seq_len).float()  # [seq_len]

    # Outer product: angle = position * frequency
    angles = torch.outer(positions, freqs)  # [seq_len, dim/2]

    # Convert to complex rotation factors
    return torch.polar(torch.ones_like(angles), angles)
    # [seq_len, dim/2] complex: e^(i * position * freq)

def apply_rope(x, rope_freqs):
    """Apply RoPE to queries or keys."""
    # x: [batch, seq_len, n_heads, dim]
    # Reshape to pairs: [batch, seq_len, n_heads, dim/2, 2]
    x_pairs = x.reshape(*x.shape[:-1], -1, 2)

    # View as complex: [batch, seq_len, n_heads, dim/2]
    x_complex = torch.view_as_complex(x_pairs)

    # Multiply by rotation factors (element-wise)
    x_rotated = x_complex * rope_freqs  # complex multiply = rotation

    # Back to real: [batch, seq_len, n_heads, dim]
    return torch.view_as_real(x_rotated).flatten(-2)
RoPE modifies Q and K, not V. RoPE is applied only to queries and keys — the vectors used to compute attention scores. Values are left unchanged. This means position information affects WHERE the model attends (via Q·K scores) but not WHAT information flows (via V). This is a clean separation of "where to look" from "what to say."
Multi-Frequency Rotation

See how different dimension pairs rotate at different frequencies. Low-index pairs rotate fast (fine position). High-index pairs rotate slowly (coarse position). Drag the position slider to watch all pairs rotate simultaneously.

Position 0
Why does RoPE use different frequencies for different dimension pairs?

Chapter 3: Relative Position Emerges

Let's verify the key property mathematically. The attention score between position m (query) and position n (key) is:

score(m,n) = (R(m)q)T(R(n)k) = qTR(m)TR(n)k = qTR(n-m)k

The last step uses the fact that rotation matrices satisfy R(a)TR(b) = R(b-a). The score depends only on q, k (content), and (n-m) (relative position). Absolute positions m and n have disappeared.

Relative position for free. RoPE doesn't explicitly compute relative positions or add bias terms. Relative position information emerges automatically from the mathematics of rotation. This is more elegant than T5's relative position bias (which adds learned offsets to attention scores) and more parameter-efficient (zero extra parameters).

What the attention score "sees"

Expanding the rotated dot product for the 2D case with relative position Δ = n - m:

score = (q0k0 + q1k1)cos(Δθ) + (q0k1 − q1k0)sin(Δθ)

The score is a function of content (qikj terms) modulated by relative position (cos and sin of Δθ). Content determines the base attention. Position modulates it.

Relative Position Proof

Set positions m and n independently. Watch how the attention score depends only on their difference (n-m), not their absolute values. Same difference = same score, regardless of absolute position.

Position m (query) 3
Position n (key) 7
How does relative position information emerge from RoPE without explicit computation?

Chapter 4: Multi-Dimensional Extension

For a head dimension d = 128 (typical in modern LLMs), RoPE creates 64 pairs, each rotating at a different frequency. The full rotation is a block-diagonal matrix with 64 independent 2x2 rotation blocks.

The frequency spectrum

The frequencies θi = 1/100002i/d form a geometric sequence. For d = 128:

Pair index iFrequency θiWavelength (positions)What it encodes
0 (fastest)1.0~6 positionsAdjacent token relationships
160.01~628Paragraph-level structure
320.0001~62,832Document-level structure
63 (slowest)~1e-5~628 millionEssentially position-invariant

This multi-scale representation is what makes RoPE so effective. The model simultaneously tracks fine-grained position (which word is next?) and coarse position (which paragraph am I in?).

Why the base frequency matters. The base frequency θ = 10000 determines the frequency range. Increasing it (e.g., to 500000 as in Llama-3's RoPE scaling) stretches the wavelengths, allowing the model to handle longer contexts. This is the basis of YaRN, NTK-aware scaling, and other long-context adaptations of RoPE.

Computational cost

RoPE is almost free computationally. It's an element-wise operation (multiply by precomputed sin/cos values) applied to Q and K. The rotation factors are computed once and cached. The overhead compared to no position encoding is negligible — a few element-wise multiplications per token.

python
# In practice: RoPE in a transformer layer
class RoPEAttention(nn.Module):
    def __init__(self, dim, n_heads, max_len=8192):
        self.q_proj = nn.Linear(dim, dim)
        self.k_proj = nn.Linear(dim, dim)
        self.v_proj = nn.Linear(dim, dim)
        # Precompute RoPE frequencies
        self.rope_freqs = compute_rope_freqs(dim // n_heads, max_len)

    def forward(self, x):
        q = apply_rope(self.q_proj(x), self.rope_freqs)  # rotate Q
        k = apply_rope(self.k_proj(x), self.rope_freqs)  # rotate K
        v = self.v_proj(x)  # V is NOT rotated

        # Standard attention: scores depend on relative position
        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        return softmax(scores) @ v
Frequency Spectrum

See the full frequency spectrum for RoPE with d=128. Each bar is a dimension pair. Fast-rotating pairs (left) encode fine position. Slow-rotating pairs (right) encode coarse position.

Why does RoPE apply different frequencies to different dimension pairs?

Chapter 5: Long-Range Decay

RoPE has a natural long-range decay property: the attention score between distant tokens is lower on average than between nearby tokens. This emerges from the rotation geometry without being explicitly designed.

Why decay happens

As the relative distance Δ increases, the rotation angle Δθi grows for high-frequency pairs. These fast-rotating dimensions contribute oscillating cos/sin terms that average to zero over large Δ. Only the slowest-rotating dimensions contribute stable signal at long distances.

E[score(Δ)] ≈ ∑i (qiki) · cos(Δθi) → 0 as Δ → ∞

This is desirable: language models should pay more attention to nearby tokens (local syntax) than distant ones (remote context), with a smooth decay in between.

Implicit recency bias. RoPE's decay gives the model a built-in recency bias: nearby tokens naturally get higher attention scores. This is similar to ALiBi's linear decay but emerges from the rotation mathematics rather than being manually designed. The decay rate depends on the frequency spectrum, which can be tuned by changing the base frequency θ.

Length extrapolation

Because RoPE is defined for any integer position, it can technically handle sequences of any length. However, the model hasn't been trained on very long sequences, so the rotation angles for positions beyond the training length are "out of distribution." Modern techniques like YaRN and NTK-aware scaling adjust the frequencies to handle longer contexts.

Long-Range Decay

See how the expected attention score decays with distance. Nearby tokens get high scores. Distant tokens get lower scores as the high-frequency oscillations cancel out.

Why do RoPE attention scores naturally decay with distance?

Chapter 6: Rotation Visualizer

This interactive visualization shows RoPE in action. Watch vectors rotate as their position changes, see how attention scores depend on relative position, and explore the frequency spectrum.

Full RoPE Rotation Visualizer

Two tokens at positions m and n. Watch their query/key vectors rotate. The attention score (right) depends only on their relative position (n-m). Move both positions by the same amount — the score stays constant.

Query position m 5
Key position n 10
In the visualizer, what happens to the attention score when you shift both m and n by the same amount (e.g., m=5,n=10 vs m=15,n=20)?

Chapter 7: Connections

RoPE is now the default position encoding for nearly every modern LLM. Its mathematical elegance and practical effectiveness made it the clear winner in the position encoding wars.

ModelYearPosition Encoding
GPT-2/32019-20Learned absolute
T52020Relative position bias
BLOOM2022ALiBi
LLaMA 1/2/32023-24RoPE
Mistral/Mixtral2023-24RoPE
Gemma2024RoPE
Qwen2024RoPE
DeepSeek2024-25RoPE

RoPE's legacy

Universality. RoPE is used in LLaMA, Mistral, Gemma, Qwen, DeepSeek, CodeLlama, and virtually every modern open-weight model. It's the de facto standard.

Long context extensions. RoPE's frequency-based design enabled a family of context extension methods: YaRN, NTK-aware scaling, dynamic NTK, and LongRoPE, which extend models from 4K to 128K+ context by modifying the frequency spectrum.

What RoPE got right

Elegance. Zero extra parameters. Works through rotation alone. Relative position emerges mathematically.

Efficiency. Element-wise operations only. Negligible overhead.

What it left open

Length generalization. RoPE doesn't automatically generalize to lengths much longer than training. The frequency spectrum must be adjusted for longer contexts.

From sinusoidal to rotary. Position encoding evolved from Vaswani's sinusoidal (additive, absolute) to Su's RoPE (multiplicative, relative). The key insight: don't ADD position to the embedding — ROTATE it. This produces relative positions for free and became the foundation of every modern LLM.
Position Encoding Evolution
Why did RoPE become the universal standard for position encoding in modern LLMs?