RetNet — One Computation, Three Forms

Chapter 0: The Impossible Triangle

When you design a sequence model, you want three things at once. Training parallelism — so you can use GPUs efficiently and scale. Cheap inference — ideally constant cost per generated token, with no growing memory. And strong performance — quality that matches the best models. For years, the brutal rule was: pick two. RetNet, from Microsoft in 2023, was named for its attempt to grab all three at once — what the paper called the “impossible triangle.”

Why you could only pick two

Transformers get parallel training and strong performance — but inference is expensive: the cost per token grows with the sequence, and the KV cache grows without bound. They miss the “cheap inference” corner.
Classic RNNs get cheap, constant-cost inference and (in principle) strong performance — but they can't train in parallel (each step waits for the last). They miss the “parallel training” corner.
Earlier linear attention got parallel training and cheap inference — but often gave up performance. It missed the “strong performance” corner.

Each known approach sat on one edge of the triangle, covering two corners but not the third. RetNet's claim was to reach the center — all three corners at once — through one mechanism called retention, which can be computed in three mathematically equivalent ways, each optimal for a different corner.

The one-sentence version. RetNet replaces attention with retention — attention without the softmax, plus a built-in distance decay — and retention can be run three ways from the same weights: a parallel form for fast training, a recurrent form for cheap constant-cost inference, and a chunkwise form that blends both for long sequences. One model, three computation modes, one for each corner of the triangle.

The trick: one mechanism, three forms

This is RetNet's signature idea, and what makes it distinct from its cousins. It's not that RetNet picks a clever middle point — it's that retention is a computation you can literally evaluate three different ways that all produce the identical result. Train with the parallel form (GPU-efficient). Deploy with the recurrent form (constant memory, constant per-token cost). Process very long sequences with the chunkwise form (parallel within chunks, recurrent across them). You switch forms to fit the job, never retraining. The rest of this lesson builds each form and shows they're the same thing.

See it: the triangle

The widget shows the three corners. Click each architecture to see which corners it covers — transformers and RNNs each light up two, leaving one dark. Then click RetNet and watch all three light up. That “all three” is the goal; the three computation forms are how it's reached.

The Impossible Triangle

Three corners: parallel training, cheap inference, strong performance. Click each model to see which corners it covers. RetNet aims for all three via its three computation forms.

Common misconception. “Three forms means three models you train separately.” No — there is one set of weights and one retention computation. The three forms are just different orders of doing the same arithmetic, like the associativity reordering in linear attention. They provably give the same output, so you train once (parallel form) and freely switch to whichever form is cheapest for what you're doing. That equivalence is the whole magic.

What is RetNet's core claim about the “impossible triangle”?

That you must always pick two of the three corners That retention, computed in three equivalent forms (parallel/recurrent/chunkwise) from the same weights, reaches all three corners: parallel training, cheap inference, and strong performance That training and inference must use different models

Chapter 1: Retention — Attention With a Decay

The heart of RetNet is retention. The quickest way to understand it: take attention, remove the softmax, and add a fixed decay that makes a token pay less attention to the past the further back it is. Recent tokens count fully; distant tokens fade. That single change — softmax out, distance-decay in — is what makes retention expressible in those three forms.

The decay by distance

In retention, how much token n attends to an earlier token m depends on two things: their query-key similarity (as in attention), and a decay factor that shrinks with the gap between them. The decay is a fixed number, call it gamma, between 0 and 1, raised to the power of the distance. So a token one step back is weighted by gamma; two steps back by gamma-squared; k steps back by gamma-to-the-k. The weighting falls off geometrically with distance — an exponential forgetting built right into the mechanism, not learned per token.

Why a fixed decay is the key enabler. Softmax couples all the scores together (it normalizes across them), which is exactly what forced the quadratic matrix in attention. Retention's per-distance decay is different: gamma-to-the-distance factorizes — the weight between positions n and m splits into a part that depends only on n and a part that depends only on m. That factorization is what lets the same computation be rearranged into a recurrent state (Chapter 3) and into chunks (Chapter 4). The decay isn't just a forgetting heuristic; it's the mathematical hinge that makes all three forms possible.

Worked example: the decay weights

Take gamma = 0.9. How much does the current token retain from tokens at various distances back?

distance back	weight (0.9 ^ distance)
0 (current)	0.9⁰ = 1.00
1	0.9¹ = 0.90
5	0.9⁵ = 0.59
10	0.9¹⁰ = 0.35
50	0.9⁵⁰ ≈ 0.005

By 50 steps back, the weight has collapsed to half a percent — effectively forgotten. Lower gamma (say 0.5) forgets much faster; higher gamma (0.99) remembers far longer. This single number sets the model's “memory horizon.” And the genius is that this geometric decay is exactly the form that turns into a simple multiply-by-gamma in the recurrent state — remember the past, scaled down by gamma, then add the new token. The decay you see here is the recurrence, in disguise.

See it: retention weights vs. distance

The widget plots how much the current token retains from each past token, under the decay. Drag gamma: a low value makes a sharp, recency-focused curve (short memory); a high value makes a long, slow tail (long memory). This is the shape of retention — and, as later chapters show, the shape of the recurrent state's forgetting.

Retention Decay by Distance

Retention weight (gamma^distance) for tokens further back. Drag gamma: low = short memory (recency), high = long memory. This decay is what makes retention expressible as a recurrence.

Decay gamma 0.90

Common misconception. “Retention is just linear attention with a decay slapped on.” They're closely related — both drop softmax and become recurrences — but the explicit, fixed, per-distance decay is RetNet's defining choice, and it's what cleanly enables the chunkwise form (Chapter 4) that's RetNet's real practical contribution. The decay isn't cosmetic; it's the structural ingredient that makes the three-way equivalence exact and efficient.

What two changes turn attention into retention, and why does the decay matter beyond “forgetting”?

Add more heads and a bigger softmax; the decay just saves memory Remove the softmax and add a fixed per-distance decay (gamma^distance); the decay factorizes the position weighting, which is what lets the same computation become recurrent and chunkwise Replace queries with keys; the decay prevents overfitting

Chapter 2: The Parallel Form — Train Like a Transformer

The first of retention's three forms is the parallel form, and it's the one you use for training. It looks almost exactly like attention — compute all the pairwise scores at once as a matrix — which means it's just as GPU-friendly and parallelizable. The only difference from attention is what fills the matrix.

The decay mask

In attention, you compute the query-key score matrix and apply softmax. In retention's parallel form, you compute the query-key score matrix and instead multiply it, elementwise, by a decay mask — a matrix that encodes gamma-to-the-distance for every pair of positions, and is zero for future positions (so a token can't see ahead). Entry (n, m) of the mask is gamma raised to the power (n minus m) when m is in the past, and 0 otherwise. So the mask is lower-triangular, and within the triangle it decays as you move away from the diagonal.

Softmax mask vs. decay mask. Causal attention uses a mask that's just 1 below the diagonal and 0 above (you can see the past, not the future), then softmax-normalizes. Retention's mask is richer: it's gamma-to-the-distance below the diagonal — not a flat 1, but a value that fades as you go further back — and there's no softmax. So the decay mask is the attention pattern: it directly sets how much each past token contributes, with no normalization step. Same matrix machinery as attention, a different (and softmax-free) mask.

Worked example: the decay mask for 4 tokens

With gamma = 0.9 and 4 tokens, the decay mask (rows = current token, columns = attended token) looks like this — 1 on the diagonal, decaying down-left, zero above:

	t0	t1	t2	t3
t0	1.00	0	0	0
t1	0.90	1.00	0	0
t2	0.81	0.90	1.00	0
t3	0.73	0.81	0.90	1.00

Read row t3: it retains its own token fully (1.00), the previous one at 0.90, two back at 0.81, three back at 0.73 — each step further is multiplied by another 0.9. The zeros above the diagonal enforce causality. Multiply this mask elementwise into the query-key scores, then multiply by the values, and you have the retention output for every token — computed all at once, in parallel, exactly like attention. The cost is the same as attention (it builds the full matrix), but that's fine for training, where parallelism is what you want.

See it: the parallel decay mask

The widget shows the decay mask as a grid — bright on the diagonal, fading down-left, black above (the future). Drag gamma and watch the triangle's fade change: low gamma makes a tight bright band near the diagonal (short memory); high gamma spreads brightness deep into the past (long memory). This single matrix, multiplied into the scores, is the entire parallel form.

The Parallel Decay Mask

Each cell (row n, col m) = gamma^(n-m) for the past, 0 for the future. Bright = strong retention. Drag gamma to reshape the fade. Multiply this into query-key scores — that's retention, in parallel.

Decay gamma 0.90

Common misconception. “If the parallel form is O(n²) like attention, RetNet didn't fix anything.” The parallel form is quadratic — but you only use it for training, where you want maximum GPU parallelism and the sequence is chunked anyway. For inference, you switch to the recurrent form (next chapter), which is constant-cost per token. RetNet's win isn't a single cheap form — it's having the right form for each job, all from the same weights.

How does retention's parallel form differ from standard causal attention?

It uses no query-key scores at all Instead of a flat causal mask + softmax, it multiplies the scores by a decay mask (gamma^distance below the diagonal, 0 above) and skips softmax — same parallel matrix machinery, different mask It processes tokens strictly one at a time

Chapter 3: The Recurrent Form — Deploy Like an RNN

Now the second form, and the one that wins the “cheap inference” corner of the triangle. The exact same retention computation can be rewritten as a recurrent state update — constant cost per token, constant memory, no growing cache. This is what you use at inference time, when generating one token at a time.

The state that decays and accumulates

Here's the rewrite, and it falls right out of the per-distance decay from Chapter 1. Keep a running state — a small matrix. At each step: first multiply the state by gamma (this is the decay — everything already stored fades by one step's worth), then add the new token's contribution (its key times its value). To produce the output, multiply the current token's query against the state. That's it: decay, add, read. Two lines, executed once per token.

Why this equals the parallel form. Look at what the recurrence produces. After processing token n, the state holds the sum over all past tokens of (gamma-to-the-distance) times (key times value) — because each older contribution has been multiplied by gamma once per step since it was added. That's exactly the decay-weighted sum the parallel form's decay mask computes, just accumulated incrementally instead of all at once. The geometric decay is what makes “multiply the whole state by gamma each step” correctly reproduce gamma-to-the-distance for every individual past token. Same math, rolled up into a recurrence.

Worked example: three steps

Start with an empty state. Token 1 arrives: state becomes k₁v₁ (nothing to decay yet). Token 2: first decay — state becomes gamma times k₁v₁ — then add k₂v₂, so state = gamma·k₁v₁ + k₂v₂. Token 3: decay again (everything ×gamma) and add k₃v₃, giving gamma²·k₁v₁ + gamma·k₂v₂ + k₃v₃.

Look at token 1's contribution in the final state: it's been multiplied by gamma twice — once at step 2, once at step 3 — giving gamma-squared, which is exactly gamma-to-the-distance (it's 2 steps back). The recurrence reproduces the decay mask's weights perfectly. And critically: the state is a fixed size, the work per token is constant, and you never look back at past tokens. Generate a millionth token at the same cost as the first.

See it: the recurrent state

Step through tokens and watch the recurrent retention state. Each step, the whole state fades by gamma (watch existing entries dim) and the new token's contribution is added (a fresh bright entry). The state never grows. Adjust gamma to see fast vs slow forgetting in the running state — the same gamma that shaped the decay mask now controls how quickly the state forgets.

The Recurrent Retention State

Step through tokens. Each step: the whole state ×gamma (everything fades), then add the new key×value. Output = query × state. Fixed size, constant cost per token — the cheap-inference corner.

Decay gamma 0.85

Common misconception. “The recurrent form gives a different (worse) result than the parallel form.” They are provably identical — same weights, same output, to numerical precision. That's the entire point of the “one mechanism, three forms” design: you train with the parallel form and deploy with the recurrent form, confident the model behaves exactly the same. You're not approximating — you're re-associating the same arithmetic, just like linear attention's parallel/recurrent duality.

In retention's recurrent form, what are the two operations performed at each token, and why does this equal the parallel form?

Re-attend to all past tokens, then softmax — same cost as attention Multiply the state by gamma (decay), then add the new key×value; because each old contribution gets ×gamma once per step, it ends up weighted by gamma^distance — exactly the decay mask Store every token's full vector in a growing list

Chapter 4: The Chunkwise Form — RetNet’s Best Idea

This is the form that makes RetNet genuinely practical, and the one its cousins only gesture at. The parallel form is great for training but quadratic. The recurrent form is great for inference but sequential (slow on GPUs because each step waits for the last). For long sequences during training, neither is ideal — quadratic blows up, sequential wastes the GPU. The chunkwise form is the brilliant compromise: parallel inside chunks, recurrent across them.

The hybrid recipe

Split the long sequence into chunks of, say, 512 tokens each. Then:

Within each chunk: use the parallel form — compute retention over the chunk's tokens all at once, with the decay mask. Each chunk is small, so this small quadratic is cheap and fully GPU-parallel.
Across chunks: use the recurrent form — carry a single retention state from one chunk to the next, summarizing everything before this chunk. Each chunk also attends to that carried-in state, so it sees the full history, not just its own tokens.

So each chunk does two things: it computes the parallel retention within itself, and it adds in the contribution from the recurrent state passed from all previous chunks (decayed appropriately). The output is correct — identical to running the whole sequence in the parallel or recurrent form — but the cost is dramatically better: roughly linear in sequence length, while keeping most of the GPU parallelism. It's the third equivalent form, tuned for the long-sequence regime.

The chunk size is a dial between the two extremes. Set the chunk size to the whole sequence and chunkwise becomes the pure parallel form (one big chunk). Set the chunk size to 1 and it becomes the pure recurrent form (every chunk is one token). In between, you trade off: bigger chunks = more parallelism but more quadratic cost per chunk; smaller chunks = less parallelism but more linear. The sweet spot (a few hundred tokens) captures nearly all the GPU efficiency at near-linear cost. Chunkwise isn't a fourth thing — it's a tunable blend of the two forms you already have.

Why this is the whole point

Without chunkwise, the “three forms” story would be incomplete: you'd train with an expensive quadratic form and only save at inference. Chunkwise is what lets RetNet train efficiently on long sequences — the regime where the quadratic transformer hurts most. This is the same trick the whole linear-attention family eventually adopted (the Linear Attention lesson mentioned it), but RetNet's clean decay structure makes the chunkwise form especially natural and exact. It's the practical heart of the architecture.

See it: chunkwise processing

The widget shows a sequence split into chunks. Within each chunk, tokens are processed in parallel (a small bright block); between chunks, a state arrow carries the summary forward. Drag the chunk size: at maximum it's one big parallel block (pure parallel); at minimum it's a long chain of single tokens (pure recurrent). Watch the cost readout interpolate between quadratic and linear as you slide.

Chunkwise: Parallel Within, Recurrent Across

The sequence in chunks. Each chunk computes in parallel (block) and passes a recurrent state to the next (arrow). Drag chunk size between pure-parallel (one big chunk) and pure-recurrent (size 1).

Chunk size 6

Common misconception. “Chunkwise is an approximation that trades accuracy for speed.” It's exact — it produces the identical output to the parallel and recurrent forms, because the decay structure lets the within-chunk and across-chunk pieces combine perfectly. You're not approximating the computation; you're factoring it into a cheap-to-evaluate shape. That exactness is why it's called a third form of the same retention, not a separate, lossy method.

How does the chunkwise form combine the best of parallel and recurrent?

It runs the parallel and recurrent forms separately and averages them Parallel within each chunk (cheap small quadratic, GPU-friendly) and recurrent across chunks (a carried state), giving near-linear cost with most of the parallelism — exact, not approximate It only processes the first chunk and ignores the rest

Chapter 5: Multi-Scale Retention — Many Memory Horizons

A single decay gamma forces one memory horizon: either you forget fast (good for local patterns, blind to long range) or slow (good for long range, fuzzy on local detail). Real language needs both at once. RetNet's answer is multi-scale retention: like multi-head attention, it has many parallel heads — but each head uses a different gamma. Some heads forget quickly; some remember for a long time. Together they cover a spectrum of timescales.

Each head a different clock

The idea maps cleanly onto multi-head attention's structure, which is why it's a drop-in. In standard multi-head attention, each head learns to attend to different content. In multi-scale retention, each head additionally has a fixed, distinct decay rate — a different time horizon. Head 1 might use gamma near 0.5 (sharp, recency-focused, captures the last few tokens); head 8 might use gamma near 0.99 (long memory, tracks information from thousands of tokens back). The decay rates are spread across the heads on a fixed schedule. The model can then route different kinds of dependencies to the head with the matching timescale.

Why fixed, spread-out decays work so well. By assigning each head a different fixed gamma, RetNet guarantees coverage of many timescales without having to learn them — the spectrum is built in. Short-horizon heads handle local syntax (agreement, punctuation); long-horizon heads carry document-level themes or a name introduced pages ago. It's a clean division of temporal labor. And since each head's gamma is fixed, all the three-form machinery (parallel, recurrent, chunkwise) applies per head unchanged — multi-scale doesn't break the equivalence, it just runs it in parallel across several decay rates.

The gated, normalized head

On top of multi-scale, RetNet wraps the heads with two refinements borrowed from the modern transformer toolkit: a group normalization on each head's output (stabilizing the unnormalized, softmax-free retention values), and a swish gate — a learned gate that modulates the combined output, adding expressiveness much like a gated linear unit. So the full operation is “gated multi-scale multi-head retention,” but the core is just: several retention heads, each with its own decay clock, normalized and gated.

See it: the spectrum of decays

The widget overlays the decay curves of several heads, each with a different gamma. The fast-decay heads (steep curves) capture nearby tokens; the slow-decay heads (long tails) reach far back. Together they tile the timescales. Adjust the number of heads to see the spectrum get denser — more heads, more memory horizons covered.

Multi-Scale: Each Head a Different Decay

Several retention heads, each a different gamma. Steep curves = short memory (local); long tails = long memory (global). Together they cover many timescales at once. Adjust the head count.

Number of heads 5

Common misconception. “Different decay rates per head means the heads are trained differently.” The decay rates are typically fixed (set by a schedule across heads), not learned — that's a feature, not a limitation. Fixing them guarantees timescale coverage and keeps the three-form math clean. What the heads learn is the usual query/key/value projections — what content to attend to. The gamma just sets how far back each head looks. Content is learned, timescale is assigned.

What does multi-scale retention add over a single-decay retention?

It makes the model quadratic again Multiple heads each with a different fixed decay rate (gamma), so the model simultaneously covers short memory (local patterns) and long memory (distant dependencies) — a spectrum of timescales It removes the recurrent form

Chapter 6: Three Forms, One Answer

This is the payoff that defines RetNet. The same retention — the same weights, the same input — computed three ways: parallel, recurrent, chunkwise. The simulator runs all three on the same little sequence and shows you two things: that they produce the identical output (the equivalence that makes the whole idea work), and that each one has a different cost profile suited to a different corner of the impossible triangle.

Toggle between the forms and watch the output vector stay the same while the cost and parallelism change. This is the “one mechanism, three forms” idea made concrete — pick the form that fits the job, never retrain:

Parallel — high parallelism, quadratic cost. Use for training.
Recurrent — constant cost per token, sequential. Use for inference.
Chunkwise — near-linear cost, mostly parallel. Use for long-sequence training.

The Same Retention, Computed Three Ways

Toggle the form. The output (top) is identical across all three — that's the equivalence. The cost, memory, and parallelism (bars) differ — that's why each form wins a different corner of the triangle.

What to take away. The output bars don't move when you switch forms — because all three compute the same retention, exactly. What moves is the cost profile. This is the entire elegance of RetNet: you don't choose a form when you design the model, you choose it per task. Train with parallel/chunkwise, serve with recurrent, all from one trained network. That's how it reaches for all three corners of the “impossible” triangle at once.

Common misconception. “Surely the recurrent and parallel outputs drift apart in practice (floating point, etc).” They're the same computation re-associated, so they agree to numerical precision — the tiny differences are the same rounding you'd get re-summing numbers in a different order, not a behavior change. You can train in one form and deploy in another with full confidence the model acts identically. That guarantee is what makes the form-switching practical, not just theoretical.

No quiz — the simulator is the test. If you can explain why the output is identical while the cost differs across the three forms, you understand RetNet's core contribution.

Chapter 7: The RetNet Block

We have retention (the three forms) and multi-scale heads. How is the full model assembled? Exactly as you'd now expect: RetNet reuses the transformer's block structure, swapping the attention sublayer for a retention sublayer. If you know a transformer block, you already know a RetNet block — with one component replaced.

Two sublayers, like a transformer

A RetNet block has two sublayers, each wrapped in a residual connection and normalization (pre-norm, like modern transformers):

The retention sublayer — gated multi-scale retention. This replaces multi-head attention. Tokens are projected to queries, keys, values; multi-scale retention is computed (in whichever of the three forms is appropriate); the heads are group-normalized and combined through a swish gate; then an output projection.
The feed-forward sublayer — identical to a transformer's FFN: a position-wise two-layer MLP that mixes across features. Unchanged from the transformer.

Stack many of these blocks, add token and positional handling, and you have a RetNet model — structurally a transformer with retention in place of attention. This is the same “swap the token mixer, keep the scaffolding” pattern shared by RWKV, Mamba, and xLSTM. The residual + norm backbone that makes deep transformers trainable carries over unchanged, so RetNet inherits all that hard-won training stability.

Why retention doesn't need positional encodings the usual way. Here's a neat consequence of the decay. Standard attention needs explicit positional encodings (sinusoidal, RoPE) because attention itself is order-blind. But retention's decay is inherently positional — gamma-to-the-distance already encodes how far apart tokens are. The decay does double duty: it controls memory and injects relative-position information. RetNet still uses a rotary-style component on the queries and keys, but the decay is itself a strong positional signal, woven into the mechanism rather than added on top.

See it: the block diagram

The widget shows a RetNet block. Notice the familiar transformer skeleton — two residual sublayers with normalization — with gated multi-scale retention where attention would be. Compare it mentally to a transformer block: everything is the same except the token-mixing sublayer.

A RetNet Block

Two residual sublayers (norm + residual), transformer-style: gated multi-scale retention replaces attention; the feed-forward network is unchanged. Click the retention sublayer to expand its internals.

Common misconception. “RetNet must be wildly different from a transformer to get its benefits.” Architecturally it's remarkably close — same blocks, same residuals, same FFN, same norm placement — with one sublayer swapped. That closeness is deliberate and valuable: it means RetNet plugs into existing transformer training infrastructure and benefits from years of transformer engineering. The radical part (the three-form retention) is contained inside one sublayer; everything around it is familiar.

How does a RetNet block relate to a transformer block, and what does the decay let it skip?

It's a completely different structure with no FFN Same two-sublayer residual structure, but gated multi-scale retention replaces attention; and since the decay is inherently positional, it leans less on add-on positional encodings It removes residual connections and normalization

Chapter 8: RetNet Among the Family

RetNet shares the 2024-era recurrent-revival DNA with RWKV, Mamba, and xLSTM — all fixed-state, linear-cost, constant-memory-inference models. Having built RetNet in detail, you can now see exactly what it shares with its cousins and what sets it apart. The honest summary: same family, distinctive framing.

What it shares

Like the whole family, RetNet drops softmax, carries a fixed-size recurrent state, and can train in parallel while inferring cheaply. Its retention state — accumulate key×value with a decay, query to retrieve — is the same matrix-valued associative memory as linear attention's KV state, the mLSTM's matrix memory, and (with different parameterization) Mamba's SSM state. The grand convergence from the xLSTM lesson includes RetNet: yet another lineage arriving at data-dependent... well, almost — which brings us to the difference.

Model	Decay/forget	Named forms
RetNet	fixed per-head decay (multi-scale)	parallel, recurrent, chunkwise (explicit)
RWKV	learned per-channel decay (data-dep in v6)	parallel / recurrent
Mamba	input-dependent (selective)	parallel scan / recurrent
xLSTM	exponential gating (data-dep)	parallel (mLSTM) / recurrent
Transformer	none (full attention)	parallel only (quadratic)

RetNet's distinctive choices. Two things set RetNet apart. First, its decay is fixed (set per head on a schedule), not data-dependent — simpler than Mamba's selectivity or RWKV-6's learned decay, which trades some adaptivity for clean math and the exact three-form equivalence. Second, it explicitly names and uses three computation forms — the chunkwise form, in particular, is RetNet's clearest contribution, made central rather than an afterthought. RetNet's pitch is less “smartest state update” and more “cleanest unification of how to compute a retention state three ways.”

The fixed-decay tradeoff

Is fixed decay a weakness? It's a tradeoff. Data-dependent forgetting (Mamba, RWKV-6) can adapt what to remember based on content, which helps on tasks needing precise, selective recall. RetNet's fixed multi-scale decay is less adaptive but simpler, more stable, and keeps the three-form equivalence exact and efficient. In practice, the family has largely moved toward data-dependent decay (it tends to win on quality), but RetNet's clean three-form framework — especially the chunkwise insight — influenced everyone. It's a landmark for how to think about these models, even where its specific fixed-decay choice was superseded.

See it: the family by forms & decay

Select a model to see its decay type and which computation forms it offers. RetNet stands out for explicitly providing all three named forms; the others share the parallel/recurrent duality but treat chunkwise as an implementation detail. On the decay axis, RetNet's “fixed” sits apart from the family's drift toward “data-dependent.”

The Family: Forms & Decay Type

Select a model. See its decay type (fixed vs data-dependent) and which computation forms it exposes. RetNet is the one that makes all three forms explicit.

Common misconception. “RetNet lost, so it doesn't matter.” Even where data-dependent models pulled ahead on benchmarks, RetNet's framing — the impossible triangle, and especially the explicit parallel/recurrent/chunkwise trichotomy — became standard vocabulary for the whole field. Understanding RetNet gives you the cleanest mental model of how all these models square parallel training with cheap inference. Its ideas outlived its leaderboard position.

What most distinguishes RetNet from RWKV-6 and Mamba?

RetNet is quadratic and they are linear RetNet uses a fixed (per-head, multi-scale) decay rather than data-dependent forgetting, and explicitly provides three named computation forms — with chunkwise as a central contribution RetNet has no recurrent form

Chapter 9: Connections & Cheat Sheet

You now understand RetNet completely: the impossible triangle it targets, retention as softmax-free attention with a per-distance decay, the parallel form for training, the recurrent form for inference, the chunkwise form for long sequences, multi-scale heads for many timescales, the transformer-style block, and where it sits in the family. The thread: one retention mechanism, three equivalent computation forms, each winning a different corner of the parallel/cheap/strong triangle.

The cheat sheet

Impossible triangle: parallel training + cheap inference + strong performance (pick all three)

Retention: attention minus softmax, plus a fixed decay gamma^distance

Parallel form: score matrix × decay mask (gamma^dist below diagonal); quadratic; for TRAINING

Recurrent form: state ← gamma·state + key×value; output = query·state; O(1)/token; for INFERENCE

Chunkwise form: parallel within chunks + recurrent across; near-linear; for LONG-SEQ training

All three are exactly equal — same weights, same output, different cost

Multi-scale: each head a different fixed gamma → many memory horizons at once

Distinctive vs family: fixed decay (not data-dependent) + explicit three-form framework

A decision guide

Training the model?

Parallel form (or chunkwise for long sequences) — maximize GPU use.

↓

Serving / generating?

Recurrent form — constant cost and memory per token.

↓

Very long training sequences?

Chunkwise — tune chunk size between parallel and recurrent.

↓

Need content-adaptive memory?

Consider data-dependent cousins (Mamba, RWKV-6) — RetNet's decay is fixed.

Where this connects

Linear Attention & RWKV — the parallel/recurrent duality and the chunkwise trick are shared family-wide; retention is a decayed linear attention.
xLSTM — another lineage reaching the same matrix-valued recurrent state; the grand convergence.
SSM / Mamba — the data-dependent-decay cousin that RetNet's fixed decay contrasts with.
Attention Variants — retention is attention with the softmax removed and a decay added; the chunkwise idea echoes FlashAttention's tiling.
Positional Encoding — retention's decay doubles as relative-position information, lessening reliance on add-on encodings.
Transformer — the architecture RetNet mirrors block-for-block, swapping attention for retention.

The one thing to remember. RetNet's big idea isn't a single clever trick — it's that one computation, retention, can be evaluated three provably-equivalent ways, and you pick the form to fit the job: parallel to train fast, recurrent to serve cheap, chunkwise to scale to long sequences. That “one mechanism, three forms” framing is how the whole field now reasons about squaring transformer-style training with RNN-style inference. The fixed decay was later often superseded by data-dependent variants — but the three-form lens, and the chunkwise insight, are RetNet's lasting gift.

You've trained a RetNet and want to deploy it as a low-latency, constant-memory text generator. Which form do you use, and will it behave like the trained model?

Retrain a separate recurrent model from scratch Keep using the parallel form — it's the only accurate one Switch to the recurrent form (constant cost/memory per token) using the same trained weights — it produces the identical output to the parallel form it was trained with

“The same truth can be told three ways — choose the telling that fits the moment.”