Linear Attention & RWKV — Escaping the Quadratic Wall

Chapter 0: The Quadratic Wall

The transformer's superpower — attention — is also its curse. In attention, every token looks at every other token. That “every×every” is the source of both its brilliance and its scaling problem: for a sequence of length n, attention does work proportional to n squared. Double the context and the cost quadruples. This is the quadratic wall, and it's the single biggest obstacle to long-context models.

The numbers get brutal fast. A 1,000-token prompt needs about a million attention scores — fine. A 100,000-token prompt (a long document) needs ten billion. A million-token context needs a trillion. The cost doesn't just grow; it explodes. Memory for the attention matrix grows the same way. This is why long context was, for years, fabulously expensive — you were fighting a square.

So a natural question drove a whole research direction: can we get the benefits of attention — mixing information across a sequence — at a cost that grows only linearly with length, like n instead of n squared? An architecture that's O(n) could handle million-token contexts, run cheaply on edge devices, and process streaming data forever. That family of answers is linear attention, and its most prominent member is RWKV — a model that's a transformer when training and an RNN when running.

The one-sentence version. Standard attention costs grow with the square of the sequence length because every token attends to every other. Linear-attention models — RWKV, Mamba, RetNet, and others — restructure the computation so cost grows only linearly, by summarizing the past into a fixed-size running state instead of re-examining every past token.

Linear vs. quadratic: why it matters so much

The difference between O(n) and O(n squared) isn't a minor speedup — at scale it's the difference between possible and impossible. At 1,000 tokens, quadratic is only 1,000× more than linear — annoying but survivable. At a million tokens, quadratic is a million times more. Linear attention doesn't make models a bit faster; it changes which problems are feasible at all. Long documents, high-resolution sequences, lifelong streaming — these need the curve to be a line, not a parabola.

See it: the curves diverge

The widget plots cost against sequence length for quadratic attention and linear attention. At short lengths they're close. Drag the length up and watch the quadratic curve rocket away while the linear one stays nearly flat. That widening gap is the entire motivation for this lesson — and the prize that linear attention reaches for.

Quadratic vs. Linear Cost

Cost vs. sequence length. The quadratic curve (softmax attention) explodes; the linear curve stays manageable. Drag the length and watch the gap become a chasm.

Sequence length 8k

Common misconception. “Linear attention is just a faster, equivalent version of attention.” It is faster, but it is not equivalent — you give something up to escape the square (Chapter 4 is all about what). Softmax attention's quadratic cost buys it a sharp, content-based ability to retrieve any past token exactly. Linear attention trades some of that precision for its speed. It's a different point on the cost–capability curve, not a free lunch.

Why does standard attention cost grow with the square of the sequence length?

Because it uses softmax, which is slow Because every token computes an attention score with every other token, so n tokens produce n×n scores Because the model has quadratically many parameters

Chapter 1: Where the Square Comes From

To escape the quadratic cost, we have to see exactly where it comes from. Let's trace standard attention and find the culprit — it's one specific matrix.

The attention matrix

Each token produces three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (what I'll contribute if attended to). To decide how much token i should attend to token j, you take the dot product of i's query with j's key. Do that for every pair (i, j) and you get an n×n attention matrix — one score for every pair of tokens. That matrix is the quadratic monster: n tokens, n scores each, n×n total. Then softmax normalizes each row, and the matrix multiplies the values to produce the output.

The square is the all-pairs matrix. The n×n attention matrix is where both the time and the memory blow up. Computing it is n×n dot products; storing it is n×n numbers. At 100k tokens that matrix has 10 billion entries — far too big to even hold in memory, let alone compute. Everything about escaping the quadratic wall comes down to one goal: never explicitly form that n×n matrix.

Why the softmax forces the matrix

Here's the subtle part — and the crux of the whole lesson. You might think you could avoid the matrix by being clever with the order of multiplications. But the softmax stands in the way. Softmax operates on each row of the attention matrix (normalizing the scores for one query across all keys), which means you must first compute the full row of scores before you can normalize. The softmax glues the query-key step and the value step together, forcing you to materialize the n×n scores in between. Remove the softmax, and — as the next chapter shows — the whole computation can be reordered to avoid the square entirely.

Worked example: counting the cost

Take a sequence of n = 1,000 tokens, each with a vector of dimension d = 64. The attention matrix is 1,000×1,000 = 1,000,000 scores, each costing 64 multiply-adds, so about 64 million operations just to build it — plus a million numbers to store. Now n = 10,000: the matrix is 100,000,000 scores — 100× more for only 10× more tokens. That's the square in action: every time you grow the sequence by a factor, the cost grows by that factor squared. The value step has the same problem. The square is unavoidable as long as you form that matrix.

See it: the matrix filling in

Watch the attention matrix build as tokens arrive. Each new token adds a whole new row and column of scores — so the total fills in as a square. Drag the token count and watch the cell count (and the cost number) grow quadratically. This square grid is precisely what linear attention refuses to build.

The n×n Attention Matrix

Each cell is one query-key score. Add tokens and the grid grows as a square — n tokens means n×n cells. The cell count is the quadratic cost.

Number of tokens (n) 8

Common misconception. “FlashAttention already solved this — it's not quadratic anymore.” FlashAttention is a brilliant memory optimization — it computes attention without ever storing the full matrix, cutting memory from quadratic to linear. But the compute is still quadratic: it still performs n×n dot products, just in a memory-clever way. Linear attention attacks the computation itself, not just its memory footprint. They're complementary, not the same. (See the Attention Variants lesson for FlashAttention.)

What specifically forces standard attention to build the full n×n matrix?

The value vectors are too large The softmax normalizes each query's scores across all keys, so the full row of scores must be computed before anything else — gluing the steps together and forcing the matrix The positional encodings require it

Chapter 2: The Linear Trick — Reorder the Multiplication

Here is the beautiful idea at the core of linear attention, and it hinges on one of the oldest facts in algebra: matrix multiplication is associative. You can choose the order in which you multiply three matrices, and the answer is the same — but the cost can be wildly different. Linear attention exploits exactly this freedom, once the softmax is out of the way.

Drop the softmax, then regroup

Standard attention computes, roughly, (scores) × values, where scores come from queries times keys. Written as matrices, it's (Q Kᵀ) V — first multiply queries by keys to get the n×n score matrix, then multiply by values. The parentheses force the giant n×n matrix into existence first.

But if we remove the softmax (or replace it with a simpler feature map), the parentheses become free to move. We can instead compute Q (Kᵀ V) — first multiply keys by values, then queries by that result. And here's the magic: Kᵀ V is a d×d matrix — its size depends only on the feature dimension d, not on the sequence length n. The enormous n×n matrix never appears. The cost drops from proportional to n squared down to proportional to n.

Same answer, different cost. (Q Kᵀ) V and Q (Kᵀ V) give the identical result — associativity guarantees it. But the first builds an n×n matrix (quadratic); the second builds a d×d matrix (linear in n, since d is a fixed small constant). The softmax was the only thing forcing the bad grouping, because it had to act on the n×n scores in between. Remove it, regroup, and the square evaporates. That's linear attention in one move.

Worked example: the two groupings

Let n = 1,000 tokens and d = 64. Compare the two orders:

order	first product	its size	rough cost
(Q Kᵀ) V	Q Kᵀ (scores)	1000 × 1000	~n²d = 64 million
Q (Kᵀ V)	Kᵀ V (a state)	64 × 64	~n·d² = 4 million

The two give the same output, but the second is 16× cheaper here — and the gap grows with n. At n = 10,000, the first balloons to 6.4 billion while the second only grows to 40 million. The first scales as n squared; the second scales as n. The Kᵀ V product is the key: a fixed-size d×d summary of all the keys and values, regardless of how many tokens there are. That little matrix is the whole sequence, compressed into constant size.

What replaces the softmax

We can't just delete the softmax and leave nothing — the softmax provided a nonlinearity and kept the weights positive. Linear attention replaces it with a feature map: a function applied to the queries and keys (something as simple as an elu plus one, or other positivity-preserving maps) that approximates softmax's behavior while allowing the regrouping. The choice of feature map is where different linear-attention variants differ — but the core move, the associativity regrouping, is shared by all of them.

See it: the regrouping

Toggle between the two multiplication orders and watch the intermediate matrix. In (QKᵀ)V order, the intermediate is the huge n×n matrix (it grows as you add tokens). In Q(KᵀV) order, the intermediate is the small fixed d×d state (it stays the same size no matter how many tokens). Add tokens and see one intermediate explode while the other holds steady.

Two Ways to Multiply Three Matrices

Same result, different intermediate. (QKᵀ)V makes an n×n matrix (grows with tokens); Q(KᵀV) makes a fixed d×d state. Toggle the order and add tokens.

Tokens (n) 10

Common misconception. “If it's just reordering and gives the same answer, why didn't everyone always do it?” Because the softmax genuinely blocks it — with softmax in the middle, the two orders do not give the same answer, and you're stuck building the matrix. The price of linear attention is removing the softmax, which changes the model's behavior (Chapter 4). The regrouping is free only after you've paid that price. There's no reordering trick that keeps softmax and avoids the square.

How does linear attention avoid building the n×n matrix?

It uses a smaller batch size By dropping the softmax and using associativity to compute Kᵀ·V first (a fixed d×d state) instead of Q·Kᵀ first (an n×n matrix) — same result, cost linear in n By skipping most of the tokens

Chapter 3: The Recurrent View — Attention as an RNN

The d×d state matrix from the last chapter has a second, even more remarkable property. It can be built up incrementally, one token at a time. And that turns linear attention into something that looks exactly like an old-fashioned recurrent neural network — with all the streaming, constant-memory benefits that implies. This dual nature is the secret to why these models are so practical.

The state that accumulates

Remember Kᵀ V — the d×d summary of all keys and values. Watch what happens when a new token arrives. Its contribution to that summary is just its own key times its own value, added on. So you can keep a running state: when token t arrives, update the state by adding token t's key-value contribution, then read the output by multiplying token t's query against the current state. No looking back at past tokens — everything you need about the past is already summarized in the state. Process the next token, update the state again, and so on.

The past lives in a fixed-size state. A softmax transformer, to generate token 1,000, must look back at all 999 previous tokens (the growing KV cache). A linear-attention model carries a single fixed-size state that already encodes everything relevant about those 999 tokens. To produce each new token it does a constant amount of work and uses a constant amount of memory — no matter how long the sequence is. That's the dream for streaming and long context: the per-token cost never grows.

The best of both worlds: dual form

Here's the cleverest part, and why these models are trainable at scale. Linear attention has two mathematically equivalent forms:

Parallel form (for training): process the whole sequence at once with big matrix multiplies, using GPUs efficiently — just like a transformer. Fast on the hardware that trains models.
Recurrent form (for inference): process one token at a time, updating the fixed state — just like an RNN. Constant memory and constant per-token cost, ideal for generating text or streaming.

You train in parallel form (fast, GPU-friendly) and deploy in recurrent form (cheap, streaming). The same model, the same weights, two views of the same computation. This solves the historical curse of RNNs — they were slow to train because they were inherently sequential. Linear attention gets RNN-style cheap inference and transformer-style parallel training. That combination is the whole point.

Worked example: two tokens, by state

Start with an empty state (all zeros). Token 1 arrives with key k₁ and value v₁: update the state to k₁v₁. Its output is q₁ times the state. Token 2 arrives with k₂, v₂: update the state to k₁v₁ + k₂v₂ (just add the new contribution). Its output is q₂ times this updated state — which already contains both tokens' info. Token 2 never re-examined token 1; it just read the accumulated state. Extend to a million tokens and the state never grows — you only ever store and update that one fixed-size matrix.

See it: the running state

Step through a sequence and watch the recurrent state update. Each new token adds its contribution to the fixed-size state (which never changes size), and its output is read from the current state. Notice the contrast with Chapter 1's growing matrix: here the memory is flat no matter how many tokens you process.

Linear Attention as a Recurrent State

Step through tokens. Each adds its key×value to the fixed-size state; the output reads the state with the query. The state never grows — constant memory, however long the sequence.

Common misconception. “If it's an RNN, it must be slow to train like old RNNs.” That was exactly the historical problem — and the breakthrough is that linear attention is also expressible as parallel matrix multiplies for training. You don't run it sequentially during training; you use the parallel form. Only at inference, where you generate one token at a time anyway, do you switch to the recurrent form. RNN inference economics, transformer training economics, same model.

What is the key practical advantage of linear attention's recurrent form at inference time?

It's more accurate than the parallel form The past is summarized in a fixed-size state, so generating each new token takes constant time and memory regardless of sequence length (vs a transformer's ever-growing KV cache) It uses more GPUs

Chapter 4: The Catch — What You Give Up

If linear attention were strictly better — same quality, lower cost — transformers would already be extinct. They aren't, and the reason is a real, fundamental tradeoff. Compressing the past into a fixed-size state is exactly what makes linear attention cheap, and it's also exactly what limits it. You cannot losslessly cram an unbounded history into a constant amount of memory.

The fixed-state bottleneck

A softmax transformer keeps every past token available (the KV cache grows with the sequence), so it can reach back and retrieve any specific one exactly — “what was the name mentioned 50,000 tokens ago?” A linear-attention model has only its fixed-size state, which is a lossy summary of everything that came before. New information overwrites or blurs old information. Ask it to recall a precise detail from far back, and it often can't — the detail was compressed away to make room. The fixed state is a bottleneck: finite memory, infinite history.

The sharpness gap. Softmax can put almost all its attention weight on a single token — a sharp, precise spike that retrieves one specific past item. This is what makes transformers so good at “needle in a haystack” recall. Linear attention's weighting is inherently smoother — it blends the state rather than spiking on one entry — so it's worse at exact, content-addressed retrieval. The very softmax we removed for speed was doing important work: its exponential made sharp, selective focus possible. Take it away and focus gets blurry.

The classic failure: associative recall

The benchmark that exposes this is associative recall: show the model pairs like “A→1, B→7, C→3,” then later ask “B→?” The right answer is 7. Softmax attention nails this — it finds the “B” token and reads off its “7.” Pure linear attention struggles, especially with many pairs, because the fixed state can't keep every pair cleanly separated — they interfere. This single task drove much of the research that followed: how to keep linear attention's speed while recovering some of softmax's precise recall.

How the field fights back

The whole modern wave of architectures (next chapters) is largely a response to this catch. The fixes share a theme: make the state smarter about what to keep and what to forget. Add a decay or gating mechanism so the state can selectively down-weight old information (RWKV's time-decay, Mamba's selectivity), or enlarge and structure the state. None fully closes the gap with softmax on pure recall — but they narrow it enough that, combined with the massive efficiency win, linear models become genuinely competitive on many tasks.

See it: sharp vs. smooth attention

The widget shows attention weights over past tokens for a query that “wants” token 7. Softmax can spike almost entirely on token 7 (precise retrieval). Linear attention spreads weight more smoothly — it leans toward 7 but can't isolate it. Drag the “sharpness” toward softmax and back toward linear to feel the retrieval-vs-cost tradeoff.

Sharp (softmax) vs. Smooth (linear) Attention

Attention weight over past tokens; the target is token 7. Softmax spikes on it (exact recall); linear spreads out (blurry). Slide between them to see the precision you trade for speed.

← linear (smooth) ......... softmax (sharp) → 0.30

Common misconception. “Linear models are just worse, so they're a dead end.” They're not worse everywhere — they're worse at precise long-range recall and competitive or better on many other tasks, at a fraction of the cost. And for very long sequences, a linear model that can actually fit the context beats a quadratic one that runs out of memory before it even gets there. The question is never “which is better” but “better for what, at what length, at what budget.”

What is the fundamental limitation that linear attention's efficiency comes from?

It uses fewer layers It compresses the entire past into a fixed-size state — cheap, but lossy, so it can't precisely retrieve arbitrary specific past tokens the way softmax (with its growing KV cache) can It can only process short sequences

Chapter 5: The Cost Simulator — Watch the Gap Grow

This brings the first four chapters together in motion. Two models process the same growing sequence, token by token: a softmax transformer and a linear-attention model. Watch their cumulative compute and their memory diverge in real time — and see, concretely, the fixed-state vs growing-cache difference that defines the whole tradeoff.

Press Run and watch as each token is processed:

The transformer must attend to all previous tokens for each new one, so its per-token cost grows with position, and its KV-cache memory grows without bound — total compute curves upward (quadratic).
The linear model updates a fixed-size state and reads it — constant cost per token, constant memory — so total compute is a straight line.

Transformer vs. Linear: Cost & Memory Over a Sequence

Top: cumulative compute (transformer curves up, linear stays straight). Bottom: memory (KV cache grows vs fixed state). Run it and watch the gap widen with every token.

Sequence length 100

What to take away. Early on, the two models cost about the same — for short sequences, who cares? The gap only matters as the sequence grows, and then it matters enormously: the transformer's total compute and memory race upward while the linear model strolls along a flat line. That diverging picture is why linear attention exists. For a chatbot reply it's irrelevant; for a million-token document or an endless stream, it's the difference between feasible and impossible.

Common misconception. “Linear models are always cheaper, so always use them.” At short sequences the constant factors can make a well-optimized transformer (with FlashAttention) just as fast or faster — and it's more accurate. The linear model's win is asymptotic: it pays off at long sequences. Below the crossover length, the quadratic model is often the better choice. Know where your crossover is.

No quiz — the simulator is the test. If you can explain why the two compute curves start together and then diverge, you understand the heart of linear attention.

Chapter 6: RWKV — The Receptance-Weighted Key-Value

RWKV (pronounced “RwaKuv”) is the most prominent linear-attention model, and a genuine attempt to build a from-scratch transformer alternative that's competitive at scale. Its name spells out its four ingredients: Receptance, Weight, Key, Value. Let's see what's different about how it mixes tokens, and how it puts a clever twist on the linear-attention recipe.

Token shift: mix with the previous token

RWKV's first trick is token shift. Before computing anything, each token's representation is blended with the previous token's, via a learned mix. It's cheap — just a weighted average of the current and previous position — but it gives every position a little built-in window onto its immediate past, helping the model capture local patterns without attention. A small idea that does a lot of work.

The WKV operator: linear attention with decay

The heart of RWKV is the WKV operation — its version of the recurrent state from Chapter 3, with a crucial addition: a learned time-decay. Each channel has its own decay rate that controls how fast old information fades from the state. Recent tokens count for more; distant tokens fade away — but at a learned, per-channel rate, so some channels can hold onto information for a long time while others forget quickly. This directly attacks the Chapter 4 problem: the decay lets RWKV selectively manage what its fixed state remembers, recovering some of the focus that pure linear attention lacks.

Receptance is a gate. The “R” in RWKV — receptance — is a learned gate (a sigmoid) that controls how much of the WKV state's output is actually let through at each position. It's the model deciding, per token and per channel, “how receptive am I to this retrieved information right now?” Combined with the time-decay (how fast the past fades) and token shift (local mixing), receptance gives RWKV fine control over its fixed-size memory — the toolkit for working around the linear-attention bottleneck.

Channel mixing: the feed-forward part

Alongside the WKV “time-mixing” block (which mixes information across positions), RWKV has a channel-mixing block that mixes information across features — analogous to a transformer's feed-forward network, and also using token shift. So an RWKV block, like a transformer block, alternates a token-mixing stage (WKV, the attention analogue) with a channel-mixing stage (the MLP analogue). Same two-stage rhythm as a transformer; completely different, linear-cost token-mixer.

The evolution

RWKV has iterated fast. RWKV-4 proved the concept at GPT-scale. RWKV-5/6 (codenamed Eagle and Finch) made the decay data-dependent — the forget rate adapts to the input rather than being fixed, much like Mamba's selectivity, sharply improving recall. RWKV-7 pushes further with a more expressive state update. Each generation narrows the gap with transformers while keeping the linear cost and the train-parallel / infer-recurrent dual form.

See it: time-decay shaping memory

The widget shows how much a past token still contributes to the current state, as a function of how far back it is — the time-decay curve. Drag the decay rate: a fast decay means the model is dominated by recent tokens (sharp, local memory); a slow decay means distant tokens linger (long memory, but more interference). RWKV learns a different decay per channel, getting both at once.

RWKV Time-Decay: How the Past Fades

Contribution of a past token to the current state vs how many steps back it is. Fast decay = recency focus; slow decay = long memory. RWKV learns one decay per channel.

Time-decay rate 0.40

Common misconception. “RWKV is just an RNN, so it's a step backward.” RWKV is a modern RNN-transformer hybrid: it trains in parallel like a transformer (no slow sequential training), uses learned per-channel decay and gating that classic RNNs never had, and scales to billions of parameters. It keeps the good parts of RNNs (constant-memory streaming inference) while shedding the bad parts (untrainable at scale, vanishing gradients). It's a reinvention, not a regression.

What does RWKV's learned time-decay accomplish that pure linear attention lacks?

It makes training sequential and slow Per-channel decay lets the model selectively control how fast old information fades from its fixed state — recovering some of the focus/recall that plain linear attention loses It removes the need for a value vector

Chapter 7: The Linear-Attention Family

RWKV isn't alone. Since around 2023 there's been a genuine renaissance of sub-quadratic architectures, and the striking thing is how related they all are. Under the hood, they're variations on one theme: summarize the past into a fixed-size state that updates per token. They differ mainly in how the state is updated and forgotten. Knowing the family helps you see the unity.

The major members

RWKV — per-channel time-decay + receptance gating + token shift. An explicit RNN/transformer hybrid. (This lesson's focus.)
Mamba / State Space Models — frame the state update as a continuous-time state-space model, with the key innovation of selectivity: the update and forget gates depend on the input, so the model can choose what to remember based on content. Mamba was the result that convinced many people linear models could match transformers. (See the SSM / Mamba lesson.)
RetNet (Retentive Networks) — Microsoft's “retention” mechanism, with an explicit decay and a clean three-way form: parallel (training), recurrent (inference), and a chunkwise form that blends both for efficiency on long sequences.
Gated Linear Attention (GLA) — adds data-dependent gates to linear attention and, crucially, hardware-efficient chunked algorithms that make it fast on real GPUs.

The common thread: data-dependent forgetting. The first wave of linear attention had a fixed way of accumulating the state, and it was weak (the Chapter 4 catch). The breakthrough across this whole family is making the state update input-dependent — the model learns to selectively remember and forget based on what it's reading, not on a fixed schedule. Mamba calls it selectivity; RWKV-6 made its decay data-dependent; GLA uses data-dependent gates. Same insight, different dress: a smart, content-aware fixed state beats a dumb one, and closes much of the gap with attention.

The chunkwise middle ground

One more shared trick worth knowing: chunkwise processing. Instead of pure parallel (quadratic within the whole sequence) or pure recurrent (sequential, slow on GPUs), you split the sequence into chunks — compute attention within each chunk in parallel (cheap, since chunks are small), and pass a recurrent state between chunks. This gets near-parallel speed and linear cost, and it's how most of these models actually run efficiently on hardware. It's the practical bridge between the two dual forms from Chapter 3.

See it: the family on shared axes

Select a model and see where it sits on the two axes that matter: state type (how it stores the past) and selectivity (whether forgetting is data-dependent). Softmax attention is the outlier — it keeps everything (no compression) at quadratic cost. The linear family clusters together, differing mainly in how cleverly they manage their fixed state.

The Family on Two Axes

Horizontal: how data-dependent the forgetting is. Vertical: state size (fixed vs growing). Select a model to place it. Note how softmax sits alone (growing state, quadratic) while the linear family clusters by selectivity.

Common misconception. “Mamba/SSMs are a totally different thing from linear attention.” They look different on the surface (continuous-time state-space math vs the kernel/associativity story), but they're deeply connected — both are fixed-state recurrences with data-dependent updates, and recent work shows the math largely unifies them. Don't memorize four unrelated architectures; understand the one idea (input-dependent fixed-state recurrence) and see each as a different parameterization of it.

What's the common breakthrough shared by Mamba, RWKV-6, and gated linear attention that earlier linear attention lacked?

They removed the state entirely Data-dependent (input-dependent) forgetting/gating — the model selectively decides what to keep in its fixed state based on content, instead of a fixed accumulation schedule They became quadratic again

Chapter 8: When to Use Which — and Hybrids

So which do you reach for — quadratic softmax attention or a linear model? The honest answer is “it depends,” and being precise about what it depends on is the practical payoff of this whole lesson. Two factors dominate: how long your sequences are, and how much you need precise recall of specific past details.

Where linear wins

Very long contexts — where quadratic cost is simply infeasible. A linear model that can fit a million tokens beats a transformer that runs out of memory at a hundred thousand.
Streaming / real-time — constant per-token cost and memory mean you can process an endless stream (audio, sensor data, live transcription) forever, never accumulating a growing cache.
Edge / on-device — the fixed, small memory footprint fits phones and embedded hardware where a growing KV cache would blow the memory budget.

Where softmax still wins

Precise long-range recall — “find the exact value associated with this key from far back.” The growing KV cache and sharp attention make transformers hard to beat here (the Chapter 4 catch).
Short sequences — below the crossover length, quadratic cost is negligible and the accuracy edge favors softmax. Most chat turns are short.
Tasks needing exact in-context lookup — copying, retrieval, certain reasoning — where the fixed state's lossy compression bites hardest.

The winning move is often a hybrid. Why choose? Hybrid models interleave a few full-attention layers among many linear/SSM layers (Jamba, Griffin, and others do exactly this). The linear layers carry the cost-efficient bulk of the sequence processing; the occasional attention layer provides the precise recall that pure linear models lack. You get most of the efficiency and most of the recall — a small number of expensive layers buys back the capability the cheap layers gave up. This is increasingly the dominant pattern at the frontier of long-context models.

The deeper truth: a fundamental tradeoff

There's no free lunch hiding here. Fixed memory cannot losslessly store unbounded history — that's information theory, not an engineering shortcoming. Any architecture that's truly O(n) with constant memory must forget something. The question is only how cleverly it chooses what to forget (selectivity, decay, gating) and whether a few attention layers can patch the gaps. Linear attention isn't “attention but cheaper” — it's a different bargain with memory, and you pick the bargain that fits your task.

See it: the recommendation map

Set your sequence length and how much precise recall your task needs. The map recommends softmax, linear, or hybrid — and shows why. Push to long sequences and the recommendation shifts toward linear or hybrid; demand high recall and it pulls back toward softmax or hybrid.

Which Architecture for Your Task?

Drag sequence length and recall-importance. The shaded regions recommend an architecture; the marker is your task. Hybrids occupy the sensible middle.

Sequence length medium

Precise-recall importance 0.50

Common misconception. “Transformers are obsolete — linear models will replace them.” The momentum is real, but the most likely future isn't pure-linear; it's hybrid. Attention's precise recall is too valuable to abandon, and a handful of attention layers is cheap. The interesting question isn't “which wins” but “what's the right ratio of attention to linear layers” for a given context length and task — and that ratio is an active research frontier.

Why are hybrid models (a few attention layers among many linear/SSM layers) increasingly popular?

They're simpler to implement than pure models The linear layers give cost-efficient bulk processing while the few attention layers restore the precise recall pure linear models lack — most of the efficiency and most of the capability They use no memory at all

Chapter 9: Connections & Cheat Sheet

You now understand the whole arc: why attention is quadratic, where exactly the square comes from, how removing softmax lets you reorder the multiplication into linear cost, how that same computation becomes a fixed-state recurrence (train parallel, infer recurrent), what you give up (precise recall), how RWKV's decay and gating fight back, how the whole family relates, and when to pick which — including hybrids. The thread: trade a growing, perfect memory for a fixed, lossy one that updates per token — and get linear cost in exchange for some recall.

The cheat sheet

The wall: softmax attention is O(n²) — every token attends to every other

The culprit: the n×n score matrix, forced into existence by the softmax

The trick: drop softmax → associativity → compute KᵀV (d×d) first → O(n)

Recurrent form: KᵀV builds incrementally → fixed-size state, constant memory per token

Dual form: parallel for training (GPU-fast), recurrent for inference (streaming-cheap)

The catch: fixed state is lossy → weaker precise recall than softmax's growing cache

RWKV: token-shift + WKV with per-channel time-decay + receptance gating + channel-mix

The family: Mamba, RetNet, GLA — all fixed-state recurrence with data-dependent forgetting

Best practice: hybrids — mostly linear layers + a few attention layers for recall

A decision guide

Short sequences, need exact recall?

Softmax attention (with FlashAttention).

↓

Very long context or streaming, recall not critical?

Linear / SSM (RWKV, Mamba) — fixed memory, linear cost.

↓

Long context AND need decent recall?

Hybrid — mostly linear layers with a few attention layers.

↓

Need it on a phone / embedded?

Linear — the fixed-state footprint fits constrained memory.

Where this connects

Attention Variants — MQA/GQA/FlashAttention attack attention's cost from the memory side; linear attention attacks the compute side.
SSM / Mamba — the state-space cousin of RWKV; deeply related fixed-state recurrence with selectivity.
Transformer — the quadratic baseline these models are trying to replace or augment.
Mixture of Experts — the other big efficiency lever; MoE cuts FFN cost, linear attention cuts attention cost (often combined).
GPT — RWKV and friends are drop-in alternatives to the attention block inside a GPT-style stack.
Test-Time Compute — long reasoning chains need long context cheaply, where linear models shine.

The one thing to remember. Attention's power — every token sees every other — is also its quadratic curse. Linear attention escapes the curse by replacing “remember everything perfectly” with “summarize the past into a fixed-size state that updates as you go.” That makes it cheap and streamable, at the cost of precise recall — a cost the field is steadily buying back with smarter, data-dependent state updates and hybrid designs. It's the leading answer to the most important scaling question in sequence modeling: how do we make context long and cheap?

You need a model to transcribe an endless live audio stream on a device with limited memory. Which architecture, and why?

A standard softmax transformer — it's the most accurate A bigger transformer with a longer context window A linear-attention model (RWKV/Mamba) — its fixed-size state gives constant memory and constant per-token cost for unbounded streaming, ideal for on-device real-time use

“You cannot remember everything forever — so the art is choosing, wisely and cheaply, what to carry forward.”