Attention costs grow with the square of the sequence — a wall at long context. This is how a clever reordering, and a transformer that thinks like an RNN, knock it down to a straight line.
The transformer's superpower — attention — is also its curse. In attention, every token looks at every other token. That “every×every” is the source of both its brilliance and its scaling problem: for a sequence of length n, attention does work proportional to n squared. Double the context and the cost quadruples. This is the quadratic wall, and it's the single biggest obstacle to long-context models.
The numbers get brutal fast. A 1,000-token prompt needs about a million attention scores — fine. A 100,000-token prompt (a long document) needs ten billion. A million-token context needs a trillion. The cost doesn't just grow; it explodes. Memory for the attention matrix grows the same way. This is why long context was, for years, fabulously expensive — you were fighting a square.
So a natural question drove a whole research direction: can we get the benefits of attention — mixing information across a sequence — at a cost that grows only linearly with length, like n instead of n squared? An architecture that's O(n) could handle million-token contexts, run cheaply on edge devices, and process streaming data forever. That family of answers is linear attention, and its most prominent member is RWKV — a model that's a transformer when training and an RNN when running.
The difference between O(n) and O(n squared) isn't a minor speedup — at scale it's the difference between possible and impossible. At 1,000 tokens, quadratic is only 1,000× more than linear — annoying but survivable. At a million tokens, quadratic is a million times more. Linear attention doesn't make models a bit faster; it changes which problems are feasible at all. Long documents, high-resolution sequences, lifelong streaming — these need the curve to be a line, not a parabola.
The widget plots cost against sequence length for quadratic attention and linear attention. At short lengths they're close. Drag the length up and watch the quadratic curve rocket away while the linear one stays nearly flat. That widening gap is the entire motivation for this lesson — and the prize that linear attention reaches for.
Cost vs. sequence length. The quadratic curve (softmax attention) explodes; the linear curve stays manageable. Drag the length and watch the gap become a chasm.
To escape the quadratic cost, we have to see exactly where it comes from. Let's trace standard attention and find the culprit — it's one specific matrix.
Each token produces three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (what I'll contribute if attended to). To decide how much token i should attend to token j, you take the dot product of i's query with j's key. Do that for every pair (i, j) and you get an n×n attention matrix — one score for every pair of tokens. That matrix is the quadratic monster: n tokens, n scores each, n×n total. Then softmax normalizes each row, and the matrix multiplies the values to produce the output.
Here's the subtle part — and the crux of the whole lesson. You might think you could avoid the matrix by being clever with the order of multiplications. But the softmax stands in the way. Softmax operates on each row of the attention matrix (normalizing the scores for one query across all keys), which means you must first compute the full row of scores before you can normalize. The softmax glues the query-key step and the value step together, forcing you to materialize the n×n scores in between. Remove the softmax, and — as the next chapter shows — the whole computation can be reordered to avoid the square entirely.
Take a sequence of n = 1,000 tokens, each with a vector of dimension d = 64. The attention matrix is 1,000×1,000 = 1,000,000 scores, each costing 64 multiply-adds, so about 64 million operations just to build it — plus a million numbers to store. Now n = 10,000: the matrix is 100,000,000 scores — 100× more for only 10× more tokens. That's the square in action: every time you grow the sequence by a factor, the cost grows by that factor squared. The value step has the same problem. The square is unavoidable as long as you form that matrix.
Watch the attention matrix build as tokens arrive. Each new token adds a whole new row and column of scores — so the total fills in as a square. Drag the token count and watch the cell count (and the cost number) grow quadratically. This square grid is precisely what linear attention refuses to build.
Each cell is one query-key score. Add tokens and the grid grows as a square — n tokens means n×n cells. The cell count is the quadratic cost.
Here is the beautiful idea at the core of linear attention, and it hinges on one of the oldest facts in algebra: matrix multiplication is associative. You can choose the order in which you multiply three matrices, and the answer is the same — but the cost can be wildly different. Linear attention exploits exactly this freedom, once the softmax is out of the way.
Standard attention computes, roughly, (scores) × values, where scores come from queries times keys. Written as matrices, it's (Q Kᵀ) V — first multiply queries by keys to get the n×n score matrix, then multiply by values. The parentheses force the giant n×n matrix into existence first.
But if we remove the softmax (or replace it with a simpler feature map), the parentheses become free to move. We can instead compute Q (Kᵀ V) — first multiply keys by values, then queries by that result. And here's the magic: Kᵀ V is a d×d matrix — its size depends only on the feature dimension d, not on the sequence length n. The enormous n×n matrix never appears. The cost drops from proportional to n squared down to proportional to n.
Let n = 1,000 tokens and d = 64. Compare the two orders:
| order | first product | its size | rough cost |
|---|---|---|---|
| (Q Kᵀ) V | Q Kᵀ (scores) | 1000 × 1000 | ~n²d = 64 million |
| Q (Kᵀ V) | Kᵀ V (a state) | 64 × 64 | ~n·d² = 4 million |
The two give the same output, but the second is 16× cheaper here — and the gap grows with n. At n = 10,000, the first balloons to 6.4 billion while the second only grows to 40 million. The first scales as n squared; the second scales as n. The Kᵀ V product is the key: a fixed-size d×d summary of all the keys and values, regardless of how many tokens there are. That little matrix is the whole sequence, compressed into constant size.
We can't just delete the softmax and leave nothing — the softmax provided a nonlinearity and kept the weights positive. Linear attention replaces it with a feature map: a function applied to the queries and keys (something as simple as an elu plus one, or other positivity-preserving maps) that approximates softmax's behavior while allowing the regrouping. The choice of feature map is where different linear-attention variants differ — but the core move, the associativity regrouping, is shared by all of them.
Toggle between the two multiplication orders and watch the intermediate matrix. In (QKᵀ)V order, the intermediate is the huge n×n matrix (it grows as you add tokens). In Q(KᵀV) order, the intermediate is the small fixed d×d state (it stays the same size no matter how many tokens). Add tokens and see one intermediate explode while the other holds steady.
Same result, different intermediate. (QKᵀ)V makes an n×n matrix (grows with tokens); Q(KᵀV) makes a fixed d×d state. Toggle the order and add tokens.
The d×d state matrix from the last chapter has a second, even more remarkable property. It can be built up incrementally, one token at a time. And that turns linear attention into something that looks exactly like an old-fashioned recurrent neural network — with all the streaming, constant-memory benefits that implies. This dual nature is the secret to why these models are so practical.
Remember Kᵀ V — the d×d summary of all keys and values. Watch what happens when a new token arrives. Its contribution to that summary is just its own key times its own value, added on. So you can keep a running state: when token t arrives, update the state by adding token t's key-value contribution, then read the output by multiplying token t's query against the current state. No looking back at past tokens — everything you need about the past is already summarized in the state. Process the next token, update the state again, and so on.
Here's the cleverest part, and why these models are trainable at scale. Linear attention has two mathematically equivalent forms:
You train in parallel form (fast, GPU-friendly) and deploy in recurrent form (cheap, streaming). The same model, the same weights, two views of the same computation. This solves the historical curse of RNNs — they were slow to train because they were inherently sequential. Linear attention gets RNN-style cheap inference and transformer-style parallel training. That combination is the whole point.
Start with an empty state (all zeros). Token 1 arrives with key k₁ and value v₁: update the state to k₁v₁. Its output is q₁ times the state. Token 2 arrives with k₂, v₂: update the state to k₁v₁ + k₂v₂ (just add the new contribution). Its output is q₂ times this updated state — which already contains both tokens' info. Token 2 never re-examined token 1; it just read the accumulated state. Extend to a million tokens and the state never grows — you only ever store and update that one fixed-size matrix.
Step through a sequence and watch the recurrent state update. Each new token adds its contribution to the fixed-size state (which never changes size), and its output is read from the current state. Notice the contrast with Chapter 1's growing matrix: here the memory is flat no matter how many tokens you process.
Step through tokens. Each adds its key×value to the fixed-size state; the output reads the state with the query. The state never grows — constant memory, however long the sequence.
If linear attention were strictly better — same quality, lower cost — transformers would already be extinct. They aren't, and the reason is a real, fundamental tradeoff. Compressing the past into a fixed-size state is exactly what makes linear attention cheap, and it's also exactly what limits it. You cannot losslessly cram an unbounded history into a constant amount of memory.
A softmax transformer keeps every past token available (the KV cache grows with the sequence), so it can reach back and retrieve any specific one exactly — “what was the name mentioned 50,000 tokens ago?” A linear-attention model has only its fixed-size state, which is a lossy summary of everything that came before. New information overwrites or blurs old information. Ask it to recall a precise detail from far back, and it often can't — the detail was compressed away to make room. The fixed state is a bottleneck: finite memory, infinite history.
The benchmark that exposes this is associative recall: show the model pairs like “A→1, B→7, C→3,” then later ask “B→?” The right answer is 7. Softmax attention nails this — it finds the “B” token and reads off its “7.” Pure linear attention struggles, especially with many pairs, because the fixed state can't keep every pair cleanly separated — they interfere. This single task drove much of the research that followed: how to keep linear attention's speed while recovering some of softmax's precise recall.
The whole modern wave of architectures (next chapters) is largely a response to this catch. The fixes share a theme: make the state smarter about what to keep and what to forget. Add a decay or gating mechanism so the state can selectively down-weight old information (RWKV's time-decay, Mamba's selectivity), or enlarge and structure the state. None fully closes the gap with softmax on pure recall — but they narrow it enough that, combined with the massive efficiency win, linear models become genuinely competitive on many tasks.
The widget shows attention weights over past tokens for a query that “wants” token 7. Softmax can spike almost entirely on token 7 (precise retrieval). Linear attention spreads weight more smoothly — it leans toward 7 but can't isolate it. Drag the “sharpness” toward softmax and back toward linear to feel the retrieval-vs-cost tradeoff.
Attention weight over past tokens; the target is token 7. Softmax spikes on it (exact recall); linear spreads out (blurry). Slide between them to see the precision you trade for speed.
This brings the first four chapters together in motion. Two models process the same growing sequence, token by token: a softmax transformer and a linear-attention model. Watch their cumulative compute and their memory diverge in real time — and see, concretely, the fixed-state vs growing-cache difference that defines the whole tradeoff.
Press Run and watch as each token is processed:
Top: cumulative compute (transformer curves up, linear stays straight). Bottom: memory (KV cache grows vs fixed state). Run it and watch the gap widen with every token.
No quiz — the simulator is the test. If you can explain why the two compute curves start together and then diverge, you understand the heart of linear attention.
RWKV (pronounced “RwaKuv”) is the most prominent linear-attention model, and a genuine attempt to build a from-scratch transformer alternative that's competitive at scale. Its name spells out its four ingredients: Receptance, Weight, Key, Value. Let's see what's different about how it mixes tokens, and how it puts a clever twist on the linear-attention recipe.
RWKV's first trick is token shift. Before computing anything, each token's representation is blended with the previous token's, via a learned mix. It's cheap — just a weighted average of the current and previous position — but it gives every position a little built-in window onto its immediate past, helping the model capture local patterns without attention. A small idea that does a lot of work.
The heart of RWKV is the WKV operation — its version of the recurrent state from Chapter 3, with a crucial addition: a learned time-decay. Each channel has its own decay rate that controls how fast old information fades from the state. Recent tokens count for more; distant tokens fade away — but at a learned, per-channel rate, so some channels can hold onto information for a long time while others forget quickly. This directly attacks the Chapter 4 problem: the decay lets RWKV selectively manage what its fixed state remembers, recovering some of the focus that pure linear attention lacks.
Alongside the WKV “time-mixing” block (which mixes information across positions), RWKV has a channel-mixing block that mixes information across features — analogous to a transformer's feed-forward network, and also using token shift. So an RWKV block, like a transformer block, alternates a token-mixing stage (WKV, the attention analogue) with a channel-mixing stage (the MLP analogue). Same two-stage rhythm as a transformer; completely different, linear-cost token-mixer.
RWKV has iterated fast. RWKV-4 proved the concept at GPT-scale. RWKV-5/6 (codenamed Eagle and Finch) made the decay data-dependent — the forget rate adapts to the input rather than being fixed, much like Mamba's selectivity, sharply improving recall. RWKV-7 pushes further with a more expressive state update. Each generation narrows the gap with transformers while keeping the linear cost and the train-parallel / infer-recurrent dual form.
The widget shows how much a past token still contributes to the current state, as a function of how far back it is — the time-decay curve. Drag the decay rate: a fast decay means the model is dominated by recent tokens (sharp, local memory); a slow decay means distant tokens linger (long memory, but more interference). RWKV learns a different decay per channel, getting both at once.
Contribution of a past token to the current state vs how many steps back it is. Fast decay = recency focus; slow decay = long memory. RWKV learns one decay per channel.
RWKV isn't alone. Since around 2023 there's been a genuine renaissance of sub-quadratic architectures, and the striking thing is how related they all are. Under the hood, they're variations on one theme: summarize the past into a fixed-size state that updates per token. They differ mainly in how the state is updated and forgotten. Knowing the family helps you see the unity.
One more shared trick worth knowing: chunkwise processing. Instead of pure parallel (quadratic within the whole sequence) or pure recurrent (sequential, slow on GPUs), you split the sequence into chunks — compute attention within each chunk in parallel (cheap, since chunks are small), and pass a recurrent state between chunks. This gets near-parallel speed and linear cost, and it's how most of these models actually run efficiently on hardware. It's the practical bridge between the two dual forms from Chapter 3.
Select a model and see where it sits on the two axes that matter: state type (how it stores the past) and selectivity (whether forgetting is data-dependent). Softmax attention is the outlier — it keeps everything (no compression) at quadratic cost. The linear family clusters together, differing mainly in how cleverly they manage their fixed state.
Horizontal: how data-dependent the forgetting is. Vertical: state size (fixed vs growing). Select a model to place it. Note how softmax sits alone (growing state, quadratic) while the linear family clusters by selectivity.
So which do you reach for — quadratic softmax attention or a linear model? The honest answer is “it depends,” and being precise about what it depends on is the practical payoff of this whole lesson. Two factors dominate: how long your sequences are, and how much you need precise recall of specific past details.
There's no free lunch hiding here. Fixed memory cannot losslessly store unbounded history — that's information theory, not an engineering shortcoming. Any architecture that's truly O(n) with constant memory must forget something. The question is only how cleverly it chooses what to forget (selectivity, decay, gating) and whether a few attention layers can patch the gaps. Linear attention isn't “attention but cheaper” — it's a different bargain with memory, and you pick the bargain that fits your task.
Set your sequence length and how much precise recall your task needs. The map recommends softmax, linear, or hybrid — and shows why. Push to long sequences and the recommendation shifts toward linear or hybrid; demand high recall and it pulls back toward softmax or hybrid.
Drag sequence length and recall-importance. The shaded regions recommend an architecture; the marker is your task. Hybrids occupy the sensible middle.
You now understand the whole arc: why attention is quadratic, where exactly the square comes from, how removing softmax lets you reorder the multiplication into linear cost, how that same computation becomes a fixed-state recurrence (train parallel, infer recurrent), what you give up (precise recall), how RWKV's decay and gating fight back, how the whole family relates, and when to pick which — including hybrids. The thread: trade a growing, perfect memory for a fixed, lossy one that updates per token — and get linear cost in exchange for some recall.
“You cannot remember everything forever — so the art is choosing, wisely and cheaply, what to carry forward.”