The architecture that ruled sequences before transformers, reborn in 2024 by its own inventor — with two old flaws finally fixed.
Before transformers took over, one architecture ruled sequence modeling for two decades: the LSTM, the Long Short-Term Memory network, invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997. It powered the first great machine translation, speech recognition, and text generation systems. Then attention arrived, and the LSTM was swept aside — not because it was bad, but because it had two specific, fatal flaws that transformers didn't.
In 2024, Hochreiter — the original inventor — came back with xLSTM (extended LSTM), asking a pointed question: what if we fix those two flaws? Could a modernized LSTM compete with transformers again? The answer turned out to be yes, surprisingly well — and it arrived right as the whole field rediscovered fixed-state recurrent models (RWKV, Mamba). xLSTM is the LSTM's entry in that revival, carrying forward decades of recurrent-network wisdom.
What exactly held the LSTM back? Two things, and xLSTM is essentially the fix for each:
xLSTM lands in the same wave as RWKV and Mamba (the Linear Attention lesson covers the family): all are fixed-state recurrences chasing the same prize — transformer-level quality with linear-cost, constant-memory inference. xLSTM's distinctive angle is that it starts from the LSTM's gating philosophy rather than from linear attention or state-space math, and asks how far that classic idea can go once its two flaws are removed. It's the same destination by a different, historically-rooted road.
The widget sketches the arc: the LSTM era, its eclipse by transformers, and the 2024 revival of recurrent models including xLSTM. Hover the eras to see what each got right and wrong — and why the pendulum is swinging back toward constant-memory recurrence for long-context and efficiency.
Click an era to see its strengths and weaknesses. The story: LSTMs (recurrent, but flawed) → transformers (parallel, but quadratic) → the 2024 revival fixing recurrence's flaws.
To understand what xLSTM fixes, we need the classic LSTM clear in mind. Its core idea, brilliant for 1997, was a protected memory cell — a value that flows along through time mostly untouched, with carefully controlled points where information can be added or removed. This protected highway is what let LSTMs remember things across long gaps, where plain RNNs forgot everything.
The LSTM carries a cell state — its long-term memory — through time. Three gates, each a value between 0 and 1 produced by a sigmoid, control it:
Each step: the forget gate scales down the old memory, the input gate adds new information, and the output gate reads out a filtered view. The cell state is updated by multiplying by the forget gate and adding the gated input — a simple, repeated update that, crucially, lets gradients flow back through many steps without vanishing (the famous “constant error carousel”).
What the LSTM got gloriously right: it processes sequences in linear time with constant memory — one fixed-size cell state, updated per token, regardless of sequence length. (Sound familiar? It's the same fixed-state recurrence the Linear Attention lesson celebrated.) For inference and streaming, that's ideal — the LSTM was “linear attention” decades before the name existed.
But the gates are sigmoids, capped between 0 and 1, and the memory is a single vector. Those two design choices — sigmoid gates and vector memory — are exactly the seeds of the two flaws. The next chapter shows why they hurt, and the chapters after show how xLSTM replaces each.
The widget is a single LSTM cell. Drag the forget and input gates and watch the cell state update over a few steps: the forget gate fades the old value, the input gate writes the new. Set the forget gate to 1 and input to 0 and watch memory persist perfectly; lower the forget gate and watch it leak away. This is the valve control at the heart of every LSTM.
The cell state over time as new inputs arrive. Drag the gates: forget controls how much old memory survives each step; input controls how much new info is written. See memory persist or fade.
Let's nail down the first flaw precisely, because the fix (next chapter) flows directly from understanding it. The classic LSTM struggles to revise a storage decision: once it has committed something to memory, it's hard to override that when a more important item appears later. The culprit is the humble sigmoid that produces every gate.
A sigmoid squashes any input into the range 0 to 1. That's its job, and it's also its trap. To make a gate truly “fully open,” you'd need an output of exactly 1, which requires an infinitely large input. In practice the sigmoid saturates — it flattens out near 0 and near 1, so its output barely changes no matter how strong the signal gets. Once a gate is near-saturated, pushing it further has almost no effect, and the gradient through it nearly vanishes, so the network can barely learn to change it either.
Suppose the LSTM has stored an item with importance 5, and a vastly more important item (importance 50) now arrives. To properly replace the old one, the input gate should open enormously more for the new item. But the sigmoid caps the gate at 1, no matter how large the input. The old item had gate value, say, 0.9; the new, far-more-important item also gets only about 0.99 — barely more. The model cannot express “this new thing is ten times more important, so overwrite decisively.” Both get squashed into nearly the same near-1 gate. The relative importance is lost to saturation.
The widget plots the sigmoid gate's output against the strength of the “store this” signal. Push the signal strength up: the output climbs, then flattens against the ceiling of 1 and stops responding. Two very different signal strengths (a mildly important item and a critically important one) produce nearly the same gate value once both are in the saturated zone. That flattening is exactly why the LSTM can't revise — it can't tell “important” from “far more important.”
Gate output vs. signal strength. Past a point the sigmoid flattens at 1 — stronger signals can't open it further. Two markers show how a strong and a much-stronger signal collapse to nearly the same gate.
xLSTM's fix for the revision problem is wonderfully direct: replace the sigmoid gates with exponential ones. An exponential function has no ceiling — it can grow arbitrarily large. So a gate can now open as wide as needed, and a far-more-important item can produce a gate value far larger than a merely-important one. The relative importance that saturation crushed is preserved. The cell can decisively revise. This is the core of the sLSTM (the scalar xLSTM variant).
Think back to the stuck-gate example. With a sigmoid, importance-5 and importance-50 items both got gate values near 1 — indistinguishable. With an exponential gate, importance-50 produces a gate value vastly larger than importance-5. When the cell combines old and new with these gates, the new, far-more-important item dominates — it effectively overwrites the old. The model can now express “this is ten times more important, so weight it ten times more,” which is exactly the revision capability the sigmoid denied. Unbounded gates mean unbounded ability to re-prioritize.
Alongside the cell state, the sLSTM tracks a second running quantity: the normalizer, which accumulates the total gate magnitude over time. At read-out, the cell state is divided by this normalizer. So even though the raw gates can be huge, the ratio stays bounded — what matters is each item's gate relative to the total, not its absolute size. It's exactly the normalization a softmax does (divide by the sum), but computed in a running, recurrent way. The exponential gives unbounded expressiveness; the normalizer divides it back into a stable, well-behaved output. Together they're “softmax-like” selection inside a recurrence.
There's one more numerical safeguard: a stabilizer that tracks the largest gate seen so far and subtracts it before exponentiating (the same max-subtraction trick a numerically-stable softmax uses). This keeps the exponentials from overflowing while leaving the ratios unchanged. With exponential gates, the normalizer, and the stabilizer, the sLSTM gets the revision power of unbounded gating without the instability.
The widget overlays the two gating functions. The sigmoid (red) flattens at 1 — the saturation from Chapter 2. The exponential (purple) keeps climbing — no ceiling, so it preserves the difference between “important” and “far more important.” Toggle the normalizer to see how dividing by the running total keeps the exponential's output bounded even as the raw gate grows — expressiveness without explosion.
Red = sigmoid (caps at 1, saturates). Purple = exponential (unbounded, preserves relative importance). Toggle the normalizer to see how the output stays stable despite the unbounded raw gate.
Exponential gating fixed revision. But there's a second limitation hiding in the classic LSTM: its memory is just a vector — a single list of numbers. That's a small amount of storage, and everything the model wants to remember has to be crammed into it, interfering with each other. xLSTM's second variant, the mLSTM (matrix LSTM), upgrades the cell from a vector to a full matrix — vastly more memory capacity, and a structure that holds associations.
Here's the elegant idea, and it should feel familiar from the Linear Attention lesson. Instead of just accumulating values, the mLSTM cell stores key-value associations: when a token arrives with a key and a value, the cell adds their outer product (the key times the value, forming a matrix) into its memory. To retrieve, you multiply a query against the matrix — and out comes the value associated with the most similar stored key. The matrix memory is an associative memory: ask with a key-like query, get back the matching value.
A vector of size d holds d numbers. A matrix of size d-by-d holds d-squared numbers — for d = 64, that's 64 values vs over 4,000. But it's not just raw count: the matrix structure lets the cell store many distinct, separable key-value pairs that don't interfere as badly, because different keys map to different directions. The vector LSTM had to overwrite or blend memories that competed for its few slots; the matrix mLSTM can keep many associations side by side. More capacity, less interference — directly attacking the recall weakness of fixed-state models.
The widget stores a series of key-value pairs, then tests recall. With vector memory, new pairs overwrite old ones — recall degrades fast as you add more. With matrix memory, the outer products coexist — recall holds up far better as the number of stored pairs grows. Add pairs and watch the vector memory's recall collapse while the matrix memory's stays strong.
Store N key-value pairs, then measure recall accuracy. Vector memory saturates and interferes; matrix memory holds many associations. Drag N up and watch the gap.
We've fixed revision (exponential gating) and capacity (matrix memory). Now the flaw that actually killed the LSTM commercially: it couldn't train in parallel. The mLSTM's design deliberately solves this, and understanding how reveals a sharp tradeoff between the two xLSTM variants.
In a classic LSTM, each step's gates depend on the previous step's hidden output — a recurrent connection from one timestep to the next. So you literally cannot compute step 5 until you've finished step 4, which needs step 3, and so on. The computation is a chain you must walk one link at a time. On a GPU — which is built to do thousands of things at once — this is agony: the hardware sits mostly idle, waiting. That's why LSTMs were slow to train and couldn't scale to the sizes that made transformers dominant.
The mLSTM makes one crucial sacrifice: it removes the hidden-to-hidden recurrence in the gates. Its gates depend only on the current input, not on the previous step's output. With that dependency gone, the per-token computations no longer form an unbreakable chain — they can be computed in parallel, all at once, exactly like attention. The mLSTM has the same dual nature as linear attention from Chapter 3 of that lesson: a parallel form for fast GPU training and a recurrent form for cheap streaming inference. The second flaw is fixed.
Like the rest of the family, the mLSTM in practice uses chunkwise processing: split the sequence into chunks, compute within each chunk in parallel, and pass a recurrent state between chunks. This gets near-parallel training speed while keeping the linear cost — the same hardware-efficient trick the Linear Attention lesson described. It's how the mLSTM actually runs fast on real GPUs.
The widget shows tokens being processed two ways. In sequential mode (sLSTM, classic LSTM), each token must wait for the previous one — a chain, processed one at a time. In parallel mode (mLSTM), with the recurrent gate-link cut, all tokens compute at once — the GPU does them simultaneously. Watch the wall-clock time: sequential grows with length; parallel stays flat.
Press Run. Sequential processes tokens one-by-one (each waits for the last); parallel does them all at once. Watch how training time diverges as the sequence grows.
This is the payoff — the revision problem and its fix, side by side, interactive. A stream of items arrives, each with an importance. The task is simple: by the end, the memory should hold the single most important item, no matter when it appeared. A sigmoid-gated LSTM fails this when the most important item comes late (it can't override what's stored); an exponential-gated xLSTM succeeds. Watch both, on the same stream.
Drag the importance of each item, and especially try making a late item the most important. Toggle between sigmoid and exponential gating and see which item each model's memory ends up holding. The sigmoid model gets “stuck” on whatever it committed early; the exponential model correctly revises to the true maximum.
Each bar is an item's importance, arriving left to right. The task: end holding the most important one. Toggle gating and adjust importances — make a late item the biggest and watch the sigmoid model fail while the exponential one revises correctly.
No quiz — the lab is the test. If you can predict, before toggling, which item each model will keep when the maximum arrives late, you understand exponential gating.
We have the pieces — exponential gating, matrix memory, the parallelizable mLSTM, the expressive sLSTM. Now how are they assembled into a full model? The answer is reassuringly familiar: xLSTM borrows the residual block structure that makes transformers trainable, and slots its recurrent cells in where attention would go.
An xLSTM model is a stack of residual blocks, exactly like a transformer. Each block wraps its core operation with a normalization layer and a residual connection (add the input back to the output). This is the same machinery that lets deep transformers train without vanishing gradients (see the Skip Connections and Normalization lessons). The novelty isn't the backbone — it's what sits inside each block: an xLSTM layer instead of an attention layer.
A real xLSTM architecture interleaves the two block types. mLSTM blocks — parallelizable, matrix-memory — do most of the work: efficient, scalable bulk sequence processing. sLSTM blocks — with their recurrent memory mixing — are sprinkled in where the extra state-tracking expressiveness helps. The ratio is a design choice, much like the attention-to-linear ratio in the hybrid models from the Linear Attention lesson. You tune how many of each based on your tasks and your compute budget.
There's one more structural detail. xLSTM blocks often up-project to a wider dimension before the recurrent operation and project back down after — giving the cell more room to work — and use additional gating (like a gated linear unit) around the core. These are the same kinds of engineering refinements that tuned transformers over the years (wider FFNs, GLU variants); xLSTM inherits that accumulated wisdom. The recurrent cell is the heart, but it's wrapped in the modern block design that makes large models train well.
The widget shows an xLSTM model as a stack of residual blocks. Toggle a block between mLSTM (parallel, matrix memory) and sLSTM (recurrent, expressive) and see the mix. Notice every block has the same residual + norm wrapper — the transformer scaffolding — with the recurrent cell as the interchangeable core.
Click a block to flip it between mLSTM (teal, parallel) and sLSTM (purple, expressive). Every block shares the residual + norm wrapper; only the recurrent core changes.
xLSTM didn't arrive in a vacuum — it's part of the 2024 wave of sub-quadratic models alongside RWKV and Mamba. They're all chasing the same prize and they're all, underneath, the same kind of thing: a fixed-state recurrence with constant-memory inference. Seeing where xLSTM fits clarifies the whole landscape.
Every model in this family — xLSTM, RWKV, Mamba, RetNet, gated linear attention — carries a fixed-size state that it updates one token at a time, giving linear cost and constant-memory streaming inference. And they all converged on the key fix from the Linear Attention lesson: data-dependent control of the state (selective forgetting and writing based on content). They differ mainly in the mathematical lineage they came from and the exact form of their state update.
| Model | Came from | State / key mechanism |
|---|---|---|
| xLSTM | the LSTM | matrix memory + exponential gating + normalizer |
| RWKV | RNN/attention hybrid | WKV state + per-channel time-decay + receptance |
| Mamba | state-space models | continuous-time SSM + input-dependent selectivity |
| RetNet | attention | retention with explicit decay; chunkwise form |
| Transformer | — | growing KV cache, quadratic, sharp recall (the baseline) |
What makes xLSTM stand out within the family? Its emphasis on exponential gating with a normalizer — an explicit, principled fix for the revision/recall problem, descended directly from the LSTM's gate philosophy — and its two-variant design (the parallel mLSTM and the more-expressive sLSTM), letting you trade efficiency for state-tracking power block by block. It's the most LSTM-faithful member of the family, which is fitting: it comes from the people who invented the LSTM in the first place.
Select a model to place it on two axes: its lineage (where it came from) and its state type. Notice how the linear family clusters — same fixed matrix-state region — while the transformer sits apart with its growing cache. xLSTM, RWKV, and Mamba are close neighbors despite their different origins.
Select a model to highlight it. The linear family clusters in the fixed-matrix-state region; the transformer is the outlier with a growing cache. Different origins, converging design.
You now understand xLSTM end to end: why the LSTM fell behind, the saturating-sigmoid revision problem, exponential gating with a normalizer as the fix, the matrix memory that holds key-value associations, the mLSTM's parallelizability, the residual-block architecture, and where it sits in the recurrent revival. The thread: take the LSTM's timeless gated-memory idea, fix its two flaws (saturation and sequentiality), and you get a competitive, constant-memory alternative to attention.
“An old idea is not a dead idea — sometimes it was just waiting for its two flaws to be fixed.”