xLSTM — The LSTM Strikes Back

Chapter 0: The LSTM Strikes Back

Before transformers took over, one architecture ruled sequence modeling for two decades: the LSTM, the Long Short-Term Memory network, invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997. It powered the first great machine translation, speech recognition, and text generation systems. Then attention arrived, and the LSTM was swept aside — not because it was bad, but because it had two specific, fatal flaws that transformers didn't.

In 2024, Hochreiter — the original inventor — came back with xLSTM (extended LSTM), asking a pointed question: what if we fix those two flaws? Could a modernized LSTM compete with transformers again? The answer turned out to be yes, surprisingly well — and it arrived right as the whole field rediscovered fixed-state recurrent models (RWKV, Mamba). xLSTM is the LSTM's entry in that revival, carrying forward decades of recurrent-network wisdom.

The two fatal flaws

What exactly held the LSTM back? Two things, and xLSTM is essentially the fix for each:

Flaw 1: It couldn't revise its decisions. Once the LSTM committed something to memory, it struggled to override that choice when a more important item showed up later. Its gates saturated. This crippled tasks like “find the most relevant earlier token.”
Flaw 2: It wasn't parallelizable. The LSTM processes one token at a time, each step depending on the last. That's fine for inference but agonizing for training — you can't use a GPU's parallelism. Transformers train in parallel; LSTMs couldn't, so they couldn't scale.

The one-sentence version. xLSTM modernizes the LSTM with two fixes: exponential gating (so it can revise what it stored, solving flaw 1) and a matrix memory with a parallelizable form (so it trains on GPUs like a transformer, solving flaw 2). The result is a recurrent model with constant-memory inference that's competitive with transformers at scale.

Why this matters now

xLSTM lands in the same wave as RWKV and Mamba (the Linear Attention lesson covers the family): all are fixed-state recurrences chasing the same prize — transformer-level quality with linear-cost, constant-memory inference. xLSTM's distinctive angle is that it starts from the LSTM's gating philosophy rather than from linear attention or state-space math, and asks how far that classic idea can go once its two flaws are removed. It's the same destination by a different, historically-rooted road.

See it: the comeback timeline

The widget sketches the arc: the LSTM era, its eclipse by transformers, and the 2024 revival of recurrent models including xLSTM. Hover the eras to see what each got right and wrong — and why the pendulum is swinging back toward constant-memory recurrence for long-context and efficiency.

The Sequence-Model Pendulum

Click an era to see its strengths and weaknesses. The story: LSTMs (recurrent, but flawed) → transformers (parallel, but quadratic) → the 2024 revival fixing recurrence's flaws.

Common misconception. “LSTMs lost to transformers because they were worse at learning.” Not quite — they lost mostly because they couldn't be trained in parallel, so they couldn't ride the GPU-scaling wave that made transformers huge. The quality gap was real but secondary to the scaling gap. Fix the parallelism (and the revision flaw), and the recurrent approach becomes competitive again — which is exactly the xLSTM bet.

What are the two flaws that xLSTM fixes in the classic LSTM?

Too few parameters and too slow inference It couldn't revise stored decisions (saturating gates) and it couldn't be trained in parallel on GPUs It used too much memory and overfit easily

Chapter 1: The LSTM, Quickly

To understand what xLSTM fixes, we need the classic LSTM clear in mind. Its core idea, brilliant for 1997, was a protected memory cell — a value that flows along through time mostly untouched, with carefully controlled points where information can be added or removed. This protected highway is what let LSTMs remember things across long gaps, where plain RNNs forgot everything.

The cell and its three gates

The LSTM carries a cell state — its long-term memory — through time. Three gates, each a value between 0 and 1 produced by a sigmoid, control it:

Forget gate: how much of the old cell state to keep (1 = keep all, 0 = erase). It decides what to discard.
Input gate: how much of the new candidate information to write into the cell. It decides what to store.
Output gate: how much of the cell to expose as this step's output. It decides what to reveal.

Each step: the forget gate scales down the old memory, the input gate adds new information, and the output gate reads out a filtered view. The cell state is updated by multiplying by the forget gate and adding the gated input — a simple, repeated update that, crucially, lets gradients flow back through many steps without vanishing (the famous “constant error carousel”).

The gates are valves on a memory pipe. Picture the cell state as water flowing through a pipe. The forget gate is a valve that lets some flow through and drains the rest; the input gate is a valve that adds new water; the output gate is a valve on a side tap that lets you sample the contents. The LSTM learns to open and close these valves at the right moments — remember this, forget that, reveal this now. That valve-control idea is timeless; xLSTM keeps it and upgrades the valves.

The good and the limits

What the LSTM got gloriously right: it processes sequences in linear time with constant memory — one fixed-size cell state, updated per token, regardless of sequence length. (Sound familiar? It's the same fixed-state recurrence the Linear Attention lesson celebrated.) For inference and streaming, that's ideal — the LSTM was “linear attention” decades before the name existed.

But the gates are sigmoids, capped between 0 and 1, and the memory is a single vector. Those two design choices — sigmoid gates and vector memory — are exactly the seeds of the two flaws. The next chapter shows why they hurt, and the chapters after show how xLSTM replaces each.

See it: the gated cell

The widget is a single LSTM cell. Drag the forget and input gates and watch the cell state update over a few steps: the forget gate fades the old value, the input gate writes the new. Set the forget gate to 1 and input to 0 and watch memory persist perfectly; lower the forget gate and watch it leak away. This is the valve control at the heart of every LSTM.

A Single LSTM Cell

The cell state over time as new inputs arrive. Drag the gates: forget controls how much old memory survives each step; input controls how much new info is written. See memory persist or fade.

Forget gate 0.90

Input gate 0.30

Common misconception. “The LSTM's cell state is its output.” No — the cell state is the protected internal memory; the output is a gated, filtered view of it (via the output gate and a squashing function). Keeping the memory separate from the output is part of what protects it: the cell can hold something quietly for many steps without being forced to expose it until the output gate decides the moment is right.

What lets a classic LSTM remember information across long gaps where a plain RNN forgets?

It has more layers A protected cell state updated by multiply-by-forget-gate and add-gated-input, which lets gradients flow across many steps without vanishing It attends to all previous tokens

Chapter 2: Flaw One — The Cell That Couldn’t Change Its Mind

Let's nail down the first flaw precisely, because the fix (next chapter) flows directly from understanding it. The classic LSTM struggles to revise a storage decision: once it has committed something to memory, it's hard to override that when a more important item appears later. The culprit is the humble sigmoid that produces every gate.

Why sigmoids saturate

A sigmoid squashes any input into the range 0 to 1. That's its job, and it's also its trap. To make a gate truly “fully open,” you'd need an output of exactly 1, which requires an infinitely large input. In practice the sigmoid saturates — it flattens out near 0 and near 1, so its output barely changes no matter how strong the signal gets. Once a gate is near-saturated, pushing it further has almost no effect, and the gradient through it nearly vanishes, so the network can barely learn to change it either.

The nearest-neighbor failure. Here's the task that exposes it. Scan a sequence and keep the item most similar to a query — you must overwrite your stored best whenever a better match appears. An LSTM has to slam its input gate wide open to store the new best while forgetting the old. But to strongly favor a much better later item, it needs the gate to swing harder than the saturated sigmoid allows. It can't decisively override an already-stored value. The memory gets “stuck” on early commitments. This single limitation showed up across many tasks where revision matters.

Worked example: the stuck gate

Suppose the LSTM has stored an item with importance 5, and a vastly more important item (importance 50) now arrives. To properly replace the old one, the input gate should open enormously more for the new item. But the sigmoid caps the gate at 1, no matter how large the input. The old item had gate value, say, 0.9; the new, far-more-important item also gets only about 0.99 — barely more. The model cannot express “this new thing is ten times more important, so overwrite decisively.” Both get squashed into nearly the same near-1 gate. The relative importance is lost to saturation.

See it: the sigmoid ceiling

The widget plots the sigmoid gate's output against the strength of the “store this” signal. Push the signal strength up: the output climbs, then flattens against the ceiling of 1 and stops responding. Two very different signal strengths (a mildly important item and a critically important one) produce nearly the same gate value once both are in the saturated zone. That flattening is exactly why the LSTM can't revise — it can't tell “important” from “far more important.”

The Saturating Sigmoid Gate

Gate output vs. signal strength. Past a point the sigmoid flattens at 1 — stronger signals can't open it further. Two markers show how a strong and a much-stronger signal collapse to nearly the same gate.

“Store this!” signal strength 2.0

Common misconception. “Saturation is just a small numerical nuisance.” It's a real expressiveness limit. A gate capped at 1 can't represent “store this much more strongly than that.” And saturated sigmoids have near-zero gradient, so the model can't even learn its way out. The fix isn't a tweak to training — it's changing the gate function itself so it can grow unboundedly and revise decisively. That's exponential gating.

Why does a sigmoid gate prevent the LSTM from decisively overriding an earlier stored value?

Sigmoids are too slow to compute The sigmoid saturates (caps at 1), so a far-more-important item produces nearly the same gate value as a moderately important one — the model can't express "open much wider" to overwrite The forget gate is disabled

Chapter 3: Exponential Gating — Gates That Can Grow

xLSTM's fix for the revision problem is wonderfully direct: replace the sigmoid gates with exponential ones. An exponential function has no ceiling — it can grow arbitrarily large. So a gate can now open as wide as needed, and a far-more-important item can produce a gate value far larger than a merely-important one. The relative importance that saturation crushed is preserved. The cell can decisively revise. This is the core of the sLSTM (the scalar xLSTM variant).

Why exponential solves revision

Think back to the stuck-gate example. With a sigmoid, importance-5 and importance-50 items both got gate values near 1 — indistinguishable. With an exponential gate, importance-50 produces a gate value vastly larger than importance-5. When the cell combines old and new with these gates, the new, far-more-important item dominates — it effectively overwrites the old. The model can now express “this is ten times more important, so weight it ten times more,” which is exactly the revision capability the sigmoid denied. Unbounded gates mean unbounded ability to re-prioritize.

But unbounded gates explode! If gates can grow without limit, the cell state can blow up to infinity — numerical disaster. This is the obvious objection, and xLSTM's answer is the second half of the trick: a normalizer state.

The normalizer state

Alongside the cell state, the sLSTM tracks a second running quantity: the normalizer, which accumulates the total gate magnitude over time. At read-out, the cell state is divided by this normalizer. So even though the raw gates can be huge, the ratio stays bounded — what matters is each item's gate relative to the total, not its absolute size. It's exactly the normalization a softmax does (divide by the sum), but computed in a running, recurrent way. The exponential gives unbounded expressiveness; the normalizer divides it back into a stable, well-behaved output. Together they're “softmax-like” selection inside a recurrence.

There's one more numerical safeguard: a stabilizer that tracks the largest gate seen so far and subtracts it before exponentiating (the same max-subtraction trick a numerically-stable softmax uses). This keeps the exponentials from overflowing while leaving the ratios unchanged. With exponential gates, the normalizer, and the stabilizer, the sLSTM gets the revision power of unbounded gating without the instability.

See it: exponential vs. sigmoid

The widget overlays the two gating functions. The sigmoid (red) flattens at 1 — the saturation from Chapter 2. The exponential (purple) keeps climbing — no ceiling, so it preserves the difference between “important” and “far more important.” Toggle the normalizer to see how dividing by the running total keeps the exponential's output bounded even as the raw gate grows — expressiveness without explosion.

Exponential vs. Sigmoid Gating

Red = sigmoid (caps at 1, saturates). Purple = exponential (unbounded, preserves relative importance). Toggle the normalizer to see how the output stays stable despite the unbounded raw gate.

Signal strength 2.0

This is the LSTM rediscovering softmax. Exponential gating plus a normalizer is, in essence, a recurrent softmax: exponentiate the scores, divide by their running sum. It's the same mechanism that gives attention its sharp, selective focus — now built into an LSTM cell that updates one token at a time. xLSTM gives the recurrent cell the selective sharpness that the sigmoid LSTM lacked, which is precisely the capability gap from the Linear Attention lesson, attacked from the LSTM side.

Exponential gates can grow without bound, which risks blowing up the cell state. How does the sLSTM stay stable?

It clips the gates back to 1, undoing the benefit A normalizer state accumulates total gate magnitude and divides the cell state by it (plus a max-subtracting stabilizer) — like a running softmax, so ratios stay bounded It uses a smaller learning rate

Chapter 4: Matrix Memory — A Bigger Cell

Exponential gating fixed revision. But there's a second limitation hiding in the classic LSTM: its memory is just a vector — a single list of numbers. That's a small amount of storage, and everything the model wants to remember has to be crammed into it, interfering with each other. xLSTM's second variant, the mLSTM (matrix LSTM), upgrades the cell from a vector to a full matrix — vastly more memory capacity, and a structure that holds associations.

From storing values to storing key-value pairs

Here's the elegant idea, and it should feel familiar from the Linear Attention lesson. Instead of just accumulating values, the mLSTM cell stores key-value associations: when a token arrives with a key and a value, the cell adds their outer product (the key times the value, forming a matrix) into its memory. To retrieve, you multiply a query against the matrix — and out comes the value associated with the most similar stored key. The matrix memory is an associative memory: ask with a key-like query, get back the matching value.

The mLSTM cell is linear attention's state. That outer-product accumulation — add key-times-value for each token, query to retrieve — is exactly the fixed-size key-value state from the Linear Attention lesson. The mLSTM and linear attention arrive at the same matrix-valued recurrent memory from opposite directions: one by upgrading the LSTM cell, the other by reordering attention's matrix multiply. This convergence is the big story of the 2024 architectures — different lineages, same fixed-state associative memory. The mLSTM adds exponential gating on top, giving it the revision power linear attention lacked.

Why a matrix holds so much more

A vector of size d holds d numbers. A matrix of size d-by-d holds d-squared numbers — for d = 64, that's 64 values vs over 4,000. But it's not just raw count: the matrix structure lets the cell store many distinct, separable key-value pairs that don't interfere as badly, because different keys map to different directions. The vector LSTM had to overwrite or blend memories that competed for its few slots; the matrix mLSTM can keep many associations side by side. More capacity, less interference — directly attacking the recall weakness of fixed-state models.

See it: vector vs. matrix capacity

The widget stores a series of key-value pairs, then tests recall. With vector memory, new pairs overwrite old ones — recall degrades fast as you add more. With matrix memory, the outer products coexist — recall holds up far better as the number of stored pairs grows. Add pairs and watch the vector memory's recall collapse while the matrix memory's stays strong.

Vector vs. Matrix Memory Capacity

Store N key-value pairs, then measure recall accuracy. Vector memory saturates and interferes; matrix memory holds many associations. Drag N up and watch the gap.

Number of stored pairs 6

Common misconception. “A matrix memory means quadratic cost again.” No — the matrix is d-by-d, fixed by the feature dimension, not n-by-n. It does not grow with sequence length. Storing a million tokens still uses the same fixed d-by-d matrix; each token just adds its outer product into it. The cost stays linear in sequence length and the memory stays constant — the matrix is bigger than a vector, but still a fixed size, exactly like linear attention's state.

How does the mLSTM's matrix memory relate to linear attention?

They're unrelated — matrix memory is a new invention It's the same fixed-size key-value state: accumulate key×value outer products, query to retrieve — the mLSTM reaches it by upgrading the LSTM cell, linear attention by reordering attention The matrix grows with sequence length, making it quadratic

Chapter 5: Parallelizable — Fixing the Second Flaw

We've fixed revision (exponential gating) and capacity (matrix memory). Now the flaw that actually killed the LSTM commercially: it couldn't train in parallel. The mLSTM's design deliberately solves this, and understanding how reveals a sharp tradeoff between the two xLSTM variants.

Why the classic LSTM is stuck in sequence

In a classic LSTM, each step's gates depend on the previous step's hidden output — a recurrent connection from one timestep to the next. So you literally cannot compute step 5 until you've finished step 4, which needs step 3, and so on. The computation is a chain you must walk one link at a time. On a GPU — which is built to do thousands of things at once — this is agony: the hardware sits mostly idle, waiting. That's why LSTMs were slow to train and couldn't scale to the sizes that made transformers dominant.

The mLSTM cuts the recurrent link

The mLSTM makes one crucial sacrifice: it removes the hidden-to-hidden recurrence in the gates. Its gates depend only on the current input, not on the previous step's output. With that dependency gone, the per-token computations no longer form an unbreakable chain — they can be computed in parallel, all at once, exactly like attention. The mLSTM has the same dual nature as linear attention from Chapter 3 of that lesson: a parallel form for fast GPU training and a recurrent form for cheap streaming inference. The second flaw is fixed.

The sLSTM–mLSTM tradeoff. This is why xLSTM has two variants. The mLSTM drops recurrent memory mixing to gain parallelism (and adds the matrix memory) — fast, scalable, the workhorse. The sLSTM keeps the recurrent memory mixing — which makes it not fully parallelizable, but more expressive for certain state-tracking tasks that genuinely need step-to-step interaction. xLSTM models mix both kinds of blocks: mLSTM blocks for efficient bulk processing, sLSTM blocks where the extra expressiveness pays off. You choose the ratio.

The chunkwise compromise

Like the rest of the family, the mLSTM in practice uses chunkwise processing: split the sequence into chunks, compute within each chunk in parallel, and pass a recurrent state between chunks. This gets near-parallel training speed while keeping the linear cost — the same hardware-efficient trick the Linear Attention lesson described. It's how the mLSTM actually runs fast on real GPUs.

See it: sequential vs. parallel computation

The widget shows tokens being processed two ways. In sequential mode (sLSTM, classic LSTM), each token must wait for the previous one — a chain, processed one at a time. In parallel mode (mLSTM), with the recurrent gate-link cut, all tokens compute at once — the GPU does them simultaneously. Watch the wall-clock time: sequential grows with length; parallel stays flat.

Sequential (sLSTM) vs. Parallel (mLSTM) Training

Press Run. Sequential processes tokens one-by-one (each waits for the last); parallel does them all at once. Watch how training time diverges as the sequence grows.

Common misconception. “If the mLSTM is parallelizable, why keep the slow sLSTM at all?” Because parallelism isn't free of cost in capability. Removing the recurrent memory mixing slightly limits what the model can compute — certain state-tracking problems (like tracking nested structure) genuinely benefit from step-to-step interaction. The sLSTM trades training speed for that expressiveness. xLSTM keeps both tools and lets you blend them, rather than forcing one compromise on everything.

How does the mLSTM become parallelizable (unlike the classic LSTM)?

By using more GPUs It removes the recurrent dependency of the gates on the previous step's output, so per-token computations no longer form a sequential chain and can be computed all at once (like attention) By making the sequence shorter

Chapter 6: The Revision Lab — See the Fix Work

This is the payoff — the revision problem and its fix, side by side, interactive. A stream of items arrives, each with an importance. The task is simple: by the end, the memory should hold the single most important item, no matter when it appeared. A sigmoid-gated LSTM fails this when the most important item comes late (it can't override what's stored); an exponential-gated xLSTM succeeds. Watch both, on the same stream.

Drag the importance of each item, and especially try making a late item the most important. Toggle between sigmoid and exponential gating and see which item each model's memory ends up holding. The sigmoid model gets “stuck” on whatever it committed early; the exponential model correctly revises to the true maximum.

Keep-the-Maximum: Sigmoid vs. Exponential Gating

Each bar is an item's importance, arriving left to right. The task: end holding the most important one. Toggle gating and adjust importances — make a late item the biggest and watch the sigmoid model fail while the exponential one revises correctly.

What to take away. Make the last item the most important and watch the sigmoid model cling to an earlier choice — its gate is already saturated, so it can't open wide enough to overwrite. The exponential model swings its gate as large as needed and correctly grabs the late maximum. This tiny task is a microcosm of why exponential gating matters: real sequences constantly require revising an earlier “best so far” when something better shows up. xLSTM can; the classic LSTM couldn't.

Common misconception. “The sigmoid model fails because it's smaller or undertrained.” No — even a perfectly-trained sigmoid LSTM hits this wall, because the limitation is in the gate function, not the parameters. Saturation caps the gate at 1 regardless of how the weights are set. It's an architectural ceiling, which is exactly why the fix had to change the architecture (the gate function), not just train harder.

No quiz — the lab is the test. If you can predict, before toggling, which item each model will keep when the maximum arrives late, you understand exponential gating.

Chapter 7: The xLSTM Block — Assembling the Model

We have the pieces — exponential gating, matrix memory, the parallelizable mLSTM, the expressive sLSTM. Now how are they assembled into a full model? The answer is reassuringly familiar: xLSTM borrows the residual block structure that makes transformers trainable, and slots its recurrent cells in where attention would go.

The residual backbone

An xLSTM model is a stack of residual blocks, exactly like a transformer. Each block wraps its core operation with a normalization layer and a residual connection (add the input back to the output). This is the same machinery that lets deep transformers train without vanishing gradients (see the Skip Connections and Normalization lessons). The novelty isn't the backbone — it's what sits inside each block: an xLSTM layer instead of an attention layer.

Drop-in replacement for attention. The cleanest way to see xLSTM: take a transformer, and replace the attention sub-layer in each block with an xLSTM layer (mLSTM or sLSTM), keeping the residual connections, the normalization, and the feed-forward parts. The result is a model with the same trainable deep-stack structure but linear-cost, constant-memory token mixing instead of quadratic attention. This “swap the mixer, keep the scaffolding” pattern is shared across the whole linear-model family — RWKV, Mamba, and xLSTM all plug into the transformer-style residual backbone.

Mixing sLSTM and mLSTM blocks

A real xLSTM architecture interleaves the two block types. mLSTM blocks — parallelizable, matrix-memory — do most of the work: efficient, scalable bulk sequence processing. sLSTM blocks — with their recurrent memory mixing — are sprinkled in where the extra state-tracking expressiveness helps. The ratio is a design choice, much like the attention-to-linear ratio in the hybrid models from the Linear Attention lesson. You tune how many of each based on your tasks and your compute budget.

Inside a block: up-projection and gating

There's one more structural detail. xLSTM blocks often up-project to a wider dimension before the recurrent operation and project back down after — giving the cell more room to work — and use additional gating (like a gated linear unit) around the core. These are the same kinds of engineering refinements that tuned transformers over the years (wider FFNs, GLU variants); xLSTM inherits that accumulated wisdom. The recurrent cell is the heart, but it's wrapped in the modern block design that makes large models train well.

See it: building the stack

The widget shows an xLSTM model as a stack of residual blocks. Toggle a block between mLSTM (parallel, matrix memory) and sLSTM (recurrent, expressive) and see the mix. Notice every block has the same residual + norm wrapper — the transformer scaffolding — with the recurrent cell as the interchangeable core.

An xLSTM Stack (residual blocks)

Click a block to flip it between mLSTM (teal, parallel) and sLSTM (purple, expressive). Every block shares the residual + norm wrapper; only the recurrent core changes.

Common misconception. “xLSTM is a totally new kind of network needing new training machinery.” It deliberately reuses the transformer's proven residual-block scaffolding — normalization, residual connections, up/down projections, gated units — so it inherits everything the field learned about training deep models. The genuinely new part is small and surgical: the recurrent cell (exponential gating + matrix memory) that replaces attention. That's why it could be scaled up quickly — most of the recipe was already battle-tested.

How is a full xLSTM model structured?

As a single giant recurrent cell with no layers As a stack of transformer-style residual blocks (norm + residual + projections), but with mLSTM/sLSTM layers replacing the attention sub-layer As a convolutional network with pooling

Chapter 8: xLSTM Among the Recurrent Revival

xLSTM didn't arrive in a vacuum — it's part of the 2024 wave of sub-quadratic models alongside RWKV and Mamba. They're all chasing the same prize and they're all, underneath, the same kind of thing: a fixed-state recurrence with constant-memory inference. Seeing where xLSTM fits clarifies the whole landscape.

What they share

Every model in this family — xLSTM, RWKV, Mamba, RetNet, gated linear attention — carries a fixed-size state that it updates one token at a time, giving linear cost and constant-memory streaming inference. And they all converged on the key fix from the Linear Attention lesson: data-dependent control of the state (selective forgetting and writing based on content). They differ mainly in the mathematical lineage they came from and the exact form of their state update.

Model	Came from	State / key mechanism
xLSTM	the LSTM	matrix memory + exponential gating + normalizer
RWKV	RNN/attention hybrid	WKV state + per-channel time-decay + receptance
Mamba	state-space models	continuous-time SSM + input-dependent selectivity
RetNet	attention	retention with explicit decay; chunkwise form
Transformer	—	growing KV cache, quadratic, sharp recall (the baseline)

The grand convergence. Here's the deep point. The mLSTM's matrix memory — accumulate key×value outer products, query to retrieve — is the same object as linear attention's key-value state, RetNet's retention state, and (with different parameterization) Mamba's SSM state. Four research lineages — LSTMs, attention, retention, state-space models — climbed different mountains and met at the same summit: a data-dependent, fixed-size, matrix-valued recurrent memory. xLSTM's distinctive contribution is reaching it through the LSTM's gating philosophy, proving that classic recurrent ideas, modernized, belong in that summit too.

xLSTM's distinctive angle

What makes xLSTM stand out within the family? Its emphasis on exponential gating with a normalizer — an explicit, principled fix for the revision/recall problem, descended directly from the LSTM's gate philosophy — and its two-variant design (the parallel mLSTM and the more-expressive sLSTM), letting you trade efficiency for state-tracking power block by block. It's the most LSTM-faithful member of the family, which is fitting: it comes from the people who invented the LSTM in the first place.

See it: the family on a map

Select a model to place it on two axes: its lineage (where it came from) and its state type. Notice how the linear family clusters — same fixed matrix-state region — while the transformer sits apart with its growing cache. xLSTM, RWKV, and Mamba are close neighbors despite their different origins.

The Recurrent-Revival Family Map

Select a model to highlight it. The linear family clusters in the fixed-matrix-state region; the transformer is the outlier with a growing cache. Different origins, converging design.

Common misconception. “These are four competing camps and one will win.” They're more like four dialects of one language — data-dependent fixed-state recurrence — and the research increasingly flows between them (techniques from one improve the others, and the math partly unifies them). Rather than “which wins,” the useful question is which formulation is easiest to scale, train, and tune — and, as the Linear Attention lesson noted, whether to hybridize any of them with a few attention layers for recall.

What do xLSTM (from LSTMs), RWKV (from RNN/attention), and Mamba (from state-space models) fundamentally share?

They all use a growing KV cache like a transformer They're all data-dependent, fixed-size recurrent memories (linear cost, constant-memory inference) — different lineages converging on the same matrix-valued state They all require quadratic attention internally

Chapter 9: Connections & Cheat Sheet

You now understand xLSTM end to end: why the LSTM fell behind, the saturating-sigmoid revision problem, exponential gating with a normalizer as the fix, the matrix memory that holds key-value associations, the mLSTM's parallelizability, the residual-block architecture, and where it sits in the recurrent revival. The thread: take the LSTM's timeless gated-memory idea, fix its two flaws (saturation and sequentiality), and you get a competitive, constant-memory alternative to attention.

The cheat sheet

The two flaws: (1) sigmoid gates saturate → can't revise; (2) recurrent gates → can't train in parallel

Exponential gating: unbounded gates → can open as wide as needed → can revise decisively

Normalizer + stabilizer: divide by running gate sum (a recurrent softmax) → bounded, stable output

sLSTM: scalar/vector memory, keeps recurrent mixing → expressive, NOT parallel

mLSTM: matrix (key-value) memory, drops recurrent gate-link → parallelizable, high capacity

Matrix memory = linear attention's KV state: accumulate key×value, query to retrieve

Architecture: transformer-style residual blocks, mLSTM/sLSTM layers replacing attention

Family: xLSTM, RWKV, Mamba — data-dependent fixed-state recurrence, different lineages

A decision guide

Need constant-memory streaming inference?

Any recurrent-family model (xLSTM, RWKV, Mamba) fits.

↓

Want maximum training throughput?

mLSTM blocks (parallelizable) for the bulk of the stack.

↓

Task needs hard state-tracking?

Add sLSTM blocks for their extra recurrent expressiveness.

↓

Need precise long-range recall?

Hybridize with a few attention layers (as in the Linear Attention lesson).

Where this connects

Linear Attention & RWKV — the sibling family; xLSTM's matrix memory IS linear attention's KV state, reached from the LSTM side.
SSM / Mamba — the state-space cousin; shares data-dependent fixed-state recurrence.
Skip Connections & Normalization — the residual-block scaffolding xLSTM reuses from transformers.
Gradient Flow — the LSTM's “constant error carousel” was an early answer to vanishing gradients.
Transformer — the architecture xLSTM aims to replace or complement as the token mixer.
Loss Functions — the normalizer's running-softmax is the same exponential-normalize idea behind cross-entropy and attention.

The one thing to remember. The LSTM was never a bad idea — it was a great idea with two specific, fixable flaws. xLSTM fixes them: exponential gating restores the ability to revise what's stored, and the parallelizable matrix-memory mLSTM restores trainability at scale. The payoff is a recurrent model with the constant-memory, linear-cost inference that attention can't match — proof that decades-old recurrent wisdom, modernized, still belongs at the frontier. Sometimes the way forward is to go back and fix what you abandoned.

A colleague says “LSTMs are obsolete, xLSTM is just nostalgia.” What's the strongest rebuttal?

LSTMs were never actually used in practice xLSTM uses attention internally, so it's really a transformer xLSTM fixes the LSTM's two real flaws (saturating gates → exponential gating; non-parallel → parallelizable matrix memory), yielding constant-memory linear-cost inference competitive with transformers

“An old idea is not a dead idea — sometimes it was just waiting for its two flaws to be fixed.”