Transformers have a quadratic bottleneck. State space models process sequences in linear time — and Mamba makes them input-dependent. Here's the full story.
Attention is powerful but expensive. For a sequence of length n, self-attention computes a score between every pair of tokens: that's n² operations. At 1,000 tokens, that's 1 million pairs. At 100,000 tokens, it's 10 billion. This quadratic cost is the fundamental bottleneck of Transformers.
What if we could process sequences without ever computing pairwise interactions? What if each token was processed in constant time, regardless of sequence length? This is the promise of state space models.
Drag the slider to see how compute scales with sequence length. The gap grows fast.
The simplest alternative to attention is a linear recurrence. At each step, we maintain a hidden state h and update it:
The matrix A controls how the previous state is transformed (memory). The matrix B controls how the new input is mixed in. The output is yt = C · ht. This takes constant time per step — no matter how long the sequence, each new token costs the same.
Watch the hidden state h evolve as input tokens arrive. A controls decay (memory); B controls input mixing.
| Approach | Cost/Token | Total (n tokens) | Can Select? |
|---|---|---|---|
| Self-Attention | O(n) | O(n²) | Yes (dynamic) |
| Linear Recurrence | O(1) | O(n) | No (fixed A, B) |
A state space model (SSM) comes from control theory. It starts in continuous time:
This is a differential equation: h' is the derivative of the hidden state. To use it on discrete sequences (like text), we discretize it using a step size Δ. The most common method (zero-order hold) gives:
A continuous signal (teal) is sampled at discrete steps. Smaller Δ = more faithful representation but more steps.
The breakthrough of S4 (Structured State Spaces for Sequence Modeling) was solving the long-range memory problem. Previous SSMs forgot quickly — information from 1000 steps ago had decayed to nearly zero. S4 fixed this with two key ideas:
Instead of random initialization, the A matrix is set to the HiPPO (High-Order Polynomial Projection Operator) matrix. This special matrix is designed to optimally compress the history of a continuous signal into a fixed-size state. Each element of ht stores a coefficient of a polynomial approximation of the entire input history.
Computing Ā = exp(ΔA) naively for an N×N matrix costs O(N³). S4 decomposes A into diagonal + low-rank form, reducing this to O(N). This makes training practical for state sizes of 64 or more.
A signal arrives at step 0. Watch how much the hidden state remembers it over time. HiPPO retains far more information.
S4 has a critical limitation: A, B, C are the same for every input token. The model can't decide to "pay more attention" to important tokens or "forget" irrelevant ones. Mamba (Gu & Dao, 2023) fixes this by making B, C, and Δ functions of the input.
This is the selective in "selective scan." When the model sees an important token, it can increase Δ (larger step = more input mixed in) and adjust B to write strongly to the state. For irrelevant tokens, it can shrink Δ and effectively skip them.
Watch how Mamba's input-dependent gates open (green) for important tokens and close (dim) for irrelevant ones. Click tokens to toggle importance.
| Model | A, B, C | Convolution? | Content-aware? |
|---|---|---|---|
| S4 | Fixed (same for all inputs) | Yes (parallel training) | No |
| Mamba | Input-dependent | No (must use scan) | Yes |
Making B, C, Δ input-dependent means we can't use convolution for training. The naive approach — a sequential for-loop — would be unbearably slow on GPUs. Mamba uses a parallel scan algorithm that computes the full recurrence in O(log n) parallel steps instead of O(n) sequential ones.
The key insight: a linear recurrence ht = at ht-1 + bt is an associative operation. Like addition, we can rearrange the order of computation. Instead of left-to-right, we use a binary tree:
Click "Step" to advance through the parallel scan. Each level halves the remaining work.
A Mamba block looks quite different from a Transformer block. It has no attention mechanism at all. Instead, it uses a combination of linear projections, 1D convolution, and the selective SSM:
Click "Step" to trace data through the Mamba block. Watch the two branches split and merge.
Pure Mamba models are fast but sometimes struggle with tasks requiring precise retrieval of specific tokens from the past (like "what was the 7th word?"). Pure Transformers are great at retrieval but expensive. The natural solution: combine both.
| Model | Architecture | Key Idea |
|---|---|---|
| Jamba (AI21) | Mamba + Attention + MoE | Alternate Mamba and attention layers. MoE for capacity. |
| Zamba (Zyphra) | Mamba + shared Attention | One shared attention layer interleaved every few Mamba blocks. |
| Griffin (DeepMind) | Gated linear recurrence + local attention | Use recurrence for global, windowed attention for local. |
| Mamba-2 | Structured state space duality | Show SSMs and attention are dual views of the same operation. |
Select an architecture to see how Mamba and attention layers are interleaved.
Mamba models are trained with the exact same objective as Transformers: next-token prediction with cross-entropy loss. The training data, tokenizer, and optimizer are all the same. The difference is entirely in the architecture.
During training, the entire sequence is available. The selective SSM uses the parallel scan to process all tokens simultaneously. This is analogous to how Transformers compute all attention weights at once.
During generation, we switch to sequential recurrence. At each step:
Compare memory usage during generation. Transformer's KV cache grows linearly; Mamba's state is constant.
Neither architecture is strictly better. Each has fundamental strengths and weaknesses rooted in their core mechanisms.
| Dimension | Transformer | SSM / Mamba |
|---|---|---|
| Training cost | O(n²) per layer | O(n) per layer |
| Inference memory | KV cache grows with n | Fixed-size state |
| Inference time/token | Grows with n (attend to all KV) | Constant |
| Exact retrieval | Excellent (direct access) | Weaker (compressed state) |
| Long-range context | Struggles past training length | Naturally extends |
| In-context learning | Strong (induction heads) | Emerging but less understood |
| Ecosystem maturity | Very mature (5+ years) | Rapidly growing |
| Hardware optimization | Flash Attention, GQA | Parallel scan, SRAM-aware |
Select a task to see which architecture is better suited and why.
You now understand both the Transformer and its most promising challenger. From continuous-time state spaces to selective scans, from HiPPO to hardware-aware algorithms — this is the frontier of sequence modeling.