The Complete Beginner's Path

Understand SSMs &
Mamba

Transformers have a quadratic bottleneck. State space models process sequences in linear time — and Mamba makes them input-dependent. Here's the full story.

Prerequisites: Basic linear algebra + Intuition for sequences. Familiarity with Transformers helps but isn't required.
10
Chapters
10+
Simulations
0
Assumed ML Knowledge

Chapter 0: Sequences Without Attention

Attention is powerful but expensive. For a sequence of length n, self-attention computes a score between every pair of tokens: that's n² operations. At 1,000 tokens, that's 1 million pairs. At 100,000 tokens, it's 10 billion. This quadratic cost is the fundamental bottleneck of Transformers.

What if we could process sequences without ever computing pairwise interactions? What if each token was processed in constant time, regardless of sequence length? This is the promise of state space models.

Interactive: Attention Cost vs Linear Cost

Drag the slider to see how compute scales with sequence length. The gap grows fast.

Sequence length1,000
The core question: Attention lets every token see every other token. That's why it works so well. But do we really need O(n²) interactions? Maybe we can compress the past into a fixed-size state and update it one token at a time — like a rolling summary.
Check: Why is attention expensive for long sequences?

Chapter 1: Linear Recurrences

The simplest alternative to attention is a linear recurrence. At each step, we maintain a hidden state h and update it:

ht = A · ht−1 + B · xt

The matrix A controls how the previous state is transformed (memory). The matrix B controls how the new input is mixed in. The output is yt = C · ht. This takes constant time per step — no matter how long the sequence, each new token costs the same.

Interactive: Linear Recurrence

Watch the hidden state h evolve as input tokens arrive. A controls decay (memory); B controls input mixing.

A (decay)0.90
B (input)0.50
The problem: A and B are fixed for every input. The same transformation is applied regardless of whether the current token is important or irrelevant. An SSM with fixed parameters is like reading a book with the same level of attention on every word — you can't speed-read the boring parts or slow down for key details.
ApproachCost/TokenTotal (n tokens)Can Select?
Self-AttentionO(n)O(n²)Yes (dynamic)
Linear RecurrenceO(1)O(n)No (fixed A, B)
Check: What is the main limitation of a linear recurrence with fixed A and B?

Chapter 2: The State Space Model

A state space model (SSM) comes from control theory. It starts in continuous time:

h'(t) = A h(t) + B x(t)      y(t) = C h(t) + D x(t)

This is a differential equation: h' is the derivative of the hidden state. To use it on discrete sequences (like text), we discretize it using a step size Δ. The most common method (zero-order hold) gives:

Ā = exp(Δ A)     B̄ = (ΔA)−¹(Ā − I) · ΔB
Continuous
h'(t) = Ah(t) + Bx(t) — differential equation
↓ discretize with step Δ
Discrete
ht = Ā ht-1 + B̄ xt — recurrence
↓ unroll
Convolution
y = K * x where Ki = C Āi B̄ — parallel!
Interactive: Discretization

A continuous signal (teal) is sampled at discrete steps. Smaller Δ = more faithful representation but more steps.

Step Δ0.15
The dual view: An SSM can be computed as either a recurrence (sequential, O(n)) or a convolution (parallel, O(n log n) via FFT). Training uses convolution mode for parallelism. Inference uses recurrence mode for constant memory per token.
Check: Why start with a continuous-time formulation?

Chapter 3: The S4 Trick

The breakthrough of S4 (Structured State Spaces for Sequence Modeling) was solving the long-range memory problem. Previous SSMs forgot quickly — information from 1000 steps ago had decayed to nearly zero. S4 fixed this with two key ideas:

1. HiPPO Initialization

Instead of random initialization, the A matrix is set to the HiPPO (High-Order Polynomial Projection Operator) matrix. This special matrix is designed to optimally compress the history of a continuous signal into a fixed-size state. Each element of ht stores a coefficient of a polynomial approximation of the entire input history.

Ank = −(2n+1)1/2(2k+1)1/2   if n > k,   −(n+1)   if n = k

2. Diagonal + Low-Rank Structure

Computing Ā = exp(ΔA) naively for an N×N matrix costs O(N³). S4 decomposes A into diagonal + low-rank form, reducing this to O(N). This makes training practical for state sizes of 64 or more.

Interactive: Memory Decay — Random vs HiPPO

A signal arrives at step 0. Watch how much the hidden state remembers it over time. HiPPO retains far more information.

Why it matters: Before S4, SSMs couldn't match Transformers on tasks requiring long-range dependencies (e.g., understanding context 4000 tokens ago). S4 was the first SSM to achieve competitive performance on the Long Range Arena benchmark, matching or beating Transformers.
Check: What problem does HiPPO initialization solve?

Chapter 4: Selective Scan (Mamba)

S4 has a critical limitation: A, B, C are the same for every input token. The model can't decide to "pay more attention" to important tokens or "forget" irrelevant ones. Mamba (Gu & Dao, 2023) fixes this by making B, C, and Δ functions of the input.

Bt = Linear(xt)    Ct = Linear(xt)    Δt = softplus(Linear(xt))

This is the selective in "selective scan." When the model sees an important token, it can increase Δ (larger step = more input mixed in) and adjust B to write strongly to the state. For irrelevant tokens, it can shrink Δ and effectively skip them.

Interactive: Selective Gates

Watch how Mamba's input-dependent gates open (green) for important tokens and close (dim) for irrelevant ones. Click tokens to toggle importance.

The breakthrough: Making parameters input-dependent breaks the convolution view — you can no longer precompute a single kernel K. But it makes the model vastly more expressive. Mamba showed that with careful GPU implementation, the recurrence can be nearly as fast as the convolution.
ModelA, B, CConvolution?Content-aware?
S4Fixed (same for all inputs)Yes (parallel training)No
MambaInput-dependentNo (must use scan)Yes
Check: What is the key innovation of Mamba over S4?

Chapter 5: Hardware-Aware Scan

Making B, C, Δ input-dependent means we can't use convolution for training. The naive approach — a sequential for-loop — would be unbearably slow on GPUs. Mamba uses a parallel scan algorithm that computes the full recurrence in O(log n) parallel steps instead of O(n) sequential ones.

The Parallel Scan

The key insight: a linear recurrence ht = at ht-1 + bt is an associative operation. Like addition, we can rearrange the order of computation. Instead of left-to-right, we use a binary tree:

Step 1
Compute pairs: (h0,h1), (h2,h3), (h4,h5), ...
Step 2
Combine pairs: (h0..1,h2..3), (h4..5,h6..7), ...
Step 3
Combine again: (h0..3,h4..7), ...
↓ log2(n) steps total
Done
All h0..n computed in parallel
Interactive: Parallel Scan Tree

Click "Step" to advance through the parallel scan. Each level halves the remaining work.

Step 0 / 3: Initial values
GPU optimization: Mamba also keeps the full state in SRAM (fast on-chip memory) and avoids writing intermediate results to slower HBM. This is the same philosophy as Flash Attention — restructure the algorithm to match GPU memory hierarchy.
Check: How does the parallel scan reduce O(n) sequential steps?

Chapter 6: The Mamba Block

A Mamba block looks quite different from a Transformer block. It has no attention mechanism at all. Instead, it uses a combination of linear projections, 1D convolution, and the selective SSM:

Input
x ∈ Rn×d
Linear Projection
Expand: d → 2·dinner (split into two branches)
↓ Branch A      ↓ Branch B
Conv1D (Branch A)
Short causal convolution (kernel size ~4)
Selective SSM
Input-dependent scan with generated B, C, Δ
↓ ⊗ SiLU(Branch B)
Gated Merge
Element-wise multiply with SiLU-activated Branch B
Linear Projection
Project back: dinner → d
Interactive: Mamba Block Data Flow

Click "Step" to trace data through the Mamba block. Watch the two branches split and merge.

Click "Step" to trace the data flow
Why Conv1D? The short convolution provides local context mixing — it lets the model see a few neighboring tokens before the SSM processes the global sequence. It's like a tiny receptive field that helps the SSM know "what's nearby" before deciding how to update the state.
Check: What role does the SiLU-gated branch play?

Chapter 7: Hybrid Models

Pure Mamba models are fast but sometimes struggle with tasks requiring precise retrieval of specific tokens from the past (like "what was the 7th word?"). Pure Transformers are great at retrieval but expensive. The natural solution: combine both.

Key Hybrid Architectures

ModelArchitectureKey Idea
Jamba (AI21)Mamba + Attention + MoEAlternate Mamba and attention layers. MoE for capacity.
Zamba (Zyphra)Mamba + shared AttentionOne shared attention layer interleaved every few Mamba blocks.
Griffin (DeepMind)Gated linear recurrence + local attentionUse recurrence for global, windowed attention for local.
Mamba-2Structured state space dualityShow SSMs and attention are dual views of the same operation.
Interactive: Hybrid Architecture Viewer

Select an architecture to see how Mamba and attention layers are interleaved.

Why hybrids work: Mamba excels at long-range, smooth information flow (like understanding overall topic and style). Attention excels at precise, content-based retrieval (like copying a name from 5000 tokens ago). Combining both gives you the best of both worlds with a fraction of the attention cost.
Check: What weakness of pure Mamba models do hybrid architectures address?

Chapter 8: Training & Inference

Mamba models are trained with the exact same objective as Transformers: next-token prediction with cross-entropy loss. The training data, tokenizer, and optimizer are all the same. The difference is entirely in the architecture.

Training: Parallel Mode

During training, the entire sequence is available. The selective SSM uses the parallel scan to process all tokens simultaneously. This is analogous to how Transformers compute all attention weights at once.

Inference: Recurrence Mode

During generation, we switch to sequential recurrence. At each step:

Get token
Receive xt from previous prediction
Update state
ht = Āt ht-1 + B̄t xt — constant time!
Output
yt = Ct ht — predict next token
Interactive: Inference Memory Comparison

Compare memory usage during generation. Transformer's KV cache grows linearly; Mamba's state is constant.

Generated tokens100
Constant memory per token: Unlike the Transformer's KV cache, which grows with every generated token, the SSM hidden state is always the same size. At 100K tokens, a Transformer might need 40+ GB for the KV cache. Mamba needs the same fixed state it had at token 1.
Check: How does Mamba's inference memory scale with sequence length?

Chapter 9: SSM vs Transformer Tradeoffs

Neither architecture is strictly better. Each has fundamental strengths and weaknesses rooted in their core mechanisms.

DimensionTransformerSSM / Mamba
Training costO(n²) per layerO(n) per layer
Inference memoryKV cache grows with nFixed-size state
Inference time/tokenGrows with n (attend to all KV)Constant
Exact retrievalExcellent (direct access)Weaker (compressed state)
Long-range contextStruggles past training lengthNaturally extends
In-context learningStrong (induction heads)Emerging but less understood
Ecosystem maturityVery mature (5+ years)Rapidly growing
Hardware optimizationFlash Attention, GQAParallel scan, SRAM-aware
Interactive: Task Suitability Comparison

Select a task to see which architecture is better suited and why.

When to use which: For very long contexts (100K+ tokens, audio, genomics), SSMs shine. For retrieval-heavy tasks (RAG, QA over documents), Transformers are more reliable. For production chat models, hybrids are increasingly the best choice.
"The next generation of sequence models will likely combine the best of both worlds."
— The emerging consensus

You now understand both the Transformer and its most promising challenger. From continuous-time state spaces to selective scans, from HiPPO to hardware-aware algorithms — this is the frontier of sequence modeling.