State Space Models
SSM / Mamba
Linear-time sequence modeling — the O(n) challenger to the transformer throne
What Is a State Space Model?
A State Space Model (SSM) is a sequence model rooted in continuous-time dynamical systems. Instead of attending over every previous token (like a transformer), it maintains a compact hidden state vector that gets updated at each time step. The governing equations come straight from control theory: x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t). Discretize these for discrete tokens and you get a linear recurrence that can process sequences in O(n) time instead of O(n²).
The breakthrough from S4 (Gu et al., 2021) showed that structuring the A matrix as a diagonal plus low-rank (HiPPO initialization) lets the model remember long-range dependencies. But S4's parameters were fixed — the same A, B, C for every input, making it a linear time-invariant (LTI) system. This limited its ability to do content-based reasoning.
Mamba (Gu & Dao, 2023) solved this with selective scan: the A, B, and C matrices become functions of the input, making the model input-dependent (time-varying). Combined with a hardware-aware parallel scan algorithm and a simplified block design, Mamba matches or beats transformers of similar size — especially on long sequences — while using linear time and constant memory per step.
Architecture Diagram
Discrete (ZOH): xk = Ā·xk-1 + B̄·uk | yk = C·xk + D·uk
Mamba twist: A, B, C = f(input) — selective (input-dependent)
Core Mechanisms
Selective Scan
Unlike LTI models with fixed dynamics, Mamba computes A, B, C as projections of the input. This lets the model decide what to remember and what to forget at each step — content-based gating, analogous to how attention selects relevant tokens.
Linear Recurrence
Sequential mode: process one token at a time with O(1) memory per step — ideal for inference. Parallel mode: use a prefix-sum (scan) to process all tokens at once during training, achieving near-transformer throughput on GPUs.
Hardware-Aware Algorithm
Mamba fuses the discretization, selective scan, and output multiplication into a single GPU kernel. The state is kept in fast SRAM (not HBM), minimizing memory I/O. This is why Mamba is 3–5x faster than a naive scan implementation.
Mamba vs Transformer
Transformer (Attention)
Mamba (SSM)
Hybrid Architectures
Pure Mamba excels at long-context efficiency but can struggle with tasks requiring exact recall over long spans (e.g., "find the phone number mentioned 10k tokens ago"). Attention is great at recall but expensive. The solution: combine both.
Jamba
52B MoE (12B active). Alternates Mamba layers with Transformer attention layers in a ratio of ~7:1. 256k context window. Gets SSM throughput with attention-level recall.
Zamba
7B parameters. Mamba backbone with a shared attention layer injected every N blocks. The single attention module is reused, keeping parameter count low while gaining recall ability.
Best of Both Worlds
Mamba handles the long-range bulk — summarizing, compressing, routing information through its state. Attention handles precision recall — exact copying, lookup, associative memory. Together they cover all bases.
Training
Next-Token Prediction
Standard autoregressive language modeling: predict the next token, cross-entropy loss, Adam optimizer. Mamba uses the same training objective and data pipeline as GPT-style models.
LoRA & Adapters
LoRA works on SSM projection matrices just like on attention projections. No architectural changes needed. Same PEFT tooling applies.
Parallel Scan in Training
During training, the recurrence is unrolled as a parallel prefix scan across the sequence length. This makes training throughput competitive with transformers on modern GPUs.
Inference
This is where Mamba truly shines. At inference time, the model operates in sequential recurrence mode: each new token updates a fixed-size state vector and produces an output. There is no KV cache that grows with sequence length.
O(d) per Token
The state vector is typically d×N where d is model dim and N is state expansion (often 16). This is fixed regardless of sequence length — generating the 100,000th token uses the same memory as the first.
O(1) per Step
No quadratic attention computation per step. Each token is a single matrix-vector multiply through the state. This makes Mamba ideal for very long generation tasks (coding, document synthesis, chat with long history).
Model Zoo
Mamba-2
Improved selective scan (SSD), structured state space duality. Faster training kernel.
Jamba
AI21. Mamba + Attention hybrid. 256k context. First production SSM hybrid.
Zamba
Zyphra. Mamba backbone with shared attention block. Efficient and competitive.
Falcon Mamba
TII. Pure Mamba architecture trained on high-quality data. Strong benchmarks for size.