Architecture Atlas · 02

State Space Models
SSM / Mamba

Linear-time sequence modeling — the O(n) challenger to the transformer throne

2023 Year
Gu & Dao Creator
Sequence Model Category
O(n·d) Complexity

What Is a State Space Model?

A State Space Model (SSM) is a sequence model rooted in continuous-time dynamical systems. Instead of attending over every previous token (like a transformer), it maintains a compact hidden state vector that gets updated at each time step. The governing equations come straight from control theory: x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t). Discretize these for discrete tokens and you get a linear recurrence that can process sequences in O(n) time instead of O(n²).

The breakthrough from S4 (Gu et al., 2021) showed that structuring the A matrix as a diagonal plus low-rank (HiPPO initialization) lets the model remember long-range dependencies. But S4's parameters were fixed — the same A, B, C for every input, making it a linear time-invariant (LTI) system. This limited its ability to do content-based reasoning.

Mamba (Gu & Dao, 2023) solved this with selective scan: the A, B, and C matrices become functions of the input, making the model input-dependent (time-varying). Combined with a hardware-aware parallel scan algorithm and a simplified block design, Mamba matches or beats transformers of similar size — especially on long sequences — while using linear time and constant memory per step.


Architecture Diagram

Mamba Block — Internal Flow Interactive
◇ State Space Equations
Continuous:  x'(t) = A·x(t) + B·u(t)  |  y(t) = C·x(t) + D·u(t)
Discrete (ZOH):  xk = Ā·xk-1 + B̄·uk  |  yk = C·xk + D·uk
Mamba twist:  A, B, C = f(input) — selective (input-dependent)

Core Mechanisms

Key Innovation

Selective Scan

Unlike LTI models with fixed dynamics, Mamba computes A, B, C as projections of the input. This lets the model decide what to remember and what to forget at each step — content-based gating, analogous to how attention selects relevant tokens.

Dual Mode

Linear Recurrence

Sequential mode: process one token at a time with O(1) memory per step — ideal for inference. Parallel mode: use a prefix-sum (scan) to process all tokens at once during training, achieving near-transformer throughput on GPUs.

Performance

Hardware-Aware Algorithm

Mamba fuses the discretization, selective scan, and output multiplication into a single GPU kernel. The state is kept in fast SRAM (not HBM), minimizing memory I/O. This is why Mamba is 3–5x faster than a naive scan implementation.

Selective Scan — Input-Dependent Gating Interactive

Mamba vs Transformer

Transformer (Attention)

TimeO(n²d)
MemoryO(n²) or O(n) w/ Flash
ContextGlobal (full pairwise)
ModeFully parallel
KV CacheGrows with seq len
RecallExcellent

Mamba (SSM)

TimeO(nd)
MemoryO(d) per step
ContextSelective (compressed)
ModeSequential + parallel scan
StateFixed-size vector
RecallGood (not exact)
Compute Scaling — Attention vs SSM Interactive

Hybrid Architectures

Pure Mamba excels at long-context efficiency but can struggle with tasks requiring exact recall over long spans (e.g., "find the phone number mentioned 10k tokens ago"). Attention is great at recall but expensive. The solution: combine both.

AI21 · 2024

Jamba

52B MoE (12B active). Alternates Mamba layers with Transformer attention layers in a ratio of ~7:1. 256k context window. Gets SSM throughput with attention-level recall.

Zyphra · 2024

Zamba

7B parameters. Mamba backbone with a shared attention layer injected every N blocks. The single attention module is reused, keeping parameter count low while gaining recall ability.

Why Hybrids Work

Best of Both Worlds

Mamba handles the long-range bulk — summarizing, compressing, routing information through its state. Attention handles precision recall — exact copying, lookup, associative memory. Together they cover all bases.


Training

Same As Transformers

Next-Token Prediction

Standard autoregressive language modeling: predict the next token, cross-entropy loss, Adam optimizer. Mamba uses the same training objective and data pipeline as GPT-style models.

Fine-Tuning

LoRA & Adapters

LoRA works on SSM projection matrices just like on attention projections. No architectural changes needed. Same PEFT tooling applies.

Key Difference

Parallel Scan in Training

During training, the recurrence is unrolled as a parallel prefix scan across the sequence length. This makes training throughput competitive with transformers on modern GPUs.


Inference

This is where Mamba truly shines. At inference time, the model operates in sequential recurrence mode: each new token updates a fixed-size state vector and produces an output. There is no KV cache that grows with sequence length.

Constant Memory

O(d) per Token

The state vector is typically d×N where d is model dim and N is state expansion (often 16). This is fixed regardless of sequence length — generating the 100,000th token uses the same memory as the first.

Linear Generation

O(1) per Step

No quadratic attention computation per step. Each token is a single matrix-vector multiply through the state. This makes Mamba ideal for very long generation tasks (coding, document synthesis, chat with long history).

State Evolution Over Sequence Interactive

Model Zoo

Mamba-2
2.8B params

Improved selective scan (SSD), structured state space duality. Faster training kernel.

Jamba
52B MoE (12B active)

AI21. Mamba + Attention hybrid. 256k context. First production SSM hybrid.

Zamba
7B params

Zyphra. Mamba backbone with shared attention block. Efficient and competitive.

Falcon Mamba
7B params

TII. Pure Mamba architecture trained on high-quality data. Strong benchmarks for size.


Connections