microMamba — State Space Models & Beyond Attention

Chapter 1: Linear Recurrences

The simplest alternative to attention is a linear recurrence. At each step, we maintain a hidden state h and update it:

h_t = A · h_t−1 + B · x_t

The matrix A controls how the previous state decays and mixes (it's the "memory" operator). The matrix B controls how much of the new input is written into the state. The output y_t = C · h_t reads a projection of the state. Processing each token costs constant time: one matrix-vector product for A·h, one for B·x, one for C·h. No matter how long the sequence, each new token costs the same.

If the state has dimension N and the input has dimension D, the per-step cost is O(N² + ND) for the matrix multiplications. Compare this to attention's O(L · D) per token at generation (where L is the sequence length so far). For long sequences, the SSM wins because N is fixed (typically 16–64), while L keeps growing.

Interactive: Linear Recurrence

Watch the hidden state h evolve as input tokens arrive. A controls decay (memory); B controls input mixing.

A (decay)0.90

B (input)0.50

The problem: A and B are fixed for every input. The same transformation is applied regardless of whether the current token is important or irrelevant. An SSM with fixed parameters is like reading a book with the same level of attention on every word — you can't speed-read the boring parts or slow down for key details.

Approach	Cost/Token	Total (n tokens)	Can Select?
Self-Attention	O(n)	O(n²)	Yes (dynamic)
Linear Recurrence	O(1)	O(n)	No (fixed A, B)

Check: What is the main limitation of a linear recurrence with fixed A and B?

It applies the same transformation regardless of the input content It's too slow to be practical It can't process sequences at all

Chapter 2: The State Space Model

A state space model (SSM) comes from control theory. It starts in continuous time:

h'(t) = A h(t) + B x(t) y(t) = C h(t) + D x(t)

This is a differential equation: h' is the derivative of the hidden state. To use it on discrete sequences (like text), we discretize it using a step size Δ. The most common method (zero-order hold) gives:

Ā = exp(Δ A) B̄ = (ΔA)−¹(Ā − I) · ΔB

Continuous

h'(t) = Ah(t) + Bx(t) — differential equation

↓ discretize with step Δ

Discrete

h_t = Ā h_t-1 + B̄ x_t — recurrence

↓ unroll

Convolution

y = K * x where K_i = C Āⁱ B̄ — parallel!

Interactive: Discretization

A continuous signal (teal) is sampled at discrete steps. Smaller Δ = more faithful representation but more steps.

Step Δ0.15

Why O(L), Not O(L²)?

The recurrence h_k = Ā h_k-1 + B̄ x_k processes each token in constant time — one matrix-vector multiply, regardless of how long the sequence is. For a sequence of length L, the total cost is O(L). Attention, by contrast, computes pairwise scores between all L tokens: O(L²). The difference is enormous at scale.

Unrolling the recurrence reveals something beautiful. Substituting repeatedly:

y_k = C · (Ā^k B̄ x₀ + Ā^k-1 B̄ x₁ + … + B̄ x_k) = ∑_j (C Ā^k-j B̄) · x_j

This is a convolution: y = K * x where the kernel K_i = C Āⁱ B̄. When A, B, C are fixed, this kernel is precomputable. You convolve it with the input using FFT in O(L log L). Training uses this parallel path; inference uses the sequential recurrence for constant memory.

The dual view: An SSM can be computed as either a recurrence (sequential, O(n)) or a convolution (parallel, O(n log n) via FFT). Training uses convolution mode for parallelism. Inference uses recurrence mode for constant memory per token.

Check: Why start with a continuous-time formulation?

Because neural networks are analog Because it's simpler to implement It provides principled discretization and connects to control theory's long-range modeling

🔨 Derivation Zero-Order Hold Discretization ▶ ✓ ATTEMPTED

Given the continuous SSM: h'(t) = A h(t) + B x(t), and the assumption that x(t) is constant over each interval [kΔ, (k+1)Δ] (the zero-order hold assumption).

Your task: Derive the discrete matrices Ā = exp(ΔA) and B̄ = (exp(ΔA) − I) A⁻¹ B from first principles. Why does B̄ have that form and not simply ΔB?

The general solution to a linear ODE h'(t) = Ah(t) + Bu (with u constant) is h(t) = exp(At) h(0) + ∫₀^t exp(A(t−s)) B u ds. This is the variation-of-constants formula from ODE theory.

When u is constant over [0, Δ], the integral becomes: ∫₀^Δ exp(A(Δ−s)) ds · B · u. Substitute τ = Δ−s to get ∫₀^Δ exp(Aτ) dτ · B · u.

∫₀^Δ exp(Aτ) dτ = A⁻¹(exp(AΔ) − I). This is the matrix analog of ∫₀^Δ e^aτ dτ = (e^aΔ − 1)/a for scalars.

Full derivation:

Start with h'(t) = Ah(t) + Bx(t). Over interval [kΔ, (k+1)Δ], assume x(t) = x_k (constant). The general solution at time (k+1)Δ given h(kΔ) = h_k is:

h_k+1 = exp(AΔ) h_k + [∫₀^Δ exp(Aτ) dτ] B x_k

Evaluating the integral: ∫₀^Δ exp(Aτ) dτ = A⁻¹(exp(AΔ) − I)

Therefore: Ā = exp(ΔA) and B̄ = A⁻¹(exp(ΔA) − I) B = (ΔA)⁻¹(Ā − I) · ΔB

The key insight: B̄ ≠ ΔB because the input isn't applied instantaneously — it's integrated over the full time step while the state is simultaneously evolving under A. The factor A⁻¹(exp(AΔ) − I) captures this interaction. For small Δ, it approaches ΔI (recovering the Euler approximation), but for larger Δ the correction matters significantly.

🔗 Pattern Recognition

SSM = Kalman Filter Without the Noise Model

This Lesson (SSM)

h_t = Ā h_t−1 + B̄ x_t
y_t = C h_t

Kalman Filter

x̂_t|t−1 = F x̂_t−1 + B u_t
ẑ_t = H x̂_t|t−1 → Kalman Filter lesson

Both are linear state space models with identical structure: a transition matrix evolves a hidden state, and an observation matrix reads it out. The Kalman filter adds noise covariance tracking (Q, R) and optimal gain computation (K). An SSM in a neural network learns its matrices from data instead of deriving them from physics. But the core operation — compress history into a fixed-size state, update linearly — is identical.

If the SSM learned Q and R matrices and computed a Kalman gain, what would that correspond to in the neural network? (Answer: an attention-like weighting mechanism over the state update.)

Checkpoint — Before you move on

Explain in your own words: why does the discrete SSM have TWO computation paths (recurrence and convolution), and why do we use different paths for training vs inference?

✓ Gate cleared

Model Answer

When A, B, C are fixed (not input-dependent), the recurrence h_k = Āh_k−1 + B̄x_k can be unrolled into a convolution y = K * x where K_i = CĀⁱB̄. This kernel K is precomputable because it doesn't depend on the input.

Training uses convolution mode (O(n log n) via FFT) because GPUs excel at parallel computation — processing all tokens simultaneously is vastly faster than a sequential loop.

Inference uses recurrence mode because we generate one token at a time anyway. The recurrence gives O(1) per token with O(N) state memory — no need to recompute over the entire history. The KV-cache-free nature of SSMs comes directly from this recurrence view.

This dual view breaks the moment A, B, C become input-dependent (Mamba), because you can no longer precompute K. That's why Mamba needs the parallel scan instead of FFT convolution.

Chapter 3: The S4 Trick

The breakthrough of S4 (Structured State Spaces for Sequence Modeling) was solving the long-range memory problem. Previous SSMs forgot quickly — information from 1000 steps ago had decayed to nearly zero. S4 fixed this with two key ideas:

1. HiPPO Initialization

Instead of random initialization, the A matrix is set to the HiPPO (High-Order Polynomial Projection Operator) matrix. This special matrix is designed to optimally compress the history of a continuous signal into a fixed-size state. Each element of h_t stores a coefficient of a polynomial approximation of the entire input history.

A_nk = −(2n+1)^1/2(2k+1)^1/2 if n > k, −(n+1) if n = k

2. Diagonal + Low-Rank Structure

Computing Ā = exp(ΔA) naively for an N×N matrix costs O(N³). S4 decomposes A into diagonal + low-rank form, reducing this to O(N). This makes training practical for state sizes of 64 or more.

Interactive: Memory Decay — Random vs HiPPO

A signal arrives at step 0. Watch how much the hidden state remembers it over time. HiPPO retains far more information.

What S4 Actually Computes

S4 decomposes A into Diagonal Plus Low-Rank (DPLR) form: A = Λ − PQ*, where Λ is diagonal and P, Q are low-rank matrices. This is critical because computing exp(ΔA) for a general N×N matrix costs O(N³). With DPLR structure, it reduces to O(N). The convolution kernel K can then be computed via a Cauchy kernel formula, making the entire training step O(L log L) via FFT.

In subsequent work, S4D (diagonal state spaces) simplified this further: just use a diagonal A matrix, initialized from the HiPPO eigenvalues. This drops the low-rank components entirely, making implementation much simpler while retaining most of the performance. Mamba uses diagonal A by default.

The HiPPO matrix specifically captures a polynomial projection: each element of the state h_t stores a Legendre polynomial coefficient of the input history. Element h_t[k] approximates the k-th Legendre coefficient of x(s) over [0, t]. This means the state literally stores a polynomial approximation of everything it has seen — a lossless-ish compression of the entire history into a fixed-size vector.

Why it matters: Before S4, SSMs couldn't match Transformers on tasks requiring long-range dependencies (e.g., understanding context 4000 tokens ago). S4 was the first SSM to achieve competitive performance on the Long Range Arena benchmark, matching or beating Transformers.

Check: What problem does HiPPO initialization solve?

It prevents the hidden state from forgetting distant inputs too quickly It speeds up the forward pass It reduces the number of parameters

🔨 Derivation Why HiPPO Uses Legendre Polynomials ▶ ✓ ATTEMPTED

The HiPPO framework asks: "What is the optimal way to compress the history of a signal x(s) for s ∈ [0, t] into a fixed-size state vector c(t) ∈ R^N?"

Your task: Show why projecting onto Legendre polynomials gives the HiPPO-LegS matrix A_nk = −(2n+1)^1/2(2k+1)^1/2 for n > k. What makes this a "good" compression?

We want c_n(t) to be the n-th coefficient of the best L² approximation of x(s) on [0, t] using an orthogonal polynomial basis. Legendre polynomials P_n are orthogonal on [−1, 1], so we rescale to [0, t]: c_n(t) = (2n+1)/t ∫₀^t P_n(2s/t − 1) x(s) ds.

Taking d/dt of c_n(t) involves the Leibniz rule (the integral's upper bound depends on t) AND the fact that the basis functions themselves shift as t grows (because we're projecting onto [0,t], not a fixed interval). This produces a coupling between all coefficients c_k with k ≤ n, yielding the matrix ODE c'(t) = −(1/t) A c(t) + (1/t) B x(t).

The derivative of P_n can be expressed as a linear combination of P_k for k < n (a classical identity for Legendre polynomials). This means updating coefficient n only requires knowing coefficients k ≤ n — hence A is lower-triangular. The specific entries come from the Legendre recurrence relation coefficients scaled by the normalization factors (2n+1)^1/2.

Full derivation:

1. Define the online polynomial approximation: c_n(t) = ∫₀^t w_n(s, t) x(s) ds where w_n(s, t) = (2n+1)/t · P_n(2s/t − 1) is the Legendre weight rescaled to [0, t].

2. Differentiate with respect to t using Leibniz rule and the chain rule on the rescaled argument 2s/t − 1:

c'_n(t) = (2n+1)/t · x(t) · P_n(1) − ∑_k=0ⁿ A_nk/t · c_k(t)

3. Using P_n(1) = 1 and the Legendre derivative identities, the coupling matrix works out to A_nk = (2n+1)^1/2(2k+1)^1/2 for n > k, and A_nn = n + 1.

4. The system is c'(t) = −(1/t) A c(t) + (1/t) B x(t) where B_n = (2n+1)^1/2.

The key insight: This specific A matrix is the unique linear system that maintains the optimal polynomial approximation of the entire history as new data arrives. A randomly initialized A has no such guarantee — it just exponentially forgets. HiPPO's A is structured so that information doesn't decay; it gets redistributed across polynomial modes. This is why S4 can model dependencies over thousands of steps.

Chapter 4: Selective Scan (Mamba)

S4 has a critical limitation: A, B, C are the same for every input token. The model can't decide to "pay more attention" to important tokens or "forget" irrelevant ones. Mamba (Gu & Dao, 2023) fixes this by making B, C, and Δ functions of the input.

B_t = Linear(x_t) C_t = Linear(x_t) Δ_t = softplus(Linear(x_t))

This is the selective in "selective scan." When the model sees an important token, it can increase Δ (larger step = more input mixed in) and adjust B to write strongly to the state. For irrelevant tokens, it can shrink Δ and effectively skip them.

Interactive: Selective Gates

Watch how Mamba's input-dependent gates open (green) for important tokens and close (dim) for irrelevant ones. Click tokens to toggle importance.

What "Selective" Actually Means

Think of Δ_t as a gate width. When Δ is large, exp(ΔA) shrinks the old state more and B̄ grows — the new input floods in. When Δ is small, the old state is preserved and the input barely registers. The model learns to open the gate wide for important tokens (names, keywords, punctuation) and keep it narrow for filler words.

B_t controls what is written to the state, C_t controls what is read out, and Δ_t controls how much the state changes. All three are linear projections of the current input x_t — making the entire SSM dynamics content-dependent. This is what S4 couldn't do.

The breakthrough: Making parameters input-dependent breaks the convolution view — you can no longer precompute a single kernel K. But it makes the model vastly more expressive. Mamba showed that with careful GPU implementation, the recurrence can be nearly as fast as the convolution.

Model	A, B, C	Convolution?	Content-aware?
S4	Fixed (same for all inputs)	Yes (parallel training)	No
Mamba	Input-dependent	No (must use scan)	Yes

The parameter generation is lightweight. For each token x_t in R^d_inner: B_t = W_B x_t where W_B is [d_state, d_inner], producing B_t in R^d_state. Similarly for C_t. For Δ_t: a linear projection to R^d_inner followed by softplus ensures Δ > 0. The total parameter overhead for selectivity is minimal — just three small linear layers per block.

Check: What is the key innovation of Mamba over S4?

Mamba uses a larger hidden state Mamba makes B, C, and Δ input-dependent, enabling content-aware selection Mamba replaces the recurrence with attention

Checkpoint — Before you move on

Making B, C, Δ input-dependent breaks the convolution view. Explain why — what specific property of the kernel K_i = CĀⁱB̄ is violated when these matrices change at every timestep?

✓ Gate cleared

Model Answer

The convolution kernel K_i = CĀⁱB̄ can be precomputed ONLY because C, Ā, B̄ are the same for every token. The kernel depends only on the lag i (how far back we're looking), not on which token is at position i.

When B_t = Linear(x_t), C_t = Linear(x_t), and Δ_t = softplus(Linear(x_t)), the effective "kernel" at position k looking back j steps becomes C_k ∏_i=k-j+1^k Ā_i · B̄_k-j. This product depends on every input in the range — it's no longer shift-invariant. You can't factor it into a single convolution.

This is the fundamental cost of selectivity: you gain content-awareness but lose the O(n log n) FFT training path. The parallel scan (next chapter) is the solution — it's not as fast as FFT, but it's O(n) work in O(log n) parallel depth, which is practical on GPUs.

💻 Build It Implement the Selective SSM Step ▶ ✓ ATTEMPTED

Implement one step of Mamba's selective SSM: given the current input x_t, produce the output y_t and update the hidden state. Include the discretization (ZOH), the input-dependent parameter generation, and the state update.

signature def selective_ssm_step(x_t, h_prev, A, W_B, W_C, W_delta, bias_delta): """ One step of Mamba's selective SSM. Args: x_t: input vector [d_inner] h_prev: previous hidden state [d_inner, d_state] A: diagonal state matrix [d_inner, d_state] (log-space) W_B: projection for B [d_state, d_inner] W_C: projection for C [d_state, d_inner] W_delta: projection for delta [d_inner, d_inner] bias_delta: bias for delta [d_inner] Returns: y_t: output [d_inner] h_t: new hidden state [d_inner, d_state] """

Test case

With d_inner=4, d_state=2, x_t=ones(4), h_prev=zeros(4,2), A=-ones(4,2):
delta_t should be positive (softplus output), B_bar should scale with delta,
and y_t should have shape [4] with non-zero values (since B_t writes x_t into fresh state).

In practice, Mamba uses a first-order approximation for B̄: B̄_t ≈ Δ_t · B_t (element-wise scaling). The exact ZOH form A⁻¹(e^ΔA−I)B is equivalent for diagonal A, but the simplified form is what's actually implemented in the CUDA kernel. For Ā: since A is stored in log-space, Ā = exp(Δ · exp(A_log)) = exp(Δ · A_real).

python
import torch
import torch.nn.functional as F

def selective_ssm_step(x_t, h_prev, A, W_B, W_C, W_delta, bias_delta):
    # 1. Generate input-dependent parameters
    B_t = W_B @ x_t                         # [d_state]
    C_t = W_C @ x_t                         # [d_state]
    delta_t = F.softplus(W_delta @ x_t + bias_delta)  # [d_inner], > 0

    # 2. Discretize (A is in log-space, shape [d_inner, d_state])
    A_real = torch.exp(A)                   # negative reals
    A_bar = torch.exp(delta_t.unsqueeze(1) * A_real)  # [d_inner, d_state]
    B_bar = delta_t.unsqueeze(1) * B_t.unsqueeze(0)   # [d_inner, d_state]

    # 3. Update state
    h_t = A_bar * h_prev + B_bar * x_t.unsqueeze(1)  # [d_inner, d_state]

    # 4. Read output
    y_t = (h_t * C_t.unsqueeze(0)).sum(dim=1)  # [d_inner]

    return y_t, h_t

Bonus challenge: Extend this to process an entire sequence of L tokens. Can you see why a naive for-loop would be slow on GPU? How would you batch the parameter generation (steps 1-2) across all tokens in parallel, then use a scan for step 3?

Chapter 5: Hardware-Aware Scan

Making B, C, Δ input-dependent means we can't use convolution for training. The naive approach — a sequential for-loop — would be unbearably slow on GPUs. Mamba uses a parallel scan algorithm that computes the full recurrence in O(log n) parallel steps instead of O(n) sequential ones.

The Parallel Scan

The key insight: a linear recurrence h_t = a_t h_t-1 + b_t is an associative operation. Like addition, we can rearrange the order of computation. Instead of left-to-right, we use a binary tree:

Step 1

Compute pairs: (h₀,h₁), (h₂,h₃), (h₄,h₅), ...

↓

Step 2

Combine pairs: (h_0..1,h_2..3), (h_4..5,h_6..7), ...

↓

Step 3

Combine again: (h_0..3,h_4..7), ...

↓ log₂(n) steps total

Done

All h_0..n computed in parallel

Interactive: Parallel Scan Tree

Click "Step" to advance through the parallel scan. Each level halves the remaining work.

Step 0 / 3: Initial values

The Fused CUDA Kernel

Mamba doesn't just use a parallel scan — it fuses the entire SSM computation (discretization, scan, output projection) into a single GPU kernel. The key bottleneck on modern GPUs isn't compute but memory bandwidth. Reading and writing intermediate tensors of shape [B, L, d_inner, d_state] to global memory (HBM) would dominate runtime.

Instead, Mamba's kernel keeps the state (d_state=16 floats per channel) entirely in SRAM (the GPU's fast on-chip scratchpad). The associative scan runs in registers, and only the final output [B, L, d_inner] is written back to HBM. This reduces memory I/O by a factor proportional to d_state, making the selective SSM nearly as fast as a simple matrix multiply.

GPU optimization: Mamba also keeps the full state in SRAM (fast on-chip memory) and avoids writing intermediate results to slower HBM. This is the same philosophy as Flash Attention — restructure the algorithm to match GPU memory hierarchy.

Check: How does the parallel scan reduce O(n) sequential steps?

By skipping unimportant tokens By using attention instead By exploiting associativity to combine pairs in a binary tree, taking O(log n) parallel steps

🔨 Derivation Proving the Scan is Associative ▶ ✓ ATTEMPTED

The linear recurrence h_t = a_t · h_t−1 + b_t can be written as a binary operation on tuples: (a₂, b₂) • (a₁, b₁) = (a₂ · a₁, a₂ · b₁ + b₂).

Your task: Prove this operation is associative: ((a₃, b₃) • (a₂, b₂)) • (a₁, b₁) = (a₃, b₃) • ((a₂, b₂) • (a₁, b₁)). Then explain why associativity enables O(log n) parallel depth.

Left side: first compute (a₃, b₃) • (a₂, b₂) = (a₃a₂, a₃b₂ + b₃). Then apply • (a₁, b₁): = (a₃a₂ · a₁, a₃a₂ · b₁ + a₃b₂ + b₃).

Right side: first compute (a₂, b₂) • (a₁, b₁) = (a₂a₁, a₂b₁ + b₂). Then apply (a₃, b₃) •: = (a₃ · a₂a₁, a₃(a₂b₁ + b₂) + b₃) = (a₃a₂a₁, a₃a₂b₁ + a₃b₂ + b₃). Same as left side!

If the operation is associative, computing the "prefix sum" (cumulative application from left to right) can be restructured into a balanced binary tree. Instead of ((a•b)•c)•d (3 sequential steps), we compute (a•b) and (c•d) in parallel, then combine. For n elements: log₂(n) parallel steps instead of n−1 sequential steps.

Full proof:

Define •: (a₂, b₂) • (a₁, b₁) = (a₂a₁, a₂b₁ + b₂)

Left: ((a₃,b₃)•(a₂,b₂))•(a₁,b₁) = (a₃a₂, a₃b₂+b₃)•(a₁,b₁) = (a₃a₂a₁, a₃a₂b₁ + a₃b₂ + b₃)

Right: (a₃,b₃)•((a₂,b₂)•(a₁,b₁)) = (a₃,b₃)•(a₂a₁, a₂b₁+b₂) = (a₃·a₂a₁, a₃(a₂b₁+b₂)+b₃) = (a₃a₂a₁, a₃a₂b₁ + a₃b₂ + b₃)

Left = Right. QED. The operation is associative (but NOT commutative — order matters!).

Why this enables the parallel scan: The prefix computation h_n = reduce(•, [(a₁,b₁), ..., (a_n,b_n)]) is equivalent to computing ALL intermediate prefixes. With associativity, we can use the Blelloch scan algorithm: an up-sweep (reduce phase) followed by a down-sweep (propagate phase), each taking log₂(n) steps. Total work is O(n), parallel depth is O(log n).

The key insight: This tuple representation encodes the entire linear recurrence. The first element tracks the cumulative "decay" (how much of the initial state survives), and the second element tracks the cumulative "input contribution" (everything that's been added along the way). It's a matrix-free way to represent h_t = (∏ a_i)h₀ + accumulated inputs.

Chapter 6: The Mamba Block

A Mamba block looks quite different from a Transformer block. It has no attention mechanism at all. Instead, it uses a combination of linear projections, 1D convolution, and the selective SSM:

Input

x ∈ R^n×d

↓

Linear Projection

Expand: d → 2·d_inner (split into two branches)

↓ Branch A ↓ Branch B

Conv1D (Branch A)

Short causal convolution (kernel size ~4)

↓

Selective SSM

Input-dependent scan with generated B, C, Δ

↓ ⊗ SiLU(Branch B)

Gated Merge

Element-wise multiply with SiLU-activated Branch B

↓

Linear Projection

Project back: d_inner → d

Interactive: Mamba Block Data Flow

Click "Step" to trace data through the Mamba block. Watch the two branches split and merge.

Click "Step" to trace the data flow

Concrete Shapes (Mamba-2.8B)

For Mamba-2.8B: d_model=2560, d_state=16, d_conv=4, expand_factor=2, so inner_dim=5120. Here's the exact data flow through one block:

Stage	Shape	Operation
Input	[B, L, 2560]	—
Linear expand	[B, L, 10240]	Project to 2 × inner_dim, split into two branches
Branch A: Conv1D	[B, L, 5120]	Causal conv, kernel=4, groups=5120 (depthwise)
Generate B, C, Δ	[B, L, 16] each	Linear projections from Branch A
SSM scan	[B, L, 5120]	Parallel scan with d_state=16 per channel
Gate merge	[B, L, 5120]	SSM output ⊗ SiLU(Branch B)
Output projection	[B, L, 2560]	Linear back to d_model

The SSM state has shape [B, 5120, 16] — each of the 5120 channels maintains its own 16-dimensional state vector. During inference, this state is all you carry forward. No KV cache needed.

Why Conv1D? The short convolution provides local context mixing — it lets the model see a few neighboring tokens before the SSM processes the global sequence. It's like a tiny receptive field that helps the SSM know "what's nearby" before deciding how to update the state.

Check: What role does the SiLU-gated branch play?

It provides the key and value vectors It acts as a multiplicative gate controlling information flow (like GLU) It computes attention weights

💥 Break-It Lab What Dies When You Remove Mamba Components? ▶ ✓ ATTEMPTED

A working Mamba block processes a sequence of tokens through selective SSM with HiPPO initialization. The canvas shows how well the model retains and selectively processes information. Toggle components off to see what breaks.

Remove Selectivity (revert to LTI) ACTIVE

Failure mode: Without input-dependent B, C, Δ, the model treats every token identically. The "important" tokens (green) get the same state update as filler tokens. The model can no longer selectively retain relevant information — it either remembers everything (state overflows) or forgets everything (information washes out). On the copying task, perplexity increases by 2-4x because the model can't "open the gate" for tokens it needs to recall.

Remove HiPPO Init (random A) ACTIVE

Failure mode: Random initialization of A causes exponential decay of past information. After ~50 tokens, the state has effectively forgotten everything. The model becomes a "short-range" processor, unable to maintain context beyond a small window. Long-range benchmarks (Path-X, Long Range Arena) drop from 90%+ to near-random. The polynomial compression structure is lost.

Remove Gated Branch (no SiLU gate) ACTIVE

Failure mode: Without the multiplicative gate from Branch B, the SSM output passes directly to the output projection. The model loses its ability to suppress or amplify SSM outputs based on local context. Training becomes unstable (gradients are less well-conditioned), and the model underperforms by ~1 perplexity point. The gate acts as an information bottleneck that regularizes the SSM signal.

Chapter 7: Hybrid Models

Pure Mamba models are fast but sometimes struggle with tasks requiring precise retrieval of specific tokens from the past (like "what was the 7th word?"). Pure Transformers are great at retrieval but expensive. The natural solution: combine both.

Key Hybrid Architectures

Model	Architecture	Key Idea
Jamba (AI21)	Mamba + Attention + MoE	Alternate Mamba and attention layers. MoE for capacity.
Zamba (Zyphra)	Mamba + shared Attention	One shared attention layer interleaved every few Mamba blocks.
Griffin (DeepMind)	Gated linear recurrence + local attention	Use recurrence for global, windowed attention for local.
Mamba-2	Structured state space duality	Show SSMs and attention are dual views of the same operation.

Interactive: Hybrid Architecture Viewer

Select an architecture to see how Mamba and attention layers are interleaved.

How Jamba Interleaves

Jamba (AI21, 2024) uses a concrete pattern: for every 7 Mamba layers, insert 1 attention layer. This means only ~12% of layers are attention, but that's enough for strong retrieval performance. Combined with Mixture-of-Experts (MoE) at each layer, Jamba-1.5 fits 256K context in a single 80GB GPU — something no pure Transformer of equivalent quality can do.

Zamba (Zyphra) goes further: it uses a single shared attention layer whose weights are reused every N Mamba blocks. This dramatically reduces the parameter count from attention (which is the most parameter-heavy component) while still providing the retrieval capability where needed.

Why hybrids work: Mamba excels at long-range, smooth information flow (like understanding overall topic and style). Attention excels at precise, content-based retrieval (like copying a name from 5000 tokens ago). Combining both gives you the best of both worlds with a fraction of the attention cost.

Check: What weakness of pure Mamba models do hybrid architectures address?

Difficulty with precise retrieval of specific tokens from the past Being too slow at inference Using too much memory

Chapter 8: Training & Inference

Mamba models are trained with the exact same objective as Transformers: next-token prediction with cross-entropy loss. The training data, tokenizer, and optimizer are all the same. The difference is entirely in the architecture.

Training: Parallel Mode

During training, the entire sequence is available. The selective SSM uses the parallel scan to process all tokens simultaneously. This is analogous to how Transformers compute all attention weights at once.

Inference: Recurrence Mode

During generation, we switch to sequential recurrence. At each step:

Get token

Receive x_t from previous prediction

↓

Update state

h_t = Ā_t h_t-1 + B̄_t x_t — constant time!

↓

Output

y_t = C_t h_t — predict next token

Interactive: Inference Memory Comparison

Compare memory usage during generation. Transformer's KV cache grows linearly; Mamba's state is constant.

Generated tokens100

Memory: A Concrete Example

Consider a 2.8B model generating a 100K-token sequence:

	Transformer (2.8B)	Mamba (2.8B)
KV cache at 1K tokens	~160 MB	0 (no KV cache)
KV cache at 100K tokens	~16 GB	0
SSM state (all layers)	n/a	~2.5 MB (fixed)
Time per token at 100K	Grows (must attend to all KV)	Constant

The Mamba state is tiny: d_inner × d_state × n_layers × 2 bytes = 5120 × 16 × 48 × 2 ≈ 7.5 MB per sequence. This is why Mamba shines for long-context applications like book-length analysis, genomics, and continuous audio processing.

Constant memory per token: Unlike the Transformer's KV cache, which grows with every generated token, the SSM hidden state is always the same size. At 100K tokens, a Transformer might need 40+ GB for the KV cache. Mamba needs the same fixed state it had at token 1.

Check: How does Mamba's inference memory scale with sequence length?

It grows quadratically It grows linearly (like KV cache) It stays constant (fixed-size hidden state)

Chapter 9: SSM vs Transformer Tradeoffs

Neither architecture is strictly better. Each has fundamental strengths and weaknesses rooted in their core mechanisms.

Dimension	Transformer	SSM / Mamba
Training cost	O(n²) per layer	O(n) per layer
Inference memory	KV cache grows with n	Fixed-size state
Inference time/token	Grows with n (attend to all KV)	Constant
Exact retrieval	Excellent (direct access)	Weaker (compressed state)
Long-range context	Struggles past training length	Naturally extends
In-context learning	Strong (induction heads)	Emerging but less understood
Ecosystem maturity	Very mature (5+ years)	Rapidly growing
Hardware optimization	Flash Attention, GQA	Parallel scan, SRAM-aware

Interactive: Task Suitability Comparison

Select a task to see which architecture is better suited and why.

The Copying Test

One of the simplest ways to expose SSM vs Transformer differences: the copying task. Present the model with "Copy this: ABCDE. Output: " and check if it reproduces ABCDE exactly. Transformers trivially solve this with induction heads — they can attend directly to the tokens to copy. SSMs struggle because the information must pass through the compressed state bottleneck. A 16-dimensional state vector can't perfectly store 5 arbitrary tokens.

This is precisely why hybrid models exist: the attention layers handle exact recall, and the SSM layers handle everything else (fluency, long-range context, style). In Jamba's design, the few attention layers disproportionately handle retrieval-like subtasks while the many Mamba layers handle the bulk of language modeling.

When to use which: For very long contexts (100K+ tokens, audio, genomics), SSMs shine. For retrieval-heavy tasks (RAG, QA over documents), Transformers are more reliable. For production chat models, hybrids are increasingly the best choice.

⚔ Adversarial: Mamba has no attention matrix. How does it handle the copying task?

You give Mamba the prompt: "Remember: X7Q2M. Now repeat: ". The model must output "X7Q2M" exactly. But the hidden state is only 16 dimensions per channel. There's no attention matrix to "look back" at the original tokens.

Mamba perfectly stores all 5 characters in the 16-dim state because the HiPPO basis is lossless Mamba uses the Conv1D layer to directly access past tokens like a sliding window Mamba struggles with exact copying — it must compress the tokens into a lossy state, and short arbitrary strings may not be perfectly reconstructible

🏗 Design Challenge You're the Architect: 1M Token Context at 7B Scale ▶ ✓ ATTEMPTED

Your team needs to ship a 7B-parameter model that handles 1 million token context at inference on a single A100 (80GB). The model must support both long-document summarization (smooth context) AND retrieval-augmented generation (exact recall from injected documents). Design the architecture.

GPU Memory

80 GB (A100)

Model Size

7B parameters (~14 GB in FP16)

Context Length

1,000,000 tokens

Latency Target

< 50ms per token at generation

Key Tasks

Summarization + RAG retrieval

1. Pure Transformer with 1M context: what's the KV cache size? Can it fit? (Hint: compute n_layers × 2 × seq_len × d_head × n_heads × 2 bytes)

2. Pure Mamba: what's the state size? Will retrieval work for RAG?

3. Hybrid: how many attention layers do you need? Where do you place them? What attention window do you use?

4. What's your prefill strategy? (Chunked? Which layers get full context vs windowed?)

The memory math:

Pure Transformer (7B, 32 layers, GQA with 8 KV heads, d_head=128): KV cache = 32 × 2 × 1M × 128 × 8 × 2 bytes = ~128 GB. Doesn't fit on one A100. Even with 4-bit KV quantization, that's 32 GB for cache alone, leaving ~34 GB for model+activations. Marginal at best.

Pure Mamba (7B, 48 layers, d_inner=5120, d_state=16): State = 48 × 5120 × 16 × 2 bytes = ~7.5 MB. Trivially fits. But RAG retrieval from compressed state is unreliable.

What Jamba/Zamba actually do: Use ~12% attention layers (4 out of 32) with a limited window (4096-8192 tokens). The attention layers handle precise retrieval within their window, while Mamba layers propagate long-range context. For 1M tokens with 4 attention layers at 4096 window: KV cache = 4 × 2 × 4096 × 128 × 8 × 2 bytes = ~64 MB. Total state: ~72 MB. Fits trivially with room to spare.

For RAG specifically: chunk the retrieved documents so the most relevant passages fall within the attention window of the final few layers. This guarantees exact recall where it matters.

🔗 Pattern Recognition

Attention and Recurrence: The Same Tradeoff Everywhere

SSM / Mamba

Compress history into fixed state.
O(1) per token, but lossy.
Can't retrieve arbitrary past tokens.

Transformer

Store all past tokens in KV cache.
O(n) per token, but lossless.
Direct access to any past token. → Transformer lesson

This is the compression vs direct access tradeoff that appears everywhere in CS. A hash table (O(1) lookup, lossy compression of key space) vs a sorted array (O(log n) lookup, preserves all info). A JPEG (fixed-size, lossy) vs a bitmap (scales with resolution, lossless). In sequence modeling: a recurrent state is the "JPEG of the past" — it captures the gist but loses details. Attention is the "bitmap" — it keeps everything but the cost scales with size.

Where else in ML do you see this exact tradeoff between a fixed-size bottleneck and full storage? (Hint: think about VAE latents, pooling layers, and knowledge distillation.)

"The next generation of sequence models will likely combine the best of both worlds."

— The emerging consensus

You now understand both the Transformer and its most promising challenger. From continuous-time state spaces to selective scans, from HiPPO to hardware-aware algorithms — this is the frontier of sequence modeling.

Understand SSMs &
Mamba

Chapter 0: Sequences Without Attention

Chapter 1: Linear Recurrences

Chapter 2: The State Space Model

Why O(L), Not O(L²)?

Chapter 3: The S4 Trick

1. HiPPO Initialization

2. Diagonal + Low-Rank Structure

What S4 Actually Computes

Chapter 4: Selective Scan (Mamba)

What "Selective" Actually Means

Chapter 5: Hardware-Aware Scan

The Parallel Scan

The Fused CUDA Kernel

Chapter 6: The Mamba Block

Concrete Shapes (Mamba-2.8B)

Chapter 7: Hybrid Models

Key Hybrid Architectures

How Jamba Interleaves

Chapter 8: Training & Inference

Training: Parallel Mode

Inference: Recurrence Mode

Memory: A Concrete Example

Chapter 9: SSM vs Transformer Tradeoffs

The Copying Test

Understand SSMs &Mamba

Chapter 0: Sequences Without Attention

Chapter 1: Linear Recurrences

Chapter 2: The State Space Model

Why O(L), Not O(L²)?

Chapter 3: The S4 Trick

1. HiPPO Initialization

2. Diagonal + Low-Rank Structure

What S4 Actually Computes

Chapter 4: Selective Scan (Mamba)

What "Selective" Actually Means

Chapter 5: Hardware-Aware Scan

The Parallel Scan

The Fused CUDA Kernel

Chapter 6: The Mamba Block

Concrete Shapes (Mamba-2.8B)

Chapter 7: Hybrid Models

Key Hybrid Architectures

How Jamba Interleaves

Chapter 8: Training & Inference

Training: Parallel Mode

Inference: Recurrence Mode

Memory: A Concrete Example

Chapter 9: SSM vs Transformer Tradeoffs

The Copying Test

Understand SSMs &
Mamba