Lahoti*, Li*, Chen*, Wang*, Bick, Kolter, Dao, Gu — CMU / Princeton / Together AI / Cartesia AI, 2026

Mamba-3 Improved Sequence Modeling Using State Space Principles

Three SSM-principled upgrades — exponential-trapezoidal discretization, complex-valued states, and MIMO — push the Pareto frontier of quality vs. inference speed, matching Transformer quality at half the state size.

Prerequisites: State Space Models (Mamba-1/2) + Linear Recurrences + Numerical ODE Methods
10
Chapters
8+
Simulations

Chapter 0: The Problem

You are deploying a language model. Users send prompts, and the model generates tokens one at a time. Each new token requires attending to every previous token — that is how Transformers work. The KV cache holding those past token representations grows linearly with sequence length. The attention computation itself grows quadratically.

At 1K tokens, this is tolerable. At 8K, the KV cache consumes gigabytes. At 32K, you need multi-GPU setups just to hold the cache. And as chain-of-thought reasoning and agentic workflows push inference-time compute to new heights, the cost of generating each additional token becomes the dominant expense. Inference is no longer an afterthought — it is the bottleneck.

The inference wall: Transformers have O(L) memory (KV cache grows with sequence length) and O(L2) compute (attention over all pairs). For generation, where you produce one token at a time and each token must attend to all previous ones, the cost per token grows linearly with the number of tokens generated so far. Double the output length, double the per-token cost.

Sub-quadratic models — state space models (SSMs) like Mamba-1 and Mamba-2, and linear attention variants like Gated DeltaNet — promise a fix. They maintain a fixed-size hidden state that summarizes the entire past. Generating a new token requires only updating this state and reading from it: O(1) memory, O(1) compute per token. No KV cache. No quadratic attention. Linear in sequence length for the full forward pass.

Sounds perfect. So why isn't everyone using them?

Three Unsolved Problems

Problem 1: Quality gap. Mamba-2 trades expressivity for training speed. Its state transition is a scalar times identity — every state dimension evolves the same way. This simplification enables hardware-efficient matrix multiplications during training, but it means the recurrence is less expressive than Mamba-1's diagonal transitions. At matched parameter counts, Mamba-2 slightly underperforms Gated DeltaNet on downstream tasks.

Problem 2: State-tracking failure. Ask Mamba-2 to compute the parity of a binary sequence (count the number of 1s, report whether it is even or odd). It fails — not at 95% accuracy, but at chance level. This is because parity requires a "rotation" in state space — flipping between two states on every 1 — and Mamba-2's real-valued, non-negative eigenvalue transitions cannot express rotational dynamics. More formally, Grazzi et al. (2025) proved that real-valued scalar transitions cannot represent functions requiring modular counting.

Problem 3: Hardware underutilization. Even though SSMs are theoretically efficient, their decode step has an arithmetic intensity of only about 2.5 ops/byte. The H100 GPU's tensor cores can sustain ~295 ops/byte for matrix multiplications. That means during SSM decoding, the tensor cores sit idle while memory bandwidth is fully saturated. The model is fast in theory but leaves 99% of compute capacity unused.

The Three Problems of Current SSMs

Click each problem to see the gap between ideal and actual behavior. This frames the three axes Mamba-3 targets.

Mamba-3 addresses all three problems simultaneously. Not with heuristic patches, but with principled improvements derived from the state space model perspective: a better discretization rule for expressivity, complex-valued states for rotation capability, and a MIMO formulation for hardware utilization. Let us see how.

Why does Mamba-2 fail at computing the parity of a binary string?

Chapter 1: The Key Insight

Mamba-3's thesis is simple: the right improvements come from returning to state space model fundamentals. Not from the linear attention viewpoint. Not from the test-time regression viewpoint. From the original continuous-time ODE that defines SSMs. Each of the three innovations arises naturally from asking "what does the SSM perspective suggest here?"

Axis 1: Exponential-Trapezoidal Discretization
Mamba-2's discretization is a first-order Euler approximation with O(Δt) global error. The trapezoidal rule is second-order: O(Δt2) global error. It adds a β term to the recurrence — effectively convolving the state-input with a width-2 kernel inside the recurrence. This eliminates the need for the external short convolution that every prior model required.
Axis 2: Complex-Valued States
Making the SSM state complex-valued introduces rotational dynamics. The imaginary part of the eigenvalue acts as a rotation angle. Discretization turns the complex SSM into a real SSM with block-diagonal 2×2 rotation matrices in the transition — equivalent to data-dependent rotary position embeddings (RoPE) on B and C. This enables parity, modular arithmetic, and other state-tracking tasks that real-valued SSMs provably cannot solve.
Axis 3: MIMO
Standard SSMs are SISO: the state update ht ← αtht-1 + BtxtT uses an outer product. MIMO generalizes B and x to have an extra rank dimension R, turning the outer product into a matrix multiply. FLOPs increase by R× but memory traffic barely changes — arithmetic intensity jumps from ~2.5 ops/byte to ~R× that. The tensor cores wake up. Decode speed stays the same because the matmul overlaps with the memory-bound state read/write.
Why these three belong together: The trapezoidal discretization improves quality but doesn't fix state tracking — that requires complex eigenvalues. Complex states improve capability but don't help hardware utilization — that requires MIMO. MIMO improves utilization but without the quality gains from better discretization and complex states, the extra FLOPs buy less. The three innovations are complementary, not redundant.

The result at 1.5B scale: Mamba-3 SISO beats Gated DeltaNet (the previous best SSM) by +0.6 points on average downstream accuracy. Mamba-3 MIMO adds another +1.2 points for a total +1.8 point gain. On perplexity, Mamba-3 with state size 64 matches Mamba-2 with state size 128 — same quality at half the inference latency. And Mamba-3 solves parity perfectly, while Mamba-2 scores at chance.

Concept → Realization: Concept: "Return to SSM fundamentals to derive principled improvements." Realization: Three specific modifications to the Mamba-2 recurrence — ht = αtht-1 + βtBt-1xt-1 + γtBtxt (trapezoidal), with block-diagonal rotation Rt in the transition (complex states), and Bt ∈ RN×R, xt ∈ RP×R instead of RN, RP (MIMO). Each has fast CUDA/Triton/CuTe kernels.

What the paper does NOT say

Mamba-3 does not solve retrieval. On tasks like extracting facts from semi-structured data (SWDE, FDA), it still substantially underperforms Transformers. The fixed-size state fundamentally limits how much verbatim information can be recalled. The authors are candid about this: they predict SSMs will primarily be used in hybrid architectures, mixing with sparse attention layers to handle retrieval.

Also, the MIMO variant increases training FLOPs by approximately R× (with R=4). Training becomes R times slower per step. The decode speed stays the same, but you pay upfront during training for the inference-time free lunch.

Why does Mamba-3 MIMO not increase decode latency despite increasing FLOPs by 4x?

Chapter 2: SSM Foundations

Before we can understand what Mamba-3 changes, we need to understand what it starts from. State space models describe a continuous-time dynamical system with three matrices — A, B, C — that govern how a hidden state evolves in response to input.

The Continuous-Time System

The SSM is defined by a first-order ODE for the hidden state h(t) ∈ RN and a linear readout for the output y(t) ∈ R:

h'(t) = A(t) h(t) + B(t) x(t)
y(t) = C(t)T h(t)

Let's trace each component:

Analogy: Think of h(t) as a whiteboard. A controls how fast writing fades (exponential decay). B is the marker — it writes new information from the input. C is the camera — it reads selective parts of the whiteboard to produce the output. The key constraint: the whiteboard has finite space (N dimensions), so old information must fade to make room for new information.

Why Data-Dependent? (Selective SSMs)

In classical SSMs (S4, S5), A, B, C are fixed parameters learned during training. Every token sees the same dynamics. This is called linear time-invariant (LTI).

Mamba-1 made A, B, C data-dependent: they are projected from the current token. Now each token controls its own write pattern (B), read pattern (C), and decay rate (A). This is linear time-varying (LTV). The model can decide, for each token: "This is important, remember it" (low decay, strong write) or "This is noise, forget it" (high decay, weak write).

The cost: LTV systems lose the efficient global convolution trick of LTI systems. The recurrence must be computed step by step (or in hardware-efficient chunks via SSD).

Discretization: From ODE to Recurrence

We have discrete tokens x1, x2, ..., xT, but the SSM is defined in continuous time. Discretization converts the continuous ODE into a discrete recurrence that we can compute on tokens.

Start with the ODE h'(t) = A(t)h(t) + B(t)x(t). The exact solution between time τt-1 and τt (with step size Δt = τt - τt-1) is:

h(τt) = exp(∫τt-1τt A(s)ds) · h(τt-1) + ∫τt-1τt exp(∫ττt A(s)ds) B(τ)x(τ) dτ

The first term (the homogeneous solution) can be computed exactly by approximating A(s) as constant At over the interval: exp(Δt At). The second term (the particular integral) is harder — it requires approximating B(τ)x(τ) over the interval. This is where different discretization methods diverge.

Mamba-2's Discretization (Exponential-Euler)

Mamba-2 uses Euler's rule: approximate the integrand by its value at the right endpoint τt. This gives:

ht = eΔtAt ht-1 + Δt Bt xt

Define αt := eΔtAt ∈ (0,1) and γt := Δt. The recurrence becomes:

ht = αt ht-1 + γt Bt xt
yt = CtT ht

This is Mamba-2's core. The step size Δt controls both forgetting (αt) and input scaling (γt): large Δt forgets faster (smaller α) but up-weights the current token (larger γ). Small Δt retains more past state but adds less from the new token.

The hidden assumption: Euler's rule assumes B(τ)x(τ) is approximately constant over the entire interval [τt-1, τt], equal to its value at the right endpoint. This is a first-order approximation with local truncation error O(Δt2) and global error O(Δt). If B and x vary rapidly between tokens, Euler under-approximates the integral. Mamba-3 addresses this.

Worked Example: Euler Discretization

Let's trace a concrete step with N=2, Δt=0.5, At=-1:

The old state decays by 60.7%, and the new token contributes with weight 0.5. Now if Ct = [0.3, 0.7]:

SSM Recurrence: Step by Step

Watch how the hidden state evolves token by token. Drag Δ to control the decay/input tradeoff. High Δ forgets fast but writes strongly. Low Δ retains but writes weakly.

Δt 1.0
At -1.0

State Space Duality (SSD)

Mamba-2 showed that the recurrence above can be written as a matrix form for parallel training:

Y = (L ⊙ CBT) X

where L is a structured mask (lower-triangular with decay terms), B and C play the roles of keys and queries, and X is the value matrix. This is the state space duality — SSMs and (masked) linear attention are the same computation in different forms. The recurrent form is efficient for inference (O(1) per token). The matrix form is efficient for training (parallelizable via matmul).

Mamba-3 will generalize L to include the trapezoidal terms, making it a richer mask while preserving this duality.

In Mamba-2's recurrence ht = αtht-1 + γtBtxt, what does a large Δt value mean?

Chapter 3: Exponential-Trapezoidal Discretization

Euler's rule approximates an integral using just one endpoint. A student of numerical methods knows there is a better option: the trapezoidal rule, which averages both endpoints. It is second-order accurate instead of first-order, meaning the error shrinks quadratically instead of linearly as the step size decreases.

But we are not in the standard setting. Our ODE has a time-varying exponential decay, and we have already solved the homogeneous part analytically. What remains is the integral of the state-input B(τ)x(τ) weighted by the exponential decay. Let us derive the trapezoidal version from scratch.

The Derivation

Recall the exact solution (from Chapter 2):

ht ≈ eΔtAt ht-1 + ∫τt-1τt et - τ)At B(τ)x(τ) dτ

The first term is exact (we solved the homogeneous ODE analytically). The second term — the integral of the state-input weighted by exponential decay — needs to be approximated.

Step 1: Define g(τ) := et - τ)At B(τ)x(τ). This is the integrand we need to approximate.

Step 2: The generalized trapezoidal rule approximates the integral with a convex combination of the two endpoints:

τt-1τt g(τ)dτ ≈ Δt [(1 - λt) g(τt-1) + λt g(τt)]

where λt ∈ [0,1] is a data-dependent mixing parameter. When λt = 1, this recovers Euler (right endpoint only). When λt = 1/2, this is the classical trapezoid (equal average of both endpoints).

Step 3: Evaluate the endpoints:

Step 4: Substitute back:

ht = eΔtAt ht-1 + (1 - λtt eΔtAt Bt-1 xt-1 + λt Δt Bt xt

Defining the three coefficients:

The Mamba-3 recurrence becomes:

ht = αt ht-1 + βt Bt-1 xt-1 + γt Bt xt
The punchline: Mamba-3's recurrence has three terms instead of Mamba-2's two. The new β term brings in the previous token's state-input, weighted by the left-endpoint evaluation. This is a second-order approximation — local truncation error O(Δt3) vs. Euler's O(Δt2), global error O(Δt2) vs. Euler's O(Δt).

Worked Example: Trapezoidal vs. Euler

Same setup as before: N=2, Δt=0.5, At=-1, λt=0.5 (classical trapezoid).

With ht-1 = [0.8, 0.3], Bt = [1.0, 0.5], xt = 2.0, and the previous step's Bt-1 = [0.7, 0.9], xt-1 = 1.5:

Compare to Euler: ht = [0.486 + 1.0, 0.182 + 0.5] = [1.486, 0.682]. The trapezoidal version distributes the input weight between the current and previous token, resulting in a smoother approximation. The β term acts as a bridge between consecutive tokens.

The Implicit Convolution Interpretation

There is an elegant alternative view: the trapezoidal discretization is equivalent to applying a width-2 convolution on the state-input vt = Btxt before feeding it into the linear recurrence.

In a standard SSM, you first compute vt = Btxt, then apply the recurrence ht = αtht-1 + γtvt. In Mamba-3, you first convolve: v't = βtvt-1 + γtvt, then apply the recurrence ht = αtht-1 + v't.

Why this matters for architecture: Mamba-2 and most linear models use an external short causal convolution (typically kernel size 4) before the SSM, applied to the raw input xt and sometimes Bt, Ct. This was believed essential for performance. Mamba-3's implicit convolution operates inside the recurrence, on the state-input Btxt rather than on xt alone. Combined with B,C biases, this internal convolution empirically replaces the external one. Mamba-3 is the first competitive SSM without an external short convolution.

The Mask Matrix View

In the SSD parallel form Y = (L ⊙ CBT)X, Mamba-2 has mask L with entries Lij = αi...j γj (decay product times input weight). Mamba-3's mask is richer: it factors as L = L1 · L2 where L1 is the decay matrix (same as Mamba-2) and L2 is a two-band matrix with entries from β and γ. This two-band structure is the "convolution" in mask form.

Euler vs. Trapezoidal Approximation

Compare how Euler (right-endpoint only) and Trapezoidal (average of endpoints) approximate the integral. The trapezoidal rule captures the shape of the curve more accurately. Drag λ from 1.0 (pure Euler) to 0.5 (classical trapezoid) to see the approximation improve.

λt (mixing) 1.00
Δt (step size) 1.5

On the λ Parameterization

For second-order accuracy, theory requires λt = 1/2 + O(Δt). But the Mamba-3 authors found that a data-dependent λt = σ(ut) (sigmoid of a learned projection of the input token) performs best empirically, even though it does not satisfy the second-order constraint. Fixing λ = 1/2 gives 15.76 perplexity vs. 15.72 for the learned gate. Setting λ = 1 (Euler, no trapezoidal) gives 15.81. The improvement is real but modest — about 0.1 perplexity points from the trapezoidal term alone.

The bigger win comes from the interaction with B,C biases: combining trapezoidal discretization with biases gives 15.72, down from 16.68 without either. And it removes the need for the external short convolution, simplifying the architecture.

What is the extra term in Mamba-3's recurrence compared to Mamba-2, and where does it come from?

Chapter 4: Complex-Valued States

Here is a task that sounds trivial: given a binary string like 1 0 1 1 0 1, compute the parity (even or odd number of 1s). A human can do it by keeping a running count modulo 2. An RNN with a single bit of state can do it. But Mamba-2 provably cannot.

Why? Parity requires the state to flip between two values on every 1 and stay unchanged on every 0. In state space terms, processing a "1" should rotate the state by 180 degrees: h → -h. But Mamba-2's transition is αt ∈ (0, 1), which only shrinks the state. It can never flip the sign. You would need α < 0 or, equivalently, complex eigenvalues.

The formal barrier (Grazzi et al., 2024): A linear recurrence ht = αtht-1 + γtBtxt with αt ∈ R≥0 cannot represent functions that require modular counting. The eigenvalue spectrum of the transition matrix must include complex (or negative real) values to express rotational dynamics. Parity is the simplest such function: it requires a rotation by π for each "1" input.

The Complex SSM

Mamba-3 makes the hidden state complex: h(t) ∈ CN/2 instead of RN. The continuous ODE becomes:

h'(t) = Diag(A(t) + iθ(t)) h(t) + (B(t) + iB̂(t)) x(t)
y(t) = Re[(C(t) + iĈ(t))T h(t)]

The key addition is θ(t) — a data-dependent rotation angle for each state dimension. While A(t) controls decay (real part), θ(t) controls rotation (imaginary part). Now the state can oscillate, not just decay.

Complex to Real Equivalence (Proposition 2)

Complex arithmetic is expensive on GPUs. But there is a beautiful equivalence: a complex SSM with state dimension N/2 is exactly equal to a real-valued SSM with state dimension N whose transition matrix contains 2×2 rotation blocks.

Let us derive this for a single complex state dimension. Write ht = ht + iĥt. The complex update:

(ht + iĥt) = eΔt(At + iθt) (ht-1 + iĥt-1) + Δt(Bt + iB̂t) xt

Expand the exponential: eΔt(At + iθt) = eΔtAt(cos(Δtθt) + i sin(Δtθt)).

Separate real and imaginary parts. In matrix form, this gives:

⎵ht⎵ = eΔtAt · R(Δtθt) · ⎵ht-1⎵ + Δt ⎵Bt⎵ xt
⎵ĥt⎵                   ⎵ĥt-1⎵        ⎵B̂t

where R(φ) is the 2×2 rotation matrix:

R(φ) = [cos(φ), -sin(φ); sin(φ), cos(φ)]

For N/2 complex state dimensions, we get N real state dimensions with a block-diagonal transition made of N/2 such rotation blocks, each scaled by eΔtAt.

The rotation enables parity: To flip the state on every "1", set θt = π · xt. Then Δtθt = Δtπ whenever xt=1, and the rotation matrix becomes [cos(π), -sin(π); sin(π), cos(π)] = [-1, 0; 0, -1] (for appropriate Δ). The state flips sign. On xt=0, the rotation angle is 0 and the state is unchanged. This is exactly the parity computation.

The RoPE Trick (Proposition 3)

Computing the block-diagonal rotation explicitly requires storing and multiplying N×N matrices at each step. Expensive. But there is a trick.

Unroll the recurrence for T steps. At step t, Bt gets multiplied by the cumulative product of all rotations from step 0 to t: (Πi=0t RiT) Bt. Similarly, Ct gets (Πi=0t RiT) Ct.

This is precisely a rotary position embedding (RoPE) applied to B and C! In the SSD duality, B corresponds to keys and C to queries. So the complex SSM is equivalent to applying data-dependent RoPE to the Q and K components of the attention-like computation.

The crucial difference from standard RoPE: the rotation angles are data-dependentt is projected from the input token), not fixed (θi = 10000-2i/N in vanilla RoPE). Data-dependent rotations can modulate based on content, enabling state tracking.

In practice, the RoPE computation is trivially efficient: pair up the state dimensions, compute cumulative sums of angles, and apply rotations with the standard RoPE formula. This adds negligible overhead.

Experimental Proof

The results are stark (Table 3b of the paper):

ModelParityArith (no brackets)Arith (brackets)
Mamba-3100.00%98.51%87.75%
Mamba-3 (std RoPE)1.56%20.70%2.62%
Mamba-3 (no RoPE)2.27%1.49%0.72%
Mamba-20.90%47.81%0.88%
GDN [-1,1]100.00%99.25%93.50%

Mamba-3 with data-dependent RoPE: perfect parity, near-perfect modular arithmetic. Without it (or with standard position-based RoPE): chance level. This confirms the theory: complex eigenvalues enable rotational dynamics that real eigenvalues cannot.

Parity via Rotation: Real vs. Complex SSM

Watch a real-valued SSM (left) and a complex SSM (right) attempt parity on the same binary string. The real SSM's state decays monotonically — it cannot flip. The complex SSM rotates by π on each "1", tracking parity exactly. Click "New Sequence" for a fresh random string.

Complex + Trapezoidal Together (Proposition 4)

Combining the complex SSM with exponential-trapezoidal discretization, the full Mamba-3 recurrence is:

ht = αt ht-1 + βti=0t-1 RiT) Bt-1 xt-1 + γti=0t RiT) Bt xt

yt = [(Πi=0t RiT) Ct]T ht

Both the β term (left endpoint) and γ term (right endpoint) get their own rotation applied. The output also gets rotated. In practice, B and C are rotated via the RoPE trick before the recurrence, so the recurrence itself is identical in form to the real-valued case — just with rotated inputs.

What does the imaginary part θ(t) of the complex SSM eigenvalue control, and why is data-dependent θ essential?

Chapter 5: Multi-Input, Multi-Output (MIMO)

We have improved quality (trapezoidal discretization) and capability (complex states). Now let us fix hardware utilization. This chapter contains the SHOWCASE simulation — the interactive visualization of how MIMO changes the arithmetic intensity.

The Underutilization Problem

Consider a single head of a SISO (single-input, single-output) SSM with head dimension P and state size N. The decode step involves:

  1. State update: ht ← αt ht-1 + Bt xtT, where ht ∈ RN×P, Bt ∈ RN, xt ∈ RP
  2. Output: yt = CtT ht, where Ct ∈ RN, yt ∈ RP

The state update uses an outer product Bt xtT (N × P FLOPs). But the memory traffic is dominated by loading ht-1 (N × P elements) and storing ht (N × P elements). So:

Arithmetic Intensity = FLOPs / Bytes ≈ (5NP - P) / (2(1 + 2N + NP + P)) ≈ 2.5 ops/byte

The H100's tensor cores can sustain ~295 ops/byte for bfloat16 matmul. We are at 2.5. That means 99% of compute is idle during the decode step — the GPU waits for memory, not arithmetic.

The paradox of efficient models: SSMs are "efficient" because they use fewer FLOPs. But on modern GPUs, the bottleneck is memory bandwidth, not FLOPs. Using fewer FLOPs actually hurts because we cannot overlap computation with memory access. We are memory-bound when we want to be compute-bound.

The MIMO Solution

MIMO adds a rank dimension R to the state-input computation. Instead of:

We use:

Similarly, Ct goes from RN to RN×R, and yt from RP to RP×R.

The key insight: memory traffic barely increases. The state ht ∈ RN×P stays the same size because the MIMO rank R does not expand the state. What increases are the projection vectors Bt, Ct, xt — but these are small compared to the state. So FLOPs go up by R× while memory I/O stays roughly constant. Arithmetic intensity increases by roughly R×.

MIMO Arithmetic Intensity ≈ (4NPR + NP - PR) / (2(1 + 2NR + NP + PR)) = Θ(R) for R ≪ N, P

With R=4: arithmetic intensity goes from ~2.5 to ~10 ops/byte. Still far from the 295 peak, but 4× better utilization is meaningful — and it comes for free in wall-clock time because the extra matmul FLOPs overlap with the unchanged memory I/O.

The Signal Processing View

Why "multi-input, multi-output"? In signal processing, a SISO system has one input channel and one output channel. The SSM takes scalar xt in and produces scalar yt out, replicated across P head dimensions by stacking. A MIMO system has R input channels and R output channels. Each of the R inputs has its own B projection and contributes to the state independently. Each of the R outputs reads the state through its own C projection.

Formally, a MIMO SSM with rank R is equivalent to R independent SISO SSMs sharing the same decay αt and the same state, with their contributions summed:

ht(j) ← αt ht-1(j) + Bt(j) xt(j)     ht = Σj=0R-1 ht(j)
yt(i) = (Ct(i))T ht

The outer product becomes a matrix product. This is computationally friendly because the matmul can use tensor cores, which were sitting idle in the SISO case.

SHOWCASE: SISO vs. MIMO Arithmetic Intensity

This is the core visualization. Left: SISO decode step — the outer product uses few FLOPs relative to memory traffic. Right: MIMO decode step — the matrix multiply uses R× more FLOPs with barely more memory traffic. Drag R to see arithmetic intensity climb. The dashed line marks where tensor cores become meaningfully utilized.

MIMO Rank (R) 4
State Size (N) 64
Head Dim (P) 64

Training Cost of MIMO

MIMO is free at decode time but not at training time. The sequence-to-sequence form requires calling the SISO training algorithm R2 times naively (all R×R cross terms). However, by exploiting the chunked SSD algorithm structure, this reduces to R× overhead: the intra-chunk computation maintains the SISO chunk size while increasing the number of chunks by R. With R=4, training is about 4× slower per step for the SSM layers.

To keep total parameter count constant, Mamba-3 shrinks the MLP hidden dimension to compensate for the extra MIMO projections. At 1.5B: SISO MLP dim is 4096, MIMO MLP dim is 3824. The parameter counts match, but the MIMO variant distributes more capacity into the SSM and less into the MLP.

Kernel Implementation

The decode kernel is implemented in CuTe-DSL (NVIDIA's tensor core DSL), not Triton, because the MIMO matrix multiply requires explicit tensor core scheduling. The forward (prefill) path uses Triton (SISO) or TileLang (MIMO). The kernel fusion structure fuses the RoPE rotation, SSM update, gating, and MIMO projection into a single kernel to minimize memory round-trips.

Kernel fusion summary: Forward SISO: IP → SSM+Rotary → Gate+OP (Triton). Forward MIMO: IP → SSM+Rotary → Gate+OP (TileLang). Decode SISO: IP → Rotary → SSM+Gate → OP (CuTe + Triton). Decode MIMO: IP → Rotary → SSM+Gate+MIMO → OP (CuTe + Triton). IP = input projection, OP = output projection.
Why does MIMO with rank R=4 not significantly increase decode latency despite 4x more FLOPs?

Chapter 6: Architecture

We have the three core innovations. Now let us assemble the full Mamba-3 block and see how it differs from its predecessor.

Overall Model Structure

The macro-architecture follows Llama: alternating sequence-mixing blocks and feedforward (SwiGLU MLP) blocks with pre-norm (RMSNorm). This is the same skeleton used by most modern language models — the innovation is in what goes inside the sequence-mixing block.

Mamba-3 Block vs. Mamba-2 Block

Mamba-2 Block
Input → Linear projection (expand 2×) → Short Conv (k=4) → SiLU activation → Δ,B,C projection → SSD (Euler recurrence) → Gate → RMSNorm → Output projection
↓ becomes ↓
Mamba-3 Block
Input → Linear projection (expand 2×) → Δ,A,B,C,λ projection → BCNormB,C BiasesData-dependent RoPEExp-Trapezoidal SSM (+ optional MIMO) → Gate → Output projection

Notice what is removed: the short causal convolution and its SiLU activation, and the post-gate RMSNorm. Notice what is added: BCNorm, B/C biases, data-dependent RoPE (complex SSM), and the trapezoidal β term.

Key Architectural Details

BCNorm (QKNorm): RMSNorm is applied to B and C after projection, before biases. This mirrors the Q/K normalization used in modern Transformers (since B ↔ K and C ↔ Q in the SSD duality). BCNorm stabilizes training enough that the post-gate RMSNorm can be removed in pure Mamba-3 models. However, for hybrid models (Mamba-3 + attention layers), a pre-gate grouped RMSNorm is reintroduced for better length generalization.

B, C Biases: Learnable, head-specific, channel-wise bias vectors added to B and C after BCNorm. Initialized to all-ones. These biases create data-independent components in B and C, making the SSM behave partly like a fixed convolution alongside its data-dependent behavior. Yu & Erichson (2025) proved that adding channel-specific bias to B grants universal approximation capabilities to block-biased Mamba.

Data-dependent A: Mamba-3 makes At data-dependent (projected from the token) rather than a fixed parameter. This keeps all SSM parameters consistently data-dependent. Empirically, this does not change performance, but it simplifies the parameterization.

No more short convolution: The combination of (1) exponential-trapezoidal discretization (implicit width-2 convolution on state-input) and (2) B,C biases (data-independent component) empirically makes the external short causal convolution redundant. Ablation: Mamba-3 with conv gets 15.85 ppl, without conv gets 15.72 ppl. The conv actually hurts. This is architecturally significant — the short conv was previously considered essential for all recurrent models.

Data Flow Through the Block

Let us trace a single token x ∈ RD through the Mamba-3 block (D = model dimension, H = number of heads, P = head dimension, N = state size):

  1. Input projection: x → [z, x'] via Win ∈ RD × (D + expand·D). z ∈ RD is the gate. x' ∈ Rexpand·D is the SSM input (expand=2 by default).
  2. SSM parameter projection: x' → Δt, At, Bt, Ct, λt via separate linear layers. Bt, Ct ∈ RH×N (per-head). Δt, At, λt ∈ RH (scalar per head).
  3. BCNorm: Bt → RMSNorm(Bt), Ct → RMSNorm(Ct) (per-head normalization).
  4. Biases: Bt → Bt + bB, Ct → Ct + bC (learnable, head-specific, shape RH×N).
  5. RoPE rotation: Compute cumulative rotation angles Θt = Σi=0t Δiθi. Apply rotation to Bt and Ct using the standard RoPE formula (pair up state dimensions, apply 2D rotation).
  6. SSM recurrence: ht = αt ht-1 + βt Bt-1 xt-1' + γt Bt xt' (where x' is reshaped to [H, P] across heads). Output: yt = CtT ht ∈ RH×P.
  7. Gate and output: ot = SiLU(z) ⊙ flatten(yt). Output projection: Wout · ot ∈ RD.

For MIMO, step 6 changes: Bt ∈ RH×N×R, xt ∈ RH×P×R, the outer product becomes a matmul, and intermediate outputs go through a SiLU residual before the final output projection.

Mamba-3 Architecture Diagram

Interactive block diagram. Click on each component to see its tensor shapes and data types. Blue = linear projection, green = SSM-specific, orange = new in Mamba-3.

Model Configurations

ScaleLayersdmodelHeadsdstateHead dimMLP dim (SISO)MLP dim (MIMO R=4)
180M127686641281,5001,264
440M2410248641282,0481,792
880M32153612641283,0722,800
1.5B24204816641284,0963,824

All models use expand factor 2, Llama-3.1 tokenizer, 2K context length during pre-training, and 100B tokens from FineWeb-Edu. Note the MIMO MLP dimension is smaller to match parameter counts with SISO.

What two changes in Mamba-3 make the external short causal convolution unnecessary?

Chapter 7: Results

Numbers matter. Mamba-3 was evaluated at four scales (180M, 440M, 880M, 1.5B) on language modeling perplexity and seven downstream tasks. Let us walk through the key findings.

Downstream Language Modeling (Table 1)

At 1.5B scale (the largest), the average downstream accuracy across seven tasks:

ModelPPL ↓LAMBADAHellaSwagPIQAARC-EARC-CWinoGr.OBQAAvg ↑
Transformer10.5150.360.673.874.040.458.729.655.4
Gated DeltaNet10.4549.261.374.375.341.258.031.655.8
Mamba-210.4747.861.473.675.341.857.532.655.7
Mamba-3 SISO10.3549.461.973.675.942.759.432.056.4
Mamba-3 MIMO10.2451.762.375.376.544.560.632.657.6

Key takeaways:

The Pareto Frontier: State Size vs. Quality

The most impactful result is Figure 2 of the paper: plotting perplexity against state size (a proxy for inference speed). Mamba-3 with state size 64 matches Mamba-2 with state size 128 — same quality at half the state size, meaning half the memory and half the per-token latency.

MIMO pushes the frontier further: even at state size 64, MIMO R=4 achieves better perplexity than Mamba-2 at state size 128. More quality for less inference cost.

State Size vs. Perplexity Pareto Frontier

Each point is a model variant. Lower-left is better (lower perplexity, smaller state). Mamba-3 shifts the entire curve down and left. MIMO shifts it further down without moving right (same state size, better quality).

Kernel Latency (Table 4)

Practical decode speed matters as much as theoretical efficiency. At 1.5B with batch 128 on a single H100:

ModelBF16, dstate=64BF16, dstate=128
Mamba-20.127 ms0.203 ms
Gated DeltaNet0.176 ms0.257 ms
Mamba-3 SISO0.110 ms0.156 ms
Mamba-3 MIMO R=40.137 ms0.179 ms

Mamba-3 SISO is the fastest across all configurations, even faster than Mamba-2's reference kernels. MIMO adds only ~25% latency despite 4× more FLOPs — confirming the memory-bound argument from Chapter 5.

End-to-End Latency (Table 12)

At 4096 tokens, total decode time for 1.5B models:

All recurrent models are 1.5-1.7× faster than the attention baseline. Mamba-3 SISO is the fastest. At 16K tokens, the gap widens to 2.3×.

Retrieval: The Honest Weakness

On real-world retrieval (SWDE, FDA), Mamba-3 scores 28.5 and 23.4 vs. the Transformer's 48.9 and 58.4. This is a fundamental limitation of fixed-state-size models — they cannot store and recall arbitrary key-value pairs from long context the way attention can.

The mitigation: hybrid models. A 5:1 ratio of Mamba-3 layers to NoPE attention layers recovers most retrieval capability (58.5 on SWDE) while keeping the efficiency benefits. This is how the authors expect Mamba-3 to be deployed in practice.

The honest picture: Mamba-3 is strictly better than Mamba-2 and GDN on everything tested. It matches or beats Transformers on downstream accuracy and perplexity. But it still needs attention layers for retrieval-heavy tasks. The future is hybrid, not pure SSM.
At the 1.5B scale, what is the relationship between Mamba-3 (dstate=64) and Mamba-2 (dstate=128)?

Chapter 8: State Tracking & Capabilities

Chapter 4 showed the theory: complex eigenvalues enable rotational dynamics. Chapter 7 showed Mamba-3 beats baselines on standard benchmarks. This chapter digs deeper into what Mamba-3 can do that its predecessors cannot — the qualitative capability gains from the complex SSM.

The Chomsky Hierarchy Tests

Following Grazzi et al. (2025), Mamba-3 was evaluated on three tasks from the Chomsky hierarchy of formal languages — each requiring increasingly sophisticated state-tracking:

1. Parity (regular language): Given a binary string, output whether the number of 1s is even or odd. Requires: a single bit of state that flips on every "1". Mamba-3: 100%. Mamba-2: 0.9% (random chance). The complex SSM can represent a 180-degree rotation (sign flip) that real-valued eigenvalues cannot.

2. Modular Arithmetic without brackets (context-free): Evaluate expressions like "3 + 2 × 4 mod 5" left-to-right. Requires: maintaining a running total modulo 5 and applying operations sequentially. Mamba-3: 98.51%. Mamba-2: 47.81%. The rotation enables modular counting, but the intermediate operations (addition, multiplication) also need to be tracked in the state — complex eigenvalues allow richer state transitions for this.

3. Modular Arithmetic with brackets (beyond context-free): Evaluate "(3 + (2 × 4)) mod 5" with nested parentheses. Requires: a stack to track nesting depth and intermediate results. Mamba-3: 87.75%. GDN [-1,1]: 93.50%. This is the hardest task. Mamba-3 nearly closes the gap but doesn't quite match GDN, which benefits from its delta-rule-style memory update for handling nesting.

Why Standard RoPE Fails

A natural question: why not just apply standard RoPE (fixed-frequency rotation) to Mamba-2? The answer is in the table: Mamba-3 with standard RoPE scores 1.56% on parity — worse than random guessing. Standard RoPE rotates by fixed amounts at each position (determined by the position index, not the token value). This is useful for encoding position but useless for encoding content-dependent state transitions.

Parity requires rotating by π when the token is "1" and by 0 when it is "0". The rotation must be data-dependent: the angle depends on what the token is, not where it appears. This is exactly what Mamba-3's complex SSM provides: θt is projected from the input token, so the model learns to set θ = π/Δ for "1" and θ = 0 for "0".

The critical distinction: Standard RoPE = position-dependent, data-independent rotation. Complex SSM = data-dependent, position-independent rotation. State tracking requires data-dependent rotations. Position encoding requires position-dependent rotations. These are fundamentally different capabilities that happen to use the same mathematical mechanism (2D rotation matrices).

Length Generalization

The models were trained on short sequences (minimum 3, maximum 40-160 via curriculum) and evaluated at length 256. Mamba-3 generalizes well: 100% parity at 256 tokens despite never seeing sequences that long during training. This suggests the learned rotation strategy (rotate by π on "1") is position-independent and composition-friendly.

Context Length Extrapolation

On standard language modeling (Figure 4), Mamba-3 extrapolates gracefully from its 2K training length to 32K tokens, with perplexity improving steadily. Mamba-2 degrades after 8K. The complex state's rotation (which embeds positional information through data-dependent angles rather than fixed position encodings) may contribute to this robustness.

State Tracking: Modular Arithmetic Simulator

Enter a modular arithmetic expression and watch how the SSM state evolves. The complex SSM tracks the running value modulo 5 via rotations. Toggle between real and complex to see the difference.

What Mamba-3 Still Cannot Do

Nested modular arithmetic (with brackets) reaches 87.75% but not 100%. The stack-like memory required for deep nesting strains the fixed-size state. True stack simulation requires unbounded memory, which no fixed-state model can provide.

On practical tasks: retrieval from semi-structured data (SWDE: 28.5%, FDA: 23.4%) remains weak. The model can track abstract state (parity, arithmetic) but struggles to verbatim recall specific facts from long context. This aligns with the theoretical distinction: state tracking is about abstracting information (reducing a sequence to a summary), while retrieval is about preserving information (storing exact values for later recall). SSMs excel at the former; attention excels at the latter.

Why does applying standard (position-dependent) RoPE to Mamba-2 fail to solve parity, while Mamba-3's data-dependent rotation succeeds?

Chapter 9: Connections

Cheat Sheet: Every Key Equation

EquationWhat it meansWhen to use
ht = αtht-1 + γtBtxtMamba-2 recurrence (Euler)Baseline comparison
ht = αtht-1 + βtBt-1xt-1 + γtBtxtMamba-3 recurrence (trapezoidal)Core recurrence
αt = eΔtAtDecay factorα ∈ (0,1), controls forgetting
βt = (1-λtteΔtAtLeft-endpoint weight (new in Mamba-3)λ=1 recovers Euler
γt = λtΔtRight-endpoint weightλ=0.5 is classical trapezoid
R(φ) = [cosφ, -sinφ; sinφ, cosφ]2×2 rotation matrixComplex SSM state transition
Rt = Block{R(Δtθt[i])}Block-diagonal rotationN/2 independent 2D rotations
t = (Πi=0t RiT) BtRoPE trick: cumulative rotation on BEfficient complex SSM impl
AI = FLOPs / BytesArithmetic intensitySISO ~2.5, MIMO ~R×2.5
Y = (L ⊙ CBT) XSSD parallel formTraining (L = L1·L2 for Mamba-3)

Symbol Glossary

SymbolMeaningTypical Value
TSequence length2048
DModel dimension2048 (1.5B)
NSSM state size64 or 128
PHead dimension128
HNumber of heads16 (1.5B)
RMIMO rank4
ΔtStep size (data-dependent)Scalar per head
AtState decay (data-dependent)Scalar per head, negative
θtRotation angle (data-dependent)RN/2 per head
λtTrapezoidal mixing parameterσ(ut), scalar per head

Related Work & Lessons

How This Idea Was Likely Discovered

The paper reads like a clean narrative: discretization → complex states → MIMO. But the discovery process was almost certainly messier. Probable sequence:

  1. The authors noticed Mamba-2's simplified scalar transition lost expressivity vs. Mamba-1's diagonal transitions. They asked: "Can we recover expressivity without sacrificing Mamba-2's training efficiency?"
  2. Revisiting the discretization derivation revealed the Euler approximation was heuristic. Formalizing it led to the exponential-adjusted framework, and the trapezoidal rule was the natural next step.
  3. The parity failure was likely already known from Grazzi et al. (2024). The fix (complex eigenvalues / block-diagonal rotations) was suggested by classical SSM theory (S4 used complex-valued NPLR matrices). The RoPE trick was the crucial implementation insight that made complex SSMs practical.
  4. The MIMO idea likely came from profiling: observing the low arithmetic intensity during decode and asking "how can we add FLOPs without increasing memory traffic?" The signal processing connection (SISO → MIMO) provided the framework.

Open Questions

What is the fundamental difference between Mamba-3's data-dependent RoPE and standard position-encoding RoPE?