Three SSM-principled upgrades — exponential-trapezoidal discretization, complex-valued states, and MIMO — push the Pareto frontier of quality vs. inference speed, matching Transformer quality at half the state size.
You are deploying a language model. Users send prompts, and the model generates tokens one at a time. Each new token requires attending to every previous token — that is how Transformers work. The KV cache holding those past token representations grows linearly with sequence length. The attention computation itself grows quadratically.
At 1K tokens, this is tolerable. At 8K, the KV cache consumes gigabytes. At 32K, you need multi-GPU setups just to hold the cache. And as chain-of-thought reasoning and agentic workflows push inference-time compute to new heights, the cost of generating each additional token becomes the dominant expense. Inference is no longer an afterthought — it is the bottleneck.
Sub-quadratic models — state space models (SSMs) like Mamba-1 and Mamba-2, and linear attention variants like Gated DeltaNet — promise a fix. They maintain a fixed-size hidden state that summarizes the entire past. Generating a new token requires only updating this state and reading from it: O(1) memory, O(1) compute per token. No KV cache. No quadratic attention. Linear in sequence length for the full forward pass.
Sounds perfect. So why isn't everyone using them?
Problem 1: Quality gap. Mamba-2 trades expressivity for training speed. Its state transition is a scalar times identity — every state dimension evolves the same way. This simplification enables hardware-efficient matrix multiplications during training, but it means the recurrence is less expressive than Mamba-1's diagonal transitions. At matched parameter counts, Mamba-2 slightly underperforms Gated DeltaNet on downstream tasks.
Problem 2: State-tracking failure. Ask Mamba-2 to compute the parity of a binary sequence (count the number of 1s, report whether it is even or odd). It fails — not at 95% accuracy, but at chance level. This is because parity requires a "rotation" in state space — flipping between two states on every 1 — and Mamba-2's real-valued, non-negative eigenvalue transitions cannot express rotational dynamics. More formally, Grazzi et al. (2025) proved that real-valued scalar transitions cannot represent functions requiring modular counting.
Problem 3: Hardware underutilization. Even though SSMs are theoretically efficient, their decode step has an arithmetic intensity of only about 2.5 ops/byte. The H100 GPU's tensor cores can sustain ~295 ops/byte for matrix multiplications. That means during SSM decoding, the tensor cores sit idle while memory bandwidth is fully saturated. The model is fast in theory but leaves 99% of compute capacity unused.
Click each problem to see the gap between ideal and actual behavior. This frames the three axes Mamba-3 targets.
Mamba-3 addresses all three problems simultaneously. Not with heuristic patches, but with principled improvements derived from the state space model perspective: a better discretization rule for expressivity, complex-valued states for rotation capability, and a MIMO formulation for hardware utilization. Let us see how.
Mamba-3's thesis is simple: the right improvements come from returning to state space model fundamentals. Not from the linear attention viewpoint. Not from the test-time regression viewpoint. From the original continuous-time ODE that defines SSMs. Each of the three innovations arises naturally from asking "what does the SSM perspective suggest here?"
The result at 1.5B scale: Mamba-3 SISO beats Gated DeltaNet (the previous best SSM) by +0.6 points on average downstream accuracy. Mamba-3 MIMO adds another +1.2 points for a total +1.8 point gain. On perplexity, Mamba-3 with state size 64 matches Mamba-2 with state size 128 — same quality at half the inference latency. And Mamba-3 solves parity perfectly, while Mamba-2 scores at chance.
Mamba-3 does not solve retrieval. On tasks like extracting facts from semi-structured data (SWDE, FDA), it still substantially underperforms Transformers. The fixed-size state fundamentally limits how much verbatim information can be recalled. The authors are candid about this: they predict SSMs will primarily be used in hybrid architectures, mixing with sparse attention layers to handle retrieval.
Also, the MIMO variant increases training FLOPs by approximately R× (with R=4). Training becomes R times slower per step. The decode speed stays the same, but you pay upfront during training for the inference-time free lunch.
Before we can understand what Mamba-3 changes, we need to understand what it starts from. State space models describe a continuous-time dynamical system with three matrices — A, B, C — that govern how a hidden state evolves in response to input.
The SSM is defined by a first-order ODE for the hidden state h(t) ∈ RN and a linear readout for the output y(t) ∈ R:
Let's trace each component:
In classical SSMs (S4, S5), A, B, C are fixed parameters learned during training. Every token sees the same dynamics. This is called linear time-invariant (LTI).
Mamba-1 made A, B, C data-dependent: they are projected from the current token. Now each token controls its own write pattern (B), read pattern (C), and decay rate (A). This is linear time-varying (LTV). The model can decide, for each token: "This is important, remember it" (low decay, strong write) or "This is noise, forget it" (high decay, weak write).
The cost: LTV systems lose the efficient global convolution trick of LTI systems. The recurrence must be computed step by step (or in hardware-efficient chunks via SSD).
We have discrete tokens x1, x2, ..., xT, but the SSM is defined in continuous time. Discretization converts the continuous ODE into a discrete recurrence that we can compute on tokens.
Start with the ODE h'(t) = A(t)h(t) + B(t)x(t). The exact solution between time τt-1 and τt (with step size Δt = τt - τt-1) is:
The first term (the homogeneous solution) can be computed exactly by approximating A(s) as constant At over the interval: exp(Δt At). The second term (the particular integral) is harder — it requires approximating B(τ)x(τ) over the interval. This is where different discretization methods diverge.
Mamba-2 uses Euler's rule: approximate the integrand by its value at the right endpoint τt. This gives:
Define αt := eΔtAt ∈ (0,1) and γt := Δt. The recurrence becomes:
This is Mamba-2's core. The step size Δt controls both forgetting (αt) and input scaling (γt): large Δt forgets faster (smaller α) but up-weights the current token (larger γ). Small Δt retains more past state but adds less from the new token.
Let's trace a concrete step with N=2, Δt=0.5, At=-1:
The old state decays by 60.7%, and the new token contributes with weight 0.5. Now if Ct = [0.3, 0.7]:
Watch how the hidden state evolves token by token. Drag Δ to control the decay/input tradeoff. High Δ forgets fast but writes strongly. Low Δ retains but writes weakly.
Mamba-2 showed that the recurrence above can be written as a matrix form for parallel training:
where L is a structured mask (lower-triangular with decay terms), B and C play the roles of keys and queries, and X is the value matrix. This is the state space duality — SSMs and (masked) linear attention are the same computation in different forms. The recurrent form is efficient for inference (O(1) per token). The matrix form is efficient for training (parallelizable via matmul).
Mamba-3 will generalize L to include the trapezoidal terms, making it a richer mask while preserving this duality.
Euler's rule approximates an integral using just one endpoint. A student of numerical methods knows there is a better option: the trapezoidal rule, which averages both endpoints. It is second-order accurate instead of first-order, meaning the error shrinks quadratically instead of linearly as the step size decreases.
But we are not in the standard setting. Our ODE has a time-varying exponential decay, and we have already solved the homogeneous part analytically. What remains is the integral of the state-input B(τ)x(τ) weighted by the exponential decay. Let us derive the trapezoidal version from scratch.
Recall the exact solution (from Chapter 2):
The first term is exact (we solved the homogeneous ODE analytically). The second term — the integral of the state-input weighted by exponential decay — needs to be approximated.
Step 1: Define g(τ) := e(τt - τ)At B(τ)x(τ). This is the integrand we need to approximate.
Step 2: The generalized trapezoidal rule approximates the integral with a convex combination of the two endpoints:
where λt ∈ [0,1] is a data-dependent mixing parameter. When λt = 1, this recovers Euler (right endpoint only). When λt = 1/2, this is the classical trapezoid (equal average of both endpoints).
Step 3: Evaluate the endpoints:
Step 4: Substitute back:
Defining the three coefficients:
The Mamba-3 recurrence becomes:
Same setup as before: N=2, Δt=0.5, At=-1, λt=0.5 (classical trapezoid).
With ht-1 = [0.8, 0.3], Bt = [1.0, 0.5], xt = 2.0, and the previous step's Bt-1 = [0.7, 0.9], xt-1 = 1.5:
Compare to Euler: ht = [0.486 + 1.0, 0.182 + 0.5] = [1.486, 0.682]. The trapezoidal version distributes the input weight between the current and previous token, resulting in a smoother approximation. The β term acts as a bridge between consecutive tokens.
There is an elegant alternative view: the trapezoidal discretization is equivalent to applying a width-2 convolution on the state-input vt = Btxt before feeding it into the linear recurrence.
In a standard SSM, you first compute vt = Btxt, then apply the recurrence ht = αtht-1 + γtvt. In Mamba-3, you first convolve: v't = βtvt-1 + γtvt, then apply the recurrence ht = αtht-1 + v't.
In the SSD parallel form Y = (L ⊙ CBT)X, Mamba-2 has mask L with entries Lij = αi...j γj (decay product times input weight). Mamba-3's mask is richer: it factors as L = L1 · L2 where L1 is the decay matrix (same as Mamba-2) and L2 is a two-band matrix with entries from β and γ. This two-band structure is the "convolution" in mask form.
Compare how Euler (right-endpoint only) and Trapezoidal (average of endpoints) approximate the integral. The trapezoidal rule captures the shape of the curve more accurately. Drag λ from 1.0 (pure Euler) to 0.5 (classical trapezoid) to see the approximation improve.
For second-order accuracy, theory requires λt = 1/2 + O(Δt). But the Mamba-3 authors found that a data-dependent λt = σ(ut) (sigmoid of a learned projection of the input token) performs best empirically, even though it does not satisfy the second-order constraint. Fixing λ = 1/2 gives 15.76 perplexity vs. 15.72 for the learned gate. Setting λ = 1 (Euler, no trapezoidal) gives 15.81. The improvement is real but modest — about 0.1 perplexity points from the trapezoidal term alone.
The bigger win comes from the interaction with B,C biases: combining trapezoidal discretization with biases gives 15.72, down from 16.68 without either. And it removes the need for the external short convolution, simplifying the architecture.
Here is a task that sounds trivial: given a binary string like 1 0 1 1 0 1, compute the parity (even or odd number of 1s). A human can do it by keeping a running count modulo 2. An RNN with a single bit of state can do it. But Mamba-2 provably cannot.
Why? Parity requires the state to flip between two values on every 1 and stay unchanged on every 0. In state space terms, processing a "1" should rotate the state by 180 degrees: h → -h. But Mamba-2's transition is αt ∈ (0, 1), which only shrinks the state. It can never flip the sign. You would need α < 0 or, equivalently, complex eigenvalues.
Mamba-3 makes the hidden state complex: h(t) ∈ CN/2 instead of RN. The continuous ODE becomes:
The key addition is θ(t) — a data-dependent rotation angle for each state dimension. While A(t) controls decay (real part), θ(t) controls rotation (imaginary part). Now the state can oscillate, not just decay.
Complex arithmetic is expensive on GPUs. But there is a beautiful equivalence: a complex SSM with state dimension N/2 is exactly equal to a real-valued SSM with state dimension N whose transition matrix contains 2×2 rotation blocks.
Let us derive this for a single complex state dimension. Write ht = ht + iĥt. The complex update:
Expand the exponential: eΔt(At + iθt) = eΔtAt(cos(Δtθt) + i sin(Δtθt)).
Separate real and imaginary parts. In matrix form, this gives:
where R(φ) is the 2×2 rotation matrix:
For N/2 complex state dimensions, we get N real state dimensions with a block-diagonal transition made of N/2 such rotation blocks, each scaled by eΔtAt.
Computing the block-diagonal rotation explicitly requires storing and multiplying N×N matrices at each step. Expensive. But there is a trick.
Unroll the recurrence for T steps. At step t, Bt gets multiplied by the cumulative product of all rotations from step 0 to t: (Πi=0t RiT) Bt. Similarly, Ct gets (Πi=0t RiT) Ct.
This is precisely a rotary position embedding (RoPE) applied to B and C! In the SSD duality, B corresponds to keys and C to queries. So the complex SSM is equivalent to applying data-dependent RoPE to the Q and K components of the attention-like computation.
The crucial difference from standard RoPE: the rotation angles are data-dependent (θt is projected from the input token), not fixed (θi = 10000-2i/N in vanilla RoPE). Data-dependent rotations can modulate based on content, enabling state tracking.
In practice, the RoPE computation is trivially efficient: pair up the state dimensions, compute cumulative sums of angles, and apply rotations with the standard RoPE formula. This adds negligible overhead.
The results are stark (Table 3b of the paper):
| Model | Parity | Arith (no brackets) | Arith (brackets) |
|---|---|---|---|
| Mamba-3 | 100.00% | 98.51% | 87.75% |
| Mamba-3 (std RoPE) | 1.56% | 20.70% | 2.62% |
| Mamba-3 (no RoPE) | 2.27% | 1.49% | 0.72% |
| Mamba-2 | 0.90% | 47.81% | 0.88% |
| GDN [-1,1] | 100.00% | 99.25% | 93.50% |
Mamba-3 with data-dependent RoPE: perfect parity, near-perfect modular arithmetic. Without it (or with standard position-based RoPE): chance level. This confirms the theory: complex eigenvalues enable rotational dynamics that real eigenvalues cannot.
Watch a real-valued SSM (left) and a complex SSM (right) attempt parity on the same binary string. The real SSM's state decays monotonically — it cannot flip. The complex SSM rotates by π on each "1", tracking parity exactly. Click "New Sequence" for a fresh random string.
Combining the complex SSM with exponential-trapezoidal discretization, the full Mamba-3 recurrence is:
Both the β term (left endpoint) and γ term (right endpoint) get their own rotation applied. The output also gets rotated. In practice, B and C are rotated via the RoPE trick before the recurrence, so the recurrence itself is identical in form to the real-valued case — just with rotated inputs.
We have improved quality (trapezoidal discretization) and capability (complex states). Now let us fix hardware utilization. This chapter contains the SHOWCASE simulation — the interactive visualization of how MIMO changes the arithmetic intensity.
Consider a single head of a SISO (single-input, single-output) SSM with head dimension P and state size N. The decode step involves:
The state update uses an outer product Bt xtT (N × P FLOPs). But the memory traffic is dominated by loading ht-1 (N × P elements) and storing ht (N × P elements). So:
The H100's tensor cores can sustain ~295 ops/byte for bfloat16 matmul. We are at 2.5. That means 99% of compute is idle during the decode step — the GPU waits for memory, not arithmetic.
MIMO adds a rank dimension R to the state-input computation. Instead of:
We use:
Similarly, Ct goes from RN to RN×R, and yt from RP to RP×R.
The key insight: memory traffic barely increases. The state ht ∈ RN×P stays the same size because the MIMO rank R does not expand the state. What increases are the projection vectors Bt, Ct, xt — but these are small compared to the state. So FLOPs go up by R× while memory I/O stays roughly constant. Arithmetic intensity increases by roughly R×.
With R=4: arithmetic intensity goes from ~2.5 to ~10 ops/byte. Still far from the 295 peak, but 4× better utilization is meaningful — and it comes for free in wall-clock time because the extra matmul FLOPs overlap with the unchanged memory I/O.
Why "multi-input, multi-output"? In signal processing, a SISO system has one input channel and one output channel. The SSM takes scalar xt in and produces scalar yt out, replicated across P head dimensions by stacking. A MIMO system has R input channels and R output channels. Each of the R inputs has its own B projection and contributes to the state independently. Each of the R outputs reads the state through its own C projection.
Formally, a MIMO SSM with rank R is equivalent to R independent SISO SSMs sharing the same decay αt and the same state, with their contributions summed:
The outer product becomes a matrix product. This is computationally friendly because the matmul can use tensor cores, which were sitting idle in the SISO case.
This is the core visualization. Left: SISO decode step — the outer product uses few FLOPs relative to memory traffic. Right: MIMO decode step — the matrix multiply uses R× more FLOPs with barely more memory traffic. Drag R to see arithmetic intensity climb. The dashed line marks where tensor cores become meaningfully utilized.
MIMO is free at decode time but not at training time. The sequence-to-sequence form requires calling the SISO training algorithm R2 times naively (all R×R cross terms). However, by exploiting the chunked SSD algorithm structure, this reduces to R× overhead: the intra-chunk computation maintains the SISO chunk size while increasing the number of chunks by R. With R=4, training is about 4× slower per step for the SSM layers.
To keep total parameter count constant, Mamba-3 shrinks the MLP hidden dimension to compensate for the extra MIMO projections. At 1.5B: SISO MLP dim is 4096, MIMO MLP dim is 3824. The parameter counts match, but the MIMO variant distributes more capacity into the SSM and less into the MLP.
The decode kernel is implemented in CuTe-DSL (NVIDIA's tensor core DSL), not Triton, because the MIMO matrix multiply requires explicit tensor core scheduling. The forward (prefill) path uses Triton (SISO) or TileLang (MIMO). The kernel fusion structure fuses the RoPE rotation, SSM update, gating, and MIMO projection into a single kernel to minimize memory round-trips.
We have the three core innovations. Now let us assemble the full Mamba-3 block and see how it differs from its predecessor.
The macro-architecture follows Llama: alternating sequence-mixing blocks and feedforward (SwiGLU MLP) blocks with pre-norm (RMSNorm). This is the same skeleton used by most modern language models — the innovation is in what goes inside the sequence-mixing block.
Notice what is removed: the short causal convolution and its SiLU activation, and the post-gate RMSNorm. Notice what is added: BCNorm, B/C biases, data-dependent RoPE (complex SSM), and the trapezoidal β term.
BCNorm (QKNorm): RMSNorm is applied to B and C after projection, before biases. This mirrors the Q/K normalization used in modern Transformers (since B ↔ K and C ↔ Q in the SSD duality). BCNorm stabilizes training enough that the post-gate RMSNorm can be removed in pure Mamba-3 models. However, for hybrid models (Mamba-3 + attention layers), a pre-gate grouped RMSNorm is reintroduced for better length generalization.
B, C Biases: Learnable, head-specific, channel-wise bias vectors added to B and C after BCNorm. Initialized to all-ones. These biases create data-independent components in B and C, making the SSM behave partly like a fixed convolution alongside its data-dependent behavior. Yu & Erichson (2025) proved that adding channel-specific bias to B grants universal approximation capabilities to block-biased Mamba.
Data-dependent A: Mamba-3 makes At data-dependent (projected from the token) rather than a fixed parameter. This keeps all SSM parameters consistently data-dependent. Empirically, this does not change performance, but it simplifies the parameterization.
Let us trace a single token x ∈ RD through the Mamba-3 block (D = model dimension, H = number of heads, P = head dimension, N = state size):
For MIMO, step 6 changes: Bt ∈ RH×N×R, xt ∈ RH×P×R, the outer product becomes a matmul, and intermediate outputs go through a SiLU residual before the final output projection.
Interactive block diagram. Click on each component to see its tensor shapes and data types. Blue = linear projection, green = SSM-specific, orange = new in Mamba-3.
| Scale | Layers | dmodel | Heads | dstate | Head dim | MLP dim (SISO) | MLP dim (MIMO R=4) |
|---|---|---|---|---|---|---|---|
| 180M | 12 | 768 | 6 | 64 | 128 | 1,500 | 1,264 |
| 440M | 24 | 1024 | 8 | 64 | 128 | 2,048 | 1,792 |
| 880M | 32 | 1536 | 12 | 64 | 128 | 3,072 | 2,800 |
| 1.5B | 24 | 2048 | 16 | 64 | 128 | 4,096 | 3,824 |
All models use expand factor 2, Llama-3.1 tokenizer, 2K context length during pre-training, and 100B tokens from FineWeb-Edu. Note the MIMO MLP dimension is smaller to match parameter counts with SISO.
Numbers matter. Mamba-3 was evaluated at four scales (180M, 440M, 880M, 1.5B) on language modeling perplexity and seven downstream tasks. Let us walk through the key findings.
At 1.5B scale (the largest), the average downstream accuracy across seven tasks:
| Model | PPL ↓ | LAMBADA | HellaSwag | PIQA | ARC-E | ARC-C | WinoGr. | OBQA | Avg ↑ |
|---|---|---|---|---|---|---|---|---|---|
| Transformer | 10.51 | 50.3 | 60.6 | 73.8 | 74.0 | 40.4 | 58.7 | 29.6 | 55.4 |
| Gated DeltaNet | 10.45 | 49.2 | 61.3 | 74.3 | 75.3 | 41.2 | 58.0 | 31.6 | 55.8 |
| Mamba-2 | 10.47 | 47.8 | 61.4 | 73.6 | 75.3 | 41.8 | 57.5 | 32.6 | 55.7 |
| Mamba-3 SISO | 10.35 | 49.4 | 61.9 | 73.6 | 75.9 | 42.7 | 59.4 | 32.0 | 56.4 |
| Mamba-3 MIMO | 10.24 | 51.7 | 62.3 | 75.3 | 76.5 | 44.5 | 60.6 | 32.6 | 57.6 |
Key takeaways:
The most impactful result is Figure 2 of the paper: plotting perplexity against state size (a proxy for inference speed). Mamba-3 with state size 64 matches Mamba-2 with state size 128 — same quality at half the state size, meaning half the memory and half the per-token latency.
MIMO pushes the frontier further: even at state size 64, MIMO R=4 achieves better perplexity than Mamba-2 at state size 128. More quality for less inference cost.
Each point is a model variant. Lower-left is better (lower perplexity, smaller state). Mamba-3 shifts the entire curve down and left. MIMO shifts it further down without moving right (same state size, better quality).
Practical decode speed matters as much as theoretical efficiency. At 1.5B with batch 128 on a single H100:
| Model | BF16, dstate=64 | BF16, dstate=128 |
|---|---|---|
| Mamba-2 | 0.127 ms | 0.203 ms |
| Gated DeltaNet | 0.176 ms | 0.257 ms |
| Mamba-3 SISO | 0.110 ms | 0.156 ms |
| Mamba-3 MIMO R=4 | 0.137 ms | 0.179 ms |
Mamba-3 SISO is the fastest across all configurations, even faster than Mamba-2's reference kernels. MIMO adds only ~25% latency despite 4× more FLOPs — confirming the memory-bound argument from Chapter 5.
At 4096 tokens, total decode time for 1.5B models:
All recurrent models are 1.5-1.7× faster than the attention baseline. Mamba-3 SISO is the fastest. At 16K tokens, the gap widens to 2.3×.
On real-world retrieval (SWDE, FDA), Mamba-3 scores 28.5 and 23.4 vs. the Transformer's 48.9 and 58.4. This is a fundamental limitation of fixed-state-size models — they cannot store and recall arbitrary key-value pairs from long context the way attention can.
The mitigation: hybrid models. A 5:1 ratio of Mamba-3 layers to NoPE attention layers recovers most retrieval capability (58.5 on SWDE) while keeping the efficiency benefits. This is how the authors expect Mamba-3 to be deployed in practice.
Chapter 4 showed the theory: complex eigenvalues enable rotational dynamics. Chapter 7 showed Mamba-3 beats baselines on standard benchmarks. This chapter digs deeper into what Mamba-3 can do that its predecessors cannot — the qualitative capability gains from the complex SSM.
Following Grazzi et al. (2025), Mamba-3 was evaluated on three tasks from the Chomsky hierarchy of formal languages — each requiring increasingly sophisticated state-tracking:
1. Parity (regular language): Given a binary string, output whether the number of 1s is even or odd. Requires: a single bit of state that flips on every "1". Mamba-3: 100%. Mamba-2: 0.9% (random chance). The complex SSM can represent a 180-degree rotation (sign flip) that real-valued eigenvalues cannot.
2. Modular Arithmetic without brackets (context-free): Evaluate expressions like "3 + 2 × 4 mod 5" left-to-right. Requires: maintaining a running total modulo 5 and applying operations sequentially. Mamba-3: 98.51%. Mamba-2: 47.81%. The rotation enables modular counting, but the intermediate operations (addition, multiplication) also need to be tracked in the state — complex eigenvalues allow richer state transitions for this.
3. Modular Arithmetic with brackets (beyond context-free): Evaluate "(3 + (2 × 4)) mod 5" with nested parentheses. Requires: a stack to track nesting depth and intermediate results. Mamba-3: 87.75%. GDN [-1,1]: 93.50%. This is the hardest task. Mamba-3 nearly closes the gap but doesn't quite match GDN, which benefits from its delta-rule-style memory update for handling nesting.
A natural question: why not just apply standard RoPE (fixed-frequency rotation) to Mamba-2? The answer is in the table: Mamba-3 with standard RoPE scores 1.56% on parity — worse than random guessing. Standard RoPE rotates by fixed amounts at each position (determined by the position index, not the token value). This is useful for encoding position but useless for encoding content-dependent state transitions.
Parity requires rotating by π when the token is "1" and by 0 when it is "0". The rotation must be data-dependent: the angle depends on what the token is, not where it appears. This is exactly what Mamba-3's complex SSM provides: θt is projected from the input token, so the model learns to set θ = π/Δ for "1" and θ = 0 for "0".
The models were trained on short sequences (minimum 3, maximum 40-160 via curriculum) and evaluated at length 256. Mamba-3 generalizes well: 100% parity at 256 tokens despite never seeing sequences that long during training. This suggests the learned rotation strategy (rotate by π on "1") is position-independent and composition-friendly.
On standard language modeling (Figure 4), Mamba-3 extrapolates gracefully from its 2K training length to 32K tokens, with perplexity improving steadily. Mamba-2 degrades after 8K. The complex state's rotation (which embeds positional information through data-dependent angles rather than fixed position encodings) may contribute to this robustness.
Enter a modular arithmetic expression and watch how the SSM state evolves. The complex SSM tracks the running value modulo 5 via rotations. Toggle between real and complex to see the difference.
Nested modular arithmetic (with brackets) reaches 87.75% but not 100%. The stack-like memory required for deep nesting strains the fixed-size state. True stack simulation requires unbounded memory, which no fixed-state model can provide.
On practical tasks: retrieval from semi-structured data (SWDE: 28.5%, FDA: 23.4%) remains weak. The model can track abstract state (parity, arithmetic) but struggles to verbatim recall specific facts from long context. This aligns with the theoretical distinction: state tracking is about abstracting information (reducing a sequence to a summary), while retrieval is about preserving information (storing exact values for later recall). SSMs excel at the former; attention excels at the latter.
| Equation | What it means | When to use |
|---|---|---|
| ht = αtht-1 + γtBtxt | Mamba-2 recurrence (Euler) | Baseline comparison |
| ht = αtht-1 + βtBt-1xt-1 + γtBtxt | Mamba-3 recurrence (trapezoidal) | Core recurrence |
| αt = eΔtAt | Decay factor | α ∈ (0,1), controls forgetting |
| βt = (1-λt)ΔteΔtAt | Left-endpoint weight (new in Mamba-3) | λ=1 recovers Euler |
| γt = λtΔt | Right-endpoint weight | λ=0.5 is classical trapezoid |
| R(φ) = [cosφ, -sinφ; sinφ, cosφ] | 2×2 rotation matrix | Complex SSM state transition |
| Rt = Block{R(Δtθt[i])} | Block-diagonal rotation | N/2 independent 2D rotations |
| B̄t = (Πi=0t RiT) Bt | RoPE trick: cumulative rotation on B | Efficient complex SSM impl |
| AI = FLOPs / Bytes | Arithmetic intensity | SISO ~2.5, MIMO ~R×2.5 |
| Y = (L ⊙ CBT) X | SSD parallel form | Training (L = L1·L2 for Mamba-3) |
| Symbol | Meaning | Typical Value |
|---|---|---|
| T | Sequence length | 2048 |
| D | Model dimension | 2048 (1.5B) |
| N | SSM state size | 64 or 128 |
| P | Head dimension | 128 |
| H | Number of heads | 16 (1.5B) |
| R | MIMO rank | 4 |
| Δt | Step size (data-dependent) | Scalar per head |
| At | State decay (data-dependent) | Scalar per head, negative |
| θt | Rotation angle (data-dependent) | RN/2 per head |
| λt | Trapezoidal mixing parameter | σ(ut), scalar per head |
The paper reads like a clean narrative: discretization → complex states → MIMO. But the discovery process was almost certainly messier. Probable sequence: