Mamba-3 — Veanors

Chapter 0: The Problem

You are deploying a language model. Users send prompts, and the model generates tokens one at a time. Each new token requires attending to every previous token — that is how Transformers work. The KV cache holding those past token representations grows linearly with sequence length. The attention computation itself grows quadratically.

At 1K tokens, this is tolerable. At 8K, the KV cache consumes gigabytes. At 32K, you need multi-GPU setups just to hold the cache. And as chain-of-thought reasoning and agentic workflows push inference-time compute to new heights, the cost of generating each additional token becomes the dominant expense. Inference is no longer an afterthought — it is the bottleneck.

The inference wall: Transformers have O(L) memory (KV cache grows with sequence length) and O(L²) compute (attention over all pairs). For generation, where you produce one token at a time and each token must attend to all previous ones, the cost per token grows linearly with the number of tokens generated so far. Double the output length, double the per-token cost.

Sub-quadratic models — state space models (SSMs) like Mamba-1 and Mamba-2, and linear attention variants like Gated DeltaNet — promise a fix. They maintain a fixed-size hidden state that summarizes the entire past. Generating a new token requires only updating this state and reading from it: O(1) memory, O(1) compute per token. No KV cache. No quadratic attention. Linear in sequence length for the full forward pass.

Sounds perfect. So why isn't everyone using them?

Three Unsolved Problems

Problem 1: Quality gap. Mamba-2 trades expressivity for training speed. Its state transition is a scalar times identity — every state dimension evolves the same way. This simplification enables hardware-efficient matrix multiplications during training, but it means the recurrence is less expressive than Mamba-1's diagonal transitions. At matched parameter counts, Mamba-2 slightly underperforms Gated DeltaNet on downstream tasks.

Problem 2: State-tracking failure. Ask Mamba-2 to compute the parity of a binary sequence (count the number of 1s, report whether it is even or odd). It fails — not at 95% accuracy, but at chance level. This is because parity requires a "rotation" in state space — flipping between two states on every 1 — and Mamba-2's real-valued, non-negative eigenvalue transitions cannot express rotational dynamics. More formally, Grazzi et al. (2025) proved that real-valued scalar transitions cannot represent functions requiring modular counting.

Problem 3: Hardware underutilization. Even though SSMs are theoretically efficient, their decode step has an arithmetic intensity of only about 2.5 ops/byte. The H100 GPU's tensor cores can sustain ~295 ops/byte for matrix multiplications. That means during SSM decoding, the tensor cores sit idle while memory bandwidth is fully saturated. The model is fast in theory but leaves 99% of compute capacity unused.

The Three Problems of Current SSMs

Click each problem to see the gap between ideal and actual behavior. This frames the three axes Mamba-3 targets.

Mamba-3 addresses all three problems simultaneously. Not with heuristic patches, but with principled improvements derived from the state space model perspective: a better discretization rule for expressivity, complex-valued states for rotation capability, and a MIMO formulation for hardware utilization. Let us see how.

Why does Mamba-2 fail at computing the parity of a binary string?

Its hidden state is too small to store the string Its real-valued, non-negative eigenvalue transition cannot represent the rotational dynamics needed for modular counting It lacks a causal convolution to pre-process the input

Chapter 1: The Key Insight

Mamba-3's thesis is simple: the right improvements come from returning to state space model fundamentals. Not from the linear attention viewpoint. Not from the test-time regression viewpoint. From the original continuous-time ODE that defines SSMs. Each of the three innovations arises naturally from asking "what does the SSM perspective suggest here?"

Axis 1: Exponential-Trapezoidal Discretization

Mamba-2's discretization is a first-order Euler approximation with O(Δt) global error. The trapezoidal rule is second-order: O(Δt²) global error. It adds a β term to the recurrence — effectively convolving the state-input with a width-2 kernel inside the recurrence. This eliminates the need for the external short convolution that every prior model required.

↓

Axis 2: Complex-Valued States

Making the SSM state complex-valued introduces rotational dynamics. The imaginary part of the eigenvalue acts as a rotation angle. Discretization turns the complex SSM into a real SSM with block-diagonal 2×2 rotation matrices in the transition — equivalent to data-dependent rotary position embeddings (RoPE) on B and C. This enables parity, modular arithmetic, and other state-tracking tasks that real-valued SSMs provably cannot solve.

↓

Axis 3: MIMO

Standard SSMs are SISO: the state update h_t ← α_th_t-1 + B_tx_t^T uses an outer product. MIMO generalizes B and x to have an extra rank dimension R, turning the outer product into a matrix multiply. FLOPs increase by R× but memory traffic barely changes — arithmetic intensity jumps from ~2.5 ops/byte to ~R× that. The tensor cores wake up. Decode speed stays the same because the matmul overlaps with the memory-bound state read/write.

Why these three belong together: The trapezoidal discretization improves quality but doesn't fix state tracking — that requires complex eigenvalues. Complex states improve capability but don't help hardware utilization — that requires MIMO. MIMO improves utilization but without the quality gains from better discretization and complex states, the extra FLOPs buy less. The three innovations are complementary, not redundant.

The result at 1.5B scale: Mamba-3 SISO beats Gated DeltaNet (the previous best SSM) by +0.6 points on average downstream accuracy. Mamba-3 MIMO adds another +1.2 points for a total +1.8 point gain. On perplexity, Mamba-3 with state size 64 matches Mamba-2 with state size 128 — same quality at half the inference latency. And Mamba-3 solves parity perfectly, while Mamba-2 scores at chance.

Concept → Realization: Concept: "Return to SSM fundamentals to derive principled improvements." Realization: Three specific modifications to the Mamba-2 recurrence — h_t = α_th_t-1 + β_tB_t-1x_t-1 + γ_tB_tx_t (trapezoidal), with block-diagonal rotation R_t in the transition (complex states), and B_t ∈ R^N×R, x_t ∈ R^P×R instead of R^N, R^P (MIMO). Each has fast CUDA/Triton/CuTe kernels.

What the paper does NOT say

Mamba-3 does not solve retrieval. On tasks like extracting facts from semi-structured data (SWDE, FDA), it still substantially underperforms Transformers. The fixed-size state fundamentally limits how much verbatim information can be recalled. The authors are candid about this: they predict SSMs will primarily be used in hybrid architectures, mixing with sparse attention layers to handle retrieval.

Also, the MIMO variant increases training FLOPs by approximately R× (with R=4). Training becomes R times slower per step. The decode speed stays the same, but you pay upfront during training for the inference-time free lunch.

Why does Mamba-3 MIMO not increase decode latency despite increasing FLOPs by 4x?

It uses model parallelism to distribute the extra computation It skips the extra computation during inference Decoding is memory-bound, not compute-bound — the extra matmul FLOPs overlap with the memory I/O that was already the bottleneck

Chapter 2: SSM Foundations

Before we can understand what Mamba-3 changes, we need to understand what it starts from. State space models describe a continuous-time dynamical system with three matrices — A, B, C — that govern how a hidden state evolves in response to input.

The Continuous-Time System

The SSM is defined by a first-order ODE for the hidden state h(t) ∈ R^N and a linear readout for the output y(t) ∈ R:

h'(t) = A(t) h(t) + B(t) x(t)
y(t) = C(t)^T h(t)

Let's trace each component:

h(t) ∈ R^N: the hidden state. Think of it as a memory vector that summarizes everything the system has seen up to time t. N is the state size (typically 64 or 128).
x(t) ∈ R: the scalar input at time t. In language models, this is one dimension of the D-dimensional token embedding — each SSM operates independently on a single channel.
A(t) ∈ R^N×N: the state-transition matrix. Controls how the state evolves on its own. Negative eigenvalues mean the state decays (forgets). Imaginary eigenvalues mean the state rotates (oscillates). In Mamba-2, this is simplified to a scalar × identity: A(t) = a(t)·I.
B(t) ∈ R^N: the input matrix. Controls how the input gets injected into each state dimension. "Which dimensions of memory does this token write to, and how much?"
C(t) ∈ R^N: the output matrix. Controls how the state gets read out to produce the output. "Which dimensions of memory are relevant to the current query?"

Analogy: Think of h(t) as a whiteboard. A controls how fast writing fades (exponential decay). B is the marker — it writes new information from the input. C is the camera — it reads selective parts of the whiteboard to produce the output. The key constraint: the whiteboard has finite space (N dimensions), so old information must fade to make room for new information.

Why Data-Dependent? (Selective SSMs)

In classical SSMs (S4, S5), A, B, C are fixed parameters learned during training. Every token sees the same dynamics. This is called linear time-invariant (LTI).

Mamba-1 made A, B, C data-dependent: they are projected from the current token. Now each token controls its own write pattern (B), read pattern (C), and decay rate (A). This is linear time-varying (LTV). The model can decide, for each token: "This is important, remember it" (low decay, strong write) or "This is noise, forget it" (high decay, weak write).

The cost: LTV systems lose the efficient global convolution trick of LTI systems. The recurrence must be computed step by step (or in hardware-efficient chunks via SSD).

Discretization: From ODE to Recurrence

We have discrete tokens x₁, x₂, ..., x_T, but the SSM is defined in continuous time. Discretization converts the continuous ODE into a discrete recurrence that we can compute on tokens.

Start with the ODE h'(t) = A(t)h(t) + B(t)x(t). The exact solution between time τ_t-1 and τ_t (with step size Δ_t = τ_t - τ_t-1) is:

h(τ_t) = exp(∫_{τ_t-1}^τ_t A(s)ds) · h(τ_t-1) + ∫_{τ_t-1}^τ_t exp(∫_τ^τ_t A(s)ds) B(τ)x(τ) dτ

The first term (the homogeneous solution) can be computed exactly by approximating A(s) as constant A_t over the interval: exp(Δ_t A_t). The second term (the particular integral) is harder — it requires approximating B(τ)x(τ) over the interval. This is where different discretization methods diverge.

Mamba-2's Discretization (Exponential-Euler)

Mamba-2 uses Euler's rule: approximate the integrand by its value at the right endpoint τ_t. This gives:

h_t = e^Δ_tA_t h_t-1 + Δ_t B_t x_t

Define α_t := e^Δ_tA_t ∈ (0,1) and γ_t := Δ_t. The recurrence becomes:

h_t = α_t h_t-1 + γ_t B_t x_t
y_t = C_t^T h_t

This is Mamba-2's core. The step size Δ_t controls both forgetting (α_t) and input scaling (γ_t): large Δ_t forgets faster (smaller α) but up-weights the current token (larger γ). Small Δ_t retains more past state but adds less from the new token.

The hidden assumption: Euler's rule assumes B(τ)x(τ) is approximately constant over the entire interval [τ_t-1, τ_t], equal to its value at the right endpoint. This is a first-order approximation with local truncation error O(Δt²) and global error O(Δt). If B and x vary rapidly between tokens, Euler under-approximates the integral. Mamba-3 addresses this.

Worked Example: Euler Discretization

Let's trace a concrete step with N=2, Δ_t=0.5, A_t=-1:

α_t = e^{0.5 × (-1)} = e^-0.5 ≈ 0.607
γ_t = 0.5
If h_t-1 = [0.8, 0.3], B_t = [1.0, 0.5], x_t = 2.0:
h_t = 0.607 × [0.8, 0.3] + 0.5 × [1.0, 0.5] × 2.0
h_t = [0.486, 0.182] + [1.0, 0.5] = [1.486, 0.682]

The old state decays by 60.7%, and the new token contributes with weight 0.5. Now if C_t = [0.3, 0.7]:

y_t = 0.3 × 1.486 + 0.7 × 0.682 = 0.446 + 0.477 = 0.923

SSM Recurrence: Step by Step

Watch how the hidden state evolves token by token. Drag Δ to control the decay/input tradeoff. High Δ forgets fast but writes strongly. Low Δ retains but writes weakly.

Δ_t 1.0

A_t -1.0

State Space Duality (SSD)

Mamba-2 showed that the recurrence above can be written as a matrix form for parallel training:

Y = (L ⊙ CB^T) X

where L is a structured mask (lower-triangular with decay terms), B and C play the roles of keys and queries, and X is the value matrix. This is the state space duality — SSMs and (masked) linear attention are the same computation in different forms. The recurrent form is efficient for inference (O(1) per token). The matrix form is efficient for training (parallelizable via matmul).

Mamba-3 will generalize L to include the trapezoidal terms, making it a richer mask while preserving this duality.

In Mamba-2's recurrence h_t = α_th_t-1 + γ_tB_tx_t, what does a large Δ_t value mean?

The model forgets faster (α = e^ΔA becomes smaller) and up-weights the current token (γ = Δ becomes larger) — emphasizing the present over the past The model retains more of the past state The hidden state dimension increases

Chapter 3: Exponential-Trapezoidal Discretization

Euler's rule approximates an integral using just one endpoint. A student of numerical methods knows there is a better option: the trapezoidal rule, which averages both endpoints. It is second-order accurate instead of first-order, meaning the error shrinks quadratically instead of linearly as the step size decreases.

But we are not in the standard setting. Our ODE has a time-varying exponential decay, and we have already solved the homogeneous part analytically. What remains is the integral of the state-input B(τ)x(τ) weighted by the exponential decay. Let us derive the trapezoidal version from scratch.

The Derivation

Recall the exact solution (from Chapter 2):

h_t ≈ e^Δ_tA_t h_t-1 + ∫_{τ_t-1}^τ_t e^{(τ_t - τ)A_t} B(τ)x(τ) dτ

The first term is exact (we solved the homogeneous ODE analytically). The second term — the integral of the state-input weighted by exponential decay — needs to be approximated.

Step 1: Define g(τ) := e^{(τ_t - τ)A_t} B(τ)x(τ). This is the integrand we need to approximate.

Step 2: The generalized trapezoidal rule approximates the integral with a convex combination of the two endpoints:

∫_{τ_t-1}^τ_t g(τ)dτ ≈ Δ_t [(1 - λ_t) g(τ_t-1) + λ_t g(τ_t)]

where λ_t ∈ [0,1] is a data-dependent mixing parameter. When λ_t = 1, this recovers Euler (right endpoint only). When λ_t = 1/2, this is the classical trapezoid (equal average of both endpoints).

Step 3: Evaluate the endpoints:

At τ = τ_t (right endpoint): g(τ_t) = e⁰ B_t x_t = B_t x_t
At τ = τ_t-1 (left endpoint): g(τ_t-1) = e^Δ_tA_t B_t-1 x_t-1

Step 4: Substitute back:

h_t = e^Δ_tA_t h_t-1 + (1 - λ_t)Δ_t e^Δ_tA_t B_t-1 x_t-1 + λ_t Δ_t B_t x_t

Defining the three coefficients:

α_t := e^Δ_tA_t — the decay factor (same as Euler)
β_t := (1 - λ_t) Δ_t e^Δ_tA_t — the new left-endpoint weight
γ_t := λ_t Δ_t — the right-endpoint weight (same as Euler when λ=1)

The Mamba-3 recurrence becomes:

h_t = α_t h_t-1 + β_t B_t-1 x_t-1 + γ_t B_t x_t

The punchline: Mamba-3's recurrence has three terms instead of Mamba-2's two. The new β term brings in the previous token's state-input, weighted by the left-endpoint evaluation. This is a second-order approximation — local truncation error O(Δt³) vs. Euler's O(Δt²), global error O(Δt²) vs. Euler's O(Δt).

Worked Example: Trapezoidal vs. Euler

Same setup as before: N=2, Δ_t=0.5, A_t=-1, λ_t=0.5 (classical trapezoid).

α_t = e^-0.5 ≈ 0.607
β_t = (1 - 0.5) × 0.5 × 0.607 = 0.152
γ_t = 0.5 × 0.5 = 0.25

With h_t-1 = [0.8, 0.3], B_t = [1.0, 0.5], x_t = 2.0, and the previous step's B_t-1 = [0.7, 0.9], x_t-1 = 1.5:

Decay: α_t × h_t-1 = 0.607 × [0.8, 0.3] = [0.486, 0.182]
Left endpoint: β_t × B_t-1 × x_t-1 = 0.152 × [0.7, 0.9] × 1.5 = 0.152 × [1.05, 1.35] = [0.160, 0.205]
Right endpoint: γ_t × B_t × x_t = 0.25 × [1.0, 0.5] × 2.0 = [0.5, 0.25]
Total: h_t = [0.486 + 0.160 + 0.5, 0.182 + 0.205 + 0.25] = [1.146, 0.637]

Compare to Euler: h_t = [0.486 + 1.0, 0.182 + 0.5] = [1.486, 0.682]. The trapezoidal version distributes the input weight between the current and previous token, resulting in a smoother approximation. The β term acts as a bridge between consecutive tokens.

The Implicit Convolution Interpretation

There is an elegant alternative view: the trapezoidal discretization is equivalent to applying a width-2 convolution on the state-input v_t = B_tx_t before feeding it into the linear recurrence.

In a standard SSM, you first compute v_t = B_tx_t, then apply the recurrence h_t = α_th_t-1 + γ_tv_t. In Mamba-3, you first convolve: v'_t = β_tv_t-1 + γ_tv_t, then apply the recurrence h_t = α_th_t-1 + v'_t.

Why this matters for architecture: Mamba-2 and most linear models use an external short causal convolution (typically kernel size 4) before the SSM, applied to the raw input x_t and sometimes B_t, C_t. This was believed essential for performance. Mamba-3's implicit convolution operates inside the recurrence, on the state-input B_tx_t rather than on x_t alone. Combined with B,C biases, this internal convolution empirically replaces the external one. Mamba-3 is the first competitive SSM without an external short convolution.

The Mask Matrix View

In the SSD parallel form Y = (L ⊙ CB^T)X, Mamba-2 has mask L with entries L_ij = α_i...j γ_j (decay product times input weight). Mamba-3's mask is richer: it factors as L = L₁ · L₂ where L₁ is the decay matrix (same as Mamba-2) and L₂ is a two-band matrix with entries from β and γ. This two-band structure is the "convolution" in mask form.

Euler vs. Trapezoidal Approximation

Compare how Euler (right-endpoint only) and Trapezoidal (average of endpoints) approximate the integral. The trapezoidal rule captures the shape of the curve more accurately. Drag λ from 1.0 (pure Euler) to 0.5 (classical trapezoid) to see the approximation improve.

λ_t (mixing) 1.00

Δ_t (step size) 1.5

On the λ Parameterization

For second-order accuracy, theory requires λ_t = 1/2 + O(Δt). But the Mamba-3 authors found that a data-dependent λ_t = σ(u_t) (sigmoid of a learned projection of the input token) performs best empirically, even though it does not satisfy the second-order constraint. Fixing λ = 1/2 gives 15.76 perplexity vs. 15.72 for the learned gate. Setting λ = 1 (Euler, no trapezoidal) gives 15.81. The improvement is real but modest — about 0.1 perplexity points from the trapezoidal term alone.

The bigger win comes from the interaction with B,C biases: combining trapezoidal discretization with biases gives 15.72, down from 16.68 without either. And it removes the need for the external short convolution, simplifying the architecture.

What is the extra term in Mamba-3's recurrence compared to Mamba-2, and where does it come from?

The β_tB_t-1x_t-1 term, which comes from evaluating the state-input integral at the left endpoint of the time interval (trapezoidal rule uses both endpoints, Euler uses only the right) An attention term that looks back at all previous tokens A normalization factor that stabilizes training

Chapter 4: Complex-Valued States

Here is a task that sounds trivial: given a binary string like 1 0 1 1 0 1, compute the parity (even or odd number of 1s). A human can do it by keeping a running count modulo 2. An RNN with a single bit of state can do it. But Mamba-2 provably cannot.

Why? Parity requires the state to flip between two values on every 1 and stay unchanged on every 0. In state space terms, processing a "1" should rotate the state by 180 degrees: h → -h. But Mamba-2's transition is α_t ∈ (0, 1), which only shrinks the state. It can never flip the sign. You would need α < 0 or, equivalently, complex eigenvalues.

The formal barrier (Grazzi et al., 2024): A linear recurrence h_t = α_th_t-1 + γ_tB_tx_t with α_t ∈ R_≥0 cannot represent functions that require modular counting. The eigenvalue spectrum of the transition matrix must include complex (or negative real) values to express rotational dynamics. Parity is the simplest such function: it requires a rotation by π for each "1" input.

The Complex SSM

Mamba-3 makes the hidden state complex: h(t) ∈ C^N/2 instead of R^N. The continuous ODE becomes:

h'(t) = Diag(A(t) + iθ(t)) h(t) + (B(t) + iB̂(t)) x(t)
y(t) = Re[(C(t) + iĈ(t))^T h(t)]

The key addition is θ(t) — a data-dependent rotation angle for each state dimension. While A(t) controls decay (real part), θ(t) controls rotation (imaginary part). Now the state can oscillate, not just decay.

Complex to Real Equivalence (Proposition 2)

Complex arithmetic is expensive on GPUs. But there is a beautiful equivalence: a complex SSM with state dimension N/2 is exactly equal to a real-valued SSM with state dimension N whose transition matrix contains 2×2 rotation blocks.

Let us derive this for a single complex state dimension. Write h_t = h_t + iĥ_t. The complex update:

(h_t + iĥ_t) = e^{Δ_t(A_t + iθ_t)} (h_t-1 + iĥ_t-1) + Δ_t(B_t + iB̂_t) x_t

Expand the exponential: e^{Δ_t(A_t + iθ_t)} = e^Δ_tA_t(cos(Δ_tθ_t) + i sin(Δ_tθ_t)).

Separate real and imaginary parts. In matrix form, this gives:

⎵h_t⎵ = e^Δ_tA_t · R(Δ_tθ_t) · ⎵h_t-1⎵ + Δ_t ⎵B_t⎵ x_t
⎵ĥ_t⎵ ⎵ĥ_t-1⎵ ⎵B̂_t⎵

where R(φ) is the 2×2 rotation matrix:

R(φ) = [cos(φ), -sin(φ); sin(φ), cos(φ)]

For N/2 complex state dimensions, we get N real state dimensions with a block-diagonal transition made of N/2 such rotation blocks, each scaled by e^Δ_tA_t.

The rotation enables parity: To flip the state on every "1", set θ_t = π · x_t. Then Δ_tθ_t = Δ_tπ whenever x_t=1, and the rotation matrix becomes [cos(π), -sin(π); sin(π), cos(π)] = [-1, 0; 0, -1] (for appropriate Δ). The state flips sign. On x_t=0, the rotation angle is 0 and the state is unchanged. This is exactly the parity computation.

The RoPE Trick (Proposition 3)

Computing the block-diagonal rotation explicitly requires storing and multiplying N×N matrices at each step. Expensive. But there is a trick.

Unroll the recurrence for T steps. At step t, B_t gets multiplied by the cumulative product of all rotations from step 0 to t: (Π_i=0^t R_i^T) B_t. Similarly, C_t gets (Π_i=0^t R_i^T) C_t.

This is precisely a rotary position embedding (RoPE) applied to B and C! In the SSD duality, B corresponds to keys and C to queries. So the complex SSM is equivalent to applying data-dependent RoPE to the Q and K components of the attention-like computation.

The crucial difference from standard RoPE: the rotation angles are data-dependent (θ_t is projected from the input token), not fixed (θ_i = 10000^-2i/N in vanilla RoPE). Data-dependent rotations can modulate based on content, enabling state tracking.

In practice, the RoPE computation is trivially efficient: pair up the state dimensions, compute cumulative sums of angles, and apply rotations with the standard RoPE formula. This adds negligible overhead.

Experimental Proof

The results are stark (Table 3b of the paper):

Model	Parity	Arith (no brackets)	Arith (brackets)
Mamba-3	100.00%	98.51%	87.75%
Mamba-3 (std RoPE)	1.56%	20.70%	2.62%
Mamba-3 (no RoPE)	2.27%	1.49%	0.72%
Mamba-2	0.90%	47.81%	0.88%
GDN [-1,1]	100.00%	99.25%	93.50%

Mamba-3 with data-dependent RoPE: perfect parity, near-perfect modular arithmetic. Without it (or with standard position-based RoPE): chance level. This confirms the theory: complex eigenvalues enable rotational dynamics that real eigenvalues cannot.

Parity via Rotation: Real vs. Complex SSM

Watch a real-valued SSM (left) and a complex SSM (right) attempt parity on the same binary string. The real SSM's state decays monotonically — it cannot flip. The complex SSM rotates by π on each "1", tracking parity exactly. Click "New Sequence" for a fresh random string.

Complex + Trapezoidal Together (Proposition 4)

Combining the complex SSM with exponential-trapezoidal discretization, the full Mamba-3 recurrence is:

h_t = α_t h_t-1 + β_t (Π_i=0^t-1 R_i^T) B_t-1 x_t-1 + γ_t (Π_i=0^t R_i^T) B_t x_t

y_t = [(Π_i=0^t R_i^T) C_t]^T h_t

Both the β term (left endpoint) and γ term (right endpoint) get their own rotation applied. The output also gets rotated. In practice, B and C are rotated via the RoPE trick before the recurrence, so the recurrence itself is identical in form to the real-valued case — just with rotated inputs.

What does the imaginary part θ(t) of the complex SSM eigenvalue control, and why is data-dependent θ essential?

θ controls the rotation angle of the state. Data-dependent θ lets the model rotate by different amounts for different tokens (e.g., π for "1" and 0 for "0" in parity), enabling state-tracking tasks that fixed or absent rotation cannot solve θ controls the learning rate of the SSM parameters θ controls the output gate strength

Chapter 5: Multi-Input, Multi-Output (MIMO)

We have improved quality (trapezoidal discretization) and capability (complex states). Now let us fix hardware utilization. This chapter contains the SHOWCASE simulation — the interactive visualization of how MIMO changes the arithmetic intensity.

The Underutilization Problem

Consider a single head of a SISO (single-input, single-output) SSM with head dimension P and state size N. The decode step involves:

State update: h_t ← α_t h_t-1 + B_t x_t^T, where h_t ∈ R^N×P, B_t ∈ R^N, x_t ∈ R^P
Output: y_t = C_t^T h_t, where C_t ∈ R^N, y_t ∈ R^P

The state update uses an outer product B_t x_t^T (N × P FLOPs). But the memory traffic is dominated by loading h_t-1 (N × P elements) and storing h_t (N × P elements). So:

Arithmetic Intensity = FLOPs / Bytes ≈ (5NP - P) / (2(1 + 2N + NP + P)) ≈ 2.5 ops/byte

The H100's tensor cores can sustain ~295 ops/byte for bfloat16 matmul. We are at 2.5. That means 99% of compute is idle during the decode step — the GPU waits for memory, not arithmetic.

The paradox of efficient models: SSMs are "efficient" because they use fewer FLOPs. But on modern GPUs, the bottleneck is memory bandwidth, not FLOPs. Using fewer FLOPs actually hurts because we cannot overlap computation with memory access. We are memory-bound when we want to be compute-bound.

The MIMO Solution

MIMO adds a rank dimension R to the state-input computation. Instead of:

B_t ∈ R^N, x_t ∈ R^P (outer product: N × P FLOPs)

We use:

B_t ∈ R^N×R, x_t ∈ R^P×R (matrix multiply: N × P × R FLOPs)

Similarly, C_t goes from R^N to R^N×R, and y_t from R^P to R^P×R.

The key insight: memory traffic barely increases. The state h_t ∈ R^N×P stays the same size because the MIMO rank R does not expand the state. What increases are the projection vectors B_t, C_t, x_t — but these are small compared to the state. So FLOPs go up by R× while memory I/O stays roughly constant. Arithmetic intensity increases by roughly R×.

MIMO Arithmetic Intensity ≈ (4NPR + NP - PR) / (2(1 + 2NR + NP + PR)) = Θ(R) for R ≪ N, P

With R=4: arithmetic intensity goes from ~2.5 to ~10 ops/byte. Still far from the 295 peak, but 4× better utilization is meaningful — and it comes for free in wall-clock time because the extra matmul FLOPs overlap with the unchanged memory I/O.

The Signal Processing View

Why "multi-input, multi-output"? In signal processing, a SISO system has one input channel and one output channel. The SSM takes scalar x_t in and produces scalar y_t out, replicated across P head dimensions by stacking. A MIMO system has R input channels and R output channels. Each of the R inputs has its own B projection and contributes to the state independently. Each of the R outputs reads the state through its own C projection.

Formally, a MIMO SSM with rank R is equivalent to R independent SISO SSMs sharing the same decay α_t and the same state, with their contributions summed:

h_t^(j) ← α_t h_t-1^(j) + B_t^(j) x_t^(j) h_t = Σ_j=0^R-1 h_t^(j)
y_t⁽ⁱ⁾ = (C_t⁽ⁱ⁾)^T h_t

The outer product becomes a matrix product. This is computationally friendly because the matmul can use tensor cores, which were sitting idle in the SISO case.

SHOWCASE: SISO vs. MIMO Arithmetic Intensity

This is the core visualization. Left: SISO decode step — the outer product uses few FLOPs relative to memory traffic. Right: MIMO decode step — the matrix multiply uses R× more FLOPs with barely more memory traffic. Drag R to see arithmetic intensity climb. The dashed line marks where tensor cores become meaningfully utilized.

MIMO Rank (R) 4

State Size (N) 64

Head Dim (P) 64

Training Cost of MIMO

MIMO is free at decode time but not at training time. The sequence-to-sequence form requires calling the SISO training algorithm R² times naively (all R×R cross terms). However, by exploiting the chunked SSD algorithm structure, this reduces to R× overhead: the intra-chunk computation maintains the SISO chunk size while increasing the number of chunks by R. With R=4, training is about 4× slower per step for the SSM layers.

To keep total parameter count constant, Mamba-3 shrinks the MLP hidden dimension to compensate for the extra MIMO projections. At 1.5B: SISO MLP dim is 4096, MIMO MLP dim is 3824. The parameter counts match, but the MIMO variant distributes more capacity into the SSM and less into the MLP.

Kernel Implementation

The decode kernel is implemented in CuTe-DSL (NVIDIA's tensor core DSL), not Triton, because the MIMO matrix multiply requires explicit tensor core scheduling. The forward (prefill) path uses Triton (SISO) or TileLang (MIMO). The kernel fusion structure fuses the RoPE rotation, SSM update, gating, and MIMO projection into a single kernel to minimize memory round-trips.

Kernel fusion summary: Forward SISO: IP → SSM+Rotary → Gate+OP (Triton). Forward MIMO: IP → SSM+Rotary → Gate+OP (TileLang). Decode SISO: IP → Rotary → SSM+Gate → OP (CuTe + Triton). Decode MIMO: IP → Rotary → SSM+Gate+MIMO → OP (CuTe + Triton). IP = input projection, OP = output projection.

Why does MIMO with rank R=4 not significantly increase decode latency despite 4x more FLOPs?

Because SISO decoding is memory-bound (2.5 ops/byte vs. 295 peak), the 4x extra FLOPs from the MIMO matmul overlap with memory I/O. The GPU was already waiting for memory, so computing more during that wait costs almost no additional time. Because MIMO reduces the state size by 4x Because the extra computation is done on CPU instead of GPU

Chapter 6: Architecture

We have the three core innovations. Now let us assemble the full Mamba-3 block and see how it differs from its predecessor.

Overall Model Structure

The macro-architecture follows Llama: alternating sequence-mixing blocks and feedforward (SwiGLU MLP) blocks with pre-norm (RMSNorm). This is the same skeleton used by most modern language models — the innovation is in what goes inside the sequence-mixing block.

Mamba-3 Block vs. Mamba-2 Block

Mamba-2 Block

Input → Linear projection (expand 2×) → Short Conv (k=4) → SiLU activation → Δ,B,C projection → SSD (Euler recurrence) → Gate → RMSNorm → Output projection

↓ becomes ↓

Mamba-3 Block

Input → Linear projection (expand 2×) → Δ,A,B,C,λ projection → BCNorm → B,C Biases → Data-dependent RoPE → Exp-Trapezoidal SSM (+ optional MIMO) → Gate → Output projection

Notice what is removed: the short causal convolution and its SiLU activation, and the post-gate RMSNorm. Notice what is added: BCNorm, B/C biases, data-dependent RoPE (complex SSM), and the trapezoidal β term.

Key Architectural Details

BCNorm (QKNorm): RMSNorm is applied to B and C after projection, before biases. This mirrors the Q/K normalization used in modern Transformers (since B ↔ K and C ↔ Q in the SSD duality). BCNorm stabilizes training enough that the post-gate RMSNorm can be removed in pure Mamba-3 models. However, for hybrid models (Mamba-3 + attention layers), a pre-gate grouped RMSNorm is reintroduced for better length generalization.

B, C Biases: Learnable, head-specific, channel-wise bias vectors added to B and C after BCNorm. Initialized to all-ones. These biases create data-independent components in B and C, making the SSM behave partly like a fixed convolution alongside its data-dependent behavior. Yu & Erichson (2025) proved that adding channel-specific bias to B grants universal approximation capabilities to block-biased Mamba.

Data-dependent A: Mamba-3 makes A_t data-dependent (projected from the token) rather than a fixed parameter. This keeps all SSM parameters consistently data-dependent. Empirically, this does not change performance, but it simplifies the parameterization.

No more short convolution: The combination of (1) exponential-trapezoidal discretization (implicit width-2 convolution on state-input) and (2) B,C biases (data-independent component) empirically makes the external short causal convolution redundant. Ablation: Mamba-3 with conv gets 15.85 ppl, without conv gets 15.72 ppl. The conv actually hurts. This is architecturally significant — the short conv was previously considered essential for all recurrent models.

Data Flow Through the Block

Let us trace a single token x ∈ R^D through the Mamba-3 block (D = model dimension, H = number of heads, P = head dimension, N = state size):

Input projection: x → [z, x'] via W_in ∈ R^{D × (D + expand·D)}. z ∈ R^D is the gate. x' ∈ R^expand·D is the SSM input (expand=2 by default).
SSM parameter projection: x' → Δ_t, A_t, B_t, C_t, λ_t via separate linear layers. B_t, C_t ∈ R^H×N (per-head). Δ_t, A_t, λ_t ∈ R^H (scalar per head).
BCNorm: B_t → RMSNorm(B_t), C_t → RMSNorm(C_t) (per-head normalization).
Biases: B_t → B_t + b_B, C_t → C_t + b_C (learnable, head-specific, shape R^H×N).
RoPE rotation: Compute cumulative rotation angles Θ_t = Σ_i=0^t Δ_iθ_i. Apply rotation to B_t and C_t using the standard RoPE formula (pair up state dimensions, apply 2D rotation).
SSM recurrence: h_t = α_t h_t-1 + β_t B_t-1 x_t-1' + γ_t B_t x_t' (where x' is reshaped to [H, P] across heads). Output: y_t = C_t^T h_t ∈ R^H×P.
Gate and output: o_t = SiLU(z) ⊙ flatten(y_t). Output projection: W_out · o_t ∈ R^D.

For MIMO, step 6 changes: B_t ∈ R^H×N×R, x_t ∈ R^H×P×R, the outer product becomes a matmul, and intermediate outputs go through a SiLU residual before the final output projection.

Mamba-3 Architecture Diagram

Interactive block diagram. Click on each component to see its tensor shapes and data types. Blue = linear projection, green = SSM-specific, orange = new in Mamba-3.

Model Configurations

Scale	Layers	d_model	Heads	d_state	Head dim	MLP dim (SISO)	MLP dim (MIMO R=4)
180M	12	768	6	64	128	1,500	1,264
440M	24	1024	8	64	128	2,048	1,792
880M	32	1536	12	64	128	3,072	2,800
1.5B	24	2048	16	64	128	4,096	3,824

All models use expand factor 2, Llama-3.1 tokenizer, 2K context length during pre-training, and 100B tokens from FineWeb-Edu. Note the MIMO MLP dimension is smaller to match parameter counts with SISO.

What two changes in Mamba-3 make the external short causal convolution unnecessary?

Exponential-trapezoidal discretization (implicit width-2 convolution on B_tx_t inside the recurrence) and B,C biases (data-independent component that provides convolution-like behavior) MIMO projections and data-dependent RoPE Larger model dimension and more heads

Chapter 7: Results

Numbers matter. Mamba-3 was evaluated at four scales (180M, 440M, 880M, 1.5B) on language modeling perplexity and seven downstream tasks. Let us walk through the key findings.

Downstream Language Modeling (Table 1)

At 1.5B scale (the largest), the average downstream accuracy across seven tasks:

Model	PPL ↓	LAMBADA	HellaSwag	PIQA	ARC-E	ARC-C	WinoGr.	OBQA	Avg ↑
Transformer	10.51	50.3	60.6	73.8	74.0	40.4	58.7	29.6	55.4
Gated DeltaNet	10.45	49.2	61.3	74.3	75.3	41.2	58.0	31.6	55.8
Mamba-2	10.47	47.8	61.4	73.6	75.3	41.8	57.5	32.6	55.7
Mamba-3 SISO	10.35	49.4	61.9	73.6	75.9	42.7	59.4	32.0	56.4
Mamba-3 MIMO	10.24	51.7	62.3	75.3	76.5	44.5	60.6	32.6	57.6

Key takeaways:

Mamba-3 SISO beats all baselines by +0.6 points over GDN (the prior best sub-quadratic model) and +1.0 over Transformer.
MIMO adds another +1.2 points for a total of +2.2 over Transformer and +1.8 over GDN.
The perplexity improvements are consistent: Mamba-3 SISO has the lowest PPL among non-MIMO models at every scale tested.

The Pareto Frontier: State Size vs. Quality

The most impactful result is Figure 2 of the paper: plotting perplexity against state size (a proxy for inference speed). Mamba-3 with state size 64 matches Mamba-2 with state size 128 — same quality at half the state size, meaning half the memory and half the per-token latency.

MIMO pushes the frontier further: even at state size 64, MIMO R=4 achieves better perplexity than Mamba-2 at state size 128. More quality for less inference cost.

State Size vs. Perplexity Pareto Frontier

Each point is a model variant. Lower-left is better (lower perplexity, smaller state). Mamba-3 shifts the entire curve down and left. MIMO shifts it further down without moving right (same state size, better quality).

Kernel Latency (Table 4)

Practical decode speed matters as much as theoretical efficiency. At 1.5B with batch 128 on a single H100:

Model	BF16, d_state=64	BF16, d_state=128
Mamba-2	0.127 ms	0.203 ms
Gated DeltaNet	0.176 ms	0.257 ms
Mamba-3 SISO	0.110 ms	0.156 ms
Mamba-3 MIMO R=4	0.137 ms	0.179 ms

Mamba-3 SISO is the fastest across all configurations, even faster than Mamba-2's reference kernels. MIMO adds only ~25% latency despite 4× more FLOPs — confirming the memory-bound argument from Chapter 5.

End-to-End Latency (Table 12)

At 4096 tokens, total decode time for 1.5B models:

vLLM Transformer (Llama-3.2-1B): 58.64 seconds
Gated DeltaNet: 36.14 seconds
Mamba-2: 36.94 seconds
Mamba-3 SISO: 34.83 seconds
Mamba-3 MIMO: 37.57 seconds

All recurrent models are 1.5-1.7× faster than the attention baseline. Mamba-3 SISO is the fastest. At 16K tokens, the gap widens to 2.3×.

Retrieval: The Honest Weakness

On real-world retrieval (SWDE, FDA), Mamba-3 scores 28.5 and 23.4 vs. the Transformer's 48.9 and 58.4. This is a fundamental limitation of fixed-state-size models — they cannot store and recall arbitrary key-value pairs from long context the way attention can.

The mitigation: hybrid models. A 5:1 ratio of Mamba-3 layers to NoPE attention layers recovers most retrieval capability (58.5 on SWDE) while keeping the efficiency benefits. This is how the authors expect Mamba-3 to be deployed in practice.

The honest picture: Mamba-3 is strictly better than Mamba-2 and GDN on everything tested. It matches or beats Transformers on downstream accuracy and perplexity. But it still needs attention layers for retrieval-heavy tasks. The future is hybrid, not pure SSM.

At the 1.5B scale, what is the relationship between Mamba-3 (d_state=64) and Mamba-2 (d_state=128)?

They achieve similar perplexity, meaning Mamba-3 delivers the same quality with half the state size — and therefore approximately half the decode latency Mamba-3 is much worse because it has a smaller state They have the same decode latency because state size does not affect speed

Chapter 8: State Tracking & Capabilities

Chapter 4 showed the theory: complex eigenvalues enable rotational dynamics. Chapter 7 showed Mamba-3 beats baselines on standard benchmarks. This chapter digs deeper into what Mamba-3 can do that its predecessors cannot — the qualitative capability gains from the complex SSM.

The Chomsky Hierarchy Tests

Following Grazzi et al. (2025), Mamba-3 was evaluated on three tasks from the Chomsky hierarchy of formal languages — each requiring increasingly sophisticated state-tracking:

1. Parity (regular language): Given a binary string, output whether the number of 1s is even or odd. Requires: a single bit of state that flips on every "1". Mamba-3: 100%. Mamba-2: 0.9% (random chance). The complex SSM can represent a 180-degree rotation (sign flip) that real-valued eigenvalues cannot.

2. Modular Arithmetic without brackets (context-free): Evaluate expressions like "3 + 2 × 4 mod 5" left-to-right. Requires: maintaining a running total modulo 5 and applying operations sequentially. Mamba-3: 98.51%. Mamba-2: 47.81%. The rotation enables modular counting, but the intermediate operations (addition, multiplication) also need to be tracked in the state — complex eigenvalues allow richer state transitions for this.

3. Modular Arithmetic with brackets (beyond context-free): Evaluate "(3 + (2 × 4)) mod 5" with nested parentheses. Requires: a stack to track nesting depth and intermediate results. Mamba-3: 87.75%. GDN [-1,1]: 93.50%. This is the hardest task. Mamba-3 nearly closes the gap but doesn't quite match GDN, which benefits from its delta-rule-style memory update for handling nesting.

Why Standard RoPE Fails

A natural question: why not just apply standard RoPE (fixed-frequency rotation) to Mamba-2? The answer is in the table: Mamba-3 with standard RoPE scores 1.56% on parity — worse than random guessing. Standard RoPE rotates by fixed amounts at each position (determined by the position index, not the token value). This is useful for encoding position but useless for encoding content-dependent state transitions.

Parity requires rotating by π when the token is "1" and by 0 when it is "0". The rotation must be data-dependent: the angle depends on what the token is, not where it appears. This is exactly what Mamba-3's complex SSM provides: θ_t is projected from the input token, so the model learns to set θ = π/Δ for "1" and θ = 0 for "0".

The critical distinction: Standard RoPE = position-dependent, data-independent rotation. Complex SSM = data-dependent, position-independent rotation. State tracking requires data-dependent rotations. Position encoding requires position-dependent rotations. These are fundamentally different capabilities that happen to use the same mathematical mechanism (2D rotation matrices).

Length Generalization

The models were trained on short sequences (minimum 3, maximum 40-160 via curriculum) and evaluated at length 256. Mamba-3 generalizes well: 100% parity at 256 tokens despite never seeing sequences that long during training. This suggests the learned rotation strategy (rotate by π on "1") is position-independent and composition-friendly.

Context Length Extrapolation

On standard language modeling (Figure 4), Mamba-3 extrapolates gracefully from its 2K training length to 32K tokens, with perplexity improving steadily. Mamba-2 degrades after 8K. The complex state's rotation (which embeds positional information through data-dependent angles rather than fixed position encodings) may contribute to this robustness.

State Tracking: Modular Arithmetic Simulator

Enter a modular arithmetic expression and watch how the SSM state evolves. The complex SSM tracks the running value modulo 5 via rotations. Toggle between real and complex to see the difference.

What Mamba-3 Still Cannot Do

Nested modular arithmetic (with brackets) reaches 87.75% but not 100%. The stack-like memory required for deep nesting strains the fixed-size state. True stack simulation requires unbounded memory, which no fixed-state model can provide.

On practical tasks: retrieval from semi-structured data (SWDE: 28.5%, FDA: 23.4%) remains weak. The model can track abstract state (parity, arithmetic) but struggles to verbatim recall specific facts from long context. This aligns with the theoretical distinction: state tracking is about abstracting information (reducing a sequence to a summary), while retrieval is about preserving information (storing exact values for later recall). SSMs excel at the former; attention excels at the latter.

Why does applying standard (position-dependent) RoPE to Mamba-2 fail to solve parity, while Mamba-3's data-dependent rotation succeeds?

Standard RoPE rotates by a fixed angle at each position regardless of the token value, so it cannot flip the state specifically on "1" tokens. Data-dependent rotation sets the angle based on the token value (θ = π for "1", θ = 0 for "0"), enabling content-dependent state transitions Standard RoPE uses smaller rotation angles Standard RoPE is only applied during training, not inference

Chapter 9: Connections

Cheat Sheet: Every Key Equation

Equation	What it means	When to use
h_t = α_th_t-1 + γ_tB_tx_t	Mamba-2 recurrence (Euler)	Baseline comparison
h_t = α_th_t-1 + β_tB_t-1x_t-1 + γ_tB_tx_t	Mamba-3 recurrence (trapezoidal)	Core recurrence
α_t = e^Δ_tA_t	Decay factor	α ∈ (0,1), controls forgetting
β_t = (1-λ_t)Δ_te^Δ_tA_t	Left-endpoint weight (new in Mamba-3)	λ=1 recovers Euler
γ_t = λ_tΔ_t	Right-endpoint weight	λ=0.5 is classical trapezoid
R(φ) = [cosφ, -sinφ; sinφ, cosφ]	2×2 rotation matrix	Complex SSM state transition
R_t = Block{R(Δ_tθ_t[i])}	Block-diagonal rotation	N/2 independent 2D rotations
B̄_t = (Π_i=0^t R_i^T) B_t	RoPE trick: cumulative rotation on B	Efficient complex SSM impl
AI = FLOPs / Bytes	Arithmetic intensity	SISO ~2.5, MIMO ~R×2.5
Y = (L ⊙ CB^T) X	SSD parallel form	Training (L = L₁·L₂ for Mamba-3)

Symbol Glossary

Symbol	Meaning	Typical Value
T	Sequence length	2048
D	Model dimension	2048 (1.5B)
N	SSM state size	64 or 128
P	Head dimension	128
H	Number of heads	16 (1.5B)
R	MIMO rank	4
Δ_t	Step size (data-dependent)	Scalar per head
A_t	State decay (data-dependent)	Scalar per head, negative
θ_t	Rotation angle (data-dependent)	R^N/2 per head
λ_t	Trapezoidal mixing parameter	σ(u_t), scalar per head

Related Work & Lessons

SSMs & Mamba: The Gleams lesson SSM & Mamba covers the foundations — S4, discretization, selective SSMs, and the original Mamba architecture.
VideoMamba: The Veanors lesson VideoMamba applies Mamba to video understanding — bidirectional scanning, spatiotemporal tokens, and O(L) video processing.
Gated DeltaNet: The primary competitor. Uses a delta rule with gated updates, allowing negative eigenvalues by design (eigenvalue range [-1,1]). Achieves state tracking via explicit negative eigenvalues rather than complex rotation.
Linear Attention: The SSD framework connects SSMs to causal linear attention. Mamba-3's mask L generalizes Mamba-2's mask, which itself generalizes the causal triangular mask of linear attention.
RoPE & Positional Embeddings: Standard RoPE (Su et al., 2023) uses fixed-frequency rotations for position encoding. Mamba-3's data-dependent rotation is mathematically identical but semantically different: it encodes content, not position.
Hybrid Architectures: Mamba-3 layers are designed for hybrid models (5:1 ratio with NoPE attention). This mirrors industry practice: Jamba (AI21), NVIDIA Nemotron, Kimi K2, and Hunyuan all use SSM-attention hybrids.
Test-Time Training (TTT): An alternative sub-quadratic approach where the "state" is a set of trainable weights updated via gradient descent at inference. Conceptually different from SSMs but solving the same problem.

How This Idea Was Likely Discovered

The paper reads like a clean narrative: discretization → complex states → MIMO. But the discovery process was almost certainly messier. Probable sequence:

The authors noticed Mamba-2's simplified scalar transition lost expressivity vs. Mamba-1's diagonal transitions. They asked: "Can we recover expressivity without sacrificing Mamba-2's training efficiency?"
Revisiting the discretization derivation revealed the Euler approximation was heuristic. Formalizing it led to the exponential-adjusted framework, and the trapezoidal rule was the natural next step.
The parity failure was likely already known from Grazzi et al. (2024). The fix (complex eigenvalues / block-diagonal rotations) was suggested by classical SSM theory (S4 used complex-valued NPLR matrices). The RoPE trick was the crucial implementation insight that made complex SSMs practical.
The MIMO idea likely came from profiling: observing the low arithmetic intensity during decode and asking "how can we add FLOPs without increasing memory traffic?" The signal processing connection (SISO → MIMO) provided the framework.

Open Questions

Can MIMO rank be increased beyond 4 without training cost becoming prohibitive? At what R does decode become compute-bound instead of memory-bound?
What is the optimal ratio of Mamba-3 to attention layers in hybrid models? The 5:1 ratio was borrowed from prior work — is it optimal for Mamba-3 specifically?
Can the complex SSM learn more general finite automata beyond parity and modular arithmetic? What about tasks requiring unbounded counters or stacks?
How does Mamba-3 perform on truly long sequences (100K+)? The evaluations went up to 32K for perplexity and 4K for retrieval.

What is the fundamental difference between Mamba-3's data-dependent RoPE and standard position-encoding RoPE?

Standard RoPE rotates by fixed angles determined by position index (θ_i = 10000^-2i/N), encoding where a token is. Mamba-3's rotation angles are projected from the input token, encoding what a token is. Same math, different semantics — one tracks position, the other tracks content-dependent state They are the same thing with different names Mamba-3 uses larger rotation angles

Mamba-3 Improved Sequence Modeling Using State Space Principles