Why pick between attention and state-space models when you can interleave them? Jamba weaves Transformer and Mamba layers, sprinkles in experts, and runs a quarter-million-token context on a single GPU.
By now you've met two camps. Attention (transformers): precise, content-addressed recall — it can fetch any past token exactly — but quadratic cost and a KV cache that grows without bound. State-space models like Mamba (and the whole RWKV / RetNet / xLSTM family): efficient, linear-cost, constant-memory — but a fixed-size state that struggles with precise long-range recall. Each is excellent at what the other is bad at. So the obvious question: why pick one? Jamba (AI21 Labs, 2024) was the first production-scale model to answer “don't” — interleave both in one network.
The Linear Attention and RetNet lessons kept pointing at this: the best move is often a hybrid. Here's why it's not a cop-out but the genuinely right answer. Most of what a model does over a long sequence is cheap, local, broad-strokes processing — perfect for efficient SSM layers. But occasionally it needs to reach back and grab a specific detail with precision — that's what attention is for. So build a network that's mostly efficient SSM layers, with just a few attention layers sprinkled in for the moments precise recall matters. You get most of the efficiency and most of the recall.
Pure-SSM models were exciting in research but hesitant in production because of the recall gap (the “needle in a haystack” weakness). Pure transformers were proven but expensive at long context. Jamba showed the hybrid isn't a compromise that's worse than both — it's better than either on the metric that matters most for deployment: quality-per-dollar at long context. A handful of attention layers is cheap to add and buys back the recall, while the SSM majority keeps the cost and memory low. That combination is why hybrids became the dominant pattern for new long-context models.
The widget plots three architectures on the two axes that matter: precise-recall ability and efficiency-at-long-context. Pure attention sits high on recall, low on efficiency. Pure SSM sits high on efficiency, lower on recall. Click the hybrid and watch it land in the strong region of both — not maxing either, but high on both, which is exactly what deployment wants.
Click each architecture. Pure attention: great recall, poor long-context efficiency. Pure SSM: efficient, weaker recall. Hybrid (Jamba): strong on both — the sweet spot for deployment.
To understand the hybrid, let's put its two layer types side by side — their strengths and weaknesses are precisely complementary, which is the whole reason the combination works. (Both have full lessons of their own; here we just need their tradeoff profiles.)
An attention layer lets every token look at every other and fetch information with content-based precision. Its superpower is exact recall: ask “what was the name mentioned 50,000 tokens ago?” and attention can find it, because it keeps every past token available (the KV cache) and can spike sharply on the relevant one. Its curse: cost grows with the square of the sequence, and the KV cache grows linearly — at long context, both the compute and the memory become punishing.
A Mamba layer is a selective state-space model — a data-dependent fixed-state recurrence (the SSM lesson covers it). Its superpower is efficiency: linear cost, constant memory, no growing cache, so it sails through arbitrarily long sequences. Its weakness: that fixed-size state is a lossy summary, so it's worse at precisely retrieving an arbitrary specific past detail — the recall gap we keep meeting. It handles the broad flow of a sequence beautifully but can fumble the pinpoint lookup.
Crucially, both layer types fit the same transformer-style block structure: a token-mixing sublayer (either attention or Mamba) plus a feed-forward network, wrapped in residual connections and normalization. Because they share this scaffolding, you can swap one for the other layer by layer without redesigning anything — the token-mixing sublayer is just attention in some blocks and Mamba in others. That interchangeability is what makes interleaving them trivial, and it's the same “swap the mixer” modularity every architecture in this series relies on.
The widget shows attention and Mamba on four metrics: recall, efficiency, memory, and parallelism. Toggle between them and notice the mirror-image profiles — where attention's bar is high, Mamba's is low. That mirror is the foundation of the hybrid: combine them and you can have a high bar on every metric, because each layer type covers the other's low bars.
Four metrics for each layer type. Toggle between them: where one is strong, the other is weak. That complementarity is exactly what a hybrid exploits.
Now the actual recipe. Jamba stacks its layers in a repeating pattern that's mostly Mamba, with the occasional attention layer. The specific design uses blocks where, out of every several layers, only one is attention and the rest are Mamba. Jamba's published ratio is roughly one attention layer for every seven Mamba layers — attention is a small minority of the stack. Getting this ratio right is the central design decision.
Concretely, you define a small repeating block — say 8 layers — and within it, you place exactly one attention layer and seven Mamba layers (interleaved with feed-forward / MoE sublayers). Then you repeat that block to whatever depth you want. So a deep Jamba model is mostly Mamba layers (cheap, linear, long-context) punctuated at regular intervals by single attention layers (the recall checkpoints). The attention layers are spread through the depth, not clustered, so precise recall is available at multiple stages of processing.
Imagine 32 layers. Pure transformer: all 32 are attention — 32 quadratic layers, 32 layers' worth of growing KV cache. Jamba 1:7: only 4 of the 32 are attention (one per 8-layer block), the other 28 are linear Mamba. So you pay the quadratic cost and KV-cache memory for just 4 layers instead of 32 — an 8× reduction in the expensive part — while the 28 Mamba layers cost almost nothing per token. The model is overwhelmingly made of cheap layers, with a few expensive ones exactly where recall needs them. That's where the efficiency comes from: you slashed the count of quadratic layers by 8×.
The widget shows a Jamba stack. Drag the attention-to-Mamba ratio: at one extreme, every layer is attention (a pure transformer — expensive); at the other, all Mamba (pure SSM — weak recall); in between, the Jamba sweet spot of mostly-Mamba with a few attention layers. Watch the cost and the “recall checkpoints” count change as you slide.
Layers from bottom to top: blue = attention (recall, expensive), teal = Mamba (efficient). Drag the ratio. Jamba's choice is ~1 attention per 7 Mamba — mostly cheap, a few recall checkpoints.
The whole hybrid bet rests on a surprising empirical fact: you need very few attention layers to recover most of attention's recall. If recall scaled linearly with the number of attention layers, the hybrid would be pointless — you'd need lots of them and lose the efficiency. But it doesn't. Recall benefit saturates quickly: the first attention layer helps enormously, the second a bit more, and beyond a handful, adding more barely moves recall. That diminishing-returns curve is what makes the 1-in-8 ratio work.
Think about why. A single attention layer, placed in the stack, already gives the model a way to do precise content-based lookup — tokens can retrieve specific past information at that layer, and the result propagates through the surrounding Mamba layers. A second attention layer adds another lookup opportunity, useful but redundant with the first for many tasks. By the time you have a few spread through the depth, the model can do precise recall whenever it needs to, and additional attention layers mostly just add cost. The recall curve rises steeply then flattens.
There's a second, even sharper reason to minimize attention layers: the KV cache. Only attention layers have a KV cache, and it's the cache — not the compute — that usually limits how long a context you can fit in memory. With 4 attention layers instead of 32, your KV cache is 8× smaller, which directly translates to fitting an 8×-longer context (or the same context with far less memory). This is why the saturation matters so much: every attention layer you don't need is a big chunk of memory you get back for longer context. (The headline 256K-token-on-one-GPU number comes mostly from here.)
The widget plots recall (saturating) and cost (rising) against the number of attention layers. Drag the count: watch recall shoot up with the first few layers then flatten, while cost keeps climbing. The shaded sweet spot is where recall is nearly maxed but cost is still low — just a few attention layers, exactly Jamba's regime.
Recall (teal) saturates after a few attention layers; cost (red) rises roughly linearly. The sweet spot (shaded) is a small number of attention layers — almost all the recall, little of the cost.
Jamba has one more trick, and it's the one that makes it a three-way hybrid: Mixture of Experts. On top of interleaving attention and Mamba layers, Jamba replaces some of the feed-forward networks with MoE layers — many expert FFNs with a router that sends each token to just a few. (The MoE lesson covers this in depth; here's how it fits into the hybrid.)
This is the elegant part: Jamba pulls two independent efficiency levers at once. The attention/Mamba interleaving cuts the cost of token mixing (the attention-replacement problem). MoE cuts the cost of the feed-forward layers (the capacity problem). These are different sublayers of the block, so the two techniques compose cleanly — Mamba handles “mix tokens cheaply,” MoE handles “add parameters cheaply,” and together they attack both of the places where a transformer spends its compute. A Jamba block is: a token-mixer (attention or Mamba) + a feed-forward (dense or MoE), each chosen for efficiency.
Jamba doesn't make every feed-forward an MoE — that would blow up the total parameter count and memory. It alternates: some blocks use a normal dense FFN, others use an MoE FFN, on a schedule (roughly every other layer). This keeps the total parameter count manageable while still getting much of MoE's capacity benefit. It's the same “use the expensive thing sparingly” philosophy as the attention layers — a few MoE layers, like a few attention layers, sprinkled through a mostly-cheap stack. Sparsity everywhere: sparse attention placement, sparse expert activation, sparse MoE placement.
The widget shows Jamba's three efficiency levers and their combined effect on total vs active parameters. Toggle each lever (Mamba layers, MoE, few-attention) on or off and watch the active-parameter cost drop while total capacity stays high. With all three on, you get a large-capacity model at a small per-token cost — the Jamba payoff.
Toggle each lever. Total params (capacity, purple) vs active params per token (cost, teal). Each lever lowers the active cost; together they make a big model cheap to run.
Now build your own hybrid and watch the tradeoffs in real time. You set how many attention layers (vs Mamba) and how much MoE, and the simulator shows the resulting model on four axes: recall, throughput (speed), max context (set by KV-cache memory), and quality-per-cost. Try the presets, then explore — you'll feel why Jamba's particular mix is a sweet spot, not an arbitrary choice.
Run these experiments:
Set the attention ratio and MoE level. The bars show recall, throughput, max context, and quality-per-cost. Use the presets, then explore why Jamba's mix balances all four.
No quiz — the builder is the test. If you can predict how each bar moves when you change the attention ratio or MoE level, you understand hybrid design.
Jamba's headline number was a 256,000-token context running on a single 80GB GPU — where a comparable pure transformer would need many times the memory or simply couldn't fit. The reason isn't mainly the compute savings; it's the KV cache. Understanding this makes clear why the “few attention layers” design is so powerful for long context specifically.
During generation, a transformer must remember the keys and values of every past token, at every attention layer — that's the KV cache, and it grows linearly with both context length and the number of attention layers. At long context, this cache, not the compute, is what fills up GPU memory. A 256K-token context in a 32-layer transformer means storing keys and values for 256,000 tokens across all 32 layers — tens of gigabytes, often more than the model weights themselves.
Say each attention layer's KV cache costs some amount per token. A 32-layer pure transformer at 256K tokens pays that cost × 32 layers × 256,000 tokens — the full bill. Jamba 1:7 has only 4 attention layers, so it pays × 4 layers × 256,000 tokens — one eighth of the cache. Same context length, one-eighth the KV memory. Flip it around: with a fixed memory budget, Jamba fits roughly 8× the context the transformer could. The 28 Mamba layers carry the sequence with their tiny constant-size states, contributing essentially nothing to the cache.
The widget plots KV-cache memory against context length for a pure transformer and for Jamba. Both grow linearly with context, but Jamba's line is far shallower (only its few attention layers cache). Drag the context up: watch the transformer hit a memory ceiling (the GPU limit) while Jamba sails far past it. The gap between the lines is the extra context Jamba fits in the same memory.
KV-cache memory vs context length. Both grow with context, but Jamba (few attention layers) grows ~8× slower. The dashed line is a GPU memory limit — see how much further Jamba reaches.
Choosing the attention-to-Mamba ratio (and where to add MoE) is the core engineering of a hybrid, and it's genuinely a design space, not a single right answer. Jamba's 1-in-8 is one well-chosen point; understanding the curve it sits on lets you reason about when to move along it.
Plot quality against throughput as you vary the attention fraction, and you get a frontier. At the high-attention end: top quality (especially recall), low throughput, short max context. At the low-attention end: high throughput and context, lower quality. In between is a curve, and you want to sit at the point on it that matches your priorities. Jamba's 1-in-8 is chosen to be near the “knee” — the point where you've captured most of the quality while keeping throughput and context high. Push toward more attention only if your task is recall-critical enough to justify the throughput and memory cost.
Beyond the ratio, details matter. Placement: spreading attention layers evenly through the depth (as Jamba does) generally beats clustering them, so precise recall is available at every processing stage. Which layers get MoE: alternating dense and MoE FFNs balances capacity against total memory. The first/last layers: some designs treat the boundary layers specially. These are the fine-tuning details that separate a good hybrid from a great one — the kind of empirical know-how that Jamba and its successors established.
The widget plots the quality-vs-throughput frontier as you sweep the attention fraction. Drag along it: toward more attention, quality rises but throughput falls; toward less, the reverse. The marked knee is roughly where Jamba sits — most of the quality, most of the throughput. There's no point that's best on both; you choose your spot on the curve.
Each point is a different attention fraction. More attention → higher quality, lower throughput. The knee (marked) is the balanced sweet spot — roughly Jamba's choice. Drag to explore the frontier.
Jamba wasn't a one-off — it announced a shift. After it, hybrids of attention with efficient sequence layers became one of the dominant patterns for new long-context models. Seeing the broader landscape shows that Jamba's specific recipe is one instance of a general, now-mainstream idea: don't replace attention, dilute it.
Several notable hybrids followed, each mixing attention with a different efficient layer:
The reason hybrids became dominant rather than either pure approach is pragmatic. Pure attention is too expensive at long context. Pure efficient-layer models haven't fully closed the recall gap and carried deployment risk. The hybrid sidesteps both: it's nearly as cheap as the efficient models (most layers are efficient) and nearly as capable as the transformer (the few attention layers restore recall). For the practitioner who needs long context, low cost, and reliable quality, the hybrid is simply the best available tradeoff — and that practical dominance is why it spread so fast.
The widget shows several hybrids and what each interleaves. Select one to see its composition — the efficient backbone, the attention component, and whether it uses MoE. Notice the shared skeleton across all of them: efficient majority + attention minority. They're dialects of one design.
Select a model to see what it mixes: its efficient backbone, its attention component, and MoE usage. All share the “efficient majority + attention minority” recipe.
You now understand Jamba completely: why hybrids beat either pure approach, the complementary attention/Mamba ingredients, interleaving at a 1-in-8 ratio, why so few attention layers suffice (recall saturates), how MoE adds a second efficiency lever, the KV-cache memory win that unlocks 256K context, the ratio design space, and the hybrid era it launched. The thread: don't replace attention — dilute it; a backbone of efficient layers with a few attention checkpoints and some experts gives long context, low cost, and strong recall at once.
“The strongest design is rarely the purest one — it is the wisest blend.”