Jamba — The Hybrid That Works

Chapter 0: Why Choose? The Case for Hybrids

By now you've met two camps. Attention (transformers): precise, content-addressed recall — it can fetch any past token exactly — but quadratic cost and a KV cache that grows without bound. State-space models like Mamba (and the whole RWKV / RetNet / xLSTM family): efficient, linear-cost, constant-memory — but a fixed-size state that struggles with precise long-range recall. Each is excellent at what the other is bad at. So the obvious question: why pick one? Jamba (AI21 Labs, 2024) was the first production-scale model to answer “don't” — interleave both in one network.

The recall–efficiency tension

The Linear Attention and RetNet lessons kept pointing at this: the best move is often a hybrid. Here's why it's not a cop-out but the genuinely right answer. Most of what a model does over a long sequence is cheap, local, broad-strokes processing — perfect for efficient SSM layers. But occasionally it needs to reach back and grab a specific detail with precision — that's what attention is for. So build a network that's mostly efficient SSM layers, with just a few attention layers sprinkled in for the moments precise recall matters. You get most of the efficiency and most of the recall.

The one-sentence version. Jamba interleaves a majority of Mamba (SSM) layers with a minority of Transformer attention layers, and adds Mixture-of-Experts on some feed-forward layers. The SSM layers give linear-cost, constant-memory processing and huge context; the few attention layers restore precise recall; the experts add capacity cheaply. The result: a 256,000-token context that fits on a single GPU, with quality competitive with pure transformers.

Why this is the production answer

Pure-SSM models were exciting in research but hesitant in production because of the recall gap (the “needle in a haystack” weakness). Pure transformers were proven but expensive at long context. Jamba showed the hybrid isn't a compromise that's worse than both — it's better than either on the metric that matters most for deployment: quality-per-dollar at long context. A handful of attention layers is cheap to add and buys back the recall, while the SSM majority keeps the cost and memory low. That combination is why hybrids became the dominant pattern for new long-context models.

See it: three designs on two axes

The widget plots three architectures on the two axes that matter: precise-recall ability and efficiency-at-long-context. Pure attention sits high on recall, low on efficiency. Pure SSM sits high on efficiency, lower on recall. Click the hybrid and watch it land in the strong region of both — not maxing either, but high on both, which is exactly what deployment wants.

Recall vs. Efficiency: Three Designs

Click each architecture. Pure attention: great recall, poor long-context efficiency. Pure SSM: efficient, weaker recall. Hybrid (Jamba): strong on both — the sweet spot for deployment.

Common misconception. “A hybrid must be a mediocre average of its parts.” That would be true if you blended them everywhere. Jamba doesn't average — it assigns: SSM layers handle the bulk efficiently, attention layers handle recall precisely, each doing what it's best at. The composition is more than the average because the two layer types are complementary, not competing. You get attention's recall and SSM's efficiency, not a watered-down middle of each.

What is the core idea of a hybrid model like Jamba?

Average attention and SSM outputs at every layer Interleave mostly efficient SSM (Mamba) layers with a few attention layers (plus MoE), so each layer type does what it's best at — efficiency from SSM, precise recall from the attention layers Use attention for training and SSM for inference

Chapter 1: The Two Ingredients

To understand the hybrid, let's put its two layer types side by side — their strengths and weaknesses are precisely complementary, which is the whole reason the combination works. (Both have full lessons of their own; here we just need their tradeoff profiles.)

Ingredient one: attention

An attention layer lets every token look at every other and fetch information with content-based precision. Its superpower is exact recall: ask “what was the name mentioned 50,000 tokens ago?” and attention can find it, because it keeps every past token available (the KV cache) and can spike sharply on the relevant one. Its curse: cost grows with the square of the sequence, and the KV cache grows linearly — at long context, both the compute and the memory become punishing.

Ingredient two: Mamba (SSM)

A Mamba layer is a selective state-space model — a data-dependent fixed-state recurrence (the SSM lesson covers it). Its superpower is efficiency: linear cost, constant memory, no growing cache, so it sails through arbitrarily long sequences. Its weakness: that fixed-size state is a lossy summary, so it's worse at precisely retrieving an arbitrary specific past detail — the recall gap we keep meeting. It handles the broad flow of a sequence beautifully but can fumble the pinpoint lookup.

The complementarity is exact. Lay them side by side and the strengths and weaknesses line up like puzzle pieces. Attention: strong recall, weak efficiency. Mamba: weak recall, strong efficiency. Where one is strong, the other is weak, and vice versa. This is the ideal setup for a hybrid — you're not combining two things that are good at the same stuff (redundant) or bad at the same stuff (no help). You're combining two things whose weaknesses each other's strengths exactly cover. That's why Jamba works and a hybrid of, say, two similar layer types wouldn't.

The shared scaffolding

Crucially, both layer types fit the same transformer-style block structure: a token-mixing sublayer (either attention or Mamba) plus a feed-forward network, wrapped in residual connections and normalization. Because they share this scaffolding, you can swap one for the other layer by layer without redesigning anything — the token-mixing sublayer is just attention in some blocks and Mamba in others. That interchangeability is what makes interleaving them trivial, and it's the same “swap the mixer” modularity every architecture in this series relies on.

See it: the tradeoff profiles

The widget shows attention and Mamba on four metrics: recall, efficiency, memory, and parallelism. Toggle between them and notice the mirror-image profiles — where attention's bar is high, Mamba's is low. That mirror is the foundation of the hybrid: combine them and you can have a high bar on every metric, because each layer type covers the other's low bars.

Attention vs. Mamba: Mirror-Image Profiles

Four metrics for each layer type. Toggle between them: where one is strong, the other is weak. That complementarity is exactly what a hybrid exploits.

Common misconception. “Mamba is just strictly better than attention now, so why keep attention at all?” Mamba is better on efficiency and long context, but it has not closed the precise-recall gap with attention — on tasks needing exact lookup of arbitrary details, attention still wins. That stubborn gap is exactly why Jamba keeps a few attention layers rather than going pure-Mamba. The two are complementary, not one-dominates-the-other.

Why are attention and Mamba ideal partners for a hybrid?

They're identical, so combining them is easy Their strengths and weaknesses are mirror images: attention has strong recall but poor efficiency; Mamba has strong efficiency but weaker recall — each covers the other's weakness They both excel at exactly the same tasks

Chapter 2: Interleaving — The Layer Ratio

Now the actual recipe. Jamba stacks its layers in a repeating pattern that's mostly Mamba, with the occasional attention layer. The specific design uses blocks where, out of every several layers, only one is attention and the rest are Mamba. Jamba's published ratio is roughly one attention layer for every seven Mamba layers — attention is a small minority of the stack. Getting this ratio right is the central design decision.

The repeating block

Concretely, you define a small repeating block — say 8 layers — and within it, you place exactly one attention layer and seven Mamba layers (interleaved with feed-forward / MoE sublayers). Then you repeat that block to whatever depth you want. So a deep Jamba model is mostly Mamba layers (cheap, linear, long-context) punctuated at regular intervals by single attention layers (the recall checkpoints). The attention layers are spread through the depth, not clustered, so precise recall is available at multiple stages of processing.

Why mostly Mamba, not 50/50? Because the attention layers are the expensive ones — each adds quadratic cost and a growing KV cache. You want as few as you can get away with while still restoring recall. It turns out you can get away with very few: a 1-in-8 ratio captures most of attention's recall benefit at a fraction of its cost. Going to 50/50 would roughly halve your efficiency gains for little extra recall. The art is the minimum attention that restores recall — and that minimum is surprisingly small (next chapter shows why).

Worked example: cost of the ratio

Imagine 32 layers. Pure transformer: all 32 are attention — 32 quadratic layers, 32 layers' worth of growing KV cache. Jamba 1:7: only 4 of the 32 are attention (one per 8-layer block), the other 28 are linear Mamba. So you pay the quadratic cost and KV-cache memory for just 4 layers instead of 32 — an 8× reduction in the expensive part — while the 28 Mamba layers cost almost nothing per token. The model is overwhelmingly made of cheap layers, with a few expensive ones exactly where recall needs them. That's where the efficiency comes from: you slashed the count of quadratic layers by 8×.

See it: the interleaved stack

The widget shows a Jamba stack. Drag the attention-to-Mamba ratio: at one extreme, every layer is attention (a pure transformer — expensive); at the other, all Mamba (pure SSM — weak recall); in between, the Jamba sweet spot of mostly-Mamba with a few attention layers. Watch the cost and the “recall checkpoints” count change as you slide.

The Interleaved Jamba Stack

Layers from bottom to top: blue = attention (recall, expensive), teal = Mamba (efficient). Drag the ratio. Jamba's choice is ~1 attention per 7 Mamba — mostly cheap, a few recall checkpoints.

Attention layers per 8 1

Common misconception. “The attention layers should go at the top (or bottom).” Jamba spreads them through the depth, one per repeating block, rather than clustering them. Recall is useful at every level of processing — early layers, middle, late — so distributing the attention checkpoints evenly gives the model access to precise lookup throughout its computation, not just at one stage. Placement, not just count, is part of the design.

What is Jamba's approximate layer ratio, and why is attention the minority?

50/50 attention and Mamba, for balance ~1 attention per 7 Mamba layers; attention is the minority because it's the expensive part (quadratic + growing KV cache), and a few suffice to restore recall Mostly attention with a few Mamba layers for speed

Chapter 3: Why a Few Attention Layers Suffice

The whole hybrid bet rests on a surprising empirical fact: you need very few attention layers to recover most of attention's recall. If recall scaled linearly with the number of attention layers, the hybrid would be pointless — you'd need lots of them and lose the efficiency. But it doesn't. Recall benefit saturates quickly: the first attention layer helps enormously, the second a bit more, and beyond a handful, adding more barely moves recall. That diminishing-returns curve is what makes the 1-in-8 ratio work.

The saturation curve

Think about why. A single attention layer, placed in the stack, already gives the model a way to do precise content-based lookup — tokens can retrieve specific past information at that layer, and the result propagates through the surrounding Mamba layers. A second attention layer adds another lookup opportunity, useful but redundant with the first for many tasks. By the time you have a few spread through the depth, the model can do precise recall whenever it needs to, and additional attention layers mostly just add cost. The recall curve rises steeply then flattens.

The economics of the sweet spot. Two curves cross here. Recall rises with attention layers but saturates fast (steep then flat). Cost rises roughly linearly with attention layers (each adds quadratic compute and KV cache). So the best value is right where recall has mostly saturated but cost is still low — a small number of attention layers. Add fewer and you sacrifice recall; add more and you pay rising cost for negligible recall gain. Jamba's 1-in-8 sits near that knee: almost all the recall, almost none of the attention cost.

The memory angle too

There's a second, even sharper reason to minimize attention layers: the KV cache. Only attention layers have a KV cache, and it's the cache — not the compute — that usually limits how long a context you can fit in memory. With 4 attention layers instead of 32, your KV cache is 8× smaller, which directly translates to fitting an 8×-longer context (or the same context with far less memory). This is why the saturation matters so much: every attention layer you don't need is a big chunk of memory you get back for longer context. (The headline 256K-token-on-one-GPU number comes mostly from here.)

See it: recall and cost vs. attention count

The widget plots recall (saturating) and cost (rising) against the number of attention layers. Drag the count: watch recall shoot up with the first few layers then flatten, while cost keeps climbing. The shaded sweet spot is where recall is nearly maxed but cost is still low — just a few attention layers, exactly Jamba's regime.

Recall Saturates; Cost Keeps Rising

Recall (teal) saturates after a few attention layers; cost (red) rises roughly linearly. The sweet spot (shaded) is a small number of attention layers — almost all the recall, little of the cost.

Attention layers (of 32) 4

Common misconception. “More attention layers always mean better quality.” For recall specifically, the returns diminish fast — and each added attention layer costs you quadratic compute and, crucially, KV-cache memory that caps your context length. Past the knee, adding attention layers makes the model slower and more memory-hungry for almost no recall gain. The skill is finding the minimum attention that restores recall, not piling it on.

Why can Jamba get away with so few attention layers?

Attention layers are free Recall benefit saturates quickly with the number of attention layers — the first few restore most of it — while each attention layer adds cost and KV-cache memory, so the sweet spot is a small number Mamba layers also have a KV cache

Chapter 4: Adding Experts — The Third Ingredient

Jamba has one more trick, and it's the one that makes it a three-way hybrid: Mixture of Experts. On top of interleaving attention and Mamba layers, Jamba replaces some of the feed-forward networks with MoE layers — many expert FFNs with a router that sends each token to just a few. (The MoE lesson covers this in depth; here's how it fits into the hybrid.)

Two orthogonal efficiency levers

This is the elegant part: Jamba pulls two independent efficiency levers at once. The attention/Mamba interleaving cuts the cost of token mixing (the attention-replacement problem). MoE cuts the cost of the feed-forward layers (the capacity problem). These are different sublayers of the block, so the two techniques compose cleanly — Mamba handles “mix tokens cheaply,” MoE handles “add parameters cheaply,” and together they attack both of the places where a transformer spends its compute. A Jamba block is: a token-mixer (attention or Mamba) + a feed-forward (dense or MoE), each chosen for efficiency.

Total vs. active parameters, in a hybrid. Recall MoE's split: total parameters (capacity, memory) vs active parameters (per-token compute). Jamba uses this to be huge in capacity but cheap to run. Its flagship has on the order of 50 billion total parameters but only about 12 billion active per token — because most tokens touch only a couple of experts. So Jamba is efficient on three fronts: Mamba layers make token-mixing linear-cost, MoE makes the FFN capacity cheap, and the few-attention design keeps the KV cache tiny. Three levers, all pulling toward “big model, small per-token cost.”

Where the experts go

Jamba doesn't make every feed-forward an MoE — that would blow up the total parameter count and memory. It alternates: some blocks use a normal dense FFN, others use an MoE FFN, on a schedule (roughly every other layer). This keeps the total parameter count manageable while still getting much of MoE's capacity benefit. It's the same “use the expensive thing sparingly” philosophy as the attention layers — a few MoE layers, like a few attention layers, sprinkled through a mostly-cheap stack. Sparsity everywhere: sparse attention placement, sparse expert activation, sparse MoE placement.

See it: the three levers

The widget shows Jamba's three efficiency levers and their combined effect on total vs active parameters. Toggle each lever (Mamba layers, MoE, few-attention) on or off and watch the active-parameter cost drop while total capacity stays high. With all three on, you get a large-capacity model at a small per-token cost — the Jamba payoff.

Three Efficiency Levers Compose

Toggle each lever. Total params (capacity, purple) vs active params per token (cost, teal). Each lever lowers the active cost; together they make a big model cheap to run.

Common misconception. “MoE and Mamba do the same efficiency job, so using both is redundant.” They target different sublayers. Mamba (or attention) is the token-mixing sublayer — how tokens share information. MoE is the feed-forward sublayer — how each token's features are transformed, independent of other tokens. A transformer spends compute in both places, so cutting only one leaves the other expensive. Jamba cuts both, which is why it composes them rather than choosing.

How does MoE complement the attention/Mamba interleaving in Jamba?

It replaces the Mamba layers entirely They target different sublayers: interleaving cuts token-mixing cost, MoE cuts feed-forward cost (high total params, low active per token) — two orthogonal efficiency levers that compose It adds a KV cache to the Mamba layers

Chapter 5: The Hybrid Stack Builder

Now build your own hybrid and watch the tradeoffs in real time. You set how many attention layers (vs Mamba) and how much MoE, and the simulator shows the resulting model on four axes: recall, throughput (speed), max context (set by KV-cache memory), and quality-per-cost. Try the presets, then explore — you'll feel why Jamba's particular mix is a sweet spot, not an arbitrary choice.

Run these experiments:

Pure transformer (all attention) — top recall, but throughput and max-context tank as the KV cache balloons.
Pure Mamba (no attention) — huge context and fast, but recall drops.
Jamba (1-in-8 attention + MoE) — strong on all four; the balanced sweet spot.
Crank MoE up — quality-per-cost rises (more capacity, same active compute), but total memory grows.

Build a Hybrid: Four-Axis Tradeoff

Set the attention ratio and MoE level. The bars show recall, throughput, max context, and quality-per-cost. Use the presets, then explore why Jamba's mix balances all four.

Attention per 8 layers 1

MoE level 0.5

What to take away. Notice there's no setting that maxes all four bars — there are real tradeoffs. Pure transformer wins recall but loses context and throughput; pure Mamba wins those but loses recall. Jamba's 1-in-8-plus-MoE doesn't top any single axis but is strong on all four, which is what a deployable long-context model needs. The art of hybrid design is finding that balanced point for your priorities — and Jamba showed a specific point that works remarkably well in production.

Common misconception. “There's an optimal hybrid ratio for all tasks.” The best ratio depends on what you're optimizing: heavy long-context retrieval wants a few more attention layers; maximum throughput and context length want fewer. Jamba's 1-in-8 is a strong general default, not a universal law. The builder makes the point: you're choosing a position in a tradeoff space, and the right position depends on the job.

No quiz — the builder is the test. If you can predict how each bar moves when you change the attention ratio or MoE level, you understand hybrid design.

Chapter 6: The Memory Win — 256K on One GPU

Jamba's headline number was a 256,000-token context running on a single 80GB GPU — where a comparable pure transformer would need many times the memory or simply couldn't fit. The reason isn't mainly the compute savings; it's the KV cache. Understanding this makes clear why the “few attention layers” design is so powerful for long context specifically.

The KV cache is the real bottleneck

During generation, a transformer must remember the keys and values of every past token, at every attention layer — that's the KV cache, and it grows linearly with both context length and the number of attention layers. At long context, this cache, not the compute, is what fills up GPU memory. A 256K-token context in a 32-layer transformer means storing keys and values for 256,000 tokens across all 32 layers — tens of gigabytes, often more than the model weights themselves.

Mamba layers have no KV cache. Here's the crux. A Mamba layer carries a small fixed-size state — it does not store a per-token cache that grows with context. So in Jamba, only the attention layers contribute to the KV cache, and there are very few of them (4 of 32 in the 1-in-8 design). The KV cache is therefore 8× smaller than a pure transformer's, because 28 of the 32 layers store nothing that grows. That 8× memory saving is what lets the same GPU hold an 8×-longer context. The few-attention design isn't just about compute — it's primarily what unlocks the long context.

Worked example: the cache shrinks

Say each attention layer's KV cache costs some amount per token. A 32-layer pure transformer at 256K tokens pays that cost × 32 layers × 256,000 tokens — the full bill. Jamba 1:7 has only 4 attention layers, so it pays × 4 layers × 256,000 tokens — one eighth of the cache. Same context length, one-eighth the KV memory. Flip it around: with a fixed memory budget, Jamba fits roughly 8× the context the transformer could. The 28 Mamba layers carry the sequence with their tiny constant-size states, contributing essentially nothing to the cache.

See it: KV cache vs. context length

The widget plots KV-cache memory against context length for a pure transformer and for Jamba. Both grow linearly with context, but Jamba's line is far shallower (only its few attention layers cache). Drag the context up: watch the transformer hit a memory ceiling (the GPU limit) while Jamba sails far past it. The gap between the lines is the extra context Jamba fits in the same memory.

KV Cache: Pure Transformer vs. Jamba

KV-cache memory vs context length. Both grow with context, but Jamba (few attention layers) grows ~8× slower. The dashed line is a GPU memory limit — see how much further Jamba reaches.

Context length 64k

Common misconception. “Long context is mainly a compute problem.” For generation, it's often a memory problem first — the KV cache fills the GPU before compute becomes the limit. That's why removing attention layers (and their caches) is so effective for long context: you're attacking the binding constraint. Jamba's efficiency story is really two stories — the Mamba layers cut compute, and the absence of their KV caches cuts memory, with the memory win being what makes the dramatic context lengths possible.

Why can Jamba fit a far longer context than a pure transformer on the same GPU?

Its weights are smaller Only attention layers have a (growing) KV cache, and Jamba has very few of them; the many Mamba layers carry the sequence with tiny fixed-size states, so the KV cache is ~8× smaller It compresses the context with a separate model

Chapter 7: Designing the Ratio

Choosing the attention-to-Mamba ratio (and where to add MoE) is the core engineering of a hybrid, and it's genuinely a design space, not a single right answer. Jamba's 1-in-8 is one well-chosen point; understanding the curve it sits on lets you reason about when to move along it.

The quality–throughput frontier

Plot quality against throughput as you vary the attention fraction, and you get a frontier. At the high-attention end: top quality (especially recall), low throughput, short max context. At the low-attention end: high throughput and context, lower quality. In between is a curve, and you want to sit at the point on it that matches your priorities. Jamba's 1-in-8 is chosen to be near the “knee” — the point where you've captured most of the quality while keeping throughput and context high. Push toward more attention only if your task is recall-critical enough to justify the throughput and memory cost.

Three knobs, not one. The hybrid design space has at least three dials: (1) the attention fraction (how many attention layers), (2) attention placement (spread evenly, or clustered), and (3) the MoE fraction and expert count. They interact — e.g., more MoE capacity can partly compensate for fewer attention layers on some tasks. Jamba's contribution wasn't just “use a hybrid” but demonstrating a specific, tuned configuration of all three that works at production scale. Later hybrids tune these knobs differently for different goals.

Placement and other subtleties

Beyond the ratio, details matter. Placement: spreading attention layers evenly through the depth (as Jamba does) generally beats clustering them, so precise recall is available at every processing stage. Which layers get MoE: alternating dense and MoE FFNs balances capacity against total memory. The first/last layers: some designs treat the boundary layers specially. These are the fine-tuning details that separate a good hybrid from a great one — the kind of empirical know-how that Jamba and its successors established.

See it: the design frontier

The widget plots the quality-vs-throughput frontier as you sweep the attention fraction. Drag along it: toward more attention, quality rises but throughput falls; toward less, the reverse. The marked knee is roughly where Jamba sits — most of the quality, most of the throughput. There's no point that's best on both; you choose your spot on the curve.

The Quality–Throughput Frontier

Each point is a different attention fraction. More attention → higher quality, lower throughput. The knee (marked) is the balanced sweet spot — roughly Jamba's choice. Drag to explore the frontier.

Attention fraction 0.12

Common misconception. “Once you pick a hybrid ratio, you're locked in.” The ratio is a design choice made per model, and different models target different points: some lean more toward attention for recall-heavy use, others toward Mamba for maximum context and throughput. The “right” ratio also shifts as Mamba-style layers improve — as their recall gap narrows, you can use even fewer attention layers. The frontier itself moves over time with better components.

What does choosing a hybrid's attention ratio amount to?

Finding the one optimal ratio that's best for everything Picking a point on a quality-vs-throughput frontier — more attention trades throughput/context for recall quality; Jamba's 1-in-8 sits near the balanced knee Setting the learning rate

Chapter 8: The Hybrid Era

Jamba wasn't a one-off — it announced a shift. After it, hybrids of attention with efficient sequence layers became one of the dominant patterns for new long-context models. Seeing the broader landscape shows that Jamba's specific recipe is one instance of a general, now-mainstream idea: don't replace attention, dilute it.

A growing family of hybrids

Several notable hybrids followed, each mixing attention with a different efficient layer:

Jamba — attention + Mamba + MoE; the production trailblazer (256K context).
Griffin / Hawk (Google DeepMind) — attention + gated linear recurrences (the RG-LRU); Hawk is the pure-recurrent variant, Griffin the hybrid.
Samba (Microsoft) — a simple, effective interleaving of Mamba with sliding-window attention.
Zamba — Mamba backbone with a shared attention block, for parameter efficiency.
And increasingly, frontier labs ship hybrid or hybrid-influenced architectures for their long-context models, even when not advertised.

The unifying recipe. Strip away the names and they're variations on one theme: a backbone of efficient layers (Mamba, gated recurrences, linear attention) for the bulk of the work, plus a minority of attention layers for precise recall, optionally with MoE for cheap capacity. The specifics differ — which efficient layer, what ratio, what attention variant, where the experts go — but the structure is shared. Jamba showed this composition works at scale; the field then explored the design space. The era of “pure transformer vs pure SSM” gave way to “what's the best hybrid recipe.”

Why hybrids won

The reason hybrids became dominant rather than either pure approach is pragmatic. Pure attention is too expensive at long context. Pure efficient-layer models haven't fully closed the recall gap and carried deployment risk. The hybrid sidesteps both: it's nearly as cheap as the efficient models (most layers are efficient) and nearly as capable as the transformer (the few attention layers restore recall). For the practitioner who needs long context, low cost, and reliable quality, the hybrid is simply the best available tradeoff — and that practical dominance is why it spread so fast.

See it: the hybrid family

The widget shows several hybrids and what each interleaves. Select one to see its composition — the efficient backbone, the attention component, and whether it uses MoE. Notice the shared skeleton across all of them: efficient majority + attention minority. They're dialects of one design.

The Hybrid Family

Select a model to see what it mixes: its efficient backbone, its attention component, and MoE usage. All share the “efficient majority + attention minority” recipe.

Common misconception. “Hybrids are a transitional hack until pure SSMs catch up.” That's one view, but the more likely reality is that hybrids are a durable design point. Attention and efficient recurrence have genuinely complementary strengths rooted in their math (exact-but-expensive vs compressed-but-cheap), and combining them captures both. Even if efficient layers keep improving, a few attention layers are so cheap and so useful for recall that there's little reason to drop them entirely. Hybrids may well be the long-term answer, not a stopgap.

What unifying recipe do Jamba, Griffin, and Samba share?

They're all pure state-space models with no attention A backbone of efficient layers (Mamba / gated recurrences) for the bulk + a minority of attention layers for recall, optionally with MoE — efficient majority, attention minority They all use only sliding-window attention everywhere

Chapter 9: Connections & Cheat Sheet

You now understand Jamba completely: why hybrids beat either pure approach, the complementary attention/Mamba ingredients, interleaving at a 1-in-8 ratio, why so few attention layers suffice (recall saturates), how MoE adds a second efficiency lever, the KV-cache memory win that unlocks 256K context, the ratio design space, and the hybrid era it launched. The thread: don't replace attention — dilute it; a backbone of efficient layers with a few attention checkpoints and some experts gives long context, low cost, and strong recall at once.

The cheat sheet

The bet: attention (recall, expensive) + Mamba (efficient, weaker recall) are complementary

Interleave: ~1 attention per 7 Mamba layers, attention spread evenly through depth

Few suffice: recall saturates fast with attention count; cost rises — sweet spot is small

+ MoE: on some FFNs — cheap capacity (high total params, low active per token)

Three levers: Mamba (token-mix cost), MoE (FFN cost), few-attention (KV-cache memory)

Memory win: only attention layers cache; few of them → ~8× smaller KV cache → 256K context

Jamba sizes: ~50B total, ~12B active; 256K context on one 80GB GPU

Family: Griffin, Samba, Zamba — all “efficient majority + attention minority”

A decision guide

Need long context + low cost + reliable quality?

A hybrid (Jamba-style) is usually the best tradeoff.

↓

Recall-critical (retrieval, copying)?

Lean toward a few more attention layers.

↓

Maximum context / throughput?

Fewer attention layers, more Mamba; mind the recall floor.

↓

Need big capacity cheaply?

Add MoE to the feed-forward layers (orthogonal to the token-mixer choice).

Where this connects

SSM / Mamba — the efficient backbone layer Jamba is mostly built from.
Mixture of Experts — Jamba's third lever, applied to the feed-forward sublayers.
Attention Variants — the precise-recall layers Jamba sprinkles in; some hybrids use sliding-window attention.
Linear Attention & RWKV & RetNet — the family that pointed toward hybrids; any of them could be the efficient backbone.
Hyena — another efficient layer that could play the backbone role in a hybrid.
Transformer — the pure-attention baseline Jamba dilutes rather than discards.

The one thing to remember. Jamba's lesson is that the answer to “attention or efficient recurrence?” is “both, in the right proportions.” A backbone of cheap Mamba layers carries the sequence; a few attention layers restore precise recall; MoE adds capacity for free. The three levers are orthogonal, so they compose into a model that's long-context, fast, memory-light, and capable — the combination deployment actually needs. Jamba proved this at scale, and the hybrid recipe it established is now the dominant pattern for new long-context models. Sometimes the best architecture isn't a new idea but the right mixture of old ones.

You need to serve a model with 200K-token context, on limited GPU memory, with strong retrieval quality. What architecture and why?

A pure transformer — best quality A pure Mamba model — best efficiency A Jamba-style hybrid: mostly Mamba layers (linear cost, tiny KV cache → fits 200K) + a few attention layers (restore retrieval recall) + MoE (cheap capacity) — long context, low memory, and strong recall together

“The strongest design is rarely the purest one — it is the wisest blend.”