How a trillion-parameter model can run as cheaply as a small one — by waking up only the few experts each word actually needs.
Bigger language models are smarter. The cleanest lever we have for raising capability is to add more parameters. But there's a brutal catch with the standard “dense” transformer: every parameter is used for every token. Double the parameters and you double the compute for every single word the model reads or writes. Capability and cost rise together, locked in lockstep. That's the wall.
Look at where the parameters live. In a transformer block, the feed-forward network (the FFN — the big two-layer MLP after attention) holds about two-thirds of the model's weights. For every token, the full FFN fires: a huge matrix multiply, then another. Make the FFN four times wider for more capacity, and every token now costs four times as much to process. There has to be a smarter way to spend parameters.
Here's the key observation that breaks the wall. Do you really need all of that giant FFN to process the word “the”? Probably not. Different tokens — a punctuation mark, a rare technical term, a number — might be best served by different specialized sub-networks. What if we had many FFNs, each an “expert” in some kind of input, and each token only used the one or two experts it actually needs?
This splits the idea of “model size” into two numbers that, in a dense model, were always equal. Total parameters: every weight the model stores — this sets capacity and memory. Active parameters: the weights actually used for a given token — this sets the compute cost. In a dense model these are identical. In a MoE, total can be 10× or more larger than active. Mixtral 8×7B has about 47 billion total parameters but activates only ~13 billion per token. DeepSeek-V3 has 671 billion total, ~37 billion active. You pay for the small number, you benefit from the large one.
The widget shows a transformer's FFN compute. Drag the number of experts. In a dense model (1 expert), total and active parameters rise together — the wall. Switch on sparse routing and add experts: total parameters (capacity) shoot up while active parameters per token (cost) stay flat. That gap is the entire promise of MoE.
Add experts and watch total parameters (capacity, purple) soar while active parameters per token (cost, teal) stay flat. In a dense model they'd be the same bar.
Let's build the layer. We start from a normal transformer block, find the feed-forward network, and replace it with a Mixture of Experts layer. That layer has two ingredients: a set of experts — several independent feed-forward networks, each identical in shape to the one we replaced — and a router (also called the gate), a tiny network that decides which experts each token should visit.
Follow a single token's vector through the layer:
The crucial part is step 4: the token is processed by only k experts (often 1 or 2), not all N. The other experts sit idle for this token — their parameters are never touched, so they cost nothing. A different token, with different router scores, lights up a different pair of experts. The model is enormous, but each token's journey through it is cheap.
python import torch, torch.nn as nn, torch.nn.functional as F class MoELayer(nn.Module): def __init__(self, dim, n_experts=8, k=2): self.experts = nn.ModuleList([FFN(dim) for _ in range(n_experts)]) self.router = nn.Linear(dim, n_experts) # scores each expert self.k = k def forward(self, x): # x: (tokens, dim) scores = self.router(x) # (tokens, n_experts) weights, idx = scores.topk(self.k, dim=-1) # pick top-k per token weights = F.softmax(weights, dim=-1) # normalize the chosen k out = torch.zeros_like(x) for i in range(self.k): for e in range(len(self.experts)): m = (idx[:, i] == e) # tokens routed to expert e if m.any(): out[m] += weights[m, i:i+1] * self.experts[e](x[m]) return out
Notice the gather/scatter: tokens are grouped by which expert they chose, each expert processes only its assigned tokens, and the results are scattered back and weighted. Only the experts that were selected ever run. (In production this loop becomes a single batched, parallelized operation, but the logic is exactly this.)
Click a token on the left to watch it route: the router scores all experts, the top-k light up, the token flows through only those, and their outputs combine. Different tokens take different paths. Change top-k to send each token to more experts.
Click a token (left). Its router scores all experts; the top-k (bright) process it and their outputs combine on the right. Each token routes independently.
The router is the brain of a MoE layer. It's almost embarrassingly simple — usually a single linear layer — yet getting it to behave is where all the difficulty lives. Let's see exactly what it computes, then why its simplicity hides a subtle problem.
For each token, the router produces one score per expert — just a dot product of the token vector with a learned weight per expert. So with 8 experts you get 8 numbers. Then two steps: top-k selection keeps only the k highest scores (the chosen experts), and a softmax over just those k scores turns them into gating weights that sum to 1. Those weights are how much each chosen expert's output counts in the final weighted sum.
A token arrives at a layer with 6 experts. The router outputs these scores, and we use top-2 (k = 2):
| expert | E0 | E1 | E2 | E3 | E4 | E5 |
|---|---|---|---|---|---|---|
| score | 0.5 | 2.1 | 0.9 | 1.7 | −0.3 | 0.2 |
The two highest scores are E1 (2.1) and E3 (1.7). We discard the rest and softmax just these two. Exponentiate: e to the 2.1 is 8.17, e to the 1.7 is 5.47. Their sum is 13.64. So the gating weights are:
The token is sent through experts E1 and E3 only. Their output vectors are combined as 0.599 times E1's output plus 0.401 times E3's output. Experts E0, E2, E4, E5 never run for this token. Another token with different scores picks a different pair — maybe E0 and E2 — and the work spreads across the experts.
Here's the subtle problem. “Pick the top-k” is a hard, discrete choice — you can't take a smooth gradient through “which experts got selected.” So how does the router learn which experts to choose? The trick: the gradient flows through the gating weights, which are continuous. If expert E1's output helped reduce the loss, the gradient pushes the router to give E1 a higher score next time (raising its weight); if it hurt, the score drops. The router learns to route well not by differentiating the selection, but by adjusting the soft weights of whatever it did select. Over training, good experts for a token get higher and higher scores.
Early on, this creates a danger: the router might, by luck, favor a few experts, send them more tokens, improve them, and favor them even more — a rich-get-richer spiral that leaves most experts untrained. Some routers add noise to the scores before top-k (“noisy top-k gating”) to encourage exploration early. But the real fix for that spiral is the subject of the next chapter: load balancing.
Drag the expert scores and watch top-k selection and the softmax gating weights update. Raise k to use more experts. Add router noise and resample to see how exploration can change which experts win when scores are close.
Bars = router scores per expert. The top-k (teal) are selected; their softmax gating weights appear above. Adjust k and the noise, and resample.
We hinted at the disease at the end of Chapter 2. Left to its own devices, a MoE router collapses: it learns to send almost every token to a small handful of experts, while the rest sit unused, untrained, and useless. This is the single biggest failure mode in training a MoE, and it has a self-reinforcing logic that makes it almost inevitable without a fix.
It's a feedback loop. Early in training, by pure chance, the router slightly favors a few experts. Those experts get more tokens, so they get more gradient updates, so they get better. Being better, they reduce the loss more, so the router favors them even more. Meanwhile the neglected experts get few tokens, barely train, stay bad, and get avoided. Within a few thousand steps, you've effectively got a dense model using two experts and 62 dead ones — you paid for a huge model and got a tiny one.
The solution is to add a second loss term that punishes imbalance, summed alongside the normal language-modeling loss. The idea: measure how unevenly tokens are distributed across experts, and add a penalty that grows when the distribution is lopsided. The router now has two pressures — route tokens to good experts (main loss) and route tokens evenly (auxiliary loss) — and it must balance them.
The standard formulation multiplies two quantities per expert: the fraction of tokens routed to that expert, and the average router probability assigned to it. The auxiliary loss is the sum of these products across experts, scaled up by the number of experts. Minimizing it pushes both quantities toward uniform — every expert getting its fair share of tokens and probability. A small coefficient (often around 0.01) keeps it from overwhelming the real objective; just enough nudge to keep the experts all in the game.
Suppose 100 tokens and 4 experts. A perfectly balanced router sends 25 tokens to each — fraction 0.25 apiece. A collapsing router might send 70, 25, 4, 1. The balancing loss multiplies each expert's token-fraction by its average gating probability and sums. When everything is uniform (0.25 each), that sum is at its minimum. When it's lopsided (0.70, 0.25, 0.04, 0.01), the product for the overloaded expert (0.70 × its high probability) dominates and the sum shoots up — a big penalty. Gradient descent on that penalty pushes the router to even out the 70 back toward 25, reviving the starved experts before they die.
Press run to stream tokens through 8 experts. With the balancing loss off, watch the rich-get-richer spiral: a couple of experts swell while the rest flatline. Toggle the balancing loss on and re-run: the load spreads evenly across all experts. The imbalance meter quantifies the difference.
Bars = cumulative tokens routed to each expert. Without the aux loss, a few experts dominate (collapse). With it, load spreads evenly. Toggle and re-run.
There's a hardware reality that load balancing alone doesn't solve. To run experts efficiently on a GPU, you need fixed-size buffers — you allocate, ahead of time, room for a specific number of tokens per expert. But routing is dynamic: on any given batch, some experts get more tokens than others, even with balancing. What happens when an expert receives more tokens than its buffer can hold?
Each expert gets a capacity: the maximum number of tokens it will process this batch. It's set by a capacity factor — a multiplier on the “fair share” each expert would get under perfect balance. A capacity factor of 1.0 means each expert can hold exactly its even share; 1.25 gives 25% headroom for the inevitable imbalance.
If more tokens route to an expert than its capacity allows, the overflow tokens are dropped — that expert simply doesn't process them. A dropped token isn't lost entirely: thanks to the residual connection around the MoE layer, it passes through unchanged, as if the layer were skipped for it. But it misses out on the expert computation it was supposed to get. Too much dropping hurts quality.
A batch has 512 tokens, 8 experts, top-1 routing. Under perfect balance each expert would get 512 / 8 = 64 tokens. With a capacity factor of 1.25, each expert's buffer holds 64 × 1.25 = 80 tokens. Now suppose, despite balancing, one expert is assigned 88 tokens this batch. It processes the first 80 and drops the last 8 — those 8 tokens skip the layer via the residual. Meanwhile an underused expert assigned only 50 tokens leaves 30 buffer slots empty (wasted compute). That's the daily reality of running a MoE: a constant, managed tension between dropping and waste.
Tokens stream into 6 experts, each with a capacity buffer (the outlined box). Watch buffers fill; when one overflows, the extra tokens turn red and drop. Raise the capacity factor to add headroom and reduce drops — but notice the growing empty space (wasted compute) in the under-filled experts.
Each box is an expert's buffer. Filled = processed tokens, red = dropped overflow, empty = wasted slots. Adjust the capacity factor and re-run.
Early MoE work assumed you needed at least top-2 routing — send each token to two experts — because comparing two options seemed necessary for the router to get a useful learning signal. In 2021, Google's Switch Transformer made a bold simplification: route each token to exactly one expert. Top-1. And it worked beautifully, scaling MoE to a trillion parameters.
Going from top-2 to top-1 sounds like a minor change. It isn't — it roughly halves several costs at once:
With top-1, the whole layer collapses to something clean: the router picks the single best expert for each token, the token goes through that one expert, and the output is the expert's result times its (single) gating weight. There's no softmax-over-k, no weighted combination of multiple outputs. The gating weight still matters — it lets gradients flow to the router so it learns which expert to pick — but the data path is just “one token, one expert.” Modern models often go back to top-2 or higher (Mixtral uses top-2, DeepSeek top-8 of fine-grained experts), but Switch proved top-1 is viable and established the modern recipe of balancing loss plus capacity factor.
Toggle between Switch (top-1) and top-2 routing on the same stream of tokens. Watch the total expert-FFN evaluations and the cross-machine communication count: top-1 roughly halves both. The capability difference is modest; the cost difference is not.
Same tokens, two routing schemes. Bars show total expert evaluations and communication. Top-1 (Switch) does roughly half the work of top-2.
Everything comes together here. This is a MoE layer training live: a stream of tokens routes through 8 experts, batch after batch, while the router learns. You control top-k, the load-balancing loss, and the capacity factor — the three dials that decide whether your expensive sparse model thrives or collapses. Watch all three effects at once: the load distribution, the imbalance, and the token drop rate.
Run these experiments to cement the whole lesson:
Bars = per-batch tokens per expert; the dashed line is capacity (overflow = red drops). Toggle the aux loss, adjust capacity and top-k, and watch imbalance and drop-rate respond in real time.
No quiz — the simulator is the test. If you can predict what happens to drop rate when you turn off the aux loss and lower capacity, you understand how a MoE layer really runs.
The ideas so far — experts, router, top-k, balancing, capacity — are the foundation. Today's best open MoE models add two refinements that meaningfully improve the basic recipe: fine-grained experts and shared experts. Both come from asking “how do we make each expert's specialization more useful?”
Mistral's Mixtral 8×7B (2023) was the model that made MoE a practical, open reality. Eight experts per layer, top-2 routing. About 47 billion total parameters, but only ~13 billion active per token — so it runs at the speed of a ~13B model while matching or beating much larger dense models. It proved the MoE recipe (top-2, balancing loss, capacity) works at the scale people actually deploy, not just in research labs.
DeepSeek's MoE models pushed the design further with two changes:
Toggle between the classic Mixtral-style layer (a few big experts, top-2) and the DeepSeek-style layer (one always-on shared expert plus many fine-grained routed experts). Watch how a token's path changes: in the modern design it always passes through the shared expert plus a larger handful of small specialists.
A token's routing under two designs. Classic: top-2 of a few big experts. Modern: a shared always-on expert (gold) plus top-k of many small routed experts (teal). Toggle.
MoE is as much a systems innovation as a modeling one. A trillion-parameter model's experts can't fit on one GPU — not even close. So the experts are spread across many GPUs, and that physical distribution creates the defining engineering challenge of MoE: moving tokens to wherever their chosen experts live.
The standard layout is expert parallelism: put different experts on different GPUs. With 64 experts across 64 GPUs, each GPU holds one expert. Now consider what routing means physically. A batch of tokens is spread across all those GPUs (each GPU processed some tokens through attention). But the router might send a token sitting on GPU 0 to an expert living on GPU 47. That token's data must be physically shipped across the network to GPU 47, processed, and shipped back.
Since every token might need to go to any GPU, this is an all-to-all communication: every GPU sends some tokens to every other GPU, simultaneously. It happens twice per MoE layer — once to dispatch tokens to their experts, once to gather the results back. All-to-all is one of the most expensive communication patterns in distributed computing, and in a large MoE it often dominates the runtime — the GPUs spend more time shuffling tokens than computing.
Despite the communication cost, the economics are compelling. The compute (the matrix multiplies) scales with active parameters, which you hold fixed. The capacity scales with total parameters, which you grow by adding GPUs (and experts). So you can keep making the model more capable by buying more GPUs to hold more experts, without each token getting more expensive to compute. The ceiling becomes communication bandwidth and memory, not per-token FLOPs — a fundamentally better scaling curve than dense models.
Set the number of experts and top-k. The calculator shows total vs active parameters and the sparsity ratio. The diagram shows tokens being shipped across GPUs (all-to-all) to reach their experts — the more experts spread across more GPUs, the more cross-GPU traffic. Watch the communication grow with top-k.
Experts live on different GPUs; arrows show tokens shipped to reach them (all-to-all). The readout quantifies the capacity-vs-cost split. Adjust experts and top-k.
You now understand the whole machine: why dense scaling hits a wall, how a MoE layer swaps one FFN for many experts plus a router, how top-k gating works and learns despite being non-differentiable, why routers collapse and how the balancing loss prevents it, how capacity and token dropping manage fixed buffers, the Switch top-1 simplification, the Mixtral and DeepSeek refinements, and why the whole thing is really a distributed-systems problem. The thread: spend parameters where they're needed, activate only a few per token, and pay constant compute for ever-growing capacity — if you can keep the router balanced and the tokens flowing.
| Term | What it means |
|---|---|
| Expert | one of several FFNs; each token uses only a few |
| Router / gate | tiny linear layer that scores experts per token |
| Top-k | how many experts each token is sent to (1, 2, 8...) |
| Total vs active params | capacity/memory vs per-token compute |
| Aux load-balancing loss | penalty that prevents router collapse |
| Capacity factor | buffer headroom per expert; controls token dropping |
| Shared expert | always-on expert for common knowledge (DeepSeek) |
| Expert parallelism | experts on different GPUs; needs all-to-all comm |
“The secret to scale isn’t doing more for every input — it’s knowing which small part of a vast mind each input needs to wake.”