Mixture of Experts — Scaling Without Paying for It

Chapter 0: The Dense Scaling Wall

Bigger language models are smarter. The cleanest lever we have for raising capability is to add more parameters. But there's a brutal catch with the standard “dense” transformer: every parameter is used for every token. Double the parameters and you double the compute for every single word the model reads or writes. Capability and cost rise together, locked in lockstep. That's the wall.

Look at where the parameters live. In a transformer block, the feed-forward network (the FFN — the big two-layer MLP after attention) holds about two-thirds of the model's weights. For every token, the full FFN fires: a huge matrix multiply, then another. Make the FFN four times wider for more capacity, and every token now costs four times as much to process. There has to be a smarter way to spend parameters.

Here's the key observation that breaks the wall. Do you really need all of that giant FFN to process the word “the”? Probably not. Different tokens — a punctuation mark, a rare technical term, a number — might be best served by different specialized sub-networks. What if we had many FFNs, each an “expert” in some kind of input, and each token only used the one or two experts it actually needs?

The one-sentence version. A Mixture of Experts replaces one big feed-forward network with many smaller expert networks plus a router that sends each token to just a few of them. The model can hold a huge total number of parameters, but each token only activates a tiny fraction — so you scale capacity without scaling the per-token compute.

Total vs. active parameters

This splits the idea of “model size” into two numbers that, in a dense model, were always equal. Total parameters: every weight the model stores — this sets capacity and memory. Active parameters: the weights actually used for a given token — this sets the compute cost. In a dense model these are identical. In a MoE, total can be 10× or more larger than active. Mixtral 8×7B has about 47 billion total parameters but activates only ~13 billion per token. DeepSeek-V3 has 671 billion total, ~37 billion active. You pay for the small number, you benefit from the large one.

See it: decoupling capacity from cost

The widget shows a transformer's FFN compute. Drag the number of experts. In a dense model (1 expert), total and active parameters rise together — the wall. Switch on sparse routing and add experts: total parameters (capacity) shoot up while active parameters per token (cost) stay flat. That gap is the entire promise of MoE.

Total Capacity vs. Per-Token Cost

Add experts and watch total parameters (capacity, purple) soar while active parameters per token (cost, teal) stay flat. In a dense model they'd be the same bar.

Number of experts 8

Experts used per token (top-k) 2

Common misconception. “A 47-billion-parameter MoE is as smart as a 47-billion-parameter dense model.” Not quite. MoE buys capacity cheaply, but each token still only sees a fraction of the network, so a sparse model is generally a bit weaker per total parameter than a dense one — while being far cheaper to run. The right comparison is on active compute: a MoE delivers much more capability than a dense model of the same per-token cost. That's the win.

What does a Mixture of Experts decouple that a dense model keeps locked together?

Training time and inference time Total parameters (capacity/memory) from active parameters per token (compute cost) — you can grow one without growing the other The number of layers from the number of heads

Chapter 1: The MoE Layer — Many FFNs and a Router

Let's build the layer. We start from a normal transformer block, find the feed-forward network, and replace it with a Mixture of Experts layer. That layer has two ingredients: a set of experts — several independent feed-forward networks, each identical in shape to the one we replaced — and a router (also called the gate), a tiny network that decides which experts each token should visit.

The data flow, traced

Follow a single token's vector through the layer:

1. token in

a vector, e.g. 4096-dim, arrives from attention

→

2. router

a small linear layer scores all N experts for this token

→

3. pick top-k

keep the k highest-scoring experts (e.g. k=2), softmax their scores into weights

→

4. run experts

send the token through only those k expert FFNs

→

5. combine

weighted sum of the k expert outputs → the layer's output

The crucial part is step 4: the token is processed by only k experts (often 1 or 2), not all N. The other experts sit idle for this token — their parameters are never touched, so they cost nothing. A different token, with different router scores, lights up a different pair of experts. The model is enormous, but each token's journey through it is cheap.

Why this is “sparse.” “Sparse” here means most of the network is inactive for any given token. A dense layer is a single road every token walks. A MoE layer is a building with many specialized rooms; the router is the receptionist who sends each visitor to just the two rooms they need. The building can have a thousand rooms, but each visitor only ever enters two.

From scratch: a MoE layer

python
import torch, torch.nn as nn, torch.nn.functional as F

class MoELayer(nn.Module):
    def __init__(self, dim, n_experts=8, k=2):
        self.experts = nn.ModuleList([FFN(dim) for _ in range(n_experts)])
        self.router  = nn.Linear(dim, n_experts)   # scores each expert
        self.k = k

    def forward(self, x):                       # x: (tokens, dim)
        scores = self.router(x)                  # (tokens, n_experts)
        weights, idx = scores.topk(self.k, dim=-1)  # pick top-k per token
        weights = F.softmax(weights, dim=-1)   # normalize the chosen k
        out = torch.zeros_like(x)
        for i in range(self.k):
            for e in range(len(self.experts)):
                m = (idx[:, i] == e)               # tokens routed to expert e
                if m.any():
                    out[m] += weights[m, i:i+1] * self.experts[e](x[m])
        return out

Notice the gather/scatter: tokens are grouped by which expert they chose, each expert processes only its assigned tokens, and the results are scattered back and weighted. Only the experts that were selected ever run. (In production this loop becomes a single batched, parallelized operation, but the logic is exactly this.)

See it: tokens routing through the layer

Click a token on the left to watch it route: the router scores all experts, the top-k light up, the token flows through only those, and their outputs combine. Different tokens take different paths. Change top-k to send each token to more experts.

A Token's Path Through the MoE Layer

Click a token (left). Its router scores all experts; the top-k (bright) process it and their outputs combine on the right. Each token routes independently.

Top-k experts per token 2

Common misconception. “Experts specialize by topic — one for math, one for French.” Mostly no. Learned experts rarely map to human-interpretable categories; they specialize in subtle, often syntactic or token-level patterns that don't have clean names. The router and experts co-adapt to whatever division of labor minimizes loss — which is usually not the tidy “topic experts” people imagine. Don't expect to open up an expert and find “the biology neuron.”

In a top-2 MoE layer with 8 experts, how many experts process a given token, and what do the other experts cost?

All 8 process it; none are skipped 2 process it; the other 6 are never run for that token, so they cost no compute 1 processes it; the rest vote

Chapter 2: The Router — Choosing Who Does the Work

The router is the brain of a MoE layer. It's almost embarrassingly simple — usually a single linear layer — yet getting it to behave is where all the difficulty lives. Let's see exactly what it computes, then why its simplicity hides a subtle problem.

Scores, top-k, and gating weights

For each token, the router produces one score per expert — just a dot product of the token vector with a learned weight per expert. So with 8 experts you get 8 numbers. Then two steps: top-k selection keeps only the k highest scores (the chosen experts), and a softmax over just those k scores turns them into gating weights that sum to 1. Those weights are how much each chosen expert's output counts in the final weighted sum.

Why softmax over only the top-k, not all experts. If you softmaxed over all N experts and then picked the top-k, the dropped experts' probability mass would leak out of the weights and they wouldn't sum to 1. By selecting top-k first and softmaxing only those, the gating weights of the experts you actually use form a clean probability distribution. The token's output is a proper weighted average of exactly the experts that ran.

Worked example: routing one token

A token arrives at a layer with 6 experts. The router outputs these scores, and we use top-2 (k = 2):

expert	E0	E1	E2	E3	E4	E5
score	0.5	2.1	0.9	1.7	−0.3	0.2

The two highest scores are E1 (2.1) and E3 (1.7). We discard the rest and softmax just these two. Exponentiate: e to the 2.1 is 8.17, e to the 1.7 is 5.47. Their sum is 13.64. So the gating weights are:

w(E1) = 8.17 / 13.64 = 0.599, w(E3) = 5.47 / 13.64 = 0.401

The token is sent through experts E1 and E3 only. Their output vectors are combined as 0.599 times E1's output plus 0.401 times E3's output. Experts E0, E2, E4, E5 never run for this token. Another token with different scores picks a different pair — maybe E0 and E2 — and the work spreads across the experts.

The non-differentiable wrinkle

Here's the subtle problem. “Pick the top-k” is a hard, discrete choice — you can't take a smooth gradient through “which experts got selected.” So how does the router learn which experts to choose? The trick: the gradient flows through the gating weights, which are continuous. If expert E1's output helped reduce the loss, the gradient pushes the router to give E1 a higher score next time (raising its weight); if it hurt, the score drops. The router learns to route well not by differentiating the selection, but by adjusting the soft weights of whatever it did select. Over training, good experts for a token get higher and higher scores.

Early on, this creates a danger: the router might, by luck, favor a few experts, send them more tokens, improve them, and favor them even more — a rich-get-richer spiral that leaves most experts untrained. Some routers add noise to the scores before top-k (“noisy top-k gating”) to encourage exploration early. But the real fix for that spiral is the subject of the next chapter: load balancing.

See it: router scores to gating weights

Drag the expert scores and watch top-k selection and the softmax gating weights update. Raise k to use more experts. Add router noise and resample to see how exploration can change which experts win when scores are close.

Router: Scores → Top-k → Gating Weights

Bars = router scores per expert. The top-k (teal) are selected; their softmax gating weights appear above. Adjust k and the noise, and resample.

Top-k 2

Router noise 0.0

Common misconception. “The router is a big, smart network deciding routing.” It's typically just one linear layer — a few thousand parameters against billions in the experts. That tiny router holds enormous power (it decides where every token's compute goes) and is famously finicky to train. Most of the engineering in MoE is about keeping this little gate well-behaved, not about the experts.

Top-k selection is a hard, discrete choice with no smooth gradient. How does the router still learn good routing?

It uses reinforcement learning with a separate reward model Gradients flow through the continuous gating weights of the chosen experts — if an expert helped, its score is pushed up; the selection itself isn't differentiated It doesn't learn; routing is fixed at initialization

Chapter 3: Load Balancing — Stopping the Rich-Get-Richer Spiral

We hinted at the disease at the end of Chapter 2. Left to its own devices, a MoE router collapses: it learns to send almost every token to a small handful of experts, while the rest sit unused, untrained, and useless. This is the single biggest failure mode in training a MoE, and it has a self-reinforcing logic that makes it almost inevitable without a fix.

Why collapse happens

It's a feedback loop. Early in training, by pure chance, the router slightly favors a few experts. Those experts get more tokens, so they get more gradient updates, so they get better. Being better, they reduce the loss more, so the router favors them even more. Meanwhile the neglected experts get few tokens, barely train, stay bad, and get avoided. Within a few thousand steps, you've effectively got a dense model using two experts and 62 dead ones — you paid for a huge model and got a tiny one.

The tragedy of collapse. The whole point of MoE is to spread capacity across many experts. Collapse defeats that point entirely: capacity concentrates in a few experts and the rest are wasted parameters. And it's not a rare edge case — it's the default behavior. A MoE without an explicit balancing mechanism will almost always collapse. So balancing isn't an optional tweak; it's load-bearing.

The fix: an auxiliary load-balancing loss

The solution is to add a second loss term that punishes imbalance, summed alongside the normal language-modeling loss. The idea: measure how unevenly tokens are distributed across experts, and add a penalty that grows when the distribution is lopsided. The router now has two pressures — route tokens to good experts (main loss) and route tokens evenly (auxiliary loss) — and it must balance them.

The standard formulation multiplies two quantities per expert: the fraction of tokens routed to that expert, and the average router probability assigned to it. The auxiliary loss is the sum of these products across experts, scaled up by the number of experts. Minimizing it pushes both quantities toward uniform — every expert getting its fair share of tokens and probability. A small coefficient (often around 0.01) keeps it from overwhelming the real objective; just enough nudge to keep the experts all in the game.

Worked example: spotting imbalance

Suppose 100 tokens and 4 experts. A perfectly balanced router sends 25 tokens to each — fraction 0.25 apiece. A collapsing router might send 70, 25, 4, 1. The balancing loss multiplies each expert's token-fraction by its average gating probability and sums. When everything is uniform (0.25 each), that sum is at its minimum. When it's lopsided (0.70, 0.25, 0.04, 0.01), the product for the overloaded expert (0.70 × its high probability) dominates and the sum shoots up — a big penalty. Gradient descent on that penalty pushes the router to even out the 70 back toward 25, reviving the starved experts before they die.

See it: collapse vs. balanced

Press run to stream tokens through 8 experts. With the balancing loss off, watch the rich-get-richer spiral: a couple of experts swell while the rest flatline. Toggle the balancing loss on and re-run: the load spreads evenly across all experts. The imbalance meter quantifies the difference.

The Collapse Spiral vs. Balanced Routing

Bars = cumulative tokens routed to each expert. Without the aux loss, a few experts dominate (collapse). With it, load spreads evenly. Toggle and re-run.

Common misconception. “Load balancing forces every token to a random expert, hurting quality.” The auxiliary loss has a small coefficient — it's a gentle nudge toward balance, not a hard constraint. The router can still send a token to its best expert; it just can't send everything to a favorite. The tiny quality cost of slightly imperfect routing is dwarfed by the benefit of keeping all experts alive and trained. Without it, you'd lose most of your model.

Why does a MoE router collapse to a few experts without an explicit balancing mechanism?

Experts are initialized identically A feedback loop: favored experts get more tokens → more training → become better → get favored even more, starving the rest The softmax temperature is too high

Chapter 4: Expert Capacity — The Buffer Problem

There's a hardware reality that load balancing alone doesn't solve. To run experts efficiently on a GPU, you need fixed-size buffers — you allocate, ahead of time, room for a specific number of tokens per expert. But routing is dynamic: on any given batch, some experts get more tokens than others, even with balancing. What happens when an expert receives more tokens than its buffer can hold?

Capacity factor and token dropping

Each expert gets a capacity: the maximum number of tokens it will process this batch. It's set by a capacity factor — a multiplier on the “fair share” each expert would get under perfect balance. A capacity factor of 1.0 means each expert can hold exactly its even share; 1.25 gives 25% headroom for the inevitable imbalance.

If more tokens route to an expert than its capacity allows, the overflow tokens are dropped — that expert simply doesn't process them. A dropped token isn't lost entirely: thanks to the residual connection around the MoE layer, it passes through unchanged, as if the layer were skipped for it. But it misses out on the expert computation it was supposed to get. Too much dropping hurts quality.

The capacity-factor tradeoff. Raise the capacity factor and fewer tokens drop — but every expert's buffer is bigger, so you waste compute and memory on empty buffer slots for the experts that didn't fill up. Lower it and you save memory but drop more tokens, hurting quality. The capacity factor is a dial between wasted compute (too high) and dropped tokens (too low). Good load balancing lets you run a lower capacity factor safely, because the loads are already even — another reason balancing matters.

Worked example: computing capacity

A batch has 512 tokens, 8 experts, top-1 routing. Under perfect balance each expert would get 512 / 8 = 64 tokens. With a capacity factor of 1.25, each expert's buffer holds 64 × 1.25 = 80 tokens. Now suppose, despite balancing, one expert is assigned 88 tokens this batch. It processes the first 80 and drops the last 8 — those 8 tokens skip the layer via the residual. Meanwhile an underused expert assigned only 50 tokens leaves 30 buffer slots empty (wasted compute). That's the daily reality of running a MoE: a constant, managed tension between dropping and waste.

See it: filling the buffers

Tokens stream into 6 experts, each with a capacity buffer (the outlined box). Watch buffers fill; when one overflows, the extra tokens turn red and drop. Raise the capacity factor to add headroom and reduce drops — but notice the growing empty space (wasted compute) in the under-filled experts.

Expert Capacity: Drops vs. Wasted Compute

Each box is an expert's buffer. Filled = processed tokens, red = dropped overflow, empty = wasted slots. Adjust the capacity factor and re-run.

Capacity factor 1.25

Common misconception. “A dropped token is a bug / lost data.” It's a deliberate design tradeoff, and the residual connection means the token still flows to the next layer — it just skips this expert's refinement. At inference, systems often raise the capacity factor (or drop less aggressively) since memory pressure differs from training. Some newer routers (“expert choice”) flip the problem around — experts pick their top tokens — guaranteeing perfect balance and zero drops by construction.

An expert receives more tokens than its capacity buffer holds. What happens to the overflow, and why isn't it catastrophic?

Training crashes with an out-of-memory error The overflow tokens are dropped (skip the expert), but the residual connection passes them through unchanged to the next layer, so they're not lost They're sent to a random other expert

Chapter 5: The Switch Transformer — Just Pick One

Early MoE work assumed you needed at least top-2 routing — send each token to two experts — because comparing two options seemed necessary for the router to get a useful learning signal. In 2021, Google's Switch Transformer made a bold simplification: route each token to exactly one expert. Top-1. And it worked beautifully, scaling MoE to a trillion parameters.

Why top-1 is a big deal

Going from top-2 to top-1 sounds like a minor change. It isn't — it roughly halves several costs at once:

Expert compute per token halves — one FFN runs instead of two.
Communication halves — in distributed training, each token's data is sent to one expert's machine instead of two (more on this in Chapter 8). Communication is often the bottleneck, so this is huge.
Router logic simplifies — no need to combine two expert outputs; the chosen expert's output (scaled by its gating weight) is the layer output.

The insight that made it safe. The worry with top-1 was training instability — with no second expert as a backup, a bad routing decision has no cushion. Switch made it work with careful engineering: a well-tuned load-balancing loss, selective use of higher-precision arithmetic in the router for stability, and a capacity factor with token dropping. The lesson: the “you need top-2” assumption was wrong; top-1 plus good balancing is simpler, cheaper, and scales further. Sometimes the bold simplification beats the careful complication.

The simplified layer

With top-1, the whole layer collapses to something clean: the router picks the single best expert for each token, the token goes through that one expert, and the output is the expert's result times its (single) gating weight. There's no softmax-over-k, no weighted combination of multiple outputs. The gating weight still matters — it lets gradients flow to the router so it learns which expert to pick — but the data path is just “one token, one expert.” Modern models often go back to top-2 or higher (Mixtral uses top-2, DeepSeek top-8 of fine-grained experts), but Switch proved top-1 is viable and established the modern recipe of balancing loss plus capacity factor.

See it: top-1 vs top-2 cost

Toggle between Switch (top-1) and top-2 routing on the same stream of tokens. Watch the total expert-FFN evaluations and the cross-machine communication count: top-1 roughly halves both. The capability difference is modest; the cost difference is not.

Switch (top-1) vs top-2: Halving the Cost

Same tokens, two routing schemes. Bars show total expert evaluations and communication. Top-1 (Switch) does roughly half the work of top-2.

Tokens in batch 256

Common misconception. “More experts per token always means better quality, so top-1 must be much worse.” The quality gap between top-1 and top-2 is real but small, while the cost difference is large. For a fixed compute budget, top-1 with more total experts can beat top-2 with fewer. The best k is an engineering tradeoff, not “higher is better” — and Switch showed top-1 is firmly on the table.

What did the Switch Transformer change, and why does it roughly halve costs?

It doubled the number of experts It routed each token to a single expert (top-1) instead of two, halving expert compute and cross-machine communication per token It removed the router entirely

Chapter 6: The Router Simulator — Run a MoE Layer Live

Everything comes together here. This is a MoE layer training live: a stream of tokens routes through 8 experts, batch after batch, while the router learns. You control top-k, the load-balancing loss, and the capacity factor — the three dials that decide whether your expensive sparse model thrives or collapses. Watch all three effects at once: the load distribution, the imbalance, and the token drop rate.

Run these experiments to cement the whole lesson:

Aux loss OFF — watch the rich-get-richer collapse: a few experts swell, drops spike as they overflow capacity, most experts go idle. Your big model becomes a small one.
Aux loss ON — the load evens out, imbalance drops near 1.0, and drops fall to almost zero. This is a healthy MoE.
Lower the capacity factor with aux off — drops explode (overloaded experts have no room). With aux on, you can run a low capacity factor safely.
Raise top-k — more experts per token: better quality signal, but proportionally more compute and communication.

Live MoE Layer: Routing, Balancing & Capacity

Bars = per-batch tokens per expert; the dashed line is capacity (overflow = red drops). Toggle the aux loss, adjust capacity and top-k, and watch imbalance and drop-rate respond in real time.

Top-k 1

Capacity factor 1.25

What to take away. A MoE doesn't just work because you added experts — it works because the router stays balanced and capacity is tuned so drops stay low. Flip the aux loss off and you can watch a billion-dollar model degrade into a fraction of itself in seconds. The three dials here are the same ones MoE engineers obsess over in real trillion-parameter training runs.

Common misconception. “Once trained, routing is solved.” Balance must be maintained throughout training, and the distribution of tokens shifts as the model learns, data changes, or you fine-tune. A model balanced during pretraining can drift toward imbalance during fine-tuning. The dials never fully go away.

No quiz — the simulator is the test. If you can predict what happens to drop rate when you turn off the aux loss and lower capacity, you understand how a MoE layer really runs.

Chapter 7: Modern MoE — Mixtral & DeepSeek

The ideas so far — experts, router, top-k, balancing, capacity — are the foundation. Today's best open MoE models add two refinements that meaningfully improve the basic recipe: fine-grained experts and shared experts. Both come from asking “how do we make each expert's specialization more useful?”

Mixtral: MoE goes mainstream

Mistral's Mixtral 8×7B (2023) was the model that made MoE a practical, open reality. Eight experts per layer, top-2 routing. About 47 billion total parameters, but only ~13 billion active per token — so it runs at the speed of a ~13B model while matching or beating much larger dense models. It proved the MoE recipe (top-2, balancing loss, capacity) works at the scale people actually deploy, not just in research labs.

DeepSeek's two refinements

DeepSeek's MoE models pushed the design further with two changes:

Fine-grained experts. Instead of a few big experts, use many smaller ones — split each expert into several thinner ones and route to more of them (e.g. top-8 of 64 small experts instead of top-2 of 8 big ones). Same active compute, but far more combinations of experts per token, so specialization is more flexible and precise. It's the difference between a few generalists and many narrow specialists you can mix freely.
Shared experts. Designate one or two experts that every token always uses, in addition to its routed experts. These shared experts capture the common knowledge every token needs (general grammar, basic facts), freeing the routed experts to specialize purely on the distinctive parts. It stops every routed expert from having to re-learn the same common basics.

Why shared experts are clever. Without them, every routed expert must independently learn the boring common-knowledge that all tokens need — wasteful redundancy across experts. A shared, always-on expert handles that baseline once, so the routed experts can spend their capacity on what makes tokens different. It's separation of concerns: shared = what everyone needs, routed = what makes you special. DeepSeek-V3 scaled this to 671 billion total parameters with only ~37 billion active.

See it: classic vs. shared+fine-grained

Toggle between the classic Mixtral-style layer (a few big experts, top-2) and the DeepSeek-style layer (one always-on shared expert plus many fine-grained routed experts). Watch how a token's path changes: in the modern design it always passes through the shared expert plus a larger handful of small specialists.

Classic MoE vs. Shared + Fine-Grained

A token's routing under two designs. Classic: top-2 of a few big experts. Modern: a shared always-on expert (gold) plus top-k of many small routed experts (teal). Toggle.

Common misconception. “Fine-grained experts cost more because there are more of them.” Total parameters can stay the same — you're splitting the same capacity into smaller pieces and routing to more of them. Active compute is governed by how many parameters fire per token, not how many experts exist. Many small experts at top-8 can use the same active compute as a few big ones at top-2, while offering vastly more ways to combine specializations.

What problem do DeepSeek's “shared experts” (always-on for every token) solve?

They make the router unnecessary They capture common knowledge every token needs, so the routed experts don't each have to redundantly re-learn the same basics and can specialize more They double the active parameter count

Chapter 8: Systems — Why MoE Is a Distributed-Computing Problem

MoE is as much a systems innovation as a modeling one. A trillion-parameter model's experts can't fit on one GPU — not even close. So the experts are spread across many GPUs, and that physical distribution creates the defining engineering challenge of MoE: moving tokens to wherever their chosen experts live.

Expert parallelism and all-to-all

The standard layout is expert parallelism: put different experts on different GPUs. With 64 experts across 64 GPUs, each GPU holds one expert. Now consider what routing means physically. A batch of tokens is spread across all those GPUs (each GPU processed some tokens through attention). But the router might send a token sitting on GPU 0 to an expert living on GPU 47. That token's data must be physically shipped across the network to GPU 47, processed, and shipped back.

Since every token might need to go to any GPU, this is an all-to-all communication: every GPU sends some tokens to every other GPU, simultaneously. It happens twice per MoE layer — once to dispatch tokens to their experts, once to gather the results back. All-to-all is one of the most expensive communication patterns in distributed computing, and in a large MoE it often dominates the runtime — the GPUs spend more time shuffling tokens than computing.

This reframes everything. Now you see why top-1 routing (Switch) was such a big deal: it halves the all-to-all traffic. Why load balancing matters for systems, not just quality: an overloaded expert's GPU becomes a straggler everyone waits for. Why capacity factors exist: fixed buffers make the all-to-all a predictable, fixed-size transfer. The modeling choices and the systems constraints are inseparable — MoE is co-designed with the hardware.

The payoff: active vs. total, quantified

Despite the communication cost, the economics are compelling. The compute (the matrix multiplies) scales with active parameters, which you hold fixed. The capacity scales with total parameters, which you grow by adding GPUs (and experts). So you can keep making the model more capable by buying more GPUs to hold more experts, without each token getting more expensive to compute. The ceiling becomes communication bandwidth and memory, not per-token FLOPs — a fundamentally better scaling curve than dense models.

See it: the active/total calculator + all-to-all

Set the number of experts and top-k. The calculator shows total vs active parameters and the sparsity ratio. The diagram shows tokens being shipped across GPUs (all-to-all) to reach their experts — the more experts spread across more GPUs, the more cross-GPU traffic. Watch the communication grow with top-k.

Active vs Total Parameters & All-to-All Traffic

Experts live on different GPUs; arrows show tokens shipped to reach them (all-to-all). The readout quantifies the capacity-vs-cost split. Adjust experts and top-k.

Experts (= GPUs) 8

Top-k 2

Common misconception. “MoE is free capacity, so just keep adding experts.” The compute is cheap, but the communication and memory are not. Every expert must be stored (memory) and reachable (bandwidth). At some point all-to-all traffic and GPU memory, not FLOPs, cap how big you can go. MoE shifts the bottleneck from compute to communication — it doesn't remove the bottleneck, it relocates it to where bandwidth lives.

Why is communication (all-to-all) often the dominant cost in a large distributed MoE?

Because the router is slow to compute Experts live on different GPUs, so each token must be physically shipped to its expert's GPU and back — and any token can go to any GPU, twice per layer Because experts use more FLOPs than a dense layer

Chapter 9: Connections & Cheat Sheet

You now understand the whole machine: why dense scaling hits a wall, how a MoE layer swaps one FFN for many experts plus a router, how top-k gating works and learns despite being non-differentiable, why routers collapse and how the balancing loss prevents it, how capacity and token dropping manage fixed buffers, the Switch top-1 simplification, the Mixtral and DeepSeek refinements, and why the whole thing is really a distributed-systems problem. The thread: spend parameters where they're needed, activate only a few per token, and pay constant compute for ever-growing capacity — if you can keep the router balanced and the tokens flowing.

The key terms

Term	What it means
Expert	one of several FFNs; each token uses only a few
Router / gate	tiny linear layer that scores experts per token
Top-k	how many experts each token is sent to (1, 2, 8...)
Total vs active params	capacity/memory vs per-token compute
Aux load-balancing loss	penalty that prevents router collapse
Capacity factor	buffer headroom per expert; controls token dropping
Shared expert	always-on expert for common knowledge (DeepSeek)
Expert parallelism	experts on different GPUs; needs all-to-all comm

The cheat sheet

MoE layer: replace the FFN with N experts + a router; route each token to top-k

Gating: router scores → pick top-k → softmax over those k → weighted sum of their outputs

Why it learns: gradients flow through the (continuous) gating weights, not the (discrete) top-k pick

Collapse: rich-get-richer; the DEFAULT without an aux balancing loss (coeff ~0.01)

Capacity factor: ~1.0–1.25; too low = drops, too high = wasted compute

Sizes: Mixtral 8×7B (47B total, ~13B active); DeepSeek-V3 (671B total, ~37B active)

Systems: experts across GPUs → all-to-all twice per layer → communication is the bottleneck

A decision guide

Want more capability at fixed inference cost?

Use MoE — grow total params, keep active params (and top-k) fixed.

↓

Tight on communication bandwidth?

Lower top-k (Switch top-1 halves all-to-all).

↓

Experts collapsing in training?

Add/raise the aux balancing loss; check capacity factor and drop rate.

↓

Want more specialization per active FLOP?

Fine-grained experts + a shared always-on expert (DeepSeek style).

Where this connects

Transformer — MoE replaces the FFN sub-layer inside each transformer block; everything else stays the same.
Attention Variants — attention (MQA/GQA/Flash) shrinks the attention cost; MoE scales the FFN cost. The two big efficiency levers.
Skip Connections — the residual around the MoE layer is what lets dropped tokens pass through unharmed.
Loss Functions — the auxiliary load-balancing loss is added to the main loss; routing also echoes the gating/softmax ideas there.
GPT & SSM / Mamba — MoE is orthogonal; it can be dropped into transformer or SSM blocks alike.
Infra Scaling & ML Inference — expert parallelism and all-to-all are core to serving MoE at scale.

The one thing to remember. A dense model forces every token to pay for every parameter. A Mixture of Experts breaks that bargain: it holds far more knowledge than any token touches, and a little router decides which slivers each token needs. The price is a finicky router you must keep balanced and a mountain of cross-GPU communication you must keep flowing. Master those, and you get a model that's enormous in capacity but cheap in compute — the dominant recipe for frontier-scale models today.

A team wants a model with 5× the capability of their current dense model but the same per-token inference cost. What's the soundest plan?

Make the dense FFN 5× wider Add 5× more layers Convert FFNs to MoE: many experts (high total params) with small top-k (fixed active params), trained with a load-balancing loss and a tuned capacity factor

“The secret to scale isn’t doing more for every input — it’s knowing which small part of a vast mind each input needs to wake.”