Mixture of Experts, token routing, all-to-all communication — distributing specialized sub-networks across GPUs.
So far, every token in our model passes through every parameter. A 70B model means every token touches all 70 billion weights. But do we really need all that compute for every token?
Consider a sentence like "The cat sat on the matrix." The word "cat" probably needs different processing than "matrix." What if we had specialized sub-networks — experts — and routed each token to only the most relevant ones?
This is the Mixture of Experts (MoE) paradigm, used in GPT-4, Mixtral, and DeepSeek-V3/R1. The key question for distributed training: how do we parallelize a model where different tokens go to different sub-networks?
In a standard transformer, each layer has one MLP (feedforward) block. In an MoE transformer, that single MLP is replaced by N expert MLP blocks plus a router (also called a gate).
| Component | Dense Model | MoE Model |
|---|---|---|
| Attention | Same | Same (unchanged) |
| MLP | 1 FFN block per layer | N expert FFN blocks per layer |
| Router | None | Small network that assigns tokens to experts |
| Active params/token | All parameters | Only top-k experts' parameters |
The router is typically a simple linear layer followed by softmax. For each token, it produces a probability distribution over all N experts and selects the top-k (usually k=1 or k=2). Only those k experts process the token.
The attention layers are unchanged — MoE only replaces the MLP blocks. This means all our previous parallelism strategies (TP for attention, CP for long sequences) still apply to the non-expert parts of the model.
The router determines which expert(s) process each token. This introduces a unique communication pattern: all-to-all.
Unlike TP's all-reduce (every GPU sends to every GPU with the same data) or PP's point-to-point (one GPU sends to the next), the all-to-all pattern is data-dependent: which tokens go where depends on what the router decides at runtime.
DeepSeek-V3 adds an additional constraint: each token can be routed to at most M nodes (4 in their case). This keeps tokens local to a small group of machines, reducing inter-node communication overhead.
Expert parallelism (EP) is straightforward in concept: put each expert on a different GPU. Since experts are fully independent feedforward networks, there is no need to split a matrix multiplication like TP does — we simply route tokens to the right GPU.
| Parallelism | What is split | Communication pattern |
|---|---|---|
| TP | Weight matrices (hidden dim) | All-reduce within each layer |
| PP | Layers (depth) | Point-to-point between stages |
| EP | Expert sub-networks | All-to-all for token routing |
However, EP only affects the MoE layers. The attention blocks, LayerNorm, and embedding layers are untouched by EP. If we only used EP, these non-expert components would be fully replicated across all GPUs — redundant computation!
This is why EP is almost always combined with other parallelism strategies. The non-expert blocks need their own parallelism (TP, DP, or PP) to avoid redundant work.
The most natural combination is EP + DP. Data parallelism handles the non-expert blocks (attention, LayerNorm) by splitting the batch, while expert parallelism distributes the expert blocks across GPUs.
Think of it this way: during the attention phase, GPUs act as DP replicas (each processing different data through the same attention weights). During the MoE phase, GPUs act as expert hosts (each holding different experts, receiving tokens via all-to-all).
In practice, large MoE models like DeepSeek-V3 combine all five parallelism dimensions:
| Dimension | What it parallelizes | Applied to |
|---|---|---|
| DP + ZeRO-1 | Batch + optimizer states | Entire model |
| TP | Hidden dimension | Attention + non-expert MLP |
| PP | Layers | Entire model depth |
| CP | Sequence length | Attention (for long sequences) |
| EP | Expert sub-networks | MoE layers only |
Watch how tokens are routed to experts across GPUs. Each token is assigned to top-k experts by the router. Click "Route" to see a new random routing assignment.
Tokens on the left are routed to experts on GPUs on the right. Color = expert assignment.
How evenly are tokens distributed across experts? Imbalance means wasted GPU time.
| Aspect | Expert Parallelism |
|---|---|
| What it shards | Expert sub-networks across GPUs |
| Communication | All-to-all for token routing + combine |
| Scope | Only MoE layers (attention is unchanged) |
| Must combine with | DP (for non-expert blocks), often also TP and PP |
| Key challenge | Load balancing — routing tokens evenly across experts |