Tazi et al., Chapter 7

Expert Parallelism

Mixture of Experts, token routing, all-to-all communication — distributing specialized sub-networks across GPUs.

Prerequisites: Chapter 6 (Pipeline Parallelism). Understanding of MLP layers and data parallelism.
7
Chapters
2
Simulations
7
Quizzes

Chapter 0: Why Experts?

So far, every token in our model passes through every parameter. A 70B model means every token touches all 70 billion weights. But do we really need all that compute for every token?

Consider a sentence like "The cat sat on the matrix." The word "cat" probably needs different processing than "matrix." What if we had specialized sub-networks — experts — and routed each token to only the most relevant ones?

The MoE promise: Scale model capacity (total parameters) without scaling compute per token. A model with 256 experts and 8B active parameters might have hundreds of billions of total parameters, but each token only activates a small fraction of them.

This is the Mixture of Experts (MoE) paradigm, used in GPT-4, Mixtral, and DeepSeek-V3/R1. The key question for distributed training: how do we parallelize a model where different tokens go to different sub-networks?

Check: What is the core advantage of MoE over dense models?

Chapter 1: MoE Architecture

In a standard transformer, each layer has one MLP (feedforward) block. In an MoE transformer, that single MLP is replaced by N expert MLP blocks plus a router (also called a gate).

ComponentDense ModelMoE Model
AttentionSameSame (unchanged)
MLP1 FFN block per layerN expert FFN blocks per layer
RouterNoneSmall network that assigns tokens to experts
Active params/tokenAll parametersOnly top-k experts' parameters

The router is typically a simple linear layer followed by softmax. For each token, it produces a probability distribution over all N experts and selects the top-k (usually k=1 or k=2). Only those k experts process the token.

Key numbers: DeepSeek-V3 has 256 experts with top-8 routing. Mixtral has 8 experts with top-2 routing. The total parameter count is much larger than the active count per token, giving MoE models their efficiency advantage.

The attention layers are unchanged — MoE only replaces the MLP blocks. This means all our previous parallelism strategies (TP for attention, CP for long sequences) still apply to the non-expert parts of the model.

Check: Which component does MoE replace in a standard transformer layer?

Chapter 2: Token Routing

The router determines which expert(s) process each token. This introduces a unique communication pattern: all-to-all.

1. Router decides
For each token, compute router scores and select top-k experts
2. All-to-all dispatch
Send each token's hidden state to the GPU hosting its assigned expert
3. Expert compute
Each expert processes its assigned tokens independently
4. All-to-all combine
Send processed tokens back to their original GPUs

Unlike TP's all-reduce (every GPU sends to every GPU with the same data) or PP's point-to-point (one GPU sends to the next), the all-to-all pattern is data-dependent: which tokens go where depends on what the router decides at runtime.

Load balancing is critical: If the router sends most tokens to a few experts, those GPUs are overloaded while others idle. MoE models use auxiliary loss functions and routing constraints to encourage balanced token distribution across experts.

DeepSeek-V3 adds an additional constraint: each token can be routed to at most M nodes (4 in their case). This keeps tokens local to a small group of machines, reducing inter-node communication overhead.

Check: What type of communication does EP use to move tokens to their assigned experts?

Chapter 3: Expert Parallelism

Expert parallelism (EP) is straightforward in concept: put each expert on a different GPU. Since experts are fully independent feedforward networks, there is no need to split a matrix multiplication like TP does — we simply route tokens to the right GPU.

ParallelismWhat is splitCommunication pattern
TPWeight matrices (hidden dim)All-reduce within each layer
PPLayers (depth)Point-to-point between stages
EPExpert sub-networksAll-to-all for token routing
Key insight: EP is much more lightweight than TP. TP must split matrix multiplications and synchronize partial results. EP just needs to move complete token hidden states to the right place — the computation itself is embarrassingly parallel once tokens arrive at their expert.

However, EP only affects the MoE layers. The attention blocks, LayerNorm, and embedding layers are untouched by EP. If we only used EP, these non-expert components would be fully replicated across all GPUs — redundant computation!

This is why EP is almost always combined with other parallelism strategies. The non-expert blocks need their own parallelism (TP, DP, or PP) to avoid redundant work.

Check: Why must EP be combined with other parallelism strategies?

Chapter 4: EP + DP Combined

The most natural combination is EP + DP. Data parallelism handles the non-expert blocks (attention, LayerNorm) by splitting the batch, while expert parallelism distributes the expert blocks across GPUs.

Think of it this way: during the attention phase, GPUs act as DP replicas (each processing different data through the same attention weights). During the MoE phase, GPUs act as expert hosts (each holding different experts, receiving tokens via all-to-all).

Similarity to DP: Some implementations consider EP a subset of DP. Both process different input data on each GPU. The key difference is that DP uses identical model copies, while EP uses specialized expert sub-networks with dynamic routing.

In practice, large MoE models like DeepSeek-V3 combine all five parallelism dimensions:

DimensionWhat it parallelizesApplied to
DP + ZeRO-1Batch + optimizer statesEntire model
TPHidden dimensionAttention + non-expert MLP
PPLayersEntire model depth
CPSequence lengthAttention (for long sequences)
EPExpert sub-networksMoE layers only
Communication locality: DeepSeek-V3 constrains each token to at most 4 nodes during expert routing. This keeps the all-to-all communication local, dramatically reducing inter-node traffic. Clever routing constraints are just as important as clever parallelism strategies.
Check: When combining EP + DP, what role does each GPU play during the attention phase vs. the MoE phase?

Chapter 5: MoE Routing Simulator

Watch how tokens are routed to experts across GPUs. Each token is assigned to top-k experts by the router. Click "Route" to see a new random routing assignment.

MoE Token Routing

Tokens on the left are routed to experts on GPUs on the right. Color = expert assignment.

Click Route to assign tokens to experts
Expert Load Balance

How evenly are tokens distributed across experts? Imbalance means wasted GPU time.

Check: Why is load balancing important in MoE models?

Chapter 6: Summary

AspectExpert Parallelism
What it shardsExpert sub-networks across GPUs
CommunicationAll-to-all for token routing + combine
ScopeOnly MoE layers (attention is unchanged)
Must combine withDP (for non-expert blocks), often also TP and PP
Key challengeLoad balancing — routing tokens evenly across experts
What comes next: We have now seen all five parallelism dimensions: DP, TP, CP, PP, and EP. Chapter 8 brings them together — comparing, combining, and understanding their interactions as "5D Parallelism."
EP vs. TP: TP splits individual weight matrices and requires high-bandwidth communication on the critical path. EP keeps expert computations fully independent and only needs all-to-all routing. This makes EP more lightweight but requires MoE model architecture.
Check: What is the fundamental communication pattern of EP?