Tazi et al., Chapter 8

5D Parallelism in a Nutshell

All five dimensions combined — DP, TP, CP, PP, EP — when to use each, how they interact, and what each one costs.

Prerequisites: Chapters 3–7. All five parallelism strategies.
8
Chapters
2
Simulations
8
Quizzes

Chapter 0: The Full Picture

Congratulations — you have now seen all five parallelism strategies for scaling model training:

#StrategyParallel dimension
1Data parallelism (DP)Batch dimension
2Tensor parallelism (TP)Hidden dimension
3Sequence / context parallelism (SP/CP)Sequence dimension
4Pipeline parallelism (PP)Model layers (depth)
5Expert parallelism (EP)Expert sub-networks

Plus the three ZeRO strategies that reduce memory within DP:

ZeRO stageWhat is sharded
ZeRO-1Optimizer states
ZeRO-2Optimizer states + gradients
ZeRO-3Optimizer states + gradients + parameters

The question now: how do these strategies compare and combine? Which ones complement each other, and which ones are redundant?

No silver bullet: Each strategy has strengths and limitations. The art of distributed training is finding the right combination for your model size, sequence length, number of GPUs, and network topology.
Check: How many parallelism dimensions have we covered?

Chapter 1: PP vs. ZeRO-3

PP and ZeRO-3 solve the same fundamental problem — distributing model weights across GPUs — but they take opposite approaches.

AspectZeRO-3Pipeline Parallelism
Each GPU storesOnly a fraction of each layerFull layers (but fewer of them)
Communication transfersWeights (all-gather before compute)Activations (between pipeline stages)
OrchestrationModel-agnosticModel-agnostic
Scaling preferencePrefers large batch to hide commsPrefers large M to hide bubble
ImplementationComplex model partitioningComplex schedule design
Key insight: ZeRO-3 communicates weights, PP communicates activations. They are rarely combined in practice because doing so requires a very large global batch size to amortize both types of communication overhead.

However, ZeRO-1 and ZeRO-2 (which only shard optimizer states and gradients, not parameters) combine easily with PP. DeepSeek-V3's training used PP + ZeRO-1.

If you do combine PP + ZeRO-3, configure ZeRO-3 to keep weights in memory during the series of PP micro-batches. Otherwise, you would re-gather weights for every micro-batch — unnecessarily multiplying communication.

Check: What is the key difference between how PP and ZeRO-3 distribute the model?

Chapter 2: TP Complements Both

Tensor parallelism (with sequence parallelism) is naturally complementary to both PP and ZeRO-3. TP relies on the distributive property of matrix multiplication — weights and activations can be sharded and computed independently before being combined.

But TP has two practical limitations:

1. Communication on critical path
All-reduce after every layer — hard to overlap with compute, performance degrades past a certain point
2. Model-specific sharding
Requires careful handling of activation shapes — sometimes hidden dim (TP), sometimes sequence dim (SP)
The practical rule: Keep TP intra-node (where NVLink bandwidth is high). Use PP or ZeRO-3 for inter-node parallelism, where their communication patterns are more tolerant of lower bandwidth.

A typical large-scale configuration:

LevelStrategyCommunication
Within a node (8 GPUs)TP=8 (with SP)NVLink — fast all-reduce
Across nodesPP or ZeRO-3 + DPInfiniBand — lower bandwidth, but more tolerant patterns
Check: Why is TP typically kept within a single node?

Chapter 3: CP and EP

Context parallelism and expert parallelism are complementary to TP and address different bottlenecks:

StrategyTargetsRelevant when
CPSequence length bottleneck128K+ tokens, attention memory exceeds GPU capacity
EPExpert capacity scalingMoE architectures with many experts (e.g., 256)

CP shards activations along the sequence dimension. Most modules (MLP, LayerNorm) process tokens independently — no communication needed. Only attention requires cross-GPU K/V exchange via Ring Attention. CP is valuable at extreme sequence lengths where attention memory alone would overflow a single GPU.

EP shards expert sub-networks. It uses all-to-all communication for token routing. Since each token is only processed by top-k experts, inference and training compute scale sub-linearly with total expert count. EP is mandatory for large MoE models and can be combined with all other strategies without conflict.

No conflicts: CP and EP can be combined with each other and with TP, PP, and DP without any particular issues. CP handles the sequence dimension, EP handles the expert dimension — they are orthogonal.
Check: What dimensions do CP and EP parallelize, respectively?

Chapter 4: Scope & Focus

Each parallelism strategy has a different scope — which sub-parts of the model it affects most:

StrategyScopeCommunication typeModel-specific?
TP + SPEntire model (weights + activations)All-reduce for matmulsYes — needs sharding patterns
CPPrimarily attention layersRing Attention K/V exchangeMostly model-agnostic
EPMoE layers onlyAll-to-all for token routingMoE-specific
PPNo specific submodulePoint-to-point activationsModel-agnostic (but needs balanced layers)
DP + ZeRONo specific submoduleAll-reduce gradients / all-gather weightsModel-agnostic
Key insight: TP affects computation throughout the entire model. CP primarily impacts attention. EP only affects MoE layers. PP and ZeRO are not specific to any submodule. Understanding this scope helps you choose which strategies to combine.

For example, if your bottleneck is attention memory on long sequences, adding CP helps but adding more PP does not. If your bottleneck is total model size, PP or ZeRO-3 helps but CP does not.

Check: Which strategy specifically targets attention layers?

Chapter 5: The Master Table

Here is the complete comparison of all strategies — what each one saves, what dimension it parallelizes, and its primary disadvantage:

MethodMemory savings onParallel dimensionDisadvantage
DPActivations (via smaller local batch)BatchLimited by max batch size
PPModel parametersModel layersPipeline bubble + complex schedules
TP + SPParameters and activationsHidden / sequenceRequires high-bandwidth communication
CPActivationsSequence lengthCommunication overhead in attention
EPExpert parametersExpertsRequires MoE layers + routing overhead
ZeRO-1Optimizer statesSharded among DP replicasParameter communication overhead
ZeRO-2Optimizer states + gradientsSharded among DP replicasParameter communication overhead
ZeRO-3Optimizer states + gradients + parametersSharded among DP replicasParameter communication overhead
The bottom line: None of these techniques is a silver bullet. We will often need to combine them. The next chapter provides a step-by-step recipe for choosing the right combination.
Check: Which strategy is limited by the maximum global batch size?

Chapter 6: 5D Parallelism Simulator

Explore how different parallelism strategies partition a transformer layer. Toggle each dimension to see what is sharded, replicated, or communicated.

5D Parallelism Visualizer

Toggle parallelism dimensions to see how a transformer layer is distributed across GPUs.

No parallelism active — single GPU
Memory Savings Overview

See how each strategy reduces different memory categories.

Check: Which parallelism dimension should be kept within a single node?

Chapter 7: Summary

The 5D recipe:
• TP intra-node for fast weight/activation sharding
• PP or ZeRO-3 inter-node for model size scaling
• DP for batch scaling (with ZeRO-1/2 for optimizer memory)
• CP for long sequences
• EP for MoE expert distribution
CombinationWhen to use
TP + DPMost common baseline for multi-GPU training
TP + PPLarge models that do not fit on one node
TP + DP + PPVery large models at scale (1024+ GPUs)
TP + PP + CPLong-sequence training (128K+ tokens)
TP + PP + EP + DPFull MoE training (e.g., DeepSeek-V3)
What comes next: Now that we know all the dimensions, Chapter 9 provides a practical step-by-step recipe: first fit the model in memory, then hit the target batch size, then optimize throughput. Theory meets practice.
Check: What combination did DeepSeek-V3 use for training?