All five dimensions combined — DP, TP, CP, PP, EP — when to use each, how they interact, and what each one costs.
Congratulations — you have now seen all five parallelism strategies for scaling model training:
| # | Strategy | Parallel dimension |
|---|---|---|
| 1 | Data parallelism (DP) | Batch dimension |
| 2 | Tensor parallelism (TP) | Hidden dimension |
| 3 | Sequence / context parallelism (SP/CP) | Sequence dimension |
| 4 | Pipeline parallelism (PP) | Model layers (depth) |
| 5 | Expert parallelism (EP) | Expert sub-networks |
Plus the three ZeRO strategies that reduce memory within DP:
| ZeRO stage | What is sharded |
|---|---|
| ZeRO-1 | Optimizer states |
| ZeRO-2 | Optimizer states + gradients |
| ZeRO-3 | Optimizer states + gradients + parameters |
The question now: how do these strategies compare and combine? Which ones complement each other, and which ones are redundant?
PP and ZeRO-3 solve the same fundamental problem — distributing model weights across GPUs — but they take opposite approaches.
| Aspect | ZeRO-3 | Pipeline Parallelism |
|---|---|---|
| Each GPU stores | Only a fraction of each layer | Full layers (but fewer of them) |
| Communication transfers | Weights (all-gather before compute) | Activations (between pipeline stages) |
| Orchestration | Model-agnostic | Model-agnostic |
| Scaling preference | Prefers large batch to hide comms | Prefers large M to hide bubble |
| Implementation | Complex model partitioning | Complex schedule design |
However, ZeRO-1 and ZeRO-2 (which only shard optimizer states and gradients, not parameters) combine easily with PP. DeepSeek-V3's training used PP + ZeRO-1.
If you do combine PP + ZeRO-3, configure ZeRO-3 to keep weights in memory during the series of PP micro-batches. Otherwise, you would re-gather weights for every micro-batch — unnecessarily multiplying communication.
Tensor parallelism (with sequence parallelism) is naturally complementary to both PP and ZeRO-3. TP relies on the distributive property of matrix multiplication — weights and activations can be sharded and computed independently before being combined.
But TP has two practical limitations:
A typical large-scale configuration:
| Level | Strategy | Communication |
|---|---|---|
| Within a node (8 GPUs) | TP=8 (with SP) | NVLink — fast all-reduce |
| Across nodes | PP or ZeRO-3 + DP | InfiniBand — lower bandwidth, but more tolerant patterns |
Context parallelism and expert parallelism are complementary to TP and address different bottlenecks:
| Strategy | Targets | Relevant when |
|---|---|---|
| CP | Sequence length bottleneck | 128K+ tokens, attention memory exceeds GPU capacity |
| EP | Expert capacity scaling | MoE architectures with many experts (e.g., 256) |
CP shards activations along the sequence dimension. Most modules (MLP, LayerNorm) process tokens independently — no communication needed. Only attention requires cross-GPU K/V exchange via Ring Attention. CP is valuable at extreme sequence lengths where attention memory alone would overflow a single GPU.
EP shards expert sub-networks. It uses all-to-all communication for token routing. Since each token is only processed by top-k experts, inference and training compute scale sub-linearly with total expert count. EP is mandatory for large MoE models and can be combined with all other strategies without conflict.
Each parallelism strategy has a different scope — which sub-parts of the model it affects most:
| Strategy | Scope | Communication type | Model-specific? |
|---|---|---|---|
| TP + SP | Entire model (weights + activations) | All-reduce for matmuls | Yes — needs sharding patterns |
| CP | Primarily attention layers | Ring Attention K/V exchange | Mostly model-agnostic |
| EP | MoE layers only | All-to-all for token routing | MoE-specific |
| PP | No specific submodule | Point-to-point activations | Model-agnostic (but needs balanced layers) |
| DP + ZeRO | No specific submodule | All-reduce gradients / all-gather weights | Model-agnostic |
For example, if your bottleneck is attention memory on long sequences, adding CP helps but adding more PP does not. If your bottleneck is total model size, PP or ZeRO-3 helps but CP does not.
Here is the complete comparison of all strategies — what each one saves, what dimension it parallelizes, and its primary disadvantage:
| Method | Memory savings on | Parallel dimension | Disadvantage |
|---|---|---|---|
| DP | Activations (via smaller local batch) | Batch | Limited by max batch size |
| PP | Model parameters | Model layers | Pipeline bubble + complex schedules |
| TP + SP | Parameters and activations | Hidden / sequence | Requires high-bandwidth communication |
| CP | Activations | Sequence length | Communication overhead in attention |
| EP | Expert parameters | Experts | Requires MoE layers + routing overhead |
| ZeRO-1 | Optimizer states | Sharded among DP replicas | Parameter communication overhead |
| ZeRO-2 | Optimizer states + gradients | Sharded among DP replicas | Parameter communication overhead |
| ZeRO-3 | Optimizer states + gradients + parameters | Sharded among DP replicas | Parameter communication overhead |
Explore how different parallelism strategies partition a transformer layer. Toggle each dimension to see what is sharded, replicated, or communicated.
Toggle parallelism dimensions to see how a transformer layer is distributed across GPUs.
See how each strategy reduces different memory categories.
| Combination | When to use |
|---|---|
| TP + DP | Most common baseline for multi-GPU training |
| TP + PP | Large models that do not fit on one node |
| TP + DP + PP | Very large models at scale (1024+ GPUs) |
| TP + PP + CP | Long-sequence training (128K+ tokens) |
| TP + PP + EP + DP | Full MoE training (e.g., DeepSeek-V3) |