A step-by-step recipe: fit in memory, hit the target batch size, optimize throughput — then benchmark thousands of configs.
We have all the tools. DP, TP, CP, PP, EP, ZeRO-1/2/3. The question is: which ones do we use, and with what settings?
There is no universal answer. The right configuration depends on your model size, sequence length, number of GPUs, network topology, and target batch size. But we can follow a systematic three-step process:
Before optimizing anything, we need a training step that does not crash. This means fitting the model, gradients, optimizer states, and activations into GPU memory.
GPU-rich case (plenty of GPUs available):
| Model size | Recommended starting config |
|---|---|
| < 10B params | Single strategy: TP=8 or ZeRO-3 with full recompute on 8 GPUs |
| 10B–100B params | Combine: TP=8 + PP, or TP=8 + DP(ZeRO-3), or pure ZeRO-3 |
| 100B+ params | Full 5D: TP=8 + PP + DP(ZeRO-2) |
GPU-poor case (limited resources):
| Technique | Effect |
|---|---|
| Full activation recomputation | Trade compute for memory — slower but fits larger models |
| Increase gradient accumulation | Process larger effective batches with less memory per step |
| Reduce micro-batch size | Less activation memory per forward pass |
Special cases:
| Scenario | Add this |
|---|---|
| Very long sequences (128K+) | Context parallelism across nodes |
| MoE architecture | Expert parallelism across nodes |
After fitting in memory, our current global batch size might be too small or too large for good convergence. Time to adjust.
To increase global batch size:
To decrease global batch size:
| Approach | How |
|---|---|
| Reduce DP | Replace DP replicas with other parallelism (TP, PP) |
| Reduce CP | Fewer CP groups for long sequences |
The model fits, the batch size is right. Now: are we training as fast as possible? The goal is to maximize Model FLOPs Utilization (MFU) — the fraction of theoretical peak GPU compute actually used.
Optimization strategies, in order of priority:
| # | Strategy | Why it helps |
|---|---|---|
| 1 | Scale up TP (within node) | Fast intra-node NVLink; reduces other parallelism overhead |
| 2 | Increase DP + ZeRO-3 | Keep target batch size while adding GPUs |
| 3 | Switch to PP when DP comms bottleneck | PP's point-to-point is more bandwidth-friendly |
| 4 | Experiment with micro-batch sizes | Balance compute per step, memory, and communication |
| 5 | Try different parallelism combos | The optimal mix depends on your specific hardware |
As a rule of thumb: keep TP intra-node for the highest bandwidth, then choose between PP and ZeRO-3 for inter-node based on your model size and available batch size.
The Hugging Face team ran thousands of distributed configurations on their cluster (1–64 nodes of 8xH100s) to produce real-world data on what works best.
Key findings from their benchmarks (4,096 sequence length, 1M token global batch size):
| Observation | Implication |
|---|---|
| Efficiency decreases with more nodes | Especially for small models with low compute-to-model-size ratio |
| Larger models are more GPU-hungry | 80B on 4 nodes barely fits and runs inefficiently near memory limits |
| Implementation quality matters enormously | Optimized PP beat unoptimized TP; then optimized TP beat PP again |
The benchmark results showed that for each model size and node count, there is an optimal combination of DP, TP, PP, gradient accumulation steps, micro-batch size, and ZeRO stage. The optimal config varies significantly across the grid — there is no one-size-fits-all answer.
Running thousands of distributed training experiments taught the Hugging Face team hard lessons about the gap between theory and practice:
| Challenge | What happened |
|---|---|
| Process cleanup failures | PyTorch processes would not terminate cleanly, leaving GPUs in bad states |
| Slurm job manager issues | Forceful termination caused node failures, requiring restart |
| Jobs stretching | Simple benchmarks that should take minutes ran for hours |
| Indefinite hangs | Some configurations would hang forever without error messages |
The team spent significant engineering time on:
Use this interactive tool to explore how different configuration choices affect memory usage and GPU count requirements.
Set your model size and see recommended parallelism strategies.
The three-step process visualized.
| Scale | Typical configuration |
|---|---|
| 8 GPUs (1 node) | TP=8 or ZeRO-3 |
| 16–64 GPUs | TP=8 + PP or TP=8 + ZeRO-3 |
| 128–512 GPUs | TP=8 + PP + DP(ZeRO-1/2) |
| 1024+ GPUs | Full 5D: TP=8 + PP + DP(ZeRO-2) + CP (if long seq) + EP (if MoE) |