Tazi et al., Chapter 9

Finding the Best Configuration

A step-by-step recipe: fit in memory, hit the target batch size, optimize throughput — then benchmark thousands of configs.

Prerequisites: Chapter 8 (5D Parallelism). All five parallelism strategies and their interactions.
8
Chapters
2
Simulations
8
Quizzes

Chapter 0: The Decision Process

We have all the tools. DP, TP, CP, PP, EP, ZeRO-1/2/3. The question is: which ones do we use, and with what settings?

There is no universal answer. The right configuration depends on your model size, sequence length, number of GPUs, network topology, and target batch size. But we can follow a systematic three-step process:

Step 1: Fit in Memory
Get a single training step to run without OOM
Step 2: Hit Target Batch Size
Adjust DP, gradient accumulation, and CP to reach the desired GBS
Step 3: Optimize Throughput
Scale up parallelism dimensions to maximize GPU utilization
Reality check: You will always need to run experiments. Theory gives you a starting point, but every cluster has its own quirks — network bandwidth, memory per GPU, CUDA version, NCCL behavior. Benchmarking is essential.
Check: What is the first priority when configuring distributed training?

Chapter 1: Step 1 — Fit in Memory

Before optimizing anything, we need a training step that does not crash. This means fitting the model, gradients, optimizer states, and activations into GPU memory.

GPU-rich case (plenty of GPUs available):

Model sizeRecommended starting config
< 10B paramsSingle strategy: TP=8 or ZeRO-3 with full recompute on 8 GPUs
10B–100B paramsCombine: TP=8 + PP, or TP=8 + DP(ZeRO-3), or pure ZeRO-3
100B+ paramsFull 5D: TP=8 + PP + DP(ZeRO-2)

GPU-poor case (limited resources):

TechniqueEffect
Full activation recomputationTrade compute for memory — slower but fits larger models
Increase gradient accumulationProcess larger effective batches with less memory per step
Reduce micro-batch sizeLess activation memory per forward pass
Key insight: At 512+ GPUs, pure ZeRO-3 starts becoming inefficient due to communication costs. At that scale, combining TP with PP or ZeRO-3 is better. At 1024+ GPUs, the recommended setup is TP=8 + DP(ZeRO-2) + PP.

Special cases:

ScenarioAdd this
Very long sequences (128K+)Context parallelism across nodes
MoE architectureExpert parallelism across nodes
Check: For a 70B model on many GPUs, what is a good starting configuration?

Chapter 2: Step 2 — Batch Size

After fitting in memory, our current global batch size might be too small or too large for good convergence. Time to adjust.

To increase global batch size:

Scale up DP
Add more data-parallel replicas — each processes a micro-batch
or
Increase gradient accumulation
Each GPU processes multiple micro-batches before syncing gradients
or
Use CP for long sequences
Context parallelism lets you process more tokens per sample

To decrease global batch size:

ApproachHow
Reduce DPReplace DP replicas with other parallelism (TP, PP)
Reduce CPFewer CP groups for long sequences
Recall: Global batch size = micro_batch_size × DP_degree × gradient_accumulation_steps. Adjusting any of these three knobs changes the effective batch size.
Check: How does gradient accumulation help with batch size?

Chapter 3: Step 3 — Throughput

The model fits, the batch size is right. Now: are we training as fast as possible? The goal is to maximize Model FLOPs Utilization (MFU) — the fraction of theoretical peak GPU compute actually used.

Optimization strategies, in order of priority:

#StrategyWhy it helps
1Scale up TP (within node)Fast intra-node NVLink; reduces other parallelism overhead
2Increase DP + ZeRO-3Keep target batch size while adding GPUs
3Switch to PP when DP comms bottleneckPP's point-to-point is more bandwidth-friendly
4Experiment with micro-batch sizesBalance compute per step, memory, and communication
5Try different parallelism combosThe optimal mix depends on your specific hardware
The throughput paradox: Adding more GPUs does not always increase throughput. Communication overhead grows with scale. At some point, adding another node hurts more than it helps. This is why benchmarking is non-negotiable.

As a rule of thumb: keep TP intra-node for the highest bandwidth, then choose between PP and ZeRO-3 for inter-node based on your model size and available batch size.

Check: What should you scale up first for throughput optimization?

Chapter 4: Benchmarking at Scale

The Hugging Face team ran thousands of distributed configurations on their cluster (1–64 nodes of 8xH100s) to produce real-world data on what works best.

Key findings from their benchmarks (4,096 sequence length, 1M token global batch size):

ObservationImplication
Efficiency decreases with more nodesEspecially for small models with low compute-to-model-size ratio
Larger models are more GPU-hungry80B on 4 nodes barely fits and runs inefficiently near memory limits
Implementation quality matters enormouslyOptimized PP beat unoptimized TP; then optimized TP beat PP again
Key insight: The "best" configuration changes as implementations improve. When the Hugging Face team first implemented both strategies, TP outperformed PP. After optimizing PP code, PP won. After improving TP's communication overlap, TP was expected to retake the lead. Never assume one strategy is always superior.

The benchmark results showed that for each model size and node count, there is an optimal combination of DP, TP, PP, gradient accumulation steps, micro-batch size, and ZeRO stage. The optimal config varies significantly across the grid — there is no one-size-fits-all answer.

Check: Why can't we just pick one parallelism strategy and use it for everything?

Chapter 5: Lessons Learned

Running thousands of distributed training experiments taught the Hugging Face team hard lessons about the gap between theory and practice:

ChallengeWhat happened
Process cleanup failuresPyTorch processes would not terminate cleanly, leaving GPUs in bad states
Slurm job manager issuesForceful termination caused node failures, requiring restart
Jobs stretchingSimple benchmarks that should take minutes ran for hours
Indefinite hangsSome configurations would hang forever without error messages

The team spent significant engineering time on:

Cluster management
Minimizing restart times and optimizing idle time between experiments
Debugging NCCL
Analyzing detailed NCCL debug logs to understand communication failures
Memory profiling
Understanding CUDA memory allocator behaviors and fragmentation
Multi-node PP fixes
Improving pipeline parallelism performance specifically for multi-node setups
The real lesson: Reproducing theoretical results on real hardware is hard. What looks simple in a paper often requires careful attention to many moving parts. This is why open-source projects like Nanotron and Picotron are so valuable — they encode hard-won operational knowledge.
Check: What is one of the biggest practical challenges in distributed training benchmarking?

Chapter 6: Config Explorer

Use this interactive tool to explore how different configuration choices affect memory usage and GPU count requirements.

Configuration Explorer

Set your model size and see recommended parallelism strategies.

Decision Flowchart

The three-step process visualized.

Check: At 512+ GPUs with pure ZeRO-3, what typically happens?

Chapter 7: Summary

The three-step recipe:
1. Fit in memory: Start with TP=8 for small models, add PP or ZeRO-3 for larger ones. Use activation recomputation if needed.
2. Hit target batch size: Scale DP and gradient accumulation. Add CP for long sequences.
3. Optimize throughput: Scale TP first (intra-node), then experiment with PP vs. ZeRO-3 inter-node. Benchmark extensively.
ScaleTypical configuration
8 GPUs (1 node)TP=8 or ZeRO-3
16–64 GPUsTP=8 + PP or TP=8 + ZeRO-3
128–512 GPUsTP=8 + PP + DP(ZeRO-1/2)
1024+ GPUsFull 5D: TP=8 + PP + DP(ZeRO-2) + CP (if long seq) + EP (if MoE)
Final thought: The knowledge to efficiently train at scale used to be locked inside a handful of big labs. This book — and the open-source tools behind it (Nanotron, Picotron) — aim to make distributed training accessible to everyone. Now go build something amazing.
Check: What is the single most important lesson from this chapter?