The three fundamental challenges of distributed LLM training: memory, compute efficiency, and communication.
Thousands of GPUs humming in perfect harmony. That is what it takes to train today's most powerful AI models — a symphony of computing power that until recently was the exclusive domain of elite research labs.
Open source has transformed the landscape. You can download Llama or DeepSeek models. You can read their technical reports. But the most challenging part — the training code, the knowledge and techniques necessary to coordinate GPUs to train these massive systems — remains shrouded in complexity and spread across disconnected papers and private codebases.
This book changes that. Starting from the basics, it walks you through scaling the training of large language models from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.
As the size of GPU clusters has grown, various techniques have been invented to ensure GPUs are highly utilized at all times. This significantly reduces training time and makes the most efficient use of expensive hardware. These distributed training techniques are not only important for building initial models but have also become essential for fine-tuning large models on specialized data.
Before we scale to many GPUs, let us review what happens on a single GPU. Model training consists of three steps that repeat over and over:
The batch size is one of the most important hyperparameters. It affects both model convergence and throughput.
A small batch size can be useful early in training to quickly explore the loss landscape. But further along, small batches keep gradients noisy and the model may fail to converge to optimal performance. A large batch size gives accurate gradient estimates but may waste compute if too large.
The batch size in tokens can be computed from the batch size in samples as:
where seq_len is the input sequence length (e.g., 4096 tokens).
We run into our first challenge when scaling to large batch sizes: out-of-memory (OOM) issues. What should we do when our GPU does not have enough memory?
Every technique in this book tackles one or several of the following three key challenges. Think of them as the "holy trinity" of distributed training:
| Challenge | Nature | Consequence |
|---|---|---|
| Memory usage | Hard limitation | If a training step does not fit in memory, training cannot proceed at all |
| Compute efficiency | Optimization target | GPUs should spend most time computing, not waiting or transferring data |
| Communication overhead | Scaling bottleneck | Idle GPUs waiting for data from other GPUs waste expensive resources |
Let's make these concrete. On a single H100 GPU with 80 GB of memory:
A 7B parameter model in mixed precision needs ~112 GB just for weights, gradients, and optimizer states — already exceeding a single GPU. A 70B model needs ~1,120 GB, requiring at least 14 H100s just for the model state, before any activations.
The communication challenge emerges once we split work across GPUs. Intra-node communication (NVLink) is fast (~900 GB/s), but inter-node communication (InfiniBand/EFA) is 5–10x slower. Any technique that requires frequent cross-node communication will suffer.
To tackle these three challenges, the distributed training community has developed a rich toolbox of techniques. Here is a bird's-eye view of what we will cover in this book:
The book builds on three foundations:
1. Theory and concepts — understand each method at a high level before code.
2. Clear code implementations — the Picotron repository implements concepts in single, self-contained files. Nanotron provides production-ready code.
3. Real benchmarks — over 4,100 distributed experiments with up to 512 GPUs were run to scan many possible distributed training layouts.
Let's preview the journey. Each technique addresses specific challenges, and they can be combined into what the community calls 5D parallelism:
| Dimension | Splits along | Primary benefit | Communication |
|---|---|---|---|
| Data (DP) | Batch | Scale throughput | All-reduce gradients |
| Tensor (TP) | Hidden dim | Fit large layers | All-reduce / all-gather within layers |
| Sequence (CP) | Sequence length | Long sequences | Ring Attention for K/V |
| Pipeline (PP) | Model layers | Fit large models | Send activations between stages |
| Expert (EP) | MoE experts | Scale capacity | All-to-all token routing |
Below is an interactive overview of the five parallelism dimensions. Toggle each one to see what it splits and what communication it requires.
Click each dimension to see how it partitions the model and data.
Let's crystallize what we have learned. Distributed training is about making thousands of GPUs work together efficiently on the same model. The three challenges — memory, compute, communication — recur at every scale.
| If you have... | The bottleneck is... | Start with... |
|---|---|---|
| 1 GPU, model fits | Throughput | Gradient accumulation, mixed precision |
| 1 node, model fits | Throughput | Data parallelism (DP=8) |
| 1 node, model too big | Memory | Tensor parallelism or ZeRO-3 |
| Multiple nodes | Communication | PP across nodes + TP within node |
| Very long sequences | Activation memory | Context parallelism |
| MoE architecture | Expert capacity | Expert parallelism |
The authors ran over 4,100 experiments with up to 512 GPUs. The benchmarks show that no single technique dominates — the optimal configuration depends on model size, batch size, sequence length, and hardware topology. Chapter 9 gives a step-by-step recipe.