Tazi et al., Chapter 1

Overview

The three fundamental challenges of distributed LLM training: memory, compute efficiency, and communication.

Prerequisites: Basic familiarity with deep learning training loops (forward pass, backward pass, optimizer step).
6
Chapters
1
Simulation
6
Quizzes

Chapter 0: Why Scale?

Thousands of GPUs humming in perfect harmony. That is what it takes to train today's most powerful AI models — a symphony of computing power that until recently was the exclusive domain of elite research labs.

Open source has transformed the landscape. You can download Llama or DeepSeek models. You can read their technical reports. But the most challenging part — the training code, the knowledge and techniques necessary to coordinate GPUs to train these massive systems — remains shrouded in complexity and spread across disconnected papers and private codebases.

This book changes that. Starting from the basics, it walks you through scaling the training of large language models from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.

The core promise: By the end of this book you will understand every major distributed training technique — data parallelism, tensor parallelism, pipeline parallelism, context parallelism, expert parallelism, and ZeRO — and how they all combine into the "5D parallelism" framework used to train frontier models.

As the size of GPU clusters has grown, various techniques have been invented to ensure GPUs are highly utilized at all times. This significantly reduces training time and makes the most efficient use of expensive hardware. These distributed training techniques are not only important for building initial models but have also become essential for fine-tuning large models on specialized data.

Check: What is the main barrier to training LLMs at scale today?

Chapter 1: The Training Loop

Before we scale to many GPUs, let us review what happens on a single GPU. Model training consists of three steps that repeat over and over:

1. Forward Pass
Pass input through the model to get predictions
2. Backward Pass
Compute gradients of the loss with respect to every parameter
3. Optimizer Step
Use gradients to update the parameters (e.g., Adam)
↻ repeat

The batch size is one of the most important hyperparameters. It affects both model convergence and throughput.

A small batch size can be useful early in training to quickly explore the loss landscape. But further along, small batches keep gradients noisy and the model may fail to converge to optimal performance. A large batch size gives accurate gradient estimates but may waste compute if too large.

Key insight: In LLM pretraining, batch sizes are commonly reported in tokens rather than samples. A sweet spot is typically 4–60 million tokens per batch. Llama 1 used ~4M tokens per batch; DeepSeek-V3 used ~60M tokens per batch.

The batch size in tokens can be computed from the batch size in samples as:

BStokens = BSsamples × seq_len

where seq_len is the input sequence length (e.g., 4096 tokens).

We run into our first challenge when scaling to large batch sizes: out-of-memory (OOM) issues. What should we do when our GPU does not have enough memory?

Check: Why are LLM batch sizes measured in tokens rather than samples?

Chapter 2: Three Challenges

Every technique in this book tackles one or several of the following three key challenges. Think of them as the "holy trinity" of distributed training:

ChallengeNatureConsequence
Memory usageHard limitationIf a training step does not fit in memory, training cannot proceed at all
Compute efficiencyOptimization targetGPUs should spend most time computing, not waiting or transferring data
Communication overheadScaling bottleneckIdle GPUs waiting for data from other GPUs waste expensive resources
The fundamental trade-off: In many places, you can trade one of these three (computation, communication, memory) against another. For example, activation recomputation trades extra compute for less memory. Tensor parallelism trades more communication for distributed memory. Finding the right balance is the key to scaling training.

Let's make these concrete. On a single H100 GPU with 80 GB of memory:

A 7B parameter model in mixed precision needs ~112 GB just for weights, gradients, and optimizer states — already exceeding a single GPU. A 70B model needs ~1,120 GB, requiring at least 14 H100s just for the model state, before any activations.

The communication challenge emerges once we split work across GPUs. Intra-node communication (NVLink) is fast (~900 GB/s), but inter-node communication (InfiniBand/EFA) is 5–10x slower. Any technique that requires frequent cross-node communication will suffer.

Speed hierarchy: GPU compute (~1000 TFLOPS) >> intra-node bandwidth (~900 GB/s via NVLink) >> inter-node bandwidth (~100 GB/s via InfiniBand). Distributed training strategies must respect this hierarchy.
Check: Which of the three challenges is a hard limitation (versus an optimization target)?

Chapter 3: The Toolbox

To tackle these three challenges, the distributed training community has developed a rich toolbox of techniques. Here is a bird's-eye view of what we will cover in this book:

Chapter 2: Single-GPU Techniques
Activation recomputation, gradient accumulation, mixed precision
↓ not enough memory?
Chapter 3: Data Parallelism + ZeRO
Replicate model, split data. ZeRO shards optimizer/grads/params
↓ model too large for one GPU?
Chapter 4: Tensor Parallelism
Split weight matrices across GPUs within a node
↓ sequences too long?
Chapter 5: Context Parallelism
Split sequences across GPUs, Ring Attention for KV exchange
↓ model spans multiple nodes?
Chapter 6: Pipeline Parallelism
Split layers across nodes, minimize idle bubbles
↓ MoE architecture?
Chapter 7: Expert Parallelism
Distribute expert FFN blocks across GPUs

The book builds on three foundations:

1. Theory and concepts — understand each method at a high level before code.

2. Clear code implementations — the Picotron repository implements concepts in single, self-contained files. Nanotron provides production-ready code.

3. Real benchmarks — over 4,100 distributed experiments with up to 512 GPUs were run to scan many possible distributed training layouts.

Check: If your model weights do not fit on a single GPU node, which technique addresses this most directly?

Chapter 4: Roadmap

Let's preview the journey. Each technique addresses specific challenges, and they can be combined into what the community calls 5D parallelism:

DimensionSplits alongPrimary benefitCommunication
Data (DP)BatchScale throughputAll-reduce gradients
Tensor (TP)Hidden dimFit large layersAll-reduce / all-gather within layers
Sequence (CP)Sequence lengthLong sequencesRing Attention for K/V
Pipeline (PP)Model layersFit large modelsSend activations between stages
Expert (EP)MoE expertsScale capacityAll-to-all token routing
Why 5D? No single technique is a silver bullet. Data parallelism hits communication limits at scale. Tensor parallelism requires expensive intra-node links. Pipeline parallelism introduces idle bubbles. The art of distributed training is combining them so their strengths cover each other's weaknesses.

Below is an interactive overview of the five parallelism dimensions. Toggle each one to see what it splits and what communication it requires.

5D Parallelism Overview

Click each dimension to see how it partitions the model and data.

Check: Which parallelism dimension splits along the model's layer depth?

Chapter 5: Summary

Let's crystallize what we have learned. Distributed training is about making thousands of GPUs work together efficiently on the same model. The three challenges — memory, compute, communication — recur at every scale.

If you have...The bottleneck is...Start with...
1 GPU, model fitsThroughputGradient accumulation, mixed precision
1 node, model fitsThroughputData parallelism (DP=8)
1 node, model too bigMemoryTensor parallelism or ZeRO-3
Multiple nodesCommunicationPP across nodes + TP within node
Very long sequencesActivation memoryContext parallelism
MoE architectureExpert capacityExpert parallelism
The story ahead: Chapter 2 tackles single-GPU optimization. Then we progressively add one parallelism dimension at a time until we reach the full 5D framework. Each technique builds on what came before.

The authors ran over 4,100 experiments with up to 512 GPUs. The benchmarks show that no single technique dominates — the optimal configuration depends on model size, batch size, sequence length, and hardware topology. Chapter 9 gives a step-by-step recipe.

Check: Can a single parallelism technique solve all scaling challenges?