Tazi et al., Chapter 3

Data Parallelism

Replicate the model, split the data. All-reduce, gradient bucketing, and ZeRO — the workhorse of distributed training.

Prerequisites: Chapter 2 (Training on One GPU). Understanding of gradient accumulation and memory categories.
9
Chapters
3
Simulations
9
Quizzes

Chapter 0: Why Parallelism?

In the last chapter we learned gradient accumulation: run multiple micro-batches sequentially, accumulate their gradients, then update the model. It works, but it is slow — each micro-batch waits for the previous one to finish.

Here is the key observation: the forward and backward passes for each micro-batch are completely independent. They use different input data but the same model weights. If we have 8 GPUs, why not run 8 micro-batches at the same time?

Data parallelism in one sentence: Replicate the entire model on every GPU, give each GPU a different slice of the batch, compute gradients in parallel, and average them before the optimizer step.

This is the simplest and most widely used form of distributed training. It is also the conceptual foundation for everything else in this book. But there are subtleties: how do we average gradients across GPUs efficiently? What if the model state does not fit on one GPU? That is where all-reduce and ZeRO come in.

Check: Data parallelism speeds up training by...

Chapter 1: The All-Reduce

After each GPU finishes its backward pass, it has a set of local gradients. To keep all model replicas in sync, we need to average these gradients across all GPUs before the optimizer step.

The operation that does this is called all-reduce. It takes a tensor from each GPU, sums (or averages) them, and distributes the result back to every GPU. After all-reduce, every GPU has the identical averaged gradient.

GPU 0: grad0
Computed from micro-batch 0
GPU 1: grad1
Computed from micro-batch 1
GPU 2: grad2
Computed from micro-batch 2
↓ all-reduce ↓
All GPUs: (grad0 + grad1 + grad2) / 3
Identical averaged gradient on every GPU

A naive implementation would wait for the entire backward pass to finish, then do one big all-reduce. But that means GPUs sit idle during communication. In distributed training, sequential compute then communicate is a BIG NO-NO.

Key insight: An all-reduce can be decomposed into a reduce-scatter (each GPU gets a different partial sum) followed by an all-gather (every GPU gets the full result). This decomposition will be crucial for understanding ZeRO and tensor parallelism later.
Check: What does all-reduce guarantee after it completes?

Chapter 2: Overlap & Bucketing

The trick to efficient data parallelism is to overlap communication with computation. As soon as gradients for a layer are ready during the backward pass, start communicating them — do not wait for the entire backward pass to finish.

There are three key optimizations that make this work:

OptimizationHow it works
Backward hooksRegister a hook on each parameter that fires as soon as its gradient is computed. The hook starts the all-reduce immediately.
Gradient bucketingGroup small gradients into larger buckets before all-reducing. This amortizes the fixed cost of launching a communication operation.
OverlapWhile the all-reduce for layer N's gradients is in progress, the backward pass for layer N-1 continues computing on the GPU.
The golden rule: Never let a GPU sit idle waiting for communication. Start communicating the moment data is ready, and keep computing while the communication runs in the background.

In a profiler trace, you would see the backward computation on one CUDA stream and the all-reduce communication on another, running in parallel. The bucket size is tunable — too small and you pay overhead per communication call, too large and you delay the overlap.

Check: Why do we use gradient bucketing instead of all-reducing each parameter individually?

Chapter 3: Batch Size Formula

With data parallelism and gradient accumulation combined, the global batch size formula becomes:

BSglobal = BSmicro × DP × grad_acc_steps

where DP is the number of data-parallel replicas (GPUs used for data parallelism), BSmicro is the micro-batch size per GPU, and grad_acc_steps is the number of gradient accumulation steps.

To get the global batch size in tokens, multiply by the sequence length:

BStokens = BSmicro × DP × grad_acc_steps × seq_len
Scaling knobs: You have three levers to hit your target batch size: micro-batch size (limited by GPU memory), DP degree (limited by number of GPUs), and gradient accumulation steps (unlimited but sequential). In practice, maximize DP first since it is parallel, then use grad_acc_steps to fill the gap.

Example: To reach 4M tokens with seq_len=4096, you need 1024 samples. With DP=8 and micro-batch=4, that is 32 samples per step, so you need grad_acc_steps = 1024 / 32 = 32.

Check: With DP=16, micro-batch=2, grad_acc=4, seq_len=4096, what is the global batch size in tokens?

Chapter 4: ZeRO Stage 1

Basic data parallelism replicates everything on every GPU: weights, gradients, and optimizer states. For a 7B model, each GPU needs ~112 GB just for the model state. That is extremely wasteful — we have N copies of the same thing.

The Zero Redundancy Optimizer (ZeRO) eliminates this redundancy by sharding the model state across data-parallel GPUs. There are three stages, each more aggressive than the last.

ZeRO Stage 1 shards the optimizer states across DP replicas.

Without ZeRO
Each GPU: all weights + all gradients + all optimizer states (m, v)
↓ shard optimizer states ↓
ZeRO-1
Each GPU: all weights + all gradients + 1/N of optimizer states

Since Adam's momentum and variance (8 bytes/param) are the largest per-parameter cost, sharding them across N GPUs saves significant memory. After each optimizer step, each GPU updates only its shard of parameters, then broadcasts the updates.

Memory savings for ZeRO-1: For a 7B model with DP=8, optimizer states drop from 56 GB to 7 GB per GPU. Weights (14 GB) and gradients (14 GB) are still replicated.
Check: What does ZeRO Stage 1 shard across GPUs?

Chapter 5: ZeRO Stages 2 & 3

ZeRO Stage 2 shards both optimizer states and gradients. Instead of all-reducing all gradients to every GPU, we use reduce-scatter: each GPU ends up with only 1/N of the averaged gradients — exactly the shard it needs for its optimizer step.

ZeRO Stage 3 goes all the way: it shards optimizer states, gradients, and model weights across GPUs. No GPU holds a complete copy of anything. Before each forward or backward pass through a layer, the GPU gathers that layer's weights from the other GPUs, uses them, then discards them.

StageShardsMemory per GPUCommunication
ZeRO-0 (none)Nothing16P bytesAll-reduce gradients
ZeRO-1Optimizer states~8P + 4P/N bytesAll-reduce grads + broadcast params
ZeRO-2Optimizer + grads~4P + 8P/N bytesReduce-scatter grads + broadcast params
ZeRO-3Everything~16P/N bytesAll-gather weights before each layer
The ZeRO-3 trade-off: Memory is near-perfect — 1/N of everything. But communication is heavy: we must gather every layer's weights before using them. This works well when communication is fast (intra-node NVLink) but becomes a bottleneck across nodes.
ZeRO-3 vs. Pipeline Parallelism: Both distribute the model across GPUs. ZeRO-3 gathers weights on demand; PP sends activations between pipeline stages. ZeRO-3 is model-agnostic; PP requires scheduling. The right choice depends on whether weight communication or activation communication is cheaper for your setup.
Check: ZeRO-3 achieves near-perfect memory scaling. What is the cost?

Chapter 6: ZeRO in Action

Let us see how much memory ZeRO actually saves. The interactive diagram below shows per-GPU memory for a model across different ZeRO stages and DP degrees.

ZeRO Memory Savings

Adjust model size and DP degree. Click ZeRO stages to compare.

Model (B params) 7
DP degree 8

Notice how ZeRO-1 already cuts optimizer memory dramatically, while ZeRO-3 divides everything by the DP degree. At DP=64, even a 70B model's state fits on a single GPU.

Practical note: ZeRO-3 (FSDP in PyTorch) can prefetch the next layer's weights while computing the current layer, hiding most of the all-gather latency. This overlap is critical for keeping throughput high.
Check: With ZeRO-3 and DP=8, how much model-state memory does each GPU hold for a 7B model?

Chapter 7: DP Scaling Simulator

Data parallelism scaling is not free. As we add more GPUs, the all-reduce communication grows. Let us simulate how throughput scales with DP degree.

Data Parallelism Scaling

Watch how total throughput grows with DP, and how per-GPU efficiency drops due to communication overhead.

DP degree 8
Comm. overhead % 5
Key insight: With perfect overlap, communication overhead is hidden and scaling is near-linear. But at very high DP degrees (512+ GPUs), even small per-GPU overhead accumulates. This is one reason why pure DP does not scale infinitely, and we need to combine it with other parallelism methods.
Check: At very high DP degrees, what limits further scaling?

Chapter 8: Summary

Data parallelism is the workhorse of distributed training. Combined with ZeRO, it handles most training scenarios up to moderate scale.

TechniqueWhat it doesWhen to use
Basic DPReplicate model, split data, all-reduce gradientsModel fits on one GPU
DP + ZeRO-1Shard optimizer statesOptimizer states are the memory bottleneck
DP + ZeRO-2Shard optimizer + gradientsNeed more memory savings
DP + ZeRO-3 (FSDP)Shard everythingModel too large for one GPU
What comes next: Data parallelism and ZeRO can shard model state across GPUs, but activations remain largely unsharded. When models get so large that even a single layer's activations and weights exceed what a single node can handle, we need tensor parallelism — splitting individual weight matrices across GPUs.
Check: When would you choose ZeRO-3 over basic data parallelism?