One GPU can't train a 70B model. Data parallelism, tensor parallelism, pipeline parallelism, ZeRO — how to split computation across hundreds of GPUs without wasting most of them.
You want to train a 70 billion parameter language model. Each parameter needs 2 bytes in fp16, so the weights alone take 140 GB. A top-of-the-line NVIDIA A100 has 80 GB of memory. The model doesn't even fit on a single GPU — and that's before gradients, optimizer states, or activations.
But the problem goes deeper than memory. LLMs are growing roughly 10x per year. GPU FLOPS double every ~2.5 years. The gap between what models demand and what a single chip can deliver is widening exponentially. No amount of waiting for better hardware will close it.
The only solution is parallelism: splitting the work across many GPUs. But how you split matters enormously. Naive approaches waste most of your GPUs' time waiting on each other. This lesson covers the three fundamental strategies — data, tensor, and pipeline parallelism — plus the memory optimization tricks (ZeRO) that make modern LLM training possible.
LLM parameters (orange) grow ~10x/year. GPU memory (teal) grows ~2x every 2.5 years. The gap is the reason parallelism exists.
Data parallelism is the simplest and most widely used strategy. The idea: replicate the entire model on every GPU, but give each GPU a different slice of the training batch. If you have 4 GPUs and a batch of 64 examples, each GPU gets 16 examples, runs forward and backward passes independently, then the GPUs synchronize their gradients before updating the weights.
Because every GPU has the same model and applies the same averaged gradient, the weights stay perfectly synchronized. The effective batch size is N × (per-GPU batch size), so adding GPUs directly increases throughput.
Each GPU (column) holds a full copy of the model. Data is split across GPUs. Watch gradients flow during AllReduce.
| Pros | Cons |
|---|---|
| Simple to implement (PyTorch DDP built-in) | Every GPU must hold the entire model in memory |
| Near-linear throughput scaling | Communication-intensive: all gradients synchronized every step |
| No code changes to model architecture | Can't train models larger than single-GPU memory |
Let's compute the memory footprint for a 7 billion parameter model using mixed-precision training with the Adam optimizer:
| Component | Formula | Bytes | GB |
|---|---|---|---|
| fp16 parameters | 2 × 7B | 14 × 109 | 14 GB |
| fp16 gradients | 2 × 7B | 14 × 109 | 14 GB |
| fp32 optimizer copy | 4 × 7B | 28 × 109 | 28 GB |
| fp32 momentum (Adam) | 4 × 7B | 28 × 109 | 28 GB |
| fp32 variance (Adam) | 4 × 7B | 28 × 109 | 28 GB |
| Total | 16 × P | 112 × 109 | 112 GB |
An 80 GB A100 can't hold 112 GB. And this is before activations (which can easily add another 30-60 GB depending on batch size and sequence length). Data parallelism alone can't train this model.
Data parallelism requires every GPU to end up with the same averaged gradient. The operation that achieves this is called AllReduce. It takes N arrays (one per GPU), sums them element-wise, and distributes the result back to every GPU. Understanding AllReduce is critical because it's the dominant communication cost in distributed training.
The most efficient AllReduce for large messages is ring AllReduce. Arrange the N GPUs in a logical ring. The algorithm runs in two phases, each taking N-1 steps:
Total steps: 2 × (N-1). Each step transfers exactly X/N bytes, where X is the total data size per GPU.
Let's derive the total communication time from first principles:
Where does this come from? In each of the 2(N-1) steps, a GPU sends exactly X/N bytes (one chunk). Multiply steps by chunk size.
If the network bandwidth is B bytes/second, the communication time is:
For large N, the factor (N-1)/N approaches 1, so Tcomm ≈ 2X/B. This is remarkable: the communication time is nearly independent of the number of GPUs. Doubling GPUs doesn't double communication cost. This is why ring AllReduce scales well.
Suppose we have N = 8 GPUs, each holding X = 2 GB of gradients (a 1B parameter model in fp16). Network bandwidth B = 25 GB/s (NVLink).
Over a slower 12.5 GB/s interconnect (PCIe Gen4 x16), the same operation takes 0.28 seconds — twice as long, but still tractable. The key insight: bandwidth matters more than latency for large transfers.
Watch data chunks flow around the ring. Phase 1 (reduce-scatter) adds values. Phase 2 (allgather) distributes the result. Each GPU is a node on the ring.
Data parallelism requires every GPU to hold the entire model. When the model is too large for a single GPU, we need model parallelism: splitting the model's weights across GPUs.
There are two ways to slice a model. You can slice vertically — assign different layers to different GPUs (we'll call this pipeline parallelism, covered in the next chapter). Or you can slice horizontally — split individual layers across GPUs. This horizontal slicing is called tensor parallelism, and the dominant approach for transformers is Megatron-style parallelism.
A transformer layer has two main computations: self-attention and the MLP (feed-forward network). Each involves large matrix multiplications like Y = XW. The key insight: matrix multiplication can be partitioned along the column or row dimension.
Column-parallel: Split weight matrix W into columns [W1, W2] across 2 GPUs. Each GPU computes Yi = X × Wi. The outputs are different slices of Y, so we concatenate them.
Row-parallel: Split W into rows. Each GPU computes a partial sum. The outputs are added with an AllReduce.
Watch how a single matrix multiply Y = X × W gets split across 2 GPUs using column parallelism. Each GPU computes half the output columns.
Tensor parallelism requires an AllReduce after every attention block and every MLP block. For a transformer with L layers, that's 2L AllReduces per forward pass (and another 2L in the backward pass). This is very frequent — which means tensor parallelism only works well when GPUs are connected by extremely fast links like NVLink (900 GB/s) rather than PCIe (64 GB/s) or network (25-100 GB/s).
| Pros | Cons |
|---|---|
| Reduces per-GPU memory (weights split across GPUs) | Very frequent AllReduces (2 per layer per pass) |
| High GPU utilization (all GPUs active simultaneously) | Needs extremely fast interconnect (NVLink) |
| Mathematically equivalent to single-GPU compute | More complex implementation than data parallelism |
What if you don't have NVLink connecting your GPUs? You still need to split a model that doesn't fit in one GPU's memory. The answer is pipeline parallelism: assign different layers to different GPUs, and pass activations forward and gradients backward between them.
The simplest version: put layers 1-8 on GPU 0, layers 9-16 on GPU 1, etc. Feed a batch through. GPU 0 processes its layers, sends activations to GPU 1, then... sits idle while GPU 1 works. GPU 1 finishes, sends to GPU 2, and now both GPU 0 and GPU 1 are idle. This is called the pipeline bubble — the idle time where GPUs wait for data.
With P stages (GPUs) and 1 micro-batch, the bubble fraction is (P-1)/P. With 4 GPUs, that's 75% idle time. Catastrophic.
The solution is to split each batch into M micro-batches and inject them into the pipeline one after another. While GPU 1 processes micro-batch 0, GPU 0 can start on micro-batch 1. The pipeline stays full (mostly).
With M micro-batches and P stages, the bubble fraction becomes:
If M = 4 and P = 4: bubble = 3/7 ≈ 43%. If M = 16 and P = 4: bubble = 3/19 ≈ 16%. More micro-batches = smaller bubble, but each micro-batch is smaller = less compute per chunk = lower arithmetic intensity.
There are two main schedules for when each GPU runs forward and backward passes:
Watch micro-batches flow through pipeline stages. Teal = forward, orange = backward, gray = idle (bubble). Toggle between GPipe and 1F1B to see how scheduling reduces bubbles and memory.
Unlike tensor parallelism, pipeline parallelism only requires point-to-point communication: GPU i sends activations to GPU i+1 (forward), and GPU i+1 sends gradients back to GPU i (backward). No AllReduce needed. This makes pipeline parallelism well-suited for cross-node communication where bandwidth is limited.
| Pros | Cons |
|---|---|
| Reduces per-GPU memory (only holds a few layers) | Pipeline bubbles waste GPU time |
| Minimal communication (point-to-point only) | Complex scheduling (GPipe, 1F1B, interleaved) |
| Works across slow networks | Requires careful load balancing across stages |
Data parallelism replicates everything on every GPU: weights, gradients, and optimizer states. With the 16P rule, a 7B model needs 112 GB per GPU. But think about it: if you have 8 GPUs, you're storing 8 copies of the optimizer states. That's 7 copies of pure redundancy.
ZeRO (Zero Redundancy Optimizer) eliminates this waste by sharding (partitioning) different components across GPUs. It has three stages, each removing more redundancy:
Each GPU only stores 1/N of the optimizer states (fp32 params, momentum, variance). After the backward pass, each GPU updates only its assigned slice of parameters. Then an AllGather broadcasts the updated parameters to all GPUs.
For N=8 GPUs and P=7B: Memory = 4(7) + 12(7)/8 = 28 + 10.5 = 38.5 GB per GPU. Down from 112 GB!
Each GPU only needs the gradients for the parameters it's responsible for updating. Instead of AllReducing the full gradient, we use ReduceScatter — each GPU receives only its 1/N slice of the reduced gradients.
For N=8: Memory = 2(7) + 14(7)/8 = 14 + 12.25 = 26.25 GB per GPU.
The final stage: don't even store all the parameters on every GPU. Each GPU holds only 1/N of the weights. When a layer needs weights that aren't local, an AllGather fetches them just-in-time for the forward or backward pass.
For N=8: Memory = 16(7)/8 = 14 GB per GPU. We went from 112 GB to 14 GB — an 8x reduction.
Adjust model size and GPU count to see memory per GPU at each ZeRO stage. The dashed line shows 80 GB A100 capacity.
| Stage | Shards | Memory (bytes) | 7B @ 8 GPUs |
|---|---|---|---|
| Baseline | Nothing | 16P | 112 GB |
| ZeRO-1 | Optimizer states | 4P + 12P/N | 38.5 GB |
| ZeRO-2 | + Gradients | 2P + 14P/N | 26.25 GB |
| ZeRO-3 | + Parameters | 16P/N | 14 GB |
FSDP(module) and PyTorch handles the sharding, AllGather, and ReduceScatter automatically. It's the easiest way to train models that don't fit on a single GPU.python import torch from torch.distributed.fsdp import FullyShardedDataParallel as FSDP torch.cuda.set_device(device_id) model = MyTransformer() # Define your model model = FSDP(model) # Wrap with FSDP (ZeRO-3) optim = torch.optim.Adam(model.parameters(), lr=1e-4) for batch in dataloader: loss = model(batch).sum() # Forward (AllGather params on-demand) loss.backward() # Backward (ReduceScatter gradients) optim.step() # Update local shard only
Each parallelism strategy has its sweet spot: data parallelism scales throughput but requires full model replicas; tensor parallelism splits layers but needs fast interconnect; pipeline parallelism works across slow networks but has bubbles. The insight behind 3D parallelism (also called PTD: Pipeline-Tensor-Data) is that you can combine all three.
Consider training a model on 64 GPUs organized as a 4 × 4 × 4 grid:
| Dimension | Degree | Scope | Interconnect |
|---|---|---|---|
| Tensor parallelism | 4-way | Within a node (NVLink) | Fastest |
| Pipeline parallelism | 4-way | Across nodes in a rack | Medium (IB) |
| Data parallelism | 4-way | Across racks / clusters | Slowest |
Each node has 4 GPUs connected by NVLink — perfect for tensor parallelism (frequent AllReduces). Nodes are connected by InfiniBand — good for pipeline parallelism (occasional point-to-point). Data parallelism runs across the widest, slowest links because its communication (one AllReduce per step) is the most tolerant of latency.
Hover over a GPU to see its parallelism role. Tensor (green) = splits layers within a node. Pipeline (orange) = different layer groups. Data (blue) = same layers, different data.
Each strategy alone hits limits. Data parallelism alone can't fit large models. Tensor parallelism alone doesn't scale beyond 8 GPUs (NVLink domain). Pipeline parallelism alone has too many bubbles. By combining them, you exploit the hardware topology at every level.
The theory of parallelism is clean. The practice is full of devils. Let's walk through the challenges that engineers face when actually deploying distributed training at scale.
Sometimes you want a large effective batch size but can't fit even a single micro-batch on a GPU. The solution: gradient accumulation. Run K forward-backward passes on tiny micro-batches, accumulating gradients locally, then synchronize once. This reduces communication from K AllReduces to 1, at the cost of K times more local compute per sync.
During the forward pass, we save activations for the backward pass. For a 96-layer transformer with large batch sizes, this can consume 50+ GB. Activation checkpointing (also called gradient checkpointing) trades compute for memory: only save activations at selected layers, and recompute the others during backward. A common strategy: checkpoint every sqrt(L) layers, reducing activation memory from O(L) to O(sqrt(L)) at ~33% more compute.
Not all GPU-to-GPU connections are equal. Within a DGX node, NVLink provides ~900 GB/s bidirectional bandwidth. Across nodes, InfiniBand typically gives 200-400 GB/s. Across racks, bandwidth drops further. A parallelism strategy that ignores topology will bottleneck on the slowest links.
| Link Type | Bandwidth | Best For |
|---|---|---|
| NVLink (intra-node) | ~900 GB/s | Tensor parallelism (frequent AllReduce) |
| InfiniBand (inter-node) | 200-400 GB/s | Pipeline parallelism (point-to-point) |
| Ethernet (cross-rack) | 25-100 GB/s | Data parallelism (infrequent AllReduce) |
With thousands of GPUs running for weeks, hardware failures are inevitable. A single GPU failure can kill the entire training run. Mitigations include: checkpointing model state to disk every N steps (so you can resume from the last checkpoint), elastic training (dynamically removing failed nodes without restarting), and redundant computation (running the same shard on multiple GPUs).
If stage 0 takes 10ms and stage 3 takes 20ms, the whole pipeline runs at the speed of the slowest stage. Uneven layer assignment creates artificial bubbles. Good implementations profile each layer's compute cost and assign layers to stages to minimize the imbalance.
Let's bring it all together. You now understand the three fundamental strategies for splitting computation across GPUs, plus the memory optimization that makes data parallelism viable at scale.
| Strategy | What it splits | Communication | Best link | Memory savings |
|---|---|---|---|---|
| Data parallelism | Input data (batch) | AllReduce (gradients) | Any | None (full replica) |
| Tensor parallelism | Layer weights (horizontal) | AllReduce (activations) | NVLink | Proportional to degree |
| Pipeline parallelism | Layer groups (vertical) | Point-to-point | InfiniBand | Proportional to degree |
| ZeRO-1 | Optimizer states | AllGather (params) | Any | ~3x vs baseline |
| ZeRO-2 | + Gradients | ReduceScatter + AllGather | Any | ~4x vs baseline |
| ZeRO-3 / FSDP | + Parameters | AllGather + ReduceScatter | Any | ~Nx (linear in GPUs) |
| Formula | What it gives you |
|---|---|
| 16P bytes | Baseline memory (mixed-precision Adam, P = params) |
| 16P/N bytes | Memory with ZeRO-3, N GPUs |
| 2(N−1) × X / (NB) | Ring AllReduce time (X = data size, B = bandwidth) |
| (P−1)/(M+P−1) | Pipeline bubble fraction (P stages, M micro-batches) |
| Paper | Contribution |
|---|---|
| Horovod (Sergeev & Balso, 2018) | Ring AllReduce for distributed training |
| Megatron-LM (Shoeybi et al., 2019) | Tensor model parallelism for transformers |
| GPipe (Huang et al., 2019) | Micro-batch pipeline parallelism |
| PipeDream (Narayanan et al., 2019) | 1F1B pipeline scheduling |
| ZeRO (Rajbhandari et al., 2020) | Sharded optimizer states, gradients, parameters |
| Megatron-Turing NLG (Smith et al., 2022) | 3D parallelism at 530B scale |
| Alpa (Zheng et al., OSDI 2022) | Automatic parallelism strategy search |