CS 229s — Systems for Machine Learning

Parallelism Fundamentals

One GPU can't train a 70B model. Data parallelism, tensor parallelism, pipeline parallelism, ZeRO — how to split computation across hundreds of GPUs without wasting most of them.

Prerequisites: Neural network training + Basic distributed concepts. That's it.
9
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Parallelism?

You want to train a 70 billion parameter language model. Each parameter needs 2 bytes in fp16, so the weights alone take 140 GB. A top-of-the-line NVIDIA A100 has 80 GB of memory. The model doesn't even fit on a single GPU — and that's before gradients, optimizer states, or activations.

But the problem goes deeper than memory. LLMs are growing roughly 10x per year. GPU FLOPS double every ~2.5 years. The gap between what models demand and what a single chip can deliver is widening exponentially. No amount of waiting for better hardware will close it.

The only solution is parallelism: splitting the work across many GPUs. But how you split matters enormously. Naive approaches waste most of your GPUs' time waiting on each other. This lesson covers the three fundamental strategies — data, tensor, and pipeline parallelism — plus the memory optimization tricks (ZeRO) that make modern LLM training possible.

The scaling gap: GPT-3 (2020) had 175B parameters. GPT-4 is estimated at over 1 trillion. Training compute doubled roughly every 6 months from 2020-2024. No single GPU — not even the next generation — can keep up alone.
The Scaling Gap

LLM parameters (orange) grow ~10x/year. GPU memory (teal) grows ~2x every 2.5 years. The gap is the reason parallelism exists.

Three ways to split: You can split the data (each GPU trains on different examples), split the model (each GPU holds different weights), or split the pipeline (each GPU handles different layers at different times). Real systems combine all three.
Check: Why can't we just use a single GPU to train a 70B parameter model?

Chapter 1: Data Parallelism

Data parallelism is the simplest and most widely used strategy. The idea: replicate the entire model on every GPU, but give each GPU a different slice of the training batch. If you have 4 GPUs and a batch of 64 examples, each GPU gets 16 examples, runs forward and backward passes independently, then the GPUs synchronize their gradients before updating the weights.

How it works, step by step

1. Replicate
Copy all model weights W to every GPU
2. Partition data
Split batch B into sub-batches B0, B1, ..., BN-1
3. Forward + backward
Each GPU computes loss and gradients on its sub-batch independently
4. AllReduce gradients
Average gradients across all GPUs (every GPU ends up with the same averaged gradient)
5. Update
Each GPU applies the identical gradient update → weights stay synchronized
↻ repeat

Because every GPU has the same model and applies the same averaged gradient, the weights stay perfectly synchronized. The effective batch size is N × (per-GPU batch size), so adding GPUs directly increases throughput.

The math is identical to single-GPU training. Averaging gradients across N GPUs with sub-batches of size B/N gives the same result as computing the gradient on the full batch B. The only difference is communication cost — every gradient must be shared.
Data Parallelism Visualizer

Each GPU (column) holds a full copy of the model. Data is split across GPUs. Watch gradients flow during AllReduce.

Pros and Cons

ProsCons
Simple to implement (PyTorch DDP built-in)Every GPU must hold the entire model in memory
Near-linear throughput scalingCommunication-intensive: all gradients synchronized every step
No code changes to model architectureCan't train models larger than single-GPU memory

Worked Example: Memory for a 7B Model

Let's compute the memory footprint for a 7 billion parameter model using mixed-precision training with the Adam optimizer:

ComponentFormulaBytesGB
fp16 parameters2 × 7B14 × 10914 GB
fp16 gradients2 × 7B14 × 10914 GB
fp32 optimizer copy4 × 7B28 × 10928 GB
fp32 momentum (Adam)4 × 7B28 × 10928 GB
fp32 variance (Adam)4 × 7B28 × 10928 GB
Total16 × P112 × 109112 GB

An 80 GB A100 can't hold 112 GB. And this is before activations (which can easily add another 30-60 GB depending on batch size and sequence length). Data parallelism alone can't train this model.

The 16P rule: With mixed-precision Adam training, total memory per GPU = 16 × P bytes (where P = number of parameters). For a 7B model, that's 112 GB. For 70B, it's 1.12 TB. This is the fundamental bottleneck that ZeRO and model parallelism address.
Check: In data parallelism with 8 GPUs and a global batch size of 256, how many examples does each GPU process?

Chapter 2: AllReduce & Communication

Data parallelism requires every GPU to end up with the same averaged gradient. The operation that achieves this is called AllReduce. It takes N arrays (one per GPU), sums them element-wise, and distributes the result back to every GPU. Understanding AllReduce is critical because it's the dominant communication cost in distributed training.

The Ring AllReduce Algorithm

The most efficient AllReduce for large messages is ring AllReduce. Arrange the N GPUs in a logical ring. The algorithm runs in two phases, each taking N-1 steps:

Phase 1: Reduce-Scatter
N-1 steps. Each GPU sends one chunk to its neighbor, receives one chunk, and adds the received chunk to its own. After this phase, each GPU holds 1/N of the fully reduced result.
Phase 2: AllGather
N-1 steps. Each GPU sends its completed chunk to its neighbor, receives a completed chunk, and replaces its local copy. After this phase, every GPU has the complete reduced result.

Total steps: 2 × (N-1). Each step transfers exactly X/N bytes, where X is the total data size per GPU.

Deriving the Communication Cost

Let's derive the total communication time from first principles:

Total data sent per GPU = 2 × (N−1) × (X / N) bytes

Where does this come from? In each of the 2(N-1) steps, a GPU sends exactly X/N bytes (one chunk). Multiply steps by chunk size.

If the network bandwidth is B bytes/second, the communication time is:

Tcomm = 2 × (N−1) × X / (N × B) seconds

For large N, the factor (N-1)/N approaches 1, so Tcomm ≈ 2X/B. This is remarkable: the communication time is nearly independent of the number of GPUs. Doubling GPUs doesn't double communication cost. This is why ring AllReduce scales well.

Why ring AllReduce is bandwidth-optimal: A naive approach (send everything to one GPU, reduce, broadcast back) has O(NX) total data through one bottleneck. Ring AllReduce spreads the load evenly: every GPU sends and receives the same amount. The total communication time is O(X/B) regardless of N.

Worked Example

Suppose we have N = 8 GPUs, each holding X = 2 GB of gradients (a 1B parameter model in fp16). Network bandwidth B = 25 GB/s (NVLink).

Tcomm = 2 × 7 × 2 / (8 × 25) = 28 / 200 = 0.14 seconds

Over a slower 12.5 GB/s interconnect (PCIe Gen4 x16), the same operation takes 0.28 seconds — twice as long, but still tractable. The key insight: bandwidth matters more than latency for large transfers.

Ring AllReduce Simulator

Watch data chunks flow around the ring. Phase 1 (reduce-scatter) adds values. Phase 2 (allgather) distributes the result. Each GPU is a node on the ring.

Ready. 4 GPUs, 4 chunks each.
Check: Why does ring AllReduce scale well to many GPUs?

Chapter 3: Tensor (Model) Parallelism

Data parallelism requires every GPU to hold the entire model. When the model is too large for a single GPU, we need model parallelism: splitting the model's weights across GPUs.

There are two ways to slice a model. You can slice vertically — assign different layers to different GPUs (we'll call this pipeline parallelism, covered in the next chapter). Or you can slice horizontally — split individual layers across GPUs. This horizontal slicing is called tensor parallelism, and the dominant approach for transformers is Megatron-style parallelism.

How Megatron Splits a Transformer Layer

A transformer layer has two main computations: self-attention and the MLP (feed-forward network). Each involves large matrix multiplications like Y = XW. The key insight: matrix multiplication can be partitioned along the column or row dimension.

Column-parallel: Split weight matrix W into columns [W1, W2] across 2 GPUs. Each GPU computes Yi = X × Wi. The outputs are different slices of Y, so we concatenate them.

Row-parallel: Split W into rows. Each GPU computes a partial sum. The outputs are added with an AllReduce.

Megatron's trick: In the MLP block (two linear layers with a GeLU in between), Megatron uses column-parallel on the first layer and row-parallel on the second. This means the GeLU activation can be computed locally on each GPU (no communication). Only one AllReduce is needed after the second layer. The attention block uses a similar trick — split heads across GPUs, then AllReduce after the output projection.
Tensor Parallelism: Splitting a Matrix Multiply

Watch how a single matrix multiply Y = X × W gets split across 2 GPUs using column parallelism. Each GPU computes half the output columns.

Communication Pattern

Tensor parallelism requires an AllReduce after every attention block and every MLP block. For a transformer with L layers, that's 2L AllReduces per forward pass (and another 2L in the backward pass). This is very frequent — which means tensor parallelism only works well when GPUs are connected by extremely fast links like NVLink (900 GB/s) rather than PCIe (64 GB/s) or network (25-100 GB/s).

ProsCons
Reduces per-GPU memory (weights split across GPUs)Very frequent AllReduces (2 per layer per pass)
High GPU utilization (all GPUs active simultaneously)Needs extremely fast interconnect (NVLink)
Mathematically equivalent to single-GPU computeMore complex implementation than data parallelism
Rule of thumb: Tensor parallelism is used within a single server node where GPUs share NVLink. Across nodes (connected by slower InfiniBand or Ethernet), you use pipeline or data parallelism instead. This is why the choice of parallelism strategy is tightly coupled to your network topology.
Check: Why does tensor parallelism require fast interconnect between GPUs?

Chapter 4: Pipeline Parallelism

What if you don't have NVLink connecting your GPUs? You still need to split a model that doesn't fit in one GPU's memory. The answer is pipeline parallelism: assign different layers to different GPUs, and pass activations forward and gradients backward between them.

The Naive Approach and the Bubble Problem

The simplest version: put layers 1-8 on GPU 0, layers 9-16 on GPU 1, etc. Feed a batch through. GPU 0 processes its layers, sends activations to GPU 1, then... sits idle while GPU 1 works. GPU 1 finishes, sends to GPU 2, and now both GPU 0 and GPU 1 are idle. This is called the pipeline bubble — the idle time where GPUs wait for data.

With P stages (GPUs) and 1 micro-batch, the bubble fraction is (P-1)/P. With 4 GPUs, that's 75% idle time. Catastrophic.

The Fix: Micro-batching

The solution is to split each batch into M micro-batches and inject them into the pipeline one after another. While GPU 1 processes micro-batch 0, GPU 0 can start on micro-batch 1. The pipeline stays full (mostly).

With M micro-batches and P stages, the bubble fraction becomes:

Bubble fraction = (P − 1) / (M + P − 1)

If M = 4 and P = 4: bubble = 3/7 ≈ 43%. If M = 16 and P = 4: bubble = 3/19 ≈ 16%. More micro-batches = smaller bubble, but each micro-batch is smaller = less compute per chunk = lower arithmetic intensity.

Scheduling: GPipe vs 1F1B

There are two main schedules for when each GPU runs forward and backward passes:

GPipe schedule: Run ALL forward passes for all micro-batches first, then ALL backward passes. Simple to implement, but requires storing activations for all micro-batches simultaneously — huge memory cost.
1F1B (one-forward-one-backward): After the pipeline fills up, alternate between one forward and one backward pass. This limits the number of in-flight micro-batches, drastically reducing peak memory. Most modern systems (like Megatron-LM) use 1F1B.
Why 1F1B wins on memory: GPipe must hold activations for ALL M micro-batches before backward starts. 1F1B only holds activations for at most P micro-batches at any time. For M=32 and P=4, that's 8x less activation memory.
Pipeline Parallelism Showcase

Watch micro-batches flow through pipeline stages. Teal = forward, orange = backward, gray = idle (bubble). Toggle between GPipe and 1F1B to see how scheduling reduces bubbles and memory.

Micro-batches M4
Stages: 4 Bubble: 43%

Communication Pattern

Unlike tensor parallelism, pipeline parallelism only requires point-to-point communication: GPU i sends activations to GPU i+1 (forward), and GPU i+1 sends gradients back to GPU i (backward). No AllReduce needed. This makes pipeline parallelism well-suited for cross-node communication where bandwidth is limited.

ProsCons
Reduces per-GPU memory (only holds a few layers)Pipeline bubbles waste GPU time
Minimal communication (point-to-point only)Complex scheduling (GPipe, 1F1B, interleaved)
Works across slow networksRequires careful load balancing across stages
Check: With 4 pipeline stages and 8 micro-batches, what is the bubble fraction?

Chapter 5: ZeRO — Sharding the Redundancy

Data parallelism replicates everything on every GPU: weights, gradients, and optimizer states. With the 16P rule, a 7B model needs 112 GB per GPU. But think about it: if you have 8 GPUs, you're storing 8 copies of the optimizer states. That's 7 copies of pure redundancy.

ZeRO (Zero Redundancy Optimizer) eliminates this waste by sharding (partitioning) different components across GPUs. It has three stages, each removing more redundancy:

ZeRO Stage 1: Shard Optimizer States

Each GPU only stores 1/N of the optimizer states (fp32 params, momentum, variance). After the backward pass, each GPU updates only its assigned slice of parameters. Then an AllGather broadcasts the updated parameters to all GPUs.

Memory = 2P + 2P + 12P/N = 4P + 12P/N

For N=8 GPUs and P=7B: Memory = 4(7) + 12(7)/8 = 28 + 10.5 = 38.5 GB per GPU. Down from 112 GB!

ZeRO Stage 2: + Shard Gradients

Each GPU only needs the gradients for the parameters it's responsible for updating. Instead of AllReducing the full gradient, we use ReduceScatter — each GPU receives only its 1/N slice of the reduced gradients.

Memory = 2P + 2P/N + 12P/N = 2P + 14P/N

For N=8: Memory = 2(7) + 14(7)/8 = 14 + 12.25 = 26.25 GB per GPU.

ZeRO Stage 3: + Shard Parameters

The final stage: don't even store all the parameters on every GPU. Each GPU holds only 1/N of the weights. When a layer needs weights that aren't local, an AllGather fetches them just-in-time for the forward or backward pass.

Memory = 16P/N

For N=8: Memory = 16(7)/8 = 14 GB per GPU. We went from 112 GB to 14 GB — an 8x reduction.

The tradeoff: ZeRO-3 adds extra communication (1.5x vs ZeRO-1). Each forward and backward pass requires AllGather calls to fetch parameters on-demand. But the memory savings are transformative — it lets you train models that are N times larger.
ZeRO Memory Calculator

Adjust model size and GPU count to see memory per GPU at each ZeRO stage. The dashed line shows 80 GB A100 capacity.

Parameters (B)7B
GPUs (N)8

Summary Table

StageShardsMemory (bytes)7B @ 8 GPUs
BaselineNothing16P112 GB
ZeRO-1Optimizer states4P + 12P/N38.5 GB
ZeRO-2+ Gradients2P + 14P/N26.25 GB
ZeRO-3+ Parameters16P/N14 GB
PyTorch FSDP (Fully Sharded Data Parallel) implements ZeRO-3 natively. Wrap your model with FSDP(module) and PyTorch handles the sharding, AllGather, and ReduceScatter automatically. It's the easiest way to train models that don't fit on a single GPU.
python
import torch
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

torch.cuda.set_device(device_id)
model = MyTransformer()           # Define your model
model = FSDP(model)               # Wrap with FSDP (ZeRO-3)
optim = torch.optim.Adam(model.parameters(), lr=1e-4)

for batch in dataloader:
    loss = model(batch).sum()     # Forward (AllGather params on-demand)
    loss.backward()               # Backward (ReduceScatter gradients)
    optim.step()                  # Update local shard only
Check: A 13B parameter model with mixed-precision Adam uses 208 GB baseline. With ZeRO-3 on 16 GPUs, how much per GPU?

Chapter 6: 3D Parallelism — Combining Everything

Each parallelism strategy has its sweet spot: data parallelism scales throughput but requires full model replicas; tensor parallelism splits layers but needs fast interconnect; pipeline parallelism works across slow networks but has bubbles. The insight behind 3D parallelism (also called PTD: Pipeline-Tensor-Data) is that you can combine all three.

How 3D Parallelism Assigns GPUs

Consider training a model on 64 GPUs organized as a 4 × 4 × 4 grid:

DimensionDegreeScopeInterconnect
Tensor parallelism4-wayWithin a node (NVLink)Fastest
Pipeline parallelism4-wayAcross nodes in a rackMedium (IB)
Data parallelism4-wayAcross racks / clustersSlowest

Each node has 4 GPUs connected by NVLink — perfect for tensor parallelism (frequent AllReduces). Nodes are connected by InfiniBand — good for pipeline parallelism (occasional point-to-point). Data parallelism runs across the widest, slowest links because its communication (one AllReduce per step) is the most tolerant of latency.

The Megatron-Turing NLG recipe: Microsoft and NVIDIA trained their 530B parameter model using 3D parallelism. 8-way tensor parallelism within each DGX node (8 A100s connected by NVLink). 35-way pipeline parallelism across nodes. 4-way data parallelism across node groups. Total: 8 × 35 × 4 = 1,120 GPUs.
3D Parallelism Grid

Hover over a GPU to see its parallelism role. Tensor (green) = splits layers within a node. Pipeline (orange) = different layer groups. Data (blue) = same layers, different data.

Why Not Just Use One Strategy?

Each strategy alone hits limits. Data parallelism alone can't fit large models. Tensor parallelism alone doesn't scale beyond 8 GPUs (NVLink domain). Pipeline parallelism alone has too many bubbles. By combining them, you exploit the hardware topology at every level.

Alpa automates this. Choosing the right combination of parallelism degrees is a combinatorial optimization problem. Alpa (OSDI 2022) automatically searches over the space of possible strategies for a given model and cluster, finding near-optimal configurations without manual tuning.
Check: In 3D parallelism, which strategy is typically used within a single multi-GPU node?

Chapter 7: Practical Challenges

The theory of parallelism is clean. The practice is full of devils. Let's walk through the challenges that engineers face when actually deploying distributed training at scale.

1. Gradient Accumulation

Sometimes you want a large effective batch size but can't fit even a single micro-batch on a GPU. The solution: gradient accumulation. Run K forward-backward passes on tiny micro-batches, accumulating gradients locally, then synchronize once. This reduces communication from K AllReduces to 1, at the cost of K times more local compute per sync.

Effective batch = NGPUs × micro-batch × Kaccum-steps

2. Activation Checkpointing

During the forward pass, we save activations for the backward pass. For a 96-layer transformer with large batch sizes, this can consume 50+ GB. Activation checkpointing (also called gradient checkpointing) trades compute for memory: only save activations at selected layers, and recompute the others during backward. A common strategy: checkpoint every sqrt(L) layers, reducing activation memory from O(L) to O(sqrt(L)) at ~33% more compute.

3. Network Topology Awareness

Not all GPU-to-GPU connections are equal. Within a DGX node, NVLink provides ~900 GB/s bidirectional bandwidth. Across nodes, InfiniBand typically gives 200-400 GB/s. Across racks, bandwidth drops further. A parallelism strategy that ignores topology will bottleneck on the slowest links.

Link TypeBandwidthBest For
NVLink (intra-node)~900 GB/sTensor parallelism (frequent AllReduce)
InfiniBand (inter-node)200-400 GB/sPipeline parallelism (point-to-point)
Ethernet (cross-rack)25-100 GB/sData parallelism (infrequent AllReduce)

4. Fault Tolerance

With thousands of GPUs running for weeks, hardware failures are inevitable. A single GPU failure can kill the entire training run. Mitigations include: checkpointing model state to disk every N steps (so you can resume from the last checkpoint), elastic training (dynamically removing failed nodes without restarting), and redundant computation (running the same shard on multiple GPUs).

Real failure rates: Meta reported that during Llama 2 training, they experienced on average one hardware failure every ~2 days across their cluster. Without robust checkpointing and restart mechanisms, large-scale training would be practically impossible.

5. Load Balancing in Pipeline Parallelism

If stage 0 takes 10ms and stage 3 takes 20ms, the whole pipeline runs at the speed of the slowest stage. Uneven layer assignment creates artificial bubbles. Good implementations profile each layer's compute cost and assign layers to stages to minimize the imbalance.

The big picture: Parallelism is not just an algorithm problem — it's a systems engineering problem. The optimal strategy depends on model architecture, hardware topology, failure rates, memory constraints, and training budget. Tools like Alpa, DeepSpeed, and Megatron-LM exist because manually configuring these systems is extraordinarily difficult.
Check: Why is activation checkpointing useful for pipeline parallelism?

Chapter 8: Connections

Let's bring it all together. You now understand the three fundamental strategies for splitting computation across GPUs, plus the memory optimization that makes data parallelism viable at scale.

Cheat Sheet

StrategyWhat it splitsCommunicationBest linkMemory savings
Data parallelismInput data (batch)AllReduce (gradients)AnyNone (full replica)
Tensor parallelismLayer weights (horizontal)AllReduce (activations)NVLinkProportional to degree
Pipeline parallelismLayer groups (vertical)Point-to-pointInfiniBandProportional to degree
ZeRO-1Optimizer statesAllGather (params)Any~3x vs baseline
ZeRO-2+ GradientsReduceScatter + AllGatherAny~4x vs baseline
ZeRO-3 / FSDP+ ParametersAllGather + ReduceScatterAny~Nx (linear in GPUs)

Decision Tree

Does the model fit on one GPU?
Yes → Use data parallelism (simplest). No → continue.
Do you have fast intra-node links (NVLink)?
Yes → Use tensor parallelism within the node. No → continue.
Multiple nodes available?
Yes → Use pipeline parallelism across nodes + data parallelism across groups.
Still running out of memory?
Use ZeRO (FSDP) to shard optimizer states, gradients, and/or parameters.
At massive scale?
Combine all three: 3D parallelism (tensor + pipeline + data) + ZeRO.

Key Formulas

FormulaWhat it gives you
16P bytesBaseline memory (mixed-precision Adam, P = params)
16P/N bytesMemory with ZeRO-3, N GPUs
2(N−1) × X / (NB)Ring AllReduce time (X = data size, B = bandwidth)
(P−1)/(M+P−1)Pipeline bubble fraction (P stages, M micro-batches)

Seminal Papers

PaperContribution
Horovod (Sergeev & Balso, 2018)Ring AllReduce for distributed training
Megatron-LM (Shoeybi et al., 2019)Tensor model parallelism for transformers
GPipe (Huang et al., 2019)Micro-batch pipeline parallelism
PipeDream (Narayanan et al., 2019)1F1B pipeline scheduling
ZeRO (Rajbhandari et al., 2020)Sharded optimizer states, gradients, parameters
Megatron-Turing NLG (Smith et al., 2022)3D parallelism at 530B scale
Alpa (Zheng et al., OSDI 2022)Automatic parallelism strategy search
Where to go next: The CS 229s course continues with expert parallelism (Mixture of Experts), sequence parallelism (splitting the sequence dimension), and advanced scheduling like interleaved 1F1B. These build directly on the foundations covered here.
Final check: You have a 70B model, 256 A100-80GB GPUs across 32 nodes (8 GPUs/node with NVLink). Which parallelism combination makes sense?