Distributed Training Workbook

Chapter 0: Ring AllReduce

You have 8 GPUs, each holding a copy of the model. After the backward pass, each GPU has its own gradient tensor. You need every GPU to end up with the sum of all 8 gradient tensors. The naive approach — send everything to GPU 0, sum, broadcast back — creates a bottleneck where GPU 0 must receive and send 7x the data. Ring AllReduce is the elegant solution.

In a ring AllReduce, the N GPUs are arranged in a logical ring. The algorithm runs in two phases:

Phase 1: Reduce-Scatter (N-1 steps)
Each GPU splits its data into N chunks. In each step, GPU_i sends chunk_k to GPU_{(i+1) mod N} and receives from GPU_{(i-1) mod N}, accumulating into its local chunk.
After N-1 steps, each GPU holds the fully reduced version of exactly one chunk.

Phase 2: AllGather (N-1 steps)
Same ring pattern, but now GPUs forward the completed chunks around the ring.
After N-1 more steps, every GPU has all N reduced chunks.

Data transferred per GPU:
Each phase: (N-1) steps × M/N bytes per step = M(N-1)/N bytes
Total per GPU = 2M(N-1)/N

The bandwidth-optimal result. Ring AllReduce transfers exactly 2M(N-1)/N bytes per GPU regardless of N. As N grows, this approaches 2M — each GPU sends and receives roughly twice the gradient size. Compare to the naive approach where GPU 0 must handle (N-1)×M bytes in and N×M bytes out. Ring AllReduce distributes the load perfectly.

Exercise 0.1: 8-GPU AllReduce Volume Derive

N=8 GPUs, each has a gradient tensor of M=1 GB. How many GB does each GPU transfer in total during a ring AllReduce?

GB per GPU

Show derivation

Total per GPU = 2 × M × (N-1)/N

= 2 × 1 × (8-1)/8 = 2 × 7/8 = 1.75 GB

Each GPU sends 0.875 GB in the reduce-scatter phase and 0.875 GB in the allgather phase. Even with 1000 GPUs, total per GPU would be 2 × 999/1000 = 1.998 GB — barely more than 8 GPUs. This is the magic of ring AllReduce.

Exercise 0.2: Ring AllReduce Time Derive

N=8 GPUs connected in a ring with 50 GB/s bidirectional bandwidth per link. Gradient size M=2 GB. What is the AllReduce time? (Ignore latency — just bandwidth.)

Time = data per GPU / bandwidth = 2M(N-1)/N / BW. But remember: the ring only uses one link per direction at a time.

Show derivation

Data per GPU = 2 × 2 × 7/8 = 3.5 GB

Time = 3.5 GB / 50 GB/s = 0.07 s = 70 ms

In practice, latency adds a per-step overhead. With 2(N-1) = 14 steps and ~5 μs latency per step, that adds only 70 μs — negligible compared to 70 ms. Bandwidth dominates for large tensors.

Exercise 0.3: Bandwidth Utilization Trace

A cluster has 8 GPUs with NVLink providing 300 GB/s per GPU (bidirectional). You measure a ring AllReduce of 4 GB finishing in 30 ms. What fraction of peak bandwidth are you utilizing?

~30% — typical for naive implementations ~60% — decent but room for improvement ~78% — good utilization ~95% — near perfect

Show derivation

Data per GPU = 2 × 4 × 7/8 = 7 GB

Achieved BW = 7 GB / 0.030 s = 233 GB/s

Utilization = 233 / 300 = 77.8%

~78% bus bandwidth utilization is typical for well-tuned NCCL AllReduce. The gap comes from protocol overhead, chunk size quantization, and PCIe hops. NCCL reports "bus bandwidth" = data / time × (2(N-1)/N) to normalize for ring overhead.

Exercise 0.4: Ring vs Tree AllReduce Trace

A tree AllReduce (binary tree, reduce to root then broadcast) with N=8 GPUs has ⌈log₂(8)⌉ = 3 steps for reduce and 3 for broadcast. At each step the root link carries the full M bytes. For M=4 GB and BW=50 GB/s per link, which is faster?

Tree is faster: 6 × 80 ms = 480 ms vs ring's 280 ms. Wait, ring is faster. Ring: 140 ms. Tree: 480 ms. Ring is ~3.4x faster because it saturates all links simultaneously. They are identical — both transfer 2M per GPU. Tree is faster for small N because it has fewer steps.

Show derivation

Ring time = 2M(N-1)/(N × BW) = 2 × 4 × 7 / (8 × 50) = 0.14 s = 140 ms

Tree reduce: 3 steps, but root receives M bytes each step: 3 × 4/50 = 240 ms

Tree broadcast: same = 240 ms. Total tree = 480 ms

The tree has fewer steps (6 vs 14) but creates a bottleneck at the root — only one link carries the full payload each step. The ring has more steps but each step moves only M/N bytes per link, saturating all links equally. For large messages, ring always wins. Trees are only better for small messages where per-step latency dominates.

Exercise 0.5: Implement ringAllReduceTime() Build

Write a function that computes the ring AllReduce time in milliseconds, including both bandwidth and latency terms.

Show solution

javascript
function ringAllReduceTime(M_gb, N, bw_gbs, latency_us) {
  const steps = 2 * (N - 1);
  const chunkGB = M_gb / N;
  const bw_time_ms = steps * (chunkGB / bw_gbs) * 1000;
  const lat_time_ms = steps * latency_us / 1000;
  return bw_time_ms + lat_time_ms;
}

Exercise 0.6: Find the Bug Debug

This AllReduce time calculator is returning values that are 2x too large. Click the buggy line.

function allReduceTime(M, N, BW) {
  const dataPerGpu = 2 * M * (N - 1) / N;
  const steps = 2 * (N - 1);
  const perStepData = M / N;
  const perStepTime = perStepData / BW;
  const totalTime = 2 * steps * perStepTime;
  return totalTime;
}

Show explanation

Line 6 is the bug. The variable steps already accounts for both phases (2(N-1) steps total). Multiplying by 2 again on line 6 double-counts. It should be const totalTime = steps * perStepTime;. Alternatively, you can compute it directly as dataPerGpu / BW — both give the same answer.

Chapter 1: ZeRO Stages

You're training a 7B parameter model. In FP16, the model weights alone are 14 GB. But the optimizer (Adam) keeps two additional state tensors per parameter — the first moment (mean) and second moment (variance), both in FP32. Plus you need the FP32 master copy of the weights. With standard data parallelism, every GPU stores all of this. ZeRO (Zero Redundancy Optimizer) eliminates this duplication.

Memory per parameter (mixed-precision Adam):
FP16 weights: 2 bytes FP16 gradients: 2 bytes
FP32 master weights: 4 bytes FP32 momentum: 4 bytes FP32 variance: 4 bytes
Total = 16 bytes per parameter (on each GPU, without ZeRO)

ZeRO Stage 1: Shard optimizer states across N GPUs
Each GPU stores: 2 (wt) + 2 (grad) + 12/N (opt) = 4 + 12/N bytes/param

ZeRO Stage 2: + Shard gradients
Each GPU stores: 2 (wt) + 2/N (grad) + 12/N (opt) = 2 + 14/N bytes/param

ZeRO Stage 3: + Shard parameters
Each GPU stores: 2/N (wt) + 2/N (grad) + 12/N (opt) = 16/N bytes/param

The key insight. Without ZeRO, doubling your GPU count cuts computation in half but saves zero memory — every GPU is a full clone. With ZeRO-3, doubling GPUs also halves the memory per GPU. This is what makes it possible to train models far larger than a single GPU's memory.

Exercise 1.1: 7B No ZeRO Derive

7B parameters, mixed-precision Adam, no ZeRO. How much memory for model states (weights + gradients + optimizer) per GPU?

Show derivation

16 bytes/param × 7 × 10⁹ params = 112 × 10⁹ bytes = 112 GB

112 GB for model states alone — that exceeds a single A100's 80 GB even before accounting for activations. Without ZeRO, you can't even load the optimizer state on one GPU.

Exercise 1.2: 7B with ZeRO Stage 1 on 64 GPUs Derive

7B params, ZeRO Stage 1, N=64 GPUs. Memory per GPU for model states?

Show derivation

Per param = 4 + 12/64 = 4 + 0.1875 = 4.1875 bytes

Total = 4.1875 × 7 × 10⁹ = 29.3 × 10⁹ bytes ≈ 29.3 GB

Stage 1 alone cuts memory from 112 GB to 29.3 GB. The 4 bytes/param for full weights + gradients is the floor — you can't avoid keeping the local copy of parameters and the full gradient for the backward pass.

Exercise 1.3: 7B with ZeRO Stage 2 on 64 GPUs Derive

Same model, ZeRO Stage 2, N=64. Memory per GPU?

Show derivation

Per param = 2 + 14/64 = 2 + 0.21875 = 2.21875 bytes

Total = 2.21875 × 7 × 10⁹ = 15.53 × 10⁹ bytes ≈ 15.5 GB

Stage 2 halves the Stage 1 cost by also sharding gradients. Now 15.5 GB fits comfortably on an 80 GB A100, leaving ~65 GB for activations. This is the default in most FSDP deployments.

Exercise 1.4: 7B with ZeRO Stage 3 on 64 GPUs Derive

Same model, ZeRO Stage 3, N=64. Memory per GPU?

Show derivation

Per param = 16/64 = 0.25 bytes

Total = 0.25 × 7 × 10⁹ = 1.75 × 10⁹ bytes = 1.75 GB

1.75 GB! But there's a catch: Stage 3 must gather parameters on-the-fly before each forward/backward computation, then discard them. This adds AllGather communication at every layer. The trade-off is clear: memory scales perfectly (1/N) but communication overhead increases.

Exercise 1.5: ZeRO Stage 3 Communication Cost Derive

ZeRO-3 does an AllGather before each layer's forward and backward pass, then a ReduceScatter after backward. For a 7B model (FP16 weights = 14 GB), 64 GPUs, and 100 GB/s interconnect, how much extra communication per training step?

Forward: 1 AllGather (parameters). Backward: 1 AllGather (parameters) + 1 ReduceScatter (gradients). Each AllGather/ReduceScatter transfers M(N-1)/N ≈ M bytes per GPU.

GB total communication per GPU

Show derivation

Each AllGather/ReduceScatter ≈ 14 × (64-1)/64 ≈ 13.78 GB per GPU

3 collectives per step: 3 × 13.78 ≈ 41.3 GB ≈ 42 GB

At 100 GB/s, that's ~420 ms of pure communication per step. Compare to ZeRO-1/2 which only does one AllReduce of gradients (~28 GB). ZeRO-3 trades 1.5x more communication for 8-17x less memory. Worth it when the model literally can't fit otherwise.

Exercise 1.6: Which ZeRO Stage? Trace

You have 8 A100-80GB GPUs and want to train a 30B model (mixed-precision Adam). Activations will need ~20 GB per GPU. Which is the minimum ZeRO stage that fits?

ZeRO-1: 4 + 12/8 = 5.5 bytes/param → 165 GB per GPU. Too much. ZeRO-2: 2 + 14/8 = 3.75 bytes/param → 112.5 GB per GPU. Still too much. ZeRO-3: 16/8 = 2 bytes/param → 60 GB + 20 GB activations = 80 GB. Just fits. No ZeRO stage fits — you need more GPUs or model parallelism.

Show reasoning

ZeRO-3 with 8 GPUs gives 16/8 = 2 bytes/param × 30B = 60 GB for model states. Add 20 GB for activations = 80 GB — exactly the A100 limit. In practice you'd want some headroom, so you might add activation checkpointing to reduce the 20 GB, or use 16 GPUs with ZeRO-2.

Chapter 2: Gradient Accumulation

You want an effective batch size of 2048 sequences, but each GPU can only fit 4 sequences in memory at once. With 8 GPUs, that's 32 per step — you're 64x short. Gradient accumulation bridges this gap: run multiple micro-batches, accumulate their gradients, and only update the optimizer after enough micro-batches.

Effective batch size:
B_eff = micro_batch × accum_steps × DP_degree

Memory impact: Same as running micro_batch alone — you just add gradients in-place.
Compute impact: Linear in accum_steps — each micro-batch is a full forward + backward pass.
Communication: Only one AllReduce per accumulation cycle (not per micro-batch).

The free batch size multiplier. Gradient accumulation costs zero extra memory and has near-zero communication overhead (you defer the AllReduce). The only cost is wall-clock time — more micro-batches mean more sequential forward/backward passes. It's the simplest trick to reach any target batch size.

Exercise 2.1: Accumulation Steps Derive

Target effective batch = 2048. Micro-batch per GPU = 4. DP degree (number of GPUs doing data parallelism) = 8. How many gradient accumulation steps?

steps

Show derivation

accum_steps = B_eff / (micro_batch × DP) = 2048 / (4 × 8) = 2048 / 32 = 64

64 micro-batches per optimizer step. That means 64 forward+backward passes between each weight update. The AllReduce only happens once at the end, so communication is 1/64th of what it would be without accumulation.

Exercise 2.2: Communication Savings Derive

Without gradient accumulation, you AllReduce after every micro-batch (accum=1, so you'd need 64 optimizer steps for the same number of samples). With accumulation, you AllReduce once per 64 micro-batches. If each AllReduce takes 50 ms, how much communication time do you save per 2048 samples?

ms saved

Show derivation

Without accumulation: 64 AllReduces × 50 ms = 3200 ms

With accumulation: 1 AllReduce × 50 ms = 50 ms

Savings = 3200 - 50 = 3150 ms

3.15 seconds saved per optimizer step! Note however that without accumulation you'd also get 64 optimizer updates instead of 1 — the training dynamics differ. Gradient accumulation gives you a specific batch size, not just speed. Large batch training converges differently from small batch training.

Exercise 2.3: Throughput Impact Derive

A micro-batch of 4 sequences takes 120 ms for forward+backward on each GPU. AllReduce takes 50 ms. With accumulation=64: what is the total wall-clock time per optimizer step, and what is the throughput in sequences/second?

sequences/second (across all 8 GPUs)

Show derivation

Compute = 64 micro-batches × 120 ms = 7680 ms

Communication = 1 AllReduce = 50 ms

Optimizer step = ~1 ms (negligible)

Total = 7731 ms ≈ 7.73 s per step

Throughput = 2048 sequences / 7.73 s ≈ 265 seq/s

Communication is only 50/7731 = 0.6% of the step time. Gradient accumulation makes the training almost perfectly compute-bound. Without it, communication would be 3200/(64×120 + 3200) = 29% of total time.

Exercise 2.4: Implement effectiveBatch() Build

Write a function that computes gradient accumulation steps from the target batch size, and vice versa.

Show solution

javascript
function effectiveBatch(microBatch, accumSteps, dpDegree) {
  return microBatch * accumSteps * dpDegree;
}
function requiredAccumSteps(targetBatch, microBatch, dpDegree) {
  return Math.ceil(targetBatch / (microBatch * dpDegree));
}

Exercise 2.5: Memory vs Speed Tradeoff Trace

You can fit micro_batch=8 with activation checkpointing OFF, or micro_batch=16 with activation checkpointing ON (which recomputes activations, adding ~33% compute overhead). Both achieve the same effective batch via accumulation. Which is faster for B_eff=2048 on 8 GPUs?

micro=8, no checkpoint: 32 accum steps × 100ms = 3200ms. micro=16, checkpoint: 16 steps × 266ms = 4256ms. No checkpoint wins. Checkpointing always wins because larger micro-batches have better GPU utilization. They are identical — 33% more compute per step is offset by 50% fewer steps.

Show reasoning

With micro=8: 2048/(8×8) = 32 accumulation steps. If each step takes T ms, total = 32T.

With micro=16 + checkpoint: 2048/(16×8) = 16 steps. Each step processes 2x tokens but takes ~2×1.33 = 2.66x time (doubled data, 33% recompute overhead). Total = 16 × 2.66T = 42.6T.

32T < 42.6T. Larger micro-batch with checkpointing is ~33% slower. Avoid activation checkpointing when you have enough accumulation steps to hit your batch size. Only use it when you literally can't fit the micro-batch you need.

Chapter 3: Learning Rate Scaling

You've tuned your LR to 3e-4 at batch size 256. Now you're scaling to 8 nodes with batch size 2048. If you keep the same LR, training diverges. If you scale it linearly, training also diverges. The relationship between batch size and learning rate is one of the most practically important (and surprisingly subtle) topics in distributed training.

Linear scaling rule (Goyal et al., 2017):
LR_new = LR_base × (B_new / B_base)

Square root scaling (Hoffer et al., 2017 / McCandlish et al., 2018):
LR_new = LR_base × √(B_new / B_base)

Warmup: Start from LR≈0 and ramp to target over W steps.
Rule of thumb: W ≈ 5-10% of total steps, or W ≈ B_new/B_base × 1000 steps.

Why linear scaling works (intuitively). A batch of size B averages B gradient samples. The variance of this average is σ²/B. Doubling B halves the variance, so the gradient estimate is "twice as good" — you can afford twice the step size. Linear scaling preserves the expected distance traveled per sample seen. But it breaks down at very large B because the noise floor vanishes and you start overshooting.

Exercise 3.1: Linear Scaling Derive

Base LR = 3e-4 at batch 256. New batch = 2048. What LR under linear scaling?

learning rate

Show derivation

LR = 3 × 10^-4 × (2048 / 256) = 3 × 10^-4 × 8 = 2.4 × 10^-3

An 8x increase in batch size gives an 8x increase in LR. This is the rule used in the original ResNet large-batch paper (Goyal et al.). It works well up to batch sizes around 8K-32K for vision and ~1M-4M tokens for LLMs.

Exercise 3.2: Square Root Scaling Derive

Same setup: base LR = 3e-4 at batch 256, new batch = 2048. What LR under square root scaling?

learning rate

Show derivation

LR = 3 × 10^-4 × √(2048 / 256) = 3 × 10^-4 × √8 = 3 × 10^-4 × 2.828 ≈ 8.49 × 10^-4

Square root scaling is more conservative — 2.83x instead of 8x. It's derived from the noise-signal ratio analysis and tends to work better at very large batch sizes (>32K). Most modern LLM training runs use something between linear and sqrt.

Exercise 3.3: Warmup Steps Derive

You're training for 100,000 steps with batch=2048 (scaled up from batch=256). Using the heuristic warmup = (B_new/B_base) × 1000 steps, how many warmup steps?

warmup steps

Show derivation

Warmup = (2048 / 256) × 1000 = 8 × 1000 = 8000 steps

8000/100000 = 8% of training, which falls in the 5-10% sweet spot. Warmup is critical at large batch sizes because the early gradient estimates are very noisy (the model is randomly initialized), and a large LR amplifies that noise. Ramping up slowly lets the loss landscape stabilize before taking big steps.

Exercise 3.4: Critical Batch Size Trace

The "critical batch size" B_crit is where the gradient noise equals the signal. Beyond B_crit, doubling the batch barely improves the gradient estimate but halves the number of optimizer steps per epoch. What is the practical consequence?

Training speed doubles for every doubling of batch above B_crit. Beyond B_crit, scaling the batch size wastes compute — you get diminishing returns in wall-clock time while using proportionally more GPUs. The model diverges immediately when B > B_crit. B_crit only affects vision models, not LLMs.

Show reasoning

Below B_crit, doubling the batch nearly doubles training speed (fewer noisy steps needed). Above B_crit, the gradient is already clean, so more samples per batch don't help — you just burn more compute per step without reaching the minimum faster. For GPT-3 scale models, B_crit ≈ 1-2M tokens. This is why Llama 3 uses batch rampup: start at 4M tokens/batch, scale to 16M as the loss landscape becomes more predictable.

Exercise 3.5: LR Schedule Interaction Trace

You use cosine LR decay with a peak of 3e-4 and train for 100K steps. Your colleague doubles the batch size and applies linear scaling (peak 6e-4). They also halve the total steps to 50K (same tokens seen). Should the cosine schedule decay over 50K or 100K steps?

50K steps — the decay should match the actual training duration, not the original. 100K steps — keep the same schedule to match per-sample LR exposure. Doesn't matter — cosine decay is robust to schedule length.

Show reasoning

The cosine schedule should decay over the actual 50K steps. The key quantity is per-token LR exposure: at step t, each token in the batch sees LR(t). With 2x batch and 2x LR, each token gets twice the update magnitude. But you have half the steps, so the integrated LR exposure (area under the LR curve) stays roughly the same. The cosine must decay over the actual step count to hit the minimum LR at the right time.

Chapter 4: Communication Overhead

You've profiled your training run and 40% of each step is spent waiting on AllReduce. Is that expected? Can you hide it behind computation? Understanding when communication dominates vs when it can be overlapped is critical for achieving high GPU utilization (called MFU — model FLOP utilization).

AllReduce time (ring, bandwidth-limited):
T_comm = 2S(N-1) / (N × BW)
where S = gradient size in bytes, N = GPUs, BW = per-link bandwidth

Overlap potential:
During backward pass, gradients become available layer-by-layer (last layer first).
You can start AllReduce for layer L_k while computing gradients for L_k-1.
Overlap fraction ≈ 1 - T_{comm_last_layer} / T_backward

Communication-computation overlap is the key to scaling. If the backward pass for one layer takes longer than the AllReduce for the previous layer's gradients, you can completely hide the communication behind computation. NCCL does this by default with bucketed AllReduce — grouping parameters into ~25 MB buckets and launching AllReduce as soon as each bucket's gradients are complete.

Exercise 4.1: AllReduce Time for 10B Model Derive

10B parameter model, FP16 gradients (2 bytes/param). 64 GPUs, NVLink at 600 GB/s per GPU (bidirectional). Ring AllReduce time?

Show derivation

S = 10 × 10⁹ × 2 = 20 GB

T = 2 × 20 × (64-1) / (64 × 600) = 2 × 20 × 63 / 38400

= 2520 / 38400 = 0.0656 s ≈ 65.6 ms

~66 ms for a single AllReduce. For reference, a forward+backward pass of a 10B model on 64 GPUs with decent micro-batch takes 200-400 ms. So communication is 15-30% of the step — significant but overlappable.

Exercise 4.2: Overlap Analysis Derive

Backward pass takes 200 ms total across 40 layers (5 ms per layer). AllReduce per layer's gradients (bucketed) takes 1.5 ms. What fraction of communication is overlapped?

The first 39 layers' AllReduces overlap with the next layer's backward. Only the last layer's AllReduce cannot overlap.

% overlapped

Show derivation

Total AllReduce = 40 layers × 1.5 ms = 60 ms

Non-overlapped = last layer's AllReduce = 1.5 ms

Overlapped fraction = (60 - 1.5) / 60 = 97.5%

97.5% overlap means only 1.5 ms of the 60 ms AllReduce is visible as overhead. The effective communication cost is 1.5 ms on top of a 200 ms backward — less than 1% overhead. This is why bucketed, overlapped AllReduce is so effective: even though total data transfer is huge, almost all of it hides behind the backward pass.

Exercise 4.3: Inter-Node Bottleneck Derive

8 GPUs per node (NVLink, 600 GB/s within node). 8 nodes connected via InfiniBand (100 GB/s per link). For a 10B model AllReduce across all 64 GPUs, the cross-node links are the bottleneck. Effective AllReduce time?

In hierarchical AllReduce: first reduce within each node (fast), then AllReduce across nodes (slow), then broadcast within nodes (fast). The cross-node step dominates.

ms (cross-node step only)

Show derivation

Cross-node AllReduce: 8 nodes in a ring, S = 20 GB

T = 2 × 20 × (8-1) / (8 × 100) = 280/800 = 0.35 s = 350 ms

350 ms for the cross-node step alone — 5x slower than the full NVLink AllReduce. The intra-node steps add maybe 10 ms total. This is why modern clusters use fat InfiniBand fabrics (400 Gb/s = 50 GB/s per port, 8 ports per node = 400 GB/s aggregate) and why tensor parallelism is always kept within a node.

Exercise 4.4: Compute vs Communication Ratio Trace

For a 10B model doing a forward+backward pass: ~120 TFLOPs of compute. On an H100 at 50% MFU = 495 TFLOPS, that takes ~242 ms. With 350 ms AllReduce (no overlap), what is the communication-to-compute ratio?

0.5:1 — compute dominates easily 1:1 — they're roughly balanced 1.45:1 — communication actually dominates, GPUs idle 60% of the time 3:1 — communication is hopelessly dominant

Show reasoning

Ratio = T_comm / T_compute = 350 / 242 = 1.45

Communication exceeds compute! The effective MFU drops from 50% to 50% × 242/(242+350) = 20%. This is why you never do pure data parallelism across slow inter-node links for large models. Instead, combine TP within the node (high-BW NVLink) + DP across nodes, or use gradient accumulation to amortize the communication cost.

Exercise 4.5: Implement commComputeRatio() Build

Given model and cluster parameters, compute the communication-to-compute ratio.

Show solution

javascript
function commComputeRatio(params_B, N, bw_gbs, flops_per_step_T, gpu_tflops) {
  const S = params_B * 2;  // FP16 grads in GB
  const commTime = 2 * S * (N - 1) / (N * bw_gbs);
  const computeTime = flops_per_step_T / gpu_tflops;
  return commTime / computeTime;
}

Chapter 5: Pipeline Parallelism

Your 70B model has 80 layers. You split them across 4 GPUs: 20 layers each. GPU 0 computes the first 20 layers, passes activations to GPU 1 for the next 20, and so on. Simple — but there's a devastating problem: while GPU 0 computes, GPUs 1-3 sit idle. This idle time is the pipeline bubble.

Naive pipeline (GPipe):
For PP=p stages, m microbatches:
Pipeline bubble fraction = (p - 1) / (p - 1 + m)

1F1B (one forward, one backward) schedule:
Warmup: p-1 forward passes, then alternate 1F + 1B.
Bubble fraction = (p - 1) / m // same total bubble, but lower peak memory
Memory: only p-1 in-flight micro-batches (vs m for GPipe).

Interleaved 1F1B:
Each GPU holds v non-contiguous "virtual stages." Bubble = (p - 1) / (m × v).
Costs more communication (v times more point-to-point transfers).

Pipeline bubble = idle GPUs = wasted money. With PP=4 and m=12 microbatches, the bubble is 3/12 = 25% — one quarter of your GPU-hours are wasted. The cure: more microbatches. With m=32, bubble = 3/32 = 9.4%. With interleaved stages (v=4), bubble = 3/128 = 2.3%. There is always a tension: more microbatches mean smaller micro-batches, which can reduce GPU utilization per micro-batch.

Exercise 5.1: Pipeline Bubble Derive

PP=4 stages, m=12 microbatches, 1F1B schedule. What is the pipeline bubble fraction?

Show derivation

Bubble = (p - 1) / m = (4 - 1) / 12 = 3/12 = 0.25 = 25%

The bubble represents the warmup (first p-1 forward-only passes) and drain (last p-1 backward-only passes). During these phases, at least one stage is idle. With 12 microbatches, the steady-state (where all stages are busy) covers only 9/12 = 75% of the time.

Exercise 5.2: Activation Memory for 1F1B Derive

In 1F1B, each stage stashes activations for p-1 in-flight microbatches during warmup. If each micro-batch's activations per stage are 2 GB, PP=4 stages: how much activation memory per GPU?

Show derivation

In-flight micro-batches = p - 1 = 3

Activation memory = 3 × 2 GB = 6 GB

Compare to GPipe which stashes all m=12 micro-batches: 12 × 2 = 24 GB. 1F1B uses 4x less activation memory because it starts backward passes as soon as possible, freeing stashed activations early. This is why 1F1B is universally preferred over GPipe in practice.

Exercise 5.3: Interleaved 1F1B Derive

Same setup (PP=4, m=12) but with v=4 virtual stages per GPU (interleaved schedule). New bubble fraction?

Show derivation

Bubble = (p - 1) / (m × v) = 3 / (12 × 4) = 3/48 = 6.25%

4x reduction in bubble — from 25% to 6.25%. The cost is 4x more point-to-point sends between stages (each virtual stage boundary requires a send/recv). With fast NVLink this is usually worth it. Megatron-LM's interleaved schedule uses v=number_of_layers / (PP × layers_per_chunk).

Exercise 5.4: Point-to-Point Communication Derive

Each microbatch's activations between stages are 500 MB (FP16). PP=4, m=12, v=4 virtual stages. How much total point-to-point data per step across all stages?

Each virtual stage boundary requires one send per microbatch for forward and one for backward. Total boundaries = PP × v - 1.

GB total

Show derivation

Virtual stage boundaries = PP × v - 1 = 4 × 4 - 1 = 15

Per boundary per microbatch: 0.5 GB forward + 0.5 GB backward = 1 GB

Total = 15 × 12 × 1 = 180 GB

180 GB of point-to-point traffic per step. At 600 GB/s NVLink, the raw transfer time is 300 ms. But these transfers are pipelined and overlapped with computation, so the actual overhead is much smaller. The key constraint is that adjacent virtual stages must be on different GPUs connected by fast links.

Exercise 5.5: Find the Bug Debug

This bubble fraction calculator has a subtle error. Click the buggy line.

function bubbleFraction(pp, microbatches, virtualStages) {
  if (virtualStages === undefined) virtualStages = 1;
  const bubbleSteps = pp - 1;
  const totalSteps = microbatches * virtualStages;
  return bubbleSteps / (bubbleSteps + totalSteps);
}

Show explanation

Line 5 is the bug. The 1F1B bubble fraction formula is (p-1)/m for standard and (p-1)/(m×v) for interleaved — both have the total in the denominator alone, NOT (p-1) / ((p-1) + m×v). The GPipe formula uses (p-1) / (p-1+m), but the comment says "1F1B." Line 5 should be return bubbleSteps / totalSteps;

Exercise 5.6: Optimal Microbatch Count Trace

PP=8, target bubble < 5%. No interleaving (v=1). What is the minimum number of microbatches?

m=64 (bubble = 7/64 = 10.9%) m=100 (bubble = 7/100 = 7%) m=140 (bubble = 7/140 = 5%) — exactly at threshold; m=141 for strictly <5% m=256 (bubble = 7/256 = 2.7%)

Show reasoning

(p-1)/m < 0.05 ⇒ m > (p-1)/0.05 = 7/0.05 = 140

m=140 gives exactly 5%, so you need m ≥ 141 for strictly less than 5%. In practice with PP=8, you'd use interleaved stages (v=2 gives bubble < 5% at m=70, v=4 at m=35) rather than cramming 140+ micro-batches into a step.

Chapter 6: Tensor Parallelism

Pipeline parallelism splits the model by layers. Tensor parallelism (TP) splits individual layers across GPUs. A matrix multiply Y = XW can be split by partitioning W along columns (each GPU gets a vertical slice) or along rows (each GPU gets a horizontal slice). The trick is choosing the split direction so that you minimize communication.

Column-parallel linear: Split W into [d, d/t] on each of t GPUs.
Input X: [s, d] — replicated on all GPUs.
Each GPU computes: Y_i = X × W_i → [s, d/t]
No communication needed — each GPU has a partial result.

Row-parallel linear: Split W into [d/t, d] on each of t GPUs.
Input X_i: [s, d/t] — partitioned across GPUs (from column-parallel output).
Each GPU computes: Z_i = X_i × W_i → [s, d]
AllReduce needed: Z = ∑ Z_i to get the correct sum.

Per transformer layer: 2 AllReduces (one after attention, one after FFN).

Column then row — the Megatron pattern. In a transformer FFN, the first linear (up-projection) is column-parallel and the second (down-projection) is row-parallel. This arrangement means you only need an AllReduce after the down-projection — the column-parallel output naturally partitions the input for the row-parallel. Same pattern for attention: Q/K/V projections are column-parallel (split heads across GPUs), output projection is row-parallel.

Exercise 6.1: Weight Split Shapes Derive

d=8192, TP=8. Column-parallel split of a [d, d] weight matrix. What shape does each GPU hold?

columns per GPU

Show derivation

Each GPU holds: [d, d/TP] = [8192, 8192/8] = [8192, 1024]

Each GPU stores 8192 × 1024 = 8.39M parameters instead of 67.1M. For attention, this naturally maps to splitting heads: with 64 heads and TP=8, each GPU handles 8 heads.

Exercise 6.2: Communication per Layer Derive

TP=8, d=8192, sequence length s=4096, micro-batch b=1, FP16. Each transformer layer needs 2 AllReduces of the activation tensor [b×s, d]. How much data per AllReduce per GPU? (Use the ring AllReduce formula.)

MB per AllReduce per GPU

Show derivation

Activation tensor = [4096, 8192] in FP16 = 4096 × 8192 × 2 bytes = 64 MB

AllReduce per GPU = 2 × 64 × (8-1)/8 = 2 × 64 × 7/8 = 112 MB

Per AllReduce per GPU = 112/2 = 56 MB

56 MB per AllReduce per GPU. With 2 AllReduces per layer and 80 layers, total TP communication = 160 × 56 = 8960 MB ≈ 8.75 GB per GPU per step (forward only, double for backward). This is why TP must use NVLink — at 600 GB/s, each AllReduce takes 56 MB / 600 GB/s ≈ 0.09 ms. Over InfiniBand (100 GB/s), it would be 0.56 ms per AllReduce × 160 = 90 ms overhead.

Exercise 6.3: TP Memory Savings Derive

A 70B model with d=8192, 80 layers. Each layer has ~12d² params = 805M params = 1.61 GB in FP16. With TP=8, how much weight memory per GPU for all layers?

Show derivation

Total weight memory = 80 × 1.61 GB = 128.8 GB

Per GPU with TP=8: 128.8 / 8 = 16.1 GB

TP=8 splits parameters perfectly — each GPU holds 1/8 of every layer's weights. Unlike ZeRO-3, there's no need to AllGather parameters before computation because each GPU already has its shard of the weight matrix and can compute its portion of the output directly. The trade-off is the 2 AllReduces per layer during both forward and backward.

Exercise 6.4: TP + Attention Heads Trace

Llama 70B has 64 attention heads and 8 KV heads (GQA). With TP=8, how many Q heads and KV heads per GPU?

64 Q heads and 8 KV heads per GPU (no split) 8 Q heads and 1 KV head per GPU — clean division 8 Q heads and 8 KV heads per GPU (KV heads are replicated) TP=8 is incompatible with GQA-8

Show reasoning

With TP=8 and 64 Q heads: 64/8 = 8 Q heads per GPU. With 8 KV heads: 8/8 = 1 KV head per GPU. This divides cleanly because the number of KV heads (8) is divisible by TP (8). Each GPU has 8 Q heads that all share 1 KV head — the GQA ratio of 8:1 is preserved locally. If TP were 16, you'd need to replicate KV heads (16/8 = 2 copies each), or use sequence parallelism instead.

Exercise 6.5: TP Scaling Limit Derive

Each TP AllReduce takes 0.09 ms at NVLink speeds. A 70B model with TP=8 has 80 layers × 2 AllReduces × 2 (fwd+bwd) = 320 AllReduces per step. The forward+backward compute takes 300 ms. What is the TP communication overhead as a percentage?

Show derivation

TP comm time = 320 × 0.09 ms = 28.8 ms

Overhead = 28.8 / 300 = 9.6%

~10% overhead with NVLink is acceptable. But TP AllReduces are on the critical path — they cannot be overlapped because the next layer's computation depends on the AllReduce result. This is why TP is capped at 8 GPUs (one node) in most training configs. Going to TP=16 across nodes with 100 GB/s InfiniBand would make each AllReduce 6x slower, pushing overhead to ~58%.

Chapter 7: Checkpoint & Recovery

Your 70B model training run crashes at step 45,000 out of 100,000. How much work did you lose? The answer depends entirely on your checkpointing strategy. Too frequent: you waste time writing to disk. Too infrequent: you lose days of work on a crash. Finding the optimal frequency is a real operations problem.

Checkpoint size (full, 70B model with Adam):
FP16 weights: 70B × 2 = 140 GB
FP32 master weights: 70B × 4 = 280 GB
FP32 optimizer (momentum + variance): 70B × 8 = 560 GB
RNG states: ~1 KB per GPU (negligible)
Total ≈ 980 GB

Optimal checkpoint interval:
If MTBF (mean time between failures) = F hours, and checkpoint takes C hours:
Optimal interval ≈ √(2 × C × F) hours (Young's formula)

Nearly 1 TB per checkpoint for a 70B model. That's the full optimizer state. If you checkpoint every 1000 steps and each step takes 10 seconds, you're writing 1 TB every 2.8 hours. At 10 GB/s to NFS, that's 98 seconds of I/O — 1% overhead, acceptable. But with async checkpointing (write to local SSD first, flush to NFS in background), overhead drops to near zero.

Exercise 7.1: Checkpoint Size for 70B Derive

70B parameters, mixed-precision Adam. Compute the exact checkpoint size including weights (FP16 + FP32 master), optimizer states, and step counter/LR (8 bytes total, negligible).

Show derivation

FP16 weights = 70 × 10⁹ × 2 = 140 GB

FP32 master weights = 70 × 10⁹ × 4 = 280 GB

FP32 momentum = 70 × 10⁹ × 4 = 280 GB

FP32 variance = 70 × 10⁹ × 4 = 280 GB

Total = 140 + 280 + 280 + 280 = 980 GB

Exercise 7.2: Write Time Derive

980 GB checkpoint, NFS write speed 10 GB/s. How long to write one checkpoint?

seconds

Show derivation

Time = 980 GB / 10 GB/s = 98 seconds

98 seconds = 1.6 minutes per checkpoint. With distributed checkpointing (each GPU writes its own shard in parallel via ZeRO), the effective write speed can be N × 10 GB/s, reducing time to 98/N seconds. With 64 GPUs: 1.5 seconds!

Exercise 7.3: Optimal Checkpoint Frequency Derive

MTBF = 24 hours (a realistic large cluster value). Checkpoint takes C = 98 seconds = 0.0272 hours. Using Young's formula: optimal interval = √(2CF). Answer in minutes.

minutes

Show derivation

Interval = √(2 × 0.0272 × 24) = √(1.307) = 1.143 hours ≈ 68.5 minutes

Checkpoint every ~68 minutes. If each step takes 10 seconds, that's every ~410 steps. On a crash, you lose on average half the interval = 34 minutes of work. The checkpoint overhead is 98 seconds per 68.5 minutes = 2.4% — a small price for crash recovery.

Exercise 7.4: Wasted Compute on Crash Derive

Your run uses 512 H100 GPUs at $2/GPU-hour. The MTBF is 24 hours. Checkpoints every 68 minutes. Each checkpoint takes 98 seconds. On average, how much money is wasted per crash (lost compute between last checkpoint and crash, plus recovery time)?

dollars

Show derivation

Average lost work = interval / 2 = 34.25 minutes

Recovery time (reload checkpoint + warmup) ≈ 2 minutes

Total lost time per crash ≈ 36.25 minutes = 0.604 hours

Cost = 512 GPUs × $2/GPU-hr × 0.604 hrs = $618

~$600 per crash. With MTBF = 24 hours over a 30-day training run, that's ~30 crashes × $618 = ~$18,500 in wasted compute. This is why Meta and Google invest heavily in faster checkpointing, fault-tolerant training (elastic training that can continue with fewer GPUs), and hardware reliability.

Exercise 7.5: Implement optimalCheckpointInterval() Build

Given MTBF and checkpoint duration, compute optimal interval and expected waste percentage.

Show solution

javascript
function optimalCheckpointInterval(mtbf_hours, ckpt_seconds) {
  const C = ckpt_seconds / 3600;  // hours
  const interval = Math.sqrt(2 * C * mtbf_hours); // hours
  const interval_min = interval * 60;
  // Waste per crash: avg lost = interval/2, plus ckpt overhead per interval
  const crashes = 1;  // per MTBF period
  const lost = interval / 2;  // hours lost per crash
  const overhead = C * (mtbf_hours / interval); // total ckpt time in mtbf
  const waste = (lost + overhead) / mtbf_hours * 100;
  return [interval_min, waste];
}

Chapter 8: Mixed Precision Training

FP16 has only 5 exponent bits, giving a range of roughly 6 × 10^-8 to 65504. Gradients during training can be much smaller than 10^-8, especially for early layers in deep networks. Without intervention, these small gradients underflow to zero and the model stops learning. Dynamic loss scaling is the fix.

Dynamic loss scaling protocol:
1. Start with scale S = 2¹⁶ = 65536.
2. Multiply loss by S before backward pass → all gradients scaled by S.
3. After backward: divide gradients by S to restore original scale.
4. Check for overflow (NaN/Inf in gradients):
If overflow: skip optimizer step, halve S → S = S / 2.
If no overflow for W steps (e.g., W=2000): double S → S = S × 2.

FP8 training (H100+):
E4M3: 4 exponent, 3 mantissa. Range [−448, 448]. Better precision, less range.
E5M2: 5 exponent, 2 mantissa. Range [−57344, 57344]. Less precision, more range.
Convention: E4M3 for forward pass (weights, activations), E5M2 for backward (gradients).

Loss scaling is invisible but critical. Without it, ~5-15% of gradient values underflow in FP16 training of deep networks. The model trains but converges to a worse solution because fine-grained updates are silently lost. AMP (Automatic Mixed Precision) handles this automatically in PyTorch via torch.cuda.amp.GradScaler.

Exercise 8.1: FP16 Range Derive

FP16 has 5 exponent bits with bias 15. The smallest positive normal number is 2^(1-15) = 2^-14. What is this in scientific notation?

× 10^-5

Show derivation

2^-14 = 1 / 16384 = 6.1035 × 10^-5

Anything smaller than 6.1 × 10^-5 underflows to zero (or subnormal with reduced precision). Typical gradient magnitudes for a well-trained deep network are 10^-4 to 10^-7. So without loss scaling, the bottom end of the gradient distribution is silently lost.

Exercise 8.2: Trace Loss Scaling Trace

Scale starts at S=65536 (2¹⁶). Step 1: no overflow. Step 2: overflow detected. Step 3: no overflow. Step 4: overflow. Step 5: no overflow. What is S after step 5? (W=2000, so no doubling happens in 5 steps.)

S = 65536 (unchanged — overflows cancel out) S = 16384 (halved twice for the two overflows: 65536 → 32768 → 16384) S = 32768 (halved once, doubled once) S = 8192 (halved three times)

Show trace

Step 1: no overflow, good_count=1, S=65536

Step 2: overflow! S = 65536/2 = 32768, good_count=0

Step 3: no overflow, good_count=1, S=32768

Step 4: overflow! S = 32768/2 = 16384, good_count=0

Step 5: no overflow, good_count=1, S=16384

S = 16384 = 2¹⁴. Each overflow halves S. Doubling requires W=2000 consecutive good steps, so it never triggers here. In practice, frequent overflows early in training cause S to drop quickly until it finds a stable range. A healthy training run sees S stabilize after a few hundred steps.

Exercise 8.3: Memory Savings from FP16 Derive

A 13B model: FP32 would use 13B × 4 = 52 GB for weights alone. With mixed precision (FP16 weights + FP32 master copy + FP32 optimizer), how much total memory? What's the ratio compared to pure FP32 training (FP32 weights + FP32 optimizer)?

ratio (mixed / pure-FP32)

Show derivation

Mixed precision: 2 (FP16 wt) + 2 (FP16 grad) + 4 (FP32 master) + 4 (mom) + 4 (var) = 16 bytes/param

Pure FP32: 4 (wt) + 4 (grad) + 4 (mom) + 4 (var) = 16 bytes/param

Ratio = 16/16 = 1.0

Surprise: mixed precision doesn't save optimizer memory! The savings come from activation memory (FP16 activations are 2x smaller) and communication (FP16 gradients are 2x smaller to AllReduce). The FP32 master copy + optimizer states still cost 12 bytes/param either way. This is why ZeRO is orthogonal to mixed precision — you need both.

Exercise 8.4: FP8 E4M3 vs E5M2 Trace

E4M3 has max value 448. E5M2 has max value 57344. During training, gradient magnitudes typically span 6 orders of magnitude. Why use E4M3 for forward activations and E5M2 for backward gradients?

It's arbitrary — either format works for both. Forward activations need precision (3 mantissa bits) to represent intermediate values accurately. Gradients need range (5 exponent bits) because they span a wider dynamic range with possible large spikes. Forward pass is faster with E4M3, backward is faster with E5M2. H100 hardware only supports this specific combination.

Show reasoning

Activations during forward pass tend to be well-behaved (bounded by activation functions like GELU, LayerNorm), so they benefit from more precision bits (3 mantissa = 8 representable fractions per exponent). Gradients can have extreme outliers (especially during loss spikes or early training), so they need wider dynamic range (5 exponent bits = range up to 57344). The precision loss from only 2 mantissa bits is acceptable for gradients because the optimizer averages over many steps.

Exercise 8.5: FP8 Memory Savings Derive

A 70B model's forward pass stores activations for backward. With sequence length 4096, batch 4, 80 layers: each layer stores ~2 × b × s × d bytes (input to attention + input to FFN). d=8192. Compare FP16 vs FP8 activation memory.

GB saved by FP8

Show derivation

Elements per layer = 2 × 4 × 4096 × 8192 = 268,435,456

All layers = 80 × 268,435,456 = 21,474,836,480 elements

FP16: 21.47B × 2 bytes = 42.95 GB × 2 (fwd activations for bwd) ≈ 343 GB

FP8: 21.47B × 1 byte = 21.47 GB × 2 ≈ 172 GB

Savings = 343 - 172 ≈ 172 GB

172 GB saved — that's more than 2 full A100-80GB GPUs worth of memory. This is a major advantage of FP8: even though the optimizer still uses FP32, the activation memory (which dominates at long sequences) is halved compared to FP16.

Exercise 8.6: Implement dynamicLossScale() Build

Simulate the dynamic loss scaling protocol. Process an array of step results (true = overflow, false = no overflow).

Show solution

javascript
function dynamicLossScale(steps, initScale, growInterval) {
  let scale = initScale;
  let goodCount = 0;
  for (const overflow of steps) {
    if (overflow) {
      scale = scale / 2;
      goodCount = 0;
    } else {
      goodCount++;
      if (goodCount >= growInterval) {
        scale = scale * 2;
        goodCount = 0;
      }
    }
  }
  return scale;
}

Chapter 9: Capstone — Train Llama 405B

You are the distributed training lead at Meta. Your job: train Llama 3.1 405B on 16,384 H100 GPUs for 15.6 trillion tokens. You must choose the parallelism strategy (TP, PP, DP, CP), compute the memory budget per GPU, estimate communication volume, and project total training time. Every number must add up or you burn millions of dollars.

Llama 3.1 405B specs:
d = 16384, L = 126 layers, h = 128 heads, h_kv = 8, d_ff = 53248 (SwiGLU)
V = 128256, Sequence length = 8192 (phase 1), context parallel for 128K

Meta's actual config (from the paper):
TP = 8 (within node), PP = 16 (across 16 stages), DP = 128 (16384 / 8 / 16)
Micro-batch = 1, Gradient accumulation = 4
Effective batch = 1 × 4 × 128 = 512 sequences = 4M tokens/batch

The three-axis parallelism puzzle. TP × PP × DP must equal total GPUs (16384). TP is capped at 8 (one NVLink node). PP is set by memory constraints (126 layers / PP stages must fit per stage). DP fills the rest. The art is balancing pipeline bubbles (more PP = more bubble) against memory pressure (less PP = more layers per GPU) against communication cost (more DP = more AllReduce traffic).

Exercise 9.1: Verify TP × PP × DP Derive

TP=8, PP=16, total GPUs=16384. What is the DP degree?

DP degree

Show derivation

DP = Total / (TP × PP) = 16384 / (8 × 16) = 16384 / 128 = 128

Exercise 9.2: Layers per Pipeline Stage Derive

126 layers across PP=16 stages. How many layers per stage? (Note: 126 is not divisible by 16. Some stages get more layers. What is the distribution?)

layers for the stages with MORE layers

Show derivation

126 / 16 = 7.875

14 stages get 8 layers, 2 stages get 7 layers: 14×8 + 2×7 = 112 + 14 = 126

In practice, Meta assigns the first and last stages fewer layers because they also handle the embedding layer and LM head respectively. The embedding (V×d = 128256×16384 ≈ 2.1B params) is non-trivial at this scale and must be accounted for in the memory budget of stage 0.

Exercise 9.3: Memory per GPU Derive

Each layer has: attention (4d² = 4 × 16384² ≈ 1.07B params) + SwiGLU FFN (3 × d × d_ff = 3 × 16384 × 53248 ≈ 2.62B) ≈ 3.69B params/layer. With TP=8, each GPU holds 1/8 of 8 layers. What's the weight memory in FP16?

Show derivation

Params per GPU = 8 layers × 3.69B / 8 (TP) = 3.69B params

FP16 memory = 3.69B × 2 bytes = 7.38 GB

7.4 GB for weights alone, leaving ~72 GB on an 80 GB H100 for optimizer states, activations, and communication buffers. With ZeRO-1 across DP=128, optimizer states add 12/128 = 0.094 bytes/param × 3.69B = 0.35 GB. The main memory pressure comes from activations.

Exercise 9.4: Total FLOPs for Training Derive

The standard estimate: 6 × N × D FLOPs for training a model with N parameters on D tokens. N=405B, D=15.6T tokens. Total petaFLOP-days?

1 petaFLOP-day = 10¹⁵ × 86400 = 8.64 × 10¹⁹ FLOPs.

petaFLOP-days

Show derivation

Total FLOPs = 6 × 405 × 10⁹ × 15.6 × 10¹² = 3.79 × 10²⁵

PF-days = 3.79 × 10²⁵ / 8.64 × 10¹⁹ = 438,657 ≈ 438,750 PF-days

~439K petaFLOP-days. For comparison, GPT-4 is estimated at ~100K PF-days. Llama 3.1 405B is roughly 4x the compute of GPT-4. At Meta's scale, this represents tens of millions of dollars in electricity and GPU-hours.

Exercise 9.5: Training Time Estimate Derive

16,384 H100 GPUs, each at 989 TFLOPS (BF16). Assume 40% MFU (realistic for 3D parallelism). How many days to process 15.6T tokens?

days

Show derivation

Effective FLOPS = 16384 × 989 × 10¹² × 0.40 = 6.48 × 10¹⁸ FLOPS

Total FLOPs = 3.79 × 10²⁵

Time = 3.79 × 10²⁵ / 6.48 × 10¹⁸ = 5.85 × 10⁶ seconds

= 5,850,000 / 86400 ≈ 67.7 days

~68 days of continuous training. Meta reported ~54 days for pre-training (implying ~48% MFU). With crashes and restarts, the actual wall-clock time was ~77 days. At 16,384 GPUs × 77 days × 24 hours × ~$2/GPU-hr, the GPU cost alone is ~$60 million.

Exercise 9.6: Pipeline Bubble for 405B Derive

PP=16, gradient accumulation=4 (so m=4 microbatches per step per DP rank). 1F1B schedule. What is the pipeline bubble fraction? What if Meta used interleaved stages with v=2?

% bubble (interleaved v=2)

Show derivation

Standard 1F1B: bubble = (16-1)/4 = 15/4 = 375% — wait, that's >100%!

This means the pipeline bubble exceeds the useful work. With only m=4 and PP=16, the formula gives (p-1)/m > 1, meaning more bubble than compute.

Interleaved v=2: bubble = (16-1)/(4×2) = 15/8 = 187.5%

Still terrible! This is why the 405B training is so challenging. Meta reports using a customized pipeline schedule and overlapping communication with computation to bring the effective bubble down to ~18-20%. They also experimented with larger accumulation steps. The raw 1F1B formula shows why PP=16 is aggressive — you need many microbatches or clever scheduling to overcome the bubble.

Exercise 9.7: DP AllReduce Volume Derive

Each DP group AllReduces gradients for their pipeline stage's parameters. Each stage has ~8 layers × 3.69B params (already TP-sharded by 8) = 3.69B params in FP16. DP=128 GPUs. Total data per DP AllReduce per GPU?

GB per GPU

Show derivation

Gradient size = 3.69B params × 2 bytes = 7.38 GB

AllReduce per GPU = 2 × 7.38 × (128-1)/128 = 2 × 7.38 × 0.992 = 14.65 GB

Wait — the question asks for data transferred, which is 2 × S × (N-1)/N = 14.65 GB

But each GPU already shards by TP, so S = 7.38 GB. Per GPU = 2 × 7.38 × 127/128 = 7.27 GB

Wait, let me recalculate. The ring formula gives total data per GPU = 2S(N-1)/N. With S=7.38 GB and N=128: 2 × 7.38 × 127/128 = 14.63 GB per GPU total. But this is split across 2(N-1) steps. At 400 Gb/s = 50 GB/s InfiniBand, time = 14.63/50 ≈ 293 ms. This is overlapped with backward compute via bucketed AllReduce.

Exercise 9.8: Design the Parallelism Design

Order the parallelism dimensions from innermost (fastest interconnect) to outermost (slowest interconnect).

→

TP=8 (NVLink within node) PP=16 (NVLink + NVSwitch across nodes) DP=128 (InfiniBand, AllReduce) CP (context parallel, ring attention)

Show reasoning

From innermost to outermost: TP → PP → CP → DP.

TP needs the highest bandwidth (AllReduce on every layer's forward/backward — on the critical path). PP needs fast point-to-point (activation transfers between stages — pipelined). CP (context parallelism) uses ring attention which requires fast ring bandwidth but is less frequent. DP uses AllReduce of gradients which can be fully overlapped with backward — so it tolerates the slowest interconnect.

Exercise 9.9: Electricity Cost Derive

16,384 H100 GPUs, each consuming 700W. Training for 77 days (including crash recovery). Electricity at $0.05/kWh (hyperscaler rate). Total electricity cost?

dollars

Show derivation

Power = 16384 × 700 W = 11,468,800 W = 11,469 kW

Energy = 11,469 kW × 77 × 24 h = 21,233,448 kWh

Cost = 21,233,448 × $0.05 = $1,061,672

~$1.06 million in electricity alone. This is just the GPU power — add ~40% for cooling, networking, storage, and other infrastructure (PUE ≈ 1.4), bringing total energy cost to ~$1.5M. The GPU rental cost ($2/GPU-hr × 16384 × 77 × 24 ≈ $60.5M) dwarfs the electricity. This is why MFU optimization matters so much — every 1% improvement in MFU saves ~$600K at this scale.