Every calculation a distributed training engineer needs to derive from scratch. Ring AllReduce, ZeRO sharding, pipeline bubbles, tensor parallelism, mixed precision — all solvable in-browser with instant feedback.
You have 8 GPUs, each holding a copy of the model. After the backward pass, each GPU has its own gradient tensor. You need every GPU to end up with the sum of all 8 gradient tensors. The naive approach — send everything to GPU 0, sum, broadcast back — creates a bottleneck where GPU 0 must receive and send 7x the data. Ring AllReduce is the elegant solution.
In a ring AllReduce, the N GPUs are arranged in a logical ring. The algorithm runs in two phases:
N=8 GPUs, each has a gradient tensor of M=1 GB. How many GB does each GPU transfer in total during a ring AllReduce?
Each GPU sends 0.875 GB in the reduce-scatter phase and 0.875 GB in the allgather phase. Even with 1000 GPUs, total per GPU would be 2 × 999/1000 = 1.998 GB — barely more than 8 GPUs. This is the magic of ring AllReduce.
N=8 GPUs connected in a ring with 50 GB/s bidirectional bandwidth per link. Gradient size M=2 GB. What is the AllReduce time? (Ignore latency — just bandwidth.)
Time = data per GPU / bandwidth = 2M(N-1)/N / BW. But remember: the ring only uses one link per direction at a time.
In practice, latency adds a per-step overhead. With 2(N-1) = 14 steps and ~5 μs latency per step, that adds only 70 μs — negligible compared to 70 ms. Bandwidth dominates for large tensors.
~78% bus bandwidth utilization is typical for well-tuned NCCL AllReduce. The gap comes from protocol overhead, chunk size quantization, and PCIe hops. NCCL reports "bus bandwidth" = data / time × (2(N-1)/N) to normalize for ring overhead.
The tree has fewer steps (6 vs 14) but creates a bottleneck at the root — only one link carries the full payload each step. The ring has more steps but each step moves only M/N bytes per link, saturating all links equally. For large messages, ring always wins. Trees are only better for small messages where per-step latency dominates.
Write a function that computes the ring AllReduce time in milliseconds, including both bandwidth and latency terms.
javascript function ringAllReduceTime(M_gb, N, bw_gbs, latency_us) { const steps = 2 * (N - 1); const chunkGB = M_gb / N; const bw_time_ms = steps * (chunkGB / bw_gbs) * 1000; const lat_time_ms = steps * latency_us / 1000; return bw_time_ms + lat_time_ms; }
This AllReduce time calculator is returning values that are 2x too large. Click the buggy line.
function allReduceTime(M, N, BW) { const dataPerGpu = 2 * M * (N - 1) / N; const steps = 2 * (N - 1); const perStepData = M / N; const perStepTime = perStepData / BW; const totalTime = 2 * steps * perStepTime; return totalTime; }
Line 6 is the bug. The variable steps already accounts for both phases (2(N-1) steps total). Multiplying by 2 again on line 6 double-counts. It should be const totalTime = steps * perStepTime;. Alternatively, you can compute it directly as dataPerGpu / BW — both give the same answer.
You're training a 7B parameter model. In FP16, the model weights alone are 14 GB. But the optimizer (Adam) keeps two additional state tensors per parameter — the first moment (mean) and second moment (variance), both in FP32. Plus you need the FP32 master copy of the weights. With standard data parallelism, every GPU stores all of this. ZeRO (Zero Redundancy Optimizer) eliminates this duplication.
7B parameters, mixed-precision Adam, no ZeRO. How much memory for model states (weights + gradients + optimizer) per GPU?
112 GB for model states alone — that exceeds a single A100's 80 GB even before accounting for activations. Without ZeRO, you can't even load the optimizer state on one GPU.
7B params, ZeRO Stage 1, N=64 GPUs. Memory per GPU for model states?
Stage 1 alone cuts memory from 112 GB to 29.3 GB. The 4 bytes/param for full weights + gradients is the floor — you can't avoid keeping the local copy of parameters and the full gradient for the backward pass.
Same model, ZeRO Stage 2, N=64. Memory per GPU?
Stage 2 halves the Stage 1 cost by also sharding gradients. Now 15.5 GB fits comfortably on an 80 GB A100, leaving ~65 GB for activations. This is the default in most FSDP deployments.
Same model, ZeRO Stage 3, N=64. Memory per GPU?
1.75 GB! But there's a catch: Stage 3 must gather parameters on-the-fly before each forward/backward computation, then discard them. This adds AllGather communication at every layer. The trade-off is clear: memory scales perfectly (1/N) but communication overhead increases.
ZeRO-3 does an AllGather before each layer's forward and backward pass, then a ReduceScatter after backward. For a 7B model (FP16 weights = 14 GB), 64 GPUs, and 100 GB/s interconnect, how much extra communication per training step?
Forward: 1 AllGather (parameters). Backward: 1 AllGather (parameters) + 1 ReduceScatter (gradients). Each AllGather/ReduceScatter transfers M(N-1)/N ≈ M bytes per GPU.
At 100 GB/s, that's ~420 ms of pure communication per step. Compare to ZeRO-1/2 which only does one AllReduce of gradients (~28 GB). ZeRO-3 trades 1.5x more communication for 8-17x less memory. Worth it when the model literally can't fit otherwise.
ZeRO-3 with 8 GPUs gives 16/8 = 2 bytes/param × 30B = 60 GB for model states. Add 20 GB for activations = 80 GB — exactly the A100 limit. In practice you'd want some headroom, so you might add activation checkpointing to reduce the 20 GB, or use 16 GPUs with ZeRO-2.
You want an effective batch size of 2048 sequences, but each GPU can only fit 4 sequences in memory at once. With 8 GPUs, that's 32 per step — you're 64x short. Gradient accumulation bridges this gap: run multiple micro-batches, accumulate their gradients, and only update the optimizer after enough micro-batches.
Target effective batch = 2048. Micro-batch per GPU = 4. DP degree (number of GPUs doing data parallelism) = 8. How many gradient accumulation steps?
64 micro-batches per optimizer step. That means 64 forward+backward passes between each weight update. The AllReduce only happens once at the end, so communication is 1/64th of what it would be without accumulation.
Without gradient accumulation, you AllReduce after every micro-batch (accum=1, so you'd need 64 optimizer steps for the same number of samples). With accumulation, you AllReduce once per 64 micro-batches. If each AllReduce takes 50 ms, how much communication time do you save per 2048 samples?
3.15 seconds saved per optimizer step! Note however that without accumulation you'd also get 64 optimizer updates instead of 1 — the training dynamics differ. Gradient accumulation gives you a specific batch size, not just speed. Large batch training converges differently from small batch training.
A micro-batch of 4 sequences takes 120 ms for forward+backward on each GPU. AllReduce takes 50 ms. With accumulation=64: what is the total wall-clock time per optimizer step, and what is the throughput in sequences/second?
Communication is only 50/7731 = 0.6% of the step time. Gradient accumulation makes the training almost perfectly compute-bound. Without it, communication would be 3200/(64×120 + 3200) = 29% of total time.
Write a function that computes gradient accumulation steps from the target batch size, and vice versa.
javascript function effectiveBatch(microBatch, accumSteps, dpDegree) { return microBatch * accumSteps * dpDegree; } function requiredAccumSteps(targetBatch, microBatch, dpDegree) { return Math.ceil(targetBatch / (microBatch * dpDegree)); }
With micro=8: 2048/(8×8) = 32 accumulation steps. If each step takes T ms, total = 32T.
With micro=16 + checkpoint: 2048/(16×8) = 16 steps. Each step processes 2x tokens but takes ~2×1.33 = 2.66x time (doubled data, 33% recompute overhead). Total = 16 × 2.66T = 42.6T.
32T < 42.6T. Larger micro-batch with checkpointing is ~33% slower. Avoid activation checkpointing when you have enough accumulation steps to hit your batch size. Only use it when you literally can't fit the micro-batch you need.
You've tuned your LR to 3e-4 at batch size 256. Now you're scaling to 8 nodes with batch size 2048. If you keep the same LR, training diverges. If you scale it linearly, training also diverges. The relationship between batch size and learning rate is one of the most practically important (and surprisingly subtle) topics in distributed training.
Base LR = 3e-4 at batch 256. New batch = 2048. What LR under linear scaling?
An 8x increase in batch size gives an 8x increase in LR. This is the rule used in the original ResNet large-batch paper (Goyal et al.). It works well up to batch sizes around 8K-32K for vision and ~1M-4M tokens for LLMs.
Same setup: base LR = 3e-4 at batch 256, new batch = 2048. What LR under square root scaling?
Square root scaling is more conservative — 2.83x instead of 8x. It's derived from the noise-signal ratio analysis and tends to work better at very large batch sizes (>32K). Most modern LLM training runs use something between linear and sqrt.
You're training for 100,000 steps with batch=2048 (scaled up from batch=256). Using the heuristic warmup = (Bnew/Bbase) × 1000 steps, how many warmup steps?
8000/100000 = 8% of training, which falls in the 5-10% sweet spot. Warmup is critical at large batch sizes because the early gradient estimates are very noisy (the model is randomly initialized), and a large LR amplifies that noise. Ramping up slowly lets the loss landscape stabilize before taking big steps.
Below Bcrit, doubling the batch nearly doubles training speed (fewer noisy steps needed). Above Bcrit, the gradient is already clean, so more samples per batch don't help — you just burn more compute per step without reaching the minimum faster. For GPT-3 scale models, Bcrit ≈ 1-2M tokens. This is why Llama 3 uses batch rampup: start at 4M tokens/batch, scale to 16M as the loss landscape becomes more predictable.
The cosine schedule should decay over the actual 50K steps. The key quantity is per-token LR exposure: at step t, each token in the batch sees LR(t). With 2x batch and 2x LR, each token gets twice the update magnitude. But you have half the steps, so the integrated LR exposure (area under the LR curve) stays roughly the same. The cosine must decay over the actual step count to hit the minimum LR at the right time.
You've profiled your training run and 40% of each step is spent waiting on AllReduce. Is that expected? Can you hide it behind computation? Understanding when communication dominates vs when it can be overlapped is critical for achieving high GPU utilization (called MFU — model FLOP utilization).
10B parameter model, FP16 gradients (2 bytes/param). 64 GPUs, NVLink at 600 GB/s per GPU (bidirectional). Ring AllReduce time?
~66 ms for a single AllReduce. For reference, a forward+backward pass of a 10B model on 64 GPUs with decent micro-batch takes 200-400 ms. So communication is 15-30% of the step — significant but overlappable.
Backward pass takes 200 ms total across 40 layers (5 ms per layer). AllReduce per layer's gradients (bucketed) takes 1.5 ms. What fraction of communication is overlapped?
The first 39 layers' AllReduces overlap with the next layer's backward. Only the last layer's AllReduce cannot overlap.
97.5% overlap means only 1.5 ms of the 60 ms AllReduce is visible as overhead. The effective communication cost is 1.5 ms on top of a 200 ms backward — less than 1% overhead. This is why bucketed, overlapped AllReduce is so effective: even though total data transfer is huge, almost all of it hides behind the backward pass.
8 GPUs per node (NVLink, 600 GB/s within node). 8 nodes connected via InfiniBand (100 GB/s per link). For a 10B model AllReduce across all 64 GPUs, the cross-node links are the bottleneck. Effective AllReduce time?
In hierarchical AllReduce: first reduce within each node (fast), then AllReduce across nodes (slow), then broadcast within nodes (fast). The cross-node step dominates.
350 ms for the cross-node step alone — 5x slower than the full NVLink AllReduce. The intra-node steps add maybe 10 ms total. This is why modern clusters use fat InfiniBand fabrics (400 Gb/s = 50 GB/s per port, 8 ports per node = 400 GB/s aggregate) and why tensor parallelism is always kept within a node.
Communication exceeds compute! The effective MFU drops from 50% to 50% × 242/(242+350) = 20%. This is why you never do pure data parallelism across slow inter-node links for large models. Instead, combine TP within the node (high-BW NVLink) + DP across nodes, or use gradient accumulation to amortize the communication cost.
Given model and cluster parameters, compute the communication-to-compute ratio.
javascript function commComputeRatio(params_B, N, bw_gbs, flops_per_step_T, gpu_tflops) { const S = params_B * 2; // FP16 grads in GB const commTime = 2 * S * (N - 1) / (N * bw_gbs); const computeTime = flops_per_step_T / gpu_tflops; return commTime / computeTime; }
Your 70B model has 80 layers. You split them across 4 GPUs: 20 layers each. GPU 0 computes the first 20 layers, passes activations to GPU 1 for the next 20, and so on. Simple — but there's a devastating problem: while GPU 0 computes, GPUs 1-3 sit idle. This idle time is the pipeline bubble.
PP=4 stages, m=12 microbatches, 1F1B schedule. What is the pipeline bubble fraction?
The bubble represents the warmup (first p-1 forward-only passes) and drain (last p-1 backward-only passes). During these phases, at least one stage is idle. With 12 microbatches, the steady-state (where all stages are busy) covers only 9/12 = 75% of the time.
In 1F1B, each stage stashes activations for p-1 in-flight microbatches during warmup. If each micro-batch's activations per stage are 2 GB, PP=4 stages: how much activation memory per GPU?
Compare to GPipe which stashes all m=12 micro-batches: 12 × 2 = 24 GB. 1F1B uses 4x less activation memory because it starts backward passes as soon as possible, freeing stashed activations early. This is why 1F1B is universally preferred over GPipe in practice.
Same setup (PP=4, m=12) but with v=4 virtual stages per GPU (interleaved schedule). New bubble fraction?
4x reduction in bubble — from 25% to 6.25%. The cost is 4x more point-to-point sends between stages (each virtual stage boundary requires a send/recv). With fast NVLink this is usually worth it. Megatron-LM's interleaved schedule uses v=number_of_layers / (PP × layers_per_chunk).
Each microbatch's activations between stages are 500 MB (FP16). PP=4, m=12, v=4 virtual stages. How much total point-to-point data per step across all stages?
Each virtual stage boundary requires one send per microbatch for forward and one for backward. Total boundaries = PP × v - 1.
180 GB of point-to-point traffic per step. At 600 GB/s NVLink, the raw transfer time is 300 ms. But these transfers are pipelined and overlapped with computation, so the actual overhead is much smaller. The key constraint is that adjacent virtual stages must be on different GPUs connected by fast links.
This bubble fraction calculator has a subtle error. Click the buggy line.
function bubbleFraction(pp, microbatches, virtualStages) { if (virtualStages === undefined) virtualStages = 1; const bubbleSteps = pp - 1; const totalSteps = microbatches * virtualStages; return bubbleSteps / (bubbleSteps + totalSteps); }
Line 5 is the bug. The 1F1B bubble fraction formula is (p-1)/m for standard and (p-1)/(m×v) for interleaved — both have the total in the denominator alone, NOT (p-1) / ((p-1) + m×v). The GPipe formula uses (p-1) / (p-1+m), but the comment says "1F1B." Line 5 should be return bubbleSteps / totalSteps;
m=140 gives exactly 5%, so you need m ≥ 141 for strictly less than 5%. In practice with PP=8, you'd use interleaved stages (v=2 gives bubble < 5% at m=70, v=4 at m=35) rather than cramming 140+ micro-batches into a step.
Pipeline parallelism splits the model by layers. Tensor parallelism (TP) splits individual layers across GPUs. A matrix multiply Y = XW can be split by partitioning W along columns (each GPU gets a vertical slice) or along rows (each GPU gets a horizontal slice). The trick is choosing the split direction so that you minimize communication.
d=8192, TP=8. Column-parallel split of a [d, d] weight matrix. What shape does each GPU hold?
Each GPU stores 8192 × 1024 = 8.39M parameters instead of 67.1M. For attention, this naturally maps to splitting heads: with 64 heads and TP=8, each GPU handles 8 heads.
TP=8, d=8192, sequence length s=4096, micro-batch b=1, FP16. Each transformer layer needs 2 AllReduces of the activation tensor [b×s, d]. How much data per AllReduce per GPU? (Use the ring AllReduce formula.)
56 MB per AllReduce per GPU. With 2 AllReduces per layer and 80 layers, total TP communication = 160 × 56 = 8960 MB ≈ 8.75 GB per GPU per step (forward only, double for backward). This is why TP must use NVLink — at 600 GB/s, each AllReduce takes 56 MB / 600 GB/s ≈ 0.09 ms. Over InfiniBand (100 GB/s), it would be 0.56 ms per AllReduce × 160 = 90 ms overhead.
A 70B model with d=8192, 80 layers. Each layer has ~12d² params = 805M params = 1.61 GB in FP16. With TP=8, how much weight memory per GPU for all layers?
TP=8 splits parameters perfectly — each GPU holds 1/8 of every layer's weights. Unlike ZeRO-3, there's no need to AllGather parameters before computation because each GPU already has its shard of the weight matrix and can compute its portion of the output directly. The trade-off is the 2 AllReduces per layer during both forward and backward.
With TP=8 and 64 Q heads: 64/8 = 8 Q heads per GPU. With 8 KV heads: 8/8 = 1 KV head per GPU. This divides cleanly because the number of KV heads (8) is divisible by TP (8). Each GPU has 8 Q heads that all share 1 KV head — the GQA ratio of 8:1 is preserved locally. If TP were 16, you'd need to replicate KV heads (16/8 = 2 copies each), or use sequence parallelism instead.
Each TP AllReduce takes 0.09 ms at NVLink speeds. A 70B model with TP=8 has 80 layers × 2 AllReduces × 2 (fwd+bwd) = 320 AllReduces per step. The forward+backward compute takes 300 ms. What is the TP communication overhead as a percentage?
~10% overhead with NVLink is acceptable. But TP AllReduces are on the critical path — they cannot be overlapped because the next layer's computation depends on the AllReduce result. This is why TP is capped at 8 GPUs (one node) in most training configs. Going to TP=16 across nodes with 100 GB/s InfiniBand would make each AllReduce 6x slower, pushing overhead to ~58%.
Your 70B model training run crashes at step 45,000 out of 100,000. How much work did you lose? The answer depends entirely on your checkpointing strategy. Too frequent: you waste time writing to disk. Too infrequent: you lose days of work on a crash. Finding the optimal frequency is a real operations problem.
70B parameters, mixed-precision Adam. Compute the exact checkpoint size including weights (FP16 + FP32 master), optimizer states, and step counter/LR (8 bytes total, negligible).
980 GB checkpoint, NFS write speed 10 GB/s. How long to write one checkpoint?
98 seconds = 1.6 minutes per checkpoint. With distributed checkpointing (each GPU writes its own shard in parallel via ZeRO), the effective write speed can be N × 10 GB/s, reducing time to 98/N seconds. With 64 GPUs: 1.5 seconds!
MTBF = 24 hours (a realistic large cluster value). Checkpoint takes C = 98 seconds = 0.0272 hours. Using Young's formula: optimal interval = √(2CF). Answer in minutes.
Checkpoint every ~68 minutes. If each step takes 10 seconds, that's every ~410 steps. On a crash, you lose on average half the interval = 34 minutes of work. The checkpoint overhead is 98 seconds per 68.5 minutes = 2.4% — a small price for crash recovery.
Your run uses 512 H100 GPUs at $2/GPU-hour. The MTBF is 24 hours. Checkpoints every 68 minutes. Each checkpoint takes 98 seconds. On average, how much money is wasted per crash (lost compute between last checkpoint and crash, plus recovery time)?
~$600 per crash. With MTBF = 24 hours over a 30-day training run, that's ~30 crashes × $618 = ~$18,500 in wasted compute. This is why Meta and Google invest heavily in faster checkpointing, fault-tolerant training (elastic training that can continue with fewer GPUs), and hardware reliability.
Given MTBF and checkpoint duration, compute optimal interval and expected waste percentage.
javascript function optimalCheckpointInterval(mtbf_hours, ckpt_seconds) { const C = ckpt_seconds / 3600; // hours const interval = Math.sqrt(2 * C * mtbf_hours); // hours const interval_min = interval * 60; // Waste per crash: avg lost = interval/2, plus ckpt overhead per interval const crashes = 1; // per MTBF period const lost = interval / 2; // hours lost per crash const overhead = C * (mtbf_hours / interval); // total ckpt time in mtbf const waste = (lost + overhead) / mtbf_hours * 100; return [interval_min, waste]; }
FP16 has only 5 exponent bits, giving a range of roughly 6 × 10-8 to 65504. Gradients during training can be much smaller than 10-8, especially for early layers in deep networks. Without intervention, these small gradients underflow to zero and the model stops learning. Dynamic loss scaling is the fix.
torch.cuda.amp.GradScaler.FP16 has 5 exponent bits with bias 15. The smallest positive normal number is 2(1-15) = 2-14. What is this in scientific notation?
Anything smaller than 6.1 × 10-5 underflows to zero (or subnormal with reduced precision). Typical gradient magnitudes for a well-trained deep network are 10-4 to 10-7. So without loss scaling, the bottom end of the gradient distribution is silently lost.
S = 16384 = 214. Each overflow halves S. Doubling requires W=2000 consecutive good steps, so it never triggers here. In practice, frequent overflows early in training cause S to drop quickly until it finds a stable range. A healthy training run sees S stabilize after a few hundred steps.
A 13B model: FP32 would use 13B × 4 = 52 GB for weights alone. With mixed precision (FP16 weights + FP32 master copy + FP32 optimizer), how much total memory? What's the ratio compared to pure FP32 training (FP32 weights + FP32 optimizer)?
Surprise: mixed precision doesn't save optimizer memory! The savings come from activation memory (FP16 activations are 2x smaller) and communication (FP16 gradients are 2x smaller to AllReduce). The FP32 master copy + optimizer states still cost 12 bytes/param either way. This is why ZeRO is orthogonal to mixed precision — you need both.
Activations during forward pass tend to be well-behaved (bounded by activation functions like GELU, LayerNorm), so they benefit from more precision bits (3 mantissa = 8 representable fractions per exponent). Gradients can have extreme outliers (especially during loss spikes or early training), so they need wider dynamic range (5 exponent bits = range up to 57344). The precision loss from only 2 mantissa bits is acceptable for gradients because the optimizer averages over many steps.
A 70B model's forward pass stores activations for backward. With sequence length 4096, batch 4, 80 layers: each layer stores ~2 × b × s × d bytes (input to attention + input to FFN). d=8192. Compare FP16 vs FP8 activation memory.
172 GB saved — that's more than 2 full A100-80GB GPUs worth of memory. This is a major advantage of FP8: even though the optimizer still uses FP32, the activation memory (which dominates at long sequences) is halved compared to FP16.
Simulate the dynamic loss scaling protocol. Process an array of step results (true = overflow, false = no overflow).
javascript function dynamicLossScale(steps, initScale, growInterval) { let scale = initScale; let goodCount = 0; for (const overflow of steps) { if (overflow) { scale = scale / 2; goodCount = 0; } else { goodCount++; if (goodCount >= growInterval) { scale = scale * 2; goodCount = 0; } } } return scale; }
You are the distributed training lead at Meta. Your job: train Llama 3.1 405B on 16,384 H100 GPUs for 15.6 trillion tokens. You must choose the parallelism strategy (TP, PP, DP, CP), compute the memory budget per GPU, estimate communication volume, and project total training time. Every number must add up or you burn millions of dollars.
TP=8, PP=16, total GPUs=16384. What is the DP degree?
126 layers across PP=16 stages. How many layers per stage? (Note: 126 is not divisible by 16. Some stages get more layers. What is the distribution?)
In practice, Meta assigns the first and last stages fewer layers because they also handle the embedding layer and LM head respectively. The embedding (V×d = 128256×16384 ≈ 2.1B params) is non-trivial at this scale and must be accounted for in the memory budget of stage 0.
Each layer has: attention (4d² = 4 × 16384² ≈ 1.07B params) + SwiGLU FFN (3 × d × dff = 3 × 16384 × 53248 ≈ 2.62B) ≈ 3.69B params/layer. With TP=8, each GPU holds 1/8 of 8 layers. What's the weight memory in FP16?
7.4 GB for weights alone, leaving ~72 GB on an 80 GB H100 for optimizer states, activations, and communication buffers. With ZeRO-1 across DP=128, optimizer states add 12/128 = 0.094 bytes/param × 3.69B = 0.35 GB. The main memory pressure comes from activations.
The standard estimate: 6 × N × D FLOPs for training a model with N parameters on D tokens. N=405B, D=15.6T tokens. Total petaFLOP-days?
1 petaFLOP-day = 1015 × 86400 = 8.64 × 1019 FLOPs.
~439K petaFLOP-days. For comparison, GPT-4 is estimated at ~100K PF-days. Llama 3.1 405B is roughly 4x the compute of GPT-4. At Meta's scale, this represents tens of millions of dollars in electricity and GPU-hours.
16,384 H100 GPUs, each at 989 TFLOPS (BF16). Assume 40% MFU (realistic for 3D parallelism). How many days to process 15.6T tokens?
~68 days of continuous training. Meta reported ~54 days for pre-training (implying ~48% MFU). With crashes and restarts, the actual wall-clock time was ~77 days. At 16,384 GPUs × 77 days × 24 hours × ~$2/GPU-hr, the GPU cost alone is ~$60 million.
PP=16, gradient accumulation=4 (so m=4 microbatches per step per DP rank). 1F1B schedule. What is the pipeline bubble fraction? What if Meta used interleaved stages with v=2?
Still terrible! This is why the 405B training is so challenging. Meta reports using a customized pipeline schedule and overlapping communication with computation to bring the effective bubble down to ~18-20%. They also experimented with larger accumulation steps. The raw 1F1B formula shows why PP=16 is aggressive — you need many microbatches or clever scheduling to overcome the bubble.
Each DP group AllReduces gradients for their pipeline stage's parameters. Each stage has ~8 layers × 3.69B params (already TP-sharded by 8) = 3.69B params in FP16. DP=128 GPUs. Total data per DP AllReduce per GPU?
Wait, let me recalculate. The ring formula gives total data per GPU = 2S(N-1)/N. With S=7.38 GB and N=128: 2 × 7.38 × 127/128 = 14.63 GB per GPU total. But this is split across 2(N-1) steps. At 400 Gb/s = 50 GB/s InfiniBand, time = 14.63/50 ≈ 293 ms. This is overlapped with backward compute via bucketed AllReduce.
Order the parallelism dimensions from innermost (fastest interconnect) to outermost (slowest interconnect).
From innermost to outermost: TP → PP → CP → DP.
TP needs the highest bandwidth (AllReduce on every layer's forward/backward — on the critical path). PP needs fast point-to-point (activation transfers between stages — pipelined). CP (context parallelism) uses ring attention which requires fast ring bandwidth but is less frequent. DP uses AllReduce of gradients which can be fully overlapped with backward — so it tolerates the slowest interconnect.
16,384 H100 GPUs, each consuming 700W. Training for 77 days (including crash recovery). Electricity at $0.05/kWh (hyperscaler rate). Total electricity cost?
~$1.06 million in electricity alone. This is just the GPU power — add ~40% for cooling, networking, storage, and other infrastructure (PUE ≈ 1.4), bringing total energy cost to ~$1.5M. The GPU rental cost ($2/GPU-hr × 16384 × 77 × 24 ≈ $60.5M) dwarfs the electricity. This is why MFU optimization matters so much — every 1% improvement in MFU saves ~$600K at this scale.