Training LLMs on GPU clusters. From one GPU to thousands — every parallelism strategy derived, benchmarked, and made interactive.
The three key challenges: memory, compute efficiency, and communication overhead.
Memory breakdown, activation recomputation, gradient accumulation, mixed precision.
Column & row splitting, sequence parallelism, intra-node communication.
Ring Attention, Zig-Zag Attention, ultra-long sequences.
AFAB, 1F1B, interleaved stages, zero-bubble schedules.
Mixture of Experts, token routing, all-to-all communication.