← Parminces
Tazi, Mom, Zhao, Nguyen, Mekkouri, Von Werra & Wolf, 2025

The Ultra-Scale
Playbook

Training LLMs on GPU clusters. From one GPU to thousands — every parallelism strategy derived, benchmarked, and made interactive.

9
Chapters
12+
Simulations
50+
Quizzes
Part I: Foundations
Chapter 1

Overview

The three key challenges: memory, compute efficiency, and communication overhead.

Chapter 2

Training on One GPU

Memory breakdown, activation recomputation, gradient accumulation, mixed precision.

Part II: Data & Sharding Parallelism
Chapter 3

Data Parallelism

All-reduce, gradient bucketing, ZeRO-1/2/3, FSDP.

Part III: Model Parallelism
Chapter 4

Tensor Parallelism

Column & row splitting, sequence parallelism, intra-node communication.

Chapter 5

Context Parallelism

Ring Attention, Zig-Zag Attention, ultra-long sequences.

Chapter 6

Pipeline Parallelism

AFAB, 1F1B, interleaved stages, zero-bubble schedules.

Chapter 7

Expert Parallelism

Mixture of Experts, token routing, all-to-all communication.

Part IV: Putting It All Together
Chapter 8

5D Parallelism in a Nutshell

Combining all five dimensions, trade-offs, memory savings overview.

Chapter 9

Finding the Best Config

Step-by-step recipe for memory, batch size, and throughput optimization.