Acknowledgments, a curated reading list organized by topic, and research frontiers in distributed training.
Let us crystallize the core lessons from the entire book. If you remember nothing else, remember these:
| Challenge | Nature | Primary Tools |
|---|---|---|
| Memory | Hard limit — if it does not fit, training cannot proceed | Sharding (ZeRO, TP, PP), activation recomputation, mixed precision |
| Compute efficiency | Optimization target — minimize idle hardware | Large batch sizes, fusions, collective matmuls, pipeline schedules |
| Communication | Scaling bottleneck — GPUs/TPUs waiting on each other | Overlapping comms with compute, topology-aware sharding, quantized gradients |
The single most important skill from this book is roofline reasoning: for any operation, calculate:
If your measured time matches this, there is nothing left to optimize. If it is far off, something is wrong and the profiler will tell you what.
| Dimension | Splits | Best For | Communication Cost |
|---|---|---|---|
| Data (DP) | Batch | Scaling throughput | AllReduce gradients |
| Tensor (TP) | Hidden dim | Fitting large layers (intra-node) | AllReduce / AllGather per layer |
| Context (CP) | Sequence | Long sequences | Ring Attention for K/V |
| Pipeline (PP) | Layers | Multi-node models | Activations between stages |
| Expert (EP) | MoE experts | Scaling capacity cheaply | AllToAll token routing |
This book represents a significant collective investment from many people at Google DeepMind. The authors wish to acknowledge key contributors:
| Contributor | Role |
|---|---|
| James Bradbury, Reiner Pope, Blake Hechtman | Originally derived many core ideas; early pioneers of the systems view of the Transformer |
| Sholto Douglas | Wrote the first version; responsible for the overall narrative and kicking off the project |
| Jacob Austin | Led the transformation from rough notes to polished artifact; editing, formatting, release coordination |
| Anselm Levskaya, Charlie Chen | Created most figures and animations |
| Charlie Chen | Wrote the inference section and drew inference figures |
| Roy Frostig | Publication, editing, and many other steps |
Additional reviewers who provided critical feedback include Zak Stone, Nikhil Sethi, Caitlin Stanton, Alek Dimitriev, Sridhar Lakshmanamurthy, Albert Magyar, Diwakar Gupta, Jeff Dean, Corry Wang, Matt Johnson, Peter Hawkins, and many others. Ruiqi Gao helped with HTML formatting.
The authors curate a reading list organized by area. Each resource offers a different perspective or deeper dive on topics covered in this book.
| Resource | Focus |
|---|---|
| TPU Deep Dive | In-depth TPU architecture in the spirit of this book |
| Domain Specific Architectures for AI Inference | Hardware and model deep dive for inference workloads |
| A Domain-Specific Supercomputer for Training DNNs | One of the original TPU papers with details about the Google TPU program |
| Resource | Focus |
|---|---|
| Making Deep Learning Go Brrrr From First Principles | GPU/PyTorch-focused rooflines and performance engineering |
| How to Optimize a CUDA Matmul Kernel | Step-by-step CUDA kernel optimization worklog; excellent for GPU vs TPU contrast |
| Rafi Witten's High Performance LLMs 2024 | Stanford course on TPU performance engineering with slides on GitHub |
| Resource | Focus |
|---|---|
| Writing TPU Kernels with Pallas | Custom TPU kernels, lower-level details not covered here |
| Distributed Arrays and Automatic Parallelism | Guide to parallelism APIs in JAX; good for implementing ideas from this book |
| Resource | Focus |
|---|---|
| Efficiently Scaling Transformer Inference (arXiv 2211.05102) | Detailed mathematics of Transformer inference; major inspiration for this book |
| HuggingFace Ultra-Scale Playbook | GPU analog covering PyTorch parallelism and memory-saving techniques |
| Transformer Inference Arithmetic | Blog with many of the same ideas and excellent illustrations |
| Stanford CS336 | Fantastic Stanford course on LLM training/serving with exercises (Assignments 1 & 2 especially relevant) |
| Stas Bekman's ML Engineering Handbook | Practical guide: negotiating with cloud providers, cluster management, empirical throughput measurements |
The field of distributed training is rapidly evolving. Here are some areas where active research is pushing the boundaries:
| Approach | Key Idea | Status |
|---|---|---|
| Wafer-scale (Cerebras) | Entire wafer as one chip; massive on-chip SRAM eliminates HBM bottleneck | Production, specialized |
| LPU (Groq) | Deterministic, SRAM-only architecture; eliminates memory hierarchy complexity | Production inference |
| MatX | Custom silicon optimized for dense linear algebra with focus on simplicity | In development |
| Photonic computing | Use light for matrix multiplication at near-zero energy | Research stage |
Thank you for reading this book. Let us close with a decision tree that captures the practical wisdom of distributed training:
| Your Situation | Primary Bottleneck | Start With |
|---|---|---|
| 1 GPU, model fits | Throughput | Gradient accumulation, mixed precision, activation recomputation |
| 1 node (8 GPUs/TPUs), model fits | Throughput | Data parallelism (DP=8) |
| 1 node, model too big | Memory | Tensor parallelism or ZeRO-3 |
| Multiple nodes | Communication | PP across nodes + TP within node |
| Very long sequences | Activation memory | Context parallelism (Ring Attention) |
| MoE architecture | Expert capacity | Expert parallelism (AllToAll routing) |
The book's code implementations are available in two repositories:
For questions or feedback, reach the corresponding author Jacob Austin at jacobaustin123 [at] gmail [dot] com, or contribute via GitHub issues and pull requests.