Austin et al., Chapter 11

Conclusions & Further Reading

Acknowledgments, a curated reading list organized by topic, and research frontiers in distributed training.

Prerequisites: Having read Chapters 0–10 (or at least skimmed them).
5
Chapters
0
Simulations
4
Quizzes

Chapter 0: Key Takeaways

Let us crystallize the core lessons from the entire book. If you remember nothing else, remember these:

The Three Challenges

ChallengeNaturePrimary Tools
MemoryHard limit — if it does not fit, training cannot proceedSharding (ZeRO, TP, PP), activation recomputation, mixed precision
Compute efficiencyOptimization target — minimize idle hardwareLarge batch sizes, fusions, collective matmuls, pipeline schedules
CommunicationScaling bottleneck — GPUs/TPUs waiting on each otherOverlapping comms with compute, topology-aware sharding, quantized gradients

The Roofline Mindset

The single most important skill from this book is roofline reasoning: for any operation, calculate:

T = max(Tcompute, Tcommunication) = max(FLOPs / peak_FLOPS, bytes / bandwidth)

If your measured time matches this, there is nothing left to optimize. If it is far off, something is wrong and the profiler will tell you what.

The 5D Parallelism Framework

DimensionSplitsBest ForCommunication Cost
Data (DP)BatchScaling throughputAllReduce gradients
Tensor (TP)Hidden dimFitting large layers (intra-node)AllReduce / AllGather per layer
Context (CP)SequenceLong sequencesRing Attention for K/V
Pipeline (PP)LayersMulti-node modelsActivations between stages
Expert (EP)MoE expertsScaling capacity cheaplyAllToAll token routing
The fundamental trade-off: You can often trade one of {computation, communication, memory} for another. Activation recomputation trades extra compute for less memory. Tensor parallelism trades more communication for distributed memory. Finding the right balance is the art of distributed training.
Check: Which of these is a hard limit (not just an optimization target)?

Chapter 1: Acknowledgments

This book represents a significant collective investment from many people at Google DeepMind. The authors wish to acknowledge key contributors:

ContributorRole
James Bradbury, Reiner Pope, Blake HechtmanOriginally derived many core ideas; early pioneers of the systems view of the Transformer
Sholto DouglasWrote the first version; responsible for the overall narrative and kicking off the project
Jacob AustinLed the transformation from rough notes to polished artifact; editing, formatting, release coordination
Anselm Levskaya, Charlie ChenCreated most figures and animations
Charlie ChenWrote the inference section and drew inference figures
Roy FrostigPublication, editing, and many other steps

Additional reviewers who provided critical feedback include Zak Stone, Nikhil Sethi, Caitlin Stanton, Alek Dimitriev, Sridhar Lakshmanamurthy, Albert Magyar, Diwakar Gupta, Jeff Dean, Corry Wang, Matt Johnson, Peter Hawkins, and many others. Ruiqi Gao helped with HTML formatting.

Open-source roots: This book grew out of internal Google DeepMind documentation about TPU performance engineering. The authors made a deliberate choice to open-source it, believing that the knowledge of how to train models at scale should not remain behind closed doors.
Check: Who originally derived many of the core ideas about the systems view of the Transformer that this book is built on?

Chapter 2: Further Reading

The authors curate a reading list organized by area. Each resource offers a different perspective or deeper dive on topics covered in this book.

Hardware Deep Dives

ResourceFocus
TPU Deep DiveIn-depth TPU architecture in the spirit of this book
Domain Specific Architectures for AI InferenceHardware and model deep dive for inference workloads
A Domain-Specific Supercomputer for Training DNNsOne of the original TPU papers with details about the Google TPU program

Performance Engineering

ResourceFocus
Making Deep Learning Go Brrrr From First PrinciplesGPU/PyTorch-focused rooflines and performance engineering
How to Optimize a CUDA Matmul KernelStep-by-step CUDA kernel optimization worklog; excellent for GPU vs TPU contrast
Rafi Witten's High Performance LLMs 2024Stanford course on TPU performance engineering with slides on GitHub

JAX Programming

ResourceFocus
Writing TPU Kernels with PallasCustom TPU kernels, lower-level details not covered here
Distributed Arrays and Automatic ParallelismGuide to parallelism APIs in JAX; good for implementing ideas from this book

Broader Resources

ResourceFocus
Efficiently Scaling Transformer Inference (arXiv 2211.05102)Detailed mathematics of Transformer inference; major inspiration for this book
HuggingFace Ultra-Scale PlaybookGPU analog covering PyTorch parallelism and memory-saving techniques
Transformer Inference ArithmeticBlog with many of the same ideas and excellent illustrations
Stanford CS336Fantastic Stanford course on LLM training/serving with exercises (Assignments 1 & 2 especially relevant)
Stas Bekman's ML Engineering HandbookPractical guide: negotiating with cloud providers, cluster management, empirical throughput measurements
The authors note: "There remains a lot of room for comprehensive writing in this area, so we hope this manuscript encourages more of it! We also believe that this is a fruitful area to study and research. In many cases, it can be done even without having many hardware accelerators on hand."
Check: Which resource is described as the "GPU analog" to this (TPU-focused) scaling book?

Chapter 3: Research Frontiers

The field of distributed training is rapidly evolving. Here are some areas where active research is pushing the boundaries:

Hardware Evolution

Algorithmic Frontiers

Systems Frontiers

Inference and Serving Frontiers

Emerging Hardware Paradigms

ApproachKey IdeaStatus
Wafer-scale (Cerebras)Entire wafer as one chip; massive on-chip SRAM eliminates HBM bottleneckProduction, specialized
LPU (Groq)Deterministic, SRAM-only architecture; eliminates memory hierarchy complexityProduction inference
MatXCustom silicon optimized for dense linear algebra with focus on simplicityIn development
Photonic computingUse light for matrix multiplication at near-zero energyResearch stage
The key open question: As models get larger and clusters get bigger, will the current 5D parallelism framework suffice, or will we need fundamentally new approaches? The interplay between hardware design, compiler technology, and training algorithms continues to be one of the most exciting areas in ML systems research.
Check: What new memory type does the Blackwell GPU architecture (B200) introduce to feed its larger Tensor Cores?

Chapter 4: Final Thoughts

Thank you for reading this book. Let us close with a decision tree that captures the practical wisdom of distributed training:

Your SituationPrimary BottleneckStart With
1 GPU, model fitsThroughputGradient accumulation, mixed precision, activation recomputation
1 node (8 GPUs/TPUs), model fitsThroughputData parallelism (DP=8)
1 node, model too bigMemoryTensor parallelism or ZeRO-3
Multiple nodesCommunicationPP across nodes + TP within node
Very long sequencesActivation memoryContext parallelism (Ring Attention)
MoE architectureExpert capacityExpert parallelism (AllToAll routing)
The authors ran over 4,100 experiments with up to 512 GPUs. The benchmarks consistently show that no single technique dominates — the optimal configuration depends on model size, batch size, sequence length, and hardware topology. There is no substitute for understanding the fundamentals and reasoning from first principles.

The book's code implementations are available in two repositories:

For questions or feedback, reach the corresponding author Jacob Austin at jacobaustin123 [at] gmail [dot] com, or contribute via GitHub issues and pull requests.

Citation:
Austin et al., "How to Scale Your Model", Google DeepMind, online, 2025.
Check: If your model does not fit on a single node (e.g., 8 GPUs), what is the recommended parallelism strategy?