← Parminces
Austin, Douglas, Levskaya, Chen, Frostig et al. (Google DeepMind), 2025

How To Scale
Your Model

From roofline models to 5D parallelism — a complete guide to training and serving LLMs on TPUs and GPUs, with profiling, JAX programming, and real benchmarks.

12
Chapters
15+
Simulations
60+
Quizzes
Part I: Preliminaries
Chapter 1

Intro to Rooflines

Arithmetic intensity, compute vs memory bound, and back-of-the-envelope performance modeling.

Chapter 2

All About TPUs

TPU architecture: MXU, VPU, HBM, VMEM, ICI interconnects, and chip-to-chip topologies.

Chapter 3

Sharded Matmuls

1D and 2D weight sharding, collective communication, and the GSPMD framework.

Part II: Transformers at Scale
Chapter 4

Transformers

Attention, MLPs, and full Transformer roofline analysis across sharding strategies.

Chapter 5

Training

Optimizer states, activation memory, gradient checkpointing, and data parallelism.

Chapter 6

Training LLaMA

End-to-end analysis of training a real LLM: memory budget, sharding plan, throughput.

Chapter 7

Inference

Prefill vs decode, KV cache, batched inference, and memory-bound decode analysis.

Chapter 8

Serving LLaMA

Serving a real LLM at scale: throughput, latency, and optimizing for production.

Part III: Practical Tutorials
Chapter 9

How to Profile TPU Code

The JAX profiler, TensorBoard traces, reading HLO ops, memory profiles, worked examples.

Chapter 10

Programming TPUs in JAX

Auto, explicit, and manual sharding modes. shard_map, collective matmuls, worked problems.

Part IV: Conclusions & Bonus
Chapter 11

Conclusions & Further Reading

Acknowledgments, curated reading list, and research frontiers in distributed training.

Chapter 12

How to Think About GPUs

GPU architecture, memory hierarchy, NVLink networking, collectives, and GPU vs TPU rooflines.