Three bounds rule every computation: how fast the chip does math, how fast it moves bytes, and how fast it talks to neighbours. Roofline analysis turns these into sharp predictions.
You run an algorithm. It takes 50 ms. Why not 5 ms or 500 ms? What physical processes are actually consuming that time?
There are exactly three bottlenecks any computation can hit, and every single operation in deep learning is bounded by at least one of them.
Computation time depends purely on the number of floating-point operations and the chip's peak throughput:
For example, an NVIDIA H100 can do about 9.89e14 bfloat16 FLOPs/s and a TPU v6e can do 9.1e14. So 1e12 FLOPs takes about 1e12 / 9.89e14 = 1.01 ms on an H100.
Communication time (whether within a chip or between chips) depends on how many bytes move and how fast the link is:
Within a chip, the HBM bandwidth is about 3.35 TB/s on H100 and 1.6 TB/s on TPU v6e. Between chips, the links are much slower.
Since Tupper ≤ 2 × Tlower, these bounds are never more than 2x apart. In practice we optimize against the max (the lower bound) and can usually get close to it.
When Tmath > Tcomms, the chip spends most of its time doing useful math. We call this compute-bound. When Tcomms > Tmath, the chip sits idle waiting for data. We call this bandwidth-bound (also called memory-bound or comms-bound).
How do we tell which regime an operation will land in, without running it? We use a single ratio called the arithmetic intensity:
This measures "FLOPs per byte moved." A high-intensity operation does a lot of math per byte of data — it will tend to be compute-bound. A low-intensity operation loads many bytes but does little math — it will be bandwidth-bound.
Let's derive this. We are compute-bound when Tmath > Tcomms:
For the TPU v5e MXU, the critical intensity is 1.97e14 / 8.2e11 = 240 FLOPs/byte. Any algorithm with intensity below 240 will be bandwidth-bound on this chip.
To compute x · y where both are bf16 vectors of length N, we load 2N bytes for x, 2N bytes for y, perform N multiplies + (N−1) adds = 2N−1 FLOPs, and write 2 bytes back.
An intensity of 0.5 is far below 240. The dot product is hopelessly bandwidth-bound. No matter how large N gets, intensity stays at ~0.5. This is why we love matrix multiplication — it has much better scaling.
A roofline plot is the standard way to visualize when an algorithm is compute-bound vs. bandwidth-bound. The x-axis shows arithmetic intensity (FLOPs/byte); the y-axis shows the peak achievable throughput (FLOPs/s).
The achievable FLOPs/s for an algorithm with intensity I is:
In the bandwidth-bound region, doubling the bandwidth doubles throughput. In the compute-bound region, increasing bandwidth has zero effect — you are already saturating the compute units.
Drag the sliders to change hardware specs and see where different operations land.
The plot shows several key operations:
| Operation | Typical Intensity | Regime |
|---|---|---|
| Dot product | ~0.5 | Always BW-bound |
| LayerNorm | ~5 | Usually BW-bound |
| Matmul (B=32) | ~32 | BW-bound on most chips |
| Matmul (B=256) | ~256 | Compute-bound on TPU v5e |
| Matmul (B=1024) | ~1024 | Solidly compute-bound |
Matrix multiplication is the most important operation in deep learning. Let's derive its arithmetic intensity from scratch.
Consider X[B, D] × Y[D, F] → Z[B, F]. We need to:
| What | Amount |
|---|---|
| Load X from HBM | 2BD bytes (bf16) |
| Load Y from HBM | 2DF bytes |
| Compute | 2BDF FLOPs |
| Write Z to HBM | 2BF bytes |
Now here is the key simplification. In a typical Transformer, D and F are large (often > 8000) while B (the per-replica token batch size) is smaller. When B ≪ D and B ≪ F:
Since the critical intensity of the TPU v5e is 240, we need B > 240 tokens per replica to be compute-bound. On an H100, the critical intensity is 9.89e14 / 3.35e12 ≈ 295, so we need B > 295.
Notice something remarkable: the FLOPs of matmul scale as O(N3) while the data transfer scales as O(N2). As we make matrices bigger, the intensity grows. This is why deep learning is dominated by matrix multiplication — it is one of the few algorithms where scaling up makes the hardware more efficient.
Let's make the matmul roofline concrete with actual hardware numbers. We want to know: at what batch size does a bf16 matmul become compute-bound?
Using the TPU v5e MXU: 1.97e14 bf16 FLOPs/s and 8.2e11 bytes/s HBM bandwidth.
Important caveats:
1. "Batch size" means tokens per replica. If you do model sharding, you scale both FLOPs and bandwidth by the same factor, so the critical batch size remains the same per independent copy of the weights.
2. Tile sizes matter. In practice, a large matmul is broken into tiles that fit in VMEM/SMEM. This changes the effective bytes loaded. For an (m,k) · (k,n) matmul with tile sizes bm, bk, bn, the effective intensity is roughly bm · bn / (bm + bn), which is similar to B when tiles are reasonable.
3. Quantization changes the rules. If weights are int8 but compute is bf16, we load half the weight bytes but do the same FLOPs. This shifts the critical batch size — we'll work through this in the exercises.
See how achieved FLOPs/s scales with batch size for different D/F values.
Everything so far has been about memory bandwidth within a single chip. But in distributed training, the bottleneck is often the network between chips. The same roofline framework applies, just with different bandwidth numbers.
Consider a concrete example: multiplying X[B, D] × Y[D, F] split across 2 TPUs along the D dimension. Each TPU holds half of D.
Step 1: Each TPU computes a partial matmul: X[:, :D//2] @ Y[:D//2, :] on TPU 0, and the other half on TPU 1.
Step 2: Copy the partial sums (shape [B, F]) to the other TPU and add them.
The math time is halved since each TPU does half the work:
The comms time is now about the inter-chip network:
(We send the partial sum Z[B, F] which is 2BF bytes in bf16.)
We become compute-bound when:
This is just one example, but the pattern is general: when you split computation across chips, the roofline depends on different dimensions than the single-chip case. Understanding these rooflines is essential for choosing good parallelism strategies.
Real models often use quantized weights. Let's work through the roofline for an int8 matmul step by step.
We want to compute X[B, D] · Y[D, F] → Z[B, F] in int8 (1 byte per parameter) instead of bf16 (2 bytes). Assume HBM bandwidth = 8.2e11 bytes/s and int8 peak OPs/s = 3.94e14 (roughly 2x bf16).
In int8, each parameter is 1 byte: X costs BD bytes, Y costs DF bytes, Z costs BF bytes to write back.
The number of multiplies and adds is the same regardless of precision: 2BDF OPs.
With B ≪ D, F: intensity ≈ 2B.
In practice, we often keep activations in bf16 but quantize only the weights to int8. So: bf16[B, D] * int8[D, F] → bf16[B, F].
Now the bytes are: 2BD (activations in bf16) + DF (weights in int8) + 2BF (output in bf16). With B ≪ D, F:
Let's plot peak FLOPs/s vs. batch size for the mixed-precision case (int8 weights, bf16 activations/compute) with different model sizes.
The exact intensity (not the approximation) is:
And the achieved FLOPs/s is:
For D=F=4096, we reach peak around B=160. For D=F=1024, the critical batch size nearly doubles because the weight matrix is smaller relative to the activation bytes. The interactive sim in Chapter 4 shows this clearly.
What if each batch element has its own weight matrix? i.e., int8[B, D] · int8[B, D, F] → int8[B, F].
FLOPs: still 2BDF (B independent [D] × [D, F] products).
Bytes: BD + BDF + BF. Since BDF dominates:
From the NVIDIA H100 SXM spec sheet: reported bf16 FLOPs/s is 1.979e15 but this assumes structured sparsity. The true dense value is 9.89e14. Memory bandwidth is 3.35e12 bytes/s.
Very similar to the TPU v5e's 240. The H100 has both higher FLOPs and higher bandwidth, keeping the ratio similar.
Let's crystallize the key ideas from this chapter.
| Concept | Key Formula | Takeaway |
|---|---|---|
| Compute time | Tmath = FLOPs / (FLOPs/s) | Depends on algorithm size and chip speed |
| Comms time | Tcomms = Bytes / (BW) | Depends on data volume and link speed |
| Lower bound | max(Tmath, Tcomms) | With perfect overlap |
| Arithmetic intensity | FLOPs / Bytes | FLOPs per byte moved |
| Critical intensity | (FLOPs/s) / (BW) | ~240 for TPU v5e, ~295 for H100 |
| Matmul intensity | ≈ B (batch size in tokens) | When D, F ≫ B |
| Mixed-precision | int8 weights + bf16 compute | Critical B drops to ~120 |
| Network roofline | Critical quantity is D, not B | Different bottleneck for multi-chip |
In the next chapter, we'll look at how TPU hardware actually implements these computations — the MXU, systolic arrays, HBM, VMEM, and the ICI network that connects chips together.