Austin et al., Part 1

A Brief Intro to Roofline Analysis

Three bounds rule every computation: how fast the chip does math, how fast it moves bytes, and how fast it talks to neighbours. Roofline analysis turns these into sharp predictions.

Prerequisites: Basic linear algebra (matrix multiplication) and comfort with scientific notation (1e14, etc.).
9
Chapters
2
Simulations
9
Quizzes

Chapter 0: Where Does the Time Go?

You run an algorithm. It takes 50 ms. Why not 5 ms or 500 ms? What physical processes are actually consuming that time?

There are exactly three bottlenecks any computation can hit, and every single operation in deep learning is bounded by at least one of them.

1. Compute
How fast the chip does floating-point math (FLOPs/s)
2. Memory Bandwidth
How fast data moves between HBM and the compute cores (bytes/s)
3. Network Bandwidth
How fast data moves between chips (bytes/s)

Computation time depends purely on the number of floating-point operations and the chip's peak throughput:

Tmath = FLOPs / (Accelerator FLOPs/s)

For example, an NVIDIA H100 can do about 9.89e14 bfloat16 FLOPs/s and a TPU v6e can do 9.1e14. So 1e12 FLOPs takes about 1e12 / 9.89e14 = 1.01 ms on an H100.

Communication time (whether within a chip or between chips) depends on how many bytes move and how fast the link is:

Tcomms = Bytes / (Bandwidth bytes/s)

Within a chip, the HBM bandwidth is about 3.35 TB/s on H100 and 1.6 TB/s on TPU v6e. Between chips, the links are much slower.

The fundamental insight: Typically, computation within a single chip can be overlapped with communication. This means the total time is bounded below by the maximum of compute and comms time, and bounded above by their sum.
Tlower = max(Tmath, Tcomms)
Tupper = Tmath + Tcomms

Since Tupper ≤ 2 × Tlower, these bounds are never more than 2x apart. In practice we optimize against the max (the lower bound) and can usually get close to it.

An operation does 1e13 FLOPs and moves 1e10 bytes on a chip with 1e15 FLOPs/s and 1e12 bytes/s bandwidth. What is the lower-bound time?

Chapter 1: Arithmetic Intensity

When Tmath > Tcomms, the chip spends most of its time doing useful math. We call this compute-bound. When Tcomms > Tmath, the chip sits idle waiting for data. We call this bandwidth-bound (also called memory-bound or comms-bound).

How do we tell which regime an operation will land in, without running it? We use a single ratio called the arithmetic intensity:

Arithmetic Intensity = FLOPs / Bytes

This measures "FLOPs per byte moved." A high-intensity operation does a lot of math per byte of data — it will tend to be compute-bound. A low-intensity operation loads many bytes but does little math — it will be bandwidth-bound.

The crossover point: The chip itself has a "critical intensity" equal to its peak FLOPs/s divided by its bandwidth. When an algorithm's intensity exceeds the chip's critical intensity, the operation is compute-bound.

Let's derive this. We are compute-bound when Tmath > Tcomms:

FLOPs / (FLOPs/s) > Bytes / (BW bytes/s)
⇔ FLOPs / Bytes > (FLOPs/s) / (BW bytes/s)
⇔ Intensity(algorithm) > Intensity(hardware)

For the TPU v5e MXU, the critical intensity is 1.97e14 / 8.2e11 = 240 FLOPs/byte. Any algorithm with intensity below 240 will be bandwidth-bound on this chip.

Worked example: the dot product

To compute x · y where both are bf16 vectors of length N, we load 2N bytes for x, 2N bytes for y, perform N multiplies + (N−1) adds = 2N−1 FLOPs, and write 2 bytes back.

Intensity(dot) = (2N − 1) / (4N + 2) → 1/2 as N → ∞

An intensity of 0.5 is far below 240. The dot product is hopelessly bandwidth-bound. No matter how large N gets, intensity stays at ~0.5. This is why we love matrix multiplication — it has much better scaling.

An element-wise addition of two bf16 vectors of length N has arithmetic intensity closest to:

Chapter 2: Roofline Plots

A roofline plot is the standard way to visualize when an algorithm is compute-bound vs. bandwidth-bound. The x-axis shows arithmetic intensity (FLOPs/byte); the y-axis shows the peak achievable throughput (FLOPs/s).

Reading the plot: On a log-log roofline, there is a diagonal line (slope = bandwidth) and a horizontal line (peak FLOPs/s). Where they meet is the critical intensity. Below that knee, you are bandwidth-bound. Above it, you are compute-bound.

The achievable FLOPs/s for an algorithm with intensity I is:

Throughput = min(I × BW, Peak FLOPs/s)

In the bandwidth-bound region, doubling the bandwidth doubles throughput. In the compute-bound region, increasing bandwidth has zero effect — you are already saturating the compute units.

Interactive Roofline Plot

Drag the sliders to change hardware specs and see where different operations land.

The plot shows several key operations:

OperationTypical IntensityRegime
Dot product~0.5Always BW-bound
LayerNorm~5Usually BW-bound
Matmul (B=32)~32BW-bound on most chips
Matmul (B=256)~256Compute-bound on TPU v5e
Matmul (B=1024)~1024Solidly compute-bound
On a roofline plot, if you double the memory bandwidth while keeping peak FLOPs/s constant, what happens?

Chapter 3: The Matmul Roofline

Matrix multiplication is the most important operation in deep learning. Let's derive its arithmetic intensity from scratch.

Consider X[B, D] × Y[D, F] → Z[B, F]. We need to:

WhatAmount
Load X from HBM2BD bytes (bf16)
Load Y from HBM2DF bytes
Compute2BDF FLOPs
Write Z to HBM2BF bytes
Intensity(matmul) = 2BDF / (2BD + 2DF + 2BF) = BDF / (BD + DF + BF)

Now here is the key simplification. In a typical Transformer, D and F are large (often > 8000) while B (the per-replica token batch size) is smaller. When B ≪ D and B ≪ F:

BDF / (BD + DF + BF) ≈ BDF / DF = B
Beautiful result: When your weight dimensions are large, the arithmetic intensity of a matmul is approximately equal to the batch size B. This means the batch size alone determines whether you're compute-bound.

Since the critical intensity of the TPU v5e is 240, we need B > 240 tokens per replica to be compute-bound. On an H100, the critical intensity is 9.89e14 / 3.35e12 ≈ 295, so we need B > 295.

Notice something remarkable: the FLOPs of matmul scale as O(N3) while the data transfer scales as O(N2). As we make matrices bigger, the intensity grows. This is why deep learning is dominated by matrix multiplication — it is one of the few algorithms where scaling up makes the hardware more efficient.

For a bf16 matmul [512, 8192] × [8192, 8192], the arithmetic intensity is approximately:

Chapter 4: The B > 240 Rule

Let's make the matmul roofline concrete with actual hardware numbers. We want to know: at what batch size does a bf16 matmul become compute-bound?

Using the TPU v5e MXU: 1.97e14 bf16 FLOPs/s and 8.2e11 bytes/s HBM bandwidth.

Critical intensity = 1.97e14 / 8.2e11 = 240 FLOPs/byte
Since Intensity(matmul) ≈ B, we need B > 240
Key rule: For a bfloat16 matmul to be compute-bound on most TPUs, you need your per-replica token batch size to be greater than ~240. Note this is tokens, not sequences. If you have a batch of 8 sequences of length 4096, that is B = 32,768 tokens — well above the threshold.

Important caveats:

1. "Batch size" means tokens per replica. If you do model sharding, you scale both FLOPs and bandwidth by the same factor, so the critical batch size remains the same per independent copy of the weights.

2. Tile sizes matter. In practice, a large matmul is broken into tiles that fit in VMEM/SMEM. This changes the effective bytes loaded. For an (m,k) · (k,n) matmul with tile sizes bm, bk, bn, the effective intensity is roughly bm · bn / (bm + bn), which is similar to B when tiles are reasonable.

3. Quantization changes the rules. If weights are int8 but compute is bf16, we load half the weight bytes but do the same FLOPs. This shifts the critical batch size — we'll work through this in the exercises.

Matmul Roofline vs. Batch Size

See how achieved FLOPs/s scales with batch size for different D/F values.

You have 512 sequences of length 4096 across 2048 GPUs. What is your per-replica token batch size?

Chapter 5: Network Communication Rooflines

Everything so far has been about memory bandwidth within a single chip. But in distributed training, the bottleneck is often the network between chips. The same roofline framework applies, just with different bandwidth numbers.

Consider a concrete example: multiplying X[B, D] × Y[D, F] split across 2 TPUs along the D dimension. Each TPU holds half of D.

Step 1: Each TPU computes a partial matmul: X[:, :D//2] @ Y[:D//2, :] on TPU 0, and the other half on TPU 1.

Step 2: Copy the partial sums (shape [B, F]) to the other TPU and add them.

The math time is halved since each TPU does half the work:

Tmath = 2BDF / (2 × FLOPs/s) = BDF / 1.97e14

The comms time is now about the inter-chip network:

Tcomms = 2BF / (Network BW) = 2BF / 4.5e10

(We send the partial sum Z[B, F] which is 2BF bytes in bf16.)

We become compute-bound when:

BDF / (2BF) > 1.97e14 / 4.5e10
D / 2 > 4377
D > 8755
Notice the switch: For the single-chip memory roofline, the critical quantity was B (the batch size). For the network roofline, the critical quantity is D (the model dimension). This is because D determines how much useful work each chip does per byte sent over the network.

This is just one example, but the pattern is general: when you split computation across chips, the roofline depends on different dimensions than the single-chip case. Understanding these rooflines is essential for choosing good parallelism strategies.

For the 2-chip matmul above, if D=4096, the operation is:

Chapter 6: Exercise — int8 Matmul

Real models often use quantized weights. Let's work through the roofline for an int8 matmul step by step.

Setup

We want to compute X[B, D] · Y[D, F] → Z[B, F] in int8 (1 byte per parameter) instead of bf16 (2 bytes). Assume HBM bandwidth = 8.2e11 bytes/s and int8 peak OPs/s = 3.94e14 (roughly 2x bf16).

Step 1: Bytes loaded

In int8, each parameter is 1 byte: X costs BD bytes, Y costs DF bytes, Z costs BF bytes to write back.

Total bytes = BD + DF + BF

Step 2: FLOPs

The number of multiplies and adds is the same regardless of precision: 2BDF OPs.

Step 3: Intensity

Intensity = 2BDF / (BD + DF + BF)

With B ≪ D, F: intensity ≈ 2B.

Step 4: Critical batch size

Hardware intensity = 3.94e14 / 8.2e11 = 480
2B > 480 ⇒ B > 240
Surprising result: The critical batch size is unchanged at B > 240. Even though int8 halves the bytes and doubles the OPs/s, these effects cancel perfectly. The roofline shifts but the crossover stays in the same place.

Mixed precision: int8 weights, bf16 activations

In practice, we often keep activations in bf16 but quantize only the weights to int8. So: bf16[B, D] * int8[D, F] → bf16[B, F].

Now the bytes are: 2BD (activations in bf16) + DF (weights in int8) + 2BF (output in bf16). With B ≪ D, F:

Intensity ≈ 2BDF / DF = 2B
2B > 240 ⇒ B > 120
Practical win: Mixed-precision quantization (int8 weights, bf16 compute) cuts the critical batch size in half to B > 120. Even though the FLOPs are still bf16-speed, we load half the weight bytes, making it easier to be compute-bound.
For int4 weights with bf16 activations and compute, the critical batch size is approximately:

Chapter 7: Exercises — Roofline Plots & GPUs

Exercise: Roofline plot for mixed-precision matmul

Let's plot peak FLOPs/s vs. batch size for the mixed-precision case (int8 weights, bf16 activations/compute) with different model sizes.

The exact intensity (not the approximation) is:

Intensity = 2BDF / (2BD + DF + 2BF)

And the achieved FLOPs/s is:

Achieved = 2BDF / max(2BDF / 1.97e14, (2BD + DF + 2BF) / 8.2e11)

For D=F=4096, we reach peak around B=160. For D=F=1024, the critical batch size nearly doubles because the weight matrix is smaller relative to the activation bytes. The interactive sim in Chapter 4 shows this clearly.

Exercise: Batched int8 matmul

What if each batch element has its own weight matrix? i.e., int8[B, D] · int8[B, D, F] → int8[B, F].

FLOPs: still 2BDF (B independent [D] × [D, F] products).

Bytes: BD + BDF + BF. Since BDF dominates:

Intensity ≈ 2BDF / BDF = 2
Bad news: With per-element weights, the intensity is constant at 2 regardless of batch size. You are essentially always bandwidth-bound. This is why shared weight matrices are so important for efficiency.

Exercise: GPU memory roofline

From the NVIDIA H100 SXM spec sheet: reported bf16 FLOPs/s is 1.979e15 but this assumes structured sparsity. The true dense value is 9.89e14. Memory bandwidth is 3.35e12 bytes/s.

Bcrit(H100) = 9.89e14 / 3.35e12 = 295

Very similar to the TPU v5e's 240. The H100 has both higher FLOPs and higher bandwidth, keeping the ratio similar.

A per-batch-element weight matrix makes matmul intensity constant at ~2. Why is this problematic?

Chapter 8: Summary

Let's crystallize the key ideas from this chapter.

ConceptKey FormulaTakeaway
Compute timeTmath = FLOPs / (FLOPs/s)Depends on algorithm size and chip speed
Comms timeTcomms = Bytes / (BW)Depends on data volume and link speed
Lower boundmax(Tmath, Tcomms)With perfect overlap
Arithmetic intensityFLOPs / BytesFLOPs per byte moved
Critical intensity(FLOPs/s) / (BW)~240 for TPU v5e, ~295 for H100
Matmul intensity≈ B (batch size in tokens)When D, F ≫ B
Mixed-precisionint8 weights + bf16 computeCritical B drops to ~120
Network rooflineCritical quantity is D, not BDifferent bottleneck for multi-chip
The big picture: The roofline framework gives you a sharp mental model for predicting performance. Before profiling, before benchmarking — just count FLOPs, count bytes, and compare their ratio to your hardware's critical intensity. If you are below the critical intensity, no amount of code optimization will help; you need to change your algorithm or hardware configuration.

In the next chapter, we'll look at how TPU hardware actually implements these computations — the MXU, systolic arrays, HBM, VMEM, and the ICI network that connects chips together.

You discover an operation is bandwidth-bound. Which strategy will NOT help?