Austin et al., Chapter 12

How to Think About GPUs

GPU architecture from SMs to NVLink fabrics — memory hierarchy, collectives, rooflines, and how GPUs compare to TPUs for LLM scaling.

Prerequisites: Roofline analysis (Chapter 1), TPU architecture (Chapter 2), and collectives (Chapter 3).
10
Chapters
2
Simulations
10
Quizzes

Chapter 0: What Is a GPU?

A modern ML GPU (H100, B200) is fundamentally a collection of compute cores that specialize in matrix multiplication — called Streaming Multiprocessors (SMs) — connected to a stick of fast memory called HBM.

If you squint, a GPU looks a lot like a TPU. Both have matrix multiplication units, vector arithmetic units, fast on-chip cache, and HBM. The key difference: a TPU has 1-2 big Tensor Cores, while a GPU has 100+ small SMs. More modularity, more flexibility, harder to reason about.

Each SM, like a TPU's Tensor Core, contains:

GPU ComponentTPU EquivalentWhat It Does
Tensor CoreMXUDedicated matrix multiplication unit (vast majority of FLOPs/s)
Warp Scheduler + CUDA CoresVPUVector arithmetic: ReLUs, pointwise ops, reductions
SMEM (L1 Cache)VMEMFast on-chip cache for activations and TC inputs
Register FileVRegsFastest per-core storage

An H100 has 132 SMs. A B200 has 148 SMs. Each SM is more or less independent, so a GPU can execute hundreds of separate tasks concurrently. This is fundamentally different from a TPU, which has only 1-2 Tensor Cores running in lockstep.

GPU Architecture Overview

Hover over components to see details. Click to toggle highlights.

Check: How many independent SMs does an H100 GPU have?

Chapter 1: Inside an SM

Each H100 SM is divided into 4 identical subpartitions (NVIDIA calls these "SM subpartitions"), each containing:

ComponentPer SubpartitionPer SMPer H100 Chip
Tensor Core14528
fp32 CUDA Cores3212816,896
Register File16,384 × 32-bit256 kB33 MB
Warp Scheduler14528

Tensor Cores

The Tensor Core is the GPU's matrix multiplication unit — analogous to the TPU's MXU. It represents the vast majority of the GPU's FLOPs/s:

H100: 990 bf16 TC TFLOPS/s vs. 66 TFLOPS/s from CUDA cores (15x ratio)

Working backward from the specs: 990 TFLOPS with 132 SMs at 1.76 GHz means each TC can do roughly 1024 bf16 FLOPs/cycle, approximately an 8×8×8 matmul. Each GPU generation since Volta has increased TC size. In Blackwell (B200), the TC has gotten so large it can no longer fit its inputs in SMEM, requiring a new memory space called TMEM.

CUDA Cores and SIMT

CUDA cores are more flexible than a TPU's VPU. They use a SIMT (Single Instruction Multiple Threads) model, compared to the TPU's SIMD model:

PropertyGPU (SIMT)TPU (SIMD)
Same instruction per cycle?Yes, within a warp (32 threads)Yes, across all VPU lanes
Per-thread state?Yes — each thread has its own instruction pointerNo — all lanes share control
Divergent branches?Supported (via masking), but wastes cyclesNot supported
Memory access patternEach thread can access arbitrary addressesMust be contiguous blocks
Warp divergence: When threads in the same warp need to do different things (e.g., an if/else branch), the GPU executes both paths and masks out inactive threads. This silently degrades performance. TPUs avoid this problem by design — but at the cost of less flexibility.

Warp Scheduling

SMs operate like multi-threaded CPUs: they can have up to 64 resident warps but only execute one per subpartition per cycle. The scheduler automatically switches between warps to hide memory latency — while one warp waits for a load, another computes. TPUs are generally single-threaded by comparison.

Check: What is the ratio of Tensor Core FLOPs to CUDA Core FLOPs on an H100?

Chapter 2: Memory Hierarchy

GPUs have a deep memory hierarchy, from fastest/smallest to slowest/largest:

Registers (256 kB/SM, 33 MB total)
Fastest. 16,384 × 32-bit per subpartition. Each CUDA core accesses up to 256 registers.
↓ slower, larger
SMEM / L1 Cache (256 kB/SM, 33 MB total)
Per-SM. Programmer-controlled or hardware-managed. Feeds Tensor Core inputs.
↓ slower, larger
L2 Cache (~50 MB on H100, ~126 MB on B200)
Shared across all SMs. ~5.5 TB/s measured bandwidth. NOT programmer controlled.
↓ slower, larger
HBM (80 GB on H100, 192 GB on B200)
Main GPU memory. 3.35 TB/s on H100, 8-9 TB/s on B200. Stores weights, gradients, activations.

Key Comparisons to TPU

PropertyH100 GPUTPU v5p
Total fast cache (SMEM/VMEM)33 MB128 MB
Fast cache bandwidth~5.5 TB/s (L2, measured)~40 TB/s (VMEM)
HBM capacity80 GB96 GB
HBM bandwidth3.35 TB/s2.3 TB/s
Programmer-controlled cache?Partially (SMEM yes, L2 no)Yes (VMEM fully controlled)
The L2 cache problem: The L2 cache is shared across all 132 SMs but is not programmer-controlled. This leads to "spooky action at a distance" — one kernel's memory access pattern can evict another kernel's cached data. The programmer must modify access patterns to ensure the L2 is well-used, despite having no direct control over it.

By contrast, a TPU's VMEM is 2x larger, 7x faster, and fully programmer-controlled. This can make TPUs faster for LLM inference if you can consistently store or prefetch model weights into VMEM.

Blackwell Additions

B200 GPUs add 256 kB of TMEM per SM — a new memory space specifically for feeding the larger Blackwell Tensor Cores. The TC has grown so large that its accumulator no longer fits in registers or SMEM.

Check: Why is the GPU's L2 cache harder to optimize for than the TPU's VMEM?

Chapter 3: GPU Specs Across Generations

Here is a comprehensive comparison of recent NVIDIA GPU generations. These numbers are essential for roofline calculations.

Memory Capacity

GPUGenerationSMsSMEM/SML2HBM
V100Volta8096 kB6 MB32 GB
A100Ampere108192 kB40 MB80 GB
H100Hopper132256 kB50 MB80 GB
H200Hopper132256 kB50 MB141 GB
B200Blackwell148256 kB126 MB192 GB

Compute and Bandwidth

GPUHBM BWbf16 TFLOPSfp8 TFLOPSfp4 TFLOPS
A1002.0 TB/s312624
H1003.35 TB/s9901,979
H2004.8 TB/s9901,979
B2008.0 TB/s2,2504,5009,000

Critical Matmul Intensity

The critical intensity is the ratio of FLOPs/s to bytes/s — it tells you the minimum batch size needed to be compute-bound:

Icritical = FLOPs/s ÷ bytes/s
GPUbf16 Intensityfp8 IntensityMeaning
H100990 / 3.35 = 2951979 / 3.35 = 590Need batch ≥ 295 tokens to saturate bf16 TCs
B2002250 / 8.0 = 2814500 / 8.0 = 562Similar to H100 despite 2x each
Interesting pattern: The critical intensity stays roughly constant across GPU generations (~280-295 for bf16). Both FLOPs and bandwidth double together, keeping the ratio stable. This means the same batch size rules of thumb apply across generations.
Check: What is the approximate bf16 critical intensity on an H100 (the minimum batch size to be compute-bound)?

Chapter 4: GPU vs TPU

Let us make the comparison precise with a 1:1 mapping of components:

GPU ComponentTPU ComponentH100 CountTPU v5p Count
SMTensor Core1322
Warp SchedulerVPU slots5288
SMEM (L1)VMEM33 MB128 MB
RegistersVRegs33 MB256 kB
Tensor CoreMXU5288

Modularity Trade-off

The fundamental difference: GPUs are more modular (100+ independent SMs vs. 2 Tensor Cores), which makes them:

Historical context: GPUs were designed for video games — rendering millions of triangles with diverse shader programs. They became ML accelerators because matrix multiplication turned out to look a lot like the parallel computation games needed. TPUs were designed specifically for ML from day one, which is why they look simpler and are more compiler-friendly.

Cost Comparison

MetricH200 GPUTPU v5p
bf16 FLOPs/s990 TFLOPS~460 TFLOPS
HBM141 GB96 GB
Cloud price (approx.)~$10/hour~$4/hour
FLOPs per dollar~99 TFLOPS/$/hr~115 TFLOPS/$/hr

Individual GPUs are more powerful but more expensive. TPUs rely more heavily on networking multiple chips together to compete.

TPU advantage — fast cache: TPUs have 4x more VMEM than GPUs have SMEM, and VMEM bandwidth (~40 TB/s) is 7x higher than GPU SMEM bandwidth (~5.5 TB/s). This makes TPUs particularly strong for inference, where model weights can be prefetched into VMEM.
Check: Why can TPUs often get closer to peak roofline performance with less effort?

Chapter 5: NVLink Networking

GPU networking is fundamentally different from TPU networking. TPUs use a 2D/3D torus where each chip connects only to its neighbors. GPUs use a hierarchical tree with two levels of interconnect:

LevelInterconnectH100 BW (full-duplex)B200 BW
Intra-node (8 GPUs)NVLink via NVSwitches450 GB/s per GPU900 GB/s
Inter-nodeInfiniBand (IB) via NICs50 GB/s per GPU (400 Gbps)50 GB/s

The GPU Node

A node is a set of 8 GPUs (up to 72 for GB200 NVL72) connected with all-to-all, full-bandwidth NVLink interconnects. Each H100 node has 4 NVSwitches connecting all 8 GPUs in a 5+4+4+5 link pattern.

NVLink bandwidth has grown with each generation:

NVLink GenGPU GenBW/LinkLinks/GPUTotal/GPUNode Size
3.0Ampere25 GB/s12300 GB/s8
4.0Hopper25 GB/s18450 GB/s8
5.0Blackwell50 GB/s18900 GB/s8 or 72
The 9x jump for GB200 NVL72: NVIDIA's GB200 NVL72 puts 72 GPUs in a single NVLink domain with 900 GB/s per GPU. The node egress bandwidth to the IB fabric jumps to 3.6 TB/s (9x more than an H100 node). This fundamentally changes the roofline for cross-node communication.

Contrast with TPU

TPU interconnect is simpler but scales differently:

Check: What is the GPU-to-GPU NVLink bandwidth on an H100 within a single node?

Chapter 6: Beyond the Node

Beyond the 8-GPU node, GPUs are connected using a fat tree topology built with InfiniBand switches. The reference architecture for 1024 H100 GPUs (a "DGX SuperPod") looks like this:

Node: 8 GPUs
4 NVSwitches, 450 GB/s per GPU via NVLink. 8 × 400Gbps IB NICs egressing.
↓ InfiniBand
Scalable Unit (SU): 32 nodes = 256 GPUs
8 leaf IB switches. 400 GB/s node egress bandwidth.
↓ InfiniBand
SuperPod: 4 SUs = 1024 GPUs
16 spine IB switches. Full bisection bandwidth maintained.

The key property of this fat tree: it provides full bisection bandwidth at every level. If you split the GPUs in half at any point, each half can egress 400 GB/s per node to the other half.

LevelGPUsSwitchesGPU-to-GPU BWFat Tree BW
Node84 NVSwitches450 GB/s450 GB/s
Leaf (SU)2568 IB switches50 GB/s400 GB/s
Spine102416 IB switches50 GB/s400 GB/s
Takeaway: Within an H100 node, you have 450 GB/s per GPU. Beyond the node, the effective collective bandwidth drops to ~400 GB/s per node. This 9:1 ratio between intra-node and inter-node bandwidth is critical for choosing which parallelism dimensions to assign to which levels.

Scaling Beyond 1024 GPUs

For 2048 GPUs, you add more spine switches. For 4096+, you need a third level of switches ("core switches"). Meta trained LLaMA-3 on a custom Ethernet fabric with a 3-layer switch topology and an oversubscribed top-level switch.

Check: What bandwidth does a fat tree topology guarantee between any two equal halves of the network?

Chapter 7: Collectives on GPUs

GPUs support the same collectives as TPUs (AllGather, ReduceScatter, AllReduce, AllToAll), but the cost structure changes depending on whether they run intra-node or cross-node. These are implemented in NVIDIA's NCCL ("nickel") library.

Intra-Node Collectives

AllGather / ReduceScatter: Ring reduction around all 8 GPUs, using full GPU-to-GPU NVLink bandwidth:

TAG or RS = bytes × (N − 1) / (N × WGPU egress) ≈ bytes / WGPU egress

For an AllReduce: 2x the above (= ReduceScatter + AllGather). On an H100 node: T ≈ bytes / 450 GB/s.

AllToAll: GPUs have all-to-all connectivity within a node, making this simpler than on TPUs:

TAllToAll = bytes × (N − 1) / (N2 × W) ≈ bytes / (N × W)

Within a node: T ≈ bytes / (8 × 450 GB/s). For top-k MoE routing, cost is further reduced by k/N.

NVIDIA SHARP

Since Hopper, NVIDIA switches support SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) — in-network reductions. The switch itself performs the reduce and broadcasts the result, theoretically halving AllReduce cost.

Theory vs practice: SHARP should nearly halve AllReduce cost (from 2B/W to B/W). In practice, the measured improvement is only about 30%, bringing effective bandwidth from ~370 GB/s to ~480 GB/s. Still useful, but far from the theoretical 2x.

Cross-Node Collectives

Because of the fat tree topology with full bisection bandwidth, the cost of cross-node AllGather/ReduceScatter is approximately:

TAG or RS (cross-node) ≈ bytes / Wnode egress = bytes / 400 GB/s (on H100)

AllToAll cross-node is much worse: Since AllToAll is not hierarchical, it cannot take advantage of intra-node reduction. The effective bandwidth drops from 450 GB/s to roughly 50 GB/s per GPU — a 9x degradation.

Practical Measurements

Although NVIDIA claims 450 GB/s NVLink bandwidth, practical measurements show:

TPU advantage here: TPUs achieve peak bandwidth at much smaller message sizes. This is because TPU ICI has lower latency and less protocol overhead than NVLink.
Check: How much does AllToAll bandwidth degrade when going from intra-node to cross-node on H100?

Chapter 8: LLM Rooflines on GPUs

With all the hardware specs and networking numbers in hand, let us derive the rooflines for LLM parallelism strategies on GPUs.

Data Parallelism

With DP, each GPU holds a full model copy and processes a shard of the batch. Communication is an AllReduce on gradients after each step.

TDP comms ≈ 2 × model_bytes / Wcollective

For a 7B model in bf16 (14 GB of gradients) on an 8-GPU H100 node: T ≈ 2 × 14 GB / 450 GB/s ≈ 62 ms. With SHARP: ~48 ms.

Tensor Parallelism

TP splits weight matrices across GPUs within a node. Each layer requires an AllReduce (or AllGather + ReduceScatter) with communication proportional to the activation size, not the weight size.

TP should stay within the node. TP requires per-layer communication, so it needs the highest bandwidth. NVLink at 450 GB/s within a node is 9x faster than InfiniBand at 50 GB/s cross-node. Putting TP across nodes is almost always a mistake.

Pipeline Parallelism

PP splits model layers across nodes. Communication is point-to-point: send activations from one stage to the next. The volume is small (one batch of activations), so the 400 GB/s cross-node bandwidth is usually sufficient. The bottleneck is pipeline bubbles, not communication.

Expert Parallelism

EP requires AllToAll communication for token routing. This is particularly sensitive to the intra-node vs cross-node distinction:

ScopeEffective BWImpact
EP within node (8-way)bytes / (8 × 450 GB/s)Fast — all-to-all connectivity
EP across 2 nodes (16-way)bytes / (2 × 400 GB/s)4x slower — limited by IB
EP across 4+ nodesbytes / (M × 400 GB/s)Even slower per GPU
Summary for GPU LLM scaling:
TP: Always within a node (NVLink). Typically 2-way or 8-way.
PP: Across nodes (IB). Minimize pipeline stages to reduce bubble overhead.
DP: Across all remaining GPUs. Use ZeRO for memory savings.
EP: Preferably within a node. Cross-node EP is expensive due to AllToAll degradation.
CP: Within or across nodes depending on sequence length.
Check: Why should tensor parallelism be confined within a single GPU node?

Chapter 9: Worked Problems

Problem 1: CUDA Core Count

Q: How many fp32 CUDA cores does an H100 have? A B200? How does this compare to a TPU v5p?

Answer: H100: 132 SMs × 4 subpartitions × 32 fp32 cores = 16,896. B200: 148 × 4 × 32 = 18,944. TPU v5p: 2 TensorCores × 4 × 8 × 128 ALUs = 8,192. GPU has roughly 2x the vector lanes of a TPU, at similar clock frequencies.

Problem 2: Vector FLOPs

Q: How many vector fp32 FLOPs/s can an H100 do? How does this compare to Tensor Core FLOPs?

Answer: 132 × 4 × 32 × 1.59 GHz = 26.9 TFLOPS (33.5 with boost). NVIDIA reports double via FMA (fused multiply-add), but single ops give 27 TFLOPS. Tensor Cores do 990 bf16 TFLOPS — about 30x more.

Problem 3: Matmul Runtime

Q: How long should fp16[64, 4096] × fp16[4096, 8192] take on a B200? What about fp16[512, 4096] × fp16[4096, 8192]?

Answer: Critical intensity on B200 is ~281. Batch=64 is memory-bound: read/write 2×64×4096 + 2×4096×8192 + 2×64×8192 = 69 MB. T = 69 MB / 8 TB/s = 8.6 μs (practically ~10-12 μs with partial bandwidth). Batch=512 is compute-bound: T = 2×512×4096×8192 / 2.3×1015 = 15 μs (practically ~20 μs).

Problem 4: L1 Cache Capacity

Q: What is the total L1/SMEM + register capacity on an H100? How does it compare to TPU VMEM?

Answer: 256 kB SMEM + 256 kB registers per SM × 132 SMs = 66 MB total. A TPU v5p has ~120 MB VMEM plus 256 kB of vector registers. TPU has roughly 2x the fast-cache capacity, with much lower latency access to VMEM.

Problem 5: AllGather Time

Q: How long does AllGather(bf16[1024, 16384]) take on an 8xH100 node?

Answer: Total bytes = 2 × 1024 × 16384 = 32 MB. T = 32 MB × 7/8 / 450 GB/s ≈ 62 μs. In practice closer to 75 μs due to latency effects at this message size.

Problem 6: B200 Clock Frequency

Q: NVIDIA reports 80 TFLOPS fp32 vector compute for B200. Given FMA (2 FLOPs/cycle per core), estimate the clock speed.

Answer: 148 × 4 × 32 = 18,944 cores. With FMA: 18,944 × 2 = 37,888 FLOPs/cycle. Clock = 80×1012 / 37,888 = 2.1 GHz. This is plausible for a liquid-cooled chip.
Check: On an H100, which is larger: the total SMEM across all SMs, or the L2 cache?