GPU architecture from SMs to NVLink fabrics — memory hierarchy, collectives, rooflines, and how GPUs compare to TPUs for LLM scaling.
A modern ML GPU (H100, B200) is fundamentally a collection of compute cores that specialize in matrix multiplication — called Streaming Multiprocessors (SMs) — connected to a stick of fast memory called HBM.
Each SM, like a TPU's Tensor Core, contains:
| GPU Component | TPU Equivalent | What It Does |
|---|---|---|
| Tensor Core | MXU | Dedicated matrix multiplication unit (vast majority of FLOPs/s) |
| Warp Scheduler + CUDA Cores | VPU | Vector arithmetic: ReLUs, pointwise ops, reductions |
| SMEM (L1 Cache) | VMEM | Fast on-chip cache for activations and TC inputs |
| Register File | VRegs | Fastest per-core storage |
An H100 has 132 SMs. A B200 has 148 SMs. Each SM is more or less independent, so a GPU can execute hundreds of separate tasks concurrently. This is fundamentally different from a TPU, which has only 1-2 Tensor Cores running in lockstep.
Hover over components to see details. Click to toggle highlights.
Each H100 SM is divided into 4 identical subpartitions (NVIDIA calls these "SM subpartitions"), each containing:
| Component | Per Subpartition | Per SM | Per H100 Chip |
|---|---|---|---|
| Tensor Core | 1 | 4 | 528 |
| fp32 CUDA Cores | 32 | 128 | 16,896 |
| Register File | 16,384 × 32-bit | 256 kB | 33 MB |
| Warp Scheduler | 1 | 4 | 528 |
The Tensor Core is the GPU's matrix multiplication unit — analogous to the TPU's MXU. It represents the vast majority of the GPU's FLOPs/s:
Working backward from the specs: 990 TFLOPS with 132 SMs at 1.76 GHz means each TC can do roughly 1024 bf16 FLOPs/cycle, approximately an 8×8×8 matmul. Each GPU generation since Volta has increased TC size. In Blackwell (B200), the TC has gotten so large it can no longer fit its inputs in SMEM, requiring a new memory space called TMEM.
CUDA cores are more flexible than a TPU's VPU. They use a SIMT (Single Instruction Multiple Threads) model, compared to the TPU's SIMD model:
| Property | GPU (SIMT) | TPU (SIMD) |
|---|---|---|
| Same instruction per cycle? | Yes, within a warp (32 threads) | Yes, across all VPU lanes |
| Per-thread state? | Yes — each thread has its own instruction pointer | No — all lanes share control |
| Divergent branches? | Supported (via masking), but wastes cycles | Not supported |
| Memory access pattern | Each thread can access arbitrary addresses | Must be contiguous blocks |
SMs operate like multi-threaded CPUs: they can have up to 64 resident warps but only execute one per subpartition per cycle. The scheduler automatically switches between warps to hide memory latency — while one warp waits for a load, another computes. TPUs are generally single-threaded by comparison.
GPUs have a deep memory hierarchy, from fastest/smallest to slowest/largest:
| Property | H100 GPU | TPU v5p |
|---|---|---|
| Total fast cache (SMEM/VMEM) | 33 MB | 128 MB |
| Fast cache bandwidth | ~5.5 TB/s (L2, measured) | ~40 TB/s (VMEM) |
| HBM capacity | 80 GB | 96 GB |
| HBM bandwidth | 3.35 TB/s | 2.3 TB/s |
| Programmer-controlled cache? | Partially (SMEM yes, L2 no) | Yes (VMEM fully controlled) |
By contrast, a TPU's VMEM is 2x larger, 7x faster, and fully programmer-controlled. This can make TPUs faster for LLM inference if you can consistently store or prefetch model weights into VMEM.
B200 GPUs add 256 kB of TMEM per SM — a new memory space specifically for feeding the larger Blackwell Tensor Cores. The TC has grown so large that its accumulator no longer fits in registers or SMEM.
Here is a comprehensive comparison of recent NVIDIA GPU generations. These numbers are essential for roofline calculations.
| GPU | Generation | SMs | SMEM/SM | L2 | HBM |
|---|---|---|---|---|---|
| V100 | Volta | 80 | 96 kB | 6 MB | 32 GB |
| A100 | Ampere | 108 | 192 kB | 40 MB | 80 GB |
| H100 | Hopper | 132 | 256 kB | 50 MB | 80 GB |
| H200 | Hopper | 132 | 256 kB | 50 MB | 141 GB |
| B200 | Blackwell | 148 | 256 kB | 126 MB | 192 GB |
| GPU | HBM BW | bf16 TFLOPS | fp8 TFLOPS | fp4 TFLOPS |
|---|---|---|---|---|
| A100 | 2.0 TB/s | 312 | 624 | — |
| H100 | 3.35 TB/s | 990 | 1,979 | — |
| H200 | 4.8 TB/s | 990 | 1,979 | — |
| B200 | 8.0 TB/s | 2,250 | 4,500 | 9,000 |
The critical intensity is the ratio of FLOPs/s to bytes/s — it tells you the minimum batch size needed to be compute-bound:
| GPU | bf16 Intensity | fp8 Intensity | Meaning |
|---|---|---|---|
| H100 | 990 / 3.35 = 295 | 1979 / 3.35 = 590 | Need batch ≥ 295 tokens to saturate bf16 TCs |
| B200 | 2250 / 8.0 = 281 | 4500 / 8.0 = 562 | Similar to H100 despite 2x each |
Let us make the comparison precise with a 1:1 mapping of components:
| GPU Component | TPU Component | H100 Count | TPU v5p Count |
|---|---|---|---|
| SM | Tensor Core | 132 | 2 |
| Warp Scheduler | VPU slots | 528 | 8 |
| SMEM (L1) | VMEM | 33 MB | 128 MB |
| Registers | VRegs | 33 MB | 256 kB |
| Tensor Core | MXU | 528 | 8 |
The fundamental difference: GPUs are more modular (100+ independent SMs vs. 2 Tensor Cores), which makes them:
| Metric | H200 GPU | TPU v5p |
|---|---|---|
| bf16 FLOPs/s | 990 TFLOPS | ~460 TFLOPS |
| HBM | 141 GB | 96 GB |
| Cloud price (approx.) | ~$10/hour | ~$4/hour |
| FLOPs per dollar | ~99 TFLOPS/$/hr | ~115 TFLOPS/$/hr |
Individual GPUs are more powerful but more expensive. TPUs rely more heavily on networking multiple chips together to compete.
GPU networking is fundamentally different from TPU networking. TPUs use a 2D/3D torus where each chip connects only to its neighbors. GPUs use a hierarchical tree with two levels of interconnect:
| Level | Interconnect | H100 BW (full-duplex) | B200 BW |
|---|---|---|---|
| Intra-node (8 GPUs) | NVLink via NVSwitches | 450 GB/s per GPU | 900 GB/s |
| Inter-node | InfiniBand (IB) via NICs | 50 GB/s per GPU (400 Gbps) | 50 GB/s |
A node is a set of 8 GPUs (up to 72 for GB200 NVL72) connected with all-to-all, full-bandwidth NVLink interconnects. Each H100 node has 4 NVSwitches connecting all 8 GPUs in a 5+4+4+5 link pattern.
NVLink bandwidth has grown with each generation:
| NVLink Gen | GPU Gen | BW/Link | Links/GPU | Total/GPU | Node Size |
|---|---|---|---|---|---|
| 3.0 | Ampere | 25 GB/s | 12 | 300 GB/s | 8 |
| 4.0 | Hopper | 25 GB/s | 18 | 450 GB/s | 8 |
| 5.0 | Blackwell | 50 GB/s | 18 | 900 GB/s | 8 or 72 |
TPU interconnect is simpler but scales differently:
Beyond the 8-GPU node, GPUs are connected using a fat tree topology built with InfiniBand switches. The reference architecture for 1024 H100 GPUs (a "DGX SuperPod") looks like this:
The key property of this fat tree: it provides full bisection bandwidth at every level. If you split the GPUs in half at any point, each half can egress 400 GB/s per node to the other half.
| Level | GPUs | Switches | GPU-to-GPU BW | Fat Tree BW |
|---|---|---|---|---|
| Node | 8 | 4 NVSwitches | 450 GB/s | 450 GB/s |
| Leaf (SU) | 256 | 8 IB switches | 50 GB/s | 400 GB/s |
| Spine | 1024 | 16 IB switches | 50 GB/s | 400 GB/s |
For 2048 GPUs, you add more spine switches. For 4096+, you need a third level of switches ("core switches"). Meta trained LLaMA-3 on a custom Ethernet fabric with a 3-layer switch topology and an oversubscribed top-level switch.
GPUs support the same collectives as TPUs (AllGather, ReduceScatter, AllReduce, AllToAll), but the cost structure changes depending on whether they run intra-node or cross-node. These are implemented in NVIDIA's NCCL ("nickel") library.
AllGather / ReduceScatter: Ring reduction around all 8 GPUs, using full GPU-to-GPU NVLink bandwidth:
For an AllReduce: 2x the above (= ReduceScatter + AllGather). On an H100 node: T ≈ bytes / 450 GB/s.
AllToAll: GPUs have all-to-all connectivity within a node, making this simpler than on TPUs:
Within a node: T ≈ bytes / (8 × 450 GB/s). For top-k MoE routing, cost is further reduced by k/N.
Since Hopper, NVIDIA switches support SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) — in-network reductions. The switch itself performs the reduce and broadcasts the result, theoretically halving AllReduce cost.
Because of the fat tree topology with full bisection bandwidth, the cost of cross-node AllGather/ReduceScatter is approximately:
AllToAll cross-node is much worse: Since AllToAll is not hierarchical, it cannot take advantage of intra-node reduction. The effective bandwidth drops from 450 GB/s to roughly 50 GB/s per GPU — a 9x degradation.
Although NVIDIA claims 450 GB/s NVLink bandwidth, practical measurements show:
With all the hardware specs and networking numbers in hand, let us derive the rooflines for LLM parallelism strategies on GPUs.
With DP, each GPU holds a full model copy and processes a shard of the batch. Communication is an AllReduce on gradients after each step.
For a 7B model in bf16 (14 GB of gradients) on an 8-GPU H100 node: T ≈ 2 × 14 GB / 450 GB/s ≈ 62 ms. With SHARP: ~48 ms.
TP splits weight matrices across GPUs within a node. Each layer requires an AllReduce (or AllGather + ReduceScatter) with communication proportional to the activation size, not the weight size.
PP splits model layers across nodes. Communication is point-to-point: send activations from one stage to the next. The volume is small (one batch of activations), so the 400 GB/s cross-node bandwidth is usually sufficient. The bottleneck is pipeline bubbles, not communication.
EP requires AllToAll communication for token routing. This is particularly sensitive to the intra-node vs cross-node distinction:
| Scope | Effective BW | Impact |
|---|---|---|
| EP within node (8-way) | bytes / (8 × 450 GB/s) | Fast — all-to-all connectivity |
| EP across 2 nodes (16-way) | bytes / (2 × 400 GB/s) | 4x slower — limited by IB |
| EP across 4+ nodes | bytes / (M × 400 GB/s) | Even slower per GPU |
Q: How many fp32 CUDA cores does an H100 have? A B200? How does this compare to a TPU v5p?
Q: How many vector fp32 FLOPs/s can an H100 do? How does this compare to Tensor Core FLOPs?
Q: How long should fp16[64, 4096] × fp16[4096, 8192] take on a B200? What about fp16[512, 4096] × fp16[4096, 8192]?
2×64×4096 + 2×4096×8192 + 2×64×8192 = 69 MB. T = 69 MB / 8 TB/s = 8.6 μs (practically ~10-12 μs with partial bandwidth). Batch=512 is compute-bound: T = 2×512×4096×8192 / 2.3×1015 = 15 μs (practically ~20 μs).Q: What is the total L1/SMEM + register capacity on an H100? How does it compare to TPU VMEM?
Q: How long does AllGather(bf16[1024, 16384]) take on an 8xH100 node?
Q: NVIDIA reports 80 TFLOPS fp32 vector compute for B200. Given FMA (2 FLOPs/cycle per core), estimate the clock speed.