A TPU is a matrix-multiply machine bolted to fast memory. Understanding its memory hierarchy and networking is the key to writing efficient distributed code.
Strip away the marketing and a TPU is remarkably simple. It is a compute core that specializes in matrix multiplication (called a TensorCore) attached to a stack of fast memory (called HBM).
The TensorCore has three key units:
Why is matmul so special? Because it uses O(n3) compute for O(n2) bytes. That makes it very easy for the compute to outpace the memory bandwidth. No other common operation has this property — this is why architectures dominated by matmul are so amenable to scaling.
At the core of the MXU is a 128×128 systolic array (256×256 on TPU v6e). That is 16,384 ALUs, each capable of a multiply-and-add per cycle.
Imagine a grid of 128×128 processing elements (PEs). Weights are loaded from the top (the "RHS"), filling the array diagonally. Activations are fed from the left (the "LHS"), also diagonally. Each PE multiplies its activation with its weight, adds the result to the partial sum passed from above, and passes the new partial sum down.
Watch how weights (blue) and activations (green) flow through the array. Click Step to advance.
After the initial pipeline bubble (while weights load diagonally), each subsequent cycle produces a valid output element. New inputs and weights can be streamed in without additional bubbles — the array stays fully saturated.
Performance: When fully saturated, the systolic array performs one bf16[8,128] × bf16[128,128] → f32[8,128] multiply every 8 cycles. At 1.5 GHz on TPU v5e, that is about 5e13 bf16 FLOPs/s per MXU. Most TensorCores have 2 or 4 MXUs, giving a total of 2e14 bf16 FLOPs/s per TPU v5e chip.
Lower precision means higher throughput. TPUs can do int8 OPs roughly 2x faster than bf16 FLOPs, and int4 at 4x.
The VPU (Vector Processing Unit) handles everything that is not a matmul: activations like ReLU/GELU, element-wise operations, reductions (sums). It is an 8×128 SIMD unit where each (lane, sublane) pair contains 4 independent ALUs.
At 1.75 GHz on TPU v5p, the VPU achieves:
That is about 30x slower than the MXU (~2e14 per core). This is why we try to express as much of the computation as matmul.
VMEM (Vector Memory) is the on-chip scratchpad, sitting between HBM and the compute units. Think of it as a programmer-controlled L1/L2 cache — but much larger (128 MiB on TPU v5e).
| Memory | Capacity (TPU v5e) | Bandwidth | Role |
|---|---|---|---|
| HBM | 16 GiB | 8.2e11 B/s | Main storage for all tensors |
| VMEM | 128 MiB | ~22x HBM ≈ 1.8e13 B/s | Scratchpad; data must pass through here |
| VREGs | ~256 KiB/core | Cycle-speed | Registers for VPU/MXU input/output |
For example, during attention we might prefetch the large FFN weights into VMEM. If the weights fit (or are sharded small enough), the following FFN matmul runs at much higher efficiency.
HBM (High Bandwidth Memory) is the big chunk of fast memory that stores all tensors. Capacity is typically tens of GiB (16 GiB on TPU v5e, 96 GiB on TPU v5p).
The data flow for a matmul X · A → Y looks like this:
Element-wise operations (VPU) follow the same pipeline: data streams from HBM → VMEM → VREGs → VPU → VREGs → VMEM → HBM, with partial results pipelined without waiting for the full array.
The CPU host connects to its TPU tray via PCIe, which is about 1.6e10 bytes/s per TPU (3.2e10 on v6e). That is ~100x slower than HBM bandwidth. Loading data from host RAM to HBM is a bottleneck best avoided.
A TPU chip typically has 2 TensorCores that share memory and act as one accelerator (called "megacore" configuration since TPU v4). Exception: inference chips like TPU v5e have just 1 core per chip.
Chips sit on trays. A tray holds 4 chips connected to a single CPU host via PCIe. For TPU v5e, each host has 2 trays (8 chips = 8 cores). For training chips like v5p, each host has a 2×2×1 topology of 4 chips.
| Level | What | Example (v5e) |
|---|---|---|
| Core | 1 TensorCore (MXU + VPU + VMEM) | 1 per chip |
| Chip | 1–2 cores + HBM | 16 GiB HBM, 1.97e14 FLOPs/s |
| Tray | 4 chips | Connected via ICI |
| Host | CPU + 1–2 trays via PCIe | 8 chips per host |
| Slice | ICI-connected chips | Up to 16×16 = 256 chips |
| Pod/Superpod | Maximum ICI topology | v5p: 16×20×28 = 8960 chips |
ICI (Inter-Chip Interconnect) is the direct chip-to-chip link that forms the TPU's communication fabric. It does NOT go through the CPU host.
TPU v5e and v6e use a 2D torus (4 nearest neighbours per chip). TPU v4 and v5p use a 3D torus (6 nearest neighbours per chip). The toroidal wraparound reduces the maximum distance between any two chips from N to N/2.
| Link | Bandwidth (per chip) | Relative Speed |
|---|---|---|
| HBM ↔ TensorCore | ~1–3 TB/s | Fastest |
| ICI (per axis) | 45–90 GB/s unidirectional | ~10–30x slower than HBM |
| PCIe (host ↔ chip) | ~16 GB/s | ~100x slower than HBM |
| DCN (between hosts) | ~6 GB/s | Slowest |
Multi-slice training: ICI-connected chips form a "slice." Different slices connect via DCN (data-center networking), which is much slower. DCN goes host-to-host, requiring PCIe transfers on both ends. Minimizing DCN traffic is critical for multi-slice training.
Toggle topology size. Lines show ICI links, dashed lines show wraparounds.
Here are the key numbers you need for roofline calculations. Memorize the order of magnitude — exact values change slightly between sources.
| Model | Pod Size | HBM/chip | HBM BW/chip | bf16 FLOPs/s/chip | int8 OPs/s/chip |
|---|---|---|---|---|---|
| v3 | 32×32 | 32 GB | 9.0e11 | 1.4e14 | 1.4e14 |
| v4p | 16×16×16 | 32 GB | 1.2e12 | 2.75e14 | 2.75e14 |
| v5p | 16×20×28 | 96 GB | 2.8e12 | 4.59e14 | 9.18e14 |
| v5e | 16×16 | 16 GB | 8.2e11 | 1.97e14 | 3.94e14 |
| v6e | 16×16 | 32 GB | 1.6e12 | 9.20e14 | 1.84e15 |
| Model | ICI BW/link (1-way) | ICI BW/link (bidi) |
|---|---|---|
| v3 | 1.0e11 | 2.0e11 |
| v4p | 4.5e10 | 9.0e10 |
| v5p | 9.0e10 | 1.8e11 |
| v5e | 4.5e10 | 9.0e10 |
| v6e | 9.0e10 | 1.8e11 |
PCIe: ~1.6e10 bytes/s per TPU (3.2e10 for v6e).
DCN: ~6.25e9 bytes/s per TPU (12.5e9 for v6e, 3.125e9 for v5e).
You want to sample from a 200B parameter model in bf16 split across 32 TPU v4p. How long to load all parameters from HBM?
TPU v5e pod (16×16): 256 chips, 1 core each = 256 TensorCores. Hosts: 256/8 = 32 hosts. Total FLOPs/s: 256 × 1.97e14 = 5.04e16. Total HBM: 256 × 16 = 4 TB.
TPU v5p pod (16×20×28): 8960 chips, 2 cores each = 17,920 TensorCores. Hosts: 8960/4 = 2240 hosts. Total FLOPs/s: 8960 × 4.59e14 = 4.1e18. Total HBM: 8960 × 96 = 860 TB.
Weights bf16[D, F] and activations bf16[B, D] stored in host DRAM. Multiplied on a single TPU v6e. With B ≪ D and F = 4D:
You need a batch of ~57,500 tokens before computation outpaces PCIe loading. This is why we keep tensors in HBM, not host RAM.
Send bf16[8, 128, 8192] from TPU{0,0} to TPU{3,3} on a v5e 4×4 slice (no wraparounds).
First byte arrives in ~6 μs. Full transfer completes in ~188 μs.
| Concept | Key Takeaway |
|---|---|
| TPU architecture | MXU (systolic array for matmul) + VPU (element-wise) + VMEM (scratchpad) + HBM (main memory) |
| MXU | 128×128 systolic array (256×256 on v6e), ~200T multiply-adds/s |
| VMEM | ~22x faster than HBM but ~128 MiB. Prefetching weights here enables better efficiency |
| Padding | Matrices must be ≥128 in both dims to fill the MXU |
| Speed hierarchy | HBM >> ICI >> PCIe >> DCN |
| TPU vs GPU | TPUs: nearest-neighbor ICI, cheaper, scales larger. GPUs: NVLink switches, richer connectivity per node |
| Multi-slice | Slices (ICI) connect via DCN (host-to-host). Minimize DCN traffic |