Hardware-Aware Algorithm Design — CS 229s Systems for ML

Chapter 0: Why Hardware Matters

You have a shiny new GPU. The spec sheet says it can do 312 trillion floating-point operations per second (312 TFLOPS). You write a PyTorch layer — a simple element-wise ReLU over a million numbers. That's a million operations. Should take about 3 nanoseconds, right?

You profile it. It takes 50 microseconds. That's 17,000× slower than the theoretical peak. Your GPU spent 99.994% of its time doing nothing, waiting for data to arrive from memory.

This is not a bug. This is the memory wall — the defining bottleneck of modern computing. Processors can crunch numbers astronomically fast, but they can only feed those numbers from memory at a fraction of that speed. If you don't understand this gap, you'll write code that leaves 90% of your hardware idle.

The core problem: Compute speed has grown exponentially. Memory bandwidth has not kept up. The gap between "how fast we can calculate" and "how fast we can move data" is the single biggest performance bottleneck in modern ML. Every optimization technique in this lesson exists to work around this gap.

The Compute-Memory Gap

Watch how compute speed (FLOPS) has outpaced memory bandwidth over generations. The gap widens every year.

This lesson is based on Stanford CS 229s, Lecture 3. By the end, you'll be able to look at any ML operation — a matmul, a softmax, a layer norm — and immediately know whether it's compute-bound or memory-bound, and what to do about it.

Check: An element-wise ReLU on a GPU runs far slower than the theoretical peak FLOPS would suggest. Why?

ReLU is too complex for GPUs to compute The GPU spends most of its time waiting for data from memory, not computing PyTorch adds too much overhead

Chapter 1: CPU vs GPU

Before we can understand the memory wall, we need to understand what's behind the wall. Let's look at the two processors you'll encounter in ML: the CPU and the GPU. They solve fundamentally different problems.

The CPU: A Few Very Smart Workers

A CPU (Central Processing Unit) is designed to execute a sequence of instructions with minimum latency — the time from "start this task" to "task is done." Think of it as one or two brilliant workers who can each solve complex, branching problems very quickly.

A CPU core has a big control unit that fetches the next instruction and figures out what to do. It has ALU/FPU units (Arithmetic Logic Unit / Floating Point Unit) that do the actual math. And it has large caches that store data close to the processor so it doesn't have to wait for slow main memory.

Most of the transistors in a CPU are dedicated to data caching and flow control — branch prediction, out-of-order execution, speculative execution. The actual compute units are a small fraction of the chip. This makes CPUs brilliant at complex, sequential logic.

The GPU: Thousands of Simple Workers

A GPU (Graphics Processing Unit) takes the opposite approach. Instead of a few smart workers, it has thousands of simple ones. Each worker (called a CUDA core on NVIDIA GPUs) can only do basic arithmetic, but there are so many of them running in parallel that the total throughput — operations per second — is enormous.

GPUs are organized into Streaming Multiprocessors (SMs). An NVIDIA A100 has 108 SMs. Each SM has its own instruction unit, registers, and shared memory. It manages hundreds of threads simultaneously using a model called SIMT (Single Instruction, Multiple Thread): 32 threads are grouped into a warp, and every thread in a warp executes the same instruction at the same time, just on different data.

Key insight: CPUs minimize latency (time per task). GPUs maximize throughput (tasks per second). ML workloads — matrix multiplies, convolutions, element-wise ops — are massively parallel: the same operation on millions of numbers. GPUs win because they have thousands of workers all doing the same thing simultaneously.

CPU vs GPU Architecture

Left: CPU with few powerful cores, big caches. Right: GPU with many simple cores, small caches. The green area is compute; the blue area is cache/control.

Why GPUs Dominate Deep Learning

The core building block of deep learning is the matrix multiplication. A matmul of two 4096×4096 matrices requires about 137 billion multiply-add operations. Every single one is independent — you can compute any output element without waiting for any other. This is a perfect fit for the GPU's thousands of parallel workers.

Property	CPU (e.g., Intel Xeon)	GPU (e.g., NVIDIA A100)
Cores	8–64 complex cores	6,912 CUDA cores (108 SMs × 64)
Clock speed	3–5 GHz	1.4 GHz
Peak FP16	~1 TFLOPS	312 TFLOPS
Strength	Low-latency sequential logic	High-throughput parallel compute
Transistor budget	Cache & control dominant	Compute dominant

Check: Why are GPUs faster than CPUs for matrix multiplication?

GPUs have higher clock speeds GPUs use less power Matmul is massively parallel, and GPUs have thousands of cores to exploit that parallelism

Chapter 2: The Memory Hierarchy

Now that we know GPUs have thousands of cores, the question becomes: how do we feed those cores with data? This is where the memory hierarchy comes in — a cascade of storage levels, each trading capacity for speed.

The Memory Ladder

Think of it like a workshop. Your workbench (registers) holds what you're working on right now — tiny capacity, instant access. The toolbox on the shelf (L1/shared memory) is a step away. The supply closet down the hall (L2 cache) takes a bit longer. And the warehouse across town (HBM/DRAM) has everything you could ever need, but getting something from there takes a real trip.

On an NVIDIA A100 GPU, here are the actual numbers:

Level	Capacity	Bandwidth	Latency	Analogy
Registers	256 KB / SM	~20 TB/s	~1 cycle	Hands on workbench
Shared Memory / L1	192 KB / SM	~19 TB/s	~20 cycles	Toolbox on the shelf
L2 Cache	40 MB	~5 TB/s	~200 cycles	Supply closet down the hall
HBM (DRAM)	40 / 80 GB	1.9 TB/s (2 TB/s)	~400 cycles	Warehouse across town

The critical numbers: Registers are ~10× faster than shared memory, which is ~4× faster than L2, which is ~3× faster than HBM. From registers to HBM is roughly a 10× bandwidth drop at each level. When your code accesses HBM, it's like sending a truck across town for every calculation. That truck trip is what kills your performance.

GPU Memory Hierarchy

Hover over each level to see its specs. Width represents bandwidth; height represents capacity.

The Memory Wall, Quantified

The A100 can do 312 TFLOPS of FP16 compute. But it can only move data from HBM at 2 TB/s. Let's convert to the same units. Each FP16 number is 2 bytes. So 2 TB/s = 1 trillion FP16 numbers per second. But we can process 312 trillion operations per second. That's a ~160:1 ratio. For every byte we read from HBM, we need to do at least 160 floating-point operations, or the compute units sit idle.

Compute-to-Memory Ratio = Peak FLOPS / Peak Bandwidth = 312 TFLOPS / 2 TB/s ≈ 160 FLOPs/Byte

This ratio — 160 FLOPs per byte — is the hardware's arithmetic intensity threshold. If your operation does fewer than 160 FLOPs per byte of data it moves, the compute units will be starved. We'll formalize this in Chapter 4.

Check: Why is HBM the bottleneck, even though it has 40–80 GB of capacity?

It's too expensive Its bandwidth (~2 TB/s) is far slower than the compute speed (312 TFLOPS), creating a ~160:1 gap It can only store integers, not floats

Chapter 3: Compute-Bound vs Memory-Bound

Now we arrive at the most important concept in hardware-aware design: the classification of every operation into one of two categories. Understanding this distinction is the single most impactful skill you can develop for writing fast ML code.

The Toy Processor

Imagine a simple processor that can do 8 operations per second (compute bandwidth) and move 4 items per second to/from memory (memory bandwidth). Suppose we need to:

Load 8 items from memory
Compute 1 operation on each item
Write 8 results back to memory

Loading 8 items at 4 items/s takes 2 seconds. Computing 8 operations at 8 ops/s takes 1 second. Writing 8 items at 4 items/s takes 2 seconds. If compute overlaps with I/O (which modern GPUs do), the total time at steady state is dominated by memory: the processor can do 8 ops/s but only receives 4 items/s. It does 4 ops per second — 50% compute utilization. The processor is sitting idle half the time, waiting for data. This is memory-bound.

Key insight: Doubling the compute speed to 16 ops/s doesn't help — we're still bottlenecked by 4 items/s from memory. But doubling memory bandwidth to 8 items/s gives 100% utilization. The bottleneck determines which resource to improve.

Now Change the Algorithm

Same processor (8 ops/s, 4 items/s), but now our algorithm does 4 operations per item instead of 1:

Load 4 items from memory
Compute 4 ops on each item (16 ops total)
Write 4 results back

Loading 4 items: 1 second. Computing 16 ops at 8 ops/s: 2 seconds. Writing 4 items: 1 second. Now compute takes longer than memory. The processor is busy the whole time, and memory has idle moments. This is compute-bound. And now, doubling compute speed to 16 ops/s does help — it cuts the total time.

Compute vs Memory Bound Simulator

Adjust the compute speed, memory bandwidth, and operations per item. Watch which resource becomes the bottleneck.

Compute (ops/s)8

Memory BW (items/s)4

Ops per item1

The Formal Test

For any operation, we can compute two times:

T_math = Total FLOPs / Compute Bandwidth

T_mem = Total Bytes Accessed / Memory Bandwidth

If T_math > T_mem, the operation is compute-bound: compute is the bottleneck. If T_mem > T_math, it's memory-bound: memory is the bottleneck. With overlapped compute and I/O, total time ≈ max(T_math, T_mem).

Check: A processor does 16 ops/s and has 4 items/s memory bandwidth. An operation loads 8 items, does 1 op each, and writes 8 items back. Is it compute-bound or memory-bound?

Compute-bound (math takes longer) Memory-bound (memory transfer takes longer) Perfectly balanced

Chapter 4: Arithmetic Intensity

In the last chapter, we compared T_math and T_mem to decide the bottleneck. There's a more elegant way to express this — a single number that characterizes any operation: the arithmetic intensity.

Arithmetic Intensity = Total FLOPs / Total Bytes Accessed

This measures how much work you do for each byte you move. High arithmetic intensity means you're doing a lot of computation on a small amount of data — the processor stays busy. Low arithmetic intensity means you're constantly shuffling data around with little computation — the processor starves.

The Threshold

We derived the A100's compute-to-memory ratio: ~160 FLOPs/Byte. This is the ridge point — the dividing line between compute-bound and memory-bound:

If Arithmetic Intensity > 160 FLOPs/Byte → compute-bound (compute is the bottleneck)
If Arithmetic Intensity < 160 FLOPs/Byte → memory-bound (memory bandwidth is the bottleneck)

Why this matters: This single number tells you exactly what to optimize. Memory-bound? Reduce data movement (fusion, tiling, caching). Compute-bound? Use faster algorithms (mixed precision, Tensor Cores, algorithmic tricks). Optimizing the wrong bottleneck wastes effort.

Worked Example: Matrix Multiplication

Let's compute the arithmetic intensity of multiplying A [N × K] by B [K × M] to get C [N × M]. We'll count FLOPs and bytes.

FLOPs: Each entry C_ij is a dot product of a row of A and a column of B: K multiplications + K additions = 2K FLOPs. There are N × M entries in C. Total: 2NMK FLOPs.

Bytes: We read A (N×K values), read B (K×M values), and write C (N×M values). In FP16 (2 bytes each): 2(NK + KM + NM) bytes.

AI_matmul = 2NMK / 2(NK + KM + NM) = NMK / (NK + KM + NM)

Case 1: M = K = 8192, N = 128 (a tall-skinny matrix times a big square one — like a small batch through a large linear layer):

AI = (128 × 8192 × 8192) / (128×8192 + 8192×8192 + 128×8192) = 8,589,934,592 / 69,206,016 ≈ 124

Since 124 < 160, this is memory-bound. The batch is too small to keep the GPU busy.

Case 2: M = K = N = 8192 (three large square matrices):

AI = (8192 × 8192 × 8192) / (8192×8192 + 8192×8192 + 8192×8192) = 8192³ / (3 × 8192²) = 8192/3 ≈ 2,731

Since 2,731 >> 160, this is compute-bound. Plenty of work per byte. The GPU is fully utilized.

Arithmetic Intensity Calculator

Set the matrix dimensions for C = A × B. See the arithmetic intensity and whether it's compute- or memory-bound on an A100.

N (rows of A)128

K (shared dim)4096

M (cols of B)4096

Common Operations: A Cheat Sheet

Operation	FLOPs	Bytes (FP16)	Intensity	Bound
MatMul (N=K=M=4096)	2N³ ≈ 137B	6N² ≈ 201M	~682	Compute
Element-wise (ReLU, GELU)	N	4N (read+write)	0.25	Memory
Softmax (row of N)	~5N	4N	~1.25	Memory
LayerNorm	~5N	4N	~1.25	Memory
Dropout	N	4N	0.25	Memory

Notice the pattern: MatMul is the only common operation that's compute-bound. Everything else — activations, normalization, dropout — is memory-bound. This is why fusing these operations (Chapter 7) is so important: we can piggyback them onto matmuls to avoid separate memory round-trips.

Check: A matrix multiply with N=64, K=4096, M=4096 has arithmetic intensity of about 63. On an A100 (ridge point ~160), is this compute-bound or memory-bound?

Compute-bound (63 > 160) Memory-bound (63 < 160) Perfectly balanced (63 = 160)

Chapter 5: The Roofline Model

We've been asking "is this operation compute-bound or memory-bound?" one operation at a time. The roofline model puts ALL operations on a single diagram and makes the answer visual. It's the most important performance analysis tool in systems engineering.

How It Works

The x-axis is arithmetic intensity (FLOPs/Byte) on a log scale. The y-axis is achievable performance (GFLOPS, also log scale). Two lines form a "roofline":

Memory bandwidth ceiling: Performance = Arithmetic Intensity × Memory Bandwidth. This is a diagonal line — the more work you do per byte, the more total FLOPS you can achieve.
Compute ceiling: Performance = Peak FLOPS. This is a horizontal line — no matter how high your arithmetic intensity, you can't exceed the hardware's peak compute.

The two lines intersect at the ridge point: the arithmetic intensity where memory bandwidth and compute are perfectly balanced.

Ridge Point = Peak FLOPS / Peak Bandwidth

Achievable FLOPS = min(Peak FLOPS, AI × Peak Bandwidth)

Derivation: Why is the memory-bound region a diagonal? If your operation has arithmetic intensity I (FLOPs per byte), and you can move B bytes per second, then you do I × B FLOPs per second. That's a linear function of I on a log-log plot — a straight line with slope 1. Once I × B exceeds the compute peak, more arithmetic intensity doesn't help — you hit the flat ceiling.

Interactive Roofline Model (A100 GPU)

Drag the sliders to change hardware specs. Click the operation buttons to place them on the roofline. Hover operations for details.

Peak FLOPS (TFLOPS)312

Memory BW (TB/s)2.0

Reading the Diagram

Every operation lands somewhere on the roofline. If it's on the diagonal (left of the ridge), it's memory-bound — you improve it by reducing data movement. If it's on the flat ceiling (right of the ridge), it's compute-bound — you improve it by using faster math (Tensor Cores, mixed precision, better algorithms).

Operations that land below the roofline are not reaching their theoretical peak — there's room for optimization. The gap between the roofline and the actual performance tells you how much room you have.

The takeaway: Most operations in a transformer — softmax, layer norm, dropout, residual add, GELU — are far to the left of the ridge. Only matmuls are to the right. This is why custom CUDA kernels like FlashAttention fuse everything together: they eliminate the memory round-trips for all those memory-bound operations.

Check: On a roofline diagram, an operation lands on the diagonal line (left of the ridge point). What does this mean?

It's memory-bound: performance is limited by how fast we can move data It's compute-bound: the math takes too long It's perfectly optimized

Chapter 6: Tiling & Data Reuse

We've diagnosed the problem: most operations are memory-bound. How do we fix it? The first technique is tiling (also called blocking) — breaking a large operation into small tiles that fit in fast memory (registers/shared memory), so we read from slow HBM once and reuse the data many times.

Naive Matrix Multiply: A Disaster

Consider computing one element of C = A × B. A naive implementation loads one row of A and one column of B, computes the dot product (2K FLOPs), and writes one output. In FP16:

Bytes = 2(2K + 1) FLOPs = 2K AI = 2K / 2(2K+1) ≈ 0.5

An arithmetic intensity of 0.5. On an A100 with a ridge point of 160, we'd achieve less than 0.3% of peak compute. Catastrophic.

1D Tiling: Reuse Across a Row

What if one thread computes a tile of T_c outputs in the same row of C? It loads 1 row of A and T_c columns of B, and reuses the row of A across all T_c dot products.

Bytes = 2(K + T_cK + T_c) FLOPs = 2KT_c AI = 2KT_c / 2(K + T_cK + T_c) ≈ T_c / (1 + T_c)

For T_c = 2: AI ≈ 4K / (3K+2) ≈ 1.33. Better, but still far from 160.

2D Tiling: The Real Win

Now each thread computes a T_r × T_c tile of C. It loads T_r rows of A and T_c columns of B, reusing both across the tile.

Bytes = 2(T_rK + T_cK + T_rT_c) FLOPs = 2T_rT_cK AI = T_rT_cK / (T_rK + T_cK + T_rT_c)

For T_r = T_c = 128, K = 4096: AI = 128 × 128 × 4096 / (128×4096 + 128×4096 + 128×128) ≈ 63.5. Now we're getting somewhere. With larger tiles we can push past the ridge point.

The key insight: Tiling increases arithmetic intensity by reusing data. Loading T_r rows and T_c columns, we do T_r×T_c dot products but only move T_r+T_c vectors. The bigger the tile, the better the data reuse ratio. The limit is how much data fits in fast memory (registers + shared memory).

Tiling Visualizer

Adjust the tile size and watch how arithmetic intensity changes. The dashed line marks the A100 ridge point (160 FLOPs/Byte).

Tile rows (T_r)1

Tile cols (T_c)1

K (shared dim)4096

Tiling in Practice

In real GPU programs (CUDA), tiling maps directly to the hardware. A thread block is assigned a tile of the output matrix. The block cooperatively loads its tile of A and B into shared memory (the fast, per-SM cache), then each thread computes its portion from shared memory. This is exactly what NVIDIA's cuBLAS library does under the hood.

Modern Tensor Cores on the A100 compute 4×4 or 8×8 tiles in a single clock cycle, making tiling a hardware-level primitive. When you call torch.matmul, cuBLAS is tiling for you. But for fused operations (Chapter 7), you need to tile manually — which is what FlashAttention does.

Check: Why does increasing tile size improve arithmetic intensity?

Larger tiles let us reuse loaded data for more computations, increasing the FLOP-to-byte ratio Larger tiles use more memory, which makes memory faster Larger tiles reduce the total number of FLOPs needed

Chapter 7: Operator Fusion

Tiling helps us reuse data within a single operation. Operator fusion goes further: it combines multiple operations into a single kernel, so intermediate results never touch slow HBM.

The Problem with Separate Kernels

In a standard PyTorch transformer layer, a sequence of operations might look like this: matmul → bias add → GELU → dropout. Without fusion, each operation is a separate GPU kernel. Each kernel reads its input from HBM, computes, and writes its output back to HBM. The next kernel reads that output from HBM.

Unfused (4 HBM round-trips)

MatMul → write to HBM → read → Bias Add → write to HBM → read → GELU → write to HBM → read → Dropout → write to HBM

↓ fuse into single kernel

Fused (1 HBM round-trip)

Read input from HBM → MatMul + Bias + GELU + Dropout in registers → write output to HBM

The unfused version does 4 separate round-trips to HBM. Bias add, GELU, and dropout are all memory-bound (AI < 1). Their compute time is negligible — they're pure overhead from the memory transfers. Fusing them into the matmul kernel eliminates 3 out of 4 HBM accesses. The intermediate values live in fast registers and shared memory, never touching HBM.

The key insight: Fusion doesn't reduce the total FLOPs. It reduces the total bytes moved to/from HBM. Since the fused operations (bias, activation, dropout) are memory-bound, eliminating their memory traffic gives us those operations essentially "for free" — they happen while the data is already in registers.

FlashAttention: Fusion's Greatest Hit

The standard attention computation is: Q×K¹ → scale → mask → softmax → dropout → ×V. Without fusion, each step reads/writes full [seq_len × seq_len] matrices to HBM. For a sequence length of 2048, that's 2048² × 2 bytes = 8 MB per intermediate matrix, several times.

FlashAttention fuses the entire attention computation into a single kernel. It tiles the Q, K, V matrices into blocks that fit in shared memory (SRAM), computes attention tile-by-tile, and accumulates the output without ever materializing the full attention matrix in HBM. The result: 2–4× speedup and dramatically less memory usage.

Fusion vs Unfused Data Flow

Watch data flow through fused vs unfused operations. Red arrows = HBM transfers (slow). Green arrows = register/SRAM transfers (fast). Toggle fusion to see the difference.

Unfused: 4 HBM round-trips

When to Fuse?

Fusion helps most when the operations being fused are memory-bound. Fusing two compute-bound operations doesn't help much — they're already utilizing the hardware well. The sweet spot is fusing a chain of memory-bound operations (bias add, activation, norm, dropout) with a compute-bound one (matmul), so the memory-bound ops ride along "for free."

python
# PyTorch without fusion: 4 separate HBM round-trips
x = torch.matmul(input, weight)     # write to HBM
x = x + bias                        # read + write HBM
x = torch.nn.functional.gelu(x)     # read + write HBM
x = torch.nn.functional.dropout(x)  # read + write HBM

# With torch.compile or a custom kernel: 1 HBM round-trip
# All intermediate results stay in registers
x = fused_matmul_bias_gelu_dropout(input, weight, bias)

Check: Operator fusion improves performance primarily by reducing what?

The total number of FLOPs computed The number of HBM read/write round-trips for intermediate results The size of the model's parameters

Chapter 8: Connections

Let's pull it all together. This lesson covered the fundamental hardware-software interface for ML systems: why hardware matters, how CPUs and GPUs differ, the memory hierarchy, arithmetic intensity, the roofline model, tiling, and operator fusion. These concepts are the foundation of every performance optimization in modern deep learning.

The Principles, Summarized

Principle	What It Means	When to Apply
Arithmetic Intensity	FLOPs / Bytes moved. The single number that determines your bottleneck.	Always. Compute this first for any operation you want to optimize.
Tiling / Blocking	Break work into tiles that fit in fast memory. Reuse loaded data.	Matmul, convolutions — any operation where data can be reused across outputs.
Operator Fusion	Combine kernels to eliminate intermediate HBM writes.	Chains of memory-bound ops (activation, norm, dropout) following a compute-bound op (matmul).
Parallelization	Expose enough independent work to fill all SMs.	Always. Need at least 108 independent output tiles to saturate an A100.
Caching vs Recomputation	Cache intermediates if compute-bound; recompute if memory-bound.	KV caching in autoregressive generation (save compute). Gradient checkpointing in training (save memory).
Pipelining	Overlap compute with memory I/O to hide latency.	Large batch processing where next batch's data can be prefetched during current batch's compute.

Real-World Impact

Technique	Example	Speedup
FlashAttention (tiling + fusion)	Attention in transformers	2–4×
KV Caching	Autoregressive generation	O(N) → O(1) per token for K,V computation
torch.compile (fusion)	General PyTorch models	1.3–2×
Mixed precision (FP16/BF16)	Training and inference	2× compute, 2× memory

The one-sentence takeaway: The fastest algorithm on paper can be the slowest in practice if it ignores how data moves through the memory hierarchy. Hardware awareness is not an optimization — it's a prerequisite.

Where to Go From Here

CS 229s Lecture 5: FlashAttention deep dive — how tiling and fusion combine to make attention hardware-efficient
CS 229s Lecture 4: Profiling tools — how to measure arithmetic intensity and identify bottlenecks in real code
Transformer lesson — the architecture whose performance these techniques optimize

“The purpose of computing is insight, not numbers.” — Richard Hamming. But to get those insights fast enough to matter, you need to understand the machine that computes them.

Final check: An operation has arithmetic intensity of 50 FLOPs/Byte on an A100 (ridge point ~160). Which optimization strategy is most appropriate?

Reduce data movement (fusion, tiling, caching) since it's memory-bound Use faster compute (mixed precision, Tensor Cores) since it's compute-bound No optimization possible