CS 229s — Systems for Machine Learning

Analyzing Transformer Efficiency

Count every FLOP, trace every byte. Training costs, inference bottlenecks, KV caching, and speculative decoding — the systems view of transformers.

Prerequisites: Transformer basics + Matrix multiply. That's it.
9
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Analyze Performance?

You've just trained a transformer model. It works. But it takes 14 hours per epoch on 8 GPUs, and inference serves 40 tokens per second. Your boss wants 200. Where is the time going? Which part do you optimize?

Most ML engineers can write a transformer from scratch. Far fewer can answer: how many floating-point operations does a single forward pass cost? Or: why is autoregressive decoding 100x slower than what the hardware should theoretically support?

The answers live in two numbers: FLOPs (floating-point operations — the computational work) and memory bandwidth (bytes moved per second between GPU memory and compute cores). Every bottleneck in training and inference comes down to one of these being the limiting factor.

The systems perspective: An algorithm isn't fast because it has few FLOPs. It's fast when the hardware isn't waiting. If compute cores sit idle while data loads from memory, you're memory-bound. If memory is idle while cores churn through math, you're compute-bound. Knowing which regime you're in tells you what to fix.
Compute vs Memory: Where's the Bottleneck?

Drag the batch size slider. Small batches are memory-bound (GPU starves for data). Large batches are compute-bound (math is the bottleneck). The crossover is the arithmetic intensity threshold.

Batch size16

In the chapters ahead, we'll count every FLOP in training (forward + backward), trace every byte in memory, dissect inference into its two phases (prefill and decode), and learn the tricks — KV caching, speculative decoding, continuous batching — that close the gap between theoretical and actual throughput.

Check: A GPU can theoretically do 312 TFLOPS, but your model only achieves 40 tokens/sec. What's the most likely bottleneck during autoregressive decoding?

Chapter 1: Forward Pass FLOPs

Let's count. Every matrix multiply has a precise FLOP cost, and a transformer is just a stack of matrix multiplies with nonlinearities between them. Once you know the rule, you can compute the cost of any model on a napkin.

The Matrix Multiply FLOP Rule

Multiplying a matrix of shape [M, K] by one of shape [K, N] produces a [M, N] output. Each output element requires K multiplications and K−1 additions. We approximate this as 2MKN FLOPs (the factor of 2 counts both multiplies and adds).

FLOPs(A × B) = 2 · M · K · N
where A is [M, K] and B is [K, N]
Why the factor of 2? Each element of the output is a dot product: multiply K pairs, then sum K results. That's K multiplies + (K−1) adds ≈ 2K FLOPs per element. There are M×N elements, so total = 2MKN. This is the universal FLOP-counting rule for dense matrix multiply.

A Single MLP Layer

Consider one MLP layer in a transformer. The input is a batch of B sequences, each of length S, with hidden dimension H. The MLP first projects up to 4H (the intermediate size), applies an activation, then projects back down:

Linear 1: [BS, H] × [H, 4H]
2 · BS · H · 4H = 8BSH² FLOPs
Activation (GELU)
≈ 0 FLOPs (negligible vs matmul)
Linear 2: [BS, 4H] × [4H, H]
2 · BS · 4H · H = 8BSH² FLOPs

Total MLP FLOPs per layer: 16BSH².

The Attention Layer

Attention has four projections (Q, K, V, and output) plus the attention score computation. Each of Q, K, V, O is a [H, H] weight matrix applied to [BS, H] input:

Q, K, V projections (3×)
3 × 2BSH² = 6BSH² FLOPs
Attention scores: QKT
2BS²H FLOPs (per head, summed)
Attention × V
2BS²H FLOPs
Output projection
2BSH² FLOPs

Total attention FLOPs per layer: 8BSH² + 4BS²H.

Full Transformer Forward Pass

A transformer with L layers has total forward FLOPs:

FLOPsfwd = L × (24BSH² + 4BS²H)
= L × (MLP: 16BSH² + Attn projections: 8BSH² + Score/Value: 4BS²H)

Plus the final vocabulary projection: 2BSH·V (where V is vocabulary size). For most models, the 24BSH² term dominates, so the quick rule of thumb is:

Napkin rule: Forward FLOPs ≈ 2 × (number of parameters) × (tokens processed). For a model with P parameters and T = B×S tokens: FLOPsfwd ≈ 2PT. This works because each parameter participates in one multiply and one add per token.

Worked Example: GPT-2 Small

ParameterGPT-2 Small
Hidden dim H768
Layers L12
Vocab V50,257
Seq length S1,024
Batch B1

Let's compute step by step. Tokens T = B×S = 1,024.

MLP per layer: 16 × 1,024 × 768² = 16 × 1,024 × 589,824 = 9.66 × 109 FLOPs

Attention projections per layer: 8 × 1,024 × 768² = 4.83 × 109 FLOPs

Attention scores per layer: 4 × 1,024² × 768 = 3.22 × 109 FLOPs

Per layer total: 9.66 + 4.83 + 3.22 = 17.7 × 109 FLOPs

All 12 layers: 12 × 17.7 × 109 = 212 GFLOPs

Vocab projection: 2 × 1,024 × 768 × 50,257 = 79.0 GFLOPs

Total forward:291 GFLOPs per batch of 1,024 tokens.

Quick check via the napkin rule: GPT-2 Small has ~124M parameters. 2 × 124M × 1,024 ≈ 254 GFLOPs. Close enough — the difference is the attention score term (4BS²H) which the napkin rule ignores.

FLOP Calculator

Adjust model dimensions and see FLOPs break down by component. The bar chart shows where compute goes.

Hidden dim H768
Layers L12
Seq length S1024
Check: For a matrix multiply [M, K] × [K, N], how many FLOPs does it cost?

Chapter 2: Backward Pass — Why It's 2× Forward

Training isn't just the forward pass. After computing the loss, we backpropagate gradients through every layer to update weights. The backward pass is always roughly twice the FLOPs of the forward pass. Let's see exactly why.

Backprop Through a Single Linear Layer

Consider a linear layer: z = x · W, where x is [B, H] and W is [H, N]. The forward pass costs 2BHN FLOPs. During backprop, we receive dL/dz (same shape as z: [B, N]) and need two things:

Gradient w.r.t. weights
dL/dW = xT · dL/dz → [H, B] × [B, N] = 2BHN FLOPs
Gradient w.r.t. input
dL/dx = dL/dz · WT → [B, N] × [N, H] = 2BHN FLOPs

Two matrix multiplies, each costing 2BHN. Total backward FLOPs for this layer: 4BHN. Compare with the forward pass: 2BHN. The backward pass is exactly the forward pass for a linear layer.

Why 2×? Forward does one matmul. Backward does two: one to compute the weight gradient (for the optimizer), and one to propagate the error signal to the previous layer (for the chain rule). Both matmuls have the same cost as the forward one. Hence: backward = 2 × forward.

The Computational Reuse Trick

Look at the weight gradient formula: dL/dW = xT · dL/dz. We need x — the activation from the forward pass. This means during forward, we must save every intermediate activation to reuse during backward. This is why training is so memory-hungry: we store activations for every layer.

The chain rule also means intermediate gradients like dL/dv are reused across multiple weight updates. An efficient backprop implementation computes each gradient once and caches it — never recomputing the same thing twice.

Full Training FLOPs

Since the backward pass is 2× forward, the total training cost per step is:

FLOPstrain = FLOPsfwd + FLOPsbwd = FLOPsfwd + 2 × FLOPsfwd = 3 × FLOPsfwd

For our GPT-2 Small example: 3 × 291 GFLOPs = 873 GFLOPs per training step (batch size 1, sequence length 1024).

With the napkin rule: training FLOPs ≈ 6PT (6 × parameters × tokens), since forward ≈ 2PT and we multiply by 3.

The 6PT rule: Total training FLOPs ≈ 6 × P × T. This is the single most useful back-of-envelope formula in ML systems. Chinchilla-optimal training of a 70B model on 1.4T tokens: 6 × 70×109 × 1.4×1012 = 5.88×1023 FLOPs — about 1,000 A100-days.
Forward vs Backward FLOPs

This visualization shows a computation graph for L = 2(w1w2)w3. Watch the forward pass compute values left-to-right, then backprop compute gradients right-to-left, reusing cached values.

Check: Why does the backward pass require about 2× the FLOPs of the forward pass?

Chapter 3: Memory — The Real Bottleneck

FLOPs tell you how much work a GPU does. Memory tells you whether the work fits. In practice, memory is almost always the binding constraint in training. Let's trace exactly where every byte goes.

Model Parameters

A transformer with L layers and hidden dimension H has roughly:

P ≈ 12 · L · H²
(4H² for attention: Q,K,V,O × H²; plus 8H² for MLP: H×4H + 4H×H)

In fp16 (2 bytes per parameter), GPT-2 Small: 12 × 12 × 768² = 84.9M weight-params ≈ 170 MB. Add embeddings and you get ~248 MB for 124M params.

Optimizer State (Adam)

Adam stores two extra copies of every parameter: the first moment (mean of gradients) and the second moment (mean of squared gradients). Both are kept in fp32 for numerical stability. Plus we store a fp32 master copy of the weights:

ComponentBytes per paramGPT-2 Small (124M params)
Weights (fp16)2248 MB
Gradients (fp16)2248 MB
Master weights (fp32)4496 MB
Adam m (fp32)4496 MB
Adam v (fp32)4496 MB
Total161.98 GB

That's 16 bytes per parameter just for optimizer state + weights + gradients. A 7B model: 7×109 × 16 = 112 GB — already more than a single A100-80GB can hold!

16 bytes per parameter: This is the number that makes large-model training impossible without parallelism. The model weights themselves are just 2 bytes/param in fp16, but Adam's state quadruples the memory footprint. This is exactly what ZeRO (DeepSpeed) and FSDP (PyTorch) target: they shard optimizer state across GPUs.

Activations

During training, we cache intermediate activations for backprop. Per layer, we store the input to each linear layer, the attention scores, and the MLP intermediate. For a transformer layer with batch B, sequence S, hidden H:

Activations per layer ≈ BS × (34H + 5S × nheads) bytes (in fp16)

For GPT-2 Small with B=1, S=1024, H=768, nheads=12: about 54 MB per layer, or 648 MB for all 12 layers. This is why activation checkpointing (recompute activations during backward instead of storing them) is essential for long sequences.

Training Memory Breakdown

Adjust model size and see where GPU memory goes. The stacked bars show weights, optimizer state, gradients, and activations.

Parameters (M)124M
Seq length1024
Check: With Adam optimizer in mixed-precision training, how many bytes of GPU memory does each model parameter require?

Chapter 4: Prefill vs Decode

Training is expensive but predictable. Inference is where things get weird. An LLM serving a chat request has two distinct phases, and they have completely different performance characteristics.

Phase 1: Prefill

The user sends a prompt of N tokens. The model processes them all at once in a single forward pass, just like training. This is compute-bound — there's enough parallel work to keep the GPU busy. The output: a KV cache and the first generated token.

Phase 2: Decode (Autoregressive Generation)

Now the model generates tokens one at a time. Each step processes a single token. The Q, K, V projections are tiny: [1, H] × [H, H]. The arithmetic intensity (FLOPs per byte loaded) plummets. We must load the entire model's weights from memory to produce one token.

The decode paradox: During decode, an A100 with 312 TFLOPS of compute only achieves a few hundred tokens/sec. Why? Because generating one token requires loading ~3 GB of weights (GPT-2-XL, fp16) from HBM. At 2 TB/s memory bandwidth, that's 1.5ms just for the load. The actual computation takes a fraction of that. The GPU's compute cores are mostly idle.

Arithmetic Intensity: The Key Metric

Arithmetic intensity is the ratio of FLOPs to bytes moved:

AI = FLOPs / Bytes moved

An A100 has a compute-to-bandwidth ratio of 312 TFLOPS / 2 TB/s = 156 FLOP/byte. If your operation has AI < 156, you're memory-bound. If AI > 156, you're compute-bound.

For a matmul [B, H] × [H, H]: FLOPs = 2BH², bytes loaded ≈ 2H² (the weight matrix in fp16). So AI = 2BH² / 2H² = B. When B = 1 (single-token decode), AI = 1 — we're catastrophically memory-bound. We need B ≥ 156 to saturate compute!

Prefill vs Decode Timeline

Watch a prompt get prefilled in parallel, then tokens decoded one-by-one. The orange bar shows GPU utilization — notice how it drops during decode.

PropertyPrefillDecode
Tokens processedAll prompt tokens at onceOne token per step
Arithmetic intensityHigh (batch=N)Low (batch=1)
BottleneckCompute-boundMemory-bandwidth-bound
GPU utilizationHigh (>50%)Low (<5% of peak FLOPS)
LatencyFast (parallel)Slow (sequential)
Check: During autoregressive decode with batch size 1, the arithmetic intensity is approximately 1. On an A100, you'd need to process how many tokens in parallel to become compute-bound?

Chapter 5: KV Cache — Trade Memory for Compute

Here's the core insight of KV caching: during autoregressive generation, the keys and values for previously generated tokens don't change. Token 5's K and V only depend on tokens 1–5. When we generate token 6, token 5's K and V are the same as before. So why recompute them?

How It Works

Without caching, generating token N requires computing Q, K, V for all N tokens, then computing attention. That's O(N × H²) for projections alone. With caching, we only compute Q, K, V for the new token, then append its K, V to the cache and attend over the full sequence:

Without KV Cache (step N)
Compute Q,K,V for ALL N tokens: 3 × 2Nd² = 6Nd² FLOPs
vs
With KV Cache (step N)
Compute Q,K,V for 1 token: 3 × 2d² = 6d² FLOPs ← N× cheaper!

The savings are massive. At step 1000, we do 6d² FLOPs instead of 6000d². But there's a cost: we must store all those cached K and V vectors.

KV Cache Memory: The Exact Calculation

For each token in the sequence, each layer stores a K vector and a V vector, each of dimension dmodel. In fp16 (2 bytes per value):

KV cache per token = 2 × 2 × L × dmodel bytes
= 2 (K and V) × 2 (bytes in fp16) × L (layers) × d (hidden dim)

Worked Example: 7B Model on A100-40GB

Model: 7B parameters, L=32 layers, dmodel=4096, sequence length 1024.

Model weights: 7B × 2 bytes = 14 GB

KV cache per token: 2 × 2 × 32 × 4096 = 524,288 bytes ≈ 0.5 MB

KV cache for 1024 tokens: 0.5 MB × 1024 = 512 MB

Remaining for batching: 40 − 14 = 26 GB free. At 0.5 MB/token, that's 26 GB / 0.5 MB = ~52,000 tokens of KV cache. With S=1024, we can serve a batch of ~50 requests simultaneously.

The KV cache tradeoff: KV caching trades memory for compute. It's worth it when you're memory-bandwidth-bound (i.e., decode phase), because the extra memory traffic from reading the cache is tiny compared to recomputing all projections. But the cache can be huge — for a 70B model with 32K context, it's ~40 GB per request!

When Does KV Caching Make Sense?

KV caching adds memory traffic (reading the cached K,V) but saves compute. It makes sense when we're compute-bound — but during decode we're memory-bound. The resolution: KV caching saves us from becoming even more memory-bound. Without it, we'd load all N weight matrices N times; with it, we load them once and read the (smaller) cache.

More precisely, recomputing K,V for all N tokens costs 2 × 2Nd² FLOPs and loads 2d² × 2 bytes of weights. Caching costs 0 extra FLOPs for projection but reads 2 × 2 × N × d bytes of cached values. For large d, the cache read is much smaller than the weight load for recomputation.

KV Cache Explorer (SHOWCASE)

Adjust sequence length, number of layers, and hidden dimension to see the KV cache grow. The visualization shows memory allocation on a 40GB GPU. Red means the cache exceeds available memory.

Seq length2048
Layers32
Hidden dim4096
Batch size1
Check: Why don't the K and V vectors for existing tokens change during autoregressive generation?

Chapter 6: Speculative Decoding

Autoregressive decode is slow because we generate one token at a time, and each token requires loading the entire model from memory. The GPU has compute to spare but nothing to do with it. Speculative decoding asks: what if we guessed the next several tokens and verified them all at once?

The Core Idea

Use a small, fast draft model to guess the next K tokens. Then run all K tokens through the large target model in a single forward pass (like prefill). Check which guesses match. Accept the correct prefix, discard the rest, and repeat.

Step 1: Draft
Small model generates K candidate tokens autoregressively (fast, cheap)
Step 2: Verify
Large model scores all K tokens in ONE forward pass (batched, efficient)
Step 3: Accept
Accept longest prefix where draft matches target. Discard the rest.
↻ repeat

Why Does This Work?

Remember: during decode, the GPU is memory-bound. Processing 1 token or K tokens takes almost the same wall-clock time, because the bottleneck is loading weights, not computing. If we batch K guessed tokens, the extra compute is "free" — we're just using idle compute capacity.

The speedup depends on the acceptance rate — what fraction of draft tokens the target model agrees with. In practice, a draft model ~15× smaller achieves 70–80% acceptance on natural language, giving 2–3× speedup.

It's just like branch prediction! CPUs guess which branch an if-statement will take and start executing speculatively. If the guess is right, free speedup. If wrong, small penalty. Speculative decoding is the same idea applied to LLM inference: guess future tokens, execute speculatively, discard if wrong.

Three Guessing Strategies

StrategyHow it worksProsCons
Draft modelSmaller LLM trained on same dataGood agreement (~70-80%), easy to implementExtra model to manage, extra memory
Medusa headsExtra prediction heads on the target model itselfNear-zero cost, no extra modelGuesses degrade quickly (mean-field approx); ~2× max speedup
Lossy optimizationQuantized/pruned version of target modelHighly correlated with target, no extra modelComplex code, under-explored
Speculative Decoding Simulator

Watch the draft model generate guesses (fast, purple) and the target model verify them (orange). Green = accepted, red = rejected. Adjust acceptance rate and speculation depth.

Acceptance rate75%
Speculation depth K5
Check: Why is processing K speculative tokens nearly free during decode?

Chapter 7: Batching — Throughput vs Latency

We've established that single-token decode is memory-bound. The natural fix: batch multiple requests together. If we're loading the weights anyway, why not process 32 users' tokens at once?

Throughput vs Latency

Two very different optimization targets:

Latency

Time for ONE request to complete. Critical for interactive applications (chatbots, Copilot). A user waiting 2 seconds feels snappy; 10 seconds feels broken. Batching doesn't help individual latency — it can even hurt.

Throughput

Total tokens generated per second across ALL requests. Critical for serving many users or processing offline batches (summarization, data analysis). Batching directly increases throughput by amortizing weight loads.

Bandwidth vs throughput: Bandwidth is a property of the hardware (how many bytes/sec the memory bus can move). Throughput is a property of your system (how many tokens/sec you actually produce). If your throughput is below what bandwidth allows, you're leaving performance on the table.

How Batch Size Affects the Bottleneck

With batch size B, the arithmetic intensity of a matmul becomes B (since FLOPs scale with B but weight loads don't). As we increase B:

Batch size BArithmetic intensityRegimeBottleneck
11Memory-boundWeight loading
3232Memory-boundWeight loading (less idle compute)
156+≥156Compute-boundArithmetic units

Maximum Batch Size: KV Cache Limits

But we can't batch infinitely! Each request in the batch needs its own KV cache. From Chapter 5, a 7B model with S=1024 uses ~512 MB of KV cache per request. On a 40GB GPU with 14GB for weights, we have 26GB for KV caches: 26GB / 512MB = ~50 concurrent requests.

Continuous Batching

In naive batching, all requests in a batch must finish before new ones enter. If one request needs 500 tokens and another needs 10, the GPU sits idle for 490 tokens on the short request.

Continuous batching (also called iteration-level batching) replaces finished requests immediately. As soon as request A finishes, request D takes its slot in the next forward pass. This keeps the batch full and the GPU busy.

PagedAttention

The KV cache for each request is a contiguous block of memory. If we pre-allocate for max sequence length, most of it is wasted. PagedAttention (from vLLM) borrows the idea of virtual memory pages from operating systems: allocate KV cache in small blocks, linked via a page table. This eliminates fragmentation and lets us serve 2–4× more concurrent requests.

The serving stack: KV caching + continuous batching + PagedAttention is the modern LLM serving trifecta. Together they push throughput from ~50 tokens/sec (naive) to thousands of tokens/sec on the same hardware.
Batch Size vs GPU Utilization

Increase batch size and watch the GPU move from memory-bound to compute-bound. The teal line is memory time, the orange line is compute time. The bottleneck is whichever is higher.

Batch size1
Check: Why can't we just use a batch size of 10,000 to maximize throughput?

Chapter 8: Connections

You now have the systems vocabulary to analyze any transformer workload. Let's consolidate.

Cheat Sheet

QuantityFormulaGPT-2 Small (124M)
Forward FLOPs≈ 2PT254 GFLOPs
Training FLOPs (per step)≈ 6PT762 GFLOPs
Params memory (fp16)2P bytes248 MB
Training memory (Adam)16P bytes1.98 GB
KV cache per token4Ld bytes73.7 KB
Arithmetic intensity (decode)≈ batch size1 (for B=1)
Backward/Forward ratio

The Decision Tree

Are you training or serving?
Training: optimize FLOPs (6PT) + memory (16P bytes). Serving: read on ↓
Is it prefill or decode?
Prefill = compute-bound. Decode = memory-bound.
Can you increase batch size?
Yes → continuous batching + PagedAttention. No → speculative decoding.
Still not fast enough?
Quantize (int8/int4), distill, prune, or use MQA/GQA to shrink the KV cache.

What We Didn't Cover

TopicWhat it isWhere to learn
FlashAttentionFused kernel that avoids materializing the full attention matrixCS 229s Lecture 03
Tensor parallelismSplitting individual layers across GPUsCS 229s Lecture 05
Pipeline parallelismPutting different layers on different GPUsCS 229s Lecture 05
QuantizationReducing precision (int8, int4, GPTQ, AWQ)CS 229s Lecture 06
Multi-Query AttentionSharing K,V heads to shrink the KV cacheShazeer 2019
The big picture: A transformer's efficiency is not one number. It's a landscape shaped by batch size, sequence length, model size, and hardware. The systems engineer's job is to know which regime they're in and choose the right tool. FLOPs matter during training and prefill. Memory bandwidth matters during decode. Memory capacity limits batch size. Every optimization trades one resource for another.

"The purpose of computing is insight, not numbers." — Richard Hamming

Final check: You're serving a 7B model and decode throughput is low. GPU compute utilization is 3%. What's the most impactful optimization?