Analyzing Transformer Efficiency

Chapter 0: Why Analyze Performance?

You've just trained a transformer model. It works. But it takes 14 hours per epoch on 8 GPUs, and inference serves 40 tokens per second. Your boss wants 200. Where is the time going? Which part do you optimize?

Most ML engineers can write a transformer from scratch. Far fewer can answer: how many floating-point operations does a single forward pass cost? Or: why is autoregressive decoding 100x slower than what the hardware should theoretically support?

The answers live in two numbers: FLOPs (floating-point operations — the computational work) and memory bandwidth (bytes moved per second between GPU memory and compute cores). Every bottleneck in training and inference comes down to one of these being the limiting factor.

The systems perspective: An algorithm isn't fast because it has few FLOPs. It's fast when the hardware isn't waiting. If compute cores sit idle while data loads from memory, you're memory-bound. If memory is idle while cores churn through math, you're compute-bound. Knowing which regime you're in tells you what to fix.

Compute vs Memory: Where's the Bottleneck?

Drag the batch size slider. Small batches are memory-bound (GPU starves for data). Large batches are compute-bound (math is the bottleneck). The crossover is the arithmetic intensity threshold.

Batch size16

In the chapters ahead, we'll count every FLOP in training (forward + backward), trace every byte in memory, dissect inference into its two phases (prefill and decode), and learn the tricks — KV caching, speculative decoding, continuous batching — that close the gap between theoretical and actual throughput.

Check: A GPU can theoretically do 312 TFLOPS, but your model only achieves 40 tokens/sec. What's the most likely bottleneck during autoregressive decoding?

The model has too many parameters Memory bandwidth — the GPU spends most of its time loading weights, not computing Python is too slow

Chapter 1: Forward Pass FLOPs

Let's count. Every matrix multiply has a precise FLOP cost, and a transformer is just a stack of matrix multiplies with nonlinearities between them. Once you know the rule, you can compute the cost of any model on a napkin.

The Matrix Multiply FLOP Rule

Multiplying a matrix of shape [M, K] by one of shape [K, N] produces a [M, N] output. Each output element requires K multiplications and K−1 additions. We approximate this as 2MKN FLOPs (the factor of 2 counts both multiplies and adds).

FLOPs(A × B) = 2 · M · K · N
where A is [M, K] and B is [K, N]

Why the factor of 2? Each element of the output is a dot product: multiply K pairs, then sum K results. That's K multiplies + (K−1) adds ≈ 2K FLOPs per element. There are M×N elements, so total = 2MKN. This is the universal FLOP-counting rule for dense matrix multiply.

A Single MLP Layer

Consider one MLP layer in a transformer. The input is a batch of B sequences, each of length S, with hidden dimension H. The MLP first projects up to 4H (the intermediate size), applies an activation, then projects back down:

Linear 1: [BS, H] × [H, 4H]

2 · BS · H · 4H = 8BSH² FLOPs

↓

Activation (GELU)

≈ 0 FLOPs (negligible vs matmul)

↓

Linear 2: [BS, 4H] × [4H, H]

2 · BS · 4H · H = 8BSH² FLOPs

Total MLP FLOPs per layer: 16BSH².

The Attention Layer

Attention has four projections (Q, K, V, and output) plus the attention score computation. Each of Q, K, V, O is a [H, H] weight matrix applied to [BS, H] input:

Q, K, V projections (3×)

3 × 2BSH² = 6BSH² FLOPs

↓

Attention scores: QK^T

2BS²H FLOPs (per head, summed)

↓

Attention × V

2BS²H FLOPs

↓

Output projection

2BSH² FLOPs

Total attention FLOPs per layer: 8BSH² + 4BS²H.

Full Transformer Forward Pass

A transformer with L layers has total forward FLOPs:

FLOPs_fwd = L × (24BSH² + 4BS²H)
= L × (MLP: 16BSH² + Attn projections: 8BSH² + Score/Value: 4BS²H)

Plus the final vocabulary projection: 2BSH·V (where V is vocabulary size). For most models, the 24BSH² term dominates, so the quick rule of thumb is:

Napkin rule: Forward FLOPs ≈ 2 × (number of parameters) × (tokens processed). For a model with P parameters and T = B×S tokens: FLOPs_fwd ≈ 2PT. This works because each parameter participates in one multiply and one add per token.

Worked Example: GPT-2 Small

Parameter	GPT-2 Small
Hidden dim H	768
Layers L	12
Vocab V	50,257
Seq length S	1,024
Batch B	1

Let's compute step by step. Tokens T = B×S = 1,024.

MLP per layer: 16 × 1,024 × 768² = 16 × 1,024 × 589,824 = 9.66 × 10⁹ FLOPs

Attention projections per layer: 8 × 1,024 × 768² = 4.83 × 10⁹ FLOPs

Attention scores per layer: 4 × 1,024² × 768 = 3.22 × 10⁹ FLOPs

Per layer total: 9.66 + 4.83 + 3.22 = 17.7 × 10⁹ FLOPs

All 12 layers: 12 × 17.7 × 10⁹ = 212 GFLOPs

Vocab projection: 2 × 1,024 × 768 × 50,257 = 79.0 GFLOPs

Total forward: ≈ 291 GFLOPs per batch of 1,024 tokens.

Quick check via the napkin rule: GPT-2 Small has ~124M parameters. 2 × 124M × 1,024 ≈ 254 GFLOPs. Close enough — the difference is the attention score term (4BS²H) which the napkin rule ignores.

FLOP Calculator

Adjust model dimensions and see FLOPs break down by component. The bar chart shows where compute goes.

Hidden dim H768

Layers L12

Seq length S1024

Check: For a matrix multiply [M, K] × [K, N], how many FLOPs does it cost?

M × K × N 2 × M × K × N (one multiply + one add per element per dot-product step) M × N (just the output size)

Chapter 2: Backward Pass — Why It's 2× Forward

Training isn't just the forward pass. After computing the loss, we backpropagate gradients through every layer to update weights. The backward pass is always roughly twice the FLOPs of the forward pass. Let's see exactly why.

Backprop Through a Single Linear Layer

Consider a linear layer: z = x · W, where x is [B, H] and W is [H, N]. The forward pass costs 2BHN FLOPs. During backprop, we receive dL/dz (same shape as z: [B, N]) and need two things:

Gradient w.r.t. weights

dL/dW = x^T · dL/dz → [H, B] × [B, N] = 2BHN FLOPs

↓

Gradient w.r.t. input

dL/dx = dL/dz · W^T → [B, N] × [N, H] = 2BHN FLOPs

Two matrix multiplies, each costing 2BHN. Total backward FLOPs for this layer: 4BHN. Compare with the forward pass: 2BHN. The backward pass is exactly 2× the forward pass for a linear layer.

Why 2×? Forward does one matmul. Backward does two: one to compute the weight gradient (for the optimizer), and one to propagate the error signal to the previous layer (for the chain rule). Both matmuls have the same cost as the forward one. Hence: backward = 2 × forward.

The Computational Reuse Trick

Look at the weight gradient formula: dL/dW = x^T · dL/dz. We need x — the activation from the forward pass. This means during forward, we must save every intermediate activation to reuse during backward. This is why training is so memory-hungry: we store activations for every layer.

The chain rule also means intermediate gradients like dL/dv are reused across multiple weight updates. An efficient backprop implementation computes each gradient once and caches it — never recomputing the same thing twice.

Full Training FLOPs

Since the backward pass is 2× forward, the total training cost per step is:

FLOPs_train = FLOPs_fwd + FLOPs_bwd = FLOPs_fwd + 2 × FLOPs_fwd = 3 × FLOPs_fwd

For our GPT-2 Small example: 3 × 291 GFLOPs = 873 GFLOPs per training step (batch size 1, sequence length 1024).

With the napkin rule: training FLOPs ≈ 6PT (6 × parameters × tokens), since forward ≈ 2PT and we multiply by 3.

The 6PT rule: Total training FLOPs ≈ 6 × P × T. This is the single most useful back-of-envelope formula in ML systems. Chinchilla-optimal training of a 70B model on 1.4T tokens: 6 × 70×10⁹ × 1.4×10¹² = 5.88×10²³ FLOPs — about 1,000 A100-days.

Forward vs Backward FLOPs

This visualization shows a computation graph for L = 2(w₁w₂)w₃. Watch the forward pass compute values left-to-right, then backprop compute gradients right-to-left, reusing cached values.

Check: Why does the backward pass require about 2× the FLOPs of the forward pass?

Because gradients are harder to compute than activations Because each layer needs two matmuls (weight gradient + input gradient) vs one in forward Because we use double-precision floats in backward

Chapter 3: Memory — The Real Bottleneck

FLOPs tell you how much work a GPU does. Memory tells you whether the work fits. In practice, memory is almost always the binding constraint in training. Let's trace exactly where every byte goes.

Model Parameters

A transformer with L layers and hidden dimension H has roughly:

P ≈ 12 · L · H²
(4H² for attention: Q,K,V,O × H²; plus 8H² for MLP: H×4H + 4H×H)

In fp16 (2 bytes per parameter), GPT-2 Small: 12 × 12 × 768² = 84.9M weight-params ≈ 170 MB. Add embeddings and you get ~248 MB for 124M params.

Optimizer State (Adam)

Adam stores two extra copies of every parameter: the first moment (mean of gradients) and the second moment (mean of squared gradients). Both are kept in fp32 for numerical stability. Plus we store a fp32 master copy of the weights:

Component	Bytes per param	GPT-2 Small (124M params)
Weights (fp16)	2	248 MB
Gradients (fp16)	2	248 MB
Master weights (fp32)	4	496 MB
Adam m (fp32)	4	496 MB
Adam v (fp32)	4	496 MB
Total	16	1.98 GB

That's 16 bytes per parameter just for optimizer state + weights + gradients. A 7B model: 7×10⁹ × 16 = 112 GB — already more than a single A100-80GB can hold!

16 bytes per parameter: This is the number that makes large-model training impossible without parallelism. The model weights themselves are just 2 bytes/param in fp16, but Adam's state quadruples the memory footprint. This is exactly what ZeRO (DeepSpeed) and FSDP (PyTorch) target: they shard optimizer state across GPUs.

Activations

During training, we cache intermediate activations for backprop. Per layer, we store the input to each linear layer, the attention scores, and the MLP intermediate. For a transformer layer with batch B, sequence S, hidden H:

Activations per layer ≈ BS × (34H + 5S × n_heads) bytes (in fp16)

For GPT-2 Small with B=1, S=1024, H=768, n_heads=12: about 54 MB per layer, or 648 MB for all 12 layers. This is why activation checkpointing (recompute activations during backward instead of storing them) is essential for long sequences.

Training Memory Breakdown

Adjust model size and see where GPU memory goes. The stacked bars show weights, optimizer state, gradients, and activations.

Parameters (M)124M

Seq length1024

Check: With Adam optimizer in mixed-precision training, how many bytes of GPU memory does each model parameter require?

2 bytes (just fp16 weights) 8 bytes (weights + gradients + one moment) 16 bytes (fp16 weights + fp16 grads + fp32 master + fp32 m + fp32 v)

Chapter 4: Prefill vs Decode

Training is expensive but predictable. Inference is where things get weird. An LLM serving a chat request has two distinct phases, and they have completely different performance characteristics.

Phase 1: Prefill

The user sends a prompt of N tokens. The model processes them all at once in a single forward pass, just like training. This is compute-bound — there's enough parallel work to keep the GPU busy. The output: a KV cache and the first generated token.

Phase 2: Decode (Autoregressive Generation)

Now the model generates tokens one at a time. Each step processes a single token. The Q, K, V projections are tiny: [1, H] × [H, H]. The arithmetic intensity (FLOPs per byte loaded) plummets. We must load the entire model's weights from memory to produce one token.

The decode paradox: During decode, an A100 with 312 TFLOPS of compute only achieves a few hundred tokens/sec. Why? Because generating one token requires loading ~3 GB of weights (GPT-2-XL, fp16) from HBM. At 2 TB/s memory bandwidth, that's 1.5ms just for the load. The actual computation takes a fraction of that. The GPU's compute cores are mostly idle.

Arithmetic Intensity: The Key Metric

Arithmetic intensity is the ratio of FLOPs to bytes moved:

AI = FLOPs / Bytes moved

An A100 has a compute-to-bandwidth ratio of 312 TFLOPS / 2 TB/s = 156 FLOP/byte. If your operation has AI < 156, you're memory-bound. If AI > 156, you're compute-bound.

For a matmul [B, H] × [H, H]: FLOPs = 2BH², bytes loaded ≈ 2H² (the weight matrix in fp16). So AI = 2BH² / 2H² = B. When B = 1 (single-token decode), AI = 1 — we're catastrophically memory-bound. We need B ≥ 156 to saturate compute!

Prefill vs Decode Timeline

Watch a prompt get prefilled in parallel, then tokens decoded one-by-one. The orange bar shows GPU utilization — notice how it drops during decode.

Property	Prefill	Decode
Tokens processed	All prompt tokens at once	One token per step
Arithmetic intensity	High (batch=N)	Low (batch=1)
Bottleneck	Compute-bound	Memory-bandwidth-bound
GPU utilization	High (>50%)	Low (<5% of peak FLOPS)
Latency	Fast (parallel)	Slow (sequential)

Check: During autoregressive decode with batch size 1, the arithmetic intensity is approximately 1. On an A100, you'd need to process how many tokens in parallel to become compute-bound?

About 156 (the A100's compute-to-bandwidth ratio) About 10 About 1,000,000

Chapter 5: KV Cache — Trade Memory for Compute

Here's the core insight of KV caching: during autoregressive generation, the keys and values for previously generated tokens don't change. Token 5's K and V only depend on tokens 1–5. When we generate token 6, token 5's K and V are the same as before. So why recompute them?

How It Works

Without caching, generating token N requires computing Q, K, V for all N tokens, then computing attention. That's O(N × H²) for projections alone. With caching, we only compute Q, K, V for the new token, then append its K, V to the cache and attend over the full sequence:

Without KV Cache (step N)

Compute Q,K,V for ALL N tokens: 3 × 2Nd² = 6Nd² FLOPs

With KV Cache (step N)

Compute Q,K,V for 1 token: 3 × 2d² = 6d² FLOPs ← N× cheaper!

The savings are massive. At step 1000, we do 6d² FLOPs instead of 6000d². But there's a cost: we must store all those cached K and V vectors.

KV Cache Memory: The Exact Calculation

For each token in the sequence, each layer stores a K vector and a V vector, each of dimension d_model. In fp16 (2 bytes per value):

KV cache per token = 2 × 2 × L × d_model bytes
= 2 (K and V) × 2 (bytes in fp16) × L (layers) × d (hidden dim)

Worked Example: 7B Model on A100-40GB

Model: 7B parameters, L=32 layers, d_model=4096, sequence length 1024.

Model weights: 7B × 2 bytes = 14 GB

KV cache per token: 2 × 2 × 32 × 4096 = 524,288 bytes ≈ 0.5 MB

KV cache for 1024 tokens: 0.5 MB × 1024 = 512 MB

Remaining for batching: 40 − 14 = 26 GB free. At 0.5 MB/token, that's 26 GB / 0.5 MB = ~52,000 tokens of KV cache. With S=1024, we can serve a batch of ~50 requests simultaneously.

The KV cache tradeoff: KV caching trades memory for compute. It's worth it when you're memory-bandwidth-bound (i.e., decode phase), because the extra memory traffic from reading the cache is tiny compared to recomputing all projections. But the cache can be huge — for a 70B model with 32K context, it's ~40 GB per request!

When Does KV Caching Make Sense?

KV caching adds memory traffic (reading the cached K,V) but saves compute. It makes sense when we're compute-bound — but during decode we're memory-bound. The resolution: KV caching saves us from becoming even more memory-bound. Without it, we'd load all N weight matrices N times; with it, we load them once and read the (smaller) cache.

More precisely, recomputing K,V for all N tokens costs 2 × 2Nd² FLOPs and loads 2d² × 2 bytes of weights. Caching costs 0 extra FLOPs for projection but reads 2 × 2 × N × d bytes of cached values. For large d, the cache read is much smaller than the weight load for recomputation.

KV Cache Explorer (SHOWCASE)

Adjust sequence length, number of layers, and hidden dimension to see the KV cache grow. The visualization shows memory allocation on a 40GB GPU. Red means the cache exceeds available memory.

Seq length2048

Layers32

Hidden dim4096

Batch size1

Check: Why don't the K and V vectors for existing tokens change during autoregressive generation?

Because each token's K,V only depend on the tokens before it (causal masking), which haven't changed Because K and V are random Because the weights are frozen during inference

Chapter 6: Speculative Decoding

Autoregressive decode is slow because we generate one token at a time, and each token requires loading the entire model from memory. The GPU has compute to spare but nothing to do with it. Speculative decoding asks: what if we guessed the next several tokens and verified them all at once?

The Core Idea

Use a small, fast draft model to guess the next K tokens. Then run all K tokens through the large target model in a single forward pass (like prefill). Check which guesses match. Accept the correct prefix, discard the rest, and repeat.

Step 1: Draft

Small model generates K candidate tokens autoregressively (fast, cheap)

↓

Step 2: Verify

Large model scores all K tokens in ONE forward pass (batched, efficient)

↓

Step 3: Accept

Accept longest prefix where draft matches target. Discard the rest.

↻ repeat

Why Does This Work?

Remember: during decode, the GPU is memory-bound. Processing 1 token or K tokens takes almost the same wall-clock time, because the bottleneck is loading weights, not computing. If we batch K guessed tokens, the extra compute is "free" — we're just using idle compute capacity.

The speedup depends on the acceptance rate — what fraction of draft tokens the target model agrees with. In practice, a draft model ~15× smaller achieves 70–80% acceptance on natural language, giving 2–3× speedup.

It's just like branch prediction! CPUs guess which branch an if-statement will take and start executing speculatively. If the guess is right, free speedup. If wrong, small penalty. Speculative decoding is the same idea applied to LLM inference: guess future tokens, execute speculatively, discard if wrong.

Three Guessing Strategies

Strategy	How it works	Pros	Cons
Draft model	Smaller LLM trained on same data	Good agreement (~70-80%), easy to implement	Extra model to manage, extra memory
Medusa heads	Extra prediction heads on the target model itself	Near-zero cost, no extra model	Guesses degrade quickly (mean-field approx); ~2× max speedup
Lossy optimization	Quantized/pruned version of target model	Highly correlated with target, no extra model	Complex code, under-explored

Speculative Decoding Simulator

Watch the draft model generate guesses (fast, purple) and the target model verify them (orange). Green = accepted, red = rejected. Adjust acceptance rate and speculation depth.

Acceptance rate75%

Speculation depth K5

Check: Why is processing K speculative tokens nearly free during decode?

Because we're memory-bound — loading weights takes the same time for 1 or K tokens, so the extra compute is free Because speculative tokens are smaller than normal tokens Because the draft model does all the work

Chapter 7: Batching — Throughput vs Latency

We've established that single-token decode is memory-bound. The natural fix: batch multiple requests together. If we're loading the weights anyway, why not process 32 users' tokens at once?

Throughput vs Latency

Two very different optimization targets:

Latency

Time for ONE request to complete. Critical for interactive applications (chatbots, Copilot). A user waiting 2 seconds feels snappy; 10 seconds feels broken. Batching doesn't help individual latency — it can even hurt.

Throughput

Total tokens generated per second across ALL requests. Critical for serving many users or processing offline batches (summarization, data analysis). Batching directly increases throughput by amortizing weight loads.

Bandwidth vs throughput: Bandwidth is a property of the hardware (how many bytes/sec the memory bus can move). Throughput is a property of your system (how many tokens/sec you actually produce). If your throughput is below what bandwidth allows, you're leaving performance on the table.

How Batch Size Affects the Bottleneck

With batch size B, the arithmetic intensity of a matmul becomes B (since FLOPs scale with B but weight loads don't). As we increase B:

Batch size B	Arithmetic intensity	Regime	Bottleneck
1	1	Memory-bound	Weight loading
32	32	Memory-bound	Weight loading (less idle compute)
156+	≥156	Compute-bound	Arithmetic units

Maximum Batch Size: KV Cache Limits

But we can't batch infinitely! Each request in the batch needs its own KV cache. From Chapter 5, a 7B model with S=1024 uses ~512 MB of KV cache per request. On a 40GB GPU with 14GB for weights, we have 26GB for KV caches: 26GB / 512MB = ~50 concurrent requests.

Continuous Batching

In naive batching, all requests in a batch must finish before new ones enter. If one request needs 500 tokens and another needs 10, the GPU sits idle for 490 tokens on the short request.

Continuous batching (also called iteration-level batching) replaces finished requests immediately. As soon as request A finishes, request D takes its slot in the next forward pass. This keeps the batch full and the GPU busy.

PagedAttention

The KV cache for each request is a contiguous block of memory. If we pre-allocate for max sequence length, most of it is wasted. PagedAttention (from vLLM) borrows the idea of virtual memory pages from operating systems: allocate KV cache in small blocks, linked via a page table. This eliminates fragmentation and lets us serve 2–4× more concurrent requests.

The serving stack: KV caching + continuous batching + PagedAttention is the modern LLM serving trifecta. Together they push throughput from ~50 tokens/sec (naive) to thousands of tokens/sec on the same hardware.

Batch Size vs GPU Utilization

Increase batch size and watch the GPU move from memory-bound to compute-bound. The teal line is memory time, the orange line is compute time. The bottleneck is whichever is higher.

Batch size1

Check: Why can't we just use a batch size of 10,000 to maximize throughput?

Because the model weights get too large Because each request needs its own KV cache, and GPU memory is finite Because Python can't handle that many requests

Chapter 8: Connections

You now have the systems vocabulary to analyze any transformer workload. Let's consolidate.

Cheat Sheet

Quantity	Formula	GPT-2 Small (124M)
Forward FLOPs	≈ 2PT	254 GFLOPs
Training FLOPs (per step)	≈ 6PT	762 GFLOPs
Params memory (fp16)	2P bytes	248 MB
Training memory (Adam)	16P bytes	1.98 GB
KV cache per token	4Ld bytes	73.7 KB
Arithmetic intensity (decode)	≈ batch size	1 (for B=1)
Backward/Forward ratio	2×	—

The Decision Tree

Are you training or serving?

Training: optimize FLOPs (6PT) + memory (16P bytes). Serving: read on ↓

↓

Is it prefill or decode?

Prefill = compute-bound. Decode = memory-bound.

↓

Can you increase batch size?

Yes → continuous batching + PagedAttention. No → speculative decoding.

↓

Still not fast enough?

Quantize (int8/int4), distill, prune, or use MQA/GQA to shrink the KV cache.

What We Didn't Cover

Topic	What it is	Where to learn
FlashAttention	Fused kernel that avoids materializing the full attention matrix	CS 229s Lecture 03
Tensor parallelism	Splitting individual layers across GPUs	CS 229s Lecture 05
Pipeline parallelism	Putting different layers on different GPUs	CS 229s Lecture 05
Quantization	Reducing precision (int8, int4, GPTQ, AWQ)	CS 229s Lecture 06
Multi-Query Attention	Sharing K,V heads to shrink the KV cache	Shazeer 2019

The big picture: A transformer's efficiency is not one number. It's a landscape shaped by batch size, sequence length, model size, and hardware. The systems engineer's job is to know which regime they're in and choose the right tool. FLOPs matter during training and prefill. Memory bandwidth matters during decode. Memory capacity limits batch size. Every optimization trades one resource for another.

"The purpose of computing is insight, not numbers." — Richard Hamming

Final check: You're serving a 7B model and decode throughput is low. GPU compute utilization is 3%. What's the most impactful optimization?

Use a bigger GPU with more FLOPs Increase batch size via continuous batching (3% utilization means memory-bound, so batch more to use idle compute) Reduce the number of layers