Why generating tokens is memory-bound, and every trick we use to make it fast.
You have a 70-billion parameter language model. You want it to generate text. Each new token requires multiplying against essentially ALL model weights. How fast can you go?
Let's do the arithmetic. 70B parameters in FP16 = 70 × 109 × 2 bytes = 140 GB of weights. An NVIDIA A100 has 2 TB/s of memory bandwidth. Time to read all weights once = 140 GB ÷ 2000 GB/s = 70 ms. That's ~14 tokens per second — regardless of how much compute the GPU has.
This is the memory wall. During autoregressive decoding, each token generation is a giant matrix-vector multiply (batch size 1). The arithmetic intensity is so low that the GPU's compute cores sit idle, waiting for data to arrive from HBM.
Let's quantify this. The arithmetic intensity is the ratio of compute operations to memory bytes transferred. For a matrix-vector multiply with an [N, N] weight matrix:
An A100 has ~312 TFLOPS of FP16 compute but only 2 TB/s of bandwidth. To fully utilize compute, you'd need an arithmetic intensity of 312/2 = 156 ops/byte. We have 1. We're using 0.6% of available compute.
Here's the timeline for generating a single token:
The compute is ~0.4ms but we wait ~70ms for memory reads. The GPU spends 99.4% of its time waiting for data.
Watch tokens generate one at a time. The orange bar is memory read time, teal is compute. Notice how compute is invisible.
Key insight from the widget: Increasing batch size doesn't increase memory reads (we read the same weights once for all sequences in the batch). But it increases compute proportionally. At batch size 32, we finally start utilizing the GPU's compute. This is why batching is so critical for serving.
python # Memory wall calculation def tokens_per_second(params_B, dtype_bytes=2, bw_TBps=2.0): """Max throughput for single-sequence decoding.""" weight_bytes = params_B * 1e9 * dtype_bytes time_per_token = weight_bytes / (bw_TBps * 1e12) # seconds return 1.0 / time_per_token # Examples: print(tokens_per_second(7)) # ~143 tok/s (7B model) print(tokens_per_second(70)) # ~14.3 tok/s (70B model) print(tokens_per_second(405)) # ~2.5 tok/s (Llama 405B)
We just learned that each token requires reading all model weights. But there's a second memory problem hiding in the attention mechanism. When generating token t, standard attention computes Q, K, V for ALL previous tokens 1 through t-1, then does softmax(QKT)V. That means for every new token, we're recomputing K and V vectors for the entire history.
The KV cache is the fix: store the K and V projections for all previously generated tokens. When generating token t, we only compute Q, K, V for the new token, then concatenate its K and V to the cache. The attention operation becomes: one new query attending to the full cached key/value history.
Let's compute the KV cache size for Llama-2-70B at sequence length 4096:
For Llama-2-70B: layers=80, kv_heads=8 (GQA), head_dim=128, seq_len=4096, FP16 (2 bytes):
Wait — that's with GQA (8 KV heads). Original multi-head attention has 64 KV heads:
10.7 GB for a single sequence! With an 80GB A100, after loading the 140GB model across GPUs, you can barely fit a few concurrent sequences. This is why KV cache optimization (GQA, quantization, PagedAttention) is so critical for serving.
python def kv_cache_bytes(layers, kv_heads, head_dim, seq_len, dtype=2): """KV cache memory for ONE sequence.""" return 2 * layers * kv_heads * head_dim * seq_len * dtype # Llama-2-70B with GQA (8 kv heads) size = kv_cache_bytes(80, 8, 128, 4096) print(f"GQA: {size / 1e9:.2f} GB") # 1.34 GB # Same model with MHA (64 kv heads) size = kv_cache_bytes(80, 64, 128, 4096) print(f"MHA: {size / 1e9:.2f} GB") # 10.74 GB # At 128K context (Llama-3.1-70B) size = kv_cache_bytes(80, 8, 128, 131072) print(f"128K ctx: {size / 1e9:.2f} GB") # 42.9 GB per sequence!
Adjust sequence length and see how KV cache memory grows. The orange region is KV cache, teal is model weights. The dashed line is total GPU memory.
In practice, the KV cache is managed like virtual memory. vLLM's PagedAttention allocates cache in fixed-size "pages" (blocks) rather than contiguous memory. This eliminates fragmentation — sequences of different lengths don't waste memory on padding.
| Model | KV Heads | Layers | Cache/seq @4K | Cache/seq @128K |
|---|---|---|---|---|
| Llama-3-8B | 8 (GQA) | 32 | 0.5 GB | 16 GB |
| Llama-3-70B | 8 (GQA) | 80 | 1.3 GB | 43 GB |
| GPT-3 175B (MHA) | 96 | 96 | 18 GB | 590 GB |
We just saw that KV cache memory is the ceiling on batch size. Three innovations attack this from different angles: reducing what we store (MQA/GQA), reducing how we compute (FlashAttention), and reducing precision (KV cache quantization).
Standard Multi-Head Attention (MHA) has separate K and V projections per head. With 64 heads, that's 64 K matrices and 64 V matrices in the cache. Multi-Query Attention shares a single K and single V across ALL query heads. Each head still has its own Q projection, so they "ask different questions" but look at the same keys and values.
For a 64-head model, MQA reduces KV cache by 64×. The cost? Slightly lower quality — about 0.5-1% degradation on benchmarks.
GQA is the compromise. Instead of 1 KV head (MQA) or N KV heads (MHA), use G groups where G divides N. Each group of N/G query heads shares one KV head. Llama-2-70B uses G=8 groups for its 64 query heads.
With 8 groups instead of 64 heads: 8× reduction in KV cache with negligible quality loss.
Standard attention materializes the full T × T attention matrix (softmax(QKT)) before multiplying by V. For T=128K, that matrix is 128K × 128K × 2 bytes = 32 GB. FlashAttention never materializes it.
The trick: tile the computation into blocks that fit in SRAM (on-chip memory, ~20MB on A100). For each tile, compute partial softmax(QKT)V, accumulate online using the log-sum-exp trick. The result is exact — not an approximation.
Compare KV cache sizes across attention variants. Model: 80 layers, 64 query heads, head_dim=128.
python # FlashAttention pseudocode (simplified) def flash_attention(Q, K, V, block_size=256): """Tiled attention that never materializes T x T matrix.""" T, d = Q.shape O = zeros_like(Q) # output accumulator L = zeros(T) # log-sum-exp accumulator for j in range(0, T, block_size): # tile over K,V Kj = K[j:j+block_size] Vj = V[j:j+block_size] for i in range(0, T, block_size): # tile over Q Qi = Q[i:i+block_size] # Compute block attention in SRAM S_ij = Qi @ Kj.T / sqrt(d) # block_size x block_size (fits in SRAM!) # Online softmax update m_new = max(L[i:i+block_size], S_ij.max(axis=-1)) P_ij = exp(S_ij - m_new[:, None]) # Accumulate output with rescaling O[i:i+block_size] = O[i:i+block_size] * exp(L[i:i+block_size] - m_new)[:, None] O[i:i+block_size] += P_ij @ Vj L[i:i+block_size] = m_new + log(P_ij.sum(axis=-1)) return O
| Method | KV Cache Reduction | Quality Impact | Used By |
|---|---|---|---|
| MHA (baseline) | 1× | — | GPT-3, OPT |
| GQA (8 groups) | 8× | <0.1% loss | Llama-2/3, Mistral |
| MQA (1 group) | 64× | 0.5-1% loss | PaLM, Falcon |
| FlashAttention | Memory: T² → T | Exact (0 loss) | Nearly universal |
We established that batching helps — reading weights once for N sequences gets us closer to compute-bound. But how do we batch in practice? Requests arrive at different times and have different output lengths. A user asking "What's 2+2?" finishes in 5 tokens. A user asking for an essay generates 500 tokens.
Collect requests into a batch. Run the entire batch until the LONGEST sequence finishes. Short sequences pad with <eos> tokens, wasting GPU cycles. New requests must wait until the entire batch completes.
The problem: if one request generates 500 tokens and the rest finish in 50, you're wasting GPU for 450 tokens worth of computation on padding. And incoming requests pile up waiting.
Continuous batching operates at the token granularity: after EACH token generation step, check if any sequence has finished. If so, evict it immediately and slot in a waiting request. The batch composition changes every single step.
The efficiency gain depends on output length variance. If all requests have similar length, static and continuous perform similarly. But in real workloads, output lengths vary by 10-100×, making continuous batching essential.
There's a subtlety: the first forward pass for a new request (the prefill) processes all input tokens in parallel — it's compute-bound. Subsequent steps (the decode) generate one token at a time — memory-bound. Good serving systems separate these:
| Phase | Tokens Processed | Bottleneck | Duration |
|---|---|---|---|
| Prefill | All input tokens (parallel) | Compute | ~100ms for 1K tokens |
| Decode | 1 token per step | Memory BW | ~70ms per token (70B) |
Time-to-first-token (TTFT) is dominated by prefill time. Time-per-output-token (TPOT) is dominated by decode time. Users care about both — TTFT for responsiveness, TPOT for streaming speed.
Watch requests arrive and get served. Left: static batching (must wait for batch). Right: continuous batching (slots fill immediately). Compare GPU utilization below.
python # Continuous batching scheduler (simplified) class ContinuousBatcher: def __init__(self, max_batch=32): self.max_batch = max_batch self.active = [] # currently generating self.waiting = [] # queue of pending requests def step(self): # 1. Evict finished sequences self.active = [r for r in self.active if not r.done] # 2. Fill empty slots from waiting queue while len(self.active) < self.max_batch and self.waiting: req = self.waiting.pop(0) req.prefill() # compute KV cache for input self.active.append(req) # 3. One decode step for all active sequences if self.active: self.decode_batch(self.active) def decode_batch(self, batch): # Single forward pass generates 1 token per sequence for req in batch: token = model.generate_one(req.kv_cache) req.output.append(token) if token == EOS: req.done = True
If the memory wall comes from reading too many bytes, the most direct solution is: make each parameter smaller. A 70B model in FP16 is 140 GB. In INT4, it's 35 GB. That's 4× less memory to read — so 4× more tokens per second from the bandwidth formula.
Quantization maps high-precision values (FP16, 16 bits) to low-precision (INT8 = 8 bits, INT4 = 4 bits). The challenge: do this without destroying model quality.
For a group of weights w1...wg (group size typically 128):
Worked example: Quantize the weights [0.12, -0.83, 0.45, -0.21] to INT4 (range -7 to +7):
The error comes from rounding. Smaller group sizes reduce error (more scales, better approximation) but increase overhead.
GPTQ quantizes weights to INT4 while keeping activations in FP16. It uses a calibration dataset to minimize the output error layer-by-layer via Hessian-based compensation: when you round one weight, you adjust its neighbors to compensate for the error.
AWQ observes that not all weights are equally important. Weights that multiply large activations matter more. AWQ scales important weight channels UP before quantizing (increasing their effective precision), then scales the corresponding activations DOWN to compensate. Net effect: salient weights get more precision "budget."
GGUF format allows different layers and different weight types (attention vs MLP) to use different bit widths. Typically: attention layers get Q5 or Q6 (higher precision), MLP layers get Q4 (lower). The first and last layers often stay in FP16 because they disproportionately affect quality.
See how bit width affects memory, throughput, and quality. The teal bar is throughput gain, red dot is quality loss.
| Format | Bits | Memory (70B) | Tok/s (A100) | Quality Loss |
|---|---|---|---|---|
| FP16 | 16 | 140 GB | ~14 | 0% |
| INT8 | 8 | 70 GB | ~28 | <0.1% |
| GPTQ/AWQ INT4 | 4 | 35 GB | ~55 | 0.5-1% |
| GGUF Q4_K_M | ~4.8 | 42 GB | ~45 | 0.3-0.7% |
| GGUF Q2_K | ~2.6 | 23 GB | ~85 | 3-8% |
python # Simulate symmetric quantization import numpy as np def quantize_symmetric(weights, bits=4, group_size=128): """Quantize weights with per-group symmetric scaling.""" qmax = 2**(bits-1) - 1 # 7 for 4-bit n = weights.shape[0] q_weights = np.zeros_like(weights, dtype=np.int8) scales = [] for i in range(0, n, group_size): group = weights[i:i+group_size] scale = np.max(np.abs(group)) / qmax scales.append(scale) q_weights[i:i+group_size] = np.round(group / scale).astype(np.int8) return q_weights, np.array(scales) def dequantize(q_weights, scales, group_size=128): """Dequantize back to float.""" result = np.zeros(q_weights.shape, dtype=np.float32) for i, s in enumerate(scales): start = i * group_size result[start:start+group_size] = q_weights[start:start+group_size] * s return result # Test: quantize 1024 random weights w = np.random.randn(1024).astype(np.float16) * 0.5 q, scales = quantize_symmetric(w, bits=4) w_hat = dequantize(q, scales) print(f"Max error: {np.max(np.abs(w - w_hat)):.4f}") print(f"Mean error: {np.mean(np.abs(w - w_hat)):.4f}") print(f"Memory: {w.nbytes} -> {q.nbytes + scales.nbytes} bytes")
Here's a counterintuitive idea: what if we could check multiple candidate tokens at once, instead of generating one at a time? The big model must verify, but verification of k tokens costs about the same as generating 1 token (it's one forward pass with k+1 positions).
Speculative decoding uses a small, fast draft model (e.g., 1B parameters) to generate k candidate tokens. Then the big target model (e.g., 70B) verifies all k in a single forward pass. If the first m tokens match what the target model would have generated, we accept them all. The rest are rejected.
Draft model proposes k=5 tokens: ["The", "cat", "sat", "in", "the"]. Target model verifies:
Result: accept 3 tokens + sample "on" from target = 4 tokens produced in 1 target forward pass.
With k=5, acceptance_rate=0.7 (3.5 accepted on avg), cost_ratio=0.07:
tokens_produced = 3.5 + 1 = 4.5
cost = 1 + 5×0.07 = 1.35 target-passes equivalent
Speedup = 4.5 / 1.35 = 3.3×
To maintain the exact target distribution (no quality loss!), we use rejection sampling. For each draft token with draft probability q and target probability p:
If q is high but p is low (draft is confident but wrong), likely rejected. If p ≥ q (target agrees or likes it even more), always accepted. This guarantees the output distribution is exactly the target model's distribution.
Watch the draft-verify-accept cycle. Teal = draft tokens. Green = accepted. Red = rejected. Orange = target correction.
python # Speculative decoding (simplified) def speculative_decode(target, draft, prompt, k=5): tokens = list(prompt) while not done: # 1. Draft k tokens draft_tokens = [] draft_probs = [] for _ in range(k): p = draft.get_probs(tokens + draft_tokens) t = sample(p) draft_tokens.append(t) draft_probs.append(p[t]) # 2. Target verifies all at once target_logits = target.forward(tokens + draft_tokens) # 1 pass! # 3. Accept/reject with rejection sampling accepted = 0 for i in range(k): p_target = softmax(target_logits[len(tokens) + i]) p_t = p_target[draft_tokens[i]] q_t = draft_probs[i] if random() <= min(1, p_t / q_t): accepted += 1 else: break # 4. Accept prefix + sample correction tokens.extend(draft_tokens[:accepted]) correction = sample(adjusted_dist(target_logits, accepted)) tokens.append(correction) return tokens
A 70B model in FP16 = 140 GB. A single A100 has 80 GB. The model literally doesn't fit on one GPU. We must split it across multiple devices. But HOW we split determines communication cost, memory efficiency, and whether we get pipeline bubbles.
Tensor Parallelism splits individual weight matrices across GPUs. For a linear layer W ∈ Rd×d with TP=4, each GPU stores Wi ∈ Rd×(d/4). Each GPU computes a partial result, then an all-reduce operation sums them across GPUs.
Benefits: memory per GPU drops by N (for TP=N). All GPUs active simultaneously. Latency for one token is the same as one GPU (minus communication).
Cost: requires all-reduce after every layer — 2 all-reduces per transformer block (one for attention, one for MLP). This needs high-bandwidth interconnect (NVLink at 900 GB/s, not PCIe at 64 GB/s).
Pipeline Parallelism splits layers across GPUs. GPU 0 gets layers 0-19, GPU 1 gets layers 20-39, etc. Data flows sequentially through the pipeline.
Benefits: minimal communication (send activations once per microbatch, not per layer). Works over PCIe or even network links.
Cost: pipeline bubbles — when GPU 0 finishes its chunk, it sits idle while waiting for downstream GPUs. Microbatching helps fill bubbles, but with batch=1 (interactive serving), PP has up to (N-1)/N idle time.
For Mixture-of-Experts (MoE) models like Mixtral, experts live on different GPUs. Only 2 of 8 experts activate per token, so each GPU only computes its assigned experts. Communication: route tokens to the correct GPU (all-to-all operation).
| Strategy | Splits | Communication | Best For |
|---|---|---|---|
| TP | Weight matrices | All-reduce per layer | Within NVLink node |
| PP | Layer groups | Activations between stages | Across nodes |
| EP | MoE experts | All-to-all token routing | MoE models |
| TP+PP | Both | Both | Very large clusters |
See how TP (left) and PP (right) distribute work across 4 GPUs. Colors represent layers/chunks. Watch for idle time (bubbles) in PP.
python # Memory calculation for parallelism strategies def memory_per_gpu(params_B, tp=1, pp=1, dtype=2): """Memory per GPU in GB.""" total_bytes = params_B * 1e9 * dtype # TP splits all layers, PP splits layer groups per_gpu = total_bytes / (tp * pp) return per_gpu / 1e9 # Llama-70B FP16 on different configs: print(f"1 GPU: {memory_per_gpu(70):.0f} GB") # 140 GB (doesn't fit!) print(f"TP=2: {memory_per_gpu(70, tp=2):.0f} GB") # 70 GB (tight on A100) print(f"TP=4: {memory_per_gpu(70, tp=4):.0f} GB") # 35 GB (comfortable) print(f"TP=8: {memory_per_gpu(70, tp=8):.0f} GB") # 17.5 GB (room for KV cache) # Communication cost (TP all-reduce per layer) def tp_comm_time(hidden_dim, tp, bw_GBps=450): """All-reduce time per layer in microseconds.""" # Ring all-reduce: 2*(N-1)/N * msg_size / bandwidth msg_bytes = hidden_dim * 2 # FP16 time_s = 2 * (tp-1)/tp * msg_bytes / (bw_GBps * 1e9) return time_s * 1e6 # microseconds print(f"TP=4 comm per layer: {tp_comm_time(8192, 4):.1f} us") # ~0.05 us (negligible on NVLink)
Now we combine everything: KV cache management, continuous batching, quantization, speculative decoding, and model parallelism into a single serving pipeline. This is what vLLM, TGI, and TensorRT-LLM do under the hood.
Requests arrive as a Poisson process (random arrivals). Each has an input length and expected output length. The system must balance throughput (total tokens/sec) against latency (how long each user waits). Your job: configure the system to maximize throughput while keeping latency reasonable.
Requests arrive, get batched, processed, completed. Metrics update in real-time. Toggle optimizations to see their impact.
Key observations to try:
Now that we understand the optimizations, let's look at the systems that implement them. Each makes different tradeoffs between ease-of-use, performance, and hardware requirements.
vLLM is the most popular open-source serving engine. Its key innovation: PagedAttention — managing KV cache like virtual memory with page tables. Instead of pre-allocating contiguous memory per sequence, it allocates fixed-size blocks (pages) on demand. Benefits:
TGI by HuggingFace: production-ready with token streaming, built-in load balancing, and HuggingFace model hub integration. Written in Rust for the HTTP layer. Supports Flash Attention, GQA, GPTQ/AWQ quantization, and TP. Slightly lower raw throughput than vLLM but easier deployment.
TensorRT-LLM by NVIDIA: maximum performance through aggressive kernel fusion, INT4/INT8 quantization with custom CUDA kernels, and pipelined execution. Requires NVIDIA GPUs and a compilation step (model → optimized engine). 20-40% faster than vLLM on NVIDIA hardware, but vendor-locked and harder to use.
llama.cpp: runs on CPU, Apple Metal, and CUDA. GGUF format with per-layer mixed quantization. Key use case: local inference on consumer hardware (MacBooks, gaming PCs). Performance: ~30-100 tok/s for 7B models on M2/M3 Macs, ~5-15 tok/s for 70B models with offloading.
At cloud GPU pricing, the economics are:
| Setup | GPU Cost | Throughput | $/1M tokens |
|---|---|---|---|
| 1×A100 80GB, vLLM, Llama-70B INT4 | $2.50/hr | ~800 tok/s | $0.87 |
| 8×A100 80GB, vLLM, Llama-70B FP16 | $20/hr | ~3000 tok/s | $1.85 |
| 8×H100, TRT-LLM, Llama-70B FP8 | $32/hr | ~8000 tok/s | $1.11 |
| M2 Max MacBook, llama.cpp, 70B Q4 | $0 (owned) | ~8 tok/s | $0 (but slow) |
Compare frameworks by throughput and cost efficiency. Bar height = throughput, label = cost per 1M tokens.
python # Quick cost calculator def cost_per_million(gpu_cost_hr, throughput_tok_s): """Cost per 1M output tokens.""" tok_per_hour = throughput_tok_s * 3600 return gpu_cost_hr / tok_per_hour * 1e6 # Compare setups setups = [ ("1xA100 vLLM INT4", 2.50, 800), ("8xA100 vLLM FP16", 20.0, 3000), ("8xH100 TRT-LLM FP8", 32.0, 8000), ] for name, cost, tput in setups: print(f"{name}: ${cost_per_million(cost, tput):.2f}/1M tokens")
You now understand every major optimization in the LLM inference stack. Let's consolidate with a decision framework, then test your mastery with real-world challenges.
| Technique | Reduces | Tradeoff | When to Use |
|---|---|---|---|
| KV Cache | Redundant compute | More memory | Always (it's standard) |
| GQA | KV cache size | Slight quality | Always (train with it) |
| FlashAttention | Memory + IO | None (exact) | Always |
| Continuous Batching | Idle GPU time | Scheduling complexity | Any serving system |
| INT4 Quantization | Weight memory 4× | 0.5-1% quality | When memory-constrained |
| Speculative Decoding | Per-token latency | Extra draft compute | Low-batch, high-latency models |
| Tensor Parallelism | Per-GPU memory | All-reduce overhead | Model > 1 GPU (intra-node) |
| Pipeline Parallelism | Per-GPU memory | Bubbles, latency | Multi-node deployments |
| PagedAttention | Memory fragmentation | Page table overhead | Variable-length serving |
Derive the optimal draft length k for speculative decoding.
Given: acceptance probability per token = α, draft cost ratio = r (time_draft / time_target).
For α=0.8, r=0.05: optimal k ≈ 5-7 (diminishing returns beyond that because αk shrinks fast).
Serve Llama-70B at 100 requests/sec with P99 latency < 1 second.
Constraints: average input 500 tokens, average output 200 tokens.
| If you liked... | Read next |
|---|---|
| Memory bandwidth analysis | Transformer (attention mechanism internals) |
| Quantization theory | VAE & VQ-VAE (vector quantization) |
| Systems optimization | SSM & Mamba (O(1) per-token inference) |
| Parallel algorithms | GPT (training parallelism) |