Systems & Serving

LLM Inference
Optimization

Why generating tokens is memory-bound, and every trick we use to make it fast.

Prerequisites: Basic transformer knowledge + Attention mechanism intuition. That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: The Memory Wall

You have a 70-billion parameter language model. You want it to generate text. Each new token requires multiplying against essentially ALL model weights. How fast can you go?

Let's do the arithmetic. 70B parameters in FP16 = 70 × 109 × 2 bytes = 140 GB of weights. An NVIDIA A100 has 2 TB/s of memory bandwidth. Time to read all weights once = 140 GB ÷ 2000 GB/s = 70 ms. That's ~14 tokens per second — regardless of how much compute the GPU has.

This is the memory wall. During autoregressive decoding, each token generation is a giant matrix-vector multiply (batch size 1). The arithmetic intensity is so low that the GPU's compute cores sit idle, waiting for data to arrive from HBM.

The fundamental bottleneck: LLM inference during token generation is memory-bandwidth-bound, not compute-bound. The GPU can multiply numbers faster than it can read them from memory. Every optimization in this lesson attacks this bottleneck from a different angle.

Let's quantify this. The arithmetic intensity is the ratio of compute operations to memory bytes transferred. For a matrix-vector multiply with an [N, N] weight matrix:

Arithmetic Intensity = (2N² ops) / (2N² bytes) = 1 op/byte

An A100 has ~312 TFLOPS of FP16 compute but only 2 TB/s of bandwidth. To fully utilize compute, you'd need an arithmetic intensity of 312/2 = 156 ops/byte. We have 1. We're using 0.6% of available compute.

The roofline model: If arithmetic intensity < (peak FLOPS / peak bandwidth), you're memory-bound. For A100: threshold = 156 ops/byte. Autoregressive decoding at ~1 op/byte is deep in memory-bound territory. The only way to increase utilization is to do more work per byte read — which means batching multiple requests together.

Here's the timeline for generating a single token:

Read Weights
Load all 140GB from HBM to registers (~70ms)
Compute
Matrix-vector multiply (~0.4ms of actual math)
Write Output
Store next-token logits (~negligible)
↻ repeat for next token

The compute is ~0.4ms but we wait ~70ms for memory reads. The GPU spends 99.4% of its time waiting for data.

Memory Wall Visualizer

Watch tokens generate one at a time. The orange bar is memory read time, teal is compute. Notice how compute is invisible.

Model Size (B params) 70B
Batch Size 1

Key insight from the widget: Increasing batch size doesn't increase memory reads (we read the same weights once for all sequences in the batch). But it increases compute proportionally. At batch size 32, we finally start utilizing the GPU's compute. This is why batching is so critical for serving.

python
# Memory wall calculation
def tokens_per_second(params_B, dtype_bytes=2, bw_TBps=2.0):
    """Max throughput for single-sequence decoding."""
    weight_bytes = params_B * 1e9 * dtype_bytes
    time_per_token = weight_bytes / (bw_TBps * 1e12)  # seconds
    return 1.0 / time_per_token

# Examples:
print(tokens_per_second(7))    # ~143 tok/s (7B model)
print(tokens_per_second(70))   # ~14.3 tok/s (70B model)
print(tokens_per_second(405))  # ~2.5 tok/s (Llama 405B)
An H100 has 3.35 TB/s bandwidth and ~1979 TFLOPS FP16. A 70B FP16 model does a matmul with batch=1. What's the bottleneck?

Chapter 1: KV Cache

We just learned that each token requires reading all model weights. But there's a second memory problem hiding in the attention mechanism. When generating token t, standard attention computes Q, K, V for ALL previous tokens 1 through t-1, then does softmax(QKT)V. That means for every new token, we're recomputing K and V vectors for the entire history.

The KV cache is the fix: store the K and V projections for all previously generated tokens. When generating token t, we only compute Q, K, V for the new token, then concatenate its K and V to the cache. The attention operation becomes: one new query attending to the full cached key/value history.

The tradeoff: KV cache converts redundant compute into memory usage. Instead of recomputing O(t²) attention every step, we store O(t) vectors and do O(t) attention per new token. But that stored cache can be enormous.

Let's compute the KV cache size for Llama-2-70B at sequence length 4096:

KV size = 2 × layers × kv_heads × head_dim × seq_len × dtype_bytes

For Llama-2-70B: layers=80, kv_heads=8 (GQA), head_dim=128, seq_len=4096, FP16 (2 bytes):

= 2 × 80 × 8 × 128 × 4096 × 2 = 1.34 GB per sequence

Wait — that's with GQA (8 KV heads). Original multi-head attention has 64 KV heads:

= 2 × 80 × 64 × 128 × 4096 × 2 = 10.7 GB per sequence (MHA)

10.7 GB for a single sequence! With an 80GB A100, after loading the 140GB model across GPUs, you can barely fit a few concurrent sequences. This is why KV cache optimization (GQA, quantization, PagedAttention) is so critical for serving.

The batch size ceiling: Maximum concurrent sequences = (GPU memory − model weights) / KV cache per sequence. With MHA at long context, this can be as low as 2-4 sequences. KV cache IS the limiting factor for batch size in production serving.
python
def kv_cache_bytes(layers, kv_heads, head_dim, seq_len, dtype=2):
    """KV cache memory for ONE sequence."""
    return 2 * layers * kv_heads * head_dim * seq_len * dtype

# Llama-2-70B with GQA (8 kv heads)
size = kv_cache_bytes(80, 8, 128, 4096)
print(f"GQA: {size / 1e9:.2f} GB")  # 1.34 GB

# Same model with MHA (64 kv heads)
size = kv_cache_bytes(80, 64, 128, 4096)
print(f"MHA: {size / 1e9:.2f} GB")  # 10.74 GB

# At 128K context (Llama-3.1-70B)
size = kv_cache_bytes(80, 8, 128, 131072)
print(f"128K ctx: {size / 1e9:.2f} GB")  # 42.9 GB per sequence!
KV Cache Growth

Adjust sequence length and see how KV cache memory grows. The orange region is KV cache, teal is model weights. The dashed line is total GPU memory.

Seq Length 4096
Concurrent Sequences 4

In practice, the KV cache is managed like virtual memory. vLLM's PagedAttention allocates cache in fixed-size "pages" (blocks) rather than contiguous memory. This eliminates fragmentation — sequences of different lengths don't waste memory on padding.

ModelKV HeadsLayersCache/seq @4KCache/seq @128K
Llama-3-8B8 (GQA)320.5 GB16 GB
Llama-3-70B8 (GQA)801.3 GB43 GB
GPT-3 175B (MHA)969618 GB590 GB
You have an 80GB GPU with a 14GB model loaded. KV cache per sequence at your context length is 5GB. What's the maximum batch size?

Chapter 2: Attention Optimizations

We just saw that KV cache memory is the ceiling on batch size. Three innovations attack this from different angles: reducing what we store (MQA/GQA), reducing how we compute (FlashAttention), and reducing precision (KV cache quantization).

Multi-Query Attention (MQA)

Standard Multi-Head Attention (MHA) has separate K and V projections per head. With 64 heads, that's 64 K matrices and 64 V matrices in the cache. Multi-Query Attention shares a single K and single V across ALL query heads. Each head still has its own Q projection, so they "ask different questions" but look at the same keys and values.

MHA cache: 2 × n_heads × head_dim × seq_len
MQA cache: 2 × 1 × head_dim × seq_len

For a 64-head model, MQA reduces KV cache by 64×. The cost? Slightly lower quality — about 0.5-1% degradation on benchmarks.

Grouped-Query Attention (GQA)

GQA is the compromise. Instead of 1 KV head (MQA) or N KV heads (MHA), use G groups where G divides N. Each group of N/G query heads shares one KV head. Llama-2-70B uses G=8 groups for its 64 query heads.

GQA cache: 2 × n_groups × head_dim × seq_len

With 8 groups instead of 64 heads: 8× reduction in KV cache with negligible quality loss.

Why GQA won: MQA was too aggressive — noticeable quality drops for large models. GQA hits the sweet spot: 8× memory savings with <0.1% quality difference from MHA. Every modern large model (Llama-3, Mistral, Gemma) uses GQA.

FlashAttention

Standard attention materializes the full T × T attention matrix (softmax(QKT)) before multiplying by V. For T=128K, that matrix is 128K × 128K × 2 bytes = 32 GB. FlashAttention never materializes it.

The trick: tile the computation into blocks that fit in SRAM (on-chip memory, ~20MB on A100). For each tile, compute partial softmax(QKT)V, accumulate online using the log-sum-exp trick. The result is exact — not an approximation.

Standard Attention
Compute S = QKT (T×T matrix in HBM) → softmax → multiply by V
vs
FlashAttention
Load Q,K,V tiles into SRAM → compute partial softmax → accumulate → never store T×T
FlashAttention's real win: It doesn't reduce FLOPs — it reduces HBM reads/writes. Standard attention does O(T²) HBM accesses. FlashAttention does O(T²/M) where M is SRAM size. 2-4× faster for long sequences, with O(T) memory instead of O(T²).
KV Cache Memory: MHA vs GQA vs MQA

Compare KV cache sizes across attention variants. Model: 80 layers, 64 query heads, head_dim=128.

Sequence Length 4096
python
# FlashAttention pseudocode (simplified)
def flash_attention(Q, K, V, block_size=256):
    """Tiled attention that never materializes T x T matrix."""
    T, d = Q.shape
    O = zeros_like(Q)        # output accumulator
    L = zeros(T)             # log-sum-exp accumulator

    for j in range(0, T, block_size):  # tile over K,V
        Kj = K[j:j+block_size]
        Vj = V[j:j+block_size]
        for i in range(0, T, block_size):  # tile over Q
            Qi = Q[i:i+block_size]
            # Compute block attention in SRAM
            S_ij = Qi @ Kj.T / sqrt(d)  # block_size x block_size (fits in SRAM!)
            # Online softmax update
            m_new = max(L[i:i+block_size], S_ij.max(axis=-1))
            P_ij = exp(S_ij - m_new[:, None])
            # Accumulate output with rescaling
            O[i:i+block_size] = O[i:i+block_size] * exp(L[i:i+block_size] - m_new)[:, None]
            O[i:i+block_size] += P_ij @ Vj
            L[i:i+block_size] = m_new + log(P_ij.sum(axis=-1))
    return O
MethodKV Cache ReductionQuality ImpactUsed By
MHA (baseline)GPT-3, OPT
GQA (8 groups)<0.1% lossLlama-2/3, Mistral
MQA (1 group)64×0.5-1% lossPaLM, Falcon
FlashAttentionMemory: T² → TExact (0 loss)Nearly universal
A model has 32 query heads and uses GQA with 4 KV groups. How much smaller is its KV cache vs standard MHA?

Chapter 3: Continuous Batching

We established that batching helps — reading weights once for N sequences gets us closer to compute-bound. But how do we batch in practice? Requests arrive at different times and have different output lengths. A user asking "What's 2+2?" finishes in 5 tokens. A user asking for an essay generates 500 tokens.

Static Batching (Naive Approach)

Collect requests into a batch. Run the entire batch until the LONGEST sequence finishes. Short sequences pad with <eos> tokens, wasting GPU cycles. New requests must wait until the entire batch completes.

The problem: if one request generates 500 tokens and the rest finish in 50, you're wasting GPU for 450 tokens worth of computation on padding. And incoming requests pile up waiting.

Continuous (In-Flight) Batching

Continuous batching operates at the token granularity: after EACH token generation step, check if any sequence has finished. If so, evict it immediately and slot in a waiting request. The batch composition changes every single step.

The insight: Don't think of a "batch" as a fixed set of requests. Think of it as a set of slots. Each slot holds one active sequence. When a slot frees up, immediately fill it. GPU utilization jumps from ~30% (static) to ~90% (continuous).
Static Batching
Batch starts → ALL must finish → new batch starts. Short sequences waste GPU.
vs
Continuous Batching
Each step: evict finished → admit new → generate next token. Slots always full.

The efficiency gain depends on output length variance. If all requests have similar length, static and continuous perform similarly. But in real workloads, output lengths vary by 10-100×, making continuous batching essential.

Prefill vs Decode Phases

There's a subtlety: the first forward pass for a new request (the prefill) processes all input tokens in parallel — it's compute-bound. Subsequent steps (the decode) generate one token at a time — memory-bound. Good serving systems separate these:

PhaseTokens ProcessedBottleneckDuration
PrefillAll input tokens (parallel)Compute~100ms for 1K tokens
Decode1 token per stepMemory BW~70ms per token (70B)

Time-to-first-token (TTFT) is dominated by prefill time. Time-per-output-token (TPOT) is dominated by decode time. Users care about both — TTFT for responsiveness, TPOT for streaming speed.

Static vs Continuous Batching

Watch requests arrive and get served. Left: static batching (must wait for batch). Right: continuous batching (slots fill immediately). Compare GPU utilization below.

python
# Continuous batching scheduler (simplified)
class ContinuousBatcher:
    def __init__(self, max_batch=32):
        self.max_batch = max_batch
        self.active = []       # currently generating
        self.waiting = []      # queue of pending requests

    def step(self):
        # 1. Evict finished sequences
        self.active = [r for r in self.active if not r.done]

        # 2. Fill empty slots from waiting queue
        while len(self.active) < self.max_batch and self.waiting:
            req = self.waiting.pop(0)
            req.prefill()  # compute KV cache for input
            self.active.append(req)

        # 3. One decode step for all active sequences
        if self.active:
            self.decode_batch(self.active)

    def decode_batch(self, batch):
        # Single forward pass generates 1 token per sequence
        for req in batch:
            token = model.generate_one(req.kv_cache)
            req.output.append(token)
            if token == EOS:
                req.done = True
You have a batch of 8 requests. 7 finish after 20 tokens, 1 needs 500 tokens. With static batching, how many GPU steps are "wasted" on padding?

Chapter 4: Quantization for LLMs

If the memory wall comes from reading too many bytes, the most direct solution is: make each parameter smaller. A 70B model in FP16 is 140 GB. In INT4, it's 35 GB. That's 4× less memory to read — so 4× more tokens per second from the bandwidth formula.

Quantization maps high-precision values (FP16, 16 bits) to low-precision (INT8 = 8 bits, INT4 = 4 bits). The challenge: do this without destroying model quality.

How Quantization Works

For a group of weights w1...wg (group size typically 128):

scale = max(|w|) / (2bits-1 - 1)
w_quant = round(w / scale)
w_dequant = w_quant × scale

Worked example: Quantize the weights [0.12, -0.83, 0.45, -0.21] to INT4 (range -7 to +7):

1. Find scale
max(|w|) = 0.83, scale = 0.83/7 = 0.1186
2. Quantize
round([0.12, -0.83, 0.45, -0.21] / 0.1186) = [1, -7, 4, -2]
3. Dequantize
[1, -7, 4, -2] × 0.1186 = [0.119, -0.830, 0.474, -0.237]
4. Error
|error| = [0.001, 0.000, 0.024, 0.027] — max error = 0.027

The error comes from rounding. Smaller group sizes reduce error (more scales, better approximation) but increase overhead.

GPTQ (Weight-Only, Post-Training)

GPTQ quantizes weights to INT4 while keeping activations in FP16. It uses a calibration dataset to minimize the output error layer-by-layer via Hessian-based compensation: when you round one weight, you adjust its neighbors to compensate for the error.

AWQ (Activation-Aware)

AWQ observes that not all weights are equally important. Weights that multiply large activations matter more. AWQ scales important weight channels UP before quantizing (increasing their effective precision), then scales the corresponding activations DOWN to compensate. Net effect: salient weights get more precision "budget."

GGUF (llama.cpp Mixed Precision)

GGUF format allows different layers and different weight types (attention vs MLP) to use different bit widths. Typically: attention layers get Q5 or Q6 (higher precision), MLP layers get Q4 (lower). The first and last layers often stay in FP16 because they disproportionately affect quality.

The quality/speed tradeoff: INT8 is nearly lossless for all models. INT4 loses 0.5-2% on benchmarks for most models >13B. INT3 and below cause significant degradation. For models <7B, even INT4 hurts noticeably — smaller models need higher precision because they have fewer parameters to be redundant with.
Quantization Explorer

See how bit width affects memory, throughput, and quality. The teal bar is throughput gain, red dot is quality loss.

Model Size (B) 70B
Bit Width 4-bit
FormatBitsMemory (70B)Tok/s (A100)Quality Loss
FP1616140 GB~140%
INT8870 GB~28<0.1%
GPTQ/AWQ INT4435 GB~550.5-1%
GGUF Q4_K_M~4.842 GB~450.3-0.7%
GGUF Q2_K~2.623 GB~853-8%
python
# Simulate symmetric quantization
import numpy as np

def quantize_symmetric(weights, bits=4, group_size=128):
    """Quantize weights with per-group symmetric scaling."""
    qmax = 2**(bits-1) - 1  # 7 for 4-bit
    n = weights.shape[0]
    q_weights = np.zeros_like(weights, dtype=np.int8)
    scales = []

    for i in range(0, n, group_size):
        group = weights[i:i+group_size]
        scale = np.max(np.abs(group)) / qmax
        scales.append(scale)
        q_weights[i:i+group_size] = np.round(group / scale).astype(np.int8)

    return q_weights, np.array(scales)

def dequantize(q_weights, scales, group_size=128):
    """Dequantize back to float."""
    result = np.zeros(q_weights.shape, dtype=np.float32)
    for i, s in enumerate(scales):
        start = i * group_size
        result[start:start+group_size] = q_weights[start:start+group_size] * s
    return result

# Test: quantize 1024 random weights
w = np.random.randn(1024).astype(np.float16) * 0.5
q, scales = quantize_symmetric(w, bits=4)
w_hat = dequantize(q, scales)
print(f"Max error: {np.max(np.abs(w - w_hat)):.4f}")
print(f"Mean error: {np.mean(np.abs(w - w_hat)):.4f}")
print(f"Memory: {w.nbytes} -> {q.nbytes + scales.nbytes} bytes")
You quantize a 70B model from FP16 to INT4. What's the new memory footprint and approximate speedup for memory-bound generation?

Chapter 5: Speculative Decoding

Here's a counterintuitive idea: what if we could check multiple candidate tokens at once, instead of generating one at a time? The big model must verify, but verification of k tokens costs about the same as generating 1 token (it's one forward pass with k+1 positions).

Speculative decoding uses a small, fast draft model (e.g., 1B parameters) to generate k candidate tokens. Then the big target model (e.g., 70B) verifies all k in a single forward pass. If the first m tokens match what the target model would have generated, we accept them all. The rest are rejected.

Why this works: The draft model is 10-100× faster per token (smaller, less memory to read). If it gets 3 out of 5 tokens right, we get 4 "accepted" tokens (3 speculated + 1 from the target's correction) for the cost of 1 target forward pass + 5 cheap draft passes. Net speedup ≈ 2-3×.

The Algorithm

1. Draft
Small model generates k candidate tokens: t1, t2, ..., tk
2. Verify
Big model forward pass on [context, t1, ..., tk] — gets logits at each position
3. Accept/Reject
Compare target logits to draft choices. Accept longest matching prefix.
4. Correct
Sample one token from target's distribution at the first rejected position.
↻ repeat from step 1

Worked Example

Draft model proposes k=5 tokens: ["The", "cat", "sat", "in", "the"]. Target model verifies:

Result: accept 3 tokens + sample "on" from target = 4 tokens produced in 1 target forward pass.

Speedup = tokens_produced / (1 + k × cost_ratio)
where cost_ratio = time_draft / time_target ≈ 0.05-0.1

With k=5, acceptance_rate=0.7 (3.5 accepted on avg), cost_ratio=0.07:
tokens_produced = 3.5 + 1 = 4.5
cost = 1 + 5×0.07 = 1.35 target-passes equivalent
Speedup = 4.5 / 1.35 = 3.3×

When speculative decoding helps most: (1) When draft model has high acceptance rate (easy, predictable text). (2) When target model is very large (high per-token cost). (3) When batch size is 1 (memory-bound, so draft's extra compute is "free"). At large batch sizes, the target is already compute-utilized, and speculation adds overhead without benefit.

Acceptance Criterion (Rejection Sampling)

To maintain the exact target distribution (no quality loss!), we use rejection sampling. For each draft token with draft probability q and target probability p:

Accept if: uniform_random ≤ min(1, p/q)

If q is high but p is low (draft is confident but wrong), likely rejected. If p ≥ q (target agrees or likes it even more), always accepted. This guarantees the output distribution is exactly the target model's distribution.

Speculative Decoding Simulator

Watch the draft-verify-accept cycle. Teal = draft tokens. Green = accepted. Red = rejected. Orange = target correction.

Draft Length (k) 5
Acceptance Rate 0.70
python
# Speculative decoding (simplified)
def speculative_decode(target, draft, prompt, k=5):
    tokens = list(prompt)
    while not done:
        # 1. Draft k tokens
        draft_tokens = []
        draft_probs = []
        for _ in range(k):
            p = draft.get_probs(tokens + draft_tokens)
            t = sample(p)
            draft_tokens.append(t)
            draft_probs.append(p[t])

        # 2. Target verifies all at once
        target_logits = target.forward(tokens + draft_tokens)  # 1 pass!

        # 3. Accept/reject with rejection sampling
        accepted = 0
        for i in range(k):
            p_target = softmax(target_logits[len(tokens) + i])
            p_t = p_target[draft_tokens[i]]
            q_t = draft_probs[i]
            if random() <= min(1, p_t / q_t):
                accepted += 1
            else:
                break

        # 4. Accept prefix + sample correction
        tokens.extend(draft_tokens[:accepted])
        correction = sample(adjusted_dist(target_logits, accepted))
        tokens.append(correction)

    return tokens
Your draft model proposes k=8 tokens. The target accepts the first 5 and rejects position 6. How many new tokens are added to the output?

Chapter 6: Model Parallelism

A 70B model in FP16 = 140 GB. A single A100 has 80 GB. The model literally doesn't fit on one GPU. We must split it across multiple devices. But HOW we split determines communication cost, memory efficiency, and whether we get pipeline bubbles.

Tensor Parallelism (TP)

Tensor Parallelism splits individual weight matrices across GPUs. For a linear layer W ∈ Rd×d with TP=4, each GPU stores Wi ∈ Rd×(d/4). Each GPU computes a partial result, then an all-reduce operation sums them across GPUs.

y = xW = x[W1 | W2 | W3 | W4] (column-parallel)
y = [y1 | y2 | y3 | y4] → all-reduce → final y

Benefits: memory per GPU drops by N (for TP=N). All GPUs active simultaneously. Latency for one token is the same as one GPU (minus communication).

Cost: requires all-reduce after every layer — 2 all-reduces per transformer block (one for attention, one for MLP). This needs high-bandwidth interconnect (NVLink at 900 GB/s, not PCIe at 64 GB/s).

Pipeline Parallelism (PP)

Pipeline Parallelism splits layers across GPUs. GPU 0 gets layers 0-19, GPU 1 gets layers 20-39, etc. Data flows sequentially through the pipeline.

GPU 0: layers 0-19 → send activations → GPU 1: layers 20-39 → ... → GPU 3: layers 60-79

Benefits: minimal communication (send activations once per microbatch, not per layer). Works over PCIe or even network links.

Cost: pipeline bubbles — when GPU 0 finishes its chunk, it sits idle while waiting for downstream GPUs. Microbatching helps fill bubbles, but with batch=1 (interactive serving), PP has up to (N-1)/N idle time.

TP vs PP decision rule: Use TP within a node (GPUs connected by NVLink). Use PP across nodes (connected by slower network). For 8-GPU serving of a 70B model: TP=8 within one node is fastest (one all-reduce per layer is fine over NVLink). For 32 GPUs: TP=8 within node, PP=4 across nodes.

Expert Parallelism (EP)

For Mixture-of-Experts (MoE) models like Mixtral, experts live on different GPUs. Only 2 of 8 experts activate per token, so each GPU only computes its assigned experts. Communication: route tokens to the correct GPU (all-to-all operation).

StrategySplitsCommunicationBest For
TPWeight matricesAll-reduce per layerWithin NVLink node
PPLayer groupsActivations between stagesAcross nodes
EPMoE expertsAll-to-all token routingMoE models
TP+PPBothBothVery large clusters
Parallelism Strategies

See how TP (left) and PP (right) distribute work across 4 GPUs. Colors represent layers/chunks. Watch for idle time (bubbles) in PP.

Strategy TP only
python
# Memory calculation for parallelism strategies
def memory_per_gpu(params_B, tp=1, pp=1, dtype=2):
    """Memory per GPU in GB."""
    total_bytes = params_B * 1e9 * dtype
    # TP splits all layers, PP splits layer groups
    per_gpu = total_bytes / (tp * pp)
    return per_gpu / 1e9

# Llama-70B FP16 on different configs:
print(f"1 GPU:  {memory_per_gpu(70):.0f} GB")       # 140 GB (doesn't fit!)
print(f"TP=2:   {memory_per_gpu(70, tp=2):.0f} GB")  # 70 GB (tight on A100)
print(f"TP=4:   {memory_per_gpu(70, tp=4):.0f} GB")  # 35 GB (comfortable)
print(f"TP=8:   {memory_per_gpu(70, tp=8):.0f} GB")  # 17.5 GB (room for KV cache)

# Communication cost (TP all-reduce per layer)
def tp_comm_time(hidden_dim, tp, bw_GBps=450):
    """All-reduce time per layer in microseconds."""
    # Ring all-reduce: 2*(N-1)/N * msg_size / bandwidth
    msg_bytes = hidden_dim * 2  # FP16
    time_s = 2 * (tp-1)/tp * msg_bytes / (bw_GBps * 1e9)
    return time_s * 1e6  # microseconds

print(f"TP=4 comm per layer: {tp_comm_time(8192, 4):.1f} us")  # ~0.05 us (negligible on NVLink)
You're serving Llama-70B on a single 8×A100 node with NVLink. Which parallelism strategy minimizes latency for a single request?

Chapter 7: Serving System Simulator

Now we combine everything: KV cache management, continuous batching, quantization, speculative decoding, and model parallelism into a single serving pipeline. This is what vLLM, TGI, and TensorRT-LLM do under the hood.

Requests arrive as a Poisson process (random arrivals). Each has an input length and expected output length. The system must balance throughput (total tokens/sec) against latency (how long each user waits). Your job: configure the system to maximize throughput while keeping latency reasonable.

The serving simulator below models a vLLM-style system. Toggle optimizations on/off and observe their impact on throughput, latency, and utilization. Try to serve Llama-70B at high throughput without blowing latency SLAs.
LLM Serving Simulator

Requests arrive, get batched, processed, completed. Metrics update in real-time. Toggle optimizations to see their impact.

GPUs 4
Request Rate (/s) 10
Max Batch 32
Quantization INT8

Key observations to try:

The fundamental tradeoff: Higher batch sizes increase throughput but also increase latency (more sequences share the same bandwidth). Serving systems tune batch size dynamically based on SLA targets. If P99 latency is too high, reduce batch size even if it lowers throughput.

Chapter 8: Deployment Systems

Now that we understand the optimizations, let's look at the systems that implement them. Each makes different tradeoffs between ease-of-use, performance, and hardware requirements.

vLLM

vLLM is the most popular open-source serving engine. Its key innovation: PagedAttention — managing KV cache like virtual memory with page tables. Instead of pre-allocating contiguous memory per sequence, it allocates fixed-size blocks (pages) on demand. Benefits:

TGI (Text Generation Inference)

TGI by HuggingFace: production-ready with token streaming, built-in load balancing, and HuggingFace model hub integration. Written in Rust for the HTTP layer. Supports Flash Attention, GQA, GPTQ/AWQ quantization, and TP. Slightly lower raw throughput than vLLM but easier deployment.

TensorRT-LLM

TensorRT-LLM by NVIDIA: maximum performance through aggressive kernel fusion, INT4/INT8 quantization with custom CUDA kernels, and pipelined execution. Requires NVIDIA GPUs and a compilation step (model → optimized engine). 20-40% faster than vLLM on NVIDIA hardware, but vendor-locked and harder to use.

llama.cpp

llama.cpp: runs on CPU, Apple Metal, and CUDA. GGUF format with per-layer mixed quantization. Key use case: local inference on consumer hardware (MacBooks, gaming PCs). Performance: ~30-100 tok/s for 7B models on M2/M3 Macs, ~5-15 tok/s for 70B models with offloading.

Cost Analysis

At cloud GPU pricing, the economics are:

Cost per 1M tokens = (GPU cost per hour) / (throughput in tok/hour) × 106
SetupGPU CostThroughput$/1M tokens
1×A100 80GB, vLLM, Llama-70B INT4$2.50/hr~800 tok/s$0.87
8×A100 80GB, vLLM, Llama-70B FP16$20/hr~3000 tok/s$1.85
8×H100, TRT-LLM, Llama-70B FP8$32/hr~8000 tok/s$1.11
M2 Max MacBook, llama.cpp, 70B Q4$0 (owned)~8 tok/s$0 (but slow)
When to use what:
vLLM: default choice for most serving. Best batch throughput, great community.
TGI: when you want HuggingFace ecosystem integration and simple deployment.
TensorRT-LLM: when you need absolute maximum throughput on NVIDIA GPUs and can afford the setup cost.
llama.cpp: local/edge inference, privacy-sensitive use cases, prototyping.
Cost & Throughput Comparison

Compare frameworks by throughput and cost efficiency. Bar height = throughput, label = cost per 1M tokens.

Model Size 70B
python
# Quick cost calculator
def cost_per_million(gpu_cost_hr, throughput_tok_s):
    """Cost per 1M output tokens."""
    tok_per_hour = throughput_tok_s * 3600
    return gpu_cost_hr / tok_per_hour * 1e6

# Compare setups
setups = [
    ("1xA100 vLLM INT4", 2.50, 800),
    ("8xA100 vLLM FP16", 20.0, 3000),
    ("8xH100 TRT-LLM FP8", 32.0, 8000),
]
for name, cost, tput in setups:
    print(f"{name}: ${cost_per_million(cost, tput):.2f}/1M tokens")
You need to serve Llama-70B with <100ms time-to-first-token, handling 50 req/s. Which is the best starting point?

Chapter 9: Mastery & Connections

You now understand every major optimization in the LLM inference stack. Let's consolidate with a decision framework, then test your mastery with real-world challenges.

Inference Optimization Cheat Sheet

TechniqueReducesTradeoffWhen to Use
KV CacheRedundant computeMore memoryAlways (it's standard)
GQAKV cache sizeSlight qualityAlways (train with it)
FlashAttentionMemory + IONone (exact)Always
Continuous BatchingIdle GPU timeScheduling complexityAny serving system
INT4 QuantizationWeight memory 4×0.5-1% qualityWhen memory-constrained
Speculative DecodingPer-token latencyExtra draft computeLow-batch, high-latency models
Tensor ParallelismPer-GPU memoryAll-reduce overheadModel > 1 GPU (intra-node)
Pipeline ParallelismPer-GPU memoryBubbles, latencyMulti-node deployments
PagedAttentionMemory fragmentationPage table overheadVariable-length serving

Decision Tree

Model fits on 1 GPU?
Yes → use vLLM + quantization. No → next step.
Multiple GPUs, same node?
Use Tensor Parallelism (TP = num_GPUs)
Multiple nodes needed?
TP within node + PP across nodes
Latency SLA tight?
Add speculative decoding + FP8 quantization
Throughput ceiling hit?
Scale horizontally: multiple replicas behind load balancer

Derivation Challenge

Derive the optimal draft length k for speculative decoding.

Given: acceptance probability per token = α, draft cost ratio = r (time_draft / time_target).

Expected accepted tokens = α + α² + ... + αk = α(1-αk)/(1-α)
Total tokens per cycle = expected_accepted + 1 (correction)
Cost per cycle = 1 (target verify) + k×r (draft generation)
Throughput = tokens_per_cycle / cost_per_cycle
Maximize over k: d(throughput)/dk = 0

For α=0.8, r=0.05: optimal k ≈ 5-7 (diminishing returns beyond that because αk shrinks fast).

Design Challenge

Serve Llama-70B at 100 requests/sec with P99 latency < 1 second.

Constraints: average input 500 tokens, average output 200 tokens.

Solution sketch:
• Hardware: 4 nodes × 8×H100 (TP=8 per node, 4 replicas)
• Quantization: FP8 (fits comfortably, minimal quality loss on H100)
• Batching: continuous, max_batch=64 per replica → 25 req/s per replica
• TTFT budget: prefill 500 tokens @ FP8 on 8×H100 ≈ 30ms ✓
• TPOT: 200 tokens @ ~15ms/tok = 3s... too slow!
• Fix: with batch=64, throughput per token-step = 64 tokens in ~20ms → effective 0.3ms/tok/req
• Total output time: 200 × 0.3ms = 60ms per request at steady state ✓
• P99 includes queuing: at 25 req/s with batch=64, queue time < 100ms ✓
• Total P99 ≈ 30ms (prefill) + 60ms (decode) + 100ms (queue) = 190ms < 1s ✓

Connections

If you liked...Read next
Memory bandwidth analysisTransformer (attention mechanism internals)
Quantization theoryVAE & VQ-VAE (vector quantization)
Systems optimizationSSM & Mamba (O(1) per-token inference)
Parallel algorithmsGPT (training parallelism)
"The purpose of computing is insight, not numbers." — Richard Hamming. Every optimization here exists because someone deeply understood WHERE the bottleneck was, not just THAT things were slow. Profile first. Optimize the binding constraint. Repeat.
You're serving a 7B model on a single A100. Latency is fine but you want 3× more throughput. What's the most impactful single change?