Systems & Serving Workbook — Engineermaxxing

Chapter 0: Tensor Parallelism

You have a single weight matrix that's too large for one GPU's memory. Or you need the forward pass to run faster than one GPU can manage. The solution: tensor parallelism (TP) — slice each weight matrix across multiple GPUs so they compute in parallel.

The core idea is simple. Take a weight matrix W of shape [d, d]. With TP=4, you split it column-wise into 4 shards: each GPU holds W_i of shape [d, d/4]. The input x is broadcast to all GPUs. Each GPU computes y_i = x · W_i. Then you need an allreduce to combine partial results.

Column-parallel split (used for W_Q, W_K, W_V, FFN W₁):
W ∈ ℝ^{d × d} → W_i ∈ ℝ^{d × d/TP} on GPU i
Each GPU computes: y_i = x · W_i    // shape [B, d/TP]

Row-parallel split (used for W_O, FFN W₂):
W ∈ ℝ^{d × d} → W_i ∈ ℝ^{d/TP × d} on GPU i
Each GPU computes: y_i = x_i · W_i    // shape [B, d]
Then: y = ∑ y_i    // allreduce (sum across GPUs)

The TP communication cost. Each transformer layer needs exactly 2 allreduce operations in the forward pass: one after attention's W_O, one after FFN's W₂. An allreduce on a vector of size d transfers 2d bytes (with ring allreduce). So the per-layer communication is 4d × bytes_per_element × batch_size. This is why TP is only practical within a single node over NVLink (900 GB/s), not across nodes over InfiniBand (~400 GB/s).

Exercise 0.1: Per-GPU Weight Size Derive

A transformer layer has d=8192 and uses standard 4d² attention + 8d² FFN = 12d² parameters. With TP=8, each GPU holds 1/8 of every weight matrix. How many parameters does each GPU store per layer?

million params per GPU per layer

Show derivation

Total params per layer = 12 × 8192² = 12 × 67,108,864 = 805,306,368

Per GPU = 805,306,368 / 8 = 100,663,296 ≈ 100.7M

Each GPU stores 1/TP of every weight matrix. Since all weight matrices in a layer are [d, d] or [d, 4d] (and their transposes), every one divides evenly by TP. The 12d²/TP rule is exact.

Exercise 0.2: Allreduce Data Volume Derive

For one transformer layer with d=8192, batch size B=32, sequence length S=2048, using FP16 (2 bytes per element): how much data does each GPU send during the 2 allreduce operations in the forward pass?

In a ring allreduce on TP GPUs, each GPU sends (TP-1)/TP × message_size bytes. For simplicity, approximate this as ~1× the message size (accurate when TP is large). Each allreduce sends a tensor of shape [B, S, d].

GB per GPU per layer (forward only)

Show derivation

One allreduce message = B × S × d × 2 bytes = 32 × 2048 × 8192 × 2 = 1,073,741,824 bytes ≈ 1.0 GB

Two allreduces per layer = 2 × 1.0 GB = 2.0 GB

At 2 GB per layer, an 80-layer model sends 160 GB through the interconnect per forward pass. On NVLink at 900 GB/s, that's ~178 ms of pure communication time. On InfiniBand at 400 GB/s, it's 400 ms — which is why TP is restricted to intra-node.

Exercise 0.3: When Does TP Help? Trace

You have 8 GPUs connected via NVLink within one node. You're serving a 13B model that fits on a single GPU with 80 GB HBM. Should you use TP=8?

Yes — more GPUs always means faster inference It depends — TP reduces per-GPU compute but adds communication overhead. For latency-sensitive serving, TP helps. For throughput, data parallelism may be better. No — if the model fits on one GPU, parallelism never helps

Show reasoning

TP=8 cuts each GPU's compute by 8× but adds 2 allreduce per layer. For latency (time-to-first-token for one request), TP helps because each GPU does less math. For throughput (requests per second), 8 independent copies (DP=8) would each serve a request simultaneously — 8× throughput with zero communication overhead.

The rule: use TP to reduce latency when the model is too slow on one GPU, use DP to increase throughput when the model already fits.

Exercise 0.4: TP Forward Pass Pipeline Design

Arrange the steps of a tensor-parallel forward pass for one transformer layer (attention + FFN).

→

Column-split Q,K,V projections Local attention (per GPU) Row-split W_O + allreduce Column-split FFN W₁ + activation Row-split FFN W₂ + allreduce Add residual + LayerNorm

Show correct order

1. Column-split Q,K,V projections (each GPU computes its head subset).
2. Local attention — each GPU runs full attention on its heads (no communication needed since heads are independent).
3. Row-split W_O + allreduce — combine head outputs back to full hidden dimension.
4. Column-split FFN W₁ + activation — each GPU computes part of the expanded hidden state.
5. Row-split FFN W₂ + allreduce — combine back to full d.
6. Add residual + LayerNorm — on the now-complete hidden state.

Exercise 0.5: 70B Model Weight Memory with TP Derive

A 70B parameter model stored in FP16 (2 bytes per parameter). With TP=8 across 8 GPUs, how much weight memory does each GPU need?

GB per GPU

Show derivation

Total weight memory = 70 × 10⁹ × 2 bytes = 140 GB

Per GPU with TP=8: 140 / 8 = 17.5 GB

17.5 GB of weights leaves plenty of room on an 80 GB H100 for KV cache, activations, and CUDA overhead. This is exactly why TP=8 is the standard for serving 70B models — one 8-GPU node handles it comfortably.

Chapter 1: Pipeline Parallelism

Tensor parallelism splits each layer across GPUs. Pipeline parallelism (PP) does something different: it assigns entire layers to different GPUs. GPU 0 gets layers 0-7, GPU 1 gets layers 8-15, and so on. The forward pass flows from GPU to GPU like an assembly line.

The problem? The assembly line has a bubble. When GPU 0 is computing its layers for microbatch 1, GPUs 1-3 are idle — they're waiting for GPU 0's output. This warmup phase wastes compute. The bubble fraction tells you how much time is lost.

Pipeline bubble fraction:
bubble = (PP - 1) / (PP - 1 + M)

Where PP = number of pipeline stages, M = number of microbatches.
With M = 1: bubble = (PP - 1) / PP // worst case
With M » PP: bubble → 0 // more microbatches hide the bubble

Microbatches are the cure. Instead of sending one big batch through the pipeline, split it into M microbatches. GPU 0 processes microbatch 1, then immediately starts microbatch 2 while GPU 1 handles microbatch 1. With enough microbatches, every GPU stays busy and the bubble shrinks to nearly zero.

Exercise 1.1: Bubble Fraction, PP=4 Derive

With PP=4 and M=1 (single microbatch), what fraction of total GPU time is wasted in the pipeline bubble?

fraction (0 to 1)

Show derivation

bubble = (PP - 1) / PP = (4 - 1) / 4 = 3/4 = 0.75

75% of total GPU time is wasted. Each GPU is idle for 3 out of 4 time steps during warmup. This is catastrophic — you're paying for 4 GPUs but getting the throughput of 1.

Exercise 1.2: Microbatches to the Rescue Derive

Same PP=4 setup. How many microbatches M do you need to reduce the bubble fraction below 10%?

Solve: (PP - 1) / (PP - 1 + M) < 0.10

minimum M

Show derivation

3 / (3 + M) < 0.10

3 < 0.10 × (3 + M) = 0.3 + 0.1M

2.7 < 0.1M

M > 27

You need at least 27 microbatches. With M=27, bubble = 3/30 = 10%. With M=32 (a more practical power-of-two choice), bubble = 3/35 = 8.6%. This is why training configs often use large microbatch counts.

Exercise 1.3: Layers per Stage Derive

A 32-layer model with PP=4 and PP=8. How many layers per GPU in each case? Which PP degree would you expect to be more memory-efficient per GPU?

layers per GPU when PP=8

Show derivation

PP=4: 32 / 4 = 8 layers per GPU

PP=8: 32 / 8 = 4 layers per GPU

PP=8 uses less memory per GPU (4 layers vs 8) but has a larger bubble (7/(7+M) vs 3/(3+M)). There's always a tradeoff between memory savings and pipeline efficiency.

Exercise 1.4: Bubble for PP=16 Derive

PP=16, M=32. What is the bubble fraction?

fraction (0 to 1)

Show derivation

bubble = (16 - 1) / (16 - 1 + 32) = 15 / 47 ≈ 0.319

Nearly a third of compute is wasted. To get this below 5%, you'd need M > 285. This is why PP=16 is rarely used — the bubble is too large to hide with reasonable microbatch counts.

Exercise 1.5: Implement pipelineBubbleFraction() Build

Write a function that computes the pipeline bubble fraction given PP and M.

Return the fraction of wasted time (0 to 1).

Show solution

javascript
function pipelineBubbleFraction(pp, m) {
  return (pp - 1) / (pp - 1 + m);
}

Exercise 1.6: PP Communication Trace

In pipeline parallelism, what data is communicated between stages?

The full model weights for each stage Gradients averaged across all stages Activations (hidden states) — the output of the last layer on GPU k becomes the input to the first layer on GPU k+1 Only the loss value, passed backward from the last stage

Show reasoning

PP communicates activations (forward) and activation gradients (backward) between adjacent stages. The activation tensor shape is [B, S, d] — much smaller than the weight matrices. This is a point-to-point send (GPU k to GPU k+1), not an allreduce, which is why PP works well across nodes over InfiniBand.

Chapter 2: Data Parallelism & FSDP

The simplest form of parallelism: give every GPU a full copy of the model, feed each GPU different data, and average the gradients. That's Distributed Data Parallel (DDP). Simple, but memory-wasteful — every GPU stores the full model weights, gradients, and optimizer states.

Fully Sharded Data Parallel (FSDP) fixes the memory problem. Instead of replicating everything, FSDP shards the model parameters, gradients, and optimizer states across GPUs. Before each layer's forward pass, GPUs allgather the parameters they need, compute, then discard the non-local shards.

Memory per GPU:
DDP: model + gradients + optimizer = P × (2 + 2 + 12) = 16P bytes // FP16 weights + FP16 grad + Adam states
FSDP: (model + gradients + optimizer) / N = 16P / N bytes // everything sharded across N GPUs

Adam optimizer states per parameter (FP32):
FP32 master weight (4B) + momentum (4B) + variance (4B) = 12 bytes

Why 16 bytes per parameter? In mixed-precision training: 2B for the FP16 weight, 2B for the FP16 gradient, plus Adam stores three FP32 copies — master weight (4B), first moment (4B), second moment (4B). Total: 2 + 2 + 4 + 4 + 4 = 16 bytes per parameter. This is the number that determines your GPU memory needs.

Exercise 2.1: DDP Memory for 7B Derive

A 7B parameter model trained with DDP using mixed precision (FP16 weights/gradients + FP32 Adam optimizer). How much memory does each GPU need just for model state (weights + gradients + optimizer)?

GB per GPU

Show derivation

Memory per GPU = 7 × 10⁹ × 16 bytes = 112 × 10⁹ bytes = 112 GB

112 GB exceeds even an H100's 80 GB HBM. This doesn't even include activations. A 7B model in DDP won't fit on a single 80 GB GPU — you need either FSDP, DeepSpeed ZeRO, or gradient checkpointing.

Exercise 2.2: FSDP Memory for 7B Derive

Same 7B model, now with FSDP across 32 GPUs. How much model state memory does each GPU need?

GB per GPU

Show derivation

FSDP memory per GPU = 112 GB / 32 = 3.5 GB

From 112 GB to 3.5 GB — a 32× reduction. Each GPU only stores 1/32 of the weights, gradients, and optimizer states. The rest is reconstructed on-the-fly via allgather before each layer's computation. This is the magic of FSDP.

Exercise 2.3: Communication Overhead Derive

In DDP, each GPU allreduces the full gradient (2 bytes per param in FP16). In FSDP, each GPU allgathers the full model weights before each forward layer and reduce-scatters gradients after backward. For a 7B model in FP16, what is the total communication volume per training step for DDP?

Allreduce transfers ~2× the message size (reduce-scatter + allgather).

GB per step (DDP)

Show derivation

DDP gradient allreduce = 2 × 7B × 2 bytes = 28 GB

FSDP has higher communication: it allgathers weights before forward (14 GB), allgathers again before backward (14 GB), and reduce-scatters gradients after backward (14 GB) — about 42 GB total. FSDP trades 1.5× more communication for dramatically less memory per GPU.

Exercise 2.4: ZeRO Stages Trace

DeepSpeed ZeRO has three stages. ZeRO-1 shards optimizer states only. ZeRO-2 shards optimizer + gradients. ZeRO-3 shards everything (= FSDP). For a 7B model on 8 GPUs, what is the per-GPU memory for ZeRO-1?

112 GB — same as DDP, ZeRO-1 doesn't help ~30 GB — optimizer sharded (12B/8=1.5B per GPU = 10.5 GB), plus full weights (14 GB) + full gradients (14 GB) = ~39 GB 14 GB — only weights remain unsharded 3.5 GB — same as FSDP

Show derivation

Optimizer per GPU = 12 × 7B / 8 = 10.5 GB

Weights (full) = 2 × 7B = 14 GB

Gradients (full) = 2 × 7B = 14 GB

Total = 10.5 + 14 + 14 = 38.5 GB ≈ 39 GB

ZeRO-1 is the sweet spot for many setups: it shards the biggest memory consumer (optimizer states are 12/16 = 75% of total) while requiring no extra communication beyond DDP's gradient allreduce.

Exercise 2.5: Implement fsdpMemoryPerGPU() Build

Write a function that returns memory per GPU (in GB) for FSDP, given model size in billions and number of GPUs. Use the 16 bytes/param rule.

Remember: 16 bytes/param total, divided evenly across GPUs.

Show solution

javascript
function fsdpMemoryPerGPU(paramsBillions, numGPUs) {
  return (paramsBillions * 16) / numGPUs;
}

Chapter 3: 3D Parallelism

Real large-scale training combines all three: tensor parallelism within a node (fast NVLink), pipeline parallelism across nodes (point-to-point activations), and data parallelism for throughput scaling. The total GPU count factors cleanly:

Total GPUs = TP × PP × DP

TP = 8    // within node (8 GPUs per node via NVLink)
PP = 4    // across 4 nodes (layers split into 4 stages)
DP = 8    // 8 independent pipeline replicas
Total = 8 × 4 × 8 = 256 GPUs    // 32 nodes of 8 GPUs each

The hierarchy rule. TP goes inside the node (NVLink). PP goes across adjacent nodes (point-to-point, low latency). DP goes across all remaining nodes (gradient allreduce, bandwidth-tolerant). Violate this hierarchy and your training slows to a crawl — putting TP across nodes saturates the slower interconnect on every single layer.

Exercise 3.1: Verify 256 GPUs Derive

You have 256 GPUs (32 nodes, 8 GPUs each). You want to train a 70B model with TP=8, PP=4. What is the DP degree?

DP degree

Show derivation

DP = Total / (TP × PP) = 256 / (8 × 4) = 256 / 32 = 8

8 data-parallel replicas, each consisting of 4 pipeline stages of 8 TP-sharded GPUs. Each replica spans 4 nodes (32 GPUs). The 8 replicas process 8 different microbatches simultaneously.

Exercise 3.2: Per-GPU Weight Memory Derive

70B model, TP=8, PP=4, 80 layers. In FP16 (2 bytes/param), how much weight memory per GPU?

Each GPU holds 1/TP of 1/PP of the layers. With 80 layers and PP=4, each stage has 20 layers.

GB per GPU

Show derivation

Total weight memory = 70B × 2 bytes = 140 GB

Per PP stage = 140 / 4 = 35 GB

Per GPU (TP=8) = 35 / 8 = 4.375 GB

Only 4.4 GB of weights per GPU. But for training, you also need optimizer states (~4.375 × 6 = ~26 GB if not using FSDP), activations, and gradient memory. This is why even with TP+PP, FSDP (or ZeRO) is often still needed for the optimizer states.

Exercise 3.3: Why TP Inside the Node? Trace

Why is tensor parallelism placed within a node (NVLink) rather than across nodes (InfiniBand)?

TP requires 2 allreduce per layer — extremely frequent, latency-sensitive communication. NVLink (900 GB/s, ~1 μs latency) handles this; InfiniBand (400 GB/s, ~5 μs latency) would bottleneck every layer. TP doesn't work across nodes due to hardware limitations TP across nodes would use more total memory It's just a convention, there's no technical reason

Show reasoning

TP is the most communication-intensive form of parallelism. With an 80-layer model, you perform 160 allreduce operations per forward pass. Even at NVLink speeds, this takes hundreds of milliseconds. On InfiniBand, the higher latency and lower bandwidth would increase this by 2-5×, making TP the bottleneck. PP, by contrast, only sends point-to-point activations between adjacent stages — much less frequent.

Exercise 3.4: 405B Configuration Derive

Llama 3.1 405B was trained on 16,384 H100 GPUs. If TP=8 and PP=16, what is the DP degree?

DP degree

Show derivation

DP = 16,384 / (8 × 16) = 16,384 / 128 = 128

128-way data parallelism means 128 independent replicas each processing a different microbatch. With a global batch size of ~16M tokens, each replica handles ~128K tokens per step. This is the scale of frontier training.

Exercise 3.5: Effective Batch Size Derive

With DP=128, each replica processes a microbatch of 4 sequences × 4096 tokens per sequence. What is the global batch size in tokens?

million tokens

Show derivation

Tokens per replica = 4 × 4096 = 16,384

Global batch = 128 × 16,384 = 2,097,152 ≈ 2.1M tokens

~2M tokens per gradient step. With gradient accumulation (say 8 accumulation steps), the effective batch can reach 16M+ tokens. LLaMA 3.1's training report mentions a 16M token batch.

Exercise 3.6: Find the Bug Debug

This function computes per-GPU weight memory for 3D parallelism but gives the wrong answer. Click the buggy line.

function gpuWeightMemGB(totalParams, tp, pp, dp) {
  const bytesPerParam = 2;  // FP16
  const totalBytes = totalParams * bytesPerParam;
  const perGPU = totalBytes / (tp * pp * dp);
  return perGPU / 1e9;  // convert to GB
}

Show explanation

Line 4 divides by TP × PP × DP, but DP does not shard weights. In data parallelism, every replica holds a full copy of the model (or its TP×PP shard). Dividing by DP would only be correct for FSDP. The correct line is: const perGPU = totalBytes / (tp * pp);

This is a common mistake: confusing "GPU count" with "sharding degree." DP replicates; only TP and PP shard the weights.

Chapter 4: Continuous Batching

You're serving a language model. 32 requests come in simultaneously. In static batching, you pad all sequences to the same length and wait until every request finishes. If one request generates 20 tokens and another generates 500 tokens, the 20-token request sits idle in the batch for 480 steps — pure waste.

Continuous batching (also called iteration-level scheduling) fixes this: the moment a request finishes generating, its slot is immediately freed and a new request from the queue takes its place. No padding, no waiting.

Static batching throughput:
Tokens/sec = B × avg_output_len / (max_output_len × time_per_step)
Utilization = avg_output_len / max_output_len

Continuous batching throughput:
Tokens/sec ≈ B × 1 / time_per_step // every slot produces a token every step
Utilization ≈ 1.0 // slots always filled from queue

The utilization gap is dramatic. If your average output is 200 tokens but the max is 512, static batching wastes 61% of compute. Continuous batching eliminates this entirely. This single optimization can 2-3× your serving throughput without any model changes.

Exercise 4.1: Static Batching Utilization Derive

32 slots, average output length = 200 tokens, max output length = 512 tokens. What is the utilization (fraction of useful compute) in static batching?

utilization (0 to 1)

Show derivation

Utilization = avg / max = 200 / 512 = 0.391

Only 39.1% utilization. Over 60% of GPU cycles are spent computing attention and FFN on padding tokens or idle slots. This is the fundamental problem continuous batching solves.

Exercise 4.2: Throughput Improvement Derive

With the same setup (avg=200, max=512), what is the throughput multiplier of continuous batching over static batching? Assume the request queue is never empty (always saturated).

× improvement

Show derivation

Static throughput ∝ avg/max = 200/512

Continuous throughput ∝ 1.0 (every slot always active)

Improvement = 1.0 / (200/512) = 512/200 = 2.56×

2.56× more tokens per second from the same hardware. If you're paying $2/hr per GPU, continuous batching effectively cuts your cost from $2/hr to $0.78/hr for the same throughput.

Exercise 4.3: When Continuous Batching Doesn't Help Trace

In which scenario does continuous batching provide the LEAST improvement over static batching?

avg=50, max=2048 (summarization tasks with variable length) avg=100, max=512 (chatbot with moderate variation) avg=500, max=512 (code generation with uniform output lengths) avg=200, max=4096 (creative writing with high variance)

Show reasoning

When avg ≈ max, all requests finish around the same time. Static batching wastes almost nothing because there's no padding. With avg=500 and max=512, static utilization is 500/512 = 97.7%. Continuous batching only gives a 1.02× improvement — barely noticeable. The bigger the gap between avg and max, the more continuous batching helps.

Exercise 4.4: Implement slotUtilization() Build

Write a function that computes the slot utilization for static batching. Utilization = average output length / max output length.

Simple ratio — but think about edge cases.

Show solution

javascript
function slotUtilization(avgLen, maxLen) {
  return avgLen / maxLen;
}

Exercise 4.5: Queue Depth Matters Derive

Continuous batching can only fill freed slots if there are requests waiting. If your server has 32 slots and receives an average of 10 requests/second, each taking ~5 seconds to generate, what is the average queue occupancy? Is the server saturated?

average concurrent requests

Show derivation

Little's Law: L = λ × W = 10 req/s × 5 s = 50 concurrent requests

50 requests in the system but only 32 slots — the server is over-saturated by 18 requests. Those 18 sit in the queue, adding latency. You need either more GPUs (more slots) or faster generation (shorter W) to keep up.

Chapter 5: PagedAttention

During autoregressive generation, each request accumulates a KV cache — the key and value tensors from every previous token. In a 70B model with 80 layers, 64 heads, and head_dim=128, each token's KV cache is:

KV cache per token:
2 (K and V) × L (layers) × n_kv_heads × d_head × bytes_per_element

Example (LLaMA 70B, FP16):
2 × 80 × 8 (GQA groups) × 128 × 2 bytes = 327,680 bytes ≈ 320 KB per token

For a 4K context: 320 KB × 4096 = 1.28 GB per request. With 32 concurrent requests: 41 GB just for KV cache. Traditional systems pre-allocate max_seq_len for every request, wasting memory when most requests are shorter. PagedAttention (from vLLM) solves this by managing KV cache like virtual memory pages.

Pages = blocks of token slots. Instead of a contiguous 4K-token allocation per request, PagedAttention allocates small blocks (e.g., 16 tokens each). As a request generates more tokens, it grabs new pages. When it finishes, pages are freed instantly. No fragmentation, no pre-allocation waste.

Exercise 5.1: KV Cache per Token Derive

LLaMA 70B: 80 layers, 8 GQA KV heads, d_head=128, FP16 (2 bytes). How many KB per token for the full KV cache?

KB per token

Show derivation

Per token = 2 × 80 × 8 × 128 × 2 = 327,680 bytes = 320 KB

GQA with 8 KV heads (vs 64 Q heads) saves 8× on KV cache compared to MHA. Without GQA, this would be 2,560 KB per token — 10× more. GQA is one of the most impactful memory optimizations for serving.

Exercise 5.2: Total KV Cache for 32 Requests Derive

Using 320 KB/token, 32 concurrent requests, each with 4096 context tokens. How much memory for KV cache?

GB total KV cache

Show derivation

Per request = 320 KB × 4096 = 1,310,720 KB = 1.25 GB

32 requests = 32 × 1.25 GB = 40 GB

40 GB for KV cache alone. Combined with 17.5 GB of weights (TP=8), that's 57.5 GB out of 80 GB per GPU. At longer contexts (16K, 128K), KV cache becomes the dominant memory consumer, dwarfing the model weights.

Exercise 5.3: Page Count Derive

Total KV cache = 40 GB. Page size = 16 tokens × 320 KB/token = 5,120 KB = 5 MB per page. How many pages needed?

pages

Show derivation

Total tokens = 32 × 4096 = 131,072

Pages = 131,072 / 16 = 8,192 pages

Verification: 8,192 × 5 MB = 40,960 MB = 40 GB ✓

8,192 pages tracked by a page table. Each request's KV cache is a list of page pointers, not a contiguous allocation. This enables sharing (prefix caching) and eliminates fragmentation.

Exercise 5.4: Internal Fragmentation Derive

A request generates 200 tokens. Pages hold 16 tokens. How many pages are allocated, and what percentage of the last page is wasted (internal fragmentation)?

% wasted in last page

Show derivation

Pages needed = ceil(200/16) = ceil(12.5) = 13 pages

Tokens stored = 13 × 16 = 208 slots, 200 used, 8 wasted

Fragmentation in last page = 8/16 = 50%

Average internal fragmentation is half a page per request = 8 tokens × 320 KB/token = 2.5 MB per request. Across 32 requests: 80 MB wasted — a tiny 0.2% of the 40 GB total. Compare this to pre-allocating full max_seq_len: (4096-200) × 320 KB = 1.22 GB wasted per request!

Exercise 5.5: Memory Savings Derive

Without PagedAttention: pre-allocate 4096 tokens per request. With PagedAttention: allocate only what's needed. If average sequence length is 800 tokens, what is the memory savings ratio?

× savings (pre-alloc / paged)

Show derivation

Pre-allocated: 4096 tokens per request

PagedAttention: ~800 tokens per request (+ 0.5 page fragmentation ≈ 808)

Savings = 4096 / 800 = 5.12×

5× less KV cache memory means you can serve 5× more concurrent requests on the same hardware, or use 5× less hardware for the same concurrency. This is why vLLM (which introduced PagedAttention) became the default serving framework.

Chapter 6: Speculative Decoding

Autoregressive generation is slow because each token depends on the previous one — you can't parallelize. Speculative decoding cheats: a small, fast draft model generates k candidate tokens. The large target model then verifies all k tokens in a single forward pass (same cost as generating 1 token, since it's just a batch of k+1 positions). Accepted tokens are free; rejected tokens cost nothing extra.

Expected tokens per target model call (simplified):
E[tokens] = (1 - α^k+1) / (1 - α) // geometric series

For rough estimation: E[tokens] ≈ kα + 1 // approximate when α is moderate

Where α = acceptance rate (probability draft token matches target),
k = number of speculated tokens per step

Why does verification cost the same as one token? In autoregressive generation, the target model already computes attention over all previous tokens. Verifying k draft tokens is equivalent to a forward pass on a sequence of k+1 positions — the same cost as generating the next token for a k-token-longer context. The key insight: generation is sequential, but verification is parallel.

Exercise 6.1: Expected Tokens per Step Derive

Draft model generates k=5 tokens, acceptance rate α=0.7. Using the approximation E = kα + 1, how many tokens do you expect per target model forward pass?

tokens per step

Show derivation

E = kα + 1 = 5 × 0.7 + 1 = 3.5 + 1 = 4.5 tokens

Instead of 1 token per forward pass, you get 4.5 tokens. That's a 4.5× improvement in tokens-per-step. But we haven't accounted for the draft model's cost yet — the actual wall-clock speedup is less.

Exercise 6.2: Wall-Clock Speedup Derive

Target model latency: 50 ms/forward pass. Draft model: 5 ms/forward pass (10× faster). With k=5, α=0.7: what is the wall-clock speedup for tokens per second?

Time per speculative step = k × draft_time + 1 × target_time. Compare tokens/time: speculative vs standard (1 token per 50 ms).

× speedup

Show derivation

Standard: 1 token / 50 ms = 20 tokens/s

Speculative step time = 5 × 5 ms + 50 ms = 25 + 50 = 75 ms

Speculative tokens/step = 4.5

Speculative: 4.5 tokens / 75 ms = 60 tokens/s

Speedup = 60 / 20 = 3.0×

3× wall-clock speedup despite a 4.5× token-per-step improvement, because the draft model takes 25 ms (5 sequential forward passes). If the draft model ran on a separate GPU in parallel, you'd get closer to 4.5×.

Exercise 6.3: Break-Even Acceptance Rate Derive

With k=5, draft at 5 ms and target at 50 ms (same GPU, sequential): what is the minimum acceptance rate α for speculative decoding to be faster than standard decoding?

Standard: 1 token/50ms. Speculative: (kα+1) tokens / (5k+50) ms. Set speculative rate > standard rate and solve for α.

minimum α

Show derivation

(kα + 1) / (5k + 50) > 1/50

(5α + 1) / 75 > 1/50

50(5α + 1) > 75

250α + 50 > 75

250α > 25

α > 0.10

The break-even is surprisingly low — only 10% acceptance rate. This means speculative decoding is almost always worth it when the draft model is 10× faster. Even a terrible draft model (20% acceptance) gives a 1.47× speedup.

Exercise 6.4: When Speculative Decoding Hurts Trace

In which scenario is speculative decoding NOT worth the complexity?

Low-latency chatbot with 70B target model and 1B draft model High-throughput batch processing where the GPU is already compute-saturated with large batch sizes — speculative decoding adds draft model overhead without reducing latency Code generation with 70B target and specialized 3B code draft model Real-time translation with strict latency requirements

Show reasoning

Speculative decoding helps with latency (tokens/second for a single request). But in high-throughput batch mode, the GPU is already fully utilized computing attention for a large batch. Adding a draft model doesn't help because the bottleneck is memory bandwidth (loading weights), not sequential token generation. With batch size 64+, each forward pass already produces 64 tokens — speculative decoding's benefit disappears.

Exercise 6.5: Exact Expected Tokens Derive

Using the exact formula E = (1 - α^k+1) / (1 - α) with k=5, α=0.7: what is the exact expected number of accepted tokens per step? (Hint: calculate 0.7⁶ first.)

tokens per step

Show derivation

α⁶ = 0.7⁶ = 0.117649

E = (1 - 0.117649) / (1 - 0.7) = 0.882351 / 0.3 = 2.941

The exact formula gives 2.94, compared to the approximation kα+1 = 4.5. The approximation significantly overestimates because it ignores the geometric decay — the probability of accepting all k tokens is α^k = 0.7⁵ = 16.8%, so most of the time some tokens are rejected.

Chapter 7: Quantization for Serving

Autoregressive decoding is memory-bandwidth bound: each generated token requires loading the entire model from HBM to the compute units. At FP16, a 70B model = 140 GB of weights. An H100 has 3.35 TB/s HBM bandwidth. Loading 140 GB takes 42 ms — that's your minimum per-token latency, no matter how fast the compute is.

Quantization shrinks the model: INT8 halves it to 70 GB, INT4 quarters it to 35 GB. Smaller weights = faster loading = faster token generation.

Memory per model (70B params):
FP16: 70B × 2 bytes = 140 GB
INT8: 70B × 1 byte = 70 GB
INT4: 70B × 0.5 bytes = 35 GB

Minimum decode latency (H100, 3.35 TB/s):
FP16: 140 / 3350 = 41.8 ms/token → 23.9 tok/s
INT8: 70 / 3350 = 20.9 ms/token → 47.8 tok/s
INT4: 35 / 3350 = 10.4 ms/token → 96.2 tok/s

Quantization is a free lunch for decode throughput. In the memory-bandwidth-bound regime (batch size = 1), halving the model size exactly doubles the tokens per second. INT4 gives ~4× the throughput of FP16. The quality tradeoff is small: INT8 loses ~0.1 perplexity, INT4 with GPTQ/AWQ loses ~0.3-0.5.

Exercise 7.1: FP16 Model Size Derive

A 70B parameter model in FP16. How many GB of weight memory?

Show derivation

70 × 10⁹ params × 2 bytes/param = 140 × 10⁹ bytes = 140 GB

Exercise 7.2: INT4 Model Size Derive

Same 70B model quantized to INT4 (4 bits per weight = 0.5 bytes). How many GB?

Show derivation

70 × 10⁹ × 0.5 bytes = 35 × 10⁹ bytes = 35 GB

35 GB fits on a single 80 GB H100 with plenty of room for KV cache. Without quantization, you'd need at least TP=2 to fit the model. Quantization can literally halve your GPU cost for serving.

Exercise 7.3: Decode Throughput at INT8 Derive

70B model, INT8 quantized (70 GB), H100 with 3.35 TB/s HBM bandwidth, batch size 1 (pure bandwidth-bound). What is the maximum decode throughput in tokens/second?

tokens/second

Show derivation

Time to load model = 70 GB / 3350 GB/s = 0.02090 s = 20.9 ms

Throughput = 1 / 0.02090 = 47.9 tokens/s

47.9 tok/s vs 23.9 tok/s at FP16. Exactly 2× faster because the model is exactly 2× smaller. This is the bandwidth-bound regime: throughput scales linearly with model compression.

Exercise 7.4: Batch Size for Compute-Bound Derive

At what batch size does a 70B INT8 model on H100 transition from bandwidth-bound to compute-bound? H100 has 990 TFLOPS INT8 and 3.35 TB/s bandwidth. The arithmetic intensity crossover happens when FLOPs/byte = hardware_FLOPS/bandwidth.

Each token requires ~2P FLOPs (forward pass through 70B params = 140 GFLOPs) and loads P bytes (70 GB for INT8). Compute-bound when: B × 140 GFLOPs / 70 GB > 990 TFLOPS / 3.35 TB/s.

batch size

Show derivation

Operational intensity per token = 2 FLOPs/byte (each weight used in one multiply-add)

With batch B: operational intensity = 2B FLOPs/byte

Machine balance = 990 × 10¹² / 3.35 × 10¹² = 295.5 FLOPs/byte

Crossover: 2B = 295.5 → B = 147.8 ≈ 148

Below batch ~148, you're bandwidth-bound and quantization's throughput benefit is linear. Above 148, you're compute-bound and quantization helps less (INT8 compute is still faster than FP16, but the scaling is sublinear). Most serving scenarios operate well below this crossover.

Exercise 7.5: Quality vs Speed Tradeoff Trace

A 70B model has perplexity 5.2 at FP16. After INT8 quantization: ~5.3. After INT4 (GPTQ, 128 group size): ~5.7. You're building a code assistant where correctness matters more than speed. Which quantization should you use?

FP16 — no quality loss, use more GPUs for speed INT8 — negligible quality loss (0.1 ppl), 2× throughput, best balance of quality and cost INT4 — 4× throughput is worth the 0.5 ppl degradation INT2 — maximum speed, quality is fine

Show reasoning

INT8 is the sweet spot for quality-sensitive applications. The 0.1 perplexity increase is almost imperceptible in code generation quality, but you get 2× throughput (or half the GPUs, half the cost). INT4's 0.5 ppl increase can noticeably affect code correctness — missing edge cases, incorrect variable names. FP16 is wasteful when INT8 is essentially lossless.

Exercise 7.6: GPU Count with Quantization Derive

To serve a 70B model, how many H100 GPUs (80 GB each) do you need at FP16 vs INT4? (Weights only — ignore KV cache for now.)

GPUs needed at FP16

Show derivation

FP16: 140 GB / 80 GB per GPU = 1.75 → need 2 GPUs (TP=2)

INT4: 35 GB / 80 GB per GPU = 0.44 → need 1 GPU

INT4 quantization cuts your serving cost in half by fitting the model on a single GPU. Plus, single-GPU serving has zero communication overhead — no TP allreduce latency. This is a 2× cost reduction AND lower latency.

Chapter 8: Cost Optimization

You've optimized the model — now optimize the bill. Serving LLMs is a business: your revenue per token must exceed your cost per token. The math is surprisingly simple once you know the hardware costs and throughput numbers.

Cost per million tokens:
$/1M tokens = (GPU_cost_per_hour / tokens_per_hour) × 1,000,000

Break-even utilization:
utilization = spot_price / on_demand_price
(Below this utilization, on-demand is cheaper than reserving hardware)

The utilization trap. An H100 at $3.50/hr costs $2,520/month. If it's only 30% utilized, your effective cost is $3.50/0.30 = $11.67/GPU-hr. At 90% utilization: $3.89/GPU-hr. The difference between 30% and 90% utilization is a 3× cost difference for the same hardware.

Exercise 8.1: Cost per Million Tokens Derive

An H100 costs $3.50/hr (on-demand). You're serving a 70B INT8 model on 2 GPUs (TP=2), achieving 3,000 output tokens/second. What is the cost per 1M output tokens?

$/1M tokens

Show derivation

GPU cost/hr = 2 GPUs × $3.50 = $7.00/hr

Tokens/hr = 3,000 tok/s × 3,600 s/hr = 10,800,000

$/1M tokens = $7.00 / 10.8 = $0.648 ≈ $0.65

$0.65 per million output tokens. For comparison, OpenAI charges $15/1M output tokens for GPT-4. The raw compute cost is ~23× less than the API price — the markup covers engineering, infrastructure, redundancy, and margin.

Exercise 8.2: Spot vs On-Demand Derive

H100 spot price: $2.00/hr. On-demand: $3.50/hr. If your spot instances get interrupted 15% of the time (85% availability), is spot cheaper for serving?

Effective spot cost = spot_price / availability. Compare to on-demand.

effective $/hr (spot)

Show derivation

Effective spot cost = $2.00 / 0.85 = $2.35/hr

On-demand = $3.50/hr

Spot saves: $3.50 - $2.35 = $1.15/hr per GPU = 33% savings

Spot is cheaper even with 15% interruptions. However, for production serving with SLAs, interruptions cause dropped requests and latency spikes. Spot is better for batch inference (reprocessable work) than real-time serving.

Exercise 8.3: Break-Even Utilization Derive

A 1-year reserved H100 costs $2.20/hr (committed). On-demand is $3.50/hr. What utilization rate makes reserved cheaper than on-demand?

Reserved cost is fixed (you pay even when idle). On-demand only when used. Break-even: reserved_price = on_demand_price × utilization.

utilization fraction

Show derivation

Break-even: $2.20 = $3.50 × utilization

utilization = $2.20 / $3.50 = 0.629

If you use the GPU more than 62.9% of the time, reserved is cheaper. Most production serving systems targeting 70%+ utilization should use reserved instances. For dev/staging at <30% utilization, on-demand is clearly better.

Exercise 8.4: Multi-Model Serving Derive

You serve both a 7B model (1 GPU, INT4, 15K tok/s) and a 70B model (2 GPUs, INT8, 3K tok/s). Traffic is 80% to 7B and 20% to 70B. Total request rate: 1,000 requests/s with average 200 output tokens each. How many GPUs do you need for each model?

Tokens/s demand = requests/s × output_length × traffic_fraction.

total GPUs needed

Show derivation

7B demand = 0.80 × 1000 × 200 = 160,000 tok/s

7B GPUs = 160,000 / 15,000 = 10.67 → 11 GPUs

70B demand = 0.20 × 1000 × 200 = 40,000 tok/s

70B replicas = 40,000 / 3,000 = 13.33 → 14 replicas

70B GPUs = 14 × 2 = 28 GPUs

Total = 11 + 28 = 39 GPUs

Even though 80% of traffic goes to 7B, the 70B model needs 28/39 = 72% of the GPUs. The large model is 5× slower per token AND uses 2× more GPUs per replica. Routing more traffic to the smaller model has a massive cost impact.

Exercise 8.5: Monthly Serving Bill Derive

39 GPUs at $3.50/hr (on-demand), running 24/7 for 30 days. What's the monthly bill?

$/month

Show derivation

Hours/month = 24 × 30 = 720

Cost = 39 × $3.50 × 720 = $98,280/month

~$100K/month to serve 1,000 requests/second. At 200 tokens/request, that's 200K tok/s = 17.28B tokens/month. Cost per million tokens = $98,280 / 17,280 = $5.69/M tokens (blended across both models).

Exercise 8.6: Quantization Savings Trace

You could switch the 70B model from INT8 (2 GPUs/replica) to INT4 (1 GPU/replica, 5K tok/s). How much would the monthly bill drop?

Saves ~$5K/month Saves ~$30K/month — 70B drops from 28 GPUs to 8 GPUs, total from 39 to 19 GPUs Saves ~$50K/month No savings — quantization doesn't reduce GPU count

Show derivation

INT4 70B replicas = 40,000 / 5,000 = 8 replicas × 1 GPU = 8 GPUs

New total = 11 + 8 = 19 GPUs

New cost = 19 × $3.50 × 720 = $47,880/month

Savings = $98,280 - $47,880 = $50,400/month

Actually ~$50K/month savings — even more than option C. INT4 reduces the 70B serving from 28 GPUs to 8 GPUs (3.5× reduction). The total infrastructure cost is nearly halved. This is why aggressive quantization is standard for production serving.

Chapter 9: Capstone — Design a Serving System

Time to put it all together. You're the ML infrastructure lead. The task: serve a 70B model to 10,000 concurrent users. Context window: 4K tokens. Target: 30 tokens/second per user. p99 latency: first token within 2 seconds. Your budget needs to be defensible to the VP of Engineering. Every number must have a derivation.

This is a real system design interview. Each exercise builds on the previous one. By the end, you'll have a complete, costed serving architecture. No hand-waving — every GPU count, memory budget, and dollar figure comes from calculation.

Exercise 9.1: Total Throughput Requirement Derive

10,000 concurrent users, each receiving 30 tok/s. But not all users are actively generating at once — assume 30% are in generation at any moment (the rest are reading, typing, or idle). What is the required aggregate output throughput?

tokens/second

Show derivation

Active users = 10,000 × 0.30 = 3,000

Total throughput = 3,000 × 30 tok/s = 90,000 tok/s

90K tokens per second. This is the number that sizes everything else — GPUs, replicas, cost.

Exercise 9.2: Quantization Choice Trace

For this 70B serving system, which quantization strategy minimizes GPU count while maintaining quality for a general-purpose assistant?

FP16 — no quality compromise INT8 — minimal quality loss, fits on 1 GPU with TP=1 INT4 (AWQ) — 35 GB weights on 1 GPU, leaves ~45 GB for KV cache. Quality loss is small (~0.3 ppl). Best cost/quality tradeoff at scale. INT2 — maximum compression

Show reasoning

INT4 (AWQ) is the right choice. 35 GB weights leave 45 GB for KV cache on an 80 GB H100. At 320 KB/token, that's 144K tokens of KV cache — enough for ~36 concurrent 4K-context requests per GPU. INT8 would leave less KV cache room, requiring more GPUs. FP16 doesn't fit on 1 GPU at all.

Exercise 9.3: Per-GPU Throughput Derive

With INT4 on a single H100 (3.35 TB/s bandwidth), the bandwidth-bound decode throughput at batch=1 is ~96 tok/s. With continuous batching and batch size 32, throughput scales nearly linearly (bandwidth-bound). What is the per-GPU throughput at batch=32?

tokens/second per GPU

Show derivation

At batch=1 (bandwidth-bound): 35 GB / 3350 GB/s = 10.45 ms/token → 95.7 tok/s

At batch=32: same weight load, 32 tokens produced → 32 × 95.7 = 3,062 tok/s

~3,000 tok/s per GPU at batch=32. This is still bandwidth-bound (batch 32 << crossover batch ~300 for INT4). Each decode step loads 35 GB of weights once and produces 32 tokens — amortizing the bandwidth cost across the batch.

Exercise 9.4: Number of GPUs Derive

Required: 90,000 tok/s. Per GPU: ~3,000 tok/s. How many GPUs? Add 20% headroom for load spikes and maintenance.

GPUs

Show derivation

Base GPUs = 90,000 / 3,000 = 30 GPUs

With 20% headroom = 30 × 1.2 = 36 GPUs

36 H100 GPUs. That's about 4.5 nodes of 8 GPUs each — round to 5 nodes (40 GPUs) for clean hardware allocation. The extra 4 GPUs provide additional failover capacity.

Exercise 9.5: KV Cache Memory Budget Derive

Each GPU has 80 GB. Weights = 35 GB. CUDA overhead ≈ 5 GB. Remaining for KV cache? With 320 KB/token and 4K context, how many concurrent requests per GPU?

max concurrent requests per GPU

Show derivation

Available KV memory = 80 - 35 - 5 = 40 GB

KV per request = 320 KB × 4096 = 1,310,720 KB = 1.25 GB

Max requests = 40 / 1.25 = 32

32 concurrent requests per GPU with PagedAttention. At batch=32, we estimated ~3K tok/s — this matches our throughput calculation. The system is KV-cache-limited, not compute-limited. To serve more requests per GPU, you'd need either shorter contexts, smaller KV heads (more aggressive GQA), or KV cache quantization (INT8 KV halves the per-request cost).

Exercise 9.6: Latency Check Trace

The p99 TTFT (time-to-first-token) target is 2 seconds. For a 4K input prompt, the prefill phase processes all input tokens. With INT4 at ~990 TFLOPS, prefill for 4096 tokens through a 70B model takes approximately: 2 × 70B × 4096 FLOPs / 990 TFLOPS. Is the latency target achievable?

Yes — prefill takes ~580 ms, well under 2 seconds even with queuing delay No — prefill alone takes ~4 seconds Impossible to tell without knowing the batch size

Show derivation

Prefill FLOPs = 2 × 70 × 10⁹ × 4096 = 573.7 × 10¹² = 573.7 TFLOPS

Time = 573.7 / 990 = 0.580 s = 580 ms

580 ms for prefill, leaving ~1.4 seconds for queuing, scheduling, and network overhead. Achievable at p99 with proper queue management. Note: this assumes the GPU is dedicated to this request during prefill — in practice, prefill is often separated from decode ("disaggregated serving") to avoid head-of-line blocking.

Exercise 9.7: Monthly Cost Derive

36 H100 GPUs, reserved pricing at $2.20/hr. What's the monthly cost?

$/month

Show derivation

Hours/month = 24 × 30 = 720

Monthly cost = 36 × $2.20 × 720 = $57,024

~$57K/month to serve 10,000 users. That's $5.70 per user per month. If each user pays $20/month, your gross margin on compute is ($20 - $5.70) / $20 = 71.5%. This is why LLM API businesses are viable — the per-user compute cost is surprisingly low at scale.

Exercise 9.8: Cost per Million Tokens (Final) Derive

Total cost: $57,024/month. Total throughput: 90,000 tok/s. What is the blended cost per 1M output tokens?

$/1M tokens

Show derivation

Tokens/month = 90,000 × 3600 × 24 × 30 = 233,280,000,000 = 233.28B

$/1M tokens = $57,024 / 233,280 = $0.244

$0.24 per million output tokens at full utilization. This is the raw infrastructure cost for a well-optimized 70B INT4 serving system. Any API pricing above this (plus engineering costs and margin) is profitable.

Exercise 9.9: Design the Pipeline Design

Put the serving system components in the correct order, from user request to generated response.

→

Load balancer routes to GPU Prefill: process input prompt Allocate KV cache pages Decode: autoregressive generation Free KV pages, return slot Stream tokens to user

Show correct order

1. Load balancer routes to GPU (least-loaded scheduling).
2. Allocate KV cache pages (PagedAttention reserves blocks).
3. Prefill: process input prompt (compute-bound, fills KV cache).
4. Decode: autoregressive generation (bandwidth-bound, token by token).
5. Stream tokens to user (SSE/WebSocket, as they're generated).
6. Free KV pages, return slot (continuous batching fills the slot immediately).

Topic	Lesson
Transformer fundamentals	Transformer — From Absolute Zero
Parameter counting	Transformer Math Workbook
Distributed training	Distributed Training — From Absolute Zero
ML inference	ML Inference Engineer — Day In The Life
Scaling laws	Scaling Book Workbook

Systems & Serving

Chapter 0: Tensor Parallelism

Chapter 1: Pipeline Parallelism

Chapter 2: Data Parallelism & FSDP

Chapter 3: 3D Parallelism

Chapter 4: Continuous Batching

Chapter 5: PagedAttention

Chapter 6: Speculative Decoding

Chapter 7: Quantization for Serving

Chapter 8: Cost Optimization

Chapter 9: Capstone — Design a Serving System

Related Lessons