Ch 8: Serving LLaMA 3 — Scaling Book

Chapter 0: Hardware Choice

The first question for serving is: which hardware? The answer is usually whichever gives the most FLOPs per dollar. For inference, where we are often bandwidth-bound, we also care about HBM bandwidth per dollar.

TPU type	bf16 FLOPs/s	HBM	HBM BW	USD/hr	FLOPs/$
H100	9.9e14	80 GB	3350 GB/s	$10.80	3.3e17
v5p	4.59e14	96 GB	2765 GB/s	$4.20	3.9e17
v5e	1.97e14	16 GB	820 GB/s	$1.20	5.8e17

TPU v5e wins on FLOPs per dollar at 5.8e17, beating v5p (3.9e17) and H100 (3.3e17). That is why v5e is the standard choice for inference at Google — you get 76% more FLOPs per dollar than H100.

The catch: v5e only has 16 GB HBM. A 70B model in bf16 needs 140 GB just for weights. That means aggressive sharding across many chips, or quantization, or both.

Why not use v5p for inference? v5p has 96 GB HBM and much higher FLOPs, but costs 3.5x more per chip. Since inference is usually bandwidth-bound (not compute-bound), the extra FLOPs of v5p go to waste. The relevant metric is HBM bandwidth per dollar, not FLOPs per dollar:

v5e: 820 GB/s / $1.20 = 683 GB/s/$

v5p: 2765 GB/s / $4.20 = 658 GB/s/$

H100: 3350 GB/s / $10.80 = 310 GB/s/$

Even on bandwidth per dollar, v5e wins slightly. And for compute-bound workloads (large batch), v5e's FLOPs per dollar dominance makes it even more attractive.

The critical batch size on v5e with bf16 weights:

B* = C / W_HBM = 1.97e14 / 8.2e11 = 240

With int8 weights (half the bytes, same FLOPs):

B* = 120

What does this mean practically? Below BS=240 (bf16), we are wasting compute cycles. The MXU is idle for (1 - B/240) fraction of each step, waiting for HBM to feed it data. At BS=1, we use 1/240 = 0.4% of the MXU — 99.6% wasted!

This is the fundamental reason why batching is so important for LLM inference. Every batch element shares the cost of loading the model weights. The weight-loading cost is fixed regardless of batch size — it is the "cover charge" for each decode step. More sequences per batch amortize this fixed cost.

The analogy: Think of decode like a bus route. The bus (model weights) makes the same trip regardless of how many passengers (batch elements) it carries. One passenger pays the full fuel cost. 240 passengers split it 240 ways. The difference between batch size 1 and batch size 240 is like the difference between a private taxi and a packed bus — same destination, 100x cheaper per passenger.

Check: Why is TPU v5e preferred for inference despite having only 16 GB HBM?

It has the best FLOPs per dollar ratio (5.8e17), making it cheapest per token generated It has the highest HBM bandwidth It supports more parallelism strategies

Chapter 1: KV Cache Sizing

Before choosing a topology, we need to know how much memory the KV cache will use. For LLaMA 3-70B with int8 KV caches:

KV bytes/token = 2 × K × H × L = 2 × 8 × 128 × 80 = 163,840 bytes ≈ 160 kB

At different sequence lengths:

Sequence length	KV cache per sequence	BS=32	BS=240
2,048	328 MB	10.5 GB	78.6 GB
8,192	1.31 GB	41.9 GB	314 GB
32,768	5.24 GB	168 GB	1.26 TB

At batch size 240 and 8K context, the KV cache alone uses 314 GB. That is 4.5x the model parameter size (70 GB in int8). On v5e with 16 GB per chip, we need at least ceil((70 + 314) / 16) = 24 chips just for memory — and that is before any working memory for activations.

This reveals a fundamental tension: we want batch size 240 (to saturate compute), but the KV cache at that batch size may require more chips than we can efficiently shard across (due to ICI communication limits).

Practical approach: pick the batch size that fills available HBM on a given topology, then check if the resulting throughput is acceptable.

The memory equation for inference:

Total HBM = param_bytes + B × S × KV_per_token + working_memory

Working memory includes activation buffers for the forward pass. With Flash Attention, this is small (a few hundred MB). Without it, the attention score matrix can be large: B × N_heads × S × S × 2 bytes. For B=32, S=8K, N=64: that is 32 × 64 × 8192 × 8192 × 2 = 275 GB! This is why Flash Attention (or equivalent) is mandatory for long-context serving.

Practical implication: When computing max batch size, always leave 10-20% HBM headroom for working memory, CUDA/XLA overhead, and memory fragmentation. A "full" GPU/TPU at 95% HBM utilization will often OOM due to fragmentation.

Check: For LLaMA 70B with 32K context and BS=32 in int8, how large is the total KV cache?

5.24 GB 168 GB (5.24 GB × 32 sequences) 70 GB (same as model weights)

Chapter 2: Minimum Topology

What is the smallest TPU v5e slice that can serve LLaMA 3-70B? This depends entirely on quantization and KV cache needs:

Quantization	Param size	KV/token	Min TPU v5e chips	Actual min slice	Remaining HBM for KV
bf16	140 GB	324 kB	9	4×4 = 16	116 GB
int8	70 GB	162 kB	5	4×2 = 8	58 GB
int4	35 GB	81 kB	3	2×2 = 4	29 GB

How many KV caches fit in the remaining HBM? At 8K context in int8: KV per sequence = 1.31 GB. So on a 4×2 with 58 GB remaining: 58 / 1.31 = 44 sequences max. That is our maximum batch size on this topology.

Let us compute the max batch size for each configuration at 8K context:

bf16 on 4×4: 116 GB / 1.31 GB = 88 sequences

int8 on 4×2: 58 GB / 1.31 GB = 44 sequences

int4 on 2×2: 29 GB / 1.31 GB = 22 sequences

None of these reach the critical batch size of 240 (bf16) or 120 (int8). So on minimum topologies, we are always memory-bandwidth-bound. Reaching compute-saturation requires larger slices.

An important subtlety: these numbers are optimistic. We assumed all remaining HBM goes to KV caches. In reality, we need:

Memory component	Size
Model parameters	70 GB (int8)
KV caches	Variable
Activation working memory	~0.5-2 GB (with Flash Attention)
XLA/CUDA runtime overhead	~0.5-1 GB per chip
Memory fragmentation	~5-15% of total

A safe rule of thumb: assume 80% of nominal HBM is actually usable. On v5e: 0.8 × 16 = 12.8 GB effective per chip. This reduces max batch sizes by 20-30%.

What about LLaMA 3-8B serving? Let us compute the same table for the smaller model:

Quantization	Param size	KV/token (K=8, H=128, L=32)	Min v5e chips	KV at 8K (per seq)
bf16	16 GB	65.5 kB	1	536 MB
int8	8 GB	32.8 kB	1	268 MB
int4	4 GB	16.4 kB	1	134 MB

LLaMA 8B in int8 fits on a single v5e chip with 8 GB remaining for KV caches. At 8K context: max BS = 8 GB / 268 MB = 29 sequences. At BS=29 and int8: we are near compute-bound (B* = 120 for bf16, 60 for int8 since the chip also does bf16 FLOPs). Not quite saturated, but reasonable.

For production, a 2x2 (4 chips) in int8 gives: 4 × 16 - 8 = 56 GB for KV, supporting BS=209 at 8K context. At BS=120, we are fully compute-bound in the MLP. This is a very efficient configuration.

What about LLaMA 3-405B? This is a more extreme case:

int8 params: 405 GB. Min v5e chips: ceil(405/16) = 26 chips (round to 4×8 = 32)

KV per token: 2 × 8 × 128 × 126 × 1 = 258 kB (int8)

Remaining HBM: 32 × 16 - 405 = 107 GB

At 8K context: KV per seq = 258e3 × 8192 = 2.11 GB. Max BS = 107/2.11 = 50

Critical BS (int8 on v5e) = 120. We are at 50 < 120 ⇒ bandwidth-bound.

Step time = (405 + 50 × 2.11) / (32 × 820) ≈ 19.5 ms

Even the 405B model has a step time of ~20 ms on 32 v5e chips. The user-facing latency for generating 512 tokens: 0.02 × 512 = 10.2 seconds. Acceptable for most applications.

The cost per 1M tokens for 405B at BS=50 on 32 v5e:

Tok/s = 50 / 0.0195 = 2564. Cost/s = 32 × $1.20/3600 = $0.0107/s

$/1M tok = $0.0107 / 2564 × 1e6 = $4.17

About 5-6x more expensive per token than the 70B model. The model has 5.8x more parameters, so the scaling is roughly linear — as expected since both are bandwidth-bound.

The smallest topology is not the most cost-effective. On a 4×2 with int8 at BS=44, we only use 44/120 = 37% of compute capacity. Doubling to 4×4 gives us 372 GB remaining for KV caches, fitting BS=140 (int8), getting us much closer to saturation.

Check: Can LLaMA 70B in int4 fit on a TPU v5e 2x2 (4 chips)?

Yes — 35 GB params fits in 4 × 16 GB = 64 GB, with 29 GB left for KV caches No — the KV cache will not fit No — int4 is not supported on v5e

Chapter 3: Decode Latency

Now let us compute actual decode latency for LLaMA 70B. On a TPU v5e 4×2 (8 chips) with int8 params, BS=32, and 8K context:

Total memory to load = param_size + KV_cache = 70 GB + 41.9 GB = 112 GB

T_step = 112 GB / (8 × 820 GB/s) = 112 / 6560 = 17 ms

At BS=32: throughput = 32 / 0.017 = 1882 tokens/s, or 235 tokens/s/chip.

Lower bound on decode latency: The minimum step time is determined by how fast we can read the model parameters from HBM. For v5e: 16 GB / 820 GB/s = 19.5 ms. This is the absolute floor — even with an empty KV cache and BS=1, we cannot go faster than this on a single chip.

On a 4×4 (16 chips):

T_step = 112 GB / (16 × 820 GB/s) = 8.5 ms

Latency halves, but throughput per chip stays the same (BS is unchanged). The benefit is purely lower latency.

Let us check if we are ICI-bound at these topologies. With 2 ICI axes for model parallelism:

Y_max (compute-bound) = F × 2 / 2200 = 28672 × 2 / 2200 = 26

So 8 and 16 are both well within the ICI limit. Good — ICI is not a bottleneck.

When to use a larger topology: If you have a latency SLA (e.g., <15 ms/step), you may need more chips even though throughput per chip stays constant. The 4×4 gives 8.5 ms vs 17 ms on 4×2.

Latency breakdown for the 4×2 config:

Component	Time	Fraction
Parameter loading (70 GB / 6560 GB/s)	10.7 ms	63%
KV cache loading (41.9 GB / 6560 GB/s)	6.4 ms	37%
Compute (2 × 70e9 × 32 / (8 × 1.97e14))	2.8 ms	(hidden behind HBM)
Total (HBM bound)	17.1 ms

Parameter loading dominates at this batch size (63% of step time). As batch size increases, KV cache loading grows and eventually dominates at very long contexts. The compute is completely hidden behind the HBM loading — we are solidly in the bandwidth-bound regime.

What is the absolute fastest decode we can achieve? The minimum is when we load only model parameters (KV cache is negligible, e.g., at BS=1 with short context):

Min latency (int8, 4×2) = 70 GB / (8 × 820 GB/s) = 10.7 ms

Min latency (int8, 4×4) = 70 GB / (16 × 820 GB/s) = 5.3 ms

Min latency (int4, 4×4) = 35 GB / (16 × 820 GB/s) = 2.7 ms

So the theoretical minimum decode step for LLaMA 70B in int4 on 16 v5e chips is about 2.7 ms, or ~370 tokens/s for a single sequence. In practice, expect 60-70% of this due to LayerNorm, activation functions, and other overhead.

Check: If decode step time is 17ms at BS=32 on 8 chips, what is the per-chip throughput?

32 / 0.017 / 8 = 235 tokens/s/chip 32 / 0.017 = 1882 tokens/s/chip 1 / 0.017 = 59 tokens/s/chip

Chapter 4: Throughput Optimization

To maximize throughput, we want to push the batch size as high as possible — ideally past the critical batch size B* = 120 (int8 on v5e). But each extra sequence needs KV cache memory.

The strategy: pick the batch size that fills all available HBM. On a fully loaded topology, the step time is simply:

T_step = HBM_total / (N × W_HBM) = 16 GB / 820 GB/s = 19.5 ms

This is independent of how the HBM is split between params and KV caches! Loading a full chip's worth of HBM always takes 19.5 ms.

With median decode length of 512 tokens, throughput in queries per second per chip:

QPS/chip = B / (T_step × median_decode × N)

Quantization	Topology	Max BS (8K ctx)	QPS/chip
bf16	4×4	88	0.27
int8	4×2	44	0.55
int4	2×2	22	1.11

int4 gives 4x the throughput per chip as bf16! This comes from two effects: (1) the model fits on fewer chips, (2) the remaining HBM is used for KV caches at a higher per-chip density. Quantization is the single most impactful optimization for serving cost.

What about doubling the topology? On a 4×8 (32 chips) in bf16:

Remaining HBM = 32 × 16 - 140 = 372 GB

Max BS = 372 / 1.31 = 283 (with 8K context)

QPS/chip = 283 / (0.0195 × 512 × 32) = 0.89

That is 3.3x better throughput per chip than the minimum 4×4! The extra chips pay for themselves by enabling larger batch sizes.

Cost per 1M output tokens: This is the key business metric. Let us compute it for each configuration.

Cost/s = N × $/hr / 3600

Tokens/s = B / T_step

Cost per 1M tokens = (Cost/s / Tokens/s) × 1e6

Config	Chips	Max BS	Tok/s	Cost per 1M tokens
bf16, 4×4	16	88	4513	$1.18
int8, 4×2	8	44	2256	$1.18
int4, 2×2	4	22	1128	$1.18
int8, 4×4	16	140	7179	$0.74
bf16, 4×8	32	283	14513	$0.73

Surprising result: On minimum topologies, the cost per token is nearly identical regardless of quantization! This is because on a minimum topology, HBM is fully loaded (19.5 ms/step) regardless of how it is split between params and KV caches. The wins from quantization appear when using larger topologies where the batch size increase outpaces the chip cost increase.

The optimal serving configuration is not the smallest possible — it is the one that minimizes cost per token. Larger topologies enable larger batches, better compute utilization, and lower per-token cost. This is counterintuitive: spending more on hardware can reduce total serving cost.

Roofline code for computing these numbers: The book provides a simple Python script that computes the full Pareto frontier. Here is the key logic:

import numpy as np

num_chips = 16
param_bytes = 70e9   # int8
hbm_bw = 8.2e11      # v5e per chip
flops = 1.97e14      # v5e per chip
kv_per_token = 160e3  # int8 KV cache

def step_time(bs, seq_len):
    kv_total = kv_per_token * seq_len * bs
    kv_time = kv_total / (num_chips * hbm_bw)
    param_time = param_bytes / (num_chips * hbm_bw)
    flops_time = 2 * param_bytes * bs / (num_chips * flops)
    mlp_time = max(flops_time, param_time)
    return mlp_time + kv_time

Notice the elegant structure: the MLP term is max(compute, bandwidth), and the attention term is always bandwidth. The total step time is just their sum. This simple formula captures 90% of what you need to reason about inference performance.

The Pareto frontier in practice: Let us compute several points on the throughput-latency curve for a 4×4 (16 chips) in int8 at 8K context:

Batch size	KV total (GB)	Step time (ms)	Tok/s total	Tok/s/chip	$/1M tok
1	1.3	5.4	185	12	$28.80
8	10.5	6.1	1311	82	$4.07
32	41.9	8.5	3765	235	$1.42
64	83.9	11.7	5470	342	$0.97
120	157.3	17.3	6936	434	$0.77

The cost per 1M tokens drops by 37x going from BS=1 to BS=120, while latency only increases by 3.2x. This is the most important insight in all of LLM serving economics.

Interpreting the cost column: At BS=120 on 16 v5e chips, the cost is $0.77 per 1M output tokens. For reference:

A typical GPT-4-class response is ~500 tokens

Cost per response = 500 / 1e6 × $0.77 = $0.000385 ≈ $0.04 per 100 responses

That is less than a penny per 25 responses. At this cost, the dominant expense for a chatbot is not inference — it is everything else (engineering, safety, infrastructure, customer support).

How does this compare to training cost? We estimated ~$40M to train LLaMA 70B. At $0.77 per 1M output tokens, the training cost is equivalent to generating:

$40M / $0.77 × 1M = 51.9 trillion output tokens

LLaMA 3 was trained on 15T tokens. So the training cost equals the inference cost of generating roughly 3.5x the training dataset. For a popular model serving millions of users daily, this crossover happens within months. This is why the "inference-aware" scaling philosophy (train smaller models on more data) makes economic sense — reducing model size directly reduces per-token inference cost.

Sensitivity analysis — what matters most for serving cost?

Change	Effect on $/1M tok	Why
int8 → int4 weights	-40%	Half the param loading, more room for KV batch
2K → 32K context	+200%	KV loading dominates at long context
GQA-8 → MHA-64	+400%	8x more KV per token, smaller batch
v5e → H100	+75%	Higher cost per chip despite higher BW
BS=32 → BS=120	-50%	Better MXU utilization

GQA is the single biggest architectural win for inference cost. Going from MHA-64 to GQA-8 reduces KV cache by 8x, enabling 8x larger batch sizes, which in turn gives roughly 4-5x better throughput per chip.

The complete serving checklist for LLaMA 70B:

1. Choose hardware

v5e for best FLOPs/$. H100 if you need more HBM per chip.

↓

2. Choose quantization

int8 weights + int8 KV is the sweet spot. int4 if quality permits.

↓

3. Size the topology

Min chips = ceil(param_bytes / HBM). Add chips until max BS ≥ B*.

↓

4. Check ICI limits

TP degree < F × M_Y / 2200 (compute-bound) or F/(8B) (BW-bound).

↓

5. Decide on disaggregation

If prefill latency > 200ms (long prompts), separate prefill and decode servers.

↓

6. Scale with replicas

Multiply replicas for throughput. Load-balance with autoscaling.

This chapter has given you all the tools to make each of these decisions quantitatively. The formulas are simple — the art is knowing which ones to apply in which order.

A final worked problem — serving LLaMA 70B for 10K QPS:

Config per replica: 16 v5e, int8, BS=64, 8K context

Step time ≈ 11.2 ms. Decode per query (512 tok) = 5.73 s

QPS per replica = 64 / 5.73 = 11.2

Replicas needed = 10,000 / 11.2 = 893

Total chips = 893 × 16 = 14,288 TPU v5e

Cost = 14,288 × $1.20/hr = $17,146/hour = $411K/day

Cost per query = $0.0048. For 10K QPS over 24 hours: 864M queries at $4.1M/day.

This is the scale of a major consumer-facing LLM product. The infrastructure cost alone is ~$150M/year. This explains why LLM API pricing is what it is — and why every percentage point of efficiency improvement matters at this scale.

Check: Why does doubling the topology more than double per-chip throughput?

More total HBM means more KV cache space, enabling a higher batch size that better saturates the compute Doubling chips doubles bandwidth, halving step time The model runs faster on more chips due to less communication

Chapter 5: Sharding for Inference

During generation, we use pure model parallelism (tensor parallelism). There is no data parallelism within a serving replica — all chips work together on the same batch.

The key question: how far can we scale model parallelism before ICI becomes the bottleneck?

In the compute-bound regime (large batch):

Y_max = F × M_Y / 2200

For LLaMA 70B (F=28672) with 2 axes: Y_max = 26.

In the bandwidth-bound regime (small batch), we can overlap ICI with HBM loading. The limit becomes:

Y_max = F / (8B)

At BS=64: Y_max = 28672 / 512 = 56. We can scale further when bandwidth-bound!

Sanity check for a 4×8 (32 chips) at BS=64. Let us compute each time component for one MLP matmul:
T_ICI = 2BD / W_ICI = 2 × 64 × 8192 / 9e10 = 11 μs
T_HBM = 2DF / (Y × W_HBM) = 2 × 8192 × 28672 / (32 × 8.2e11) = 18 μs
T_math = 2BDF / (Y × C) = 2 × 64 × 8192 × 28672 / (32 × 1.97e14) = 4 μs
Since T_HBM > T_ICI > T_math, we are HBM bandwidth-bound — the ideal regime!

For the KV cache sharding: LLaMA 3-70B has K=8 KV heads. With 8-way TP, each chip stores exactly 1 KV head. Beyond 8-way TP, we must split individual heads across chips, adding complexity.

KV head sharding in detail: With K=8 KV heads and Y-way TP:

TP degree	KV heads per chip	Split needed?	Extra ICI for attention?
4	2	No	None (each chip has full heads)
8	1	No	None
16	0.5	Yes (split across 2 chips)	AllGather within head-shard groups
32	0.25	Yes (split across 4 chips)	More AllGather overhead

At 16-way TP, each KV head is split across 2 chips, requiring an AllGather of partial attention outputs. This adds ICI communication proportional to B × S × H per layer per TP group. For LLaMA 70B at BS=64, S=8192: the extra volume is 64 × 8192 × 128 × 2 = 134 MB per layer. At 90 GB/s ICI, this is 1.5 ms — noticeable but manageable.

Topology	TP degree	ICI status	Best for
4×2	8	Well within limit	Cost-optimized serving
4×4	16	Within limit	Low latency
4×8	32	Marginal (need small B)	Ultra-low latency

Check: Why can we scale to more model parallelism in the bandwidth-bound regime than the compute-bound regime?

In the bandwidth-bound regime, HBM loading dominates, and ICI comms can be hidden behind the slower HBM reads The ICI bandwidth increases at small batch sizes Smaller batches require less ICI communication

Chapter 6: Disaggregated Serving

Prefill and decode have different bottlenecks. Prefill is compute-bound; decode is bandwidth-bound. Running them on the same hardware means one phase is always under-utilizing the hardware.

Disaggregated serving runs prefill on compute-optimized chips and decode on bandwidth-optimized chips. The KV cache is transferred between them after prefill completes.

Prefill servers

High FLOPs/s, large batch of prompts. Process in parallel. Ship KV caches to decode servers.

↓ KV cache transfer

Decode servers

High HBM bandwidth, batch of active sequences. Generate tokens autoregressively.

Let us size the prefill:decode ratio. Assume median prefill = 8192 tokens, median decode = 512 tokens:

Prefill latency = 0.91 s (16 v5e chips, 40% MFU)

Decode latency per seq = 0.019 × 512 = 9.7 s (BS=32, 16 chips)

Prefill feeds: P / 0.91 = 1.1 sequences/s per prefill server

Decode consumes: 32 / 9.7 = 3.3 sequences/s per decode server

Ratio: we need 3.3 / 1.1 = 3 prefill servers per decode server

Prefill is the bottleneck. Each prefill takes nearly a second, while decode processes a batch of 32 over ~10 seconds. We need 3x more prefill capacity than decode capacity. This makes intuitive sense: prefill processes 8192 tokens per request versus 512 for decode, but prefill is compute-bound and slower per token.

The transfer cost: shipping 8192 tokens of KV cache = 8192 × 160 kB = 1.3 GB. Over inter-node network at ~50 GB/s, this takes 26 ms — negligible compared to the seconds-scale compute.

When does disaggregation help?

Scenario	Disaggregated?	Reason
Long prompts, short outputs	Yes	Prefill dominates; separate pools let you scale prefill independently
Short prompts, long outputs	Maybe not	Decode dominates; KV transfer overhead may not be worth it
Mixed traffic	Yes	Prevents long prefills from blocking active decode batches
Very small deployment	No	Not enough traffic to justify two separate server pools

The interference problem: Without disaggregation, a long prefill request (e.g., 32K tokens) arriving in a batch of actively decoding sequences will stall all decode slots while the prefill runs. This causes latency spikes for in-flight decode requests. Disaggregation eliminates this interference entirely.

Worked example — cost of the interference problem:

Decode batch: 32 sequences actively generating, step time = 17 ms

New request arrives: 32K token prompt needs prefill

Prefill time for 32K tokens = 2 × 70e9 × 32768 / (16 × 1.97e14 × 0.4) = 3.6 seconds!

All 32 active sequences are blocked for 3.6 seconds

Wasted decode capacity: 32 sequences × 3.6s / 0.017s = ~6776 wasted decode steps

That is 6776 × 32 = 216,832 tokens of decode throughput lost!

This is catastrophic for latency SLAs. Users with active conversations experience a multi-second pause. Disaggregation or chunked prefill is not optional — it is a requirement for production serving.

Chunked prefill as an alternative: Instead of disaggregating, we can break the 32K prefill into 64 chunks of 512 tokens each. Each decode step processes one chunk alongside the normal decode batch. This adds ~0.5 ms per decode step (from the 512-token prefill chunk), but the new request takes 64 × 17.5 ms = 1.12 seconds to fully prefill — versus 3.6 seconds for a full blocking prefill. The decode batch is only slightly slowed, not blocked.

Check: In disaggregated serving for LLaMA 70B, which phase needs more server replicas?

Prefill — it is slower per request and we need ~3x more prefill servers Decode — it processes one token at a time They need equal replicas

Chapter 7: Cost / Throughput Calculator

Explore the latency-throughput Pareto frontier for LLaMA 70B serving on TPU v5e. Adjust batch size and sequence length to see the dramatic tradeoff between cost and latency.

Serving Cost Explorer

LLaMA 70B (int8 params, int8 KV) on 16x TPU v5e.

Batch size: 32

Seq length: 8192

Chips: 16

Check: At the cost of doubling per-token latency, roughly how much can you reduce per-token cost?

Up to 100x — by batching many sequences together 2x — doubling latency halves cost No change — cost is independent of batch size

Chapter 8: Takeaways

Serving LLaMA 3-70B on TPUs teaches us the essential economics of LLM inference:

Key finding	Value
Best hardware (FLOPs/$)	TPU v5e at 5.8e17 FLOPs/$
KV cache per token	160 kB (int8)
Min topology (int8)	4×2 = 8 chips
Critical batch (int8 on v5e)	B* = 120
Decode lower bound	HBM/W = 19.5 ms per chip
Prefill:Decode ratio	~3:1 prefill servers needed
Max useful TP	~16-26 (depends on batch and ICI axes)

The three levers for serving cost:
1. Quantize aggressively. int8 halves param memory, halves critical batch size, doubles effective throughput per chip. int4 goes further.
2. Maximize batch size. Fill HBM with KV caches. The throughput improvement from BS=1 to BS=120 is ~100x while latency only doubles.
3. Right-size the topology. The minimum topology is not the cheapest per token. Extra chips enable larger batches for better hardware utilization.

The latency-throughput tradeoff is extreme: at BS=1 on 16 chips, latency is ~5.5 ms but per-chip throughput is only 11 tok/s. At BS=120, latency rises to ~14 ms (2.5x) but throughput jumps to 536 tok/s (49x). Almost all production systems operate at high batch sizes because the cost savings are overwhelming.

Comparison with real-world API pricing:

Provider	Model	Price per 1M output tokens
OpenAI	GPT-4o	$10-15
Google	Gemini 1.5 Pro	$5-10
Meta (self-hosted)	LLaMA 70B	$0.70-1.20 (our estimate)

Our back-of-the-envelope estimate of $0.70-1.20 per 1M tokens for LLaMA 70B on v5e is in line with third-party hosting costs from providers like Together.ai and Anyscale. The premium charged by OpenAI and Google reflects API margins, safety infrastructure, and the amortized cost of training.

The road to cheaper inference:

Better quantization

FP8/INT4 with calibration can reduce param size by 2-4x with minimal quality loss

↓

Architectural improvements

MQA/GQA reduces KV cache. Smaller models with better training data match larger ones.

↓

Hardware improvements

Higher HBM bandwidth (HBM3e), larger HBM capacity, better FLOPs/$

↓

System optimizations

Speculative decoding, prefix caching, continuous batching, disaggregated serving

Key formula summary:
Step latency: T = (params + KV) / (N × W_HBM)
Throughput: B / T_step
QPS/chip: B / (T_step × median_decode × N)
Min chips: ceil(param_bytes / HBM_per_chip)
Max BS: (N × HBM - params) / KV_per_seq

Check: What is the single most impactful optimization for reducing LLaMA 70B serving cost?

Using a larger topology Weight quantization (int8/int4) — it halves memory, enables larger batches, and doubles effective throughput Pipeline parallelism across pods