Cost per million tokens, hardware selection, sharding strategies, and the latency-throughput Pareto frontier — all derived from first principles.
The first question for serving is: which hardware? The answer is usually whichever gives the most FLOPs per dollar. For inference, where we are often bandwidth-bound, we also care about HBM bandwidth per dollar.
| TPU type | bf16 FLOPs/s | HBM | HBM BW | USD/hr | FLOPs/$ |
|---|---|---|---|---|---|
| H100 | 9.9e14 | 80 GB | 3350 GB/s | $10.80 | 3.3e17 |
| v5p | 4.59e14 | 96 GB | 2765 GB/s | $4.20 | 3.9e17 |
| v5e | 1.97e14 | 16 GB | 820 GB/s | $1.20 | 5.8e17 |
The catch: v5e only has 16 GB HBM. A 70B model in bf16 needs 140 GB just for weights. That means aggressive sharding across many chips, or quantization, or both.
Why not use v5p for inference? v5p has 96 GB HBM and much higher FLOPs, but costs 3.5x more per chip. Since inference is usually bandwidth-bound (not compute-bound), the extra FLOPs of v5p go to waste. The relevant metric is HBM bandwidth per dollar, not FLOPs per dollar:
Even on bandwidth per dollar, v5e wins slightly. And for compute-bound workloads (large batch), v5e's FLOPs per dollar dominance makes it even more attractive.
The critical batch size on v5e with bf16 weights:
With int8 weights (half the bytes, same FLOPs):
What does this mean practically? Below BS=240 (bf16), we are wasting compute cycles. The MXU is idle for (1 - B/240) fraction of each step, waiting for HBM to feed it data. At BS=1, we use 1/240 = 0.4% of the MXU — 99.6% wasted!
This is the fundamental reason why batching is so important for LLM inference. Every batch element shares the cost of loading the model weights. The weight-loading cost is fixed regardless of batch size — it is the "cover charge" for each decode step. More sequences per batch amortize this fixed cost.
The analogy: Think of decode like a bus route. The bus (model weights) makes the same trip regardless of how many passengers (batch elements) it carries. One passenger pays the full fuel cost. 240 passengers split it 240 ways. The difference between batch size 1 and batch size 240 is like the difference between a private taxi and a packed bus — same destination, 100x cheaper per passenger.
Before choosing a topology, we need to know how much memory the KV cache will use. For LLaMA 3-70B with int8 KV caches:
At different sequence lengths:
| Sequence length | KV cache per sequence | BS=32 | BS=240 |
|---|---|---|---|
| 2,048 | 328 MB | 10.5 GB | 78.6 GB |
| 8,192 | 1.31 GB | 41.9 GB | 314 GB |
| 32,768 | 5.24 GB | 168 GB | 1.26 TB |
This reveals a fundamental tension: we want batch size 240 (to saturate compute), but the KV cache at that batch size may require more chips than we can efficiently shard across (due to ICI communication limits).
Practical approach: pick the batch size that fills available HBM on a given topology, then check if the resulting throughput is acceptable.
The memory equation for inference:
Working memory includes activation buffers for the forward pass. With Flash Attention, this is small (a few hundred MB). Without it, the attention score matrix can be large: B × Nheads × S × S × 2 bytes. For B=32, S=8K, N=64: that is 32 × 64 × 8192 × 8192 × 2 = 275 GB! This is why Flash Attention (or equivalent) is mandatory for long-context serving.
Practical implication: When computing max batch size, always leave 10-20% HBM headroom for working memory, CUDA/XLA overhead, and memory fragmentation. A "full" GPU/TPU at 95% HBM utilization will often OOM due to fragmentation.
What is the smallest TPU v5e slice that can serve LLaMA 3-70B? This depends entirely on quantization and KV cache needs:
| Quantization | Param size | KV/token | Min TPU v5e chips | Actual min slice | Remaining HBM for KV |
|---|---|---|---|---|---|
| bf16 | 140 GB | 324 kB | 9 | 4×4 = 16 | 116 GB |
| int8 | 70 GB | 162 kB | 5 | 4×2 = 8 | 58 GB |
| int4 | 35 GB | 81 kB | 3 | 2×2 = 4 | 29 GB |
Let us compute the max batch size for each configuration at 8K context:
None of these reach the critical batch size of 240 (bf16) or 120 (int8). So on minimum topologies, we are always memory-bandwidth-bound. Reaching compute-saturation requires larger slices.
An important subtlety: these numbers are optimistic. We assumed all remaining HBM goes to KV caches. In reality, we need:
| Memory component | Size |
|---|---|
| Model parameters | 70 GB (int8) |
| KV caches | Variable |
| Activation working memory | ~0.5-2 GB (with Flash Attention) |
| XLA/CUDA runtime overhead | ~0.5-1 GB per chip |
| Memory fragmentation | ~5-15% of total |
A safe rule of thumb: assume 80% of nominal HBM is actually usable. On v5e: 0.8 × 16 = 12.8 GB effective per chip. This reduces max batch sizes by 20-30%.
What about LLaMA 3-8B serving? Let us compute the same table for the smaller model:
| Quantization | Param size | KV/token (K=8, H=128, L=32) | Min v5e chips | KV at 8K (per seq) |
|---|---|---|---|---|
| bf16 | 16 GB | 65.5 kB | 1 | 536 MB |
| int8 | 8 GB | 32.8 kB | 1 | 268 MB |
| int4 | 4 GB | 16.4 kB | 1 | 134 MB |
LLaMA 8B in int8 fits on a single v5e chip with 8 GB remaining for KV caches. At 8K context: max BS = 8 GB / 268 MB = 29 sequences. At BS=29 and int8: we are near compute-bound (B* = 120 for bf16, 60 for int8 since the chip also does bf16 FLOPs). Not quite saturated, but reasonable.
For production, a 2x2 (4 chips) in int8 gives: 4 × 16 - 8 = 56 GB for KV, supporting BS=209 at 8K context. At BS=120, we are fully compute-bound in the MLP. This is a very efficient configuration.
What about LLaMA 3-405B? This is a more extreme case:
Even the 405B model has a step time of ~20 ms on 32 v5e chips. The user-facing latency for generating 512 tokens: 0.02 × 512 = 10.2 seconds. Acceptable for most applications.
The cost per 1M tokens for 405B at BS=50 on 32 v5e:
About 5-6x more expensive per token than the 70B model. The model has 5.8x more parameters, so the scaling is roughly linear — as expected since both are bandwidth-bound.
Now let us compute actual decode latency for LLaMA 70B. On a TPU v5e 4×2 (8 chips) with int8 params, BS=32, and 8K context:
At BS=32: throughput = 32 / 0.017 = 1882 tokens/s, or 235 tokens/s/chip.
On a 4×4 (16 chips):
Latency halves, but throughput per chip stays the same (BS is unchanged). The benefit is purely lower latency.
Let us check if we are ICI-bound at these topologies. With 2 ICI axes for model parallelism:
So 8 and 16 are both well within the ICI limit. Good — ICI is not a bottleneck.
Latency breakdown for the 4×2 config:
| Component | Time | Fraction |
|---|---|---|
| Parameter loading (70 GB / 6560 GB/s) | 10.7 ms | 63% |
| KV cache loading (41.9 GB / 6560 GB/s) | 6.4 ms | 37% |
| Compute (2 × 70e9 × 32 / (8 × 1.97e14)) | 2.8 ms | (hidden behind HBM) |
| Total (HBM bound) | 17.1 ms |
Parameter loading dominates at this batch size (63% of step time). As batch size increases, KV cache loading grows and eventually dominates at very long contexts. The compute is completely hidden behind the HBM loading — we are solidly in the bandwidth-bound regime.
What is the absolute fastest decode we can achieve? The minimum is when we load only model parameters (KV cache is negligible, e.g., at BS=1 with short context):
So the theoretical minimum decode step for LLaMA 70B in int4 on 16 v5e chips is about 2.7 ms, or ~370 tokens/s for a single sequence. In practice, expect 60-70% of this due to LayerNorm, activation functions, and other overhead.
To maximize throughput, we want to push the batch size as high as possible — ideally past the critical batch size B* = 120 (int8 on v5e). But each extra sequence needs KV cache memory.
The strategy: pick the batch size that fills all available HBM. On a fully loaded topology, the step time is simply:
This is independent of how the HBM is split between params and KV caches! Loading a full chip's worth of HBM always takes 19.5 ms.
With median decode length of 512 tokens, throughput in queries per second per chip:
| Quantization | Topology | Max BS (8K ctx) | QPS/chip |
|---|---|---|---|
| bf16 | 4×4 | 88 | 0.27 |
| int8 | 4×2 | 44 | 0.55 |
| int4 | 2×2 | 22 | 1.11 |
What about doubling the topology? On a 4×8 (32 chips) in bf16:
That is 3.3x better throughput per chip than the minimum 4×4! The extra chips pay for themselves by enabling larger batch sizes.
Cost per 1M output tokens: This is the key business metric. Let us compute it for each configuration.
| Config | Chips | Max BS | Tok/s | Cost per 1M tokens |
|---|---|---|---|---|
| bf16, 4×4 | 16 | 88 | 4513 | $1.18 |
| int8, 4×2 | 8 | 44 | 2256 | $1.18 |
| int4, 2×2 | 4 | 22 | 1128 | $1.18 |
| int8, 4×4 | 16 | 140 | 7179 | $0.74 |
| bf16, 4×8 | 32 | 283 | 14513 | $0.73 |
The optimal serving configuration is not the smallest possible — it is the one that minimizes cost per token. Larger topologies enable larger batches, better compute utilization, and lower per-token cost. This is counterintuitive: spending more on hardware can reduce total serving cost.
Roofline code for computing these numbers: The book provides a simple Python script that computes the full Pareto frontier. Here is the key logic:
import numpy as np num_chips = 16 param_bytes = 70e9 # int8 hbm_bw = 8.2e11 # v5e per chip flops = 1.97e14 # v5e per chip kv_per_token = 160e3 # int8 KV cache def step_time(bs, seq_len): kv_total = kv_per_token * seq_len * bs kv_time = kv_total / (num_chips * hbm_bw) param_time = param_bytes / (num_chips * hbm_bw) flops_time = 2 * param_bytes * bs / (num_chips * flops) mlp_time = max(flops_time, param_time) return mlp_time + kv_time
Notice the elegant structure: the MLP term is max(compute, bandwidth), and the attention term is always bandwidth. The total step time is just their sum. This simple formula captures 90% of what you need to reason about inference performance.
The Pareto frontier in practice: Let us compute several points on the throughput-latency curve for a 4×4 (16 chips) in int8 at 8K context:
| Batch size | KV total (GB) | Step time (ms) | Tok/s total | Tok/s/chip | $/1M tok |
|---|---|---|---|---|---|
| 1 | 1.3 | 5.4 | 185 | 12 | $28.80 |
| 8 | 10.5 | 6.1 | 1311 | 82 | $4.07 |
| 32 | 41.9 | 8.5 | 3765 | 235 | $1.42 |
| 64 | 83.9 | 11.7 | 5470 | 342 | $0.97 |
| 120 | 157.3 | 17.3 | 6936 | 434 | $0.77 |
The cost per 1M tokens drops by 37x going from BS=1 to BS=120, while latency only increases by 3.2x. This is the most important insight in all of LLM serving economics.
Interpreting the cost column: At BS=120 on 16 v5e chips, the cost is $0.77 per 1M output tokens. For reference:
That is less than a penny per 25 responses. At this cost, the dominant expense for a chatbot is not inference — it is everything else (engineering, safety, infrastructure, customer support).
How does this compare to training cost? We estimated ~$40M to train LLaMA 70B. At $0.77 per 1M output tokens, the training cost is equivalent to generating:
LLaMA 3 was trained on 15T tokens. So the training cost equals the inference cost of generating roughly 3.5x the training dataset. For a popular model serving millions of users daily, this crossover happens within months. This is why the "inference-aware" scaling philosophy (train smaller models on more data) makes economic sense — reducing model size directly reduces per-token inference cost.
Sensitivity analysis — what matters most for serving cost?
| Change | Effect on $/1M tok | Why |
|---|---|---|
| int8 → int4 weights | -40% | Half the param loading, more room for KV batch |
| 2K → 32K context | +200% | KV loading dominates at long context |
| GQA-8 → MHA-64 | +400% | 8x more KV per token, smaller batch |
| v5e → H100 | +75% | Higher cost per chip despite higher BW |
| BS=32 → BS=120 | -50% | Better MXU utilization |
GQA is the single biggest architectural win for inference cost. Going from MHA-64 to GQA-8 reduces KV cache by 8x, enabling 8x larger batch sizes, which in turn gives roughly 4-5x better throughput per chip.
The complete serving checklist for LLaMA 70B:
This chapter has given you all the tools to make each of these decisions quantitatively. The formulas are simple — the art is knowing which ones to apply in which order.
A final worked problem — serving LLaMA 70B for 10K QPS:
This is the scale of a major consumer-facing LLM product. The infrastructure cost alone is ~$150M/year. This explains why LLM API pricing is what it is — and why every percentage point of efficiency improvement matters at this scale.
During generation, we use pure model parallelism (tensor parallelism). There is no data parallelism within a serving replica — all chips work together on the same batch.
The key question: how far can we scale model parallelism before ICI becomes the bottleneck?
In the compute-bound regime (large batch):
For LLaMA 70B (F=28672) with 2 axes: Ymax = 26.
In the bandwidth-bound regime (small batch), we can overlap ICI with HBM loading. The limit becomes:
At BS=64: Ymax = 28672 / 512 = 56. We can scale further when bandwidth-bound!
For the KV cache sharding: LLaMA 3-70B has K=8 KV heads. With 8-way TP, each chip stores exactly 1 KV head. Beyond 8-way TP, we must split individual heads across chips, adding complexity.
KV head sharding in detail: With K=8 KV heads and Y-way TP:
| TP degree | KV heads per chip | Split needed? | Extra ICI for attention? |
|---|---|---|---|
| 4 | 2 | No | None (each chip has full heads) |
| 8 | 1 | No | None |
| 16 | 0.5 | Yes (split across 2 chips) | AllGather within head-shard groups |
| 32 | 0.25 | Yes (split across 4 chips) | More AllGather overhead |
At 16-way TP, each KV head is split across 2 chips, requiring an AllGather of partial attention outputs. This adds ICI communication proportional to B × S × H per layer per TP group. For LLaMA 70B at BS=64, S=8192: the extra volume is 64 × 8192 × 128 × 2 = 134 MB per layer. At 90 GB/s ICI, this is 1.5 ms — noticeable but manageable.
| Topology | TP degree | ICI status | Best for |
|---|---|---|---|
| 4×2 | 8 | Well within limit | Cost-optimized serving |
| 4×4 | 16 | Within limit | Low latency |
| 4×8 | 32 | Marginal (need small B) | Ultra-low latency |
Prefill and decode have different bottlenecks. Prefill is compute-bound; decode is bandwidth-bound. Running them on the same hardware means one phase is always under-utilizing the hardware.
Disaggregated serving runs prefill on compute-optimized chips and decode on bandwidth-optimized chips. The KV cache is transferred between them after prefill completes.
Let us size the prefill:decode ratio. Assume median prefill = 8192 tokens, median decode = 512 tokens:
The transfer cost: shipping 8192 tokens of KV cache = 8192 × 160 kB = 1.3 GB. Over inter-node network at ~50 GB/s, this takes 26 ms — negligible compared to the seconds-scale compute.
When does disaggregation help?
| Scenario | Disaggregated? | Reason |
|---|---|---|
| Long prompts, short outputs | Yes | Prefill dominates; separate pools let you scale prefill independently |
| Short prompts, long outputs | Maybe not | Decode dominates; KV transfer overhead may not be worth it |
| Mixed traffic | Yes | Prevents long prefills from blocking active decode batches |
| Very small deployment | No | Not enough traffic to justify two separate server pools |
The interference problem: Without disaggregation, a long prefill request (e.g., 32K tokens) arriving in a batch of actively decoding sequences will stall all decode slots while the prefill runs. This causes latency spikes for in-flight decode requests. Disaggregation eliminates this interference entirely.
Worked example — cost of the interference problem:
This is catastrophic for latency SLAs. Users with active conversations experience a multi-second pause. Disaggregation or chunked prefill is not optional — it is a requirement for production serving.
Chunked prefill as an alternative: Instead of disaggregating, we can break the 32K prefill into 64 chunks of 512 tokens each. Each decode step processes one chunk alongside the normal decode batch. This adds ~0.5 ms per decode step (from the 512-token prefill chunk), but the new request takes 64 × 17.5 ms = 1.12 seconds to fully prefill — versus 3.6 seconds for a full blocking prefill. The decode batch is only slightly slowed, not blocked.
Explore the latency-throughput Pareto frontier for LLaMA 70B serving on TPU v5e. Adjust batch size and sequence length to see the dramatic tradeoff between cost and latency.
LLaMA 70B (int8 params, int8 KV) on 16x TPU v5e.
Serving LLaMA 3-70B on TPUs teaches us the essential economics of LLM inference:
| Key finding | Value |
|---|---|
| Best hardware (FLOPs/$) | TPU v5e at 5.8e17 FLOPs/$ |
| KV cache per token | 160 kB (int8) |
| Min topology (int8) | 4×2 = 8 chips |
| Critical batch (int8 on v5e) | B* = 120 |
| Decode lower bound | HBM/W = 19.5 ms per chip |
| Prefill:Decode ratio | ~3:1 prefill servers needed |
| Max useful TP | ~16-26 (depends on batch and ICI axes) |
The latency-throughput tradeoff is extreme: at BS=1 on 16 chips, latency is ~5.5 ms but per-chip throughput is only 11 tok/s. At BS=120, latency rises to ~14 ms (2.5x) but throughput jumps to 536 tok/s (49x). Almost all production systems operate at high batch sizes because the cost savings are overwhelming.
Comparison with real-world API pricing:
| Provider | Model | Price per 1M output tokens |
|---|---|---|
| OpenAI | GPT-4o | $10-15 |
| Gemini 1.5 Pro | $5-10 | |
| Meta (self-hosted) | LLaMA 70B | $0.70-1.20 (our estimate) |
Our back-of-the-envelope estimate of $0.70-1.20 per 1M tokens for LLaMA 70B on v5e is in line with third-party hosting costs from providers like Together.ai and Anyscale. The premium charged by OpenAI and Google reflects API margins, safety infrastructure, and the amortized cost of training.
The road to cheaper inference: