Austin et al., Part 8

Serving LLaMA 3-70B on TPUs

Cost per million tokens, hardware selection, sharding strategies, and the latency-throughput Pareto frontier — all derived from first principles.

Prerequisites: Inference fundamentals (Ch 7: prefill vs decode, arithmetic intensity, KV cache), LLaMA 3 architecture (Ch 6).
9
Chapters
1
Calculator
9
Quizzes

Chapter 0: Hardware Choice

The first question for serving is: which hardware? The answer is usually whichever gives the most FLOPs per dollar. For inference, where we are often bandwidth-bound, we also care about HBM bandwidth per dollar.

TPU typebf16 FLOPs/sHBMHBM BWUSD/hrFLOPs/$
H1009.9e1480 GB3350 GB/s$10.803.3e17
v5p4.59e1496 GB2765 GB/s$4.203.9e17
v5e1.97e1416 GB820 GB/s$1.205.8e17
TPU v5e wins on FLOPs per dollar at 5.8e17, beating v5p (3.9e17) and H100 (3.3e17). That is why v5e is the standard choice for inference at Google — you get 76% more FLOPs per dollar than H100.

The catch: v5e only has 16 GB HBM. A 70B model in bf16 needs 140 GB just for weights. That means aggressive sharding across many chips, or quantization, or both.

Why not use v5p for inference? v5p has 96 GB HBM and much higher FLOPs, but costs 3.5x more per chip. Since inference is usually bandwidth-bound (not compute-bound), the extra FLOPs of v5p go to waste. The relevant metric is HBM bandwidth per dollar, not FLOPs per dollar:

v5e: 820 GB/s / $1.20 = 683 GB/s/$
v5p: 2765 GB/s / $4.20 = 658 GB/s/$
H100: 3350 GB/s / $10.80 = 310 GB/s/$

Even on bandwidth per dollar, v5e wins slightly. And for compute-bound workloads (large batch), v5e's FLOPs per dollar dominance makes it even more attractive.

The critical batch size on v5e with bf16 weights:

B* = C / WHBM = 1.97e14 / 8.2e11 = 240

With int8 weights (half the bytes, same FLOPs):

B* = 120

What does this mean practically? Below BS=240 (bf16), we are wasting compute cycles. The MXU is idle for (1 - B/240) fraction of each step, waiting for HBM to feed it data. At BS=1, we use 1/240 = 0.4% of the MXU — 99.6% wasted!

This is the fundamental reason why batching is so important for LLM inference. Every batch element shares the cost of loading the model weights. The weight-loading cost is fixed regardless of batch size — it is the "cover charge" for each decode step. More sequences per batch amortize this fixed cost.

The analogy: Think of decode like a bus route. The bus (model weights) makes the same trip regardless of how many passengers (batch elements) it carries. One passenger pays the full fuel cost. 240 passengers split it 240 ways. The difference between batch size 1 and batch size 240 is like the difference between a private taxi and a packed bus — same destination, 100x cheaper per passenger.

Check: Why is TPU v5e preferred for inference despite having only 16 GB HBM?

Chapter 1: KV Cache Sizing

Before choosing a topology, we need to know how much memory the KV cache will use. For LLaMA 3-70B with int8 KV caches:

KV bytes/token = 2 × K × H × L = 2 × 8 × 128 × 80 = 163,840 bytes ≈ 160 kB

At different sequence lengths:

Sequence lengthKV cache per sequenceBS=32BS=240
2,048328 MB10.5 GB78.6 GB
8,1921.31 GB41.9 GB314 GB
32,7685.24 GB168 GB1.26 TB
At batch size 240 and 8K context, the KV cache alone uses 314 GB. That is 4.5x the model parameter size (70 GB in int8). On v5e with 16 GB per chip, we need at least ceil((70 + 314) / 16) = 24 chips just for memory — and that is before any working memory for activations.

This reveals a fundamental tension: we want batch size 240 (to saturate compute), but the KV cache at that batch size may require more chips than we can efficiently shard across (due to ICI communication limits).

Practical approach: pick the batch size that fills available HBM on a given topology, then check if the resulting throughput is acceptable.

The memory equation for inference:

Total HBM = param_bytes + B × S × KV_per_token + working_memory

Working memory includes activation buffers for the forward pass. With Flash Attention, this is small (a few hundred MB). Without it, the attention score matrix can be large: B × Nheads × S × S × 2 bytes. For B=32, S=8K, N=64: that is 32 × 64 × 8192 × 8192 × 2 = 275 GB! This is why Flash Attention (or equivalent) is mandatory for long-context serving.

Practical implication: When computing max batch size, always leave 10-20% HBM headroom for working memory, CUDA/XLA overhead, and memory fragmentation. A "full" GPU/TPU at 95% HBM utilization will often OOM due to fragmentation.

Check: For LLaMA 70B with 32K context and BS=32 in int8, how large is the total KV cache?

Chapter 2: Minimum Topology

What is the smallest TPU v5e slice that can serve LLaMA 3-70B? This depends entirely on quantization and KV cache needs:

QuantizationParam sizeKV/tokenMin TPU v5e chipsActual min sliceRemaining HBM for KV
bf16140 GB324 kB94×4 = 16116 GB
int870 GB162 kB54×2 = 858 GB
int435 GB81 kB32×2 = 429 GB
How many KV caches fit in the remaining HBM? At 8K context in int8: KV per sequence = 1.31 GB. So on a 4×2 with 58 GB remaining: 58 / 1.31 = 44 sequences max. That is our maximum batch size on this topology.

Let us compute the max batch size for each configuration at 8K context:

bf16 on 4×4: 116 GB / 1.31 GB = 88 sequences
int8 on 4×2: 58 GB / 1.31 GB = 44 sequences
int4 on 2×2: 29 GB / 1.31 GB = 22 sequences

None of these reach the critical batch size of 240 (bf16) or 120 (int8). So on minimum topologies, we are always memory-bandwidth-bound. Reaching compute-saturation requires larger slices.

An important subtlety: these numbers are optimistic. We assumed all remaining HBM goes to KV caches. In reality, we need:

Memory componentSize
Model parameters70 GB (int8)
KV cachesVariable
Activation working memory~0.5-2 GB (with Flash Attention)
XLA/CUDA runtime overhead~0.5-1 GB per chip
Memory fragmentation~5-15% of total

A safe rule of thumb: assume 80% of nominal HBM is actually usable. On v5e: 0.8 × 16 = 12.8 GB effective per chip. This reduces max batch sizes by 20-30%.

What about LLaMA 3-8B serving? Let us compute the same table for the smaller model:

QuantizationParam sizeKV/token (K=8, H=128, L=32)Min v5e chipsKV at 8K (per seq)
bf1616 GB65.5 kB1536 MB
int88 GB32.8 kB1268 MB
int44 GB16.4 kB1134 MB

LLaMA 8B in int8 fits on a single v5e chip with 8 GB remaining for KV caches. At 8K context: max BS = 8 GB / 268 MB = 29 sequences. At BS=29 and int8: we are near compute-bound (B* = 120 for bf16, 60 for int8 since the chip also does bf16 FLOPs). Not quite saturated, but reasonable.

For production, a 2x2 (4 chips) in int8 gives: 4 × 16 - 8 = 56 GB for KV, supporting BS=209 at 8K context. At BS=120, we are fully compute-bound in the MLP. This is a very efficient configuration.

What about LLaMA 3-405B? This is a more extreme case:

int8 params: 405 GB. Min v5e chips: ceil(405/16) = 26 chips (round to 4×8 = 32)
KV per token: 2 × 8 × 128 × 126 × 1 = 258 kB (int8)
Remaining HBM: 32 × 16 - 405 = 107 GB
At 8K context: KV per seq = 258e3 × 8192 = 2.11 GB. Max BS = 107/2.11 = 50
Critical BS (int8 on v5e) = 120. We are at 50 < 120 ⇒ bandwidth-bound.
Step time = (405 + 50 × 2.11) / (32 × 820) ≈ 19.5 ms

Even the 405B model has a step time of ~20 ms on 32 v5e chips. The user-facing latency for generating 512 tokens: 0.02 × 512 = 10.2 seconds. Acceptable for most applications.

The cost per 1M tokens for 405B at BS=50 on 32 v5e:

Tok/s = 50 / 0.0195 = 2564. Cost/s = 32 × $1.20/3600 = $0.0107/s
$/1M tok = $0.0107 / 2564 × 1e6 = $4.17

About 5-6x more expensive per token than the 70B model. The model has 5.8x more parameters, so the scaling is roughly linear — as expected since both are bandwidth-bound.

The smallest topology is not the most cost-effective. On a 4×2 with int8 at BS=44, we only use 44/120 = 37% of compute capacity. Doubling to 4×4 gives us 372 GB remaining for KV caches, fitting BS=140 (int8), getting us much closer to saturation.
Check: Can LLaMA 70B in int4 fit on a TPU v5e 2x2 (4 chips)?

Chapter 3: Decode Latency

Now let us compute actual decode latency for LLaMA 70B. On a TPU v5e 4×2 (8 chips) with int8 params, BS=32, and 8K context:

Total memory to load = param_size + KV_cache = 70 GB + 41.9 GB = 112 GB
Tstep = 112 GB / (8 × 820 GB/s) = 112 / 6560 = 17 ms

At BS=32: throughput = 32 / 0.017 = 1882 tokens/s, or 235 tokens/s/chip.

Lower bound on decode latency: The minimum step time is determined by how fast we can read the model parameters from HBM. For v5e: 16 GB / 820 GB/s = 19.5 ms. This is the absolute floor — even with an empty KV cache and BS=1, we cannot go faster than this on a single chip.

On a 4×4 (16 chips):

Tstep = 112 GB / (16 × 820 GB/s) = 8.5 ms

Latency halves, but throughput per chip stays the same (BS is unchanged). The benefit is purely lower latency.

Let us check if we are ICI-bound at these topologies. With 2 ICI axes for model parallelism:

Ymax (compute-bound) = F × 2 / 2200 = 28672 × 2 / 2200 = 26

So 8 and 16 are both well within the ICI limit. Good — ICI is not a bottleneck.

When to use a larger topology: If you have a latency SLA (e.g., <15 ms/step), you may need more chips even though throughput per chip stays constant. The 4×4 gives 8.5 ms vs 17 ms on 4×2.

Latency breakdown for the 4×2 config:

ComponentTimeFraction
Parameter loading (70 GB / 6560 GB/s)10.7 ms63%
KV cache loading (41.9 GB / 6560 GB/s)6.4 ms37%
Compute (2 × 70e9 × 32 / (8 × 1.97e14))2.8 ms(hidden behind HBM)
Total (HBM bound)17.1 ms

Parameter loading dominates at this batch size (63% of step time). As batch size increases, KV cache loading grows and eventually dominates at very long contexts. The compute is completely hidden behind the HBM loading — we are solidly in the bandwidth-bound regime.

What is the absolute fastest decode we can achieve? The minimum is when we load only model parameters (KV cache is negligible, e.g., at BS=1 with short context):

Min latency (int8, 4×2) = 70 GB / (8 × 820 GB/s) = 10.7 ms
Min latency (int8, 4×4) = 70 GB / (16 × 820 GB/s) = 5.3 ms
Min latency (int4, 4×4) = 35 GB / (16 × 820 GB/s) = 2.7 ms

So the theoretical minimum decode step for LLaMA 70B in int4 on 16 v5e chips is about 2.7 ms, or ~370 tokens/s for a single sequence. In practice, expect 60-70% of this due to LayerNorm, activation functions, and other overhead.

Check: If decode step time is 17ms at BS=32 on 8 chips, what is the per-chip throughput?

Chapter 4: Throughput Optimization

To maximize throughput, we want to push the batch size as high as possible — ideally past the critical batch size B* = 120 (int8 on v5e). But each extra sequence needs KV cache memory.

The strategy: pick the batch size that fills all available HBM. On a fully loaded topology, the step time is simply:

Tstep = HBMtotal / (N × WHBM) = 16 GB / 820 GB/s = 19.5 ms

This is independent of how the HBM is split between params and KV caches! Loading a full chip's worth of HBM always takes 19.5 ms.

With median decode length of 512 tokens, throughput in queries per second per chip:

QPS/chip = B / (Tstep × median_decode × N)
QuantizationTopologyMax BS (8K ctx)QPS/chip
bf164×4880.27
int84×2440.55
int42×2221.11
int4 gives 4x the throughput per chip as bf16! This comes from two effects: (1) the model fits on fewer chips, (2) the remaining HBM is used for KV caches at a higher per-chip density. Quantization is the single most impactful optimization for serving cost.

What about doubling the topology? On a 4×8 (32 chips) in bf16:

Remaining HBM = 32 × 16 - 140 = 372 GB
Max BS = 372 / 1.31 = 283 (with 8K context)
QPS/chip = 283 / (0.0195 × 512 × 32) = 0.89

That is 3.3x better throughput per chip than the minimum 4×4! The extra chips pay for themselves by enabling larger batch sizes.

Cost per 1M output tokens: This is the key business metric. Let us compute it for each configuration.

Cost/s = N × $/hr / 3600
Tokens/s = B / Tstep
Cost per 1M tokens = (Cost/s / Tokens/s) × 1e6
ConfigChipsMax BSTok/sCost per 1M tokens
bf16, 4×416884513$1.18
int8, 4×28442256$1.18
int4, 2×24221128$1.18
int8, 4×4161407179$0.74
bf16, 4×83228314513$0.73
Surprising result: On minimum topologies, the cost per token is nearly identical regardless of quantization! This is because on a minimum topology, HBM is fully loaded (19.5 ms/step) regardless of how it is split between params and KV caches. The wins from quantization appear when using larger topologies where the batch size increase outpaces the chip cost increase.

The optimal serving configuration is not the smallest possible — it is the one that minimizes cost per token. Larger topologies enable larger batches, better compute utilization, and lower per-token cost. This is counterintuitive: spending more on hardware can reduce total serving cost.

Roofline code for computing these numbers: The book provides a simple Python script that computes the full Pareto frontier. Here is the key logic:

import numpy as np

num_chips = 16
param_bytes = 70e9   # int8
hbm_bw = 8.2e11      # v5e per chip
flops = 1.97e14      # v5e per chip
kv_per_token = 160e3  # int8 KV cache

def step_time(bs, seq_len):
    kv_total = kv_per_token * seq_len * bs
    kv_time = kv_total / (num_chips * hbm_bw)
    param_time = param_bytes / (num_chips * hbm_bw)
    flops_time = 2 * param_bytes * bs / (num_chips * flops)
    mlp_time = max(flops_time, param_time)
    return mlp_time + kv_time

Notice the elegant structure: the MLP term is max(compute, bandwidth), and the attention term is always bandwidth. The total step time is just their sum. This simple formula captures 90% of what you need to reason about inference performance.

The Pareto frontier in practice: Let us compute several points on the throughput-latency curve for a 4×4 (16 chips) in int8 at 8K context:

Batch sizeKV total (GB)Step time (ms)Tok/s totalTok/s/chip$/1M tok
11.35.418512$28.80
810.56.1131182$4.07
3241.98.53765235$1.42
6483.911.75470342$0.97
120157.317.36936434$0.77

The cost per 1M tokens drops by 37x going from BS=1 to BS=120, while latency only increases by 3.2x. This is the most important insight in all of LLM serving economics.

Interpreting the cost column: At BS=120 on 16 v5e chips, the cost is $0.77 per 1M output tokens. For reference:

A typical GPT-4-class response is ~500 tokens
Cost per response = 500 / 1e6 × $0.77 = $0.000385 ≈ $0.04 per 100 responses

That is less than a penny per 25 responses. At this cost, the dominant expense for a chatbot is not inference — it is everything else (engineering, safety, infrastructure, customer support).

How does this compare to training cost? We estimated ~$40M to train LLaMA 70B. At $0.77 per 1M output tokens, the training cost is equivalent to generating:

$40M / $0.77 × 1M = 51.9 trillion output tokens

LLaMA 3 was trained on 15T tokens. So the training cost equals the inference cost of generating roughly 3.5x the training dataset. For a popular model serving millions of users daily, this crossover happens within months. This is why the "inference-aware" scaling philosophy (train smaller models on more data) makes economic sense — reducing model size directly reduces per-token inference cost.

Sensitivity analysis — what matters most for serving cost?

ChangeEffect on $/1M tokWhy
int8 → int4 weights-40%Half the param loading, more room for KV batch
2K → 32K context+200%KV loading dominates at long context
GQA-8 → MHA-64+400%8x more KV per token, smaller batch
v5e → H100+75%Higher cost per chip despite higher BW
BS=32 → BS=120-50%Better MXU utilization

GQA is the single biggest architectural win for inference cost. Going from MHA-64 to GQA-8 reduces KV cache by 8x, enabling 8x larger batch sizes, which in turn gives roughly 4-5x better throughput per chip.

The complete serving checklist for LLaMA 70B:

1. Choose hardware
v5e for best FLOPs/$. H100 if you need more HBM per chip.
2. Choose quantization
int8 weights + int8 KV is the sweet spot. int4 if quality permits.
3. Size the topology
Min chips = ceil(param_bytes / HBM). Add chips until max BS ≥ B*.
4. Check ICI limits
TP degree < F × MY / 2200 (compute-bound) or F/(8B) (BW-bound).
5. Decide on disaggregation
If prefill latency > 200ms (long prompts), separate prefill and decode servers.
6. Scale with replicas
Multiply replicas for throughput. Load-balance with autoscaling.

This chapter has given you all the tools to make each of these decisions quantitatively. The formulas are simple — the art is knowing which ones to apply in which order.

A final worked problem — serving LLaMA 70B for 10K QPS:

Config per replica: 16 v5e, int8, BS=64, 8K context
Step time ≈ 11.2 ms. Decode per query (512 tok) = 5.73 s
QPS per replica = 64 / 5.73 = 11.2
Replicas needed = 10,000 / 11.2 = 893
Total chips = 893 × 16 = 14,288 TPU v5e
Cost = 14,288 × $1.20/hr = $17,146/hour = $411K/day
Cost per query = $0.0048. For 10K QPS over 24 hours: 864M queries at $4.1M/day.

This is the scale of a major consumer-facing LLM product. The infrastructure cost alone is ~$150M/year. This explains why LLM API pricing is what it is — and why every percentage point of efficiency improvement matters at this scale.

Check: Why does doubling the topology more than double per-chip throughput?

Chapter 5: Sharding for Inference

During generation, we use pure model parallelism (tensor parallelism). There is no data parallelism within a serving replica — all chips work together on the same batch.

The key question: how far can we scale model parallelism before ICI becomes the bottleneck?

In the compute-bound regime (large batch):

Ymax = F × MY / 2200

For LLaMA 70B (F=28672) with 2 axes: Ymax = 26.

In the bandwidth-bound regime (small batch), we can overlap ICI with HBM loading. The limit becomes:

Ymax = F / (8B)

At BS=64: Ymax = 28672 / 512 = 56. We can scale further when bandwidth-bound!

Sanity check for a 4×8 (32 chips) at BS=64. Let us compute each time component for one MLP matmul:
TICI = 2BD / WICI = 2 × 64 × 8192 / 9e10 = 11 μs
THBM = 2DF / (Y × WHBM) = 2 × 8192 × 28672 / (32 × 8.2e11) = 18 μs
Tmath = 2BDF / (Y × C) = 2 × 64 × 8192 × 28672 / (32 × 1.97e14) = 4 μs
Since THBM > TICI > Tmath, we are HBM bandwidth-bound — the ideal regime!

For the KV cache sharding: LLaMA 3-70B has K=8 KV heads. With 8-way TP, each chip stores exactly 1 KV head. Beyond 8-way TP, we must split individual heads across chips, adding complexity.

KV head sharding in detail: With K=8 KV heads and Y-way TP:

TP degreeKV heads per chipSplit needed?Extra ICI for attention?
42NoNone (each chip has full heads)
81NoNone
160.5Yes (split across 2 chips)AllGather within head-shard groups
320.25Yes (split across 4 chips)More AllGather overhead

At 16-way TP, each KV head is split across 2 chips, requiring an AllGather of partial attention outputs. This adds ICI communication proportional to B × S × H per layer per TP group. For LLaMA 70B at BS=64, S=8192: the extra volume is 64 × 8192 × 128 × 2 = 134 MB per layer. At 90 GB/s ICI, this is 1.5 ms — noticeable but manageable.

TopologyTP degreeICI statusBest for
4×28Well within limitCost-optimized serving
4×416Within limitLow latency
4×832Marginal (need small B)Ultra-low latency
Check: Why can we scale to more model parallelism in the bandwidth-bound regime than the compute-bound regime?

Chapter 6: Disaggregated Serving

Prefill and decode have different bottlenecks. Prefill is compute-bound; decode is bandwidth-bound. Running them on the same hardware means one phase is always under-utilizing the hardware.

Disaggregated serving runs prefill on compute-optimized chips and decode on bandwidth-optimized chips. The KV cache is transferred between them after prefill completes.

Prefill servers
High FLOPs/s, large batch of prompts. Process in parallel. Ship KV caches to decode servers.
↓ KV cache transfer
Decode servers
High HBM bandwidth, batch of active sequences. Generate tokens autoregressively.

Let us size the prefill:decode ratio. Assume median prefill = 8192 tokens, median decode = 512 tokens:

Prefill latency = 0.91 s (16 v5e chips, 40% MFU)
Decode latency per seq = 0.019 × 512 = 9.7 s (BS=32, 16 chips)
Prefill feeds: P / 0.91 = 1.1 sequences/s per prefill server
Decode consumes: 32 / 9.7 = 3.3 sequences/s per decode server
Ratio: we need 3.3 / 1.1 = 3 prefill servers per decode server
Prefill is the bottleneck. Each prefill takes nearly a second, while decode processes a batch of 32 over ~10 seconds. We need 3x more prefill capacity than decode capacity. This makes intuitive sense: prefill processes 8192 tokens per request versus 512 for decode, but prefill is compute-bound and slower per token.

The transfer cost: shipping 8192 tokens of KV cache = 8192 × 160 kB = 1.3 GB. Over inter-node network at ~50 GB/s, this takes 26 ms — negligible compared to the seconds-scale compute.

When does disaggregation help?

ScenarioDisaggregated?Reason
Long prompts, short outputsYesPrefill dominates; separate pools let you scale prefill independently
Short prompts, long outputsMaybe notDecode dominates; KV transfer overhead may not be worth it
Mixed trafficYesPrevents long prefills from blocking active decode batches
Very small deploymentNoNot enough traffic to justify two separate server pools

The interference problem: Without disaggregation, a long prefill request (e.g., 32K tokens) arriving in a batch of actively decoding sequences will stall all decode slots while the prefill runs. This causes latency spikes for in-flight decode requests. Disaggregation eliminates this interference entirely.

Worked example — cost of the interference problem:

Decode batch: 32 sequences actively generating, step time = 17 ms
New request arrives: 32K token prompt needs prefill
Prefill time for 32K tokens = 2 × 70e9 × 32768 / (16 × 1.97e14 × 0.4) = 3.6 seconds!
All 32 active sequences are blocked for 3.6 seconds
Wasted decode capacity: 32 sequences × 3.6s / 0.017s = ~6776 wasted decode steps
That is 6776 × 32 = 216,832 tokens of decode throughput lost!

This is catastrophic for latency SLAs. Users with active conversations experience a multi-second pause. Disaggregation or chunked prefill is not optional — it is a requirement for production serving.

Chunked prefill as an alternative: Instead of disaggregating, we can break the 32K prefill into 64 chunks of 512 tokens each. Each decode step processes one chunk alongside the normal decode batch. This adds ~0.5 ms per decode step (from the 512-token prefill chunk), but the new request takes 64 × 17.5 ms = 1.12 seconds to fully prefill — versus 3.6 seconds for a full blocking prefill. The decode batch is only slightly slowed, not blocked.

Check: In disaggregated serving for LLaMA 70B, which phase needs more server replicas?

Chapter 7: Cost / Throughput Calculator

Explore the latency-throughput Pareto frontier for LLaMA 70B serving on TPU v5e. Adjust batch size and sequence length to see the dramatic tradeoff between cost and latency.

Serving Cost Explorer

LLaMA 70B (int8 params, int8 KV) on 16x TPU v5e.

32
8192
16
Check: At the cost of doubling per-token latency, roughly how much can you reduce per-token cost?

Chapter 8: Takeaways

Serving LLaMA 3-70B on TPUs teaches us the essential economics of LLM inference:

Key findingValue
Best hardware (FLOPs/$)TPU v5e at 5.8e17 FLOPs/$
KV cache per token160 kB (int8)
Min topology (int8)4×2 = 8 chips
Critical batch (int8 on v5e)B* = 120
Decode lower boundHBM/W = 19.5 ms per chip
Prefill:Decode ratio~3:1 prefill servers needed
Max useful TP~16-26 (depends on batch and ICI axes)
The three levers for serving cost:
1. Quantize aggressively. int8 halves param memory, halves critical batch size, doubles effective throughput per chip. int4 goes further.
2. Maximize batch size. Fill HBM with KV caches. The throughput improvement from BS=1 to BS=120 is ~100x while latency only doubles.
3. Right-size the topology. The minimum topology is not the cheapest per token. Extra chips enable larger batches for better hardware utilization.

The latency-throughput tradeoff is extreme: at BS=1 on 16 chips, latency is ~5.5 ms but per-chip throughput is only 11 tok/s. At BS=120, latency rises to ~14 ms (2.5x) but throughput jumps to 536 tok/s (49x). Almost all production systems operate at high batch sizes because the cost savings are overwhelming.

Comparison with real-world API pricing:

ProviderModelPrice per 1M output tokens
OpenAIGPT-4o$10-15
GoogleGemini 1.5 Pro$5-10
Meta (self-hosted)LLaMA 70B$0.70-1.20 (our estimate)

Our back-of-the-envelope estimate of $0.70-1.20 per 1M tokens for LLaMA 70B on v5e is in line with third-party hosting costs from providers like Together.ai and Anyscale. The premium charged by OpenAI and Google reflects API margins, safety infrastructure, and the amortized cost of training.

The road to cheaper inference:

Better quantization
FP8/INT4 with calibration can reduce param size by 2-4x with minimal quality loss
Architectural improvements
MQA/GQA reduces KV cache. Smaller models with better training data match larger ones.
Hardware improvements
Higher HBM bandwidth (HBM3e), larger HBM capacity, better FLOPs/$
System optimizations
Speculative decoding, prefix caching, continuous batching, disaggregated serving
Key formula summary:
Step latency: T = (params + KV) / (N × WHBM)
Throughput: B / Tstep
QPS/chip: B / (Tstep × median_decode × N)
Min chips: ceil(param_bytes / HBM_per_chip)
Max BS: (N × HBM - params) / KV_per_seq
Check: What is the single most impactful optimization for reducing LLaMA 70B serving cost?