Parallelism strategies, serving optimization, batch scheduling, and cost analysis — every calculation an ML infrastructure engineer needs to derive cold.
You have a single weight matrix that's too large for one GPU's memory. Or you need the forward pass to run faster than one GPU can manage. The solution: tensor parallelism (TP) — slice each weight matrix across multiple GPUs so they compute in parallel.
The core idea is simple. Take a weight matrix W of shape [d, d]. With TP=4, you split it column-wise into 4 shards: each GPU holds Wi of shape [d, d/4]. The input x is broadcast to all GPUs. Each GPU computes yi = x · Wi. Then you need an allreduce to combine partial results.
A transformer layer has d=8192 and uses standard 4d² attention + 8d² FFN = 12d² parameters. With TP=8, each GPU holds 1/8 of every weight matrix. How many parameters does each GPU store per layer?
Each GPU stores 1/TP of every weight matrix. Since all weight matrices in a layer are [d, d] or [d, 4d] (and their transposes), every one divides evenly by TP. The 12d²/TP rule is exact.
For one transformer layer with d=8192, batch size B=32, sequence length S=2048, using FP16 (2 bytes per element): how much data does each GPU send during the 2 allreduce operations in the forward pass?
In a ring allreduce on TP GPUs, each GPU sends (TP-1)/TP × message_size bytes. For simplicity, approximate this as ~1× the message size (accurate when TP is large). Each allreduce sends a tensor of shape [B, S, d].
At 2 GB per layer, an 80-layer model sends 160 GB through the interconnect per forward pass. On NVLink at 900 GB/s, that's ~178 ms of pure communication time. On InfiniBand at 400 GB/s, it's 400 ms — which is why TP is restricted to intra-node.
TP=8 cuts each GPU's compute by 8× but adds 2 allreduce per layer. For latency (time-to-first-token for one request), TP helps because each GPU does less math. For throughput (requests per second), 8 independent copies (DP=8) would each serve a request simultaneously — 8× throughput with zero communication overhead.
The rule: use TP to reduce latency when the model is too slow on one GPU, use DP to increase throughput when the model already fits.
Arrange the steps of a tensor-parallel forward pass for one transformer layer (attention + FFN).
1. Column-split Q,K,V projections (each GPU computes its head subset).
2. Local attention — each GPU runs full attention on its heads (no communication needed since heads are independent).
3. Row-split WO + allreduce — combine head outputs back to full hidden dimension.
4. Column-split FFN W1 + activation — each GPU computes part of the expanded hidden state.
5. Row-split FFN W2 + allreduce — combine back to full d.
6. Add residual + LayerNorm — on the now-complete hidden state.
A 70B parameter model stored in FP16 (2 bytes per parameter). With TP=8 across 8 GPUs, how much weight memory does each GPU need?
17.5 GB of weights leaves plenty of room on an 80 GB H100 for KV cache, activations, and CUDA overhead. This is exactly why TP=8 is the standard for serving 70B models — one 8-GPU node handles it comfortably.
Tensor parallelism splits each layer across GPUs. Pipeline parallelism (PP) does something different: it assigns entire layers to different GPUs. GPU 0 gets layers 0-7, GPU 1 gets layers 8-15, and so on. The forward pass flows from GPU to GPU like an assembly line.
The problem? The assembly line has a bubble. When GPU 0 is computing its layers for microbatch 1, GPUs 1-3 are idle — they're waiting for GPU 0's output. This warmup phase wastes compute. The bubble fraction tells you how much time is lost.
With PP=4 and M=1 (single microbatch), what fraction of total GPU time is wasted in the pipeline bubble?
75% of total GPU time is wasted. Each GPU is idle for 3 out of 4 time steps during warmup. This is catastrophic — you're paying for 4 GPUs but getting the throughput of 1.
Same PP=4 setup. How many microbatches M do you need to reduce the bubble fraction below 10%?
Solve: (PP - 1) / (PP - 1 + M) < 0.10
You need at least 27 microbatches. With M=27, bubble = 3/30 = 10%. With M=32 (a more practical power-of-two choice), bubble = 3/35 = 8.6%. This is why training configs often use large microbatch counts.
A 32-layer model with PP=4 and PP=8. How many layers per GPU in each case? Which PP degree would you expect to be more memory-efficient per GPU?
PP=8 uses less memory per GPU (4 layers vs 8) but has a larger bubble (7/(7+M) vs 3/(3+M)). There's always a tradeoff between memory savings and pipeline efficiency.
PP=16, M=32. What is the bubble fraction?
Nearly a third of compute is wasted. To get this below 5%, you'd need M > 285. This is why PP=16 is rarely used — the bubble is too large to hide with reasonable microbatch counts.
Write a function that computes the pipeline bubble fraction given PP and M.
javascript function pipelineBubbleFraction(pp, m) { return (pp - 1) / (pp - 1 + m); }
PP communicates activations (forward) and activation gradients (backward) between adjacent stages. The activation tensor shape is [B, S, d] — much smaller than the weight matrices. This is a point-to-point send (GPU k to GPU k+1), not an allreduce, which is why PP works well across nodes over InfiniBand.
The simplest form of parallelism: give every GPU a full copy of the model, feed each GPU different data, and average the gradients. That's Distributed Data Parallel (DDP). Simple, but memory-wasteful — every GPU stores the full model weights, gradients, and optimizer states.
Fully Sharded Data Parallel (FSDP) fixes the memory problem. Instead of replicating everything, FSDP shards the model parameters, gradients, and optimizer states across GPUs. Before each layer's forward pass, GPUs allgather the parameters they need, compute, then discard the non-local shards.
A 7B parameter model trained with DDP using mixed precision (FP16 weights/gradients + FP32 Adam optimizer). How much memory does each GPU need just for model state (weights + gradients + optimizer)?
112 GB exceeds even an H100's 80 GB HBM. This doesn't even include activations. A 7B model in DDP won't fit on a single 80 GB GPU — you need either FSDP, DeepSpeed ZeRO, or gradient checkpointing.
Same 7B model, now with FSDP across 32 GPUs. How much model state memory does each GPU need?
From 112 GB to 3.5 GB — a 32× reduction. Each GPU only stores 1/32 of the weights, gradients, and optimizer states. The rest is reconstructed on-the-fly via allgather before each layer's computation. This is the magic of FSDP.
In DDP, each GPU allreduces the full gradient (2 bytes per param in FP16). In FSDP, each GPU allgathers the full model weights before each forward layer and reduce-scatters gradients after backward. For a 7B model in FP16, what is the total communication volume per training step for DDP?
Allreduce transfers ~2× the message size (reduce-scatter + allgather).
FSDP has higher communication: it allgathers weights before forward (14 GB), allgathers again before backward (14 GB), and reduce-scatters gradients after backward (14 GB) — about 42 GB total. FSDP trades 1.5× more communication for dramatically less memory per GPU.
ZeRO-1 is the sweet spot for many setups: it shards the biggest memory consumer (optimizer states are 12/16 = 75% of total) while requiring no extra communication beyond DDP's gradient allreduce.
Write a function that returns memory per GPU (in GB) for FSDP, given model size in billions and number of GPUs. Use the 16 bytes/param rule.
javascript function fsdpMemoryPerGPU(paramsBillions, numGPUs) { return (paramsBillions * 16) / numGPUs; }
Real large-scale training combines all three: tensor parallelism within a node (fast NVLink), pipeline parallelism across nodes (point-to-point activations), and data parallelism for throughput scaling. The total GPU count factors cleanly:
You have 256 GPUs (32 nodes, 8 GPUs each). You want to train a 70B model with TP=8, PP=4. What is the DP degree?
8 data-parallel replicas, each consisting of 4 pipeline stages of 8 TP-sharded GPUs. Each replica spans 4 nodes (32 GPUs). The 8 replicas process 8 different microbatches simultaneously.
70B model, TP=8, PP=4, 80 layers. In FP16 (2 bytes/param), how much weight memory per GPU?
Each GPU holds 1/TP of 1/PP of the layers. With 80 layers and PP=4, each stage has 20 layers.
Only 4.4 GB of weights per GPU. But for training, you also need optimizer states (~4.375 × 6 = ~26 GB if not using FSDP), activations, and gradient memory. This is why even with TP+PP, FSDP (or ZeRO) is often still needed for the optimizer states.
TP is the most communication-intensive form of parallelism. With an 80-layer model, you perform 160 allreduce operations per forward pass. Even at NVLink speeds, this takes hundreds of milliseconds. On InfiniBand, the higher latency and lower bandwidth would increase this by 2-5×, making TP the bottleneck. PP, by contrast, only sends point-to-point activations between adjacent stages — much less frequent.
Llama 3.1 405B was trained on 16,384 H100 GPUs. If TP=8 and PP=16, what is the DP degree?
128-way data parallelism means 128 independent replicas each processing a different microbatch. With a global batch size of ~16M tokens, each replica handles ~128K tokens per step. This is the scale of frontier training.
With DP=128, each replica processes a microbatch of 4 sequences × 4096 tokens per sequence. What is the global batch size in tokens?
~2M tokens per gradient step. With gradient accumulation (say 8 accumulation steps), the effective batch can reach 16M+ tokens. LLaMA 3.1's training report mentions a 16M token batch.
This function computes per-GPU weight memory for 3D parallelism but gives the wrong answer. Click the buggy line.
function gpuWeightMemGB(totalParams, tp, pp, dp) { const bytesPerParam = 2; // FP16 const totalBytes = totalParams * bytesPerParam; const perGPU = totalBytes / (tp * pp * dp); return perGPU / 1e9; // convert to GB }
Line 4 divides by TP × PP × DP, but DP does not shard weights. In data parallelism, every replica holds a full copy of the model (or its TP×PP shard). Dividing by DP would only be correct for FSDP. The correct line is: const perGPU = totalBytes / (tp * pp);
This is a common mistake: confusing "GPU count" with "sharding degree." DP replicates; only TP and PP shard the weights.
You're serving a language model. 32 requests come in simultaneously. In static batching, you pad all sequences to the same length and wait until every request finishes. If one request generates 20 tokens and another generates 500 tokens, the 20-token request sits idle in the batch for 480 steps — pure waste.
Continuous batching (also called iteration-level scheduling) fixes this: the moment a request finishes generating, its slot is immediately freed and a new request from the queue takes its place. No padding, no waiting.
32 slots, average output length = 200 tokens, max output length = 512 tokens. What is the utilization (fraction of useful compute) in static batching?
Only 39.1% utilization. Over 60% of GPU cycles are spent computing attention and FFN on padding tokens or idle slots. This is the fundamental problem continuous batching solves.
With the same setup (avg=200, max=512), what is the throughput multiplier of continuous batching over static batching? Assume the request queue is never empty (always saturated).
2.56× more tokens per second from the same hardware. If you're paying $2/hr per GPU, continuous batching effectively cuts your cost from $2/hr to $0.78/hr for the same throughput.
When avg ≈ max, all requests finish around the same time. Static batching wastes almost nothing because there's no padding. With avg=500 and max=512, static utilization is 500/512 = 97.7%. Continuous batching only gives a 1.02× improvement — barely noticeable. The bigger the gap between avg and max, the more continuous batching helps.
Write a function that computes the slot utilization for static batching. Utilization = average output length / max output length.
javascript function slotUtilization(avgLen, maxLen) { return avgLen / maxLen; }
Continuous batching can only fill freed slots if there are requests waiting. If your server has 32 slots and receives an average of 10 requests/second, each taking ~5 seconds to generate, what is the average queue occupancy? Is the server saturated?
50 requests in the system but only 32 slots — the server is over-saturated by 18 requests. Those 18 sit in the queue, adding latency. You need either more GPUs (more slots) or faster generation (shorter W) to keep up.
During autoregressive generation, each request accumulates a KV cache — the key and value tensors from every previous token. In a 70B model with 80 layers, 64 heads, and head_dim=128, each token's KV cache is:
For a 4K context: 320 KB × 4096 = 1.28 GB per request. With 32 concurrent requests: 41 GB just for KV cache. Traditional systems pre-allocate max_seq_len for every request, wasting memory when most requests are shorter. PagedAttention (from vLLM) solves this by managing KV cache like virtual memory pages.
LLaMA 70B: 80 layers, 8 GQA KV heads, d_head=128, FP16 (2 bytes). How many KB per token for the full KV cache?
GQA with 8 KV heads (vs 64 Q heads) saves 8× on KV cache compared to MHA. Without GQA, this would be 2,560 KB per token — 10× more. GQA is one of the most impactful memory optimizations for serving.
Using 320 KB/token, 32 concurrent requests, each with 4096 context tokens. How much memory for KV cache?
40 GB for KV cache alone. Combined with 17.5 GB of weights (TP=8), that's 57.5 GB out of 80 GB per GPU. At longer contexts (16K, 128K), KV cache becomes the dominant memory consumer, dwarfing the model weights.
Total KV cache = 40 GB. Page size = 16 tokens × 320 KB/token = 5,120 KB = 5 MB per page. How many pages needed?
8,192 pages tracked by a page table. Each request's KV cache is a list of page pointers, not a contiguous allocation. This enables sharing (prefix caching) and eliminates fragmentation.
A request generates 200 tokens. Pages hold 16 tokens. How many pages are allocated, and what percentage of the last page is wasted (internal fragmentation)?
Average internal fragmentation is half a page per request = 8 tokens × 320 KB/token = 2.5 MB per request. Across 32 requests: 80 MB wasted — a tiny 0.2% of the 40 GB total. Compare this to pre-allocating full max_seq_len: (4096-200) × 320 KB = 1.22 GB wasted per request!
Without PagedAttention: pre-allocate 4096 tokens per request. With PagedAttention: allocate only what's needed. If average sequence length is 800 tokens, what is the memory savings ratio?
5× less KV cache memory means you can serve 5× more concurrent requests on the same hardware, or use 5× less hardware for the same concurrency. This is why vLLM (which introduced PagedAttention) became the default serving framework.
Autoregressive generation is slow because each token depends on the previous one — you can't parallelize. Speculative decoding cheats: a small, fast draft model generates k candidate tokens. The large target model then verifies all k tokens in a single forward pass (same cost as generating 1 token, since it's just a batch of k+1 positions). Accepted tokens are free; rejected tokens cost nothing extra.
Draft model generates k=5 tokens, acceptance rate α=0.7. Using the approximation E = kα + 1, how many tokens do you expect per target model forward pass?
Instead of 1 token per forward pass, you get 4.5 tokens. That's a 4.5× improvement in tokens-per-step. But we haven't accounted for the draft model's cost yet — the actual wall-clock speedup is less.
Target model latency: 50 ms/forward pass. Draft model: 5 ms/forward pass (10× faster). With k=5, α=0.7: what is the wall-clock speedup for tokens per second?
Time per speculative step = k × draft_time + 1 × target_time. Compare tokens/time: speculative vs standard (1 token per 50 ms).
3× wall-clock speedup despite a 4.5× token-per-step improvement, because the draft model takes 25 ms (5 sequential forward passes). If the draft model ran on a separate GPU in parallel, you'd get closer to 4.5×.
With k=5, draft at 5 ms and target at 50 ms (same GPU, sequential): what is the minimum acceptance rate α for speculative decoding to be faster than standard decoding?
Standard: 1 token/50ms. Speculative: (kα+1) tokens / (5k+50) ms. Set speculative rate > standard rate and solve for α.
The break-even is surprisingly low — only 10% acceptance rate. This means speculative decoding is almost always worth it when the draft model is 10× faster. Even a terrible draft model (20% acceptance) gives a 1.47× speedup.
Speculative decoding helps with latency (tokens/second for a single request). But in high-throughput batch mode, the GPU is already fully utilized computing attention for a large batch. Adding a draft model doesn't help because the bottleneck is memory bandwidth (loading weights), not sequential token generation. With batch size 64+, each forward pass already produces 64 tokens — speculative decoding's benefit disappears.
Using the exact formula E = (1 - αk+1) / (1 - α) with k=5, α=0.7: what is the exact expected number of accepted tokens per step? (Hint: calculate 0.76 first.)
The exact formula gives 2.94, compared to the approximation kα+1 = 4.5. The approximation significantly overestimates because it ignores the geometric decay — the probability of accepting all k tokens is αk = 0.75 = 16.8%, so most of the time some tokens are rejected.
Autoregressive decoding is memory-bandwidth bound: each generated token requires loading the entire model from HBM to the compute units. At FP16, a 70B model = 140 GB of weights. An H100 has 3.35 TB/s HBM bandwidth. Loading 140 GB takes 42 ms — that's your minimum per-token latency, no matter how fast the compute is.
Quantization shrinks the model: INT8 halves it to 70 GB, INT4 quarters it to 35 GB. Smaller weights = faster loading = faster token generation.
A 70B parameter model in FP16. How many GB of weight memory?
Same 70B model quantized to INT4 (4 bits per weight = 0.5 bytes). How many GB?
35 GB fits on a single 80 GB H100 with plenty of room for KV cache. Without quantization, you'd need at least TP=2 to fit the model. Quantization can literally halve your GPU cost for serving.
70B model, INT8 quantized (70 GB), H100 with 3.35 TB/s HBM bandwidth, batch size 1 (pure bandwidth-bound). What is the maximum decode throughput in tokens/second?
47.9 tok/s vs 23.9 tok/s at FP16. Exactly 2× faster because the model is exactly 2× smaller. This is the bandwidth-bound regime: throughput scales linearly with model compression.
At what batch size does a 70B INT8 model on H100 transition from bandwidth-bound to compute-bound? H100 has 990 TFLOPS INT8 and 3.35 TB/s bandwidth. The arithmetic intensity crossover happens when FLOPs/byte = hardware_FLOPS/bandwidth.
Each token requires ~2P FLOPs (forward pass through 70B params = 140 GFLOPs) and loads P bytes (70 GB for INT8). Compute-bound when: B × 140 GFLOPs / 70 GB > 990 TFLOPS / 3.35 TB/s.
Below batch ~148, you're bandwidth-bound and quantization's throughput benefit is linear. Above 148, you're compute-bound and quantization helps less (INT8 compute is still faster than FP16, but the scaling is sublinear). Most serving scenarios operate well below this crossover.
INT8 is the sweet spot for quality-sensitive applications. The 0.1 perplexity increase is almost imperceptible in code generation quality, but you get 2× throughput (or half the GPUs, half the cost). INT4's 0.5 ppl increase can noticeably affect code correctness — missing edge cases, incorrect variable names. FP16 is wasteful when INT8 is essentially lossless.
To serve a 70B model, how many H100 GPUs (80 GB each) do you need at FP16 vs INT4? (Weights only — ignore KV cache for now.)
INT4 quantization cuts your serving cost in half by fitting the model on a single GPU. Plus, single-GPU serving has zero communication overhead — no TP allreduce latency. This is a 2× cost reduction AND lower latency.
You've optimized the model — now optimize the bill. Serving LLMs is a business: your revenue per token must exceed your cost per token. The math is surprisingly simple once you know the hardware costs and throughput numbers.
An H100 costs $3.50/hr (on-demand). You're serving a 70B INT8 model on 2 GPUs (TP=2), achieving 3,000 output tokens/second. What is the cost per 1M output tokens?
$0.65 per million output tokens. For comparison, OpenAI charges $15/1M output tokens for GPT-4. The raw compute cost is ~23× less than the API price — the markup covers engineering, infrastructure, redundancy, and margin.
H100 spot price: $2.00/hr. On-demand: $3.50/hr. If your spot instances get interrupted 15% of the time (85% availability), is spot cheaper for serving?
Effective spot cost = spot_price / availability. Compare to on-demand.
Spot is cheaper even with 15% interruptions. However, for production serving with SLAs, interruptions cause dropped requests and latency spikes. Spot is better for batch inference (reprocessable work) than real-time serving.
A 1-year reserved H100 costs $2.20/hr (committed). On-demand is $3.50/hr. What utilization rate makes reserved cheaper than on-demand?
Reserved cost is fixed (you pay even when idle). On-demand only when used. Break-even: reserved_price = on_demand_price × utilization.
If you use the GPU more than 62.9% of the time, reserved is cheaper. Most production serving systems targeting 70%+ utilization should use reserved instances. For dev/staging at <30% utilization, on-demand is clearly better.
You serve both a 7B model (1 GPU, INT4, 15K tok/s) and a 70B model (2 GPUs, INT8, 3K tok/s). Traffic is 80% to 7B and 20% to 70B. Total request rate: 1,000 requests/s with average 200 output tokens each. How many GPUs do you need for each model?
Tokens/s demand = requests/s × output_length × traffic_fraction.
Even though 80% of traffic goes to 7B, the 70B model needs 28/39 = 72% of the GPUs. The large model is 5× slower per token AND uses 2× more GPUs per replica. Routing more traffic to the smaller model has a massive cost impact.
39 GPUs at $3.50/hr (on-demand), running 24/7 for 30 days. What's the monthly bill?
~$100K/month to serve 1,000 requests/second. At 200 tokens/request, that's 200K tok/s = 17.28B tokens/month. Cost per million tokens = $98,280 / 17,280 = $5.69/M tokens (blended across both models).
Actually ~$50K/month savings — even more than option C. INT4 reduces the 70B serving from 28 GPUs to 8 GPUs (3.5× reduction). The total infrastructure cost is nearly halved. This is why aggressive quantization is standard for production serving.
Time to put it all together. You're the ML infrastructure lead. The task: serve a 70B model to 10,000 concurrent users. Context window: 4K tokens. Target: 30 tokens/second per user. p99 latency: first token within 2 seconds. Your budget needs to be defensible to the VP of Engineering. Every number must have a derivation.
10,000 concurrent users, each receiving 30 tok/s. But not all users are actively generating at once — assume 30% are in generation at any moment (the rest are reading, typing, or idle). What is the required aggregate output throughput?
90K tokens per second. This is the number that sizes everything else — GPUs, replicas, cost.
INT4 (AWQ) is the right choice. 35 GB weights leave 45 GB for KV cache on an 80 GB H100. At 320 KB/token, that's 144K tokens of KV cache — enough for ~36 concurrent 4K-context requests per GPU. INT8 would leave less KV cache room, requiring more GPUs. FP16 doesn't fit on 1 GPU at all.
With INT4 on a single H100 (3.35 TB/s bandwidth), the bandwidth-bound decode throughput at batch=1 is ~96 tok/s. With continuous batching and batch size 32, throughput scales nearly linearly (bandwidth-bound). What is the per-GPU throughput at batch=32?
~3,000 tok/s per GPU at batch=32. This is still bandwidth-bound (batch 32 << crossover batch ~300 for INT4). Each decode step loads 35 GB of weights once and produces 32 tokens — amortizing the bandwidth cost across the batch.
Required: 90,000 tok/s. Per GPU: ~3,000 tok/s. How many GPUs? Add 20% headroom for load spikes and maintenance.
36 H100 GPUs. That's about 4.5 nodes of 8 GPUs each — round to 5 nodes (40 GPUs) for clean hardware allocation. The extra 4 GPUs provide additional failover capacity.
Each GPU has 80 GB. Weights = 35 GB. CUDA overhead ≈ 5 GB. Remaining for KV cache? With 320 KB/token and 4K context, how many concurrent requests per GPU?
32 concurrent requests per GPU with PagedAttention. At batch=32, we estimated ~3K tok/s — this matches our throughput calculation. The system is KV-cache-limited, not compute-limited. To serve more requests per GPU, you'd need either shorter contexts, smaller KV heads (more aggressive GQA), or KV cache quantization (INT8 KV halves the per-request cost).
580 ms for prefill, leaving ~1.4 seconds for queuing, scheduling, and network overhead. Achievable at p99 with proper queue management. Note: this assumes the GPU is dedicated to this request during prefill — in practice, prefill is often separated from decode ("disaggregated serving") to avoid head-of-line blocking.
36 H100 GPUs, reserved pricing at $2.20/hr. What's the monthly cost?
~$57K/month to serve 10,000 users. That's $5.70 per user per month. If each user pays $20/month, your gross margin on compute is ($20 - $5.70) / $20 = 71.5%. This is why LLM API businesses are viable — the per-user compute cost is surprisingly low at scale.
Total cost: $57,024/month. Total throughput: 90,000 tok/s. What is the blended cost per 1M output tokens?
$0.24 per million output tokens at full utilization. This is the raw infrastructure cost for a well-optimized 70B INT4 serving system. Any API pricing above this (plus engineering costs and margin) is profitable.
Put the serving system components in the correct order, from user request to generated response.
1. Load balancer routes to GPU (least-loaded scheduling).
2. Allocate KV cache pages (PagedAttention reserves blocks).
3. Prefill: process input prompt (compute-bound, fills KV cache).
4. Decode: autoregressive generation (bandwidth-bound, token by token).
5. Stream tokens to user (SSE/WebSocket, as they're generated).
6. Free KV pages, return slot (continuous batching fills the slot immediately).
| Topic | Lesson |
|---|---|
| Transformer fundamentals | Transformer — From Absolute Zero |
| Parameter counting | Transformer Math Workbook |
| Distributed training | Distributed Training — From Absolute Zero |
| ML inference | ML Inference Engineer — Day In The Life |
| Scaling laws | Scaling Book Workbook |