The gap between "it works" and "it ships at 30 FPS in a car" is performance engineering. Every chapter covers concept, system design, code, debugging, and the frontier.
You're an ML performance engineer at an autonomous driving company. Not a researcher who publishes papers. Not a data scientist who trains models in notebooks. You own the full pipeline from training infrastructure to real-time inference in a vehicle traveling at 70 mph.
Here is what your day looks like. Morning: a training run that costs $47,000 per run on 64 A100 GPUs is taking 72 hours. Your job is to cut that to 24 hours without touching the model architecture or losing accuracy. Afternoon: the perception model runs at 42ms on the in-vehicle compute — but the deadline is 33ms for 30 FPS. You need to shave 9ms. Evening: the serving team reports P99 latency spiking to 200ms during traffic bursts. You dig into profiler traces.
The full system you own:
Real numbers for a perception model training run on 64 A100 GPUs:
| Metric | Before Optimization | After Optimization |
|---|---|---|
| Training time | 72 hours | 22 hours (3.3x) |
| GPU cost | $47,000 | $14,300 |
| GPU utilization | 34% | 82% |
| Inference latency | 42ms | 18ms |
| Serving throughput | 200 req/s | 1,400 req/s |
That 3.3x training speedup came from stacking many techniques: mixed precision (1.7x), torch.compile (1.15x), fused optimizer (1.12x), fixing the data pipeline (1.4x), and overlapping communication (1.1x). No single trick gave 3x. Performance engineering is the discipline of stacking 1.1x improvements until they compound.
The first thing you learn: GPUs don't scale linearly. Adding 8 GPUs doesn't give you 8x speedup. Communication overhead — synchronizing gradients across GPUs after every step — eats into your gains. On a well-optimized training run, expect ~85% scaling efficiency at 8 GPUs, ~70% at 64, and ~50% at 256+.
Here are the numbers for ResNet-50 on ImageNet. Single V100: ~14 hours. 8 V100s (DDP): ~2 hours (7x, not 8x). 256 V100s (highly optimized): ~25 minutes (33x, not 256x). The gap between linear scaling and reality is your optimization target.
Drag the slider to add GPUs. The teal line shows actual training time with communication overhead. The dashed line is the impossible ideal of perfect linear scaling.
Click on any component to highlight the optimization opportunities at that stage. Drag the GPU slider to see how the system scales.
A modern GPU is not one fast processor — it's thousands of tiny processors with a deep memory hierarchy. The bottleneck is almost never raw compute. It's getting data to the compute units fast enough.
CUDA cores are general-purpose floating-point units. Each does one multiply-add per cycle. The A100 has 6,912. Fine for general math, but not special.
Tensor Cores are the game-changer. A single Tensor Core performs a 4×4 matrix multiply-accumulate in ONE cycle — 128 multiply-add operations simultaneously. Since transformers are fundamentally stacks of matrix multiplications, Tensor Cores are why modern GPUs are so fast for ML. But they only activate when matrix dimensions are multiples of 8 (FP16/BF16) or 16 (INT8).
The memory hierarchy is where performance lives or dies:
| Level | Size (A100) | Bandwidth | Latency |
|---|---|---|---|
| Registers | ~256KB / SM | ~20 TB/s | ~1 cycle |
| Shared Mem / L1 | 228KB / SM | ~19 TB/s | ~20 cycles |
| L2 Cache | 40 MB | ~5 TB/s | ~200 cycles |
| HBM (Global) | 80 GB | ~2 TB/s | ~400 cycles |
That's a 10x bandwidth drop from L1 to HBM. Most ML operations read from HBM, compute, and write back to HBM. If the operation doesn't do enough math per byte loaded, the compute units sit idle waiting for data — you're memory-bound.
The roofline model is the staff engineer's first tool. It answers: "Is this operation limited by compute or by memory bandwidth?" Plot achievable throughput (TFLOPS) against arithmetic intensity (FLOPs per byte of memory traffic):
Below 156 ops/byte on the A100, you're memory-bound. Above, compute-bound. Most ML operations — element-wise activations, layer normalization, softmax — are deeply memory-bound. Only large matrix multiplications cross the line into compute-bound territory.
python def roofline_analysis(op_name, flops, bytes_transferred, gpu): """Determine if an operation is compute or memory bound.""" intensity = flops / bytes_transferred # ops/byte crossover = gpu["tflops"] * 1e12 / (gpu["bw_gbs"] * 1e9) if intensity < crossover: achievable_tflops = intensity * gpu["bw_gbs"] / 1e3 utilization = achievable_tflops / gpu["tflops"] return f"{op_name}: MEMORY-BOUND. Intensity={intensity:.1f} ops/B. " \ f"Achievable={achievable_tflops:.1f} TFLOPS ({utilization*100:.1f}% of peak)" else: return f"{op_name}: COMPUTE-BOUND. Intensity={intensity:.1f} ops/B. " \ f"Achievable={gpu['tflops']} TFLOPS (peak)" A100 = {"tflops": 312, "bw_gbs": 2000} # LayerNorm: 4N FLOPs, reads+writes 12N bytes (N = hidden dim) N = 4096 print(roofline_analysis("LayerNorm", 4*N, 12*N, A100)) # LayerNorm: MEMORY-BOUND. Intensity=0.3 ops/B. Achievable=0.7 TFLOPS (0.2% of peak) # Large matmul: 2*M*N*K FLOPs, reads M*K + K*N, writes M*N M, K = 4096, 4096 flops = 2 * M * N * K bytes_rw = (M*K + K*N + M*N) * 2 # BF16 = 2 bytes print(roofline_analysis("Matmul 4096x4096x4096", flops, bytes_rw, A100)) # Matmul: COMPUTE-BOUND. Intensity=2730.7 ops/B. Achievable=312 TFLOPS (peak)
| GPU | Year | HBM | BF16 TFLOPS | BW (TB/s) | Crossover |
|---|---|---|---|---|---|
| V100 | 2017 | 16/32 GB | 125 | 0.9 | 139 ops/B |
| A100 | 2020 | 80 GB | 312 | 2.0 | 156 ops/B |
| H100 | 2023 | 80 GB | 990 | 3.4 | 291 ops/B |
| H200 | 2024 | 141 GB | 990 | 4.8 | 206 ops/B |
| B200 | 2025 | 192 GB | 2,250 | 8.0 | 281 ops/B |
The H100 Transformer Engine dynamically switches between FP8 and FP16 per layer, per training step. FP8 training halves memory traffic and doubles Tensor Core throughput — but requires careful per-tensor scaling. Custom silicon (Google TPUs, AWS Trainium) targets specific matrix shapes and collective operations, trading generality for efficiency.
Notice the trend: each generation increases compute faster than bandwidth. The A100 crossover was 156 ops/byte, the H100 is 291. This means more operations become memory-bound over time. The hardware is getting faster at math, but not proportionally faster at moving data. This is why techniques like FlashAttention (which reduce HBM traffic) become more valuable, not less, on newer hardware.
python # Roofline analysis: every operation in a single transformer layer # Model: hidden=4096, seq=2048, batch=4, A100 A100 = {"tflops": 312, "bw_gbs": 2000} crossover = 312e12 / (2000e9) # 156 ops/byte B, S, D = 4, 2048, 4096 # batch, seq, hidden ops = [ ("QKV Projection", 2*B*S*D*3*D, (B*S*D + 3*D*D + B*S*3*D)*2), ("Attention (QK^T)", 2*B*S*S*D, (B*S*D*2 + B*S*S)*2), ("Softmax", 5*B*S*S, B*S*S*2*2), ("Attn x V", 2*B*S*S*D, (B*S*S + B*S*D + B*S*D)*2), ("Output Projection", 2*B*S*D*D, (B*S*D + D*D + B*S*D)*2), ("LayerNorm", 4*B*S*D, 12*B*S*D), ("FFN Up (4x)", 2*B*S*D*4*D, (B*S*D + D*4*D + B*S*4*D)*2), ("GELU", B*S*4*D, B*S*4*D*2*2), ("FFN Down", 2*B*S*4*D*D, (B*S*4*D + 4*D*D + B*S*D)*2), ] for name, flops, bytes_rw in ops: intensity = flops / bytes_rw bound = "MEM" if intensity < crossover else "COMP" print(f"{name:20s} | I={intensity:8.1f} | {bound}") # Result: LayerNorm, Softmax, GELU are memory-bound # All matmuls are compute-bound # Fusing LayerNorm+Linear or GELU+Linear eliminates HBM round-trips # This is exactly what torch.compile does!
X-axis: arithmetic intensity (ops/byte, log scale). Y-axis: achievable TFLOPS. Toggle GPU generations. Dots show common ML operations — red = memory-bound, green = compute-bound.
Eager mode: each operation executes immediately as Python encounters it. y = x @ W + b launches a matmul kernel, waits, then launches an add kernel. Each kernel launch has ~5-10μs of CPU overhead. For a 32-layer transformer with ~50 ops per layer, that's 8-16ms of wasted time per forward pass.
torch.compile (PyTorch 2.0+) traces the computation graph and fuses operations. A Linear + LayerNorm + GELU that takes 3 kernel launches and 3 HBM round-trips becomes ONE fused kernel: 1 read, compute all three in registers, 1 write. Typical speedup: 1.1-1.4x on transformer models.
The autograd DAG records every operation during forward for the backward pass. For a matmul y = x @ W, PyTorch saves both x and W because backward needs them: ∂L/∂W = xT · ∂L/∂y. These saved tensors are where memory goes.
PyTorch's caching allocator pools freed GPU memory for reuse instead of returning it to CUDA. Calling torch.cuda.empty_cache() hurts because it forces reallocation. The key diagnostic: torch.cuda.max_memory_allocated() — the peak watermark.
Where does memory go when training a 7B parameter model? Every byte is accounted for:
| Component | Formula | 7B in BF16 + Adam |
|---|---|---|
| Model weights | params × 2 bytes (BF16) | 14 GB |
| Gradients | params × 2 bytes (BF16) | 14 GB |
| Adam first moment (m) | params × 4 bytes (FP32) | 28 GB |
| Adam second moment (v) | params × 4 bytes (FP32) | 28 GB |
| FP32 master weights | params × 4 bytes (FP32) | 28 GB |
| Subtotal (static) | 112 GB | |
| Activations | batch × seq × hidden × layers × ~12 × 2B | Variable |
python def training_memory_gb(params_b, batch, seq, hidden, layers, dtype_bytes=2): """Calculate total GPU memory for training.""" P = params_b * 1e9 # Static memory weights = P * dtype_bytes # BF16 weights gradients = P * dtype_bytes # BF16 gradients adam_m = P * 4 # FP32 first moment adam_v = P * 4 # FP32 second moment master_weights = P * 4 # FP32 master copy # Activation memory (per-layer, simplified) # Each transformer layer saves: input, QKV projections, attention scores, # attention output, FFN intermediate (4*hidden), FFN output act_per_layer = batch * seq * hidden * dtype_bytes * 12 # ~12x factor activations = act_per_layer * layers total = weights + gradients + adam_m + adam_v + master_weights + activations return { "weights_gb": weights / 1e9, "optimizer_gb": (adam_m + adam_v + master_weights) / 1e9, "gradients_gb": gradients / 1e9, "activations_gb": activations / 1e9, "total_gb": total / 1e9, } mem = training_memory_gb(7, batch=1, seq=2048, hidden=4096, layers=32) # weights: 14.0 GB, optimizer: 84.0 GB, gradients: 14.0 GB # activations: 6.0 GB, total: 118.0 GB
torch.cuda.max_memory_allocated(). Usually activations, not weights. Fixes in order: (1) gradient checkpointing, (2) smaller batch + gradient accumulation, (3) FSDP to shard optimizer states. Do NOT call torch.cuda.empty_cache() — it hurts the caching allocator.torch._dynamo.config.cache_size_limit = 64 if you see recompilation. Dynamic shapes trigger recompilation — pad inputs to fixed lengths or use dynamic=True.torch.autograd.set_detect_anomaly(True) to find the exact operation. If NaN appears only after many steps, suspect gradient explosion — add gradient clipping.Regional compilation lets you compile only performance-critical subgraphs instead of the whole model — faster compilation, fewer graph breaks. torch.export captures the full graph for deployment without Python. The Inductor backend generates Triton kernels that approach hand-written performance for many patterns.
torch.compile modes: "default" balances compile time vs speedup. "reduce-overhead" uses CUDA graphs to eliminate kernel launch overhead (best for small models). "max-autotune" benchmarks multiple kernel implementations (slow compile, fastest runtime). For training, start with "default". For inference, use "max-autotune".
python # Diagnosing torch.compile issues # 1. See where graph breaks occur torch._dynamo.config.verbose = True explanation = torch._dynamo.explain(model)(sample_input) print(explanation) # Shows: graph breaks, reasons, affected operations # 2. Common graph break causes and fixes # - data-dependent control flow: if tensor.item() > 0 → remove .item() # - dynamic shapes: use torch.compile(dynamic=True) # - unsupported ops: check torch._dynamo.config.suppress_errors = True # 3. Measuring compile speedup import time model_eager = MyModel() model_compiled = torch.compile(MyModel(), mode="reduce-overhead") # Warmup (compilation happens here) for _ in range(3): model_compiled(sample_input) # Benchmark torch.cuda.synchronize() t0 = time.perf_counter() for _ in range(100): model_compiled(sample_input) torch.cuda.synchronize() t1 = time.perf_counter() print(f"Compiled: {(t1-t0)/100*1000:.1f}ms/iter")
Watch GPU memory during a forward+backward pass. Activations accumulate during forward, then get freed during backward. Toggle gradient checkpointing to see reduced peak.
PyTorch Profiler wraps your training loop and records every CUDA kernel, memory allocation, and CPU operation. Good for finding which operations take the most time. Start here.
Nsight Systems gives a system-level timeline: CPU and GPU activity side by side. You see exactly when the GPU is idle, when data transfers happen, and where synchronization stalls. This is how you find the real bottleneck — the visual timeline tells stories numbers can't.
Nsight Compute is kernel-level: SM occupancy, memory bandwidth utilization, warp stalls, instruction mix. Use this when you know WHICH kernel is slow and need to understand WHY.
python import torch from torch.profiler import profile, ProfilerActivity, schedule, tensorboard_trace_handler # Profile steps 5-10 (skip warmup) with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=schedule(wait=2, warmup=2, active=6, repeat=1), on_trace_ready=tensorboard_trace_handler("./profiler_logs"), record_shapes=True, profile_memory=True, with_stack=True, ) as prof: for step, (x, y) in enumerate(loader): loss = model(x).loss loss.backward() optimizer.step() optimizer.zero_grad() prof.step() # Reading the trace: look for these red flags # - CPU bar thick, GPU bar thin = CPU bottleneck # - Large gaps between GPU kernels = sync points (.item(), print()) # - GPU idle between steps = data loading too slow # - Lots of tiny GPU kernels = eager mode overhead (use torch.compile)
bash # Nsight Systems command-line profiling nsys profile --trace=cuda,nvtx --output=training_trace \ python train.py --epochs 1 --profile # View in Nsight Systems GUI: # - GPU utilization timeline (should be >80% filled) # - CUDA API trace (kernel launches, memcpy) # - NVTX annotations (custom markers in your code) # Nsight Compute for kernel-level analysis ncu --target-processes all --set full \ python inference.py --single-step # Shows: SM occupancy, memory throughput, roofline position
| Symptom | Bottleneck | Fix |
|---|---|---|
| GPU util ~0% between batches | Data loading | num_workers=8, pin_memory, persistent_workers |
| Frequent short GPU idle gaps | CPU-GPU sync | Remove .item()/.cpu() calls, batch operations |
| Low SM occupancy, high BW | Memory-bound kernels | Operator fusion, torch.compile |
| GPU idle during all-reduce | Communication | Overlap comm with compute, gradient bucketing |
| Many small kernels, gaps | Eager overhead | torch.compile, CUDA graphs |
loss.item() (forces GPU-CPU sync), print(tensor) (same), if tensor > threshold (same). Each sync point blocks the CPU until the GPU finishes all queued work. Move logging to every 100 steps, and use .detach() for values you don't need gradients on.The CPU row should be thin — just launching kernels. If the CPU row is thick (lots of Python time), you have eager mode overhead. Fix: torch.compile.
The GPU row should be solid blocks, no gaps. Each gap is wasted time. Small gaps between kernels = too many small ops (torch.compile fuses them). Large gaps = synchronization (.item() call, CPU-GPU copy, or all-reduce barrier).
The memory row should be sawtooth: memory grows during forward (accumulating activations), peaks at the boundary, then drops during backward (freeing activations). If it plateaus or grows monotonically, you have a memory leak.
python import torch import nvtx # pip install nvtx class ProfiledTrainer: def train_step(self, batch): with nvtx.annotate("data_transfer", color="red"): x, y = batch[0].cuda(), batch[1].cuda() with nvtx.annotate("forward", color="blue"): with torch.cuda.amp.autocast(dtype=torch.bfloat16): loss = self.model(x, y) with nvtx.annotate("backward", color="orange"): loss.backward() with nvtx.annotate("optimizer", color="purple"): self.optimizer.step() self.optimizer.zero_grad() # These annotations appear as colored bars in Nsight Systems # Makes it trivial to identify which phase is the bottleneck
NVIDIA's DLProf automatically identifies optimization opportunities from profiler traces. PyTorch Kineto provides production-grade profiling with minimal overhead (<2% when sampling). Research on AI-assisted profiling uses LLMs to read traces and suggest optimizations — still early but promising. In production, continuous profiling samples every Nth step to detect performance regressions before they compound into wasted GPU-hours.
Holistic Trace Analysis (HTA) from Meta is an open-source tool that ingests PyTorch profiler traces and automatically computes: GPU idle time breakdown, kernel duration distribution, communication-computation overlap ratio, and memory bandwidth utilization. It generates actionable recommendations ranked by expected impact.
GPU timeline: forward (blue), backward (orange), data loading (red), communication (teal), optimizer (purple). Adjust DataLoader workers and toggle optimizations.
A floating-point number has three parts: sign (1 bit), exponent (controls range), and mantissa (controls precision):
| Format | Bits | Exponent | Mantissa | Range | Precision |
|---|---|---|---|---|---|
| FP32 | 32 | 8 | 23 | ±3.4×1038 | ~7 decimal digits |
| FP16 | 16 | 5 | 10 | ±65,504 | ~3 decimal digits |
| BF16 | 16 | 8 | 7 | ±3.4×1038 | ~2 decimal digits |
FP16 has more precision but max value is only 65,504. Gradients during training can exceed this and overflow to infinity. BF16 keeps the same exponent as FP32 — same range, never overflows. Less precision, but gradients that will be averaged with millions of others don't need 3 decimal digits.
AMP autocast automatically selects precision per operation:
| Precision | Operations | Why |
|---|---|---|
| BF16/FP16 | Matmul, convolution, linear | Tensor Cores give 2x throughput |
| FP32 | Softmax, layer norm, loss | Precision-sensitive reductions |
| FP32 (master) | Optimizer weight updates | Updates ~1e-7 would underflow in BF16 |
python # BF16 training (modern GPUs: A100+) for x, y in loader: optimizer.zero_grad() with torch.cuda.amp.autocast(dtype=torch.bfloat16): loss = model(x, y) # forward in BF16 loss.backward() # backward in BF16 optimizer.step() # updates FP32 master weights # FP16 training (older GPUs: V100) needs loss scaling scaler = torch.cuda.amp.GradScaler() for x, y in loader: optimizer.zero_grad() with torch.cuda.amp.autocast(dtype=torch.float16): loss = model(x, y) scaler.scale(loss).backward() # multiply loss by 1024 (dynamic) scaler.step(optimizer) # unscale grads, then step scaler.update() # adjust scale if overflow detected # Loss scaling math: # Forward: loss = 0.5 (normal) # Scale: scaled_loss = 0.5 * 1024 = 512 # Backward: all gradients are 1024x larger (in FP16 range) # Unscale: gradients /= 1024 (back to true values) # If any gradient is inf/nan: skip this step, halve scale
torch.backends.cuda.matmul.allow_tf32 = True as a baseline.| Component | FP32 Training | BF16 AMP Training |
|---|---|---|
| Weights | 7B × 4 = 28 GB | 7B × 2 = 14 GB (BF16) |
| Gradients | 7B × 4 = 28 GB | 7B × 2 = 14 GB (BF16) |
| Adam m (1st moment) | 7B × 4 = 28 GB | 7B × 4 = 28 GB (always FP32) |
| Adam v (2nd moment) | 7B × 4 = 28 GB | 7B × 4 = 28 GB (always FP32) |
| Master weights | — (already FP32) | 7B × 4 = 28 GB (FP32 copy) |
| Activations (batch=4) | ~32 GB | ~16 GB (BF16) |
| Total | ~144 GB | ~128 GB |
| Saving | — | ~11% memory + 1.7x speed |
The memory savings from AMP are modest (~11%) because the optimizer states dominate and stay in FP32. The real win is speed: BF16 matmuls run at 2x Tensor Core throughput, giving 1.5-2x overall training speedup.
FP8 training on H100: two formats, E4M3 (more mantissa = precision, used for forward activations) and E5M2 (more exponent = range, used for backward gradients). Halves memory traffic again compared to BF16. Requires per-tensor dynamic scaling — the H100 Transformer Engine handles this automatically.
Microscaling (MX) formats use shared exponents across groups of values: 32 weights share one 8-bit exponent, and each weight stores only a 4-bit mantissa. This enables effective 4.25 bits/weight with shared overhead. 4-bit training is an active research area — current results show it's possible for fine-tuning (QLoRA) but challenging for pre-training due to gradient precision requirements.
python # FP8 training on H100 (conceptual) # The Transformer Engine handles this automatically import transformer_engine.pytorch as te # Replace nn.Linear with te.Linear for automatic FP8 layer = te.Linear(4096, 4096, bias=True) # Forward: E4M3 (4-bit exponent, 3-bit mantissa) # More mantissa = better precision for activations # Backward: E5M2 (5-bit exponent, 2-bit mantissa) # More exponent = better range for gradients # Per-tensor dynamic scaling: each tensor gets its own scale factor # Updated every step based on observed value distribution # Memory impact: BF16 -> FP8 halves activation memory # 7B model, batch=4, seq=2048: ~12 GB -> ~6 GB activations # Speed impact: ~1.5x on H100 Tensor Cores
Each dot is a representable value. FP32 has dense coverage. FP16 has a limited range. BF16 covers the full range but with wider gaps. Toggle views to explore different value ranges.
Data Parallelism (DDP): every GPU holds a complete model copy. Split the batch. Each GPU computes gradients, then ring all-reduce averages them. Simplest, most common.
Fully Sharded (FSDP): shard parameters, gradients, AND optimizer states across GPUs. Before each layer's forward, all-gather the full parameters, compute, discard. Memory per GPU drops from full to 1/N.
Tensor Parallelism (TP): split individual layers across GPUs. A linear W of shape (d, 4d) column-split across 4 GPUs. Requires NVLink because communication happens inside every layer.
Pipeline Parallelism (PP): split layers sequentially across GPUs. GPU 0 gets layers 1-10, GPU 1 gets 11-20. Micro-batching fills the pipeline. Has "bubble" overhead.
Communication cost for ring all-reduce: 2(N-1)/N × data_size, where N is GPU count. Let's work through a concrete example:
If your forward + backward takes 1.5 seconds, communication adds 0.49s = 25% overhead. This is why DDP overlaps communication with backward: while computing gradients for layer N, send gradients for layer N+1 that are already done. With good overlap, the comm is hidden behind compute.
For inter-node communication (InfiniBand at 100 Gb/s = 12.5 GB/s), the same all-reduce takes 24.5/12.5 = 1.96 seconds. This is why tensor parallelism uses NVLink (intra-node) while data/FSDP parallelism uses InfiniBand (inter-node, less frequent communication).
python # DDP in 3 lines torch.distributed.init_process_group("nccl") model = model.to(local_rank) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank]) # FSDP in 5 lines from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import MixedPrecision mp_policy = MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16) model = FSDP(model, mixed_precision=mp_policy, use_orig_params=True) # Communication volume calculation def allreduce_comm_gb(params_b, n_gpus, dtype_bytes=2): data_gb = params_b * dtype_bytes # total gradient size in GB # Ring all-reduce: 2*(N-1)/N * data_size return 2 * (n_gpus - 1) / n_gpus * data_gb def comm_time_sec(params_b, n_gpus, bw_gbs, dtype_bytes=2): vol = allreduce_comm_gb(params_b, n_gpus, dtype_bytes) return vol / bw_gbs # 7B model, 8 GPUs, NVLink (50 GB/s) vs InfiniBand (12.5 GB/s) print(f"NVLink: {comm_time_sec(7, 8, 50):.2f}s") # 0.49s print(f"InfiniBand: {comm_time_sec(7, 8, 12.5):.2f}s") # 1.96s
auto_wrap_policy to shard within layers, not just between them.DeepSpeed ZeRO-3 adds CPU/NVMe offloading for extreme memory savings. 3D parallelism (Megatron-LM) combines DP + TP + PP for 100B+ models. Sequence parallelism shards the sequence dimension for long-context models. Expert parallelism distributes MoE experts across GPUs. Context parallelism (ring attention) enables 1M+ token sequences by distributing the attention computation across GPUs in a ring topology.
PyTorch FSDP2 (2024) simplifies the API and improves performance: per-parameter sharding (not per-module), better composition with torch.compile, and native mixed precision support. Fully Sharded Data Parallel + Tensor Parallel composition is the standard for Llama-3 scale training: FSDP across nodes, TP within nodes.
python # FSDP2 (PyTorch 2.4+) — cleaner API from torch.distributed._composable.fsdp import fully_shard # Shard each transformer block individually for block in model.blocks: fully_shard(block) fully_shard(model) # root shard # Composes with torch.compile model = torch.compile(model) # Memory per GPU for 70B, 64 GPUs: # Weights: 140 GB / 64 = 2.2 GB # Optimizer: 840 GB / 64 = 13.1 GB # Gradients: 140 GB / 64 = 2.2 GB # Total static: 17.5 GB (fits easily in 80 GB A100)
Toggle between DDP, FSDP, and Pipeline to see how model state distributes across 4 GPUs.
No single trick gives 10x. But stack enough 1.2x improvements: 1.7 × 1.3 × 1.15 × 1.12 = 2.9x. Each technique below is well-understood, widely deployed, and testable.
Gradient checkpointing: Don't save activations during forward. During backward, recompute them. ~33% more compute, but 5-10x less activation memory. Memory goes from O(L) to O(√L) in number of layers.
Efficient data loading: Default DataLoader (num_workers=0) means the GPU idles while CPU decodes images one at a time. Fix: num_workers=8, pin_memory=True, persistent_workers=True. For large-scale: WebDataset (tar shards) or FFCV (memory-mapped binary).
Fused optimizers: Standard AdamW does 4 element-wise operations per parameter, each round-tripping to HBM. A fused kernel does all 4 in one pass: 1 read, 1 write. 15-20% faster optimizer step.
torch.compile: One line of code. Fuses operations, eliminates kernel launch overhead. 1.1-1.4x speedup on transformers.
Gradient accumulation: Want batch_size=1024 but only have memory for 64? Accumulate gradients over K=16 mini-batches before stepping. Mathematically equivalent, just K more forward/backward passes.
python # 1. Gradient checkpointing from torch.utils.checkpoint import checkpoint class Model(nn.Module): def forward(self, x): for layer in self.layers: x = checkpoint(layer, x, use_reentrant=False) return x # 2. Efficient data loading loader = DataLoader(dataset, batch_size=64, num_workers=8, pin_memory=True, persistent_workers=True, prefetch_factor=2) # 3. Fused optimizer (15-20% faster step) optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, fused=True) # 4. torch.compile (one line) model = torch.compile(model, mode="reduce-overhead") # 5. Gradient accumulation accum = 16 for i, (x, y) in enumerate(loader): loss = model(x, y) / accum loss.backward() if (i + 1) % accum == 0: optimizer.step() optimizer.zero_grad()
checkpoint_sequential(layers, segments=len(layers)//3). The sweet spot is where recomputation time equals the memory savings you need.dynamic=True. Or use torch._dynamo.config.suppress_errors = True to fall back to eager for broken subgraphs. Check torch._dynamo.explain(model)(inputs) to see where graph breaks occur.FFCV (from MosaicML) achieves near-theoretical I/O throughput with memory-mapped random-access binary files. Mosaic StreamingDataset streams from S3 with intelligent prefetching. Selective compilation in PyTorch 2.x compiles only hot subgraphs, reducing compilation time from minutes to seconds. 8-bit optimizers (bitsandbytes) reduce optimizer memory by 2x with negligible accuracy impact.
Let's trace the exact math for cutting a $50K training run to $16K:
| Step | Technique | Multiplier | Cumulative | Step Time | Cost |
|---|---|---|---|---|---|
| 0 | Baseline | 1.0x | 1.0x | 850ms | $50,000 |
| 1 | Fix data loading (0→8 workers) | 1.4x | 1.4x | 607ms | $35,700 |
| 2 | BF16 AMP | 1.7x | 2.38x | 357ms | $21,000 |
| 3 | torch.compile | 1.15x | 2.74x | 310ms | $18,250 |
| 4 | Fused AdamW | 1.12x | 3.07x | 277ms | $16,300 |
Key observation: each optimization was applied in order of impact AND ease of implementation. Data loading fix is the biggest win and requires only config changes. AMP is one context manager. torch.compile is one line. Fused optimizer is one flag. Total engineering effort: ~2 hours. Total savings: $33,700 per run.
Toggle optimizations to see cumulative impact on speed and memory.
Quantization replaces high-precision weights with low-precision integers. The math is straightforward:
Post-Training Quantization (PTQ): quantize after training. Run calibration data to find value ranges. Minutes to apply. Quantization-Aware Training (QAT): simulate quantization during training so the model adapts. Hours of training but recovers accuracy.
Per-tensor vs per-channel vs per-group: Per-tensor uses one scale for the whole tensor — coarse, more error. Per-channel (one scale per output channel) is much better for convolutions. Per-group (groups of 128 weights) gives the best accuracy for LLMs at INT4.
Pruning removes weights that contribute little. Unstructured: zero individual weights — 80-90% sparsity possible, but needs sparse hardware (A100 2:4 structured sparsity). Structured: remove entire neurons/channels/heads — hardware-friendly, result is a smaller dense model.
Knowledge distillation trains a small "student" to mimic a large "teacher." The teacher's soft probabilities ("70% cat, 20% dog, 10% fox") contain richer information than hard labels ("cat"). The 20% dog tells the student that cats and dogs look similar.
| Technique | Time to Apply | Accuracy Hit | Size Reduction | Speed Gain |
|---|---|---|---|---|
| PTQ INT8 | Minutes | 0.1-0.5% | 2x | 2-3x |
| PTQ INT4 (GPTQ) | Hours | 0.5-2% | 4x | 2-4x |
| QAT INT4 | Days | 0.1-0.5% | 4x | 2-4x |
| Structured Pruning 50% | Days | 1-2% | 2x | 1.5-2x |
| Distillation | Days-Weeks | 1-5% | 3-10x | 3-10x |
python import torch def quantize_per_channel(weight, bits=8): """Per-channel symmetric quantization.""" # weight shape: (out_channels, in_channels) qmax = 2 ** (bits - 1) - 1 # 127 for INT8 # Per-channel: one scale per output channel abs_max = weight.abs().amax(dim=1, keepdim=True) # (out_ch, 1) scale = abs_max / qmax # (out_ch, 1) scale = scale.clamp(min=1e-8) # avoid div by zero # Quantize w_int = torch.round(weight / scale).clamp(-qmax, qmax).to(torch.int8) # Dequantize w_deq = w_int.float() * scale # Quantization error error = (weight - w_deq).abs().mean() return w_int, scale, error # Example: quantize a 4096x4096 weight matrix W = torch.randn(4096, 4096) w_q, scales, err = quantize_per_channel(W, bits=8) print(f"Original: {W.numel() * 4 / 1e6:.1f} MB (FP32)") print(f"Quantized: {w_q.numel() / 1e6:.1f} MB (INT8) + {scales.numel()*4/1e3:.1f} KB scales") print(f"Mean absolute error: {err:.6f}") # Original: 67.1 MB → Quantized: 16.8 MB (4x reduction) # Mean absolute error: 0.003142
GPTQ (2023) quantizes one layer at a time, using second-order (Hessian) information to decide which weights to round up vs down. The key insight: quantization error in one weight can be compensated by adjusting other weights in the same row. This "error compensation" dramatically reduces total quantization error. Process: for each column of the weight matrix, quantize it, measure the error, and spread that error across remaining unquantized columns.
python # GPTQ conceptual flow (simplified) for col in range(weight.shape[1]): # Quantize this column w_q = quantize(weight[:, col], scale, zero_point) # Compute quantization error error = weight[:, col] - dequantize(w_q, scale, zero_point) # Spread error to remaining columns using inverse Hessian # This is the key innovation: error compensation weight[:, col+1:] += error.unsqueeze(1) * H_inv[col, col+1:] / H_inv[col, col] weight[:, col] = dequantize(w_q, scale, zero_point)
AWQ (2024) observes that not all weights are equally important. Weights connected to high-activation channels matter more. AWQ scales up salient weights before quantization (effectively giving them more quantization levels), then scales down during inference. No retraining needed.
AQLM and QuIP# use vector quantization: instead of quantizing each weight independently, quantize groups of weights to the nearest codeword in a learned codebook. This enables 2-bit quantization with surprisingly low degradation. 2:4 structured sparsity (A100+): hardware-native support for 50% sparsity with zero runtime overhead.
Three compression views of the same network. Adjust quantization bits, pruning sparsity, and distillation ratio. Watch accuracy and size change. Try to minimize size while keeping accuracy above 70%.
TensorRT is NVIDIA's inference compiler. It takes an ONNX model and applies: (1) layer fusion — Conv+BN+ReLU becomes one kernel, (2) INT8 calibration — automatically quantizes safe layers, (3) kernel auto-tuning — benchmarks dozens of kernel implementations per operation and picks the fastest for your specific GPU and input shapes.
FlashAttention tiles the attention computation into SRAM-sized blocks. Standard attention materializes the full N×N attention matrix in HBM — for seq=8192, that's 128MB per head. FlashAttention never writes it: O(N) memory instead of O(N²), and faster because fewer HBM accesses. The insight: you can compute softmax without ever materializing the full matrix by maintaining running statistics (online softmax).
KV Cache stores precomputed key/value tensors for autoregressive generation. Without it, generating token 1000 recomputes all 999 previous K/V projections. Memory cost:
python def kv_cache_gb(n_layers, n_heads, head_dim, seq_len, dtype_bytes=2, batch=1): """Calculate KV cache memory in GB.""" # 2 for K and V, times each layer, head, dimension bytes_total = 2 * n_layers * n_heads * head_dim * seq_len * dtype_bytes * batch return bytes_total / 1e9 # LLaMA-7B: 32 layers, 32 heads, 128 head_dim for seq in [2048, 4096, 8192, 32768]: print(f"seq={seq}: {kv_cache_gb(32, 32, 128, seq):.1f} GB") # seq=2048: 1.1 GB # seq=4096: 2.1 GB # seq=8192: 4.3 GB # seq=32768: 17.2 GB ← this is why context length is expensive! # With GQA (Grouped Query Attention), n_kv_heads < n_heads # LLaMA-3 70B: 80 layers, 8 KV heads (GQA), 128 head_dim print(f"LLaMA-3 70B, 8K: {kv_cache_gb(80, 8, 128, 8192):.1f} GB") # LLaMA-3 70B, 8K: 2.6 GB (GQA saves 8x vs MHA)
Speculative decoding exploits the fact that a small "draft" model (e.g., 125M params) can predict many tokens correctly. The algorithm: (1) Draft model generates K candidate tokens quickly. (2) Large model verifies all K candidates in a single parallel forward pass. (3) Accept the longest prefix of correct predictions. If the draft model gets 4 out of 5 right, you generated 4 tokens for the cost of ~1 large-model forward pass.
The speedup depends on acceptance rate — how often the draft model matches the large model. For well-matched pairs (e.g., 125M draft for 7B target), acceptance rates of 70-85% give 2-3x speedup.
python # Speculative decoding speedup estimation def spec_decode_speedup(accept_rate, k_candidates, draft_time_ms, target_time_ms): """Estimate speculative decoding speedup.""" # Expected tokens per step: sum of geometric series expected_tokens = (1 - accept_rate ** (k_candidates + 1)) / (1 - accept_rate) # Time per step: draft generates k + target verifies k in parallel time_per_step = k_candidates * draft_time_ms + target_time_ms # Baseline: 1 token per target forward baseline_time = target_time_ms speedup = expected_tokens * baseline_time / time_per_step return speedup, expected_tokens # Example: 125M draft (2ms), 7B target (30ms), k=5, 80% accept s, t = spec_decode_speedup(0.8, 5, 2, 30) print(f"Speedup: {s:.1f}x, Expected tokens/step: {t:.1f}") # Speedup: 2.5x, Expected tokens/step: 3.4
FlashAttention-3 (2024): FP8 support on H100, asynchronous pipelining, 1.5-2x faster than FA2 on Hopper. PagedAttention (vLLM): virtual memory for KV cache. Disaggregated serving: separate prefill (compute-heavy) from decode (memory-heavy) onto different hardware. Medusa: adds multiple prediction heads to generate several candidate tokens simultaneously without a separate draft model.
Toggle TensorRT fusion and INT8 quantization. Adjust batch size for throughput/latency tradeoff.
Static batching: wait for B requests, process together. Problem: if B=32, the first request waits for 31 others. Terrible latency at low traffic.
Continuous batching: process requests as they arrive. For LLMs, sequences have different lengths — some finish in 20 tokens, others in 2000. vLLM inserts new requests as old ones complete, keeping the GPU busy. 2-3x throughput improvement over static batching.
PagedAttention: borrows virtual memory paging from OS design. KV cache is stored in fixed-size blocks (pages), tracked by a page table. Sequences allocate pages on-demand — no pre-allocated max-length buffers. Why this matters: without paging, you pre-allocate max_seq_len × KV per sequence. Average utilization is ~20%. With paging, you allocate only what each sequence actually uses. 2-4x more concurrent sequences in the same GPU memory.
python # Conceptual: why PagedAttention saves memory # WITHOUT paging: pre-allocate max_seq_len per sequence max_seq = 8192 kv_per_token = 2 * 32 * 128 * 2 # 2*layers*head_dim*bytes = 16KB/token allocated = max_seq * kv_per_token # 131 MB per sequence # Avg sequence uses 500 tokens: 8 MB actually used # Waste: 94% of allocated KV memory is unused! # WITH paging: allocate blocks as needed block_size = 16 # tokens per block block_bytes = block_size * kv_per_token # 256 KB per block # 500-token sequence needs ceil(500/16) = 32 blocks = 8 MB # On demand: only allocate as sequence grows # Non-contiguous: blocks can be anywhere in GPU memory # Shared: common prefixes (system prompt) share physical blocks # Result: fit 10-20x more concurrent sequences
| Framework | Best For | Key Feature |
|---|---|---|
| vLLM | LLM serving | PagedAttention, continuous batching, prefix caching |
| Triton Inference Server | Multi-model pipelines, non-LLM | Dynamic batching, model ensembles, TensorRT |
| TGI (HuggingFace) | Quick LLM deployment | Optimized transformers, FlashAttention, quantization |
| SGLang | Structured generation | RadixAttention, constrained decoding, fast JSON |
Production LLM serving is measured on four dimensions:
| Metric | Definition | Target | How to Optimize |
|---|---|---|---|
| TTFT | Time to first token (prefill latency) | <500ms | FlashAttention, prefix caching, shorter prompts |
| ITL | Inter-token latency (decode step) | <50ms | INT4 quantization, speculative decoding |
| Throughput | Total tokens/sec across all requests | Maximize | Continuous batching, PagedAttention |
| P99 latency | 99th percentile end-to-end | <2s | Autoscaling, request shedding, pre-warming |
The TTFT vs throughput tradeoff: larger batches improve throughput (more tokens processed per GPU-second) but increase TTFT (new requests wait in the batch). Continuous batching helps because it inserts new requests without waiting for the current batch to finish. The sweet spot is a batch size where GPU compute utilization is >80% and TTFT stays under your SLA.
FP8 KV cache is an emerging technique in vLLM and SGLang: quantize KV cache entries from BF16 to FP8, cutting KV memory in half. This doubles the number of concurrent sequences for the same GPU memory, with <0.5% perplexity degradation on most models. Combined with PagedAttention, this can increase serving efficiency by 3-4x over naive implementations.
Chunked prefill is another key optimization: instead of processing a 4K-token prompt in one large compute-heavy chunk (which blocks decode for other sequences), split it into smaller chunks interleaved with decode steps. This keeps TTFT for new requests low even when the system is processing long prompts.
Both vLLM and SGLang implement chunked prefill, and it's becoming the default for production deployments where latency SLAs are strict.
torch.cuda.memory_allocated() over time to catch the leak pattern.Disaggregated prefill/decode: prefill is compute-heavy (processing the whole prompt at once), decode is memory-heavy (generating one token at a time). Run prefill on high-compute GPUs, decode on high-bandwidth GPUs. Splitwise and DistServe implement this.
Prefix caching: system prompts shared across users are cached once, avoiding redundant computation. Multi-LoRA serving: one base model, hundreds of fine-tuned adapters hot-swapped at request time. SGLang's RadixAttention: tree-based KV cache sharing for branching generation (e.g., beam search, multiple responses).
python def capacity_plan(model_params_b, concurrent_users, avg_seq_len, gpu_mem_gb, quant_bits=4): """Estimate GPU count for an LLM serving deployment.""" # Model memory model_gb = model_params_b * quant_bits / 8 # KV cache per user (approximate) # Simplified: ~0.5 MB per token per 7B params at BF16 kv_per_token_gb = model_params_b * 0.5e-3 / 7 # scale from 7B baseline kv_per_user_gb = kv_per_token_gb * avg_seq_len # Total KV with PagedAttention overhead (~30%) total_kv_gb = concurrent_users * kv_per_user_gb * 1.3 # GPUs needed (leave 20% headroom) mem_per_gpu = gpu_mem_gb * 0.8 total_mem = model_gb + total_kv_gb n_gpus = max(1, -(-total_mem // mem_per_gpu)) # ceil division return {"model_gb": model_gb, "kv_total_gb": total_kv_gb, "total_gb": total_mem, "gpus_needed": int(n_gpus)} # 7B model, 1000 users, avg 1K tokens, A100 80GB, INT4 plan = capacity_plan(7, 1000, 1000, 80, 4) print(plan) # model: 3.5 GB, kv: 92.9 GB, total: 96.4 GB, gpus: 2
Requests arrive as colored dots. Watch how static batching queues them vs continuous batching processes them immediately. Adjust arrival rate.
A modern AD vehicle runs cameras (6-8 at 2MP, 30Hz = ~1.4 GB/s), LiDAR (128 beams, 300K points/frame), and radar. All of this feeds into: BEV features → detection + segmentation → tracking → prediction → planning. Total budget: 33ms at 30 FPS. Miss the deadline and the car drives blind for that frame.
| Stage | Model | Baseline | Optimized |
|---|---|---|---|
| Camera preprocess | — | 5ms | 2ms |
| Backbone (encoder) | EfficientNet-B4 | 12ms | 6ms |
| Detection head | BEV deformable attn | 8ms | 4ms |
| Segmentation head | BEV decoder | 5ms | 3ms |
| Tracking | Kalman + association | 3ms | 3ms |
| Prediction | Trajectory forecast | 10ms | 5ms |
| Planning | Motion planner | 5ms | 4ms |
| Total | — | 48ms (21 FPS) | 27ms (37 FPS) |
Four techniques that cut the 48ms baseline nearly in half:
1. Backbone sharing. One encoder feeds detection + segmentation + depth heads. Amortize the 12ms encoder cost across 3 tasks instead of running 3 encoders.
2. Temporal fusion. Reuse BEV features from previous frames (aligned by ego-motion). Only update regions where things changed. Saves 30-40% on BEV computation.
3. TensorRT. Layer fusion + kernel auto-tuning. Typical 2-3x speedup on neural network components. Deterministic timing for safety-critical code.
4. Sparse computation. Process only regions with objects, skip empty space. BEVPoolv2 computes features only where objects are likely. 30-50% savings.
python # Without sharing: 3 separate encoders encoder_gflops = 8.5 # EfficientNet-B4 n_cameras = 6 separate_cost = encoder_gflops * n_cameras * 3 # 3 tasks print(f"Separate encoders: {separate_cost:.0f} GFLOPs") # 153 GFLOPs # With sharing: 1 shared encoder + 3 lightweight heads shared_cost = encoder_gflops * n_cameras # 51 GFLOPs (encoder) head_cost = 1.5 * 3 # 4.5 GFLOPs (3 heads) total_shared = shared_cost + head_cost # 55.5 GFLOPs print(f"Shared encoder: {total_shared:.0f} GFLOPs") print(f"Savings: {(1 - total_shared/separate_cost)*100:.0f}%") # Savings: 64%
| Platform | GPU Cores | Memory | INT8 TOPS | Power |
|---|---|---|---|---|
| NVIDIA Orin | 2048 CUDA + 64 Tensor | 32 GB shared | 275 | 50W |
| NVIDIA Thor | Next-gen | TBD | 2000 | TBD |
| Qualcomm SA8650P | Custom DSP | 16 GB | ~100 | 30W |
UniAD (CVPR 2023): single model from raw sensors to planned trajectory. Joint optimization, shared features, no error propagation between modules. VAD: vectorized scene representation for end-to-end planning. Occupancy networks: predict 3D occupancy instead of bounding boxes — handles arbitrary shapes. World models (GAIA-1, DriveDreamer): learn a simulator of the environment for data augmentation and planning.
Performance engineering for end-to-end AD models is different from modular stacks. You can't independently optimize each stage because they share features. Instead, the optimization targets become: (1) shared backbone efficiency (single encoder serves all downstream tasks), (2) attention mechanism optimization (BEV attention is the bottleneck in models like BEVFormer), (3) temporal feature caching (avoid recomputing BEV features from scratch each frame), and (4) output head pruning (remove prediction heads for tasks not needed in a given driving mode).
NVIDIA's DriveOS provides a deterministic execution framework: fixed memory allocation, pre-compiled TensorRT engines, and hardware-level scheduling guarantees. This is required for ASIL-D (automotive safety integrity level D) certification. Consumer ML inference can tolerate 2x P99/P50 ratios; AD requires P99/P50 < 1.2x.
The 33ms deadline is the red line. Toggle optimizations to fit the pipeline within budget.
This chapter is your reference sheet. Every table, every drill, every scenario is something that has appeared in real staff-level ML performance interviews. Print this. Memorize the numbers. Practice the calculations.
| Technique | What It Does | Speed | Memory | Complexity | When |
|---|---|---|---|---|---|
| Mixed Precision | BF16 forward/backward, FP32 optimizer | 1.5-2x | ~50% less act. | Low | Always |
| torch.compile | Fuses ops, eliminates kernel overhead | 1.1-1.4x | Same | Low | Always |
| Fused Optimizer | Single-kernel AdamW | 1.15x | Same | Low | Always |
| FlashAttention | Tiled attention in SRAM | 2-4x | O(N) vs O(N²) | Low | Any attention model |
| Grad Checkpoint | Recompute activations in backward | 0.7x (slower) | 5-10x less act. | Low | When OOM |
| DDP | Replicate model, split data | ~Linear | Same/GPU | Medium | Model fits 1 GPU |
| FSDP | Shard everything across GPUs | ~Linear | 1/N per GPU | Medium | Model doesn't fit |
| TensorRT | Layer fusion + auto-tune + calibration | 2-5x | Similar | Medium | NVIDIA inference |
| INT8 Quantization | 8-bit weights + activations | 2-4x | 4x smaller | Medium | Inference |
| INT4 (GPTQ/AWQ) | 4-bit weights, FP16 activations | 3-6x | 8x smaller | Medium | LLM on consumer GPU |
| Structured Pruning | Remove channels/heads | 1.5-3x | 2-4x smaller | High | Dense model too big |
| Distillation | Train small student from teacher | Model-dep. | 3-10x smaller | High | Need smaller model |
| vLLM | PagedAttention + continuous batching | 10-24x tput | 2-4x efficient | Low | LLM serving |
python # Given: 13B params, BF16, Adam, batch=4, seq=4096, hidden=5120, 40 layers params = 13e9 weights = params * 2 # 26 GB grads = params * 2 # 26 GB adam_states = params * 4 * 3 # 156 GB (m + v + master) activations = 4 * 4096 * 5120 * 40 * 12 * 2 / 1e9 # ~80 GB total = (26 + 26 + 156 + 80) # 288 GB → need FSDP across 4+ A100s
python from torch.utils.checkpoint import checkpoint class CheckpointedTransformer(nn.Module): def __init__(self, layers, checkpoint_every=2): super().__init__() self.layers = nn.ModuleList(layers) self.checkpoint_every = checkpoint_every def forward(self, x): for i, layer in enumerate(self.layers): if i % self.checkpoint_every == 0: x = checkpoint(layer, x, use_reentrant=False) else: x = layer(x) return x
python # Ring all-reduce: 2*(N-1)/N * data_size # N GPUs, 7B params in BF16 = 14 GB gradient data N = 8 data_gb = 14 comm_gb = 2 * (N-1) / N * data_gb # 24.5 GB # At 400 Gb/s NVLink: 24.5*8/400 = 0.49 seconds # At 100 Gb/s InfiniBand: 24.5*8/100 = 1.96 seconds # This is why NVLink matters!
| Scenario | Root Cause | Diagnosis | Fix |
|---|---|---|---|
| GPU util 30% | Data pipeline starving GPU | Nsight: large gaps between kernels. CPU at 100%. | num_workers=8, WebDataset, pin_memory |
| INT8 model -8% acc | Per-tensor quant too coarse; outlier channels | Per-channel error analysis, weight histogram | Per-channel quant, SmoothQuant, or GPTQ |
| 8→64 GPU no scale | Communication dominates | Profile all-reduce time vs compute time | Overlap comm, gradient compression, local SGD |
| P99 5x P50 inference | GC pauses, CUDA malloc, scheduling | Nsight timeline: irregular large gaps | Pre-alloc buffers, CUDA graphs, TensorRT |
| OOM at batch=4 | Activation memory > weight memory | torch.cuda.max_memory_allocated() shows 70GB peak | Gradient checkpointing, reduce seq len, FSDP |
| Loss spikes every 1K steps | Data loader restarting (epoch boundary) | Correlate loss spikes with epoch count | persistent_workers=True, proper shuffling |
| NaN in training | FP16 overflow, bad LR, data corruption | torch.autograd.set_detect_anomaly(True) | BF16, lower LR, validate data pipeline |
| Serving memory leak | KV cache not freed after request | Monitor torch.cuda.memory_allocated() over time | Ensure request cleanup, use vLLM scheduler |
Training Priority:
1. Fix data loading (1.5-3x)
2. Mixed precision (1.5-2x)
3. torch.compile (1.1-1.4x)
4. Fused optimizer (1.15x)
5. Grad checkpoint if OOM
6. Distributed (DDP/FSDP)
Inference Priority:
1. FlashAttention (2-4x)
2. KV cache (essential)
3. TensorRT (2-5x)
4. INT8/INT4 quantization (2-6x)
5. Continuous batching (2-3x tput)
6. Speculative decoding (2-3x)
| Type | Resource | Why |
|---|---|---|
| THE Book | Hennessy & Patterson, "Computer Architecture: A Quantitative Approach" | Roofline, memory hierarchy, everything foundational |
| Paper | Dao et al., "FlashAttention" (NeurIPS 2022) | The gold standard for hardware-aware algorithm design |
| Paper | Micikevicius et al., "Mixed Precision Training" (ICLR 2018) | Introduced loss scaling, still the reference |
| Paper | Kwon et al., "vLLM / PagedAttention" (SOSP 2023) | Changed how everyone serves LLMs |
| Paper | Frantar et al., "GPTQ" (ICLR 2023) | State-of-the-art post-training quantization |
| Paper | Shoeybi et al., "Megatron-LM" (2020) | Tensor + pipeline parallelism at scale |
| Repo | github.com/vllm-project/vllm | Production LLM serving |
| Repo | github.com/Dao-AILab/flash-attention | FlashAttention implementation |
| Repo | github.com/microsoft/DeepSpeed | Distributed training at scale |
| Repo | github.com/NVIDIA/TensorRT | Inference optimization |
| Repo | github.com/pytorch/pytorch (torch.compile) | The compiler that's replacing hand-tuned kernels |
| Metric | Value | Why It Matters |
|---|---|---|
| A100 BF16 TFLOPS | 312 | Baseline for all performance estimates |
| A100 HBM bandwidth | 2 TB/s | Memory-bound operations hit this ceiling |
| A100 roofline crossover | 156 ops/byte | Below = memory-bound, above = compute-bound |
| H100 BF16 TFLOPS | 990 (3.2x A100) | New baseline, wider memory-bound regime |
| NVLink bandwidth | 900 GB/s (A100) | Intra-node communication speed |
| InfiniBand HDR | 200 Gb/s = 25 GB/s | Inter-node communication speed |
| BF16 per param | 2 bytes | 7B model = 14 GB weights |
| Adam optimizer state | 12 bytes/param (m + v + master) | 7B model = 84 GB optimizer |
| Full training memory | ~16 bytes/param + activations | 7B model ≥ 112 GB before activations |
| CUDA kernel launch | 5-10 μs | 50 ops × 10μs = 0.5ms/layer overhead |
| FlashAttention memory | O(N) vs O(N²) | seq=8192: 128MB/head → ~1MB/head |
| KV cache per token | ~16KB (7B model, BF16) | 8192 tokens = 131 MB per sequence |
| AD latency budget | 33ms (30 FPS) | Miss deadline = car drives blind |
Transformers — The architecture all these techniques optimize.
Distributed Training — Deep dive into DDP, FSDP, ring all-reduce.
Test your speed on the four coding drills. Click "Start Drill" to see a random calculation prompt. Time yourself — staff interviews expect these in under 2 minutes.