ML Performance Optimization Engineer

Chapter 0: The Role — What You'd Actually Build

You're an ML performance engineer at an autonomous driving company. Not a researcher who publishes papers. Not a data scientist who trains models in notebooks. You own the full pipeline from training infrastructure to real-time inference in a vehicle traveling at 70 mph.

Here is what your day looks like. Morning: a training run that costs $47,000 per run on 64 A100 GPUs is taking 72 hours. Your job is to cut that to 24 hours without touching the model architecture or losing accuracy. Afternoon: the perception model runs at 42ms on the in-vehicle compute — but the deadline is 33ms for 30 FPS. You need to shave 9ms. Evening: the serving team reports P99 latency spiking to 200ms during traffic bursts. You dig into profiler traces.

The full system you own:

Training Infra

Multi-GPU clusters, data pipelines, distributed strategies

↓

Profiling

Find the bottleneck — never guess

↓

Training Optimization

Mixed precision, torch.compile, gradient checkpointing

↓

Model Compression

Quantization, pruning, distillation

↓

Inference Engine

TensorRT, FlashAttention, KV cache

↓

Serving & Monitoring

vLLM, Triton, continuous batching, autoscaling

Interview tip: Every answer you give should follow this structure: "I found bottleneck X using tool Y, applied technique Z, and achieved W speedup." Vague answers like "I made it faster" fail at staff level. Specifics win: "Profiling with Nsight showed 40% of step time was data loading. I switched from random-access NFS reads to WebDataset tar shards and added 8 DataLoader workers with pin_memory. Step time dropped from 850ms to 520ms — a 1.6x improvement."

Real numbers for a perception model training run on 64 A100 GPUs:

Metric	Before Optimization	After Optimization
Training time	72 hours	22 hours (3.3x)
GPU cost	$47,000	$14,300
GPU utilization	34%	82%
Inference latency	42ms	18ms
Serving throughput	200 req/s	1,400 req/s

That 3.3x training speedup came from stacking many techniques: mixed precision (1.7x), torch.compile (1.15x), fused optimizer (1.12x), fixing the data pipeline (1.4x), and overlapping communication (1.1x). No single trick gave 3x. Performance engineering is the discipline of stacking 1.1x improvements until they compound.

The Scaling Reality

The first thing you learn: GPUs don't scale linearly. Adding 8 GPUs doesn't give you 8x speedup. Communication overhead — synchronizing gradients across GPUs after every step — eats into your gains. On a well-optimized training run, expect ~85% scaling efficiency at 8 GPUs, ~70% at 64, and ~50% at 256+.

Here are the numbers for ResNet-50 on ImageNet. Single V100: ~14 hours. 8 V100s (DDP): ~2 hours (7x, not 8x). 256 V100s (highly optimized): ~25 minutes (33x, not 256x). The gap between linear scaling and reality is your optimization target.

GPU Scaling: The Gap Between Ideal and Reality

Drag the slider to add GPUs. The teal line shows actual training time with communication overhead. The dashed line is the impossible ideal of perfect linear scaling.

GPUs1

System Design: ML Platform Architecture

Click on any component to highlight the optimization opportunities at that stage. Drag the GPU slider to see how the system scales.

GPUs8

The five dimensions of every chapter: Each chapter in this lesson covers (1) CONCEPT — whiteboard-ready from first principles, (2) DESIGN — where it fits in production, data flow, tradeoffs, (3) CODE — from-scratch implementation, (4) DEBUG — failure modes and knobs, (5) FRONTIER — latest papers and repos. This mirrors the five angles a staff interview probes.

Interview question: You inherit an ML training pipeline that costs $50K per run. The CEO wants it under $15K. What is your first step?

Immediately add mixed precision and gradient checkpointing Profile the pipeline to find where time is actually spent before making any changes Switch to cheaper GPUs

Chapter 1: GPU Architecture — The Hardware You're Optimizing For

CONCEPT: Cores, Memory, and the Bandwidth Wall

A modern GPU is not one fast processor — it's thousands of tiny processors with a deep memory hierarchy. The bottleneck is almost never raw compute. It's getting data to the compute units fast enough.

CUDA cores are general-purpose floating-point units. Each does one multiply-add per cycle. The A100 has 6,912. Fine for general math, but not special.

Tensor Cores are the game-changer. A single Tensor Core performs a 4×4 matrix multiply-accumulate in ONE cycle — 128 multiply-add operations simultaneously. Since transformers are fundamentally stacks of matrix multiplications, Tensor Cores are why modern GPUs are so fast for ML. But they only activate when matrix dimensions are multiples of 8 (FP16/BF16) or 16 (INT8).

The memory hierarchy is where performance lives or dies:

Level	Size (A100)	Bandwidth	Latency
Registers	~256KB / SM	~20 TB/s	~1 cycle
Shared Mem / L1	228KB / SM	~19 TB/s	~20 cycles
L2 Cache	40 MB	~5 TB/s	~200 cycles
HBM (Global)	80 GB	~2 TB/s	~400 cycles

That's a 10x bandwidth drop from L1 to HBM. Most ML operations read from HBM, compute, and write back to HBM. If the operation doesn't do enough math per byte loaded, the compute units sit idle waiting for data — you're memory-bound.

DESIGN: The Roofline Model

The roofline model is the staff engineer's first tool. It answers: "Is this operation limited by compute or by memory bandwidth?" Plot achievable throughput (TFLOPS) against arithmetic intensity (FLOPs per byte of memory traffic):

Arithmetic Intensity = FLOPs / Bytes transferred
Crossover = Peak TFLOPS / Peak Bandwidth
A100: 312 TFLOPS / 2 TB/s = 156 ops/byte

Below 156 ops/byte on the A100, you're memory-bound. Above, compute-bound. Most ML operations — element-wise activations, layer normalization, softmax — are deeply memory-bound. Only large matrix multiplications cross the line into compute-bound territory.

CODE: Roofline Calculation from Scratch

python
def roofline_analysis(op_name, flops, bytes_transferred, gpu):
    """Determine if an operation is compute or memory bound."""
    intensity = flops / bytes_transferred  # ops/byte
    crossover = gpu["tflops"] * 1e12 / (gpu["bw_gbs"] * 1e9)

    if intensity < crossover:
        achievable_tflops = intensity * gpu["bw_gbs"] / 1e3
        utilization = achievable_tflops / gpu["tflops"]
        return f"{op_name}: MEMORY-BOUND. Intensity={intensity:.1f} ops/B. " \
               f"Achievable={achievable_tflops:.1f} TFLOPS ({utilization*100:.1f}% of peak)"
    else:
        return f"{op_name}: COMPUTE-BOUND. Intensity={intensity:.1f} ops/B. " \
               f"Achievable={gpu['tflops']} TFLOPS (peak)"

A100 = {"tflops": 312, "bw_gbs": 2000}

# LayerNorm: 4N FLOPs, reads+writes 12N bytes (N = hidden dim)
N = 4096
print(roofline_analysis("LayerNorm", 4*N, 12*N, A100))
# LayerNorm: MEMORY-BOUND. Intensity=0.3 ops/B. Achievable=0.7 TFLOPS (0.2% of peak)

# Large matmul: 2*M*N*K FLOPs, reads M*K + K*N, writes M*N
M, K = 4096, 4096
flops = 2 * M * N * K
bytes_rw = (M*K + K*N + M*N) * 2  # BF16 = 2 bytes
print(roofline_analysis("Matmul 4096x4096x4096", flops, bytes_rw, A100))
# Matmul: COMPUTE-BOUND. Intensity=2730.7 ops/B. Achievable=312 TFLOPS (peak)

DEBUG: Common GPU Performance Traps

Debugging tip: If you see "GPU utilization is 100% but training is slow" — you're compute-bound but might be doing wasteful work. 100% utilization does NOT mean 100% efficiency — the GPU could be busy running memory-bound operations at 5% of peak compute throughput. Profile at the kernel level with Nsight Compute to check SM occupancy vs memory bandwidth utilization.

Debugging tip: If you see "GPU utilization at 30%" — either memory-bound (check kernel-level SM occupancy vs memory bandwidth with Nsight Compute) or CPU bottleneck (check if CPU is at 100% or if there are long gaps between kernel launches in the Nsight Systems timeline). The fix depends entirely on which one.

Debugging tip: If you see "Tensor Cores not activating" — check matrix dimensions. Tensor Cores require dimensions that are multiples of 8 for BF16/FP16 and multiples of 16 for INT8. Padding a 4097-dim layer to 4104 can double throughput at almost zero memory cost.

FRONTIER: Next-Generation Hardware

GPU	Year	HBM	BF16 TFLOPS	BW (TB/s)	Crossover
V100	2017	16/32 GB	125	0.9	139 ops/B
A100	2020	80 GB	312	2.0	156 ops/B
H100	2023	80 GB	990	3.4	291 ops/B
H200	2024	141 GB	990	4.8	206 ops/B
B200	2025	192 GB	2,250	8.0	281 ops/B

The H100 Transformer Engine dynamically switches between FP8 and FP16 per layer, per training step. FP8 training halves memory traffic and doubles Tensor Core throughput — but requires careful per-tensor scaling. Custom silicon (Google TPUs, AWS Trainium) targets specific matrix shapes and collective operations, trading generality for efficiency.

Notice the trend: each generation increases compute faster than bandwidth. The A100 crossover was 156 ops/byte, the H100 is 291. This means more operations become memory-bound over time. The hardware is getting faster at math, but not proportionally faster at moving data. This is why techniques like FlashAttention (which reduce HBM traffic) become more valuable, not less, on newer hardware.

Worked Example: End-to-End Roofline Analysis of a Transformer Layer

python
# Roofline analysis: every operation in a single transformer layer
# Model: hidden=4096, seq=2048, batch=4, A100

A100 = {"tflops": 312, "bw_gbs": 2000}
crossover = 312e12 / (2000e9)  # 156 ops/byte

B, S, D = 4, 2048, 4096  # batch, seq, hidden

ops = [
    ("QKV Projection",  2*B*S*D*3*D,         (B*S*D + 3*D*D + B*S*3*D)*2),
    ("Attention (QK^T)", 2*B*S*S*D,             (B*S*D*2 + B*S*S)*2),
    ("Softmax",          5*B*S*S,              B*S*S*2*2),
    ("Attn x V",         2*B*S*S*D,             (B*S*S + B*S*D + B*S*D)*2),
    ("Output Projection", 2*B*S*D*D,            (B*S*D + D*D + B*S*D)*2),
    ("LayerNorm",        4*B*S*D,              12*B*S*D),
    ("FFN Up (4x)",      2*B*S*D*4*D,           (B*S*D + D*4*D + B*S*4*D)*2),
    ("GELU",             B*S*4*D,              B*S*4*D*2*2),
    ("FFN Down",         2*B*S*4*D*D,           (B*S*4*D + 4*D*D + B*S*D)*2),
]

for name, flops, bytes_rw in ops:
    intensity = flops / bytes_rw
    bound = "MEM" if intensity < crossover else "COMP"
    print(f"{name:20s} | I={intensity:8.1f} | {bound}")

# Result: LayerNorm, Softmax, GELU are memory-bound
# All matmuls are compute-bound
# Fusing LayerNorm+Linear or GELU+Linear eliminates HBM round-trips
# This is exactly what torch.compile does!

Interview tip: When asked "How would you speed up this model?", start with the roofline. Calculate the arithmetic intensity of the dominant operation. If memory-bound (most likely), focus on reducing memory traffic: operator fusion, quantization, FlashAttention. If compute-bound, focus on reducing FLOPs: smaller model, pruning, efficient architecture. This framework instantly separates you from candidates who jump to "use TensorRT" without diagnosis.

Interactive Roofline Model

X-axis: arithmetic intensity (ops/byte, log scale). Y-axis: achievable TFLOPS. Toggle GPU generations. Dots show common ML operations — red = memory-bound, green = compute-bound.

Interview question: A layer norm operation does 4N FLOPs and reads+writes 12N bytes (N = hidden dim = 4096). On an A100 (312 TFLOPS, 2 TB/s), is this compute-bound or memory-bound? What does this tell you about optimization strategy?

Compute-bound (intensity = 0.33, above crossover). Optimize by reducing FLOPs. Memory-bound (intensity = 0.33, far below crossover of 156). Fuse it with adjacent ops to reduce HBM round-trips. Neither — layer norm is too small to matter.

Chapter 2: PyTorch Internals — Understanding What You're Optimizing

CONCEPT: Eager Mode, Autograd, and the Caching Allocator

Eager mode: each operation executes immediately as Python encounters it. y = x @ W + b launches a matmul kernel, waits, then launches an add kernel. Each kernel launch has ~5-10μs of CPU overhead. For a 32-layer transformer with ~50 ops per layer, that's 8-16ms of wasted time per forward pass.

torch.compile (PyTorch 2.0+) traces the computation graph and fuses operations. A Linear + LayerNorm + GELU that takes 3 kernel launches and 3 HBM round-trips becomes ONE fused kernel: 1 read, compute all three in registers, 1 write. Typical speedup: 1.1-1.4x on transformer models.

The autograd DAG records every operation during forward for the backward pass. For a matmul y = x @ W, PyTorch saves both x and W because backward needs them: ∂L/∂W = x^T · ∂L/∂y. These saved tensors are where memory goes.

PyTorch's caching allocator pools freed GPU memory for reuse instead of returning it to CUDA. Calling torch.cuda.empty_cache() hurts because it forces reallocation. The key diagnostic: torch.cuda.max_memory_allocated() — the peak watermark.

DESIGN: Memory Breakdown for a 7B Model

Where does memory go when training a 7B parameter model? Every byte is accounted for:

Component	Formula	7B in BF16 + Adam
Model weights	params × 2 bytes (BF16)	14 GB
Gradients	params × 2 bytes (BF16)	14 GB
Adam first moment (m)	params × 4 bytes (FP32)	28 GB
Adam second moment (v)	params × 4 bytes (FP32)	28 GB
FP32 master weights	params × 4 bytes (FP32)	28 GB
Subtotal (static)		112 GB
Activations	batch × seq × hidden × layers × ~12 × 2B	Variable

Interview tip: When asked about memory optimization, walk through the full breakdown: "The 7B model needs 14GB for weights, 14GB for gradients, and 84GB for Adam optimizer states (first moment, second moment, master weights in FP32). That's 112GB static before activations. Activations depend on batch size and sequence length. For batch=4, seq=2048, hidden=4096, 32 layers: approximately 24GB. Total: ~136GB, requiring FSDP across at least 2 A100s." Precise numbers beat hand-waving.

CODE: Activation Memory Calculator

python
def training_memory_gb(params_b, batch, seq, hidden, layers, dtype_bytes=2):
    """Calculate total GPU memory for training."""
    P = params_b * 1e9

    # Static memory
    weights = P * dtype_bytes                    # BF16 weights
    gradients = P * dtype_bytes                  # BF16 gradients
    adam_m = P * 4                               # FP32 first moment
    adam_v = P * 4                               # FP32 second moment
    master_weights = P * 4                       # FP32 master copy

    # Activation memory (per-layer, simplified)
    # Each transformer layer saves: input, QKV projections, attention scores,
    # attention output, FFN intermediate (4*hidden), FFN output
    act_per_layer = batch * seq * hidden * dtype_bytes * 12  # ~12x factor
    activations = act_per_layer * layers

    total = weights + gradients + adam_m + adam_v + master_weights + activations
    return {
        "weights_gb": weights / 1e9,
        "optimizer_gb": (adam_m + adam_v + master_weights) / 1e9,
        "gradients_gb": gradients / 1e9,
        "activations_gb": activations / 1e9,
        "total_gb": total / 1e9,
    }

mem = training_memory_gb(7, batch=1, seq=2048, hidden=4096, layers=32)
# weights: 14.0 GB, optimizer: 84.0 GB, gradients: 14.0 GB
# activations: 6.0 GB, total: 118.0 GB

DEBUG: Common PyTorch Memory Failures

Debugging tip: If you see "OOM during training" — check torch.cuda.max_memory_allocated(). Usually activations, not weights. Fixes in order: (1) gradient checkpointing, (2) smaller batch + gradient accumulation, (3) FSDP to shard optimizer states. Do NOT call torch.cuda.empty_cache() — it hurts the caching allocator.

Debugging tip: If you see "training slow after adding torch.compile" — first iteration compiles the graph (can take minutes). Subsequent iterations are fast. Use torch._dynamo.config.cache_size_limit = 64 if you see recompilation. Dynamic shapes trigger recompilation — pad inputs to fixed lengths or use dynamic=True.

Debugging tip: If you see "NaN loss" — check for overflow: use BF16 not FP16 (or add GradScaler for FP16). Check for division by zero in loss. Add torch.autograd.set_detect_anomaly(True) to find the exact operation. If NaN appears only after many steps, suspect gradient explosion — add gradient clipping.

FRONTIER: PyTorch 2.x and Beyond

Regional compilation lets you compile only performance-critical subgraphs instead of the whole model — faster compilation, fewer graph breaks. torch.export captures the full graph for deployment without Python. The Inductor backend generates Triton kernels that approach hand-written performance for many patterns.

torch.compile modes: "default" balances compile time vs speedup. "reduce-overhead" uses CUDA graphs to eliminate kernel launch overhead (best for small models). "max-autotune" benchmarks multiple kernel implementations (slow compile, fastest runtime). For training, start with "default". For inference, use "max-autotune".

python
# Diagnosing torch.compile issues

# 1. See where graph breaks occur
torch._dynamo.config.verbose = True
explanation = torch._dynamo.explain(model)(sample_input)
print(explanation)
# Shows: graph breaks, reasons, affected operations

# 2. Common graph break causes and fixes
# - data-dependent control flow: if tensor.item() > 0 → remove .item()
# - dynamic shapes: use torch.compile(dynamic=True)
# - unsupported ops: check torch._dynamo.config.suppress_errors = True

# 3. Measuring compile speedup
import time
model_eager = MyModel()
model_compiled = torch.compile(MyModel(), mode="reduce-overhead")

# Warmup (compilation happens here)
for _ in range(3):
    model_compiled(sample_input)

# Benchmark
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(100):
    model_compiled(sample_input)
torch.cuda.synchronize()
t1 = time.perf_counter()
print(f"Compiled: {(t1-t0)/100*1000:.1f}ms/iter")

Memory Timeline: Forward + Backward

Watch GPU memory during a forward+backward pass. Activations accumulate during forward, then get freed during backward. Toggle gradient checkpointing to see reduced peak.

Gradient Checkpointing

Interview question: You have a 7B model in BF16 with Adam optimizer. Calculate the minimum GPU memory for training at batch_size=1 (no gradient checkpointing). Can it fit on one A100 (80GB)?

28 GB (weights + gradients). Yes, fits easily. 56 GB (weights + optimizer). Yes, fits on 80GB A100. 112+ GB (14 weights + 14 grads + 84 optimizer + activations). No, exceeds 80GB.

Chapter 3: Profiling — The Staff Engineer's Superpower

CONCEPT: Three Levels of Profiling

PyTorch Profiler wraps your training loop and records every CUDA kernel, memory allocation, and CPU operation. Good for finding which operations take the most time. Start here.

Nsight Systems gives a system-level timeline: CPU and GPU activity side by side. You see exactly when the GPU is idle, when data transfers happen, and where synchronization stalls. This is how you find the real bottleneck — the visual timeline tells stories numbers can't.

Nsight Compute is kernel-level: SM occupancy, memory bandwidth utilization, warp stalls, instruction mix. Use this when you know WHICH kernel is slow and need to understand WHY.

DESIGN: The Profiling Workflow

1. Instrument

Wrap 5-10 training steps with profiler

↓

2. Identify

Find where time is spent (data? forward? comm?)

↓

3. Categorize

Compute-bound? Memory-bound? IO-bound? Comm-bound?

↓

4. Fix

Apply the targeted technique for that category

↓

5. Re-profile

Verify. The bottleneck SHIFTS — never assume.

↻ repeat until target met

Interview tip: When asked about performance optimization methodology, never say "I would try X." Say "I would profile to identify the bottleneck, then apply the targeted fix." The cardinal rule: never optimize without profiling first. A staff engineer who says "The trace shows 40% of step time is idle between batches, indicating a data loading bottleneck" is diagnosing. A staff engineer who says "I think the model is slow because..." without profiler data is guessing.

CODE: Complete Profiling Setup

python
import torch
from torch.profiler import profile, ProfilerActivity, schedule, tensorboard_trace_handler

# Profile steps 5-10 (skip warmup)
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=2, warmup=2, active=6, repeat=1),
    on_trace_ready=tensorboard_trace_handler("./profiler_logs"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for step, (x, y) in enumerate(loader):
        loss = model(x).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        prof.step()

# Reading the trace: look for these red flags
# - CPU bar thick, GPU bar thin = CPU bottleneck
# - Large gaps between GPU kernels = sync points (.item(), print())
# - GPU idle between steps = data loading too slow
# - Lots of tiny GPU kernels = eager mode overhead (use torch.compile)

bash
# Nsight Systems command-line profiling
nsys profile --trace=cuda,nvtx --output=training_trace \
    python train.py --epochs 1 --profile

# View in Nsight Systems GUI:
# - GPU utilization timeline (should be >80% filled)
# - CUDA API trace (kernel launches, memcpy)
# - NVTX annotations (custom markers in your code)

# Nsight Compute for kernel-level analysis
ncu --target-processes all --set full \
    python inference.py --single-step
# Shows: SM occupancy, memory throughput, roofline position

DEBUG: Bottleneck Signatures

Symptom	Bottleneck	Fix
GPU util ~0% between batches	Data loading	num_workers=8, pin_memory, persistent_workers
Frequent short GPU idle gaps	CPU-GPU sync	Remove .item()/.cpu() calls, batch operations
Low SM occupancy, high BW	Memory-bound kernels	Operator fusion, torch.compile
GPU idle during all-reduce	Communication	Overlap comm with compute, gradient bucketing
Many small kernels, gaps	Eager overhead	torch.compile, CUDA graphs

Debugging tip: If you see low GPU utilization but can't tell why — check for hidden synchronization points. Common culprits: loss.item() (forces GPU-CPU sync), print(tensor) (same), if tensor > threshold (same). Each sync point blocks the CPU until the GPU finishes all queued work. Move logging to every 100 steps, and use .detach() for values you don't need gradients on.

Worked example: You profile a training step and find: data loading 40%, forward 20%, backward 25%, optimizer 10%, idle 5%. The model is doing useful work only 45% of the time. Priority: fix data loading first (biggest chunk). Expected impact: reducing data from 40% to 5% cuts step time by 35%. Next: fuse optimizer (10% to 5%). Then overlap comm if distributed. Result: GPU utilization jumps from 45% to ~80%.

Reading a Profiler Trace: Step by Step

The CPU row should be thin — just launching kernels. If the CPU row is thick (lots of Python time), you have eager mode overhead. Fix: torch.compile.

The GPU row should be solid blocks, no gaps. Each gap is wasted time. Small gaps between kernels = too many small ops (torch.compile fuses them). Large gaps = synchronization (.item() call, CPU-GPU copy, or all-reduce barrier).

The memory row should be sawtooth: memory grows during forward (accumulating activations), peaks at the boundary, then drops during backward (freeing activations). If it plateaus or grows monotonically, you have a memory leak.

Advanced: NVTX Annotations for Custom Profiling

python
import torch
import nvtx  # pip install nvtx

class ProfiledTrainer:
    def train_step(self, batch):
        with nvtx.annotate("data_transfer", color="red"):
            x, y = batch[0].cuda(), batch[1].cuda()

        with nvtx.annotate("forward", color="blue"):
            with torch.cuda.amp.autocast(dtype=torch.bfloat16):
                loss = self.model(x, y)

        with nvtx.annotate("backward", color="orange"):
            loss.backward()

        with nvtx.annotate("optimizer", color="purple"):
            self.optimizer.step()
            self.optimizer.zero_grad()
# These annotations appear as colored bars in Nsight Systems
# Makes it trivial to identify which phase is the bottleneck

FRONTIER: Automated and Continuous Profiling

NVIDIA's DLProf automatically identifies optimization opportunities from profiler traces. PyTorch Kineto provides production-grade profiling with minimal overhead (<2% when sampling). Research on AI-assisted profiling uses LLMs to read traces and suggest optimizations — still early but promising. In production, continuous profiling samples every Nth step to detect performance regressions before they compound into wasted GPU-hours.

Holistic Trace Analysis (HTA) from Meta is an open-source tool that ingests PyTorch profiler traces and automatically computes: GPU idle time breakdown, kernel duration distribution, communication-computation overlap ratio, and memory bandwidth utilization. It generates actionable recommendations ranked by expected impact.

Interactive Profiler Trace

GPU timeline: forward (blue), backward (orange), data loading (red), communication (teal), optimizer (purple). Adjust DataLoader workers and toggle optimizations.

Workers2

Overlap Communication Fused Optimizer

Interview question: You profile a training step and see: data loading 40%, forward 20%, backward 25%, optimizer 10%, idle 5%. Walk me through your optimization plan and expected impact.

Fix data loading first (biggest bottleneck). Increasing workers from 0 to 8 can reduce data time from 40% to ~5%, cutting total step time by ~35%. Then fuse the optimizer. Optimize the forward pass since it's the core computation. Use torch.compile to speed up everything equally.

Chapter 4: Mixed Precision Training

CONCEPT: FP32, FP16, BF16, and Why BF16 Won

A floating-point number has three parts: sign (1 bit), exponent (controls range), and mantissa (controls precision):

Format	Bits	Exponent	Mantissa	Range	Precision
FP32	32	8	23	±3.4×10³⁸	~7 decimal digits
FP16	16	5	10	±65,504	~3 decimal digits
BF16	16	8	7	±3.4×10³⁸	~2 decimal digits

FP16 has more precision but max value is only 65,504. Gradients during training can exceed this and overflow to infinity. BF16 keeps the same exponent as FP32 — same range, never overflows. Less precision, but gradients that will be averaged with millions of others don't need 3 decimal digits.

DESIGN: What Runs in Which Precision

AMP autocast automatically selects precision per operation:

Precision	Operations	Why
BF16/FP16	Matmul, convolution, linear	Tensor Cores give 2x throughput
FP32	Softmax, layer norm, loss	Precision-sensitive reductions
FP32 (master)	Optimizer weight updates	Updates ~1e-7 would underflow in BF16

CODE: AMP Training Loop and Loss Scaling

python
# BF16 training (modern GPUs: A100+)
for x, y in loader:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        loss = model(x, y)       # forward in BF16
    loss.backward()              # backward in BF16
    optimizer.step()             # updates FP32 master weights

# FP16 training (older GPUs: V100) needs loss scaling
scaler = torch.cuda.amp.GradScaler()
for x, y in loader:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast(dtype=torch.float16):
        loss = model(x, y)
    scaler.scale(loss).backward()  # multiply loss by 1024 (dynamic)
    scaler.step(optimizer)         # unscale grads, then step
    scaler.update()                # adjust scale if overflow detected

# Loss scaling math:
# Forward: loss = 0.5 (normal)
# Scale: scaled_loss = 0.5 * 1024 = 512
# Backward: all gradients are 1024x larger (in FP16 range)
# Unscale: gradients /= 1024 (back to true values)
# If any gradient is inf/nan: skip this step, halve scale

DEBUG: Mixed Precision Failures

Debugging tip: If you see "NaN loss with FP16" — gradient overflow. GradScaler should handle it (you'll see "inf detected, skipping step" in logs). If it happens every step, your loss or gradients are too large — reduce learning rate or check for bugs in the loss function. Quick fix: switch to BF16 which has the same range as FP32.

Debugging tip: If you see "no speedup with AMP" — three causes: (1) Model too small to saturate Tensor Cores, (2) Matrix dimensions not multiples of 8 (Tensor Cores don't activate), (3) Most time spent on non-matmul operations that don't benefit. Check with torch.backends.cuda.matmul.allow_tf32 = True as a baseline.

Debugging tip: If you see "periodic 'inf detected' every ~50 steps" — this is NORMAL for FP16 with GradScaler. The scaler dynamically adjusts. Only worry if it happens every step (learning rate too high) or accuracy degrades (numerical instability in a specific layer).

Worked Example: Memory Savings from AMP

Component	FP32 Training	BF16 AMP Training
Weights	7B × 4 = 28 GB	7B × 2 = 14 GB (BF16)
Gradients	7B × 4 = 28 GB	7B × 2 = 14 GB (BF16)
Adam m (1st moment)	7B × 4 = 28 GB	7B × 4 = 28 GB (always FP32)
Adam v (2nd moment)	7B × 4 = 28 GB	7B × 4 = 28 GB (always FP32)
Master weights	— (already FP32)	7B × 4 = 28 GB (FP32 copy)
Activations (batch=4)	~32 GB	~16 GB (BF16)
Total	~144 GB	~128 GB
Saving	—	~11% memory + 1.7x speed

The memory savings from AMP are modest (~11%) because the optimizer states dominate and stay in FP32. The real win is speed: BF16 matmuls run at 2x Tensor Core throughput, giving 1.5-2x overall training speedup.

FRONTIER: FP8 and Beyond

FP8 training on H100: two formats, E4M3 (more mantissa = precision, used for forward activations) and E5M2 (more exponent = range, used for backward gradients). Halves memory traffic again compared to BF16. Requires per-tensor dynamic scaling — the H100 Transformer Engine handles this automatically.

Microscaling (MX) formats use shared exponents across groups of values: 32 weights share one 8-bit exponent, and each weight stores only a 4-bit mantissa. This enables effective 4.25 bits/weight with shared overhead. 4-bit training is an active research area — current results show it's possible for fine-tuning (QLoRA) but challenging for pre-training due to gradient precision requirements.

python
# FP8 training on H100 (conceptual)
# The Transformer Engine handles this automatically
import transformer_engine.pytorch as te

# Replace nn.Linear with te.Linear for automatic FP8
layer = te.Linear(4096, 4096, bias=True)

# Forward: E4M3 (4-bit exponent, 3-bit mantissa)
#   More mantissa = better precision for activations
# Backward: E5M2 (5-bit exponent, 2-bit mantissa)
#   More exponent = better range for gradients
# Per-tensor dynamic scaling: each tensor gets its own scale factor
# Updated every step based on observed value distribution

# Memory impact: BF16 -> FP8 halves activation memory
# 7B model, batch=4, seq=2048: ~12 GB -> ~6 GB activations
# Speed impact: ~1.5x on H100 Tensor Cores

Interview tip: When asked "BF16 vs FP16 for training", say: "BF16 for A100+. Same range as FP32, no loss scaling needed, just works. FP16 only if stuck on V100s, and then you need GradScaler. The 2-bit precision difference between BF16 and FP16 doesn't matter because optimizer updates happen in FP32 master weights anyway."

Floating-Point Number Lines

Each dot is a representable value. FP32 has dense coverage. FP16 has a limited range. BF16 covers the full range but with wider gaps. Toggle views to explore different value ranges.

Interview question: Your team is training a vision transformer with FP16 + loss scaling. You see periodic "inf detected, skipping step" messages (about 1 in 50 steps). Is this a problem? What would you recommend?

This is normal for FP16 GradScaler — it dynamically adjusts. Only investigate if accuracy degrades or it happens every step. Consider switching to BF16 to eliminate it entirely. This means the model is broken and you should switch to FP32. You should increase the loss scale to prevent this.

Chapter 5: Distributed Training — DDP, FSDP, and Beyond

CONCEPT: Four Parallelism Strategies

Data Parallelism (DDP): every GPU holds a complete model copy. Split the batch. Each GPU computes gradients, then ring all-reduce averages them. Simplest, most common.

Fully Sharded (FSDP): shard parameters, gradients, AND optimizer states across GPUs. Before each layer's forward, all-gather the full parameters, compute, discard. Memory per GPU drops from full to 1/N.

Tensor Parallelism (TP): split individual layers across GPUs. A linear W of shape (d, 4d) column-split across 4 GPUs. Requires NVLink because communication happens inside every layer.

Pipeline Parallelism (PP): split layers sequentially across GPUs. GPU 0 gets layers 1-10, GPU 1 gets 11-20. Micro-batching fills the pipeline. Has "bubble" overhead.

DESIGN: Decision Framework and Communication Cost

When to use what: Model fits on 1 GPU → DDP. Doesn't fit → FSDP. Doesn't fit even sharded → TP + PP. At scale, combine: Llama 3 used FSDP + TP + PP across 24,576 GPUs.

Communication cost for ring all-reduce: 2(N-1)/N × data_size, where N is GPU count. Let's work through a concrete example:

Model: 7B params in BF16 = 7 × 10⁹ × 2 bytes = 14 GB of gradients
GPUs: N = 8
Ring all-reduce volume: 2 × (8-1)/8 × 14 GB = 24.5 GB
NVLink bandwidth: 400 Gb/s = 50 GB/s
Communication time: 24.5 / 50 = 0.49 seconds

If your forward + backward takes 1.5 seconds, communication adds 0.49s = 25% overhead. This is why DDP overlaps communication with backward: while computing gradients for layer N, send gradients for layer N+1 that are already done. With good overlap, the comm is hidden behind compute.

For inter-node communication (InfiniBand at 100 Gb/s = 12.5 GB/s), the same all-reduce takes 24.5/12.5 = 1.96 seconds. This is why tensor parallelism uses NVLink (intra-node) while data/FSDP parallelism uses InfiniBand (inter-node, less frequent communication).

CODE: DDP, FSDP, and Communication Volume

python
# DDP in 3 lines
torch.distributed.init_process_group("nccl")
model = model.to(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

# FSDP in 5 lines
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import MixedPrecision

mp_policy = MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16)
model = FSDP(model, mixed_precision=mp_policy, use_orig_params=True)

# Communication volume calculation
def allreduce_comm_gb(params_b, n_gpus, dtype_bytes=2):
    data_gb = params_b * dtype_bytes  # total gradient size in GB
    # Ring all-reduce: 2*(N-1)/N * data_size
    return 2 * (n_gpus - 1) / n_gpus * data_gb

def comm_time_sec(params_b, n_gpus, bw_gbs, dtype_bytes=2):
    vol = allreduce_comm_gb(params_b, n_gpus, dtype_bytes)
    return vol / bw_gbs

# 7B model, 8 GPUs, NVLink (50 GB/s) vs InfiniBand (12.5 GB/s)
print(f"NVLink: {comm_time_sec(7, 8, 50):.2f}s")    # 0.49s
print(f"InfiniBand: {comm_time_sec(7, 8, 12.5):.2f}s") # 1.96s

DEBUG: Distributed Training Failures

Debugging tip: If you see "training doesn't scale linearly" — communication overhead. Profile the all-reduce time. If comm > 20% of step time, solutions: (1) overlap comm with backward computation (DDP default with bucketed gradients), (2) gradient compression (PowerSGD), (3) reduce comm frequency (local SGD with periodic sync).

Debugging tip: If you see "effective batch too large causing divergence" — 8 GPUs × 64 per GPU = 512 effective batch. You need learning rate scaling (linear scaling rule: LR × N) and warmup. Without warmup, the large initial LR causes training to diverge. Standard: 5-10 epochs of linear warmup.

Debugging tip: If you see "FSDP OOMs during forward" — the all-gather temporarily materializes the full layer. If a single layer is huge (e.g., large embedding table), it may exceed memory. Solution: increase sharding granularity with auto_wrap_policy to shard within layers, not just between them.

FRONTIER: Beyond FSDP

DeepSpeed ZeRO-3 adds CPU/NVMe offloading for extreme memory savings. 3D parallelism (Megatron-LM) combines DP + TP + PP for 100B+ models. Sequence parallelism shards the sequence dimension for long-context models. Expert parallelism distributes MoE experts across GPUs. Context parallelism (ring attention) enables 1M+ token sequences by distributing the attention computation across GPUs in a ring topology.

PyTorch FSDP2 (2024) simplifies the API and improves performance: per-parameter sharding (not per-module), better composition with torch.compile, and native mixed precision support. Fully Sharded Data Parallel + Tensor Parallel composition is the standard for Llama-3 scale training: FSDP across nodes, TP within nodes.

python
# FSDP2 (PyTorch 2.4+) — cleaner API
from torch.distributed._composable.fsdp import fully_shard

# Shard each transformer block individually
for block in model.blocks:
    fully_shard(block)
fully_shard(model)  # root shard

# Composes with torch.compile
model = torch.compile(model)

# Memory per GPU for 70B, 64 GPUs:
# Weights: 140 GB / 64 = 2.2 GB
# Optimizer: 840 GB / 64 = 13.1 GB
# Gradients: 140 GB / 64 = 2.2 GB
# Total static: 17.5 GB (fits easily in 80 GB A100)

Interview tip: When asked to design training for a 70B model, show you understand the constraints: "70B in BF16 = 140GB weights alone. With Adam optimizer: 140 + 140 + 560 + 560 + 560 = 1.96TB total state. On 64 A100s (80GB each), that's 5.12TB total memory. FSDP shards across all GPUs: ~31GB/GPU for model state, leaving ~49GB for activations. TP=8 within each node reduces per-layer all-gather volume. PP adds pipeline stages if activation memory is still tight."

Distributed Strategies: Memory per GPU

Toggle between DDP, FSDP, and Pipeline to see how model state distributes across 4 GPUs.

Interview question: You need to train a 70B model on 64 A100 GPUs (80GB each). The model in BF16 is 140GB. Design the parallelism strategy and explain your reasoning.

DDP — replicate the full model on each GPU. FSDP alone — shard everything across 64 GPUs. 3D: TP=8 within each node (NVLink), FSDP across 8 nodes, achieving per-GPU memory of ~15-20GB for model state plus activations.

Chapter 6: Training Tricks — The Bag of Engineering Wins

CONCEPT: Five Techniques, Each Worth 10-40%

No single trick gives 10x. But stack enough 1.2x improvements: 1.7 × 1.3 × 1.15 × 1.12 = 2.9x. Each technique below is well-understood, widely deployed, and testable.

Gradient checkpointing: Don't save activations during forward. During backward, recompute them. ~33% more compute, but 5-10x less activation memory. Memory goes from O(L) to O(√L) in number of layers.

Efficient data loading: Default DataLoader (num_workers=0) means the GPU idles while CPU decodes images one at a time. Fix: num_workers=8, pin_memory=True, persistent_workers=True. For large-scale: WebDataset (tar shards) or FFCV (memory-mapped binary).

Fused optimizers: Standard AdamW does 4 element-wise operations per parameter, each round-tripping to HBM. A fused kernel does all 4 in one pass: 1 read, 1 write. 15-20% faster optimizer step.

torch.compile: One line of code. Fuses operations, eliminates kernel launch overhead. 1.1-1.4x speedup on transformers.

Gradient accumulation: Want batch_size=1024 but only have memory for 64? Accumulate gradients over K=16 mini-batches before stepping. Mathematically equivalent, just K more forward/backward passes.

DESIGN: The Optimization Priority List

Apply in this order: (1) Fix data loading bottleneck (often 1.5-3x alone). (2) Enable mixed precision (1.5-2x). (3) torch.compile (1.1-1.4x). (4) Fused optimizer (1.15x). (5) Gradient checkpointing ONLY if OOM. Profile after each step — each fix shifts the bottleneck.

CODE: Implementation of Each Technique

python
# 1. Gradient checkpointing
from torch.utils.checkpoint import checkpoint
class Model(nn.Module):
    def forward(self, x):
        for layer in self.layers:
            x = checkpoint(layer, x, use_reentrant=False)
        return x

# 2. Efficient data loading
loader = DataLoader(dataset, batch_size=64,
    num_workers=8, pin_memory=True,
    persistent_workers=True, prefetch_factor=2)

# 3. Fused optimizer (15-20% faster step)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, fused=True)

# 4. torch.compile (one line)
model = torch.compile(model, mode="reduce-overhead")

# 5. Gradient accumulation
accum = 16
for i, (x, y) in enumerate(loader):
    loss = model(x, y) / accum
    loss.backward()
    if (i + 1) % accum == 0:
        optimizer.step()
        optimizer.zero_grad()

DEBUG: When Tricks Backfire

Debugging tip: If you see "gradient checkpointing makes training 50% slower" — too many checkpoint segments. Default checkpoints every layer. Try checkpointing every 2-3 layers: checkpoint_sequential(layers, segments=len(layers)//3). The sweet spot is where recomputation time equals the memory savings you need.

Debugging tip: If you see "DataLoader slow despite 8 workers" — I/O bound, not CPU bound. NFS or S3 random reads are ~100x slower than local SSD sequential reads. Switch to WebDataset (tar shards) or FFCV (memory-mapped). Or: pre-download to local NVMe before training starts.

Debugging tip: If you see "torch.compile fails with error" — dynamic shapes or unsupported operations cause graph breaks. Try dynamic=True. Or use torch._dynamo.config.suppress_errors = True to fall back to eager for broken subgraphs. Check torch._dynamo.explain(model)(inputs) to see where graph breaks occur.

FRONTIER: Next-Generation Training Efficiency

FFCV (from MosaicML) achieves near-theoretical I/O throughput with memory-mapped random-access binary files. Mosaic StreamingDataset streams from S3 with intelligent prefetching. Selective compilation in PyTorch 2.x compiles only hot subgraphs, reducing compilation time from minutes to seconds. 8-bit optimizers (bitsandbytes) reduce optimizer memory by 2x with negligible accuracy impact.

Worked Example: Stacking 3x Speedup

Let's trace the exact math for cutting a $50K training run to $16K:

Step	Technique	Multiplier	Cumulative	Step Time	Cost
0	Baseline	1.0x	1.0x	850ms	$50,000
1	Fix data loading (0→8 workers)	1.4x	1.4x	607ms	$35,700
2	BF16 AMP	1.7x	2.38x	357ms	$21,000
3	torch.compile	1.15x	2.74x	310ms	$18,250
4	Fused AdamW	1.12x	3.07x	277ms	$16,300

Key observation: each optimization was applied in order of impact AND ease of implementation. Data loading fix is the biggest win and requires only config changes. AMP is one context manager. torch.compile is one line. Fused optimizer is one flag. Total engineering effort: ~2 hours. Total savings: $33,700 per run.

Interview tip: When asked "How would you reduce training cost by 3x?", present the stack with math: "Profile first to find the bottleneck. Then apply in order of impact: mixed precision (1.7x), fix data pipeline (1.3x), torch.compile (1.2x), fused optimizer (1.15x). The product is 1.7 × 1.3 × 1.2 × 1.15 = 3.05x. Cost drops from $50K to $16K." Show the multiplication. Staff engineers quantify, they don't guess.

Training Optimization Dashboard

Toggle optimizations to see cumulative impact on speed and memory.

Mixed Precision Grad Checkpoint Fused Optim torch.compile

Interview question: Rank these optimizations by typical impact for training a 1B model on 8 GPUs, and explain why gradient checkpointing is last.

torch.compile > mixed precision > data loading > fused optimizer > grad checkpoint mixed precision (1.5-2x) > data loading (1.3-3x) > torch.compile (1.1-1.4x) > fused optimizer (1.15x) > grad checkpoint (slower, but enables larger batch) grad checkpoint > fused optimizer > data loading > torch.compile > mixed precision

Chapter 7: Model Compression — SHOWCASE

CONCEPT: Three Compression Families

Quantization replaces high-precision weights with low-precision integers. The math is straightforward:

scale = (max − min) / (2^bits − 1)
zero_point = round(−min / scale)
x_int = clamp(round(x / scale + zero_point), 0, 2^bits−1)
x_dequant = (x_int − zero_point) × scale

Post-Training Quantization (PTQ): quantize after training. Run calibration data to find value ranges. Minutes to apply. Quantization-Aware Training (QAT): simulate quantization during training so the model adapts. Hours of training but recovers accuracy.

Per-tensor vs per-channel vs per-group: Per-tensor uses one scale for the whole tensor — coarse, more error. Per-channel (one scale per output channel) is much better for convolutions. Per-group (groups of 128 weights) gives the best accuracy for LLMs at INT4.

Pruning removes weights that contribute little. Unstructured: zero individual weights — 80-90% sparsity possible, but needs sparse hardware (A100 2:4 structured sparsity). Structured: remove entire neurons/channels/heads — hardware-friendly, result is a smaller dense model.

Knowledge distillation trains a small "student" to mimic a large "teacher." The teacher's soft probabilities ("70% cat, 20% dog, 10% fox") contain richer information than hard labels ("cat"). The 20% dog tells the student that cats and dogs look similar.

L = α · L_hard(student, labels) + (1−α) · T² · KL(softmax(s/T) ‖ softmax(t/T))

DESIGN: When to Use Each Technique

Technique	Time to Apply	Accuracy Hit	Size Reduction	Speed Gain
PTQ INT8	Minutes	0.1-0.5%	2x	2-3x
PTQ INT4 (GPTQ)	Hours	0.5-2%	4x	2-4x
QAT INT4	Days	0.1-0.5%	4x	2-4x
Structured Pruning 50%	Days	1-2%	2x	1.5-2x
Distillation	Days-Weeks	1-5%	3-10x	3-10x

CODE: Per-Channel Quantization from Scratch

python
import torch

def quantize_per_channel(weight, bits=8):
    """Per-channel symmetric quantization."""
    # weight shape: (out_channels, in_channels)
    qmax = 2 ** (bits - 1) - 1  # 127 for INT8

    # Per-channel: one scale per output channel
    abs_max = weight.abs().amax(dim=1, keepdim=True)  # (out_ch, 1)
    scale = abs_max / qmax                               # (out_ch, 1)
    scale = scale.clamp(min=1e-8)                        # avoid div by zero

    # Quantize
    w_int = torch.round(weight / scale).clamp(-qmax, qmax).to(torch.int8)

    # Dequantize
    w_deq = w_int.float() * scale

    # Quantization error
    error = (weight - w_deq).abs().mean()
    return w_int, scale, error

# Example: quantize a 4096x4096 weight matrix
W = torch.randn(4096, 4096)
w_q, scales, err = quantize_per_channel(W, bits=8)
print(f"Original: {W.numel() * 4 / 1e6:.1f} MB (FP32)")
print(f"Quantized: {w_q.numel() / 1e6:.1f} MB (INT8) + {scales.numel()*4/1e3:.1f} KB scales")
print(f"Mean absolute error: {err:.6f}")
# Original: 67.1 MB → Quantized: 16.8 MB (4x reduction)
# Mean absolute error: 0.003142

DEBUG: Compression Failures

Debugging tip: If you see "INT8 quantized model has 5% accuracy drop" — per-tensor quantization is too coarse. Switch to per-channel. If still bad: check for outlier channels (some transformer layers have extreme weight ranges). SmoothQuant migrates the quantization difficulty from activations to weights by mathematically scaling the problem.

Debugging tip: If you see "pruned model is same speed" — unstructured pruning creates sparse matrices, but without sparse hardware support (or NVIDIA 2:4 structured sparsity), the GPU still processes the full dense matrix. Use structured pruning (remove whole channels/heads) for actual speedup on current hardware.

Debugging tip: If you see "distilled student doesn't converge" — temperature too high (try T=2-4, not T=20). Alpha balance wrong (start with α=0.5). Student too small (it can't fit the teacher's knowledge if the capacity gap is too large — rule of thumb: student should be at least 1/10th teacher size).

FRONTIER: How GPTQ and AWQ Actually Work

GPTQ (2023) quantizes one layer at a time, using second-order (Hessian) information to decide which weights to round up vs down. The key insight: quantization error in one weight can be compensated by adjusting other weights in the same row. This "error compensation" dramatically reduces total quantization error. Process: for each column of the weight matrix, quantize it, measure the error, and spread that error across remaining unquantized columns.

python
# GPTQ conceptual flow (simplified)
for col in range(weight.shape[1]):
    # Quantize this column
    w_q = quantize(weight[:, col], scale, zero_point)
    # Compute quantization error
    error = weight[:, col] - dequantize(w_q, scale, zero_point)
    # Spread error to remaining columns using inverse Hessian
    # This is the key innovation: error compensation
    weight[:, col+1:] += error.unsqueeze(1) * H_inv[col, col+1:] / H_inv[col, col]
    weight[:, col] = dequantize(w_q, scale, zero_point)

AWQ (2024) observes that not all weights are equally important. Weights connected to high-activation channels matter more. AWQ scales up salient weights before quantization (effectively giving them more quantization levels), then scales down during inference. No retraining needed.

AQLM and QuIP# use vector quantization: instead of quantizing each weight independently, quantize groups of weights to the nearest codeword in a learned codebook. This enables 2-bit quantization with surprisingly low degradation. 2:4 structured sparsity (A100+): hardware-native support for 50% sparsity with zero runtime overhead.

Interview tip: When asked "INT4 vs INT8 for deployment?", frame it as a risk-accuracy tradeoff: "INT8 is safe for most models with <0.5% accuracy loss — use it as the default. INT4 halves model size again but needs GPTQ or AWQ for acceptable quality. For safety-critical systems (AD), INT8 is the floor — the accuracy risk of INT4 is harder to bound analytically. For consumer chatbots, INT4 (GPTQ/AWQ) is standard practice."

Compression Playground (SHOWCASE)

Three compression views of the same network. Adjust quantization bits, pruning sparsity, and distillation ratio. Watch accuracy and size change. Try to minimize size while keeping accuracy above 70%.

Quant bits16

Prune %0%

Distill ratio0%

Interview question: You need to deploy a 7B LLM on a consumer GPU (24GB VRAM). Walk through your compression strategy including specific tools and expected memory.

Use FP32 and hope it fits. GPTQ/AWQ INT4 quantization reduces model to 3.5GB. KV cache in FP16 for 4K context is ~1GB. Total ~5GB, leaving 19GB for batching. Use vLLM with PagedAttention for efficient KV management. Prune 80% of weights and use FP32.

Chapter 8: Inference Optimization — TensorRT, FlashAttention, KV Cache

CONCEPT: Three Pillars of Fast Inference

TensorRT is NVIDIA's inference compiler. It takes an ONNX model and applies: (1) layer fusion — Conv+BN+ReLU becomes one kernel, (2) INT8 calibration — automatically quantizes safe layers, (3) kernel auto-tuning — benchmarks dozens of kernel implementations per operation and picks the fastest for your specific GPU and input shapes.

FlashAttention tiles the attention computation into SRAM-sized blocks. Standard attention materializes the full N×N attention matrix in HBM — for seq=8192, that's 128MB per head. FlashAttention never writes it: O(N) memory instead of O(N²), and faster because fewer HBM accesses. The insight: you can compute softmax without ever materializing the full matrix by maintaining running statistics (online softmax).

KV Cache stores precomputed key/value tensors for autoregressive generation. Without it, generating token 1000 recomputes all 999 previous K/V projections. Memory cost:

KV cache = 2 × n_layers × n_heads × head_dim × seq_len × dtype_bytes
LLaMA-7B at seq=8192: 2 × 32 × 32 × 128 × 8192 × 2 = 4.3 GB

DESIGN: Inference Optimization Priority

Apply in order: (1) FlashAttention — free 2-4x for attention-heavy models. (2) KV cache — essential for autoregressive. (3) TensorRT/torch.compile — 2-5x from fusion and kernel tuning. (4) INT8 quantization — 2-4x from reduced memory traffic. Total possible: 10-50x over naive PyTorch.

CODE: KV Cache Memory Calculation

python
def kv_cache_gb(n_layers, n_heads, head_dim, seq_len, dtype_bytes=2, batch=1):
    """Calculate KV cache memory in GB."""
    # 2 for K and V, times each layer, head, dimension
    bytes_total = 2 * n_layers * n_heads * head_dim * seq_len * dtype_bytes * batch
    return bytes_total / 1e9

# LLaMA-7B: 32 layers, 32 heads, 128 head_dim
for seq in [2048, 4096, 8192, 32768]:
    print(f"seq={seq}: {kv_cache_gb(32, 32, 128, seq):.1f} GB")
# seq=2048:  1.1 GB
# seq=4096:  2.1 GB
# seq=8192:  4.3 GB
# seq=32768: 17.2 GB  ← this is why context length is expensive!

# With GQA (Grouped Query Attention), n_kv_heads < n_heads
# LLaMA-3 70B: 80 layers, 8 KV heads (GQA), 128 head_dim
print(f"LLaMA-3 70B, 8K: {kv_cache_gb(80, 8, 128, 8192):.1f} GB")
# LLaMA-3 70B, 8K: 2.6 GB (GQA saves 8x vs MHA)

DEBUG: Inference Performance Traps

Debugging tip: If you see "latency fine for batch=1 but terrible for batch=32" — the model shifts from compute-bound to memory-bound. At batch=1, the GPU can process the small amount of data quickly. At batch=32, memory bandwidth becomes the bottleneck (32x more KV cache to read per decode step). Fix: INT8 quantization reduces memory traffic per token.

Debugging tip: If you see "FlashAttention not helping" — sequence length too short. FlashAttention's advantage comes from avoiding the N² attention matrix. At seq=128, that matrix is tiny (32KB). The benefit starts around seq=512 and grows quadratically with length. Also check: are you using the right FlashAttention version for your GPU (FA2 for Ampere, FA3 for Hopper)?

Debugging tip: If you see "TensorRT build fails" — unsupported operations (custom layers, dynamic control flow). Try onnx-simplifier to canonicalize the graph. For custom ops, write TensorRT plugins or use torch-TensorRT's partial compilation (compile what works, fall back to PyTorch for the rest).

Speculative Decoding: 2-3x Free Speedup

Speculative decoding exploits the fact that a small "draft" model (e.g., 125M params) can predict many tokens correctly. The algorithm: (1) Draft model generates K candidate tokens quickly. (2) Large model verifies all K candidates in a single parallel forward pass. (3) Accept the longest prefix of correct predictions. If the draft model gets 4 out of 5 right, you generated 4 tokens for the cost of ~1 large-model forward pass.

The speedup depends on acceptance rate — how often the draft model matches the large model. For well-matched pairs (e.g., 125M draft for 7B target), acceptance rates of 70-85% give 2-3x speedup.

python
# Speculative decoding speedup estimation
def spec_decode_speedup(accept_rate, k_candidates, draft_time_ms, target_time_ms):
    """Estimate speculative decoding speedup."""
    # Expected tokens per step: sum of geometric series
    expected_tokens = (1 - accept_rate ** (k_candidates + 1)) / (1 - accept_rate)
    # Time per step: draft generates k + target verifies k in parallel
    time_per_step = k_candidates * draft_time_ms + target_time_ms
    # Baseline: 1 token per target forward
    baseline_time = target_time_ms
    speedup = expected_tokens * baseline_time / time_per_step
    return speedup, expected_tokens

# Example: 125M draft (2ms), 7B target (30ms), k=5, 80% accept
s, t = spec_decode_speedup(0.8, 5, 2, 30)
print(f"Speedup: {s:.1f}x, Expected tokens/step: {t:.1f}")
# Speedup: 2.5x, Expected tokens/step: 3.4

FRONTIER: Next-Gen Inference

FlashAttention-3 (2024): FP8 support on H100, asynchronous pipelining, 1.5-2x faster than FA2 on Hopper. PagedAttention (vLLM): virtual memory for KV cache. Disaggregated serving: separate prefill (compute-heavy) from decode (memory-heavy) onto different hardware. Medusa: adds multiple prediction heads to generate several candidate tokens simultaneously without a separate draft model.

Interview tip: When asked about inference optimization, frame it as two regimes: "Prefill is compute-bound (processing the whole prompt at once — optimize with FlashAttention and TensorRT). Decode is memory-bound (generating one token at a time, reading the entire KV cache — optimize with quantization and GQA). The bottleneck depends on which phase dominates your workload."

Inference Pipeline: Before and After Optimization

Toggle TensorRT fusion and INT8 quantization. Adjust batch size for throughput/latency tradeoff.

Batch size1

TensorRT Fusion INT8 Quantization

Interview question: An LLM generates tokens at 50 tokens/sec for batch=1, but only 10 tokens/sec per sequence for batch=16. Is this expected? Explain the bottleneck shift and total throughput.

Yes. At batch=1, decode is compute-bound (GPU has spare bandwidth). At batch=16, the KV cache is 16x larger — memory bandwidth becomes the bottleneck, throttling per-sequence throughput. But total throughput (16 × 10 = 160 tok/s) is still 3.2x higher than single-sequence (50 tok/s). No, this indicates a bug. Batching should give linear speedup. This is caused by Python GIL contention.

Chapter 9: Serving at Scale — vLLM, Triton, and Production

CONCEPT: Batching, PagedAttention, and Continuous Batching

Static batching: wait for B requests, process together. Problem: if B=32, the first request waits for 31 others. Terrible latency at low traffic.

Continuous batching: process requests as they arrive. For LLMs, sequences have different lengths — some finish in 20 tokens, others in 2000. vLLM inserts new requests as old ones complete, keeping the GPU busy. 2-3x throughput improvement over static batching.

PagedAttention: borrows virtual memory paging from OS design. KV cache is stored in fixed-size blocks (pages), tracked by a page table. Sequences allocate pages on-demand — no pre-allocated max-length buffers. Why this matters: without paging, you pre-allocate max_seq_len × KV per sequence. Average utilization is ~20%. With paging, you allocate only what each sequence actually uses. 2-4x more concurrent sequences in the same GPU memory.

DESIGN: Production Serving Architecture

Load Balancer

Route requests to least-loaded GPU worker

↓

Request Queue

Priority queue with SLA-aware scheduling

↓

GPU Workers (vLLM)

Continuous batching + PagedAttention + prefix caching

↓

Response Streaming

Server-sent events, token by token

↻ Autoscaler: add/remove workers based on queue depth

CODE: PagedAttention Conceptual Implementation

python
# Conceptual: why PagedAttention saves memory

# WITHOUT paging: pre-allocate max_seq_len per sequence
max_seq = 8192
kv_per_token = 2 * 32 * 128 * 2  # 2*layers*head_dim*bytes = 16KB/token
allocated = max_seq * kv_per_token    # 131 MB per sequence
# Avg sequence uses 500 tokens: 8 MB actually used
# Waste: 94% of allocated KV memory is unused!

# WITH paging: allocate blocks as needed
block_size = 16  # tokens per block
block_bytes = block_size * kv_per_token  # 256 KB per block

# 500-token sequence needs ceil(500/16) = 32 blocks = 8 MB
# On demand: only allocate as sequence grows
# Non-contiguous: blocks can be anywhere in GPU memory
# Shared: common prefixes (system prompt) share physical blocks

# Result: fit 10-20x more concurrent sequences

Model Serving Frameworks

Framework	Best For	Key Feature
vLLM	LLM serving	PagedAttention, continuous batching, prefix caching
Triton Inference Server	Multi-model pipelines, non-LLM	Dynamic batching, model ensembles, TensorRT
TGI (HuggingFace)	Quick LLM deployment	Optimized transformers, FlashAttention, quantization
SGLang	Structured generation	RadixAttention, constrained decoding, fast JSON

Key Serving Metrics

Production LLM serving is measured on four dimensions:

Metric	Definition	Target	How to Optimize
TTFT	Time to first token (prefill latency)	<500ms	FlashAttention, prefix caching, shorter prompts
ITL	Inter-token latency (decode step)	<50ms	INT4 quantization, speculative decoding
Throughput	Total tokens/sec across all requests	Maximize	Continuous batching, PagedAttention
P99 latency	99th percentile end-to-end	<2s	Autoscaling, request shedding, pre-warming

The TTFT vs throughput tradeoff: larger batches improve throughput (more tokens processed per GPU-second) but increase TTFT (new requests wait in the batch). Continuous batching helps because it inserts new requests without waiting for the current batch to finish. The sweet spot is a batch size where GPU compute utilization is >80% and TTFT stays under your SLA.

FP8 KV cache is an emerging technique in vLLM and SGLang: quantize KV cache entries from BF16 to FP8, cutting KV memory in half. This doubles the number of concurrent sequences for the same GPU memory, with <0.5% perplexity degradation on most models. Combined with PagedAttention, this can increase serving efficiency by 3-4x over naive implementations.

Chunked prefill is another key optimization: instead of processing a 4K-token prompt in one large compute-heavy chunk (which blocks decode for other sequences), split it into smaller chunks interleaved with decode steps. This keeps TTFT for new requests low even when the system is processing long prompts.

Both vLLM and SGLang implement chunked prefill, and it's becoming the default for production deployments where latency SLAs are strict.

DEBUG: Serving Failures at Scale

Debugging tip: If you see "P99 latency spike during traffic bursts" — autoscaling too slow (cold start takes minutes). Solutions: (1) pre-warm spare instances, (2) scale on queue depth (leading indicator) not GPU utilization (lagging indicator), (3) request shedding with graceful degradation (return shorter responses under load).

Debugging tip: If you see "memory grows until OOM" — KV cache leak. Sequences not being freed after completion. Check for: disconnected clients whose sessions aren't cleaned up, or max_num_seqs set too high for available memory. Monitor torch.cuda.memory_allocated() over time to catch the leak pattern.

Debugging tip: If you see "throughput drops with longer sequences" — KV cache memory pressure forces eviction of batched sequences. Reduce max_num_seqs or max_model_len. Alternatively: quantize KV cache to FP8 (vLLM supports this) to double effective memory.

FRONTIER: Next-Gen Serving

Disaggregated prefill/decode: prefill is compute-heavy (processing the whole prompt at once), decode is memory-heavy (generating one token at a time). Run prefill on high-compute GPUs, decode on high-bandwidth GPUs. Splitwise and DistServe implement this.

Prefix caching: system prompts shared across users are cached once, avoiding redundant computation. Multi-LoRA serving: one base model, hundreds of fine-tuned adapters hot-swapped at request time. SGLang's RadixAttention: tree-based KV cache sharing for branching generation (e.g., beam search, multiple responses).

Worked Example: Capacity Planning for a Chatbot

python
def capacity_plan(model_params_b, concurrent_users, avg_seq_len,
                   gpu_mem_gb, quant_bits=4):
    """Estimate GPU count for an LLM serving deployment."""
    # Model memory
    model_gb = model_params_b * quant_bits / 8

    # KV cache per user (approximate)
    # Simplified: ~0.5 MB per token per 7B params at BF16
    kv_per_token_gb = model_params_b * 0.5e-3 / 7  # scale from 7B baseline
    kv_per_user_gb = kv_per_token_gb * avg_seq_len

    # Total KV with PagedAttention overhead (~30%)
    total_kv_gb = concurrent_users * kv_per_user_gb * 1.3

    # GPUs needed (leave 20% headroom)
    mem_per_gpu = gpu_mem_gb * 0.8
    total_mem = model_gb + total_kv_gb
    n_gpus = max(1, -(-total_mem // mem_per_gpu))  # ceil division

    return {"model_gb": model_gb, "kv_total_gb": total_kv_gb,
            "total_gb": total_mem, "gpus_needed": int(n_gpus)}

# 7B model, 1000 users, avg 1K tokens, A100 80GB, INT4
plan = capacity_plan(7, 1000, 1000, 80, 4)
print(plan)
# model: 3.5 GB, kv: 92.9 GB, total: 96.4 GB, gpus: 2

Interview tip: When asked to design a serving system, start with the math: "7B model in INT4 = 3.5GB. KV cache per user at avg 1K tokens = 0.13GB. For 1000 concurrent users: 130GB KV total. With PagedAttention at 70% efficiency: ~185GB memory needed. That's 3 A100-80GB GPUs for memory alone, plus throughput headroom. I'd use vLLM with continuous batching, autoscale on queue depth, and add prefix caching if there's a shared system prompt."

Static vs Continuous Batching

Requests arrive as colored dots. Watch how static batching queues them vs continuous batching processes them immediately. Adjust arrival rate.

Arrival rate5/s

Interview question: Design a serving architecture for a chatbot with 1000 concurrent users and a 7B model. Each conversation can reach 8192 tokens. How many A100 GPUs (80GB) do you need? Show your calculation.

1 GPU — the model is only 14GB in BF16. Model (INT4) = 3.5GB. KV cache per user (avg 1K tokens) = 0.13GB. 1000 users = 130GB KV. With PagedAttention (70% efficiency): ~185GB total. Need ~3 A100s for memory, more for throughput SLA. 100 GPUs — one per 10 users.

Chapter 10: Autonomous Driving — Real-Time Multi-Model Stacks

CONCEPT: The AD Perception Stack and the 33ms Budget

A modern AD vehicle runs cameras (6-8 at 2MP, 30Hz = ~1.4 GB/s), LiDAR (128 beams, 300K points/frame), and radar. All of this feeds into: BEV features → detection + segmentation → tracking → prediction → planning. Total budget: 33ms at 30 FPS. Miss the deadline and the car drives blind for that frame.

Stage	Model	Baseline	Optimized
Camera preprocess	—	5ms	2ms
Backbone (encoder)	EfficientNet-B4	12ms	6ms
Detection head	BEV deformable attn	8ms	4ms
Segmentation head	BEV decoder	5ms	3ms
Tracking	Kalman + association	3ms	3ms
Prediction	Trajectory forecast	10ms	5ms
Planning	Motion planner	5ms	4ms
Total	—	48ms (21 FPS)	27ms (37 FPS)

DESIGN: Where to Spend Optimization Effort

Four techniques that cut the 48ms baseline nearly in half:

1. Backbone sharing. One encoder feeds detection + segmentation + depth heads. Amortize the 12ms encoder cost across 3 tasks instead of running 3 encoders.

2. Temporal fusion. Reuse BEV features from previous frames (aligned by ego-motion). Only update regions where things changed. Saves 30-40% on BEV computation.

3. TensorRT. Layer fusion + kernel auto-tuning. Typical 2-3x speedup on neural network components. Deterministic timing for safety-critical code.

4. Sparse computation. Process only regions with objects, skip empty space. BEVPoolv2 computes features only where objects are likely. 30-50% savings.

CODE: FLOPs Savings from Backbone Sharing

python
# Without sharing: 3 separate encoders
encoder_gflops = 8.5  # EfficientNet-B4
n_cameras = 6
separate_cost = encoder_gflops * n_cameras * 3  # 3 tasks
print(f"Separate encoders: {separate_cost:.0f} GFLOPs")  # 153 GFLOPs

# With sharing: 1 shared encoder + 3 lightweight heads
shared_cost = encoder_gflops * n_cameras        # 51 GFLOPs (encoder)
head_cost = 1.5 * 3                              # 4.5 GFLOPs (3 heads)
total_shared = shared_cost + head_cost           # 55.5 GFLOPs
print(f"Shared encoder: {total_shared:.0f} GFLOPs")
print(f"Savings: {(1 - total_shared/separate_cost)*100:.0f}%")
# Savings: 64%

DEBUG: AD-Specific Failures

Debugging tip: If you see "P99 latency 3x higher than average" — at 70 mph, the car travels 1 extra meter in that time. Causes: GC pauses (use pre-allocated buffers), CUDA kernel scheduling variability (use TensorRT for deterministic timing), thermal throttling (monitor GPU temp in the vehicle). AD cares about worst-case execution time (WCET), not average.

Debugging tip: If you see "model works in simulation but not on embedded GPU" — TensorRT version mismatch between dev and target. CUDA version mismatch. The embedded GPU (Orin) has a different Tensor Core generation than A100. Always build TensorRT engines ON the target hardware. Never cross-compile TRT plans.

Debugging tip: If you see "sim-to-real performance gap" — the embedded GPU (Orin at 275 TOPS) has ~1/4 the throughput of an A100. Your model must be 4x more efficient. Additionally, the Orin shares memory between CPU and GPU (unified memory), so CPU workloads compete for bandwidth. Profile on the actual hardware, not just in simulation.

The Hardware Reality: Embedded GPU Constraints

Platform	GPU Cores	Memory	INT8 TOPS	Power
NVIDIA Orin	2048 CUDA + 64 Tensor	32 GB shared	275	50W
NVIDIA Thor	Next-gen	TBD	2000	TBD
Qualcomm SA8650P	Custom DSP	16 GB	~100	30W

FRONTIER: End-to-End AD and World Models

UniAD (CVPR 2023): single model from raw sensors to planned trajectory. Joint optimization, shared features, no error propagation between modules. VAD: vectorized scene representation for end-to-end planning. Occupancy networks: predict 3D occupancy instead of bounding boxes — handles arbitrary shapes. World models (GAIA-1, DriveDreamer): learn a simulator of the environment for data augmentation and planning.

Performance engineering for end-to-end AD models is different from modular stacks. You can't independently optimize each stage because they share features. Instead, the optimization targets become: (1) shared backbone efficiency (single encoder serves all downstream tasks), (2) attention mechanism optimization (BEV attention is the bottleneck in models like BEVFormer), (3) temporal feature caching (avoid recomputing BEV features from scratch each frame), and (4) output head pruning (remove prediction heads for tasks not needed in a given driving mode).

NVIDIA's DriveOS provides a deterministic execution framework: fixed memory allocation, pre-compiled TensorRT engines, and hardware-level scheduling guarantees. This is required for ASIL-D (automotive safety integrity level D) certification. Consumer ML inference can tolerate 2x P99/P50 ratios; AD requires P99/P50 < 1.2x.

Interview tip: When asked about AD performance, always anchor to the 33ms budget: "At 30 FPS, every frame gets 33ms total. I'd allocate 6ms backbone, 4ms detection, 3ms segmentation (parallel with det using shared backbone), 3ms tracking, 5ms prediction, 4ms planning. That's 25ms with 8ms margin for jitter. The margin matters because P99 must also be under 33ms — average is not enough for safety-critical systems."

AD Perception Pipeline (Gantt Chart)

The 33ms deadline is the red line. Toggle optimizations to fit the pipeline within budget.

Shared Backbone Temporal Reuse TensorRT INT8

Interview question: Your AD perception stack runs in 28ms average but has P99 of 45ms. Why is this unacceptable and what are your three highest-impact fixes?

P99 of 45ms means 1 in 100 frames misses the 33ms deadline — the car drives blind. Fixes: (1) TensorRT for deterministic kernel timing, (2) pre-allocate all memory to eliminate GC/malloc spikes, (3) pin to dedicated GPU cores to avoid scheduling jitter. 28ms average is fine since it's under 33ms. P99 doesn't matter. Just reduce the frame rate to 20 FPS to give more time.

Chapter 11: Interview Arsenal

This chapter is your reference sheet. Every table, every drill, every scenario is something that has appeared in real staff-level ML performance interviews. Print this. Memorize the numbers. Practice the calculations.

1. Master Cheat Sheet

Technique	What It Does	Speed	Memory	Complexity	When
Mixed Precision	BF16 forward/backward, FP32 optimizer	1.5-2x	~50% less act.	Low	Always
torch.compile	Fuses ops, eliminates kernel overhead	1.1-1.4x	Same	Low	Always
Fused Optimizer	Single-kernel AdamW	1.15x	Same	Low	Always
FlashAttention	Tiled attention in SRAM	2-4x	O(N) vs O(N²)	Low	Any attention model
Grad Checkpoint	Recompute activations in backward	0.7x (slower)	5-10x less act.	Low	When OOM
DDP	Replicate model, split data	~Linear	Same/GPU	Medium	Model fits 1 GPU
FSDP	Shard everything across GPUs	~Linear	1/N per GPU	Medium	Model doesn't fit
TensorRT	Layer fusion + auto-tune + calibration	2-5x	Similar	Medium	NVIDIA inference
INT8 Quantization	8-bit weights + activations	2-4x	4x smaller	Medium	Inference
INT4 (GPTQ/AWQ)	4-bit weights, FP16 activations	3-6x	8x smaller	Medium	LLM on consumer GPU
Structured Pruning	Remove channels/heads	1.5-3x	2-4x smaller	High	Dense model too big
Distillation	Train small student from teacher	Model-dep.	3-10x smaller	High	Need smaller model
vLLM	PagedAttention + continuous batching	10-24x tput	2-4x efficient	Low	LLM serving

2. System Design Questions

Q: "Design a training platform for a 100B model."

Answer framework: (1) Model size: 100B × 2B = 200GB in BF16. With Adam: 200 + 200 + 400 + 400 + 400 = 1.6TB total state. (2) Strategy: 3D parallelism — TP=8 within node (NVLink), PP=4 across node groups, FSDP across remaining GPUs. (3) Hardware: 256 H100s (8 nodes of 32 GPUs). (4) Data: WebDataset shards on fast NFS, 8 workers/GPU. (5) Training: BF16 AMP, FlashAttention, gradient checkpointing (every 2 layers), fused AdamW. (6) Monitoring: continuous profiling, loss curves, GPU utilization dashboard. (7) Fault tolerance: checkpoint every 30 min, auto-restart from last checkpoint.

Q: "Design a real-time inference stack for autonomous driving."

Answer framework: (1) Hardware: NVIDIA Orin (275 TOPS INT8). (2) Models: shared backbone (EfficientNet-B3, TensorRT INT8), multi-task heads (det + seg + depth). (3) Pipeline: camera preprocess (2ms) → backbone (6ms) → heads (4ms parallel) → BEV fusion (3ms) → tracking (2ms) → prediction (5ms) → planning (4ms) = 26ms. (4) Safety: TensorRT for deterministic timing, pre-allocated buffers, watchdog timer, graceful degradation (skip prediction if behind). (5) Monitoring: frame timing histogram, WCET alerts, thermal monitoring.

Q: "Your training costs $50K/run. Reduce to $15K without accuracy loss."

Answer framework: (1) Profile first: find where time goes. (2) Quick wins: AMP (1.7x, hours), torch.compile (1.2x, minutes), fused optimizer (1.15x, minutes), fix data pipeline (1.3x, days). Combined: 1.7 × 1.2 × 1.15 × 1.3 = 3.1x. Cost drops from $50K to $16K. (3) If still over: reduce sequence length for early training, use curriculum learning. (4) Validate: run full eval suite, compare metrics to baseline within confidence interval.

Q: "An LLM chatbot has P99 latency of 5 seconds. Target is 2 seconds."

Answer framework: (1) Profile: is the bottleneck prefill (long prompt) or decode (long output)? (2) If prefill: prefix caching for shared system prompts, FlashAttention, reduce prompt length. (3) If decode: speculative decoding (2-3x), INT4 quantization (2x less memory bandwidth), continuous batching to avoid queuing delay. (4) If queuing: add GPU workers, scale on queue depth. (5) Measure: time-to-first-token (TTFT) and inter-token latency (ITL) separately.

3. Coding Drills

Drill 1: "Calculate memory requirements for training a transformer."

python
# Given: 13B params, BF16, Adam, batch=4, seq=4096, hidden=5120, 40 layers
params = 13e9
weights = params * 2                 # 26 GB
grads = params * 2                   # 26 GB
adam_states = params * 4 * 3         # 156 GB (m + v + master)
activations = 4 * 4096 * 5120 * 40 * 12 * 2 / 1e9  # ~80 GB
total = (26 + 26 + 156 + 80)        # 288 GB → need FSDP across 4+ A100s

Drill 2: "Implement per-channel quantization." (See Chapter 7 CODE section — be able to write this from memory on a whiteboard. Key: compute per-channel abs_max, derive scale, round, clamp.)

Drill 3: "Write a gradient checkpointing wrapper."

python
from torch.utils.checkpoint import checkpoint
class CheckpointedTransformer(nn.Module):
    def __init__(self, layers, checkpoint_every=2):
        super().__init__()
        self.layers = nn.ModuleList(layers)
        self.checkpoint_every = checkpoint_every
    def forward(self, x):
        for i, layer in enumerate(self.layers):
            if i % self.checkpoint_every == 0:
                x = checkpoint(layer, x, use_reentrant=False)
            else:
                x = layer(x)
        return x

Drill 4: "Calculate communication volume for ring all-reduce."

python
# Ring all-reduce: 2*(N-1)/N * data_size
# N GPUs, 7B params in BF16 = 14 GB gradient data
N = 8
data_gb = 14
comm_gb = 2 * (N-1) / N * data_gb   # 24.5 GB
# At 400 Gb/s NVLink: 24.5*8/400 = 0.49 seconds
# At 100 Gb/s InfiniBand: 24.5*8/100 = 1.96 seconds
# This is why NVLink matters!

4. Debugging Scenarios

Scenario	Root Cause	Diagnosis	Fix
GPU util 30%	Data pipeline starving GPU	Nsight: large gaps between kernels. CPU at 100%.	num_workers=8, WebDataset, pin_memory
INT8 model -8% acc	Per-tensor quant too coarse; outlier channels	Per-channel error analysis, weight histogram	Per-channel quant, SmoothQuant, or GPTQ
8→64 GPU no scale	Communication dominates	Profile all-reduce time vs compute time	Overlap comm, gradient compression, local SGD
P99 5x P50 inference	GC pauses, CUDA malloc, scheduling	Nsight timeline: irregular large gaps	Pre-alloc buffers, CUDA graphs, TensorRT

OOM at batch=4	Activation memory > weight memory	torch.cuda.max_memory_allocated() shows 70GB peak	Gradient checkpointing, reduce seq len, FSDP
Loss spikes every 1K steps	Data loader restarting (epoch boundary)	Correlate loss spikes with epoch count	persistent_workers=True, proper shuffling
NaN in training	FP16 overflow, bad LR, data corruption	torch.autograd.set_detect_anomaly(True)	BF16, lower LR, validate data pipeline
Serving memory leak	KV cache not freed after request	Monitor torch.cuda.memory_allocated() over time	Ensure request cleanup, use vLLM scheduler

Debugging workflow for any of the above: (1) Reproduce the issue reliably. (2) Profile to get numbers, not guesses. (3) Form a hypothesis. (4) Make ONE change. (5) Re-measure. (6) If fixed, document. If not, revert and try next hypothesis. Never change multiple things at once.

5. Optimization Priority Tables

Training Priority:

1. Fix data loading (1.5-3x)

2. Mixed precision (1.5-2x)

3. torch.compile (1.1-1.4x)

4. Fused optimizer (1.15x)

5. Grad checkpoint if OOM

6. Distributed (DDP/FSDP)

Inference Priority:

1. FlashAttention (2-4x)

2. KV cache (essential)

3. TensorRT (2-5x)

4. INT8/INT4 quantization (2-6x)

5. Continuous batching (2-3x tput)

6. Speculative decoding (2-3x)

6. Recommended Reading

Type	Resource	Why
THE Book	Hennessy & Patterson, "Computer Architecture: A Quantitative Approach"	Roofline, memory hierarchy, everything foundational
Paper	Dao et al., "FlashAttention" (NeurIPS 2022)	The gold standard for hardware-aware algorithm design
Paper	Micikevicius et al., "Mixed Precision Training" (ICLR 2018)	Introduced loss scaling, still the reference
Paper	Kwon et al., "vLLM / PagedAttention" (SOSP 2023)	Changed how everyone serves LLMs
Paper	Frantar et al., "GPTQ" (ICLR 2023)	State-of-the-art post-training quantization
Paper	Shoeybi et al., "Megatron-LM" (2020)	Tensor + pipeline parallelism at scale
Repo	github.com/vllm-project/vllm	Production LLM serving
Repo	github.com/Dao-AILab/flash-attention	FlashAttention implementation
Repo	github.com/microsoft/DeepSpeed	Distributed training at scale
Repo	github.com/NVIDIA/TensorRT	Inference optimization
Repo	github.com/pytorch/pytorch (torch.compile)	The compiler that's replacing hand-tuned kernels

7. Numbers You Must Know Cold

Metric	Value	Why It Matters
A100 BF16 TFLOPS	312	Baseline for all performance estimates
A100 HBM bandwidth	2 TB/s	Memory-bound operations hit this ceiling
A100 roofline crossover	156 ops/byte	Below = memory-bound, above = compute-bound
H100 BF16 TFLOPS	990 (3.2x A100)	New baseline, wider memory-bound regime
NVLink bandwidth	900 GB/s (A100)	Intra-node communication speed
InfiniBand HDR	200 Gb/s = 25 GB/s	Inter-node communication speed
BF16 per param	2 bytes	7B model = 14 GB weights
Adam optimizer state	12 bytes/param (m + v + master)	7B model = 84 GB optimizer
Full training memory	~16 bytes/param + activations	7B model ≥ 112 GB before activations
CUDA kernel launch	5-10 μs	50 ops × 10μs = 0.5ms/layer overhead
FlashAttention memory	O(N) vs O(N²)	seq=8192: 128MB/head → ~1MB/head
KV cache per token	~16KB (7B model, BF16)	8192 tokens = 131 MB per sequence
AD latency budget	33ms (30 FPS)	Miss deadline = car drives blind

8. Quick Estimation Templates

Template 1: "How long to train X?"
FLOPs per step = 6 × params × tokens_per_step (forward + backward ≈ 3x forward × 2).
Steps = total_tokens / tokens_per_step.
Time = (FLOPs per step × steps) / (GPU_TFLOPS × 1e12 × utilization).
Example: 7B model, 1T tokens, batch=4M tokens/step, 64 A100s at 40% MFU.
FLOPs/step = 6 × 7e9 × 4e6 = 1.68e17. Steps = 1e12/4e6 = 250K.
Total FLOPs = 4.2e22. Time = 4.2e22 / (64 × 312e12 × 0.4) = 5.3e6 sec = 61 days.

Template 2: "How much memory for model X?"
Training: params × (2 + 2 + 12) = 16 bytes/param + activations.
Inference: params × dtype_bytes + KV_cache.
Example: 70B BF16 inference, seq=4096.
Weights = 70e9 × 2 = 140 GB. KV = 2 × 80 × 64 × 128 × 4096 × 2 = 10.7 GB.
Total = 151 GB → need 2 A100 80GB GPUs minimum.

Related Lessons

Transformers — The architecture all these techniques optimize.

Distributed Training — Deep dive into DDP, FSDP, ring all-reduce.

9. The Three Questions Every Interviewer Asks

Question 1: "Walk me through how you'd optimize a training pipeline."
Answer: "Profile first. I never guess. I instrument with PyTorch Profiler or Nsight Systems to get a time breakdown: data loading, forward, backward, optimizer, communication. Then I attack the biggest bucket. Typical order: fix data pipeline (num_workers, pin_memory), enable AMP, add torch.compile, use fused optimizer. I re-profile after each change because the bottleneck shifts. I keep a spreadsheet: technique, measured speedup, cumulative speedup, cost."

Question 2: "This model is too slow for production. What do you do?"
Answer: "First, is this training or inference? For inference: profile to find the dominant operation. Apply FlashAttention if attention-heavy, TensorRT for layer fusion, INT8 quantization if memory-bound. Measure latency at the target batch size, not batch=1. For autoregressive models, distinguish prefill latency from per-token decode latency — they have different bottlenecks."

Question 3: "How would you scale training from 8 GPUs to 64?"
Answer: "DDP if the model fits on one GPU (likely for 1-7B). FSDP if it doesn't. The key concern is communication overhead: ring all-reduce volume is 2*(N-1)/N * gradient_size. At 64 GPUs on InfiniBand, that's ~25 GB needing ~2 seconds. Solutions: overlap communication with backward compute, use gradient compression (PowerSGD), or switch to FSDP which communicates parameters instead of gradients. The effective batch size also changes: 64 GPUs * batch_per_GPU = very large batch. Apply learning rate scaling and warmup."

"Hardware is easy to understand. Making software match what the hardware can deliver — that's the hard part." — Bill Dally, NVIDIA Chief Scientist

Interview Readiness Tracker

Test your speed on the four coding drills. Click "Start Drill" to see a random calculation prompt. Time yourself — staff interviews expect these in under 2 minutes.

Interview question: You have a model that runs at 50ms inference on an A100. Target is 15ms. What is your approach?

Rewrite in C++ for maximum control. Profile to find the bottleneck, then apply TensorRT (fusion + INT8 calibration) — likely 2-5x speedup gets you to 10-25ms. Verify with re-profiling. Add more GPUs to distribute the work.

ML PerformanceEngineering