Staff-level interview prep: quantization, CUDA kernels, TensorRT, distributed training, edge deployment, AV perception, and frontier research.
It is 7:30 AM. You badge into the perception bullpen at an autonomous vehicle company. On one monitor, a TensorRT compilation log from the overnight CI pipeline glows red: a custom fused multi-head attention kernel is producing NaN outputs after INT8 quantization on the latest backbone checkpoint. On your second monitor, a Slack thread from the planning team is heating up. Your BEV (Bird's Eye View) model update shipped yesterday and it added 12 ms to the on-vehicle inference loop, pushing total perception from 93 ms to 105 ms. The safety team's rule is absolute: sensor-to-actuation must stay under 200 ms, and perception's share is 100 ms. You just blew the budget.
On your third monitor, your pull request from yesterday has four review comments. The PR implements PagedAttention for the vehicle's onboard VLM (Vision-Language Model). A colleague wants to know how you guarantee memory safety when the planner and the perception module both issue concurrent queries. Another reviewer is asking whether your page table walk adds measurable latency when sequence length exceeds 2048 tokens.
Before lunch, you will debug the NaN (probably an outlier activation channel that the INT8 calibrator did not clip), shave those 12 ms (by fusing two elementwise ops into the backbone's attention kernel and switching the BEV grid scatter from FP32 to BF16), and rewrite a chunk of C++ inference code to use CUDA Graphs instead of individual kernel launches so that the CPU submission overhead stops showing up in the Nsight timeline.
This is the daily reality of an ML Inference and Performance Optimization Engineer in autonomous driving. You sit at the intersection of three disciplines that rarely overlap in a single person's head:
| Discipline | What you need | How it shows up daily |
|---|---|---|
| ML Research | Understand architectures, loss functions, training dynamics | You read the BEVFormer paper to know which layers are safe to quantize |
| Systems Engineering | GPU memory hierarchy, CUDA, compiler internals | You write a fused kernel and profile it in Nsight Compute |
| Safety-Critical Deployment | Determinism, thermal limits, failure analysis | You prove that INT8 parity holds across 50K edge-case frames |
The diagram below traces a model from training cluster to road. Every box is a system you own or co-own. Think of it as the "pipeline map" you would draw on a whiteboard in a system-design interview.
A senior engineer can quantize a model. They know the APIs, can run a PTQ calibration, and can diagnose a NaN. Give them a model and a target latency, and they will hit it.
A staff engineer designs the system that quantizes every model the team ships. They choose which layers get INT8 vs FP16 based on an automated sensitivity sweep that runs in CI. They build the calibration pipeline so that every model update automatically generates a new calibration cache from a representative frame set drawn from the hardest 5% of the validation set. They write the parity test framework that catches accuracy regressions before the engine ever touches the vehicle. And when the next-generation SOC arrives with FP8 tensor cores, they redesign the entire pipeline rather than patching it piecemeal.
The distinction is scope. Senior owns a component. Staff owns the system and its evolution over time.
Every strong interview loop for this role tests five orthogonal skills. Each chapter in this lesson hits all five, but the table below shows the kind of question each dimension produces.
| Dimension | What they test | Example question | What a staff answer adds |
|---|---|---|---|
| Concept | First-principles math | "Derive the quantization error bound for symmetric INT8" | Connects the bound to practical calibration strategy |
| Design | System architecture | "Design an inference pipeline for a 3B-param VLM on a 30W SOC" | Discusses fallback behavior, thermal throttling, graceful degradation |
| Code | Implementation skill | "Write a CUDA kernel for fused LayerNorm + bias add" | Adds launch config reasoning, occupancy analysis, bank-conflict avoidance |
| Debug | Failure diagnosis | "Our INT8 model diverges after 500 frames. What do you check?" | Walks through a systematic bisection, layer-by-layer parity, calibration audit |
| Frontier | Research awareness | "What changed in model compression since 2024?" | Discusses FP4, Microscaling, speculative decoding for VLMs on edge |
| Time | Task | Skill used |
|---|---|---|
| 7:30 | Triage overnight CI failures (NaN in INT8 engine) | Debug |
| 8:00 | Layer-by-layer parity check to isolate diverging layer | Code + Debug |
| 9:00 | Fix: add per-channel quantization to the outlier layer, re-calibrate | Concept + Code |
| 10:00 | Standup: present latency regression analysis to perception lead | Design |
| 10:30 | Profile the 12 ms regression with Nsight Systems, find two unfused ops | Code + Debug |
| 11:30 | Write a fused CUDA kernel for the two ops, benchmark, submit PR | Code |
| 13:00 | Review a colleague's Triton kernel for deformable attention | Code + Frontier |
| 14:00 | Design doc: migrate calibration pipeline from manual to CI-triggered | Design |
| 15:30 | Experiment: benchmark FP8 E4M3 on the new backbone, compare to INT8 | Concept + Frontier |
| 17:00 | Update parity test suite with the new edge-case frames from field testing | Design + Debug |
Your 3-billion parameter perception backbone runs beautifully in FP32 on a beefy A100 during training. Each weight is a 32-bit floating-point number, so the model alone occupies 3B × 4 bytes = 12 GB. Add activations, optimizer states, and a KV cache, and you are well past 40 GB. Now ship this to a vehicle SOC with 32 GB of shared memory (CPU and GPU share the same pool), a 60-watt power budget, and a 100 ms latency ceiling. FP32 will not fit. FP16 is tight. You need INT8 — or even INT4.
Welcome to quantization: the art of representing a number that needs 32 bits of precision using 8, 4, or even fewer bits, while keeping the model accurate enough to not kill anyone.
Before we can shrink numbers, we need to understand what a number is inside a computer. Every floating-point format stores three fields packed into a fixed number of bits:
The sign bit (1 bit) says positive or negative. The exponent bits control the range — how large or small the number can be. The mantissa (also called significand or fraction) bits control the precision — how many distinct values exist between any two powers of 2. There is an implicit leading 1 bit (the "hidden bit") that gives you one free bit of precision.
| Format | Total bits | Sign | Exponent | Mantissa | Bias | Range | Precision |
|---|---|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | 127 | ±3.4×1038 | ~7.2 decimal digits |
| BF16 | 16 | 1 | 8 | 7 | 127 | ±3.4×1038 | ~2.4 decimal digits |
| FP16 | 16 | 1 | 5 | 10 | 15 | ±65504 | ~3.3 decimal digits |
| FP8 E4M3 | 8 | 1 | 4 | 3 | 7 | ±448 | ~1.7 decimal digits |
| FP8 E5M2 | 8 | 1 | 5 | 2 | 15 | ±57344 | ~1.2 decimal digits |
| INT8 | 8 | Uniform: -128 to 127 | -128..127 | 256 evenly spaced values | |||
| INT4 | 4 | Uniform: -8 to 7 | -8..7 | 16 values total | |||
Floating-point formats have non-uniform spacing (more precision near zero, less far away). Integer quantization is different: the values are uniformly spaced. To map a continuous float tensor to a discrete integer grid, you need a scale factor s and optionally a zero-point z.
There are two families. Let us derive each one.
Symmetric quantization maps the float range [-α, +α] to the integer range [-127, +127] (for signed INT8). The scale is:
The key property: float zero maps exactly to integer zero (q=0). This matters because zero-padding in convolutions must remain zero after quantization.
If the float distribution is heavily skewed (e.g., ReLU activations that are always non-negative, ranging from 0 to 6), symmetric quantization wastes half the integer range on values that never appear. Asymmetric quantization shifts the mapping so the integer range covers only the actual data range:
The zero-point z is an integer offset that ensures floating-point zero is exactly representable. For ReLU activations, z=0 happens naturally. For distributions centered around a negative value, z will be positive.
Per-tensor quantization uses one scale s (and one zero-point z) for the entire weight matrix. Per-channel quantization computes a separate sc for each output channel of a convolution (or each row of a linear layer). Per-channel is almost always better, and here is a concrete example of why.
In this small example, both methods work well because the channels have similar ranges. The disaster happens when one channel is 50x larger than another — which is common in transformer attention projections.
For weights, computing α = max(|w|) is straightforward — the weights are fixed. For activations, the distribution changes with every input. You need to run a calibration set (typically 500-1000 representative inputs) through the model and collect statistics. But which statistic you use to set the scale dramatically affects accuracy.
| Method | How it sets α | Pros | Cons |
|---|---|---|---|
| Min/Max | α = max(|x|) across all calibration samples | Simple, fast, no outlier clipping | A single outlier dominates, wastes range |
| Percentile | α = 99.99th percentile of |x| | Robust to rare outliers | Clips extreme values, introduces clipping error |
| Entropy (KL) | Minimizes KL divergence between original and quantized histograms | Theoretically optimal for distribution shape | Slow (searches over candidate thresholds) |
| MSE | Minimizes mean squared error between original and dequantized values | Good for reconstruction quality | Sensitive to outliers in the squared sense |
The entropy (KL-divergence) method, used by TensorRT's default calibrator, works as follows:
The intuition: KL divergence measures how much information you lose by approximating distribution P with distribution Q. Minimizing it finds the clipping threshold that preserves the most information about the activation distribution in 8 bits.
When PTQ fails (accuracy drop > 0.5%), you turn to Quantization-Aware Training (QAT). The idea: insert fake quantization nodes into the training graph that simulate quantization rounding during the forward pass, so the model learns to be robust to the noise.
But there is a mathematical problem. The rounding function round(x) has zero gradient almost everywhere (it is a staircase function). You cannot backpropagate through it. The Straight-Through Estimator (STE) solves this by pretending the gradient of the rounding operation is 1:
The STE is a biased estimator, but it works remarkably well in practice. The intuition: gradients still point in the right direction even if their magnitude is approximate, and SGD is robust to noisy gradients.
Classic PTQ and QAT work for CNNs and small transformers. For large language models (1B+ parameters), three techniques from 2023-2024 dominate:
python import torch import torch.nn as nn import numpy as np # ── Step 1: Load the FP32 trained model ── model = load_perception_backbone("checkpoint_ep50.pt") model.eval().cuda() # ── Step 2: Sensitivity analysis ── # Quantize each layer to INT8 independently, measure mAP drop. # WHY: Some layers (LayerNorm, final classification head) are # extremely sensitive. Others (early convolutions) are robust. sensitive_layers = [] for name, module in model.named_modules(): if isinstance(module, (nn.Linear, nn.Conv2d)): # Temporarily quantize just this layer orig_weight = module.weight.data.clone() s = module.weight.data.abs().max() / 127 module.weight.data = (module.weight.data / s).round().clamp(-127, 127) * s mAP_drop = evaluate_mAP(model, val_loader) - baseline_mAP module.weight.data = orig_weight # restore if abs(mAP_drop) > 0.5: # threshold: 0.5% mAP sensitive_layers.append(name) print(f"SENSITIVE: {name} drops mAP by {mAP_drop:.2f}%") # ── Step 3: Calibration with histogram collection ── # WHY histogram, not just min/max: min/max is dominated by outliers. # Histogram + KL divergence finds the optimal clipping threshold. from torch.ao.quantization import get_default_qconfig_mapping from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx qconfig = get_default_qconfig_mapping("x86") # "qnnpack" for ARM/Orin # Override sensitive layers to stay in FP16 for layer_name in sensitive_layers: qconfig = qconfig.set_module_name(layer_name, None) # None = no quantization prepared = prepare_fx(model, qconfig, example_inputs=(torch.randn(1,3,640,640).cuda(),)) # ── Step 4: Run calibration (the actual data collection) ── # WHY 500 frames? Empirically, activation statistics converge # around 200-500 frames for perception models. More helps but # has diminishing returns. Must include edge cases. with torch.no_grad(): for i, batch in enumerate(calibration_loader): if i >= 500: break prepared(batch.cuda()) # observers collect activation histograms # ── Step 5: Convert to quantized model ── quantized_model = convert_fx(prepared) # ── Step 6: Parity check ── # WHY cosine similarity? Element-wise absolute error can be misleading # when values are very small. Cosine similarity captures directional # agreement independent of magnitude. max_abs_errors, cos_sims = [], [] with torch.no_grad(): for batch in test_loader: ref = model(batch.cuda()) # FP32 reference opt = quantized_model(batch.cuda()) # INT8 output max_abs_errors.append((ref - opt).abs().max().item()) cos_sims.append(torch.nn.functional.cosine_similarity( ref.flatten(), opt.flatten(), dim=0).item()) print(f"Max abs error: {max(max_abs_errors):.4f}") print(f"Min cosine sim: {min(cos_sims):.6f}") assert max(max_abs_errors) < 0.1, "INT8 parity FAILED" assert min(cos_sims) > 0.999, "INT8 parity FAILED"
Failure 1: Outlier activation channels.
Symptom: One layer's output has max absolute error 10x worse than all others. Cosine similarity for that layer drops below 0.95.
Root cause: A single channel has activations of magnitude 500 while all others are in [-5, 5]. The INT8 scale accommodates the outlier, collapsing 99.6% of the dynamic range into a handful of integer levels.
Diagnostic: Plot per-channel activation histograms. Look for channels where max(|x|) is >10x the median channel max.
Fix: (a) SmoothQuant to migrate the difficulty to weights. (b) Per-channel quantization for activations (more expensive but eliminates the problem). (c) Percentile clipping at 99.99% — accept small clipping error for outliers.
Failure 2: LayerNorm amplification.
Symptom: Accuracy is fine on most inputs but catastrophically wrong on inputs with near-constant feature vectors (low variance).
Root cause: LayerNorm divides by standard deviation. When σ is small (say 0.001 in FP32), the quantized σ might round to 0 or to a different small value (say 0.003), changing the output by 3x.
Diagnostic: Compute the ratio max(|FP32_output|) / max(|INT8_output|) per layer. LayerNorm layers with ratio > 2 are suspect.
Fix: Always keep LayerNorm in FP16 (both the normalization and the affine transform). Quantize only the matmuls before and after.
Failure 3: Calibration distribution mismatch.
Symptom: Average mAP on the full test set drops 0.3% (acceptable), but mAP on rainy night scenes drops 4.2% (unacceptable).
Root cause: Calibration data was 80% daytime city driving. Night/rain activation distributions differ — different brightness channels fire, different feature magnitudes appear. The scale was tuned for daytime and clips or under-resolves night features.
Diagnostic: Stratify parity checks by scene type. If one stratum is consistently worse, the calibration set is biased.
Fix: Curate a calibration set balanced across all deployment conditions. Alternatively, use running-mean calibration over the last N production frames (dynamic calibration) — but this introduces non-determinism.
Failure 4: Accumulator overflow in INT8 matmul.
Symptom: Random NaN or wildly wrong outputs, inconsistent across runs.
Root cause: INT8 × INT8 multiplication produces INT16 results. Accumulating many of them (e.g., a matmul with K=4096) can overflow INT32 accumulators on some hardware.
Diagnostic: Check if the error is reproducible (overflow is deterministic for given input). Reduce K or check the accumulator bit-width of the target hardware.
Fix: Use FP16 accumulation for large matmuls (the Hopper INT8 tensor core accumulates in FP32 by default, but older hardware may not).
FP8 (E4M3 / E5M2): Hopper and Blackwell GPUs natively support FP8 tensor cores. E4M3 for forward pass (range to 448, enough for inference activations), E5M2 for gradients (range to 57344, enough for large gradient spikes). Nearly as accurate as FP16, 2x the throughput. Replacing INT8 as the default inference format for datacenter GPUs. The scaling is per-tensor with a simple "max-abs" calibration, avoiding the complexity of histogram/KL methods.
Microscaling (MXFP4, MXFP6): A 2024 consortium proposal (backed by major chip vendors) for block-level scaling with very low bit-widths. A block of 32 values shares a single 8-bit scale factor, and each value is FP4 or FP6. This amortizes the scale overhead (1 byte per 32 elements) while enabling 4-bit inference without the complexity of GPTQ. Early results show competitive with INT4 at lower implementation cost.
QuIP# and QuaRot (2024): Random orthogonal rotations applied before quantization to decorrelate weight columns, making them easier to quantize independently. QuIP# achieves near-FP16 accuracy at 2 bits per weight on large LLMs. Theoretically grounded in random matrix theory.
The TensorRT compiler fuses most standard operations automatically: Conv+BN+ReLU becomes one kernel, and matmul+bias+GELU becomes another. But your perception model has a novel temporal cross-attention mechanism — it attends over the BEV features from the last 8 frames with learned 3D position offsets. No pre-built kernel exists. The unfused version launches 14 separate kernels, reads and writes the feature map from HBM 14 times, and takes 18 ms. A hand-written fused kernel should take 4 ms. Time to write CUDA.
A GPU is not "a bunch of CPU cores." It is a fundamentally different machine. A CPU optimizes for latency (make one thread fast). A GPU optimizes for throughput (make millions of threads collectively fast by hiding individual latency behind massive parallelism).
The hardware is organized hierarchically, and understanding this hierarchy is the single most important thing for writing fast kernels:
| Level | Hardware unit | Programming abstraction | Count (A100) | What it does |
|---|---|---|---|---|
| Top | GPU (GPC) | Grid | 1 per kernel launch | The entire kernel launch — all blocks |
| Mid | Streaming Multiprocessor (SM) | Block (CTA) | 108 SMs, multiple blocks per SM | A group of threads that share fast memory and can synchronize |
| Low | Warp Scheduler | Warp | 32 threads, fixed | 32 threads that execute the same instruction at the same time (SIMT) |
| Unit | CUDA Core | Thread | 6912 total | One execution unit with its own registers |
The key insight: when a warp (32 threads) issues a memory read from HBM, it takes ~400 clock cycles to come back. But the warp scheduler does not wait. It switches to another warp and executes its instructions. When that warp also stalls, it switches again. This is latency hiding through occupancy — the more warps you have ready to run, the better you hide memory latency. This is why GPU code looks nothing like CPU code: you launch thousands of threads not because you have thousands of independent computations, but because you need enough threads in flight to hide memory stalls.
Every performance decision in CUDA comes down to which memory level your data lives in. Here are the actual numbers for an A100 SXM4:
| Memory level | Bandwidth | Capacity per SM | Total capacity | Latency (cycles) | Scope |
|---|---|---|---|---|---|
| Registers | ~78 TB/s | 256 KB | 27 MB total | 0 (same cycle) | Per thread |
| Shared Memory (SRAM) | ~19 TB/s | Up to 164 KB | ~17 MB total | ~20-30 cycles | Per block |
| L2 Cache | ~6.3 TB/s | — | 40 MB | ~200 cycles | All SMs |
| HBM (Global Memory) | 2.0 TB/s | — | 80 GB | ~400 cycles | All SMs |
The ratio tells the story: registers are ~39x faster than shared memory, shared memory is ~3x faster than L2, and L2 is ~3x faster than HBM. Moving data from HBM to registers is ~400 cycles. Moving data from shared memory to registers is ~25 cycles. This 16x difference is why the #1 optimization in CUDA is: load from HBM once into shared memory, then reuse from shared memory as many times as possible.
Before writing any kernel, you need to know whether performance is limited by compute (not enough FLOPs/s) or by memory bandwidth (not enough bytes/s). The roofline model answers this with a single number: arithmetic intensity.
This is why kernel fusion matters enormously for LayerNorm (memory-bound: reducing memory traffic directly speeds it up) and barely matters for large matmuls (compute-bound: the bottleneck is ALU throughput, not memory bandwidth).
When 32 threads in a warp read from global memory, the hardware can combine their requests into a single transaction if the addresses are contiguous. This is coalesced access — one transaction serves 32 threads. If the addresses are scattered (strided or random), each thread needs a separate transaction: 32 transactions instead of 1, a 32x slowdown.
Shared memory is divided into 32 banks, each 4 bytes wide. Bank assignment is: address % 32. When two threads in the same warp access different addresses in the same bank, the accesses serialize (a "bank conflict"). If all 32 threads hit the same bank, you get a 32-way conflict: 32x slower.
Threads within a warp can communicate without shared memory using warp shuffle instructions. These are register-to-register transfers between threads in the same warp — no shared memory needed, no synchronization needed.
This is a classic interview kernel. Standard PyTorch runs LayerNorm(x + bias) as three separate kernels: (1) elementwise add, (2) compute mean and variance, (3) normalize+scale+shift. Each kernel reads and writes the entire tensor from HBM. The fused version: one kernel, one HBM read, one HBM write.
cuda // ── Warp-level reduction helper ── // WHY warp shuffle instead of shared memory? Shuffle is register-to-register // (zero latency), avoids bank conflicts, and needs no __syncthreads(). __device__ float warpReduceSum(float val) { for (int offset = 16; offset > 0; offset >>= 1) val += __shfl_down_sync(0xFFFFFFFF, val, offset); return val; // only lane 0 of each warp has the correct sum } // ── Block-level reduction (handles blocks with multiple warps) ── // WHY two-phase? First reduce within each warp (fast, no sync needed), // then reduce across warps via shared memory (one sync). __device__ float blockReduceSum(float val) { __shared__ float warp_sums[32]; // max 32 warps per block (1024 threads) int warp_id = threadIdx.x / 32; int lane_id = threadIdx.x % 32; val = warpReduceSum(val); // phase 1: intra-warp if (lane_id == 0) warp_sums[warp_id] = val; // lane 0 writes warp result __syncthreads(); // wait for all warps // Phase 2: first warp reduces across all warp sums int num_warps = blockDim.x / 32; val = (lane_id < num_warps) ? warp_sums[lane_id] : 0.0f; val = warpReduceSum(val); // final reduction return val; // only thread 0 has the total } // ── Main kernel: fused bias add + LayerNorm ── // Each block processes one row (one token in a sequence, one spatial position). // Input x: [N, D], bias: [D], gamma: [D], beta: [D], output: [N, D] // Launch config: grid = N blocks, block = min(D, 1024) threads __global__ void fused_layernorm_bias( const float* __restrict__ x, // [N, D] input activations const float* __restrict__ bias, // [D] bias vector const float* __restrict__ gamma, // [D] LayerNorm scale const float* __restrict__ beta, // [D] LayerNorm shift float* __restrict__ out, // [N, D] output int D, // hidden dimension float eps // LayerNorm epsilon (1e-5) ) { int row = blockIdx.x; // which row (token) this block handles int tid = threadIdx.x; // thread index within block // WHY extern __shared__? We need D floats, but D varies at runtime. // Extern shared memory is sized at kernel launch time. extern __shared__ float sdata[]; // ── Pass 1: Load (x + bias) into shared mem, accumulate partial sum ── // Each thread handles D/blockDim.x elements (stride loop pattern) float local_sum = 0.0f; for (int i = tid; i < D; i += blockDim.x) { float val = x[row * D + i] + bias[i]; // fused bias add sdata[i] = val; // store for reuse (3 reads later) local_sum += val; // partial mean contribution } // ── Parallel reduction for mean ── __shared__ float smean, svar; float total = blockReduceSum(local_sum); if (tid == 0) smean = total / D; // thread 0 broadcasts mean __syncthreads(); // all threads wait for smean // ── Pass 2: Compute variance from shared memory (no HBM re-read!) ── float local_var = 0.0f; for (int i = tid; i < D; i += blockDim.x) { float diff = sdata[i] - smean; local_var += diff * diff; } total = blockReduceSum(local_var); if (tid == 0) svar = total / D; // broadcast variance __syncthreads(); // ── Pass 3: Normalize + scale + shift (read shared, write HBM once) ── float inv_std = rsqrtf(svar + eps); // 1/sqrt(var+eps), one instruction for (int i = tid; i < D; i += blockDim.x) { out[row * D + i] = gamma[i] * (sdata[i] - smean) * inv_std + beta[i]; } // Total HBM traffic: read x once (Pass 1) + write out once (Pass 3) // = 2 × N × D × 4 bytes. Unfused version: 6 × N × D × 4 bytes. } // ── Launch configuration ── // WHY min(D,1024)? Block size is capped at 1024 threads by hardware. // WHY D*sizeof(float) for shared? We need one float per hidden dimension. int block_size = min(D, 1024); int shared_bytes = D * sizeof(float); fused_layernorm_bias<<>>( x_ptr, bias_ptr, gamma_ptr, beta_ptr, out_ptr, D, 1e-5f);
The blockReduceSum pattern above is used everywhere in ML kernels (softmax, LayerNorm, loss functions, attention). Here is how the full reduction works, step by step, for a warp of 8 threads (simplified from 32):
python import triton import triton.language as tl # WHY Triton? It compiles Python to PTX (GPU assembly) via MLIR. # You think in blocks, not threads. No manual shared memory management. # Typically achieves 80-90% of hand-written CUDA performance. @triton.jit def fused_layernorm_bias_kernel( x_ptr, bias_ptr, gamma_ptr, beta_ptr, out_ptr, D: tl.constexpr, eps: tl.constexpr ): row = tl.program_id(0) # block-level: one block per row cols = tl.arange(0, D) # Triton auto-tiles this if D > block x = tl.load(x_ptr + row * D + cols) b = tl.load(bias_ptr + cols) val = x + b # fused bias add mean = tl.sum(val, axis=0) / D var = tl.sum((val - mean) ** 2, axis=0) / D g = tl.load(gamma_ptr + cols) bt = tl.load(beta_ptr + cols) out = g * (val - mean) / tl.sqrt(var + eps) + bt tl.store(out_ptr + row * D + cols, out)
Triton handles shared memory, coalescing, bank conflicts, and warp scheduling automatically. You trade fine-grained control for 5x less code and 80-90% of peak performance.
Failure 1: Shared memory bank conflicts.
Symptom: Kernel runs 4-8x slower than expected. Nsight Compute shows "shared memory bank conflicts" metric at 50%+.
Root cause: Column-wise access to a 2D shared memory array where row stride is a multiple of 32. All threads in a warp hit the same bank.
Diagnostic: In Nsight Compute, check "L1/TEX Hit Rate" and "Shared Bank Conflicts" sections. Bank conflicts show as "replayed" shared memory instructions.
Fix: Pad the shared memory array: __shared__ float s[M][N + 1]. The +1 offsets each row's bank assignment, eliminating the conflict pattern.
Failure 2: Low occupancy from register pressure.
Symptom: Nsight Compute shows 15% achieved occupancy. Memory throughput is also low (20%). The kernel is neither compute-bound nor memory-bound — it is stall-bound.
Root cause: Each thread uses too many registers (e.g., 128 registers per thread). An SM has 65536 registers. At 128 per thread, only 512 threads fit per SM = 16 warps. The SM can schedule up to 64 warps, so you are at 25% occupancy. Not enough warps to hide memory latency.
Diagnostic: Compile with --ptxas-options=-v to see register count per thread. Or check Nsight Compute's "Occupancy" tab.
Fix: (a) Reduce per-thread register usage by recomputing values instead of storing them. (b) Use __launch_bounds__(maxThreadsPerBlock, minBlocksPerSM) to hint the compiler. (c) Reduce block size to give the compiler more freedom.
Failure 3: Non-coalesced global memory access.
Symptom: Memory throughput is 10% of peak, but the kernel does many loads/stores.
Root cause: Adjacent threads access non-adjacent memory locations. For example, transposed matrix access where thread i reads row i from a column-major array.
Diagnostic: Nsight Compute "Memory" tab shows "Global Load/Store Efficiency" below 25%.
Fix: Reorder the data layout (e.g., transpose the matrix in a preprocessing step), or load into shared memory in a coalesced pattern and then access the transposed layout from shared memory.
Failure 4: Warp divergence in conditional code.
Symptom: Kernel takes 2x longer than expected despite simple logic.
Root cause: An if/else inside the kernel where threads in the same warp take different branches. In SIMT, both branches execute for all threads — threads not on the active branch are masked off but still burn cycles.
Diagnostic: Nsight Compute shows "Warp Execution Efficiency" below 50%.
Fix: Restructure to ensure all threads in a warp take the same branch (common: boundary checks affect only the last warp). Use arithmetic instead of branches where possible: val = cond ? a : b compiles to a predicated move, no divergence.
Triton 3.x / Proton: Triton is becoming the standard for ML kernel development. Version 3.x adds TMA (Tensor Memory Accelerator) support for Hopper's hardware copy engine, persistent kernel patterns, and better autotuning. Proton is a built-in profiler that gives roofline-style analysis without leaving the Python ecosystem.
CUTLASS 3.x / CuTe: NVIDIA's C++ template library for high-performance GEMM. CuTe (Cute Tensor) provides a composable layout algebra for tiled tensor operations. Used inside TensorRT and cuBLAS. For custom attention patterns that need Tensor Core utilization above 90%, CUTLASS is the tool.
ThunderKittens (2024): A DSL for writing GPU kernels at the warp level, designed for ML workloads. Abstracts away shared memory management and bank conflicts while preserving performance. Gaining traction for attention kernel development.
You have a PyTorch model that runs at 85 ms in FP16 on your vehicle's SOC. Your hand-tuned CUDA kernel shaved 14 ms off the attention layer. Quantization brought it down to 65 ms. But there are still 50+ individual kernel launches — each one incurs CPU-side dispatch overhead (~5-15 microseconds), and between kernels the intermediate tensors bounce through HBM. TensorRT can fuse dozens of those kernels, eliminate intermediate memory writes, and auto-tune each fused kernel to the target GPU's specific memory hierarchy. The same model runs at 24 ms after TensorRT compilation.
TensorRT is a graph compiler and inference runtime. It takes a neural network graph (from ONNX, TensorFlow, or directly via the TensorRT API) and produces an optimized engine — a serialized binary blob containing fused CUDA kernels, memory allocation plans, and precision-per-layer decisions, all tuned for a specific GPU architecture.
The optimizations happen at multiple levels:
| Optimization | What it does | Typical speedup |
|---|---|---|
| Layer fusion | Merges adjacent operations into one kernel (Conv+BN+ReLU, MatMul+Bias+GELU) | 2-5x for fused patterns |
| Precision selection | Runs each layer in the fastest precision (FP32, FP16, INT8) that maintains accuracy | 2-4x from FP32 to INT8 |
| Kernel auto-tuning | Benchmarks multiple kernel implementations (different tile sizes, data layouts) and picks the fastest | 10-30% over default |
| Memory planning | Reuses memory buffers across layers (layer A's output buffer becomes layer C's input buffer if they don't overlap in time) | 30-60% less memory |
| Tensor format | Reorders from NCHW to NHWC or NC/32HW32 if the kernel is faster in that layout | 10-20% for conv layers |
To see what TensorRT does concretely, consider a typical transformer block with 11 operations. Before TensorRT, each is a separate CUDA kernel launch with an intermediate HBM write/read:
That is 11 HBM round-trips reduced to 4. For a 24-layer transformer, this means 168 kernel launches instead of 264, and ~60% less HBM bandwidth consumption.
The deployment pipeline has four stages. Each can fail in non-obvious ways. Let us walk through every step.
python import torch model = load_perception_model("checkpoint.pt") model.eval().cuda() # Create a representative input. WHY representative? ONNX tracing # executes the model once and records the operations. If your model # has input-dependent control flow (e.g., different paths for # different image resolutions), the trace only captures ONE path. dummy_input = torch.randn(1, 3, 640, 640).cuda() # Export with dynamic axes for batch size flexibility # WHY dynamic_axes? Without it, the ONNX graph bakes in batch=1. # TRT can then only run batch=1 forever. With dynamic_axes, # the graph accepts any batch size at runtime. torch.onnx.export( model, dummy_input, "model.onnx", input_names=["image"], output_names=["boxes", "scores", "classes"], dynamic_axes={ "image": {0: "batch"}, # batch dim is dynamic "boxes": {0: "batch", 1: "num_det"}, # detections vary }, opset_version=17, # latest stable opset do_constant_folding=True, # fold constant ops at export time ) # Validate the ONNX graph import onnx onnx_model = onnx.load("model.onnx") onnx.checker.check_model(onnx_model) # catches shape mismatches, unsupported ops
if x.shape[0] > 1 becomes a static branch in ONNX; use torch.where or torch.cond instead. (2) Custom operators — if your model calls a C++ extension, you need to register a symbolic function for ONNX export. (3) In-place operations — x.add_(1) can confuse the ONNX tracer; use x = x + 1. Always run onnx.checker.check_model() and then onnxruntime.InferenceSession("model.onnx") to verify the graph is valid and runnable.python import tensorrt as trt # ── Create the builder and logger ── logger = trt.Logger(trt.Logger.WARNING) # VERBOSE for debugging builder = trt.Builder(logger) # ── Parse ONNX into a TRT network ── # WHY EXPLICIT_BATCH? Legacy TRT used implicit batch dim. All modern # models need explicit batch for dynamic shapes and attention. network = builder.create_network( 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) ) parser = trt.OnnxParser(network, logger) with open("model.onnx", "rb") as f: success = parser.parse(f.read()) if not success: for i in range(parser.num_errors): print(parser.get_error(i)) # detailed ONNX parse errors raise RuntimeError("ONNX parse failed") # ── Configure the builder ── config = builder.create_builder_config() config.set_memory_pool_limit( trt.MemoryPoolType.WORKSPACE, 1 << 30 # 1 GB workspace # WHY workspace? TRT uses scratch memory for kernel tuning. # Larger workspace = more kernel variants tried = better perf. # But on a 32 GB SOC, you can't afford 8 GB of workspace. ) # Enable precision modes config.set_flag(trt.BuilderFlag.FP16) # allow FP16 kernels config.set_flag(trt.BuilderFlag.INT8) # allow INT8 kernels # WHY both FP16 and INT8? TRT will choose per-layer. Layers where # INT8 would lose too much accuracy automatically stay in FP16. # This is controlled by the calibrator's per-layer ranges. # Dynamic shapes: specify min/opt/max for each input dimension # WHY all three? TRT auto-tunes for the "opt" shape but must # support anything from "min" to "max" at runtime. profile = builder.create_optimization_profile() profile.set_shape("image", min=(1, 3, 640, 640), # minimum batch opt=(4, 3, 640, 640), # typical batch (auto-tune target) max=(8, 3, 640, 640), # maximum batch ) config.add_optimization_profile(profile)
python # TRT INT8 calibration requires a custom calibrator class. # It provides calibration data batches and stores the resulting # per-layer quantization ranges in a cache file. class PerceptionCalibrator(trt.IInt8EntropyCalibrator2): # WHY EntropyCalibrator2? It uses KL-divergence to find optimal # clipping thresholds. Alternatives: MinMaxCalibrator (simpler, # worse for skewed distributions), PercentileCalibrator. def __init__(self, data_loader, cache_file="calibration.cache"): super().__init__() self.data_loader = data_loader self.iterator = iter(data_loader) self.cache_file = cache_file # Pre-allocate GPU memory for calibration batch # WHY pre-allocate? Allocating per-batch is slow and fragments memory. self.device_input = cuda.mem_alloc( 4 * 3 * 640 * 640 * 4 # batch=4, 3ch, 640x640, float32 ) def get_batch_size(self): return 4 def get_batch(self, names): # Called by TRT builder to get the next calibration batch. # Returns a list of GPU pointers (one per input tensor). try: batch = next(self.iterator) cuda.memcpy_htod(self.device_input, batch.numpy()) return [int(self.device_input)] except StopIteration: return None # signals end of calibration data def read_calibration_cache(self): # If a cache file exists, TRT skips calibration and reuses it. # WHY cache? Calibration can take 10-30 minutes for large models. # The cache stores per-layer scale factors, not raw data. if os.path.exists(self.cache_file): with open(self.cache_file, "rb") as f: return f.read() return None def write_calibration_cache(self, cache): with open(self.cache_file, "wb") as f: f.write(cache) # Attach calibrator to the builder config calib_loader = create_calibration_loader( dataset_path="calib_frames/", num_samples=500, batch_size=4, # WHY 500 samples? Activation statistics converge around 200-500 # for perception models. Using more is diminishing returns. ) config.int8_calibrator = PerceptionCalibrator(calib_loader) # ── Build the engine ── # WHY build_serialized_network? It returns bytes that can be saved # to disk. The old build_engine() returned a runtime object. serialized_engine = builder.build_serialized_network(network, config) with open("model.engine", "wb") as f: f.write(serialized_engine)
When TensorRT's ONNX parser encounters an operation it does not recognize — your custom deformable attention, a novel NMS variant, or a BEV grid scatter — it falls back to FP32 or fails entirely. You need a TensorRT plugin: a C++ class that implements the operation and registers it with TRT's plugin registry.
cpp // ── Complete TensorRT Plugin for a Fused Bias + GELU operation ── // WHY this example? It's simple enough to show the full structure but // representative of real plugins (custom activation + elementwise fusion). #include "NvInferPlugin.h" #include <vector> #include <cstring> // Forward declaration of the CUDA kernel (defined in .cu file) void launchBiasGelu(const float* input, const float* bias, float* output, int N, int D, cudaStream_t stream); class BiasGeluPlugin : public nvinfer1::IPluginV2DynamicExt { public: // ── Constructor: store parameters needed for the plugin ── BiasGeluPlugin(int hidden_dim) : mHiddenDim(hidden_dim) {} // ── Deserialization constructor (for loading saved engines) ── BiasGeluPlugin(const void* data, size_t length) { const char* p = static_cast<const char*>(data); mHiddenDim = *reinterpret_cast<const int*>(p); } // ── Tell TRT the output shape given input shapes ── nvinfer1::DimsExprs getOutputDimensions( int outputIndex, const nvinfer1::DimsExprs* inputs, int nbInputs, nvinfer1::IExprBuilder& builder ) noexcept override { // Output has same shape as first input (the activation tensor) return inputs[0]; } // ── Tell TRT what precisions we support ── bool supportsFormatCombination( int pos, const nvinfer1::PluginTensorDesc* inOut, int nbInputs, int nbOutputs ) noexcept override { // Support FP32 and FP16, linear format only return inOut[pos].format == nvinfer1::TensorFormat::kLINEAR && (inOut[pos].type == nvinfer1::DataType::kFLOAT || inOut[pos].type == nvinfer1::DataType::kHALF); } // ── The actual kernel launch (called during inference) ── int enqueue( const nvinfer1::PluginTensorDesc* inputDesc, const nvinfer1::PluginTensorDesc* outputDesc, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream ) noexcept override { int N = inputDesc[0].dims.d[0]; // batch × seq_len int D = inputDesc[0].dims.d[1]; // hidden dim launchBiasGelu( static_cast<const float*>(inputs[0]), // activation static_cast<const float*>(inputs[1]), // bias static_cast<float*>(outputs[0]), // output N, D, stream ); return 0; // 0 = success } // ── Serialization (for saving the engine to disk) ── size_t getSerializationSize() const noexcept override { return sizeof(int); // just mHiddenDim } void serialize(void* buffer) const noexcept override { *static_cast<int*>(buffer) = mHiddenDim; } // ── Plugin metadata ── const char* getPluginType() const noexcept override { return "BiasGelu"; } const char* getPluginVersion() const noexcept override { return "1"; } int getNbOutputs() const noexcept override { return 1; } private: int mHiddenDim; }; // ── Plugin Creator (factory that TRT uses to instantiate the plugin) ── class BiasGeluPluginCreator : public nvinfer1::IPluginCreator { public: const char* getPluginName() const noexcept override { return "BiasGelu"; } const char* getPluginVersion() const noexcept override { return "1"; } nvinfer1::IPluginV2* createPlugin( const char* name, const nvinfer1::PluginFieldCollection* fc ) noexcept override { int hidden_dim = 1024; // default; parse from fc in production return new BiasGeluPlugin(hidden_dim); } nvinfer1::IPluginV2* deserializePlugin( const char* name, const void* data, size_t length ) noexcept override { return new BiasGeluPlugin(data, length); } // Required boilerplate (field names, namespace, etc.) omitted for brevity }; // Register the plugin so TRT can find it by name during ONNX parsing REGISTER_TENSORRT_PLUGIN(BiasGeluPluginCreator);
No TensorRT engine ships to the vehicle without passing parity checks. This is the automated gate that prevents quantization or fusion errors from reaching the road. The framework compares the TensorRT engine output against the original PyTorch FP32 model on a comprehensive test set.
python import numpy as np import torch import tensorrt as trt def run_parity_check(pytorch_model, trt_engine_path, test_loader, abs_tol=0.01, cos_tol=0.999, max_failures=0): """ Compare PyTorch FP32 outputs vs TensorRT engine outputs. WHY three metrics? Each catches different failure modes: - abs_tol: catches large pointwise errors (e.g., NaN, overflow) - cos_tol: catches directional drift (rotation of feature vectors) - per-layer: isolates WHICH layer diverged for debugging """ # Load TRT engine runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) with open(trt_engine_path, "rb") as f: engine = runtime.deserialize_cuda_engine(f.read()) context = engine.create_execution_context() failures = [] for batch_idx, batch in enumerate(test_loader): # PyTorch reference with torch.no_grad(): ref = pytorch_model(batch.cuda()).cpu().numpy() # TRT inference (simplified; real code manages CUDA buffers) trt_out = run_trt_inference(context, batch.numpy()) # Metric 1: Maximum absolute error max_abs = np.max(np.abs(ref - trt_out)) # Metric 2: Cosine similarity (treats outputs as vectors) cos_sim = np.dot(ref.flatten(), trt_out.flatten()) / ( np.linalg.norm(ref.flatten()) * np.linalg.norm(trt_out.flatten()) + 1e-8 ) # Metric 3: Per-element relative error (skip near-zero values) mask = np.abs(ref) > 0.01 rel_err = np.max(np.abs(ref[mask] - trt_out[mask]) / np.abs(ref[mask])) if max_abs > abs_tol or cos_sim < cos_tol: failures.append({ "batch": batch_idx, "max_abs_error": float(max_abs), "cosine_sim": float(cos_sim), "max_rel_error": float(rel_err), }) print(f"Parity: {len(test_loader)-len(failures)}/{len(test_loader)} passed") if len(failures) > max_failures: print("PARITY FAILED. Failing batches:") for f in failures[:5]: print(f" Batch {f['batch']}: abs={f['max_abs_error']:.4f}, " f"cos={f['cosine_sim']:.6f}, rel={f['max_rel_error']:.4f}") raise AssertionError("TRT parity check failed") return True
Failure 1: ONNX export breaks on dynamic control flow.
Symptom: torch.onnx.export raises TracerWarning: Converting a tensor to a Python boolean.
Root cause: Python if tensor.shape[0] > 1 is evaluated at trace time, baking in the condition as a constant. The ONNX graph only captures one branch.
Diagnostic: Check for TracerWarning in the export logs. Visualize the ONNX graph with Netron to confirm missing branches.
Fix: Replace Python control flow with tensor ops: torch.where(cond, a, b). For complex control flow, use torch.export (PyTorch 2.x) instead of torch.onnx.export — it captures the full computation graph including control flow.
Failure 2: TensorRT engine produces NaN outputs.
Symptom: Some or all output values are NaN. Often intermittent — depends on input.
Root cause: Three common causes: (a) FP16 overflow — a layer with activations exceeding 65504 overflows to Inf, then Inf × 0 = NaN in subsequent layers. (b) INT8 scale of zero — if a layer had zero variance during calibration, its scale is 0, causing division by zero at inference. (c) Plugin bug — the custom kernel writes out-of-bounds or uses uninitialized memory.
Diagnostic: Build with trt.BuilderFlag.DEBUG and dump per-layer outputs using the TRT profiler callback. Find the first layer that produces NaN. Check if it is FP16-assigned and has large-magnitude inputs.
Fix: For (a): force that layer to FP32 using network.get_layer(i).precision = trt.float32. For (b): ensure calibration data produces non-zero activations in every layer. For (c): run the plugin in isolation with cuda-memcheck.
Failure 3: Engine is slower than expected.
Symptom: TRT engine runs at 45 ms but you expected 25 ms based on the roofline model.
Root cause: TensorRT's kernel auto-tuner selects from a library of pre-compiled kernels. If the workspace is too small, some faster kernels cannot be tried. If the tensor shapes are unusual (non-power-of-2 dimensions), the best kernels may not apply.
Diagnostic: Use trtexec --onnx=model.onnx --verbose --dumpLayerInfo to see which kernel was selected for each layer and its timing. Look for layers where the chosen kernel is unexpectedly slow.
Fix: (a) Increase workspace to 4-8 GB during build (does not affect runtime memory). (b) Pad tensor dimensions to multiples of 8 or 16 to enable Tensor Core kernels. (c) Use a timing cache file (--timingCacheFile) to share tuning results across builds.
Failure 4: Engine rebuilds break on new GPU architecture.
Symptom: Engine built on Orin fails to deserialize on next-gen SOC.
Root cause: TRT engines are not portable across GPU architectures. The engine contains GPU-specific kernels and memory layouts. An engine built for SM 8.7 (Orin) will not run on SM 9.0 (next-gen).
Diagnostic: trt.Runtime.deserialize_cuda_engine() returns null with a "CUDA engine built for incompatible architecture" error.
Fix: Always rebuild engines per-target-GPU as part of the deployment pipeline. Store the ONNX model (portable) alongside the engine (non-portable). In CI, rebuild engines for every supported SOC variant.
torch.compile JIT-compiles Python code via TorchInductor, generating Triton kernels automatically. Advantages: zero export step, supports dynamic shapes natively, Pythonic debugging. Disadvantages: less mature INT8 support, less control over kernel selection, slightly lower peak throughput than TRT for standard architectures. For datacenter LLM serving, torch.compile is gaining ground. For production edge deployment (vehicles, robots), TensorRT remains dominant because of its mature INT8 calibration, engine serialization (no Python at runtime), and C++ runtime.Standard self-attention computes Q×KT — an N×N matrix where N is sequence length. For a BEV model processing 8 camera views at 200 spatial tokens each, N=1600. That attention matrix is 1600×1600 = 2.56M entries. For a VLM processing 4096 tokens, it's 16.8M entries. The matrix alone doesn't fit in on-chip SRAM. This chapter derives exactly why, builds the solution tile by tile, and shows you every intermediate number.
Let's be precise about what "standard attention" costs. We have three input matrices: Q, K, V, each of shape [N, d] where N is the sequence length and d is the head dimension. The attention computation produces two intermediate matrices and one output:
Now let's count bytes. Assume FP16 (2 bytes per element), N=2048, d=128:
But memory size isn't the real killer — it's memory traffic. An A100's SRAM (shared memory per SM) is ~192 KB. The S matrix alone is 8 MB — it cannot live on-chip. So the standard algorithm writes S to HBM (slow global memory), then reads it back for softmax, writes P to HBM, then reads it back for the final matmul. Every element of S and P makes a round trip through HBM.
Before we can tile attention, we need to solve a fundamental problem: softmax requires knowing the entire row before you can compute any single output. Here's why, and how to fix it.
Standard softmax for a vector x of length N:
Now the tiling problem becomes clear. If we process the score row in blocks of B elements, after processing the first block we have a partial max m1 and partial sum l1. When the second block arrives, the new elements might contain a larger value — our entire partial sum is now wrong because we subtracted the wrong max. We need to fix up the old partial results.
The online softmax algorithm (Milakov & Gimelshein, 2018) does exactly this fix-up with a single extra multiply per block:
But FlashAttention doesn't just need the softmax — it needs the weighted output O = softmax(S) × V. So we must also maintain a running output accumulator and rescale it whenever the max changes:
Let's trace FlashAttention completely for N=4 tokens, d=2, tile size Br=Bc=2. We tile over the K/V dimension (columns of S), processing 2 key-value pairs at a time.
FlashAttention-1 (Dao et al., 2022) introduced tiling + online softmax. It reduced HBM accesses from Θ(N²d) to Θ(N²d²/M) where M is SRAM size. But it left performance on the table — the inner loop was doing too much non-matmul arithmetic (the online softmax rescaling) which couldn't use Tensor Cores.
FlashAttention-2 (Dao, 2023) made three key changes:
FA-2 Optimization 2: Reduce non-matmul FLOPs. The rescaling factor exp(mold - mnew) is computed per element. FA-2 defers the final 1/lK division to after the loop, doing it once instead of every iteration. This reduces non-matmul FLOPs by ~30%.
FA-2 Optimization 3: Better warp partitioning. FA-1 splits across warps along the K dimension, requiring synchronization after each tile. FA-2 splits along the Q dimension — each warp handles its own Q rows independently. No inter-warp sync needed for the online softmax state.
FlashAttention-3 (Dao et al., 2024, targeting Hopper/H100) adds:
Producer-consumer warp pipelining: one warp group loads the next K/V tile from HBM into shared memory while another computes on the current tile. This uses Hopper's Tensor Memory Accelerator (TMA) for asynchronous, hardware-accelerated data movement — the load warp issues a TMA descriptor and immediately continues to other work.
FP8 support with incoherent processing: accumulate in FP32, quantize per-tile to FP8 using block-level scaling factors, and apply a random orthogonal transform to reduce quantization bias. This gives ~2x throughput over FP16 with minimal accuracy loss.
FlashAttention keeps exact attention at O(N²) FLOPs — it's a memory optimization, not a computational one. Linear attention changes the math itself to get O(N) complexity.
The key idea starts with a subtle rewrite. Standard attention for a single query qi:
Common feature maps: φ(x) = elu(x) + 1 (Katharopoulos et al., 2020, "Transformers are RNNs"), or random Fourier features that approximate the softmax kernel. The trade-off is real: linear attention loses the sharp, peaked attention patterns that standard softmax produces. For long-range temporal attention (e.g., attending over 100+ past frames), the O(N) scaling wins. For spatial attention within a single image, the quality loss usually isn't worth it.
Multi-Head Attention (MHA) uses separate Q, K, V projections per head. With 32 heads and d=128, that's 32 separate K and V matrices to store and load from HBM. Multi-Query Attention (MQA) (Shazeer, 2019) uses a single shared K and V across all heads:
GQA matters enormously for inference. The KV-cache is typically the memory bottleneck during decoding (see Chapter 5). GQA-8 gives 4x more concurrent sequences for the same memory budget compared to full MHA.
python import numpy as np def standard_softmax(x): """Two-pass safe softmax: max pass + exp-sum pass.""" m = np.max(x) # Pass 1: find max e = np.exp(x - m) # Pass 2: exp with stability return e / np.sum(e) def online_softmax(x, block_size=2): """Single-pass online softmax — processes blocks sequentially.""" N = len(x) m = float('-inf') # Running max l = 0.0 # Running sum of exp(x - m) d = np.zeros(N) # Will hold exp(x_i - m_final) for start in range(0, N, block_size): block = x[start : start + block_size] m_new = max(m, np.max(block)) # Update running max # Rescale old sum to new max, add new block l = l * np.exp(m - m_new) + np.sum(np.exp(block - m_new)) # Store the shifted exponentials for this block d[start : start + block_size] = np.exp(block - m_new) # Fix up PREVIOUS blocks: they used the old max if start > 0: d[:start] *= np.exp(m - m_new) m = m_new return d / l # Normalize by final sum # Verify exactness x = np.array([0.707, 0.0, 0.707, 0.707]) print("Standard:", standard_softmax(x)) # [0.286, 0.141, 0.286, 0.286] print("Online: ", online_softmax(x, 2)) # [0.286, 0.141, 0.286, 0.286] — EXACT match
python # Simplified FlashAttention forward — conceptual Triton kernel import triton import triton.language as tl @triton.jit def flash_attn_fwd(Q, K, V, O, # pointers to [N, D] tensors in HBM stride_qn, stride_kn, # stride between rows N, # sequence length D: tl.constexpr, # head dimension (compile-time) BLOCK: tl.constexpr): # tile size (compile-time) pid = tl.program_id(0) # which Q-block this thread block handles # Load this thread block's Q tile: [BLOCK, D], stays in registers q_ptrs = Q + pid * BLOCK * stride_qn + tl.arange(0, BLOCK)[:, None] * stride_qn + tl.arange(0, D)[None, :] q_block = tl.load(q_ptrs) # [BLOCK, D] — lives in registers for entire loop # Initialize online softmax accumulators (in registers, not shared mem) m_i = tl.full([BLOCK], float('-inf'), dtype=tl.float32) # running max per row l_i = tl.zeros([BLOCK], dtype=tl.float32) # running sum per row o_i = tl.zeros([BLOCK, D], dtype=tl.float32) # running output [BLOCK, D] # Iterate over all K/V tiles (inner loop of FA-2) for j in range(0, N, BLOCK): # Load K,V tile from HBM into shared memory: [BLOCK, D] k_ptrs = K + j * stride_kn + tl.arange(0, BLOCK)[:, None] * stride_kn + tl.arange(0, D)[None, :] v_ptrs = V + j * stride_kn + tl.arange(0, BLOCK)[:, None] * stride_kn + tl.arange(0, D)[None, :] k_block = tl.load(k_ptrs) # [BLOCK, D] v_block = tl.load(v_ptrs) # [BLOCK, D] # Compute QK^T for this tile — uses Tensor Cores s = tl.dot(q_block, tl.trans(k_block)) # [BLOCK, BLOCK] s = s * (D ** -0.5) # scale by 1/sqrt(d) # Online softmax: update running max m_new = tl.maximum(m_i, tl.max(s, axis=1)) # new max per row # Correction factor: rescale old accumulator to new max alpha = tl.exp(m_i - m_new) # exp(m_old - m_new) ≤ 1.0 # Compute exp(s - m_new) for new tile p = tl.exp(s - m_new[:, None]) # [BLOCK, BLOCK] — unnormalized weights # Update running sum l_new = alpha * l_i + tl.sum(p, axis=1) # rescaled old sum + new sum # Update running output: rescale old + add new contribution # Note: we defer the 1/l_new division to after the loop (FA-2 trick) o_i = o_i * alpha[:, None] + tl.dot(p.to(v_block.dtype), v_block) m_i = m_new l_i = l_new # Final normalization (done ONCE after all tiles) o_i = o_i / l_i[:, None] # Write output tile to HBM o_ptrs = O + pid * BLOCK * stride_qn + tl.arange(0, BLOCK)[:, None] * stride_qn + tl.arange(0, D)[None, :] tl.store(o_ptrs, o_i.to(tl.float16))
Failure 1: FP16 overflow in softmax scores.
Symptom: NaN or Inf in attention output. Cause: With d=128 and random Q/K in [-1, 1], q·k can reach 128. exp(128) = 3.8e55 — well beyond FP16 max (65504). The /√d scaling brings this to exp(128/11.3) = exp(11.3) = 80000 — still overflows FP16. Metric: Monitor max(abs(S)) per layer. Fix: FlashAttention accumulates the online softmax (m, l, O) in FP32 internally. The final output is cast to FP16 only at the very end. If you write your own kernel, you must do this too — FP16 accumulators will overflow.
Failure 2: Head dimension not a multiple of tile size.
Symptom: Kernel launch failure or incorrect results. Cause: Triton/CUDA tiling requires D to be divisible by the warp size or tile dimension (typically 64 or 128). If D=96, loading a [BLOCK, 128] tile reads garbage. Metric: Assert D % 64 == 0 at model init. Fix: Pad the head dimension to the next multiple of 64. For D=96, pad to 128. The 33% extra memory is negligible compared to the N² cost. Alternatively, some FA implementations support non-power-of-2 D with masking.
Failure 3: Causal mask breaks tiling.
Symptom: Autoregressive model produces incoherent output. Cause: A causal mask requires Sij = -∞ for j > i. Naive masking per-tile is tricky: for a Q tile covering rows [64..127] and K tile covering columns [128..191], the entire tile should be masked (it's fully in the future). For rows [64..127] and columns [64..127], partial masking is needed. Fix: FlashAttention uses three-way tile classification: fully unmasked (process normally), fully masked (skip entirely — free speedup), partially masked (apply element-wise mask within the tile). The skip optimization gives causal attention ~50% speedup over full attention.
Failure 4: FlashAttention backward is slower than expected.
Symptom: Backward pass 2-3x slower than forward. Cause: FA's backward recomputes the S matrix from Q, K instead of storing it. This trades memory for compute — the whole point. But if your model is compute-bound (not memory-bound), this recomputation hurts. Metric: Compare SM occupancy between forward and backward. Fix: Expected behavior for compute-bound regimes. For very short sequences (N < 512), standard attention may actually be faster because the N² matrix fits in SRAM anyway.
Your autonomous vehicle runs a VLM for scene understanding. It processes 8 camera images + a text prompt and generates structured output describing detected objects and driving decisions. At each decoding step, the model attends to every previous token. Recomputing K and V from scratch every step would mean quadratic cost in sequence length. The KV-cache eliminates this. But the cache itself creates new problems: memory fragmentation, unbounded growth, and the fundamental sequentiality of autoregressive decoding. This chapter derives the solutions.
In autoregressive generation, the model produces one token at a time. At step t, it must attend to all t tokens generated so far (plus any prompt tokens). Let's trace what happens without a cache:
Let's derive the memory formula from first principles, then compute concrete numbers for real models.
Now let's plug in real numbers for a 7B-class model (32 layers, 32 heads, dhead=128):
| Seq Length | FP16 (per seq) | INT8 (per seq) | Batch=8, FP16 | Batch=8, INT8 |
|---|---|---|---|---|
| 512 | 256 MB | 128 MB | 2.0 GB | 1.0 GB |
| 2048 | 1.0 GB | 512 MB | 8.0 GB | 4.0 GB |
| 4096 | 2.0 GB | 1.0 GB | 16.0 GB | 8.0 GB |
| 8192 | 4.0 GB | 2.0 GB | 32.0 GB | 16.0 GB |
| 32768 | 16.0 GB | 8.0 GB | 128.0 GB | 64.0 GB |
Recall from Chapter 4: Grouped-Query Attention (GQA) shares K/V across groups of heads. This directly shrinks the KV-cache:
This is why every modern production model uses GQA. The cache reduction is the primary motivation — it directly translates to more concurrent users or longer sequences on the same hardware.
Even with GQA, the cache management problem remains: sequences have different lengths. The naive approach pre-allocates max_sequence_length for every request. Let's see why that's wasteful.
This is exactly the problem operating systems solved decades ago with virtual memory. PagedAttention (Kwon et al., 2023) applies the same solution to KV-cache management.
In OS virtual memory, each process sees a contiguous address space (logical pages), but the OS maps these to scattered physical pages in RAM. A page table tracks the mapping. Pages are allocated on demand — a process that requests 4 GB but only touches 1 GB only uses 1 GB of physical RAM.
PagedAttention does the same for KV-cache:
Traditional batching waits for all sequences in a batch to finish before starting new ones. If sequence A needs 10 tokens and sequence B needs 1000, sequence A's GPU slot sits idle for 990 steps. Continuous batching (also called "iteration-level scheduling") fills empty slots immediately.
PagedAttention makes continuous batching efficient because inserting a new sequence requires no memory reshuffling — just allocate a new page table and start appending blocks.
Even with KV-cache, autoregressive decoding has a fundamental problem: it's sequential. Token t+1 depends on token t. The GPU runs a massive model to produce a single token, leaving most of its compute capacity idle. Speculative decoding (Leviathan et al., 2022; Chen et al., 2023) turns this sequential bottleneck into a parallel verification problem.
Let's prove this is lossless with a concrete 5-token example:
python import torch class KVCache: """Simple KV-cache for one attention layer.""" def __init__(self, max_seq, n_heads, d_head, dtype=torch.float16): # Pre-allocate maximum size — no runtime allocation self.k = torch.zeros(max_seq, n_heads, d_head, dtype=dtype, device="cuda") self.v = torch.zeros(max_seq, n_heads, d_head, dtype=dtype, device="cuda") self.length = 0 # current number of cached tokens def append(self, k_new, v_new): """Append new K,V vectors. k_new shape: [1, n_heads, d_head]""" self.k[self.length] = k_new[0] # write into pre-allocated slot self.v[self.length] = v_new[0] self.length += 1 def get(self): """Return cached K,V up to current length.""" return self.k[:self.length], self.v[:self.length] def memory_bytes(self): return self.k.nelement() * self.k.element_size() * 2 # K + V
python class BlockAllocator: """PagedAttention-style block allocator with free list.""" def __init__(self, total_blocks, block_size, n_heads, d_head, dtype=torch.float16): self.block_size = block_size # tokens per block (e.g., 16) # Physical storage: one big tensor, sliced into blocks self.k_pool = torch.zeros(total_blocks, block_size, n_heads, d_head, dtype=dtype, device="cuda") self.v_pool = torch.zeros_like(self.k_pool) # Free list: all blocks start as free self.free_blocks = list(range(total_blocks)) # Page tables: sequence_id -> [physical_block_indices] self.page_tables = {} def allocate(self, seq_id): """Allocate a new block for a sequence. O(1) — just pop from free list.""" if not self.free_blocks: raise RuntimeError("OOM: no free blocks") block_idx = self.free_blocks.pop() if seq_id not in self.page_tables: self.page_tables[seq_id] = [] self.page_tables[seq_id].append(block_idx) return block_idx def free(self, seq_id): """Free all blocks for a finished sequence.""" for block_idx in self.page_tables.pop(seq_id, []): self.free_blocks.append(block_idx) # return to free list def fragmentation(self): """Fragmentation = 0 by design. All blocks are same size.""" return 0.0 # no external fragmentation ever
python def speculative_accept(draft_probs, target_probs, draft_tokens, K=5): """Rejection sampling for speculative decoding. Returns accepted tokens (guaranteed to match target distribution).""" accepted = [] for i in range(K): t = draft_tokens[i] q = draft_probs[i][t] # draft's probability of this token p = target_probs[i][t] # target's probability of this token # Accept with probability min(1, p/q) if torch.rand(1).item() < min(1.0, p / (q + 1e-10)): accepted.append(t) else: # Reject: sample from adjusted distribution adjusted = torch.clamp(target_probs[i] - draft_probs[i], min=0) adjusted = adjusted / adjusted.sum() # normalize resampled = torch.multinomial(adjusted, 1).item() accepted.append(resampled) break # stop at first rejection return accepted # Expected accepted length: ∑_{i=1}^{K} ∏_{j=1}^{i} alpha_j # where alpha_j = ∑_t min(p_j(t), q_j(t)) (token-level acceptance rate) # Typical values: alpha ~ 0.7-0.9 with a good draft model # Expected accepted with K=5, alpha=0.8: 0.8+0.64+0.51+0.41+0.33 = 2.69 tokens # Plus 1 for the resampled token = ~3.7 tokens per target forward pass
Failure 1: KV-cache memory leak — latency grows over time.
Symptom: After 200 frames of continuous operation, inference latency degrades from 40ms to 120ms. Cause: The context window keeps growing because old tokens are never evicted. Attention cost is linear in cache length, so 3x more cached tokens = 3x slower attention. Metric: Track cache length per request over time; alert if length exceeds expected maximum. Fix: Implement a sliding window policy — evict tokens older than max_context. Or use StreamingLLM-style "attention sinks": keep the first K tokens (which accumulate disproportionate attention mass) plus the most recent W tokens, dropping everything in between.
Failure 2: Speculative decoding gives only 1.2x speedup instead of expected 3x.
Symptom: Draft model acceptance rate is <40%. Cause: The draft and target models have divergent distributions — the draft model makes different vocabulary choices. This happens when the draft model is trained on different data or has a fundamentally different architecture. Metric: Log average acceptance rate per speculation round. Fix: Use a draft model distilled from the target (not independently trained). Alternatively, use "self-speculative" decoding: use the target model's early-exit layers or a smaller subset of layers as the draft.
Failure 3: PagedAttention latency spikes during block allocation.
Symptom: P99 latency is 5x higher than P50 during high load. Cause: When memory pressure is high, the allocator must either: (a) evict a sequence to free blocks, triggering recomputation later, or (b) block until a sequence finishes. Either path adds unpredictable latency. Metric: Track free block count; alert when below 10% threshold. Fix: Pre-size the block pool for expected max concurrency. Implement admission control: reject new requests rather than causing latency spikes for in-flight requests. Use a preemption policy (shortest-sequence-first eviction) to minimize recomputation cost.
Failure 4: KV-cache quantization degrades long-context quality.
Symptom: Model produces incoherent output for prompts > 4K tokens with INT4 KV-cache, but works fine with FP16. Cause: Quantization error accumulates over many tokens. Attention scores become noisy, and the model "forgets" early context. Metric: Compare perplexity at various context lengths between FP16 and quantized cache. Fix: Use mixed-precision caching — keep the first 256 and last 256 tokens in FP16 (these get the most attention), quantize the middle. Or use per-channel quantization instead of per-tensor to reduce outlier impact.
Python prototypes. C++ ships. On a safety-critical vehicle, every inference call must complete within a deterministic time budget, never leak memory, handle concurrent sensor streams, and fail gracefully under all conditions. A garbage collection pause of 50ms means the vehicle drives blind for one full cycle at 20 Hz. This chapter covers the exact C++ patterns, CUDA primitives, and memory strategies that make real-time inference possible.
Python's runtime has three fundamental problems for real-time systems. Let's quantify each one.
Problem 1: The Global Interpreter Lock (GIL). Python's GIL allows only one thread to execute Python bytecode at a time. Even with 8 threads processing 8 camera streams, only one runs at any instant. You get concurrency (interleaving) but not parallelism (simultaneous execution). For CPU-bound preprocessing (image decoding, normalization), this means 8 cores sit idle while one does work.
Problem 2: Garbage collection pauses. Python's GC uses reference counting + cycle detection. The cycle detector runs periodically and freezes ALL threads while it traces the object graph. For a process with 2 GB of Python objects, a Gen2 collection takes 10-80ms. This is non-deterministic — you cannot predict when it will happen.
Problem 3: Dynamic dispatch overhead. Every Python function call looks up the function object in a dictionary, checks types at runtime, boxes/unboxes arguments. A C++ virtual function call is a single indirect jump through a vtable. A non-virtual C++ call is a direct jump — zero overhead.
You don't need all of C++ for inference. You need these five patterns, each solving a specific inference problem.
1. RAII (Resource Acquisition Is Initialization). Why for inference: GPU resources (buffers, streams, contexts) MUST be freed. In Python, a forgotten del means the GC eventually cleans up. In C++, RAII guarantees cleanup at scope exit — even if an exception is thrown.
cpp // RAII wrapper for CUDA memory — impossible to leak template<typename T> class CudaBuffer { public: CudaBuffer(size_t count) : size_(count * sizeof(T)) { cudaMalloc(&ptr_, size_); // acquire on construction } ~CudaBuffer() { cudaFree(ptr_); } // release on destruction — GUARANTEED // Delete copy (no accidental double-free) CudaBuffer(const CudaBuffer&) = delete; CudaBuffer& operator=(const CudaBuffer&) = delete; // Allow move (transfer ownership without copying) CudaBuffer(CudaBuffer&& other) noexcept : ptr_(other.ptr_), size_(other.size_) { other.ptr_ = nullptr; // source gives up ownership other.size_ = 0; } T* data() { return static_cast<T*>(ptr_); } size_t bytes() const { return size_; } private: void* ptr_ = nullptr; size_t size_ = 0; }; // When CudaBuffer goes out of scope (function exit, exception, etc.), // destructor runs automatically. Memory CANNOT leak.
2. Smart pointers. std::unique_ptr for exclusive ownership (most inference objects), std::shared_ptr for shared ownership (e.g., a TensorRT engine shared across multiple inference contexts). Why for inference: TensorRT objects have specific destruction ordering requirements — engine must outlive execution contexts. Smart pointers enforce this automatically.
3. Move semantics. Large tensors should be moved, not copied. A move transfers ownership of the underlying pointer in O(1) — no memcpy. Critical when passing inference results between pipeline stages.
4. std::span (C++20). A non-owning view over contiguous memory. Perfect for passing pre-allocated buffers to functions without transferring ownership or copying. Think of it as "a pointer + a size" with bounds checking.
cpp // Production TensorRT inference runner — every line annotated #include <NvInfer.h> #include <cuda_runtime.h> #include <fstream> #include <vector> #include <unordered_map> #include <memory> #include <string> // Custom deleter for TensorRT objects (RAII compliance) struct TrtDeleter { template<typename T> void operator()(T* p) const { if (p) p->destroy(); } }; template<typename T> using TrtPtr = std::unique_ptr<T, TrtDeleter>; class InferenceRunner { public: // ──── CONSTRUCTOR: all allocation happens here ──── InferenceRunner(const std::string& engine_path) { // Step 1: Read serialized engine from disk into CPU memory std::ifstream file(engine_path, std::ios::binary | std::ios::ate); auto size = file.tellg(); // file size in bytes file.seekg(0, std::ios::beg); std::vector<char> engine_data(size); file.read(engine_data.data(), size); // Step 2: Create runtime and deserialize engine // TensorRT runtime: manages engine lifecycle runtime_.reset(nvinfer1::createInferRuntime(logger_)); // Engine: contains the optimized network graph + weights engine_.reset(runtime_->deserializeCudaEngine( engine_data.data(), engine_data.size())); // Execution context: holds per-inference state (bindings, workspace) // Multiple contexts can share one engine for concurrent inference context_.reset(engine_->createExecutionContext()); // Step 3: Pre-allocate ALL device buffers // This is the critical design decision: ZERO allocation at runtime for (int i = 0; i < engine_->getNbIOTensors(); ++i) { auto name = engine_->getIOTensorName(i); auto dims = engine_->getTensorShape(name); size_t bytes = volume(dims) * sizeof(float); void* ptr = nullptr; cudaMalloc(&ptr, bytes); // GPU allocation — happens ONCE buffers_[name] = {ptr, bytes}; context_->setTensorAddress(name, ptr); } // Step 4: Create CUDA stream for async operations cudaStreamCreate(&stream_); // Step 5: Pre-warm — run one dummy inference to trigger JIT compilation // cuDNN/cuBLAS may compile kernels on first call, causing a latency spike std::vector<float> dummy(input_elements(), 0.0f); std::vector<float> dummy_out(output_elements()); infer(dummy.data(), dummy_out.data()); // absorb JIT cost here } // ──── INFER: the hot path — zero allocation, fully deterministic ──── void infer(const float* input, float* output) { auto& in_buf = buffers_["input"]; auto& out_buf = buffers_["output"]; // Async copy: CPU → GPU (uses DMA, CPU is free during transfer) cudaMemcpyAsync(in_buf.ptr, input, in_buf.bytes, cudaMemcpyHostToDevice, stream_); // Enqueue inference on stream (returns immediately) context_->enqueueV3(stream_); // Async copy: GPU → CPU cudaMemcpyAsync(output, out_buf.ptr, out_buf.bytes, cudaMemcpyDeviceToHost, stream_); // Block until all operations on this stream complete cudaStreamSynchronize(stream_); } // ──── DESTRUCTOR: cleanup in reverse order of creation ──── ~InferenceRunner() { cudaStreamSynchronize(stream_); // wait for in-flight work for (auto& [name, buf] : buffers_) cudaFree(buf.ptr); // free GPU memory cudaStreamDestroy(stream_); // context_, engine_, runtime_ freed by unique_ptr destructors // in reverse order (context first, runtime last) — correct! } private: struct Buffer { void* ptr; size_t bytes; }; nvinfer1::ILogger logger_; TrtPtr<nvinfer1::IRuntime> runtime_; TrtPtr<nvinfer1::ICudaEngine> engine_; TrtPtr<nvinfer1::IExecutionContext> context_; std::unordered_map<std::string, Buffer> buffers_; cudaStream_t stream_; static size_t volume(const nvinfer1::Dims& d) { size_t v = 1; for (int i = 0; i < d.nbDims; ++i) v *= d.d[i]; return v; } };
A CUDA stream is a sequence of operations that execute in order on the GPU. Operations in different streams can execute concurrently. This is how we overlap data transfer with compute.
cpp // Double-buffered inference pipeline class DoubleBufPipeline { static constexpr int N_BUF = 2; cudaStream_t streams_[N_BUF]; void* d_input_[N_BUF]; // two input buffers on GPU void* d_output_[N_BUF]; // two output buffers on GPU float* h_input_[N_BUF]; // pinned host memory (for async copy) float* h_output_[N_BUF]; public: DoubleBufPipeline(size_t in_bytes, size_t out_bytes) { for (int i = 0; i < N_BUF; ++i) { cudaStreamCreate(&streams_[i]); cudaMalloc(&d_input_[i], in_bytes); cudaMalloc(&d_output_[i], out_bytes); // PINNED host memory — required for async transfers // Regular malloc'd memory forces synchronous copies cudaMallocHost(&h_input_[i], in_bytes); cudaMallocHost(&h_output_[i], out_bytes); } } void process_frame(int frame_idx, const float* frame_data, nvinfer1::IExecutionContext* ctx) { int buf = frame_idx % N_BUF; // alternate between buffers auto s = streams_[buf]; // Wait for PREVIOUS use of this buffer to complete cudaStreamSynchronize(s); // Copy frame data to pinned host buffer memcpy(h_input_[buf], frame_data, input_bytes_); // Async H2D copy (uses DMA engine, doesn't block SMs) cudaMemcpyAsync(d_input_[buf], h_input_[buf], input_bytes_, cudaMemcpyHostToDevice, s); // Enqueue inference (uses SMs, concurrent with next frame's H2D) ctx->setTensorAddress("input", d_input_[buf]); ctx->setTensorAddress("output", d_output_[buf]); ctx->enqueueV3(s); // Async D2H copy cudaMemcpyAsync(h_output_[buf], d_output_[buf], output_bytes_, cudaMemcpyDeviceToHost, s); } ~DoubleBufPipeline() { for (int i = 0; i < N_BUF; ++i) { cudaStreamSynchronize(streams_[i]); // wait for in-flight cudaFree(d_input_[i]); cudaFree(d_output_[i]); cudaFreeHost(h_input_[i]); cudaFreeHost(h_output_[i]); cudaStreamDestroy(streams_[i]); } } };
Triple buffering adds a third buffer set, allowing three stages to overlap: while frame N computes, frame N+1 copies H2D, and frame N-1 copies D2H. Useful when your transfer and compute times are close (so double buffering still has bubbles). Diminishing returns beyond triple — the pipeline is either transfer-bound or compute-bound, and more buffers can't fix that.
Every CUDA API call (cudaMemcpyAsync, kernel launch) incurs ~5-10μs of CPU-side overhead for parameter validation, driver calls, and stream enqueue. For a TensorRT model with 200+ kernels, that's 1-2ms of pure CPU overhead per inference. When your total inference budget is 12ms, that's 15% wasted.
CUDA graphs solve this by recording a sequence of GPU operations once, then replaying them with a single API call:
cpp // CUDA graph: record the inference + copy pattern, replay forever cudaGraph_t graph; cudaGraphExec_t graph_exec; // Step 1: Record — execute once with graph capture enabled cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal); cudaMemcpyAsync(d_input, h_input, in_bytes, cudaMemcpyHostToDevice, stream); context->enqueueV3(stream); // all TRT kernels captured cudaMemcpyAsync(h_output, d_output, out_bytes, cudaMemcpyDeviceToHost, stream); cudaStreamEndCapture(stream, &graph); // Step 2: Instantiate — optimize the graph (fuse operations, etc.) cudaGraphInstantiate(&graph_exec, graph, nullptr, nullptr, 0); // Step 3: Replay — run the entire captured sequence with ONE call for (int frame = 0; frame < num_frames; ++frame) { memcpy(h_input, frame_data[frame], in_bytes); // fill pinned buffer cudaGraphLaunch(graph_exec, stream); // ONE launch, all ops cudaStreamSynchronize(stream); } // Constraints: // - Input/output SHAPES must be fixed (no dynamic shapes) // - No cudaMalloc inside the graph (no dynamic allocation) // - No conditional branching (graph is a fixed DAG) // - Perfect for perception models with fixed input resolution
cudaMalloc is a system call that takes 50-500μs. During inference, this is catastrophic. The solution: pre-allocate everything during initialization, then never allocate again.
cpp // cudaMemPool: CUDA 11.2+ async allocation pool cudaMemPool_t pool; cudaDeviceGetDefaultMemPool(&pool, 0); // Set pool to release memory back to OS only above threshold uint64_t threshold = 2ULL * 1024 * 1024 * 1024; // 2 GB cudaMemPoolSetAttribute(pool, cudaMemPoolAttrReleaseThreshold, &threshold); // Now cudaMallocAsync sub-allocates from the pool — no system call void* ptr; cudaMallocAsync(&ptr, 1024 * 1024, stream); // ~2μs, not 50μs // ... use ptr ... cudaFreeAsync(ptr, stream); // returns to pool, not to OS
Failure 1: Use-after-free from async operations.
Symptom: Intermittent garbage output, occasional CUDA illegal memory access. Cause: Host code frees a buffer while a kernel on a different stream is still reading it. Async operations return immediately — cudaFree executes before the kernel finishes. Metric: Run with CUDA_LAUNCH_BLOCKING=1 to serialize all operations. If the bug disappears, it's an async ordering issue. Fix: Always cudaStreamSynchronize(stream) before freeing any buffer used by that stream. Better: use RAII wrappers that synchronize in their destructor. Best: never free during the hot path — pre-allocate everything.
Failure 2: Hidden cudaMalloc in library code.
Symptom: P99 latency is 3x higher than P50, with spikes at seemingly random intervals. Cause: cuDNN allocates "workspace" memory on the first call to certain algorithms. TensorRT's first inference may trigger cuBLAS handle creation, which calls cudaMalloc internally. Metric: Profile with nsys profile --trace=cuda and search for cudaMalloc/cudaFree calls during steady-state inference. Fix: Pre-warm all code paths during initialization. Run one dummy inference per model. Set CUDA_MODULE_LOADING=LAZY to defer module loading (reduces startup time). Use cudaMallocAsync with a pool so even surprise allocations are fast.
Failure 3: Priority inversion with CUDA streams.
Symptom: High-priority inference stream (perception) gets delayed by low-priority stream (logging/telemetry). Cause: CUDA stream priorities only affect kernel scheduling, not memory transfers. If a low-priority stream's large H2D copy saturates the PCIe bus, the high-priority stream's small copy waits behind it. Metric: Use nsys to visualize stream timelines — look for high-priority kernels waiting on copy engine. Fix: Use separate copy engines (cudaMemcpyPeerAsync) for different priority levels. Or batch low-priority copies into a single large transfer during non-critical windows.
Failure 4: Memory corruption from shared execution contexts.
Symptom: Inference output is non-deterministically wrong — sometimes correct, sometimes garbage. Cause: Two threads share one IExecutionContext and call enqueueV3 concurrently. TensorRT contexts are NOT thread-safe — they use internal buffers that get corrupted by concurrent access. Metric: Add a mutex around enqueueV3 and check if the problem disappears. Fix: Create one IExecutionContext per thread. Multiple contexts can share one engine (the engine is thread-safe), but each context must be used by only one thread at a time. Pre-create contexts during initialization.
Your perception foundation model has 3 billion parameters and trains on 2 million driving scenes. A single A100 can process 2 scenes/second. At that rate: 2M / 2 / 3600 = 278 hours — 11.6 days for one epoch. You need 8 epochs, so that's 93 days on one GPU. Unacceptable. This chapter derives the math behind every parallelism strategy, computes exact memory breakdowns, and shows you the code to make 64 GPUs work together.
Let's start with what "ideal scaling" means and why you never achieve it.
Distributed Data Parallel (DDP) is the workhorse of distributed training. Every GPU holds a complete copy of the model. Each GPU processes a different mini-batch. After the backward pass, gradients are averaged across all GPUs using AllReduce.
Let's trace AllReduce with actual data. Say we have 4 GPUs and a gradient tensor of 4 elements:
Ring-AllReduce solves this by arranging GPUs in a ring. Each GPU sends/receives 1/P of the data per step. After 2(P-1) steps, all GPUs have the full average. Every GPU's bandwidth is utilized equally.
Bucket gradient fusion is DDP's other key optimization. Instead of AllReducing each parameter tensor separately (hundreds of small messages with high per-message overhead), DDP groups gradients into buckets (default 25 MB each) and AllReduces entire buckets. Fewer messages, better bandwidth utilization.
Communication/computation overlap: DDP starts AllReducing the last layer's gradients while the backward pass is still computing earlier layers' gradients. Since the backward pass proceeds layer by layer from output to input, the last layer's gradients are ready first. By the time the backward pass finishes, most of the AllReduce is already done.
Before understanding ZeRO, you need to know what consumes memory during training. For a model with Ψ parameters:
Now let's compute exact per-GPU memory for a 3B parameter model on 8 GPUs:
| Component | Formula | ZeRO-0 (DDP) | ZeRO-1 | ZeRO-2 | ZeRO-3 |
|---|---|---|---|---|---|
| FP16 Params | Ψ × 2 | 6.0 GB | 6.0 GB | 6.0 GB | 0.75 GB |
| FP16 Gradients | Ψ × 2 | 6.0 GB | 6.0 GB | 0.75 GB | 0.75 GB |
| FP32 Master | Ψ × 4 | 12.0 GB | 1.5 GB | 1.5 GB | 1.5 GB |
| FP32 Adam m | Ψ × 4 | 12.0 GB | 1.5 GB | 1.5 GB | 1.5 GB |
| FP32 Adam v | Ψ × 4 | 12.0 GB | 1.5 GB | 1.5 GB | 1.5 GB |
| Total per GPU | 48.0 GB | 16.5 GB | 11.25 GB | 6.0 GB |
Fully Sharded Data Parallel (FSDP) is PyTorch's native implementation of ZeRO-3. Each GPU stores only 1/P of the model parameters. Before computing a layer, GPUs gather the full parameters. After the layer, they discard the non-owned portion.
ZeRO/FSDP shards the model across GPUs but each GPU runs the full computation after gathering. Tensor parallelism splits individual matrix operations across GPUs — each GPU computes part of the result, then they combine.
Megatron-LM splits the MLP block of a transformer across GPUs:
Pipeline parallelism assigns different layers to different GPUs. GPU 0 runs layers 0-7, GPU 1 runs layers 8-15, etc. The problem: when GPU 0 is processing a batch's forward pass, GPU 1 is idle (waiting for GPU 0's output). This creates pipeline bubbles.
PipeDream (1F1B schedule) interleaves forward and backward micro-batches to reduce the bubble further. After the pipeline fills (the first micro-batch reaches the last stage), each GPU alternates: one forward, one backward, one forward, one backward. This keeps all GPUs busy and limits activation memory to at most P micro-batches worth (instead of M with GPipe).
Modern training uses mixed precision — different precisions for different parts of the computation. Here's why each piece exists:
Loss scaling (needed for FP16, less critical for BF16): multiply the loss by a large constant (e.g., 1024) before backward. This scales all gradients up, preventing underflow in FP16. After backward, divide gradients by the same constant before the optimizer step. If gradients overflow (Inf/NaN), reduce the scale factor and skip the step.
During forward, each layer's activations must be saved for the backward pass. For a model with L layers and activations of size A per layer, that's L × A memory. For a 32-layer model with large batch size, this can be 10-20 GB.
python # PyTorch DDP — the gold standard for data parallelism import os import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def setup_ddp(): # NCCL backend: optimized for GPU-to-GPU communication (NVLink, InfiniBand) dist.init_process_group(backend="nccl") local_rank = int(os.environ["LOCAL_RANK"]) # set by torchrun torch.cuda.set_device(local_rank) return local_rank def train_ddp(model, train_loader, optimizer, local_rank): model = model.cuda(local_rank) model = DDP(model, device_ids=[local_rank], bucket_cap_mb=25) # gradient bucket size for AllReduce # larger = better bandwidth utilization # smaller = earlier overlap start for epoch in range(num_epochs): train_loader.sampler.set_epoch(epoch) # shuffle differently per epoch for batch in train_loader: optimizer.zero_grad() loss = model(batch).loss loss.backward() # DDP hooks trigger AllReduce here optimizer.step() # all GPUs have same avg gradients
python # Gradient accumulation — simulate larger batch without more memory # Effective batch = micro_batch_size × accumulation_steps × num_gpus accumulation_steps = 4 for i, batch in enumerate(train_loader): loss = model(batch).loss / accumulation_steps # scale loss! loss.backward() # gradients accumulate in .grad if (i + 1) % accumulation_steps == 0: optimizer.step() # only AllReduce + step every N micro-batches optimizer.zero_grad() # clear accumulated gradients # WHY divide loss by accumulation_steps: # .backward() ADDS to .grad (doesn't replace). After 4 backward() calls, # .grad contains the SUM of 4 mini-batch gradients. # We want the MEAN. So scale each loss by 1/4 before backward.
json // DeepSpeed config for ZeRO-2 (ds_config.json) { "zero_optimization": { "stage": 2, // shard optimizer states + gradients "allgather_partitions": true, // AllGather after ReduceScatter "reduce_scatter": true, // use ReduceScatter instead of AllReduce "overlap_comm": true, // overlap gradient comm with backward "contiguous_gradients": true // pack gradients contiguously for faster NCCL }, "bf16": { "enabled": true }, "gradient_accumulation_steps": 4, "train_micro_batch_size_per_gpu": 8 }
python # PyTorch FSDP (ZeRO-3 equivalent) from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import MixedPrecision, ShardingStrategy # Mixed precision policy: BF16 compute, FP32 reduce, FP32 output mp_policy = MixedPrecision( param_dtype=torch.bfloat16, # parameters cast to BF16 for forward/backward reduce_dtype=torch.float32, # gradients reduced in FP32 (avoid precision loss) buffer_dtype=torch.bfloat16, ) model = FSDP( model, sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3: shard everything mixed_precision=mp_policy, auto_wrap_policy=size_based_auto_wrap_policy, # wrap layers > 100M params device_id=local_rank, ) # Training loop is identical to standard PyTorch # FSDP handles AllGather/ReduceScatter transparently for batch in train_loader: loss = model(batch).loss loss.backward() optimizer.step() optimizer.zero_grad()
Failure 1: Gradient NaN after scaling to 64 GPUs.
Symptom: Training loss goes to NaN within the first 100 steps. Works fine on 8 GPUs. Cause: Effective batch size is now 8x larger (64 vs 8 GPUs). The learning rate that worked for batch=256 is too large for batch=2048 — gradient steps overshoot. Metric: Monitor gradient norm per step. If it spikes above 10x normal before NaN, it's an LR issue. Fix: Apply the linear scaling rule: lr_new = lr_base × (batch_new / batch_base). Use warmup for the first 1-5% of steps. For very large batches (>8K), use sqrt scaling instead of linear, or LARS/LAMB optimizers that adapt per-parameter.
Failure 2: Communication bottleneck — 32 GPUs is slower than 16.
Symptom: Wall-clock time per step increases when adding GPUs. Cause: The 16 → 32 GPU transition often crosses a node boundary. Intra-node bandwidth (NVLink: 600 GB/s per A100) vs inter-node bandwidth (InfiniBand: 25-50 GB/s) — a 12-24x gap. AllReduce of 6 GB of gradients now hits the inter-node bottleneck. Metric: Profile with torch.profiler. If AllReduce takes >30% of step time, communication is the bottleneck. Fix: (1) Increase per-GPU batch size (more compute per comm round). (2) Gradient accumulation (fewer AllReduce calls). (3) Switch from DDP to ZeRO-1 (shards only optimizer states, reducing AllReduce volume by 3x). (4) Use hierarchical AllReduce: AllReduce within nodes (fast NVLink), then AllReduce across nodes (slower InfiniBand).
Failure 3: OOM despite ZeRO-3/FSDP sharding.
Symptom: CUDA OOM on the first batch, even though ZeRO-3 should reduce per-GPU memory. Cause: Activations are NOT sharded by ZeRO — they stay on the GPU that computed them. With large batch size or long sequences, activations dominate memory. For a 3B model with batch=8, seq=2048: activations can be 15-20 GB, dwarfing the 6 GB sharded model state. Metric: Use torch.cuda.max_memory_allocated() to track peak usage. Compare with expected model state size. Fix: Enable activation checkpointing: torch.utils.checkpoint.checkpoint(layer, input). This recomputes activations during backward instead of storing them, trading ~33% more compute for 3-5x less activation memory.
Failure 4: Loss divergence when mixing parallelism strategies.
Symptom: Training with tensor parallelism + data parallelism produces different (worse) results than pure DDP. Cause: Tensor parallelism introduces AllReduce operations inside the forward pass. If these AllReduces use a different communication group than DDP's gradient AllReduce, the effective gradient averaging is wrong — some gradients get double-counted. Metric: Compare gradient norms between pure DDP and hybrid runs. Fix: Carefully set up process groups: tensor parallel (TP) group for intra-layer communication, data parallel (DP) group for gradient sync across TP-identical replicas. The product of TP_size × DP_size must equal total GPUs. Never put the same GPU in both groups for the same collective.
Your perception model runs at 85ms on an Orin. The budget is 50ms. Your manager asks: "What's the plan?" A junior engineer would reply "I'll try INT8" or "I'll try fusing some ops." A staff engineer says: "I'll profile it first and tell you tomorrow exactly which ops to optimize and the expected savings." GPU profiling is the difference between spending a week optimizing the wrong layer and spending a day fixing the actual bottleneck.
Profiling is not glamorous. It's not even hard, technically. But it's the single most leveraged skill in performance engineering. Every hour spent profiling correctly saves ten hours of misdirected optimization. This chapter teaches you the full profiling stack, the roofline model for reasoning about performance limits, and a systematic decision tree for turning profile data into optimization actions.
Think of GPU profiling as a microscope with four zoom levels. Each level reveals different information, and you use them in order: coarse first, fine only when you've identified a specific target.
| Tool | Level | What it shows | When to use | Overhead |
|---|---|---|---|---|
| torch.profiler | Python/Op | Per-op CPU and CUDA time, memory allocation, Python stack traces, tensor shapes | First pass — which ops dominate wall-clock time | Low (~5%) |
| Nsight Systems (nsys) | System timeline | CPU/GPU timeline, kernel launches, memory copies (H2D/D2H), NCCL collectives, CUDA streams, API calls | Find idle gaps, CPU/GPU overlap, serialization, data transfer bottlenecks | Low (~2%) |
| Nsight Compute (ncu) | Single kernel | Per-kernel: achieved occupancy, memory throughput, compute throughput, warp stall reasons, instruction mix, "speed of light" analysis | Deep-dive into one specific slow kernel | High (10-100x slower) |
| CUDA Events | Custom regions | Precise GPU-side timing of arbitrary code sections | Production latency monitoring, A/B testing optimizations | Negligible |
The PyTorch profiler wraps your model execution and records every operator call — both on CPU and GPU. It captures: which ATen operator was called, how long it took on CPU vs CUDA, how much memory it allocated, and the Python call stack that triggered it. The output is a table you can sort by different columns.
Critical detail: GPU operations are asynchronous. When PyTorch calls torch.mm(), the CPU just enqueues the kernel and returns immediately. The actual work happens later on the GPU. The profiler uses CUDA events behind the scenes to measure real GPU time, but you must understand that cpu_time and cuda_time in the output mean different things. CPU time is the time the CPU spent setting up the launch. CUDA time is the time the GPU actually spent computing.
python import torch from torch.profiler import profile, ProfilerActivity, schedule # CRITICAL: warmup iterations. The first 1-3 runs are slow because: # - CUDA context initialization (first kernel launch) # - cuBLAS/cuDNN autotuning (selects best algorithm) # - Memory allocator caching (first allocs go to cudaMalloc) # - JIT compilation (torch.compile, TensorRT) # If you include warmup in your profile, you'll get misleading results. model = load_perception_model() model.eval().cuda() input_batch = create_sample_input().cuda() # Warmup: 3 iterations to stabilize with torch.no_grad(): for _ in range(3): model(input_batch) torch.cuda.synchronize() # ensure warmup kernels finish # Profile: schedule controls warmup/active/repeat with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], schedule=schedule(wait=1, warmup=1, active=5, repeat=1), record_shapes=True, # log tensor shapes per op profile_memory=True, # track allocations with_stack=True, # Python call stacks with_flops=True, # estimate FLOPs per op on_trace_ready=torch.profiler.tensorboard_trace_handler("./log"), ) as prof: with torch.no_grad(): for step in range(7): # 1 wait + 1 warmup + 5 active model(input_batch) prof.step() # Print top 20 ops sorted by total CUDA time print(prof.key_averages().table( sort_by="cuda_time_total", row_limit=20, header="Sorted by cuda_time_total" )) # Also useful: sort by self_cuda_time to find leaf ops print(prof.key_averages().table( sort_by="self_cuda_time_total", row_limit=20 )) # Export Chrome trace for visual inspection prof.export_chrome_trace("perception_trace.json") # Open in chrome://tracing or ui.perfetto.dev
Let's walk through a real profile output from a BEV perception model. Here is what sort_by="cuda_time_total" produces:
Now the question: what do we optimize? Matmul at 33% seems like the biggest target, but matmuls are already highly optimized (cuBLAS, tensor cores). The real wins are in the memory-bound operations: softmax, LayerNorm, GeLU, and cat — collectively 48% of time, and all amenable to kernel fusion.
The roofline model is the single most important mental model for GPU performance. It tells you the theoretical maximum performance of any kernel, given its arithmetic intensity — the ratio of compute operations to memory traffic.
Every GPU has two ceilings:
Now let's compute the arithmetic intensity for the ops in our profile:
Nsight Systems (nsys) gives you a timeline view of your entire application: what the CPU was doing, what the GPU was doing, and — critically — when the GPU was idle. It captures CUDA API calls, kernel launches, memory copies, NCCL collectives, and CPU thread activity.
bash # Capture a trace (10 seconds of execution) nsys profile \ --trace=cuda,nvtx,osrt \ --output=perception_profile \ --force-overwrite=true \ python run_inference.py # For training profiling (capture NCCL too): nsys profile \ --trace=cuda,nvtx,osrt,cudnn,cublas \ --cuda-graph-trace=node \ --output=training_profile \ torchrun --nproc_per_node=8 train.py # Open in Nsight Systems GUI: nsys-ui perception_profile.nsys-rep
What to look for in the timeline:
| Pattern | What you see | Root cause | Fix |
|---|---|---|---|
| GPU idle gaps | Empty bands between kernel rows | CPU can't enqueue kernels fast enough (Python overhead, data loading, synchronization calls) | CUDA graphs, async data loading, reduce Python overhead (torch.compile) |
| Long H2D copies | Wide blue bars on memory copy row | Large tensors being transferred CPU→GPU each iteration | Pin memory, pre-allocate GPU buffers, pipeline data transfers with compute |
| Tiny kernels | Hundreds of thin green bars with gaps | Many small ops with per-launch overhead (~5μs each) | Kernel fusion, CUDA graphs, torch.compile |
| NCCL blocking | Long red bars blocking GPU compute | AllReduce/AllGather waiting for slow node | Overlap compute with communication (pipeline parallelism), check network |
| cudaMalloc spikes | Tall bars on CUDA API row | Dynamic memory allocation during inference | Pre-allocate all memory, use memory pools, CUDA caching allocator |
Once Nsight Systems tells you which kernel to optimize, Nsight Compute (ncu) tells you how. It runs the target kernel hundreds of times with hardware counters enabled, collecting metrics like achieved occupancy, memory throughput as a percentage of peak, compute throughput as a percentage of peak, and the exact reasons warps are stalling.
bash # Profile a specific kernel (by name regex) ncu --kernel-name "softmax" \ --set full \ --launch-count 10 \ python run_inference.py # Key sections in the report: # "Speed of Light" — % of peak compute and memory bandwidth achieved # "Warp Stall Reasons" — why threads are waiting # "Occupancy" — % of max possible active warps # "Memory Workload" — L1/L2/HBM hit rates, throughput
The Speed of Light (SOL) chart is the most important section. It shows two bars: compute throughput (% of peak FLOPS) and memory throughput (% of peak bandwidth). The interpretation:
CUDA events are lightweight GPU-side timestamps. Unlike Python's time.time(), they measure actual GPU execution time, accounting for the asynchronous nature of CUDA. Use them for production latency monitoring where the profiler overhead is unacceptable.
python # CUDA event timing — the right way to measure GPU latency def time_inference(model, input_batch, n_runs=100): # Create events start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) # Warmup for _ in range(5): model(input_batch) torch.cuda.synchronize() # Timed runs times = [] for _ in range(n_runs): start.record() # GPU-side timestamp model(input_batch) end.record() # GPU-side timestamp torch.cuda.synchronize() # wait for GPU to finish times.append(start.elapsed_time(end)) # milliseconds import numpy as np print(f"Mean: {np.mean(times):.2f} ms") print(f"Std: {np.std(times):.2f} ms") print(f"P50: {np.percentile(times, 50):.2f} ms") print(f"P95: {np.percentile(times, 95):.2f} ms") print(f"P99: {np.percentile(times, 99):.2f} ms") return times # WARNING: Common mistake — timing without synchronize: # start = time.time() # model(input_batch) # returns IMMEDIATELY (async) # elapsed = time.time() - start # measures CPU time, NOT GPU time! # This gives you ~0.1ms regardless of model size. Meaningless.
Every CUDA kernel launch has CPU-side overhead: ~5-10 microseconds for argument packing, driver calls, and scheduler interaction. For a single large matmul that runs for 2ms, this overhead is negligible (0.5%). But for a model with 500 small kernels each running 10μs, the launch overhead dominates: 500 × 7μs = 3.5ms of pure overhead vs 5ms of useful compute.
CUDA Graphs solve this by recording a sequence of kernel launches once, then replaying the entire sequence with a single CPU call. The GPU sees the same kernels in the same order with the same arguments — but the CPU overhead is amortized to nearly zero.
python # CUDA Graph capture pattern # Step 1: Warmup (required — ensures cuBLAS selects algorithms) for _ in range(3): output = model(static_input) # must use SAME tensors for capture torch.cuda.synchronize() # Step 2: Capture graph = torch.cuda.CUDAGraph() with torch.cuda.graph(graph): # All operations recorded — NOT executed yet static_output = model(static_input) # Step 3: Replay (in production loop) for real_input in data_stream: # Copy new data into the SAME input buffer static_input.copy_(real_input) graph.replay() # single CPU call replays ALL kernels result = static_output.clone() # read from the SAME output buffer # KEY CONSTRAINT: CUDA graphs require STATIC shapes. # No dynamic branching, no shape-dependent ops, no Python control flow. # This is why they work great for fixed-shape inference # but poorly for dynamic-length text generation.
Here's the complete decision tree a staff engineer follows when optimizing a model:
Let's apply the decision tree to our 85ms perception model:
Failure 1: Misleading profiles from cold cache. You profile 5 iterations without warmup. The first 2 iterations include CUDA context init (200-500ms), cuBLAS autotuning (50-100ms per matmul shape), and caching allocator warmup. Your average shows 150ms instead of the true 85ms, and the breakdown is dominated by initialization artifacts. Symptom: "cudaMalloc" or "cudaFuncGetAttributes" appears in the top 10 ops. Fix: Always run 3+ warmup iterations before profiling. Use the schedule(wait=1, warmup=1, active=5) pattern shown above.
Failure 2: Profiling overhead distorts results. Nsight Compute runs kernels 100-1000x slower to collect hardware counters. If your kernel has timing-dependent behavior (e.g., polling loops, spin-locks), the profiled version behaves differently from production. Symptom: The profiled kernel takes 100x longer and shows different bottlenecks than expected. Fix: Use ncu for targeted kernel analysis only. Use CUDA events or nsys for overall timing.
Failure 3: Attributing time to the wrong op due to async execution. You see torch.cuda.synchronize() taking 50ms in the CPU trace. It's not synchronize that's slow — it's the preceding GPU work that hasn't finished. Synchronize just makes the CPU wait. Symptom: CPU profile shows most time in synchronize or cudaStreamSynchronize. Fix: Look at CUDA time, not CPU time. Use the profiler's CUDA time columns.
Failure 4: Missing the CPU bottleneck because you only profiled the GPU. Your GPU profile shows 40ms of kernel time. But wall-clock latency is 80ms. The other 40ms is CPU-side: data preprocessing (resize, normalize in NumPy), Python overhead (GIL contention, object creation), and DataLoader stalls. Symptom: Nsight Systems shows large gaps between GPU kernels. Fix: Profile CPU and GPU together. Use nsys to see the full timeline. Move preprocessing to GPU (DALI, TorchVision transforms on GPU, or kornia).
Drag the arithmetic intensity slider to see where different ops fall. Operations below the roofline are limited by either compute or memory bandwidth.
Chapter 1 taught you to make weights smaller (quantization). This chapter teaches you to make models fewer. A 3B-parameter perception model at INT8 is 3GB. But what if you could remove half the parameters and still hit your accuracy target? Now it's 1.5GB at INT8, fits in a smaller memory budget, uses less power, and runs faster because there's simply less work to do. That's compression.
There are three pillars: pruning (removing parameters), distillation (training a smaller model to mimic a larger one), and parameter-efficient fine-tuning (adapting a foundation model without copying all its weights). Each addresses a different deployment constraint. This chapter derives each from first principles.
The simplest idea in pruning: some weights are near zero. Remove them. Magnitude pruning sorts all weights by absolute value, zeros out the smallest ones, and (optionally) fine-tunes the remaining network.
Concretely, let's prune a weight matrix to 50% sparsity:
This illustrates the fundamental problem with unstructured pruning: the resulting matrix has zeros scattered randomly. A GPU can't skip individual zeros in a dense matrix multiply — it still loads the full row, multiplies everything (zeros produce zero results but still consume cycles), and writes the full output. Unstructured sparsity doesn't speed up inference on standard hardware.
Structured pruning removes entire structural units — channels, attention heads, or entire layers — rather than individual weights. The result is a smaller but dense model that runs faster on any hardware without special sparse support.
| Granularity | What's removed | Effect on architecture | Typical accuracy cost |
|---|---|---|---|
| Channel pruning | Entire conv filter channels | Reduces conv width, shrinks next layer's input | 0.5-2% for 30% channels removed |
| Head pruning | Attention heads in transformer | Reduces MHA width, shrinks QKV projections | 0.2-1% for 25% heads removed |
| Layer pruning | Entire transformer layers | Reduces model depth | 1-3% for removing 2 of 24 layers |
| Block pruning | Contiguous weight blocks (e.g. 32×32) | Smaller dense sub-matrices | 0.3-1% at 50% sparsity |
NVIDIA's Ampere and later GPUs support a specific sparsity pattern in hardware: for every group of 4 consecutive elements, exactly 2 must be zero. This is called 2:4 fine-grained structured sparsity. The hardware stores only the 2 non-zero values plus a 2-bit index indicating their positions, achieving 2x compression and 2x throughput on sparse tensor cores.
Magnitude pruning is simple but naive — a weight might be small because it operates on large activations, making its contribution significant despite its magnitude. Three importance criteria:
Instead of compressing a large model, train a small model from scratch — but teach it to mimic the large model's soft outputs, not just the hard labels. The large model (the teacher) produces probability distributions that contain more information than one-hot labels: "this is 85% car, 10% truck, 5% van" teaches the student that cars and trucks look similar, something a one-hot label "car" doesn't convey.
The key mechanism is temperature scaling. The standard softmax produces sharp distributions (one probability near 1, rest near 0). By dividing the logits by a temperature T > 1 before softmax, you soften the distribution, revealing the teacher's "dark knowledge" about inter-class relationships.
The distillation loss combines the soft target loss (KL divergence from teacher) with the hard target loss (cross-entropy with ground truth):
python import torch import torch.nn as nn import torch.nn.functional as F def distillation_loss( student_logits, # [B, num_classes] teacher_logits, # [B, num_classes] labels, # [B] ground truth T=4.0, # temperature alpha=0.7, # weight for soft loss ): # Soft targets from teacher (no grad — teacher is frozen) with torch.no_grad(): soft_teacher = F.softmax(teacher_logits / T, dim=-1) # Soft predictions from student soft_student = F.log_softmax(student_logits / T, dim=-1) # KL divergence (soft loss) # KL(P||Q) = sum(P * log(P/Q)) = sum(P * log(P)) - sum(P * log(Q)) # F.kl_div expects log(Q) as input and P as target soft_loss = F.kl_div( soft_student, soft_teacher, reduction="batchmean" ) * (T * T) # T^2 correction for gradient magnitude # Hard loss (standard cross-entropy with ground truth) hard_loss = F.cross_entropy(student_logits, labels) # Combined return alpha * soft_loss + (1 - alpha) * hard_loss
You have a pre-trained 3B perception foundation model. You want to adapt it for a specific driving domain (e.g., snow conditions in Nordic countries). Full fine-tuning requires storing 3B gradients plus optimizer states (2x for Adam) — that's 3B + 6B = 9B parameters in FP32 = 36GB. That doesn't fit on one GPU.
LoRA (Low-Rank Adaptation) freezes the entire base model and adds small trainable "adapter" matrices. The key insight: the weight updates during fine-tuning have low intrinsic rank. When you fine-tune a 4096×4096 weight matrix, the actual change ΔW = Wfinetuned - Wpretrained can be well approximated by a rank-16 matrix. That's because fine-tuning adjusts the model along a low-dimensional manifold, not across all 16 million dimensions.
Why is A initialized randomly and B initialized to zero? This is a critical design choice. At the start of fine-tuning, ΔW = BA = 0 (because B is all zeros), so the model starts from exactly the pre-trained weights. If both A and B were random, the model would start from a randomly-perturbed version of the pre-trained model, losing the benefit of pre-training. The random initialization of A provides the "search directions" for adaptation, while B's zero initialization ensures a stable starting point.
The α/r scaling factor controls the learning rate of the LoRA update relative to the base model. When you increase rank r, you have more parameters and the update magnitude grows. Dividing by r keeps the effective update magnitude constant, so you don't need to re-tune the learning rate when changing rank.
python import torch import torch.nn as nn import math class LoRALinear(nn.Module): """Low-Rank Adaptation for a linear layer.""" def __init__(self, base_linear, r=16, alpha=32, dropout=0.05): super().__init__() self.base = base_linear self.base.weight.requires_grad = False # freeze base if self.base.bias is not None: self.base.bias.requires_grad = False d_out, d_in = base_linear.weight.shape # A: random init (Kaiming) — provides search directions self.A = nn.Parameter(torch.empty(r, d_in)) nn.init.kaiming_uniform_(self.A, a=math.sqrt(5)) # B: zero init — ensures delta_W = 0 at start self.B = nn.Parameter(torch.zeros(d_out, r)) # Scaling: alpha/r keeps update magnitude constant across ranks self.scale = alpha / r # Optional dropout on the LoRA path for regularization self.dropout = nn.Dropout(p=dropout) if dropout > 0 else nn.Identity() def forward(self, x): # Base path: frozen pre-trained weights base_out = self.base(x) # [B, seq, d_out] # LoRA path: low-rank update lora_out = self.dropout(x) @ self.A.T # [B, seq, r] lora_out = lora_out @ self.B.T # [B, seq, d_out] return base_out + self.scale * lora_out def merge_weights(self): """Merge LoRA into base weight for zero-overhead inference.""" # W' = W + scale * B @ A # CRITICAL: do this in float32 to avoid precision loss delta = self.scale * (self.B.float() @ self.A.float()) self.base.weight.data += delta.to(self.base.weight.dtype) def save_adapter(self, path): """Save only the LoRA weights (tiny file).""" torch.save({ "A": self.A.data, "B": self.B.data, "scale": self.scale, }, path) # ~500KB for r=16, d=4096 def apply_lora_to_model(model, r=16, alpha=32, target_modules=["q_proj", "v_proj"]): """Replace target linear layers with LoRA versions.""" for name, module in model.named_modules(): if isinstance(module, nn.Linear): if any(t in name for t in target_modules): parent = model parts = name.split(".") for p in parts[:-1]: parent = getattr(parent, p) setattr(parent, parts[-1], LoRALinear(module, r, alpha)) # Count trainable vs total params trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
QLoRA (Dettmers et al., 2023) combines three ideas to let you fine-tune massive models on a single GPU: (1) quantize the base model to 4-bit NormalFloat (NF4), (2) use LoRA adapters at FP16/BF16, (3) use paged optimizers that spill to CPU RAM when GPU memory is full.
| Method | Innovation | Benefit |
|---|---|---|
| DoRA (2024) | Decomposes W into magnitude m and direction V. Applies LoRA only to the direction component, keeps magnitude separate. | Closes the gap between LoRA and full fine-tuning. Particularly effective for large domain shifts. |
| LoRA+ (2024) | Uses different learning rates for A and B matrices. B gets a higher LR because it starts from zero. | Faster convergence, 1-2% better accuracy with no extra cost. |
| AdaLoRA (2023) | Adaptively allocates rank budget across layers based on importance scoring. Important layers get higher rank. | Same total parameter budget, better accuracy. Particularly useful when total adapter budget is constrained. |
| VeRA (2024) | Shares A and B matrices across all layers, only trains per-layer scaling vectors d and b. | 10x fewer params than LoRA with similar accuracy. Extreme compression. |
python import torch import torch.nn as nn def prune_attention_heads(model, head_importance, prune_ratio=0.25): """Prune least-important attention heads.""" # head_importance: [num_layers, num_heads] tensor # Computed by: average attention entropy, Taylor importance, or # gradient-based importance over a calibration set num_layers, num_heads = head_importance.shape n_prune = int(num_heads * prune_ratio) for layer_idx in range(num_layers): # Find least important heads in this layer _, prune_idx = torch.topk( head_importance[layer_idx], n_prune, largest=False ) layer = model.layers[layer_idx].self_attn d_head = layer.head_dim # Zero out Q, K, V projections for pruned heads for idx in prune_idx: start = idx * d_head end = (idx + 1) * d_head layer.q_proj.weight.data[start:end, :] = 0 layer.k_proj.weight.data[start:end, :] = 0 layer.v_proj.weight.data[start:end, :] = 0 layer.o_proj.weight.data[:, start:end] = 0 # For real deployment: remove the zeroed dimensions entirely # to create a smaller dense model (not shown — requires # reshaping all affected weight matrices) return model def compute_head_importance(model, calib_loader, criterion): """Taylor-expansion head importance scoring.""" num_layers = len(model.layers) num_heads = model.config.num_attention_heads importance = torch.zeros(num_layers, num_heads) model.eval() for batch in calib_loader: output = model(batch["input"]) loss = criterion(output, batch["target"]) loss.backward() for i, layer in enumerate(model.layers): w = layer.self_attn.q_proj.weight # [num_heads*d_head, d_model] g = w.grad # Taylor importance: |w * grad| summed per head head_scores = (w * g).abs().view(num_heads, -1).sum(dim=1) importance[i] += head_scores.detach().cpu() model.zero_grad() return importance / len(calib_loader)
Failure 1: Pruning the wrong layers. You uniformly prune 30% of channels from every layer. Accuracy drops 5% instead of the expected 0.5%. The problem: early layers in vision models extract fundamental features (edges, textures) and are very sensitive to pruning. Later layers are more redundant. Symptom: Sudden accuracy collapse when removing "just a few more" channels. Fix: Run per-layer sensitivity analysis. Prune each layer independently and measure accuracy impact. Use the importance scoring methods above. Typical result: layers 1-3 tolerate only 10% pruning, layers 12-24 tolerate 50%+.
Failure 2: LoRA merged model gives worse results than adapter version. Your LoRA adapter at FP16 works great. You merge it into the base model: W' = W + scale * B @ A. Now accuracy drops. The problem is numerical: if W is in INT4 (QLoRA) and B, A are in FP16, the multiplication B @ A produces an FP16 result, and adding it to INT4 weights requires dequantizing, adding, and re-quantizing — each step losing precision. Fix: Always merge in FP32: W' = W.float() + scale * (B.float() @ A.float()), then re-quantize the merged weight.
Failure 3: Knowledge distillation diverges. The student loss drops for 5 epochs then explodes. Common causes: (1) temperature too high (T > 20) makes the teacher distribution nearly uniform, providing no signal. (2) Learning rate too high for the KL term. (3) Teacher and student architectures are too different for logit-level distillation. Fix: Start with T=4, α=0.5. If diverges, try feature distillation (match intermediate layer representations instead of final logits). Add gradient clipping.
Failure 4: Over-compression cascade. You quantize to INT8, then prune 40%, then distill to a model half the depth. Each step individually loses 0.5% accuracy. But combined, the loss is 6%, not 1.5%. Compression errors compound non-linearly. Fix: Compress in one step if possible (e.g., pruning-aware QAT). Or apply compressions from most aggressive to least aggressive, fine-tuning between each step.
Adjust the compression ratio to see the accuracy-size tradeoff. Each technique traces a different curve.
A camera sees a dark blob on the road ahead. Is it a shadow, a pothole, or a pedestrian in dark clothing? The camera alone can't tell — it has no depth information. A LiDAR scan shows a cluster of points 1.2 meters tall at 35 meters distance. Is it a pedestrian, a mailbox, or a traffic cone? Without color or texture, LiDAR can't tell. But together — a 1.2m-tall point cluster at 35m with camera features showing dark clothing and human limbs — the answer is unambiguous. This is why autonomous vehicles use multiple sensors, and why fusing them correctly is one of the hardest perception problems.
As the inference engineer, you don't design the fusion architecture (that's the perception researcher's job). But you must understand it deeply enough to: profile it, optimize it, deploy it to the vehicle SOC, debug it when it fails in the field, and explain to the planner team why perception confidence dropped in a specific scenario. This chapter gives you that understanding.
| Sensor | Output format | Rate | Strengths | Failure modes |
|---|---|---|---|---|
| Camera | [H,W,3] uint8 image per cam. Typical: 1920×1080 or 1600×900. 6-8 cameras for 360° coverage. | 30 Hz | Rich semantics (color, texture, signs, lane markings), cheap, high resolution | Glare (direct sun), darkness (no illumination), fog (scatter), occlusion, no depth |
| LiDAR | Point cloud [N, 4]: (x, y, z, intensity). N ≈ 100K-300K points per sweep. Multiple returns per beam. | 10-20 Hz | Precise 3D geometry (cm accuracy), works in dark, range up to 200m | Rain/snow scatter (false returns), dust, sparse at long range, no color, expensive |
| Radar | Detections [M, 5]: (range, azimuth, elevation, Doppler velocity, RCS). M ≈ 64-256 detections. Or: range-Doppler-azimuth tensor for 4D imaging radar. | 13-20 Hz | Direct radial velocity, all-weather (rain, fog, snow), long range (>200m), cheap | Low angular resolution, multipath reflections (under bridges), no height (2D radar) |
| IMU | [6]: (ax, ay, az, ωx, ωy, ωz). Linear acceleration + angular velocity. | 100-1000 Hz | Very fast, no external dependency, measures ego-motion directly | Drift over time (requires fusion with GPS/vision), vibration noise |
Each sensor lives in its own coordinate frame. A LiDAR point at (3, 2, 1) in the LiDAR frame is not the same world position as a camera pixel at row 200, column 400. To fuse sensors, you must transform all measurements into a common frame. This requires extrinsic calibration — the rigid-body transform (rotation + translation) between each sensor and the vehicle body.
The pinhole camera model maps 3D world points to 2D image pixels. Understanding this projection is essential for fusing LiDAR points with camera features — you need to know which image pixel corresponds to which 3D point.
Given a LiDAR point and calibration matrices, let's compute exactly which pixel it maps to:
Raw LiDAR data is an unordered set of 3D points — not a grid, not a sequence. Three approaches for neural network processing:
python import torch import numpy as np def voxelize_pointcloud( points, # [N, 4] (x, y, z, intensity) voxel_size, # [3] (dx, dy, dz) in meters, e.g., [0.1, 0.1, 0.2] point_range, # [6] (x_min, y_min, z_min, x_max, y_max, z_max) max_points=32, # max points per voxel max_voxels=40000, ): """Convert raw point cloud to voxel grid.""" # Step 1: Filter points within range mask = ( (points[:, 0] >= point_range[0]) & (points[:, 0] < point_range[3]) & (points[:, 1] >= point_range[1]) & (points[:, 1] < point_range[4]) & (points[:, 2] >= point_range[2]) & (points[:, 2] < point_range[5]) ) points = points[mask] # [M, 4] where M ≤ N # Step 2: Compute voxel indices coords = ((points[:, :3] - point_range[:3]) / voxel_size).astype(np.int32) # coords: [M, 3] — integer voxel (ix, iy, iz) for each point # Step 3: Group points by voxel # Hash voxel coordinates for unique identification grid_size = ((point_range[3:6] - point_range[:3]) / voxel_size).astype(np.int32) voxel_hash = coords[:, 0] * grid_size[1] * grid_size[2] + \ coords[:, 1] * grid_size[2] + coords[:, 2] # Unique voxels and point assignments unique_voxels, inverse, counts = np.unique( voxel_hash, return_inverse=True, return_counts=True ) # Step 4: Collect up to max_points per voxel n_voxels = min(len(unique_voxels), max_voxels) voxels = np.zeros((n_voxels, max_points, 4), dtype=np.float32) num_points = np.zeros(n_voxels, dtype=np.int32) voxel_coords = np.zeros((n_voxels, 3), dtype=np.int32) # ... fill voxels (production code uses C++ for speed) ... return voxels, voxel_coords, num_points # voxels: [n_voxels, max_points, 4] — point features per voxel # voxel_coords: [n_voxels, 3] — (ix, iy, iz) grid coordinates # num_points: [n_voxels] — actual point count per voxel
There are three major fusion strategies, each with different engineering tradeoffs:
Early fusion (BEV space). Project all sensors into a shared Bird's-Eye-View grid, then process jointly. For cameras, this means the Lift-Splat-Shoot algorithm: predict per-pixel depth distributions, scatter camera features into 3D space, then collapse to BEV. For LiDAR, pillarize directly to BEV. Concatenate the BEV feature maps channel-wise and process with a 2D backbone.
Late fusion. Run independent detection pipelines on each modality, then merge the resulting 3D bounding boxes. Merging requires association: matching a camera detection with a LiDAR detection. This is typically done with the Hungarian algorithm on center distance. NMS (Non-Maximum Suppression) across modalities removes duplicates.
Mid fusion (transformer-based). The BEVFusion approach: extract features from each modality independently, project them to a shared BEV space, and fuse with a transformer that performs cross-attention between modalities. This lets the model learn which modality to trust for each spatial location.
Cameras run at 30Hz. LiDAR runs at 10Hz. Radar at 20Hz. At time T=100ms, you have a camera frame from T=100ms, a LiDAR sweep from T=90ms, and a radar return from T=95ms. A vehicle moving at 30 m/s (108 km/h) travels 0.3m in 10ms. If you naively fuse the T=90ms LiDAR with the T=100ms camera, all LiDAR points are 0.3m behind where the camera sees the objects. For an object at 5m range, 0.3m error could be the difference between "safe to proceed" and "emergency brake."
python import numpy as np def project_lidar_to_camera( points_lidar, # [N, 3] xyz in LiDAR frame T_cam_lidar, # [4, 4] extrinsic: LiDAR → camera K, # [3, 3] camera intrinsics img_shape, # (H, W) image dimensions ): """Project LiDAR points onto camera image plane.""" N = points_lidar.shape[0] # Homogeneous coordinates: [N, 4] pts_h = np.hstack([points_lidar, np.ones((N, 1))]) # Transform to camera frame: [4, 4] @ [4, N] = [4, N] pts_cam = (T_cam_lidar @ pts_h.T).T # [N, 4] # Filter: keep only points in front of camera (Z > 0) depth = pts_cam[:, 2] valid = depth > 0.1 # minimum depth threshold pts_cam = pts_cam[valid] depth = depth[valid] # Perspective projection + intrinsics pts_2d = K @ pts_cam[:, :3].T # [3, 3] @ [3, N] = [3, N] pts_2d = pts_2d.T # [N, 3] pts_2d[:, 0] /= pts_2d[:, 2] # u = fx*X/Z + cx pts_2d[:, 1] /= pts_2d[:, 2] # v = fy*Y/Z + cy u = pts_2d[:, 0].astype(np.int32) v = pts_2d[:, 1].astype(np.int32) # Filter: keep only points within image bounds H, W = img_shape in_img = (u >= 0) & (u < W) & (v >= 0) & (v < H) return u[in_img], v[in_img], depth[in_img]
The BEV grid is the lingua franca of AV fusion. All sensor data eventually gets projected here. Let's implement the full pipeline: create the grid, project camera features via Lift-Splat, project LiDAR via pillarization, and fuse.
python import torch import torch.nn as nn class BEVGrid: """Create and manage a Bird's Eye View feature grid.""" def __init__(self, x_range=(-40, 40), y_range=(-40, 40), resolution=0.5, feature_dim=64): self.x_range = x_range # meters, in vehicle frame self.y_range = y_range self.res = resolution # meters per cell self.C = feature_dim # Grid dimensions self.nx = int((x_range[1] - x_range[0]) / resolution) # 160 self.ny = int((y_range[1] - y_range[0]) / resolution) # 160 # Pre-compute cell center coordinates (for projection) xs = torch.linspace(x_range[0]+resolution/2, x_range[1]-resolution/2, self.nx) ys = torch.linspace(y_range[0]+resolution/2, y_range[1]-resolution/2, self.ny) self.grid_xy = torch.stack( torch.meshgrid(xs, ys, indexing='ij'), dim=-1 ) # [nx, ny, 2] def world_to_grid(self, x, y): """Convert world (x,y) in meters to grid indices.""" ix = ((x - self.x_range[0]) / self.res).long() iy = ((y - self.y_range[0]) / self.res).long() valid = (ix >= 0) & (ix < self.nx) & (iy >= 0) & (iy < self.ny) return ix, iy, valid def scatter_lidar_to_bev(self, points, features): """Scatter LiDAR point features into BEV grid. points: [N, 3] (x, y, z) in vehicle frame features: [N, C] per-point features from PointPillars encoder Returns: [1, C, nx, ny] BEV feature map """ ix, iy, valid = self.world_to_grid(points[:, 0], points[:, 1]) ix, iy = ix[valid], iy[valid] feat = features[valid] # [M, C] # Flatten grid index for scatter flat_idx = ix * self.ny + iy # [M] # Scatter: sum features in each cell bev = torch.zeros(self.nx * self.ny, self.C) bev.scatter_add_(0, flat_idx.unsqueeze(1).expand_as(feat), feat) # Reshape to spatial return bev.reshape(self.nx, self.ny, self.C).permute(2,0,1).unsqueeze(0) # [1, C, nx, ny]
Real vehicles encounter sensor failures: camera lenses get dirty, LiDAR returns false points in heavy rain, radar produces ghost targets under bridges. A model trained with all sensors always present will fail catastrophically when any sensor degrades. Sensor dropout training randomly masks entire modalities during training, forcing the model to maintain reasonable performance with any subset of sensors.
python class SensorDropout(nn.Module): """Randomly drop entire sensor modalities during training.""" def __init__(self, drop_prob=0.15): super().__init__() self.drop_prob = drop_prob # per-modality dropout probability def forward(self, cam_feats, lidar_feats, radar_feats): # cam_feats: [B, N_cam, C, H, W] # lidar_feats: [B, C_L, X, Y] # radar_feats: [B, C_R, X, Y] if self.training: B = cam_feats.shape[0] for b in range(B): # Per-sample, per-modality dropout if torch.rand(1) < self.drop_prob: cam_feats[b] = 0 if torch.rand(1) < self.drop_prob: lidar_feats[b] = 0 if torch.rand(1) < self.drop_prob: radar_feats[b] = 0 # CRITICAL: never drop ALL modalities if cam_feats[b].sum()==0 and lidar_feats[b].sum()==0: lidar_feats[b] = lidar_feats[b].clone() # restore at least one modality # Also randomly drop individual cameras (more common failure) for b in range(B): for n in range(cam_feats.shape[1]): if torch.rand(1) < self.drop_prob * 0.5: cam_feats[b, n] = 0 # one dirty camera return cam_feats, lidar_feats, radar_feats
A question that comes up in every AV perception interview: how much does calibration error cost you? Let's derive it.
Failure 1: Calibration error propagation. A 1-degree rotation error in the LiDAR-to-camera extrinsic causes a distance-dependent pixel error: at 10m, it's ~0.17m (a few pixels); at 50m, it's ~0.87m (potentially 20+ pixels). This means objects at long range appear misaligned between modalities, and the fusion model learns to distrust one sensor. Symptom: Long-range detection accuracy degrades after a vehicle service that physically moved the sensors. Diagnostic: Project LiDAR points onto the camera image and visually check alignment — misalignment grows linearly with distance. Fix: Automated re-calibration pipeline that runs on every boot using lane markings or building edges as alignment targets. Continuous online refinement of extrinsics using predicted depth vs LiDAR depth.
Failure 2: Temporal misalignment. You fuse a camera frame from 100ms with a LiDAR sweep from 80ms. At highway speeds (30 m/s), objects have moved 0.6m in 20ms. The fusion model sees the same car at two different positions, creating a "ghost" trail. Symptom: Velocity estimates are biased; objects appear to "slide." Fix: Timestamp all sensor data at capture time (not arrival time). Use IMU-based ego-motion compensation to warp all sensors to a common reference time. Budget 1-2ms for the compensation computation.
Failure 3: Sensor dropout at inference. A camera lens gets covered by mud. The fusion model, trained on all 6 cameras always present, produces garbage output. Symptom: Perception confidence drops to near-zero or produces wildly incorrect detections when any sensor is degraded. Fix: Train with random sensor dropout: during training, randomly mask entire cameras (zero out input) or LiDAR (empty point cloud) with probability 10-20%. The model learns to work with subsets. Also implement sensor health monitoring that detects degraded inputs and alerts the planner.
Failure 4: Modality imbalance in fusion. Camera features dominate because they have higher spatial resolution, and the model learns to ignore LiDAR. When a camera fails (darkness, glare), the model collapses. Symptom: Ablation shows removing cameras hurts much more than removing LiDAR, even though LiDAR alone should give strong 3D geometry. Fix: Use gated fusion (learn per-location weights for each modality), balance training loss contributions from each modality, or use auxiliary supervision that forces LiDAR features to be independently informative.
Top-down view showing camera frustums (blue), LiDAR points (green), and radar returns (purple). Toggle sensors to see each modality's contribution. Click the grid to add obstacles.
The planner needs three things from perception: what objects are around the vehicle (detection), where they're going (velocity/trajectory), and what space is free to drive through (occupancy). Everything else — beautiful feature maps, clever attention mechanisms, impressive backbone architectures — is only valuable insofar as it produces these three outputs accurately and within the latency budget. This chapter covers the algorithms that produce them.
| Output | Format | Why the planner needs it | Typical spec |
|---|---|---|---|
| 3D bounding boxes | Per object: (x, y, z, l, w, h, θ, vx, vy, class, score) | Track vehicles, predict trajectories, compute time-to-collision | ±0.3m position, ±5° heading, ±1m/s velocity |
| Semantic BEV map | [X, Y, C_classes] grid: driveable surface, lane markings, crosswalks, curbs | Know where the vehicle can drive, where lanes are, lane changes | 0.25-0.5m resolution, 50-100m range |
| 3D occupancy grid | [X, Y, Z, C_classes] voxels: free, vehicle, pedestrian, building, vegetation... | General obstacle avoidance for arbitrary shapes (not just boxes) | 0.4m voxels, 80m×80m×6.4m, 16+ classes |
Lift-Splat-Shoot (LSS) is the foundational algorithm for projecting camera features into BEV space. It answers the question: how do you go from 2D image features to a 3D volumetric representation, when cameras provide no direct depth measurement?
The key insight: for each pixel, predict a categorical depth distribution — a probability over D discrete depth bins. Then "lift" the pixel's image features to every depth, weighted by those probabilities. The result is a 3D frustum of features that, when projected to BEV and summed across all cameras, produces a dense BEV feature map.
LSS explicitly constructs a 3D frustum and projects it to BEV. BEVFormer takes a different approach: it starts with a set of learnable BEV queries (a grid of feature vectors in BEV space) and uses deformable cross-attention to sample features from the camera images at the relevant locations.
Single-frame detection gives you positions but not velocities. The naive approach: detect in each frame, then associate detections across frames using a tracker. But tracking is a separate error-prone step. Modern methods (StreamPETR, Sparse4Dv2) maintain persistent object queries that carry information across frames natively inside the detector.
A 3D bounding box has 10 components: center position (x, y, z), dimensions (length, width, height), heading angle θ, velocity (vx, vy), and class. Each requires a different loss function:
Bounding boxes can't represent irregular shapes: a construction barrier, a pile of debris, an overhanging tree branch. Occupancy networks discretize the world into a 3D voxel grid and predict: for each voxel, is it free space or occupied? If occupied, what class?
The TPVFormer approach reduces the cost of full 3D prediction by factoring the 3D volume into three perpendicular planes (tri-plane): XY (top-down), XZ (front), and YZ (side). Each plane gets its own set of queries, and features from all three planes are combined to predict occupancy at any 3D point. This is much cheaper than dense 3D voxel prediction.
NMS in 3D requires computing the Intersection over Union (IoU) of two 3D bounding boxes that may be rotated. The 2D BEV IoU (rotated rectangles) is the hard part — the height dimension is typically handled separately.
| Metric | Used for | How it works |
|---|---|---|
| mAP (3D) | 3D detection | Match predictions to ground truth by center distance (not IoU). Thresholds: 0.5, 1.0, 2.0, 4.0 meters. Compute AP at each threshold, average across classes and thresholds. Center-distance matching is used because 3D IoU is expensive and sensitive to size errors. |
| NDS | 3D detection (nuScenes) | NDS = 0.5 × mAP + 0.1 × (mATE + mASE + mAOE + mAVE + mAAE). Combines detection accuracy with localization, size, orientation, velocity, and attribute errors. The "one number" for perception quality. |
| mIoU | 3D occupancy | Per-class IoU between predicted and ground-truth voxels, averaged across classes. IoU = TP / (TP + FP + FN) per class. Ignoring free-space class in the average to avoid inflating the metric. |
python import torch import numpy as np def rotated_box_corners(cx, cy, l, w, theta): """Compute 4 corners of a rotated 2D box in BEV.""" cos_t, sin_t = np.cos(theta), np.sin(theta) # Half extents dx, dy = l / 2, w / 2 # Corner offsets (before rotation) offsets = np.array([[-dx,-dy],[ dx,-dy],[ dx, dy],[-dx, dy]]) # Rotation matrix R = np.array([[cos_t, -sin_t], [sin_t, cos_t]]) corners = offsets @ R.T + np.array([cx, cy]) return corners # [4, 2] def polygon_area(vertices): """Shoelace formula for polygon area.""" n = len(vertices) area = 0 for i in range(n): j = (i + 1) % n area += vertices[i][0] * vertices[j][1] area -= vertices[j][0] * vertices[i][1] return abs(area) / 2 def clip_polygon_by_edge(polygon, p1, p2): """Sutherland-Hodgman: clip polygon by half-plane defined by edge p1->p2.""" if len(polygon) == 0: return [] result = [] for i in range(len(polygon)): curr = polygon[i] prev = polygon[i - 1] # Cross product to determine side d_curr = (p2[0]-p1[0])*(curr[1]-p1[1]) - (p2[1]-p1[1])*(curr[0]-p1[0]) d_prev = (p2[0]-p1[0])*(prev[1]-p1[1]) - (p2[1]-p1[1])*(prev[0]-p1[0]) if d_curr >= 0: # inside if d_prev < 0: # was outside, add intersection t = d_prev / (d_prev - d_curr) ix = prev[0] + t * (curr[0] - prev[0]) iy = prev[1] + t * (curr[1] - prev[1]) result.append([ix, iy]) result.append(curr) elif d_prev >= 0: # going from inside to outside t = d_prev / (d_prev - d_curr) ix = prev[0] + t * (curr[0] - prev[0]) iy = prev[1] + t * (curr[1] - prev[1]) result.append([ix, iy]) return result def rotated_iou_2d(box_a, box_b): """Compute IoU of two rotated 2D boxes in BEV. Each box: (cx, cy, length, width, heading_rad) """ corners_a = rotated_box_corners(*box_a) corners_b = rotated_box_corners(*box_b) # Sutherland-Hodgman polygon clipping polygon = corners_a.tolist() for i in range(4): p1 = corners_b[i].tolist() p2 = corners_b[(i + 1) % 4].tolist() polygon = clip_polygon_by_edge(polygon, p1, p2) if len(polygon) == 0: return 0.0 inter = polygon_area(polygon) area_a = box_a[2] * box_a[3] # l * w area_b = box_b[2] * box_b[3] return inter / (area_a + area_b - inter + 1e-8) def iou_3d(box_a, box_b): """3D IoU of two oriented boxes. Each box: (cx, cy, cz, l, w, h, heading_rad) """ # BEV IoU bev_a = (box_a[0], box_a[1], box_a[3], box_a[4], box_a[6]) bev_b = (box_b[0], box_b[1], box_b[3], box_b[4], box_b[6]) bev_inter_area = rotated_iou_2d(bev_a, bev_b) * ( bev_a[2]*bev_a[3] + bev_b[2]*bev_b[3] - rotated_iou_2d(bev_a, bev_b) * (bev_a[2]*bev_a[3] + bev_b[2]*bev_b[3]) ) # simplified — real code computes intersection area directly # Height overlap za_min, za_max = box_a[2] - box_a[5]/2, box_a[2] + box_a[5]/2 zb_min, zb_max = box_b[2] - box_b[5]/2, box_b[2] + box_b[5]/2 h_overlap = max(0, min(za_max, zb_max) - max(za_min, zb_min)) vol_a = box_a[3] * box_a[4] * box_a[5] vol_b = box_b[3] * box_b[4] * box_b[5] vol_inter = bev_inter_area * h_overlap return vol_inter / (vol_a + vol_b - vol_inter + 1e-8)
Failure 1: Depth estimation fails at long range. Monocular depth uncertainty grows quadratically with distance. At 50m, a 1-pixel error in the image corresponds to ~2m depth error. The BEV features become a blurry smear beyond 40-50m. Symptom: Detection recall drops from 92% at 30m to 54% at 50m. Diagnostic: Plot per-pixel depth error vs ground-truth range (use LiDAR as reference). You'll see error grow as O(d2). Fix: (1) LiDAR depth supervision: add an auxiliary loss that trains the depth head against LiDAR ground truth. (2) Multi-scale BEV: use higher resolution (0.25m) near the ego and lower resolution (1.0m) at range. (3) Temporal stereo: use ego-motion between frames for triangulation.
Failure 2: BEV feature smearing. When the predicted depth distribution is too spread out (high entropy), features get scattered across a wide range of BEV cells instead of concentrating at the correct position. The BEV map becomes noisy and detection heads produce false positives. Symptom: High recall but low precision (many false positives in BEV). Fix: Sharpen the depth distribution with a temperature parameter in the softmax, or use top-K depth bins. Add depth distribution entropy as a regularization term.
Failure 3: Temporal false positive persistence. A ghost detection in one frame gets carried forward by persistent object queries through subsequent frames, appearing to "confirm" itself. Symptom: False positive rate increases with temporal window length. Objects that don't exist persist for 10+ frames. Fix: Add a confidence decay: queries not re-detected (low attention weight to current-frame features) have their confidence reduced by a factor each frame. After 3-5 frames without re-detection, suppress.
Failure 4: Occupancy class imbalance. Your occupancy model predicts "free space" for 98% of voxels and gets 95% mIoU. But per-class: vehicle mIoU = 45%, pedestrian mIoU = 12%. Fix: (1) Class-weighted cross-entropy with inverse-frequency weights. (2) Focal loss. (3) Over-sample scenes with rare classes. (4) Lovász-softmax loss which directly optimizes the mIoU metric.
| Development | What's new | Impact |
|---|---|---|
| UniAD / VAD | Unified perception-prediction-planning in one model. Shared BEV features, end-to-end training. | Eliminates hand-crafted interfaces between modules. 20-30% latency reduction vs modular stack. |
| Occ3D / SurroundOcc | Dense 3D occupancy from cameras only (no LiDAR). Camera-to-3D via learned depth + volume rendering. | Enables camera-only vehicles to reason about free space. Approaching LiDAR-level quality for occupancy. |
| Sparse perception | SparseBEV, SparseOcc: predict only at query locations, not dense grid. 5-10x faster than dense BEV. | Makes real-time occupancy feasible on edge hardware. Sub-20ms for occupancy prediction on Orin. |
| World models for perception | GAIA-1, DriveDreamer: predict future sensor observations. Used for self-supervised pre-training. | Reduces labeled data requirements by 10x. Pre-train on unlabeled driving videos, fine-tune with 10% labels. |
Top-down view of the ego vehicle's perception output. Adjust the detection range to see how recall degrades at distance. Objects shown as rotated boxes with velocity arrows and confidence scores.
A data center GPU sits in a climate-controlled room with 700 watts of cooling, 80 GB of dedicated HBM3, and effectively infinite power. Your vehicle compute module sits in an enclosed box behind the passenger seat, passively cooled by ambient air that can reach 45 degrees Celsius in Phoenix summer, sharing 32 GB of memory between the CPU, GPU, and every other process on the vehicle. It draws 60 watts total — less than a laptop charger. And it needs to run perception, prediction, planning, localization, and mapping simultaneously, with hard real-time deadlines. Welcome to edge deployment.
This chapter is where everything you learned about quantization, TensorRT, CUDA, and C++ inference converges on a single, unforgiving hardware target. Every optimization trick matters here — not for throughput charts, but because a missed frame at 65 mph means 2.9 meters of blind driving.
The dominant AV compute platform today is the NVIDIA Orin system-on-chip. Understanding its architecture is essential because every optimization decision depends on what the hardware can and cannot do. Let's dissect it.
| Component | Specification | What It Does |
|---|---|---|
| Ampere GPU | 2048 CUDA cores, 64 Tensor Cores | General compute and matrix operations. Tensor Cores do INT8/FP16 matrix multiply-accumulate at 275 TOPS (INT8) |
| DLA (x2) | 2 Deep Learning Accelerators | Fixed-function inference engines. Support Conv, BN, pooling, activation, deconv. Cannot do attention, custom ops, dynamic shapes |
| Arm Cortex-A78AE CPU | 12 cores, up to 2.2 GHz | Preprocessing, postprocessing, system orchestration. Automotive-grade (lockstep mode for ASIL-D) |
| LPDDR5 Memory | 32 GB unified, 204.8 GB/s bandwidth | Shared between CPU, GPU, DLA. No separate VRAM. Bandwidth is the critical bottleneck |
| PVA (x2) | 2 Programmable Vision Accelerators | Image signal processing, stereo disparity, optical flow. Frees GPU for neural network inference |
| Video Encoders/Decoders | NVENC/NVDEC | Hardware-accelerated video encode/decode. Used for camera input and logging |
| Power Envelope | 15-60W configurable | MAXN mode (60W) = full performance. 30W/15W modes trade performance for power savings. Software-selectable |
Edge deployment lives inside a triangle of constraints. Each vertex constrains the other two, and violating any one of them can cascade into system failure.
Power is the root constraint. The vehicle's 12V electrical system allocates a fixed power budget to each compute domain. Perception might get 40W. That 40W must cover the GPU, DLA, and the CPU cycles dedicated to perception. More compute = more power = more heat.
Thermal is the enforcer. The Orin module has a junction temperature limit (typically 105 degrees Celsius). When the chip approaches this limit, it thermal throttles — reducing clock frequencies to reduce heat output. This means your 40ms model suddenly takes 65ms. The insidious part: thermal throttling is non-linear. A 10% reduction in clock speed can cause a 30% latency increase because memory access patterns are disrupted and pipeline stalls cascade.
Memory is the hard wall. 32 GB is all you get. Period. There's no swap file (too slow for real-time), no second DIMM slot, no cloud fallback. If your stack exceeds 32 GB, something doesn't run.
On Orin, CPU and GPU access the same physical LPDDR5 through a shared memory controller. The total bandwidth is 204.8 GB/s, but this is shared across all consumers. Let's trace what happens during one inference cycle:
The critical insight: bandwidth contention doesn't cause graceful degradation. When total demand exceeds supply, all consumers slow down simultaneously. Your GPU isn't just slower — it's unpredictably slower, because the delay depends on what the CPU and DLA are doing at the same instant.
How to measure and minimize contention:
bash # tegrastats: real-time SOC monitoring tool # Shows: CPU/GPU freq, memory bandwidth, power, temperature tegrastats --interval 100 # Output (every 100ms): # RAM 18432/32768MB (lfb 64x4MB) SWAP 0/16384MB # CPU [45%@2201,38%@2201,52%@2201,...] GR3D_FREQ 98%@1275 # EMC_FREQ 100%@3199 ← THIS IS THE KEY NUMBER # EMC = External Memory Controller. 100% = bandwidth saturated! # VDD_CPU_GPU_CV 34520mW ← total power draw # SOC_THERM cpu@71.5C gpu@73.2C ← junction temperatures
If EMC_FREQ is consistently above 85%, you have a bandwidth problem. Solutions:
| Technique | Effect | Implementation |
|---|---|---|
| Reduce model size | Less weight data to read | INT8/INT4 quantization, pruning |
| Activation checkpointing | Less activation memory traffic | Recompute instead of store intermediate activations |
| Schedule CPU/GPU work | Avoid simultaneous bursts | CPU preprocessing in GPU idle periods, double-buffering |
| Use DLA for secondary models | Offload from GPU memory bus | DLA has its own memory path for supported ops |
| Zero-copy buffers | Eliminate CPU→GPU copies | On unified memory, use cudaHostAlloc with mapped flag — both CPU and GPU access same physical pages |
Orin has two DLA engines — fixed-function neural network accelerators that run inference at very low power. A DLA engine uses roughly 5W to run a model that would take the GPU 15W. The catch: DLAs only support a subset of operations.
| Supported (DLA-native) | NOT supported (falls back to GPU) |
|---|---|
| Conv2d, ConvTranspose2d | Self-attention, cross-attention |
| BatchNorm, InstanceNorm | LayerNorm, RMSNorm |
| ReLU, Sigmoid, Tanh | GELU, SiLU, Mish |
| MaxPool, AvgPool | Deformable convolution |
| Elementwise add/mul | Custom CUDA kernels |
| Concat, slice | Dynamic shape operations |
| Fully connected | Einsum, complex indexing |
When TensorRT compiles a model, it automatically partitions ops between GPU and DLA based on a compatibility check. But the automatic partition isn't always optimal. Each GPU↔DLA transition incurs a data copy overhead (typically 0.5-1ms). If TensorRT creates many small DLA segments with GPU transitions between them, the copy overhead can exceed the DLA savings.
Let's compute the exact memory budget for a realistic AV stack on Orin (32 GB). This is a worked example you should be able to reproduce in an interview.
Your perception model averages 45ms. Great — well within the 100ms budget. Then you deploy it, and once every 200 frames (roughly every 6.6 seconds at 30 fps) the latency spikes to 180ms. That's a 5.4-meter blind spot at highway speed. Every six seconds. This is unacceptable.
P99 latency (the 99th percentile) is the metric that matters for safety-critical systems. It means "99% of inferences complete within this time." For autonomous driving, even P99 might not be strict enough — some teams target P99.9 or even P99.99.
Sources of latency jitter on a vehicle SOC:
| Source | Typical spike | Fix |
|---|---|---|
| OS scheduling other processes on same CPU core | 5-20ms | CPU pinning with taskset or pthread_setaffinity_np |
| CPU frequency scaling (dynamic clocking) | 10-50ms | Lock CPU frequency: cpufreq-set -g performance |
| GPU context switching between processes | 5-15ms | Use CUDA MPS or exclusive GPU mode |
| Memory allocation (malloc/cudaMalloc) during inference | 1-100ms | Pre-allocate ALL buffers at startup. Zero allocations in hot path |
| Thermal throttling | 30-100ms | Budget for worst-case thermal state. Fallback model at high temp |
| IRQ handling (network, USB, sensors) | 1-5ms | IRQ affinity: route interrupts to non-inference CPU cores |
cpp // Deterministic inference launcher with CPU pinning and CUDA priority // This is the code that actually runs on the vehicle #include <sched.h> #include <pthread.h> #include <cuda_runtime.h> struct DeterministicInference { cudaStream_t hi_pri_stream; // High-priority CUDA stream void* pre_alloc_buffers[16]; // All memory pre-allocated int target_cpu_core; // Dedicated CPU core void init(int cpu_core) { target_cpu_core = cpu_core; // 1. Pin this thread to a specific CPU core cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu_core, &cpuset); pthread_setaffinity_np( pthread_self(), sizeof(cpu_set_t), &cpuset); // 2. Set real-time scheduling (SCHED_FIFO = highest priority) struct sched_param param; param.sched_priority = 90; // 1-99, higher = more priority pthread_setschedparam( pthread_self(), SCHED_FIFO, ¶m); // 3. Create high-priority CUDA stream // Priority: lower number = higher priority int lo, hi; cudaDeviceGetStreamPriorityRange(&lo, &hi); cudaStreamCreateWithPriority( &hi_pri_stream, cudaStreamNonBlocking, hi); // 4. Pre-allocate ALL inference buffers // ZERO allocations during inference loop cudaMalloc(&pre_alloc_buffers[0], INPUT_SIZE); cudaMalloc(&pre_alloc_buffers[1], OUTPUT_SIZE); // ... all intermediate buffers ... } float run_inference(const void* input) { // Measure wall-clock time, not GPU time auto start = std::chrono::high_resolution_clock::now(); // Copy input to pre-allocated device buffer cudaMemcpyAsync(pre_alloc_buffers[0], input, INPUT_SIZE, cudaMemcpyHostToDevice, hi_pri_stream); // Run TensorRT engine on high-priority stream context->enqueueV3(hi_pri_stream); // Synchronize (blocking — wait for GPU to finish) cudaStreamSynchronize(hi_pri_stream); auto end = std::chrono::high_resolution_clock::now(); return std::chrono::duration<float, std::milli>( end - start).count(); } };
python # tegrastats parser — extracts key metrics for power/thermal analysis import subprocess, re, time, json class TegraStatsParser: def __init__(self, interval_ms=100): self.interval = interval_ms self.history = [] # rolling window of samples def parse_line(self, line): """Parse one tegrastats output line into structured data.""" d = {} # RAM usage: "RAM 18432/32768MB" m = re.search(r'RAM (\d+)/(\d+)MB', line) if m: d['ram_used_mb'] = int(m.group(1)) # GPU frequency and utilization: "GR3D_FREQ 98%@1275" m = re.search(r'GR3D_FREQ (\d+)%@(\d+)', line) if m: d['gpu_util'] = int(m.group(1)) d['gpu_freq_mhz'] = int(m.group(2)) # Memory controller: "EMC_FREQ 87%@3199" m = re.search(r'EMC_FREQ (\d+)%@(\d+)', line) if m: d['emc_util'] = int(m.group(1)) # Power rails: "VDD_CPU_GPU_CV 34520mW" for rail in ['VDD_CPU_GPU_CV', 'VDD_SOC', 'VDD_IN']: m = re.search(rail + r' (\d+)mW', line) if m: d[rail.lower()] = int(m.group(1)) / 1000.0 # Temperatures: "cpu@71.5C gpu@73.2C" for sensor in ['cpu', 'gpu', 'aux']: m = re.search(sensor + r'@([\d.]+)C', line) if m: d[f'temp_{sensor}'] = float(m.group(1)) self.history.append(d) return d def is_throttling(self): """Detect thermal throttling: GPU freq dropping while utilization stays high.""" if len(self.history) < 10: return False recent = self.history[-10:] avg_util = sum(d.get('gpu_util', 0) for d in recent) / 10 avg_freq = sum(d.get('gpu_freq_mhz', 0) for d in recent) / 10 max_freq = 1275 # Orin max GPU freq # Throttling = high utilization but reduced frequency return avg_util > 90 and avg_freq < max_freq * 0.85
The vehicle operates in environments from -40 degrees C (Minnesota winter) to +50 degrees C (Arizona summer parking lot). The compute module must work across this entire range. Here's the thermal design process:
Failure 1: Thermal throttling P99 spikes. Average latency is 44ms, but every 5 seconds the GPU clock drops by 20% due to thermal throttling, causing a 68ms spike. Over time, sustained load pushes the chip hotter and spikes become more frequent. Diagnosis: Check tegrastats for GPU frequency drops correlated with temperature increase. The tell-tale sign: GPU utilization stays at 98% but frequency drops from 1275 MHz to 1020 MHz. Fix: Reduce power target to 35W (prevents throttling entirely), optimize model to run within the lower power budget, improve heatsink thermal resistance, or implement a thermal-aware model switcher that swaps to a lighter backbone when junction temperature exceeds 95 degrees C.
Failure 2: Shared memory bandwidth contention. Perception runs at 44ms in isolation but 58ms when prediction and planning run simultaneously. Diagnosis: Monitor EMC utilization — if it spikes above 90% when all modules run concurrently, bandwidth contention is the culprit. Fix: Schedule modules to avoid simultaneous peak bandwidth. Use double-buffering so perception writes results to buffer A while prediction reads from buffer B. Reduce model memory footprint through more aggressive quantization. Consider activation checkpointing to trade compute for memory bandwidth.
Failure 3: DLA fallback errors. A new model version adds a GELU activation that DLA doesn't support. TensorRT silently falls back to GPU for that layer, adding a DLA→GPU→DLA transition that costs 1.4ms per occurrence. With 12 GELU layers, you've added 16.8ms. Diagnosis: Compare TensorRT engine layer timing between old and new versions. Look for layers marked "executed on GPU" that were previously on DLA. Fix: Replace GELU with ReLU in the backbone (minor accuracy impact, full DLA compatibility), or restructure the network so all DLA-incompatible ops are contiguous (minimizing transitions).
Failure 4: Power budget exceeded during sensor burst. Six cameras capture simultaneously, triggering a DMA burst that temporarily pushes total power to 65W. The power management unit (PMU) reacts by throttling GPU clocks for the next 100ms. Fix: Stagger camera capture (pairs of cameras at 10ms offsets). Reserve 5W of power headroom for sensor I/O bursts. Never design to the absolute power limit.
NVIDIA Thor (2025-2026): The successor to Orin. 2000 INT8 TOPS (7x Orin), up to 128 GB memory, transformer engine with native FP8. This is enough to run a 10B parameter VLA at FP8 with room to spare. The constraint shifts from "will it fit?" to "how to use the surplus for redundancy and safety."
Qualcomm Snapdragon Ride: An alternative to NVIDIA, using the Hexagon DSP for neural inference. Advantages: lower power (often 20-30% less for equivalent TOPS), hardware support for more activation functions. Disadvantages: less mature toolchain, smaller developer ecosystem, no CUDA (uses OpenCL/Qualcomm AI Engine Direct).
Chiplets and disaggregated compute (2026+): Instead of one monolithic SOC, future vehicles may use multiple smaller chips connected via high-speed links. This improves thermal distribution (spread heat across multiple packages) and enables modular upgrade paths (replace the perception chip without replacing planning).
Toggle model configurations to see how power, thermal, and memory change. Watch the gauges update in real time.
The vehicle is one deployment target. The other is the cloud — and it's often the larger engineering challenge. You're running the same perception models on millions of logged driving scenes for training data curation, auto-labeling, and simulation validation. In the cloud, you don't care about 50ms latency — you care about processing a petabyte of driving data before the next training cycle starts. This is inference serving at scale, and it's an entirely different optimization problem.
| Dimension | On-Vehicle (Latency) | Cloud (Throughput) |
|---|---|---|
| Primary metric | P99 latency (ms) | Samples/second/dollar |
| Batch size | 1 (single frame, real-time) | 32-128 (fill the GPU) |
| Model format | TensorRT engine (platform-specific) | TensorRT, PyTorch, ONNX (flexible) |
| Scaling | Fixed hardware (one SOC per vehicle) | Elastic (autoscale GPU replicas) |
| Failure mode | Safety-critical (must never fail) | Retry-friendly (can re-queue failed jobs) |
| Memory | 32 GB shared, rigid budget | 80 GB HBM3, relatively generous |
| Cost concern | Power (watts per vehicle × fleet size) | GPU-hours (cloud bill) |
Triton is the industry-standard inference server for multi-model serving. Understanding its architecture is essential because it's what you'll configure, debug, and extend in production. Here's what happens when a request arrives:
max_queue_delay for other requests to form a batch.The model repository is a directory structure that Triton watches:
bash model_repository/ ├── perception_backbone/ │ ├── config.pbtxt # Model configuration │ ├── 1/ # Version 1 │ │ └── model.plan # TensorRT engine │ └── 2/ # Version 2 (canary) │ └── model.plan ├── detection_head/ │ ├── config.pbtxt │ └── 1/ │ └── model.onnx # ONNX model (different backend) └── full_pipeline/ ├── config.pbtxt # Ensemble model └── 1/ # No model file — orchestration only
Dynamic batching is the single most impactful optimization for cloud serving throughput. The idea: instead of processing one request at a time, accumulate requests in a queue and process them together. GPU matrix operations scale sub-linearly with batch size — processing 32 samples takes far less than 32x the time of one sample.
But there's a fundamental tension: larger batches improve throughput but increase latency (each request waits in the queue). Let's derive the optimal batch size.
Standard dynamic batching works for fixed-size models (images in, detections out). But LLMs are autoregressive — each request generates tokens one at a time, and different requests have different output lengths. Continuous batching (also called inflight batching) solves this.
The problem with naive LLM batching: if you batch 8 requests and one generates 500 tokens while the others generate 50, the 7 short requests finish early and sit idle while the long one continues. You're wasting 7/8 of GPU capacity during the tail generation.
Here's what four requests look like under naive batching vs continuous batching:
protobuf # config.pbtxt for BEV perception backbone name: "perception_backbone_v2" platform: "tensorrt_plan" max_batch_size: 32 input [ { name: "images" data_type: TYPE_FP16 dims: [ 6, 3, 480, 800 ] # 6 cameras, 3 channels, H, W } ] output [ { name: "bev_features" data_type: TYPE_FP16 dims: [ 256, 200, 200 ] # C, H_bev, W_bev } ] # Dynamic batching configuration dynamic_batching { preferred_batch_size: [ 8, 16, 32 ] # fire at these sizes immediately max_queue_delay_microseconds: 15000 # 15ms max wait default_queue_policy { timeout_action: DELAY # if timeout, send partial batch default_timeout_microseconds: 25000 allow_timeout_override: true } priority_levels: 3 # 0=low (auto-label), 2=high (real-time) default_priority_level: 1 } # Model versioning version_policy { specific { versions: [1, 2] } # keep both versions loaded } # Instance groups — how many model copies instance_group [ { count: 2 kind: KIND_GPU gpus: [0] } # 2 instances on GPU 0 ]
Every model update must pass a parity check before deployment. This is the gate that prevents silent accuracy regressions. Here's the complete framework:
python import torch, numpy as np from dataclasses import dataclass from typing import Dict, List @dataclass class ParityResult: layer_name: str max_abs_diff: float # worst-case element-wise error mean_abs_diff: float # average element-wise error cosine_sim: float # directional alignment (should be > 0.999) kl_divergence: float # distribution shift (should be < 0.01) pass_status: bool class ParityChecker: """Production parity checking framework. Compares reference model (FP32 PyTorch) against optimized engine at three levels: per-element, per-layer, and distribution-level.""" def __init__(self, tolerances=None): self.tolerances = tolerances or { 'max_abs': 0.05, # no element off by more than 5% 'mean_abs': 0.005, # average error under 0.5% 'cosine': 0.999, # cosine similarity > 0.999 'kl_div': 0.01, # KL divergence < 0.01 'output_mAP': 0.005, # end-to-end mAP drop < 0.5% } def compare_layers(self, ref_acts, opt_acts) -> List[ParityResult]: """Compare intermediate activations layer by layer.""" results = [] for name in ref_acts: r = ref_acts[name].float() o = opt_acts[name].float() max_abs = (r - o).abs().max().item() mean_abs = (r - o).abs().mean().item() cos = torch.nn.functional.cosine_similarity( r.flatten(), o.flatten(), dim=0).item() # KL divergence: treat activations as distributions r_soft = torch.softmax(r.flatten(), dim=0) o_soft = torch.softmax(o.flatten(), dim=0) kl = torch.nn.functional.kl_div( o_soft.log(), r_soft, reduction='sum').item() passed = (max_abs < self.tolerances['max_abs'] and cos > self.tolerances['cosine'] and kl < self.tolerances['kl_div']) results.append(ParityResult( name, max_abs, mean_abs, cos, kl, passed)) return results def regression_check(self, current_metrics, baseline_metrics): """Compare metrics across model versions for regression detection.""" regressions = [] for metric, value in current_metrics.items(): baseline = baseline_metrics.get(metric, value) # Allow 0.5% regression max for any single metric if value < baseline * 0.995: regressions.append({ 'metric': metric, 'baseline': baseline, 'current': value, 'drop': (baseline - value) / baseline * 100 }) return regressions # empty list = passed
Rolling out a new model version in production is not a "deploy and pray" operation. The safe deployment pipeline has multiple stages:
python # A/B test metrics collector — runs alongside Triton class ABTestCollector: def __init__(self, model_a_version, model_b_version): self.versions = {'A': model_a_version, 'B': model_b_version} self.metrics = {'A': [], 'B': []} def record(self, version, latency_ms, output): self.metrics[version].append({ 'latency': latency_ms, 'num_detections': len(output['boxes']), 'max_confidence': output['scores'].max().item(), 'timestamp': time.time() }) def should_rollback(self, min_samples=1000): if len(self.metrics['B']) < min_samples: return False # not enough data yet a_p99 = np.percentile([m['latency'] for m in self.metrics['A']], 99) b_p99 = np.percentile([m['latency'] for m in self.metrics['B']], 99) # Rollback if new version P99 is >20% worse return b_p99 > a_p99 * 1.2
Cloud inference clusters must scale with demand. Too few replicas = requests queue and latency spikes. Too many replicas = wasted GPU money. The challenge: model loading takes 15-60 seconds (deserializing a TensorRT engine, allocating GPU memory, warming up CUDA contexts). This is the cold start problem.
Failure 1: Dynamic batching latency spikes. Batch size oscillates between 1 and 32, causing P99 latency to vary wildly. Root cause: Bursty traffic pattern — requests arrive in clumps with quiet periods between. During quiet periods, single requests fire immediately (low latency). During bursts, queue fills to 32 (high latency). Fix: Set a smaller max_batch_size (8-16) with a tighter queue delay (5ms). Accept slightly lower peak throughput for much more consistent latency. Or: use multiple priority levels so latency-sensitive requests bypass the queue.
Failure 2: Model version mismatch. Preprocessing pipeline was updated for model v2 (new normalization values, different input resolution) but the Triton config still points to model v1. The model receives incorrectly preprocessed inputs and produces garbage outputs — but doesn't crash or return errors. Diagnosis: Parity check catches it — cosine similarity between reference and production drops below threshold. Fix: Bundle preprocessing with the model (model ensemble in Triton). Version the preprocessing alongside the model. Include a preprocessing hash in the model config that's validated at load time.
Failure 3: Autoscaler thrashing. The autoscaler adds replicas when P99 > 30ms, removes them when GPU utilization < 40%. These two signals fight each other: adding replicas reduces utilization (triggers scale-down), removing replicas increases latency (triggers scale-up). Fix: Add cooldown periods (minimum 5 minutes between scale actions). Use hysteresis (scale up at P99 > 30ms, scale down only when P99 < 20ms for 10 minutes). Never scale to zero — always keep minimum warm replicas.
Failure 4: Cold start cascade. A cluster restart (planned maintenance) requires all 50 replicas to reload their models simultaneously. Each loads a 6 GB TensorRT engine into GPU memory. The shared filesystem serving the model repository is overwhelmed — 50 concurrent 6 GB reads = 300 GB I/O burst. Load times increase from 30s to 300s. Meanwhile, queued requests timeout and retry, creating a thundering herd. Fix: Stagger restarts (rolling restart with 5 replicas at a time). Cache TensorRT engines locally on each node's NVMe. Use a content-delivery approach (pre-distribute engines before restart).
Watch requests arrive, batch, and process. Adjust batch size and arrival rate to see throughput and latency change.
The classical AV stack is modular: perception detects objects, prediction forecasts their trajectories, planning optimizes a route, and control executes it. Each module is designed, trained, and optimized independently. Every handoff between modules loses information — perception outputs bounding boxes, discarding the rich feature maps that might help the planner. The emerging paradigm replaces this entire pipeline with a single foundation model that maps raw sensor inputs directly to driving actions. This is end-to-end driving, and it fundamentally changes what an inference engineer optimizes.
Modular Stack
End-to-End VLA
| Dimension | Modular Stack | E2E VLA |
|---|---|---|
| Total parameters | ~375M (sum of all modules) | ~3B (single model) |
| Information flow | Lossy handoffs (boxes, not features) | End-to-end gradients, no information loss |
| Latency | Sum of sequential stages: ~68ms | Single forward pass: ~55ms (but depends on decoding) |
| Memory | ~1.8 GB weights (mixed precision) | ~3 GB (INT8) or ~6 GB (FP16) + KV-cache |
| Debuggability | High — inspect each module independently | Low — single black box |
| Failure modes | One module fails, others compensate | Single point of failure, fails opaquely |
| Safety certification | Easier — test each component | Harder — must treat entire model as one unit |
| Optimization | Per-module quantization, scheduling | Whole-model sensitivity analysis |
| Update cycle | Update one module without touching others | Retrain entire model for any change |
A Vision-Language Model (VLM) takes images and text as input and produces text or structured output. For driving, the process works as follows:
Step 1: Image tokenization. Each camera image is divided into patches (typically 14×14 or 16×16 pixels) and passed through a Vision Transformer (ViT). A 1920×1280 image at patch size 14 produces (1920/14) × (1280/14) = 137 × 91 = 12,467 patches per camera. That's too many — so a pooling layer reduces this to ~576 patches per camera. With 6 cameras: 3,456 visual tokens.
Step 2: Text tokenization. The driving context ("You are driving on a highway at 65 mph. Weather is clear. The car ahead is slowing.") is tokenized into ~50-100 text tokens using a standard BPE tokenizer.
Step 3: Cross-attention fusion. The transformer backbone processes both visual and text tokens together. Self-attention allows every token to attend to every other token — visual tokens learn from text context, and text tokens learn from visual features. This is where the model "understands" the scene.
Step 4: Structured output decoding. The model decodes a structured output: detected objects with bounding boxes, predicted trajectories, or driving instructions. This can be autoregressive (generate one token at a time) or parallel (predict all outputs in one shot).
A VLA extends the VLM by adding an action head that outputs continuous driving commands. The key challenge: how do you go from discrete text tokens to continuous steering angles?
Approach 1: Action tokenization (discrete). Discretize the continuous action space into bins. Steering angle [-30°, +30°] becomes 256 bins (0.23° resolution). Throttle [0, 1] becomes 64 bins. The model generates action tokens autoregressively, just like generating text. Advantage: uses the same decoding infrastructure as LLMs (KV-cache, speculative decoding). Disadvantage: quantization of actions introduces discretization error, and the number of bins trades resolution against vocabulary size.
Approach 2: Regression head (continuous). Add an MLP head that takes the last hidden state and regresses continuous waypoints: [(x1, y1), (x2, y2), ..., (x10, y10)] — ten future positions of the ego vehicle. Advantage: no discretization error, parallel output (one forward pass, not autoregressive). Disadvantage: different training objective (MSE loss on waypoints) that may fight the language modeling loss.
Approach 3: Diffusion action head (2024-present). Use a small diffusion model as the action head. The transformer backbone produces a "plan embedding," and a denoising network iteratively refines a noisy trajectory into a clean one. Advantage: captures multimodal action distributions (multiple valid trajectories for an intersection). Disadvantage: denoising requires multiple forward passes (typically 4-8), adding latency.
VLAs are trained by imitation learning: given millions of hours of human driving demonstrations, the model learns to map observations to actions by minimizing the difference between its predicted actions and the human driver's actions. The loss function:
DAgger (Dataset Aggregation) attempts to fix distributional shift: run the learned policy, record the states it visits, query the human expert for correct actions in those states, add these to the training data, and repeat. This iteratively expands the training distribution to cover states the model actually visits. But DAgger requires online interaction — you need a human driver correcting the model in real time, which is expensive and dangerous.
If the VLA uses autoregressive action decoding (Approach 1), it inherits all the latency challenges of LLMs:
| LLM Serving Trick | Applies to VLA? | Consideration |
|---|---|---|
| KV-cache | Yes — essential | Cache visual tokens across decoding steps. 3,456 visual tokens × 32 layers × 2 × 2048 × 2 bytes = 900 MB at FP16 |
| PagedAttention | Partially | Useful if serving multiple queries (e.g., multi-scenario planning). Less useful for single-vehicle deployment |
| Speculative decoding | Yes — high impact | Small "draft" VLA generates candidate trajectories, large VLA verifies. Can reduce decoding steps 2-3x |
| FlashAttention | Essential | 3,506 total tokens → O(n^2) attention. FlashAttention reduces memory from 46 GB to ~200 MB |
| Continuous batching | Cloud only | On-vehicle: batch size always 1. In cloud simulation: continuous batching across scenarios |
| FP8 / INT4 quantization | Mixed precision | Vision encoder tolerates INT8 well. Action head is sensitive — keep in FP16. LLM backbone: FP8 or INT4 with GPTQ |
A VLA isn't uniformly sensitive to quantization. Different components have wildly different tolerance. Here's the systematic approach:
python # VLA quantization sensitivity analysis # Quantize each component independently, measure end-to-end impact def sensitivity_analysis(model, eval_data, baseline_metrics): """Quantize each component to INT8, measure accuracy drop.""" components = { 'vision_encoder': model.vision_encoder, # ViT backbone 'projection': model.projection, # visual → LLM mapping 'llm_layers_0_15': model.llm.layers[:16], # first half of LLM 'llm_layers_16_31': model.llm.layers[16:],# second half of LLM 'action_head': model.action_head, # trajectory decoder } results = {} for name, component in components.items(): # Quantize just this component to INT8 quantized_model = copy_and_quantize(model, {name: 'int8'}) # Evaluate on trajectory prediction metrics metrics = evaluate(quantized_model, eval_data) ade_drop = metrics['ADE'] - baseline_metrics['ADE'] # Average Displacement Error fde_drop = metrics['FDE'] - baseline_metrics['FDE'] # Final Displacement Error collision_rate = metrics['collision_rate'] results[name] = { 'ADE_increase': ade_drop, 'FDE_increase': fde_drop, 'collision_rate': collision_rate, 'recommendation': 'INT8' if ade_drop < 0.05 else 'FP16' } return results # Typical results: # vision_encoder: ADE +0.02m → INT8 safe (CNN-like, robust) # projection: ADE +0.01m → INT8 safe (simple linear) # llm_layers_0_15: ADE +0.03m → INT8 safe (early layers less sensitive) # llm_layers_16_31: ADE +0.08m → FP16 needed (later layers more sensitive) # action_head: ADE +0.15m → FP16 critical (directly outputs trajectory!)
No matter how good the VLA is, you must have a classical safety system watching over it. The VLA is a learned model — it can fail in ways you cannot predict. The safety monitor is rule-based, deterministic, and fast.
python # Safety monitor: validates VLA output before sending to vehicle control class TrajectoryValidator: """Rule-based safety check for VLA trajectory output. Must run in < 1ms (CPU). Must NEVER be bypassed.""" MAX_ACCEL = 4.0 # m/s^2 — comfortable driving limit MAX_JERK = 2.5 # m/s^3 — passenger comfort MAX_LATERAL = 0.3 # g — lateral acceleration limit MAX_STEER_RATE = 0.5 # rad/s — steering wheel rate limit MIN_TTC = 2.0 # seconds — time to collision minimum def validate(self, trajectory, ego_state, obstacles): """Returns (is_safe, reason) tuple.""" # 1. Physical feasibility: can the vehicle actually follow this path? for i in range(1, len(trajectory)): dt = trajectory[i].t - trajectory[i-1].t dv = trajectory[i].v - trajectory[i-1].v accel = dv / dt if abs(accel) > self.MAX_ACCEL: return False, f"Accel {accel:.1f} exceeds {self.MAX_ACCEL}" # 2. Collision check: does trajectory intersect with any obstacle? for wp in trajectory: for obs in obstacles: ttc = time_to_collision(wp, obs) if ttc < self.MIN_TTC: return False, f"TTC {ttc:.1f}s < {self.MIN_TTC}s" # 3. Road boundary: does trajectory stay on drivable surface? for wp in trajectory: if not is_on_road(wp.x, wp.y): return False, "Trajectory leaves drivable area" # 4. Continuity: is trajectory smooth from current state? initial_accel = compute_accel(ego_state, trajectory[0]) if abs(initial_accel) > self.MAX_JERK * 0.1: return False, "Trajectory discontinuous from ego state" return True, "OK" def emergency_override(self, ego_state, obstacles): """Called when VLA trajectory is rejected. Returns safe fallback.""" # Simple: maintain lane, decelerate smoothly return generate_deceleration_trajectory( ego_state, decel=-2.0) # gentle braking, 2 m/s^2
Failure 1: Distributional shift in deployment. The VLA was trained on 10M miles of human driving data from five US cities. You deploy it in a new city with different lane markings, traffic patterns, and signage. The model encounters states it has never seen, and errors compound. Diagnosis: Monitor the model's prediction entropy — high entropy (low confidence) on consistently encountered scenarios indicates distributional shift. Fix: Online fine-tuning with LoRA on data from the new city (few-shot adaptation). Or: DAgger-style data collection with safety driver corrections. Prevention: include diverse geographies in training data.
Failure 2: Quantization corrupting planning output. INT8 quantization of the LLM backbone causes subtle errors in later layers that propagate to the action head. The VLA drives perfectly straight but consistently misjudges lane change timing by 0.3 seconds — enough to cause near-misses. Diagnosis: The sensitivity analysis from above catches this. Compare INT8 vs FP16 action head outputs on 10K scenarios — if FDE (Final Displacement Error) increases > 0.1m, the quantization is corrupting planning. Fix: Mixed precision — keep later LLM layers and action head in FP16.
Failure 3: Autoregressive latency blowing budget. The VLA generates 20 trajectory waypoints autoregressively. At 1.5ms per token on Orin, that's 30ms just for decoding — before counting prefill (25ms). Total: 55ms for the VLA alone, leaving only 45ms for everything else. Fix: (1) Speculative decoding with a small draft model reduces effective tokens by 2-3x. (2) Parallel decoding: predict all 20 waypoints simultaneously with a regression head (non-autoregressive). (3) Reduce planning horizon from 20 to 10 waypoints.
Failure 4: Catastrophic forgetting during LoRA adaptation. You fine-tune the VLA with LoRA on city-specific data. It improves in the new city but degrades on highway scenarios it previously handled well. Diagnosis: Evaluate on the FULL benchmark after LoRA fine-tuning, not just the new city. Fix: Mix new city data with replay data from the original training set (experience replay). Use separate LoRA adapters per scenario type that can be hot-swapped at inference time based on road classification.
Toggle between architectures to see data flow, latency breakdown, and failure propagation paths.
Everything you've learned in the past 14 chapters converges here. This interactive simulation lets you build and optimize the complete AV inference pipeline — from raw sensor input to vehicle control command — with real latency, memory, and accuracy trade-offs. You'll make the same decisions a staff engineer makes on day one: which precision for each stage, which hardware accelerator, which optimizations to enable, and whether the whole thing fits within the vehicle's unforgiving budget constraints.
Here is the complete inference pipeline for a modern AV perception-to-action system. Each stage has its own precision, latency, memory footprint, and accuracy impact. The interactive canvas below lets you configure each one.
Use the controls below the canvas to configure each pipeline stage. Watch the three budget bars (latency, memory, accuracy) update in real time. Green = within budget. Red = over budget.
Here's the strategy a staff engineer follows when optimizing this pipeline, step by step. The order matters — some optimizations interact.
Step 1: Free wins first. Enable FlashAttention and kernel fusion. These have zero accuracy cost. FlashAttention reduces BEV attention memory by 5x and latency by 20%. Kernel fusion (LayerNorm + GELU, bias + residual) reduces latency 10-15%.
Step 2: TensorRT compilation. Export to ONNX, compile with TensorRT. This fuses operations the hand-written kernels missed, auto-tunes kernel configurations for your specific GPU, and enables hardware-specific optimizations. Typically 20-30% latency reduction, zero accuracy cost.
Step 3: Precision optimization. Run sensitivity analysis per component. The backbone (CNNs) almost always tolerates INT8 with < 0.3% mAP drop. BEV attention is moderately sensitive — FP16 is safe, INT8 requires careful calibration. Detection and occupancy heads are usually fine in FP16. The trajectory prediction head is often sensitive — test carefully. Apply the most aggressive precision each component can tolerate.
Step 4: Architecture decisions. If you're still over budget, consider: (a) smaller backbone (S vs M vs L — large accuracy impact but large latency savings), (b) disable temporal fusion (saves 4-12ms but hurts tracking and velocity estimation), (c) DLA offload for the CNN backbone (saves 10-15W of GPU power, useful if thermally constrained).
Step 5: Micro-optimizations. If you're within 5ms of the budget, profile at the kernel level. Look for: unfused elementwise ops, unnecessary data format conversions (NCHW→NHWC), synchronization points that could be async, preprocessing steps that could overlap with GPU inference using CUDA streams and double-buffering.
When asked "Design an inference pipeline for a 3B perception model on a vehicle SOC," use this structure:
| Step | Action | What to Say | Time in Answer |
|---|---|---|---|
| 1. Clarify | Ask questions | "What's the latency target? Memory budget? Power envelope? Is this Orin or Thor? What sensors?" | 30 seconds |
| 2. Budget | Compute the numbers | "3B at INT8 = 3 GB weights. Activations ~1 GB. KV-cache if autoregressive: ~500 MB. System overhead: ~3.5 GB. Total: ~8 GB on a 32 GB SOC." | 1 minute |
| 3. Architecture | Draw the pipeline | "Sensor → preprocess → backbone → BEV → heads → prediction → planning → safety → control. Let me walk through each stage." | 2 minutes |
| 4. Optimization | Apply techniques | "Free wins first: FlashAttention, kernel fusion, TensorRT. Then precision: sensitivity analysis → mixed INT8/FP16. Then DLA offload for the CNN backbone." | 2 minutes |
| 5. Failure modes | Show experience | "Three things that will go wrong: thermal throttling in summer, bandwidth contention from concurrent models, and DLA compatibility breaks on model updates." | 1 minute |
| 6. Validation | Production readiness | "Parity check on 10K frames, P99 under worst-case thermal, shadow deployment for 2 weeks before production." | 1 minute |
This chapter is your reference sheet. Bookmark it. Print it. Read it on the way to the interview. It compresses the entire lesson into actionable tables, drill problems, and debugging frameworks. Every section is self-contained — you can study any one in isolation.
| Concept | 30-Second Explanation | Key Equation / Tool | Primary Tool | Classic Paper | 2024+ Paper |
|---|---|---|---|---|---|
| Symmetric INT8 PTQ | Map floats to [-127, 127] using a single scale factor. Zero maps to zero. Fast calibration, no retraining. | q = round(x/s), s = max|x|/127 | TensorRT PTQ | Krishnamoorthi 2018 | SmoothQuant (Xiao 2023) |
| Asymmetric Quant | Uses scale + zero-point for skewed distributions. Handles activations with non-zero mean (after ReLU: all positive). | q = round(x/s) + z | PyTorch QAT | Jacob et al. 2018 | QServe (Lin 2024) |
| QAT | Insert fake-quantize nodes during training. Model learns to be robust to quantization noise via STE gradients. | STE: ∂L/∂x = ∂L/∂q | PyTorch ao | Bengio STE 2013 | FP8-QAT (Hopper) |
| FlashAttention | Tiled attention that never materializes the N×N attention matrix. Exact (not approximate). IO-aware: minimizes HBM reads. | Online softmax + tiling | flash-attn library | Dao et al. 2022 | FlashAttention-3 (Hopper) |
| PagedAttention | OS-style virtual memory for KV-cache. Allocates fixed-size blocks on demand. Eliminates fragmentation waste. | Page table: logical → physical block | vLLM | Kwon et al. 2023 | SGLang (Zheng 2024) |
| KV-Cache | Cache key/value tensors from previous tokens to avoid recomputation during autoregressive decoding. Memory grows linearly with sequence length. | mem = 2 × nlayers × nheads × d × seq × bytes | TensorRT-LLM | GPT-2 (Radford 2019) | MLA (DeepSeek 2024) |
| LoRA | Freeze base model, train tiny low-rank adapter matrices. Merge for inference: W' = W + BA. Parameter-efficient fine-tuning. | W' = W + BA, r << d | PEFT library | Hu et al. 2021 | DoRA (Liu 2024) |
| TensorRT | Graph compiler: fuses ops, selects precision per layer, auto-tunes kernels for target GPU. Produces optimized "engine" file. | ONNX → TRT builder → engine | trtexec CLI | NVIDIA TensorRT | TensorRT-LLM 2024 |
| Dynamic Batching | Accumulate requests in a queue, process together. GPU ops scale sub-linearly with batch → huge throughput gain. | Throughput = B / tB | Triton Server | Triton Inference Server | Sarathi-Serve 2024 |
| Continuous Batching | For autoregressive models: evict completed sequences, insert new ones at every decode step. No wasted GPU on padding. | Inflight batch management | vLLM, TRT-LLM | Orca (Yu 2022) | Distserve (Zhong 2024) |
| BEV Perception | Project multi-camera 2D features into a unified 3D bird's-eye-view grid. Enables 3D detection from cameras only. | Lift-Splat or cross-attention | mmdet3d | LSS (Philion 2020) | StreamPETR (Wang 2024) |
| Occupancy Networks | Predict per-voxel occupancy and semantics in 3D space. Provides free-space reasoning beyond bounding boxes. | 3D grid: [X, Y, Z, C_semantic] | mmdet3d | OccNet (Tong 2023) | SparseOcc (Liu 2024) |
| Speculative Decoding | Small "draft" model generates candidate tokens, large model verifies in parallel. Reduces effective decode steps 2-3x. | Draft k tokens → verify batch | vLLM, TRT-LLM | Leviathan et al. 2023 | Medusa (Cai 2024) |
| Pruning | Remove weights (unstructured) or entire channels/heads (structured). Structured preferred for real speedup on GPUs. | Magnitude or gradient-based scoring | torch.nn.utils.prune | LTH (Frankle 2019) | Wanda (Sun 2024) |
| Knowledge Distillation | Train a small "student" model to mimic a large "teacher." The teacher's soft outputs contain more information than hard labels. | L = αCE(y, s) + (1-α)KL(t, s) | Custom training | Hinton et al. 2015 | TinyLLM (2024) |
| Thermal Throttling | SOC reduces clock speed when junction temperature approaches limit. Causes non-linear latency spikes on edge devices. | Tj = Tamb + P × ΘJA | tegrastats | N/A (hardware) | N/A |
| CUDA Streams | Independent work queues on GPU. Enable overlapping compute with data transfer. Essential for pipelining sensor frames. | cudaStreamCreate / enqueue | Nsight Systems | CUDA Programming Guide | CUDA Graphs (12.x) |
| DLA (Deep Learning Accelerator) | Fixed-function inference engine on Orin. Supports Conv/BN/Pool/ReLU at 3-5x power efficiency vs GPU. No attention, no LayerNorm. | TensorRT DLA partitioning | trtexec --useDLACore | NVIDIA Orin docs | Thor DLA (2025+) |
Question 1: "Design an inference pipeline for a 3B VLM on a 30W vehicle SOC."
Key components:
Scaling strategy: This is on-vehicle, so no horizontal scaling. Vertical scaling = wait for next-gen SOC (Thor: 2000 TOPS, 128 GB memory). Until then, the optimization is all in precision, compilation, and DLA offload.
Failure modes: (1) Thermal throttling in summer → latency P99 spikes. Mitigation: thermal-aware model switching. (2) KV-cache growing unbounded if sequence length isn't capped → OOM. Mitigation: fixed-length sliding window, pre-allocated cache. (3) ViT tokens are 3456 per frame — at 30fps, re-encoding every frame is wasteful. Mitigation: cache visual tokens, only re-encode when scene changes significantly (motion-triggered).
Build vs buy: Use TensorRT-LLM for the LLM backbone compilation (buy). Write custom TensorRT plugins for the vision-to-language projection (build — this is model-specific). Use the Orin DLA for the ViT backbone if it's CNN-based (configure). Write the safety monitor from scratch (build — too safety-critical to depend on external code).
Question 2: "Design an auto-labeling pipeline that processes 10M driving scenes per day."
Key components:
Failure modes: (1) Data pipeline bottleneck — GPUs idle waiting for images. Fix: profile data loading vs compute time, add prefetch workers. (2) Label drift — model performance degrades on distribution shift in new data. Fix: continuous monitoring with held-out human-labeled test set. (3) Cost explosion — spot instances get reclaimed during training deadlines. Fix: mix on-demand (for deadline-critical) and spot (for best-effort) with automatic checkpointing.
Question 3: "Design a training infrastructure for 100 ML researchers working on AV models."
Key components:
Failure modes: (1) GPU waste — researcher launches 8-GPU job, code bug crashes at hour 2, GPUs sit idle for 6 hours. Fix: automatic health checking and job termination on stall. (2) Reproducibility crisis — "it worked on my machine" across different GPU types. Fix: containerized training environments, pinned dependencies, deterministic seeding. (3) Storage bottleneck — 100 researchers all read the same dataset simultaneously. Fix: distributed caching, dataset sharding, prefetch workers.
Question 4: "Design a parity testing and deployment gate for model updates on the vehicle fleet."
Key components:
Drill 1: "Implement symmetric INT8 quantization from scratch."
python def quantize_symmetric(x, bits=8): """Symmetric quantization: zero maps to zero, scale only.""" qmax = 2 ** (bits - 1) - 1 # 127 for INT8, 7 for INT4 scale = x.abs().max() / qmax # one scale for entire tensor q = torch.round(x / scale) # round to nearest integer q = q.clamp(-qmax, qmax).to(torch.int8) return q, scale # Dequantize: x_hat = q.float() * scale # Talk about: per-tensor vs per-channel, outlier clipping, # why clamp is needed (round can exceed qmax)
While writing, discuss: per-tensor vs per-channel granularity; what happens with outlier activations (single large value wastes dynamic range); the straight-through estimator for gradients in QAT.
Follow-ups: "How would you add per-channel support?" "What about asymmetric?" "How does SmoothQuant handle outlier activations?"
Drill 2: "Write a CUDA kernel for vector addition."
cuda __global__ void vecadd(float* a, float* b, float* c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) c[i] = a[i] + b[i]; // bounds check prevents OOB } // Launch: vecadd<<<(n+255)/256, 256>>>(a, b, c, n); // Talk about: why (n+255)/256 (ceiling division for grid size), // memory coalescing (adjacent threads access adjacent memory), // occupancy (256 threads per block is a safe default)
Follow-ups: "How would you add shared memory?" "What if a and b are in different memory spaces?" "How do you handle float4 vectorized loads for better bandwidth?"
Drill 3: "Implement LoRA forward pass."
python def lora_forward(x, W, A, B, scale): """x: [batch, in_dim], W: [out_dim, in_dim], A: [r, in_dim], B: [out_dim, r]""" base = x @ W.T # [batch, out_dim] — frozen base model lora = scale * (x @ A.T) @ B.T # [batch, r] → [batch, out_dim] return base + lora # Key: x @ A.T is [batch, r] — tiny! r=8 or 16 typically # Merge for inference: W_merged = W + scale * B @ A # Then single matmul at full speed, no LoRA overhead
While writing, discuss: why low-rank works (weight updates during fine-tuning are empirically low-rank); how to choose r (start with 8, increase if underfitting); memory savings (fine-tuning 3B model needs only 50MB of LoRA weights vs 6GB for full fine-tuning).
Follow-ups: "How does QLoRA differ?" "What's the math behind merging for inference?" "Can you apply LoRA to attention only, or all linear layers?"
Drill 4: "Implement a simple memory budget calculator."
python def memory_budget( num_params, # total parameters (e.g., 3e9) precision='int8', # 'fp32', 'fp16', 'int8', 'int4' seq_len=2048, # sequence length for KV-cache num_layers=32, # transformer layers num_heads=32, # attention heads head_dim=64, # dimension per head batch_size=1, ): bytes_per = {'fp32': 4, 'fp16': 2, 'int8': 1, 'int4': 0.5} bpp = bytes_per[precision] weights = num_params * bpp # model weights kv_cache = 2 * num_layers * num_heads * head_dim * seq_len * batch_size * 2 # 2 for K and V, × 2 bytes (FP16 cache regardless of weight precision) activations = num_params * 0.3 * bpp # rough: 30% of weights system_overhead = 3.5e9 # OS + CUDA + TRT: ~3.5 GB total = weights + kv_cache + activations + system_overhead return { 'weights_gb': weights / 1e9, 'kv_cache_gb': kv_cache / 1e9, 'activations_gb': activations / 1e9, 'system_gb': system_overhead / 1e9, 'total_gb': total / 1e9, 'fits_orin_32gb': total < 32e9 * 0.8, # 80% safety margin }
Follow-ups: "What did you assume for activations? How would you measure it more precisely?" "How does MQA (multi-query attention) change the KV-cache calculation?" "What if we need two models loaded simultaneously?"
Drill 5: "Write a Triton (OpenAI Triton, not NVIDIA Triton) kernel for fused softmax."
python import triton, triton.language as tl @triton.jit def softmax_kernel(input_ptr, output_ptr, n_cols, BLOCK: tl.constexpr): row = tl.program_id(0) # one block per row offsets = tl.arange(0, BLOCK) # column indices mask = offsets < n_cols # bounds mask # Load row row_ptr = input_ptr + row * n_cols x = tl.load(row_ptr + offsets, mask=mask, other=-1e9) # Numerically stable softmax: subtract max first x_max = tl.max(x, axis=0) x = x - x_max exp_x = tl.exp(x) sum_exp = tl.sum(exp_x, axis=0) result = exp_x / sum_exp # Store out_ptr = output_ptr + row * n_cols tl.store(out_ptr + offsets, result, mask=mask)
While writing, discuss: Why subtract max (prevents overflow in exp); why one block per row (rows are independent); how this is faster than PyTorch (single fused kernel vs 4 separate memory-bound ops: max, subtract, exp, divide).
Drill 6: "Implement parity check between FP32 reference and INT8 engine."
python def parity_check(ref_model, opt_engine, test_loader, rtol=0.01, atol=0.05): """Returns True if optimized engine matches reference within tolerance.""" all_cos_sims, all_max_diffs = [], [] for batch in test_loader: with torch.no_grad(): ref_out = ref_model(batch).float() opt_out = opt_engine.infer(batch).float() # Element-wise metrics max_diff = (ref_out - opt_out).abs().max().item() cos_sim = torch.nn.functional.cosine_similarity( ref_out.flatten(), opt_out.flatten(), dim=0).item() all_max_diffs.append(max_diff) all_cos_sims.append(cos_sim) passed = (max(all_max_diffs) < atol and min(all_cos_sims) > 1.0 - rtol) return passed, { 'worst_max_diff': max(all_max_diffs), 'worst_cos_sim': min(all_cos_sims), }
Follow-ups: "How would you extend this to per-layer comparison?" "What tolerance would you set for detection outputs vs classification logits?" "How do you handle non-deterministic outputs (dropout, sampling)?"
Scenario 1: "INT8 model produces NaN outputs on 0.1% of inputs."
Scenario 2: "Latency spikes every ~60 seconds during on-vehicle inference."
gc.get_stats() timing. Fix: Move to C++ runtime for inference. Or disable GC in the hot path: gc.disable() during inference, run manually in idle periods.cudaDeviceSynchronize() which blocks all CUDA work until the GPU is idle. Diagnostic: Nsight Systems timeline — look for unexpected sync points. Fix: Use async CUDA operations with streams. Never call device-level sync in production.cudaMalloc/cudaFree during inference, the allocator may occasionally compact memory. Diagnostic: Monitor CUDA memory allocation events. Fix: Pre-allocate all buffers at startup. Zero dynamic allocation in the inference loop.Scenario 3: "Model accuracy fine on eval but bad in production."
model.eval() is called. Check all BN layers are using running stats.Scenario 4: "Training loss plateaus at 32 GPUs but not at 8."
Scenario 5: "TensorRT engine is 20% slower than expected from benchmarks."
Scenario 6: "GPU utilization is 95% but throughput is lower than expected."
| Task | Classical Approach | Modern (Learned) Approach | When to Use Classical | Key Trade-off |
|---|---|---|---|---|
| 3D Object Detection | PointPillars, SECOND (voxel-based) | StreamPETR, BEVFormer (transformer-based BEV) | LiDAR-primary, low compute, real-time on weak hardware | Classical faster but lower accuracy on camera-only setups |
| Depth Estimation | Stereo matching (SGM, ELAS) | Depth Anything v2, MoGe, Metric3D | When stereo cameras available and accuracy > precision needed | Classical needs stereo pair; modern works with single camera |
| Object Tracking | Kalman filter + Hungarian matching | MOTR, TrackFormer (transformer end-to-end) | Real-time, low compute, predictable behavior | Classical: 0.1ms, predictable. Modern: 5ms, handles occlusion better |
| Model Compression | Magnitude pruning, knowledge distillation | SparseGPT, Wanda, AWQ, GPTQ | Structured pruning for actual HW speedup | Classical gives real speedup (remove channels). Modern gives better accuracy retention for weight-only |
| Kernel Optimization | Hand-written CUDA kernels | Triton (OpenAI), torch.compile, TVM | Critical path kernels, maximum performance | Hand-written: 10x dev time, 20% faster. Triton: quick iteration, good enough |
| Trajectory Prediction | Physics-based (constant velocity, bicycle model) | MotionDiffuser, MTR++, QCNet | Simple scenarios (highway, no interaction) | Classical: deterministic, explainable. Modern: handles multi-agent interaction |
| Path Planning | A*, RRT, lattice planner | Neural planner (UniAD, VAD) | When safety certification required, deterministic guarantees needed | Classical: provably complete, verifiable. Modern: more human-like, smoother |
| Sensor Calibration | Checkerboard patterns, APRiL tags | CalibAnything, self-supervised calibration | When accuracy > convenience, offline calibration fine | Classical: sub-pixel accurate but manual. Modern: automatic but less precise |
| Localization | Particle filter, EKF, factor graphs | Neural relocalization (MapLite, NeuralRecon) | When HD maps available, need centimeter accuracy | Classical: cm-accurate with good map. Modern: works without maps |
| Anomaly Detection | Statistical process control (SPC), threshold-based | Autoencoders, one-class classification | When failure modes are well-understood and enumerable | Classical: zero false negatives for known failures. Modern: catches unknown unknowns |
| NMS (Post-processing) | Greedy NMS, Soft-NMS | End-to-end set prediction (DETR-style) | When using anchor-based detectors, speed critical | Classical: fast, simple. Modern: eliminates NMS entirely but needs transformer detector |
| Data Augmentation | Random crop, flip, color jitter | Generative augmentation (diffusion-based) | When real data is plentiful enough | Classical: fast, deterministic. Modern: generates realistic rare scenarios |
8 papers to read (and WHY):
| # | Paper | Why Read It |
|---|---|---|
| 1 | FlashAttention (Dao et al., 2022) | The IO-awareness paradigm shift. Teaches you that FLOPs don't determine runtime — memory movement does. Understanding the online softmax trick and tiling strategy is essential for anyone writing GPU kernels. Read Section 3 (Algorithm) in detail. |
| 2 | Efficient Inference on a Single GPU (vLLM / PagedAttention, Kwon et al., 2023) | Shows how OS concepts (virtual memory, paging) transfer to ML systems. The key insight: KV-cache memory is fragmented just like OS memory, and the same solution (paging) works. Read for systems-thinking in ML. Study the throughput experiments in Section 5. |
| 3 | SmoothQuant (Xiao et al., 2023) | The elegant solution to the outlier activation problem. Migrating quantization difficulty from activations to weights using a mathematically simple per-channel scaling. Shows how one insight can make previously-impossible quantization work. Read Section 3 (Method) — it's only 2 pages. |
| 4 | LoRA (Hu et al., 2021) | The foundation of parameter-efficient fine-tuning that's used everywhere in production. Understanding WHY low-rank updates work (intrinsic dimensionality of the update matrix) makes you a better model optimization engineer. Read Sections 2-4. |
| 5 | BEVFormer (Li et al., 2022) | The reference architecture for camera-based 3D perception. Combines spatial cross-attention (image→BEV) with temporal self-attention (fuse past frames). Understanding this architecture is table-stakes for AV inference work. Read the architecture diagram in Section 3 carefully. |
| 6 | GPTQ (Frantar et al., 2023) | The breakthrough in weight-only INT4 quantization. Uses second-order information (Hessian inverse) to minimize quantization error layer by layer. Enables 3-4x compression with minimal quality loss. Read Section 3 — the OBQ algorithm is a beautiful application of matrix math. |
| 7 | Scaling Laws for Neural Language Models (Kaplan et al., 2020) | Understanding scaling laws is critical for making architecture and infrastructure decisions. This paper tells you how performance scales with compute, data, and parameters — essential for planning hardware procurement and model sizing. Sections 3-4. |
| 8 | UniAD (Hu et al., 2023) | The first unified end-to-end autonomous driving framework. Shows how detection, tracking, prediction, and planning can be combined in a single model. Understanding this architecture is essential for anyone working on VLA inference. Read the full pipeline in Section 3. |
8 repos to study (and WHAT to look at):
| # | Repository | What to Study |
|---|---|---|
| 1 | vLLM | Study vllm/core/scheduler.py for continuous batching logic. Study vllm/attention/ for PagedAttention implementation. This is production-grade ML systems code — note the extensive error handling and edge case management. |
| 2 | FlashAttention | Study csrc/flash_attn/ for the CUDA kernel. Focus on the tiling strategy in flash_fwd_kernel.h. Note how shared memory is used as a scratchpad for Q, K, V tiles. This is world-class CUDA code. |
| 3 | TensorRT-LLM | Study tensorrt_llm/models/ for how models are defined as TensorRT graphs. Study tensorrt_llm/runtime/ for the inference runtime including inflight batching and KV-cache management. |
| 4 | OpenAI Triton | Study the python/tutorials/ directory — especially 02-fused-softmax.py and 06-fused-attention.py. These show how to write GPU kernels in Python that match hand-tuned CUDA performance. |
| 5 | mmdetection3d | Study projects/BEVFormer/ for the BEV perception pipeline. Focus on data flow: how multi-camera images are transformed into BEV features. Study the config files for understanding model architecture specification. |
| 6 | NVIDIA Triton Inference Server | Study the docs/examples/model_repository/ for model configuration patterns. Study src/core/dynamic_batch_scheduler.cc for how dynamic batching actually works at the code level. Note the queue management and batch assembly logic. |
| 7 | DeepSpeed | Study deepspeed/runtime/zero/ for ZeRO optimization stages. Focus on Stage 3 (stage3.py) to understand how optimizer states, gradients, and parameters are sharded across GPUs. This is essential for large model training. |
| 8 | torch.compile (PyTorch) | Study torch/_inductor/ for how PyTorch generates optimized kernels. Focus on triton_ops/ to see how high-level PyTorch ops are lowered to Triton kernels. Understanding this compilation path helps debug torch.compile issues in production. |