Staff-level interview prep: cloud infrastructure, GPU clusters, LLM serving, distributed systems, cost optimization, reliability engineering, and the full scaling platform.
It is 2:47 AM. Your phone buzzes. PagerDuty: "LLM inference cluster us-east-1: p99 latency 4200ms (SLO: 800ms). GPU utilization dropped from 78% to 12%." You pull up Grafana on your phone. The throughput graph fell off a cliff 11 minutes ago. You SSH into the bastion host and check the NVIDIA driver logs — three H100 nodes reported an ECC memory error and the SLURM scheduler evicted all jobs. But the autoscaler hasn't replaced them because the spot capacity in us-east-1 is exhausted at this hour.
You execute the runbook: failover inference traffic to us-west-2 via the global load balancer, then spin up on-demand instances in us-east-1 as a bridge until spot capacity returns. Total time to recovery: 7 minutes. Cost of the on-demand bridge: ~$340/hour. Cost of the outage if you hadn't responded: ~$12,000/minute in lost API revenue.
This is the daily reality of an Infrastructure & LLM Scaling Engineer at Parallel. You are the person who makes sure the AI actually runs — at scale, cost-efficiently, and reliably. Every prompt that flows through Parallel's platform touches infrastructure you built or maintain: the Kubernetes clusters, the GPU scheduling, the model serving layer, the database shards, the monitoring pipelines, the CI/CD system, the cost optimization automation.
The role sits at the intersection of five engineering disciplines:
| Discipline | What you own | Your daily intersection |
|---|---|---|
| Cloud & IaC | Terraform modules, K8s clusters, service mesh, multi-region | Every infrastructure change goes through your IaC pipeline |
| GPU & ML Infra | GPU scheduling, driver management, model serving, batch tuning | You keep the most expensive hardware in the fleet utilized |
| Distributed Systems | Data replication, consensus, sharding, consistency guarantees | You design the systems that survive partial failures |
| Reliability | SLOs, error budgets, chaos engineering, incident response | You define what "reliable" means and enforce it |
| Cost Engineering | Spot strategy, right-sizing, utilization monitoring, spend forecasting | You make the $2B company profitable by controlling infrastructure spend |
Before we dive in, internalize these numbers. They come up in every infrastructure interview and every oncall incident:
| Fact | Value | Why it matters |
|---|---|---|
| H100 80GB cost (cloud) | ~$10/hr | Your biggest cost line item. Every idle minute = $0.17 wasted. |
| H100 memory bandwidth | 3.35 TB/s | The bottleneck for LLM decode. Determines tokens/sec/GPU. |
| 70B model in FP16 | 140 GB | Doesn't fit on one H100. Need TP=2 minimum. |
| 70B model in INT4 | ~35 GB | Fits on one H100 with room for KV cache. 50% cost reduction. |
| KV cache per token (70B) | ~1.6 MB | 128 concurrent requests × 4096 tokens = 800 GB of KV cache. |
| NVLink bandwidth (H100) | 900 GB/s | 15x faster than PCIe. Required for tensor parallelism. |
| 99.99% uptime | 52 min/yr downtime | Every minute of outage burns a week of error budget. |
| Spot instance savings | 60-90% | The biggest cost lever after GPU utilization. |
| Continuous batching improvement | 2-10x | The single most impactful serving optimization. |
9:00 AM — Check the overnight dashboard. GPU utilization averaged 72% (good). Error budget at 85% (comfortable). One P2 ticket: a customer's rate limiter is misconfigured, they are getting 429s.
10:00 AM — Design review for the new B200 migration plan. You present: phase 1 is a canary cluster (4 nodes, 32 B200s), run shadow traffic from production, compare latency/throughput/quality against H100 baseline. Phase 2 is gradual migration over 6 weeks. The tricky part: B200 uses a different CUDA driver version, so every container image needs rebuilding.
11:30 AM — Pair with a junior engineer on their K8s operator for GPU autoscaling. They wrote the controller logic but forgot to handle the case where spot instances are reclaimed mid-scaling. You add a reconciliation loop that checks for unexpected node removals.
2:00 PM — Cost review meeting. You present the monthly infrastructure report: GPU spend is $3.2M, down 8% from last month thanks to the FP8 quantization rollout. Networking egress is unexpectedly $180K — investigation reveals a misconfigured cross-region data pipeline that is copying model telemetry to all three regions instead of aggregating locally.
4:00 PM — Chaos engineering session. You run the "kill a random GPU node" experiment in production (with traffic-shifted to other nodes first). The autoscaler takes 4 minutes to replace the node. Your SLO requires recovery within 5 minutes. Passing, but barely. You file a ticket to reduce cold-start time by pre-pulling container images to all nodes.
5:30 PM — Postmortem for last week's incident (a bad Terraform apply deleted a security group rule, briefly exposing an internal service to the internet). You draft the timeline, root cause (lack of terraform plan review for auto-applied changes), and action items (require human approval for security group modifications).
The diagram below traces a single LLM inference request from arrival to response. Every box is a system you own or co-own. This is your whiteboard answer in a system-design interview.
Watch an inference request flow through the full stack. Latency counters show where time is spent. Click Kill Node to see failover in action.
Staff-level interviews at companies like Parallel, Anthropic, OpenAI, and Databricks test you across five dimensions. Each chapter maps to one or more:
| Dimension | What they ask | Chapters |
|---|---|---|
| CONCEPT | "Explain how PagedAttention works and why it matters for serving" | 0, 3, 4, 8 |
| DESIGN | "Design a multi-region GPU inference platform for 100K req/s" | 1, 2, 5, 8 |
| CODE | "Write a K8s operator that autoscales GPU pods based on queue depth" | 1, 2, 6, 9 |
| DEBUG | "p99 latency doubled after a deployment. Walk me through investigation." | 3, 7, 10, 11 |
| FRONTIER | "How does SGLang compare to vLLM for MoE models in 2025?" | 2, 3, 6, 12 |
You cannot SSH into 400 servers and configure them by hand. You cannot click through the AWS console to set up a VPC, three subnets, a NAT gateway, security groups, an EKS cluster, and a service mesh. You cannot do it because it is not reproducible, not auditable, not reviewable, and not recoverable. If someone accidentally deletes a security group, there is no "undo." If a new hire needs to set up a staging environment, there is no "clone production."
Infrastructure as Code (IaC) means your entire infrastructure — every VPC, every firewall rule, every Kubernetes cluster, every IAM policy — exists as version-controlled code. Change infrastructure by making a pull request. Review it like you review application code. Roll it back with git revert. This is not optional for a $2B company. It is the foundation of everything.
Terraform uses HCL (HashiCorp Configuration Language), a declarative DSL. You describe the desired state ("I want 3 EC2 instances of type p4d.24xlarge in us-east-1") and Terraform figures out how to get there. It maintains a state file that records what currently exists, computes a diff against your desired state, and generates an execution plan.
Pulumi lets you write infrastructure in real programming languages (Python, TypeScript, Go). Instead of learning HCL, you use familiar constructs: loops, conditionals, functions, classes. The tradeoff: Pulumi code can have bugs that HCL cannot (infinite loops, runtime exceptions). But it is vastly more expressive for complex infrastructure.
| Dimension | Terraform | Pulumi |
|---|---|---|
| Language | HCL (declarative DSL) | Python, TS, Go (imperative) |
| State | JSON state file (S3, Terraform Cloud) | Pulumi Service or self-hosted backend |
| Testing | terraform plan + Sentinel policies | Unit tests with pytest/jest |
| Learning curve | New DSL but simple | Your existing language |
| Industry adoption | Dominant (2024-2025) | Growing, esp. in startups |
| Drift detection | terraform plan detects drift | pulumi preview detects drift |
Parallel serves customers globally. A single-region architecture means that if us-east-1 has an outage (and it does, roughly once per year), every customer is down. A multi-region architecture runs independent clusters in at least three regions, with a global load balancer routing traffic to the nearest healthy region.
The architecture has three layers:
Layer 1: Global routing. A DNS-based or anycast load balancer (AWS Global Accelerator, Cloudflare) routes requests to the nearest region. Health checks remove unhealthy regions in under 30 seconds.
Layer 2: Regional K8s clusters. Each region runs an independent EKS/GKE cluster with its own control plane, worker nodes, and GPU node pools. Clusters share configuration (via GitOps) but are operationally independent.
Layer 3: Data replication. The hard part. Model weights are read-only and can be pre-cached in each region. But stateful data (user sessions, conversation history, rate limit counters) must be replicated. Options: CockroachDB for strong consistency (higher latency), Redis Cluster with eventual consistency (lower latency), or a hybrid where critical data uses strong consistency and session data is eventually consistent.
hcl # modules/gpu-cluster/main.tf # Reusable module: one GPU-enabled EKS cluster per region variable "region" { type = string } variable "gpu_type" { type = string default = "p5.48xlarge" } # H100 nodes variable "min_gpus" { type = number default = 4 } variable "max_gpus" { type = number default = 32 } variable "k8s_version" { type = string default = "1.30" } module "vpc" { source = "terraform-aws-modules/vpc/aws" cidr = "10.0.0.0/16" azs = ["${var.region}a", "${var.region}b", "${var.region}c"] private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"] public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"] enable_nat_gateway = true single_nat_gateway = false # One NAT per AZ for HA enable_dns_hostnames = true } module "eks" { source = "terraform-aws-modules/eks/aws" cluster_name = "parallel-${var.region}" cluster_version = var.k8s_version vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets eks_managed_node_groups = { gpu_pool = { instance_types = [var.gpu_type] min_size = var.min_gpus max_size = var.max_gpus desired_size = var.min_gpus ami_type = "AL2_x86_64_GPU" labels = { "nvidia.com/gpu.present" = "true" } taints = [{ key = "nvidia.com/gpu", value = "true", effect = "NO_SCHEDULE" }] } cpu_pool = { instance_types = ["m6i.4xlarge"] min_size = 3 max_size = 20 } } }
nvidia.com/gpu=true:NoSchedule so that only pods with a matching toleration (your inference workloads) get scheduled there. Without this, a random logging daemon could land on an H100 node and waste $90/hour of GPU time. This is the single most common IaC mistake in GPU clusters.The most dangerous IaC bug is state drift — when the real infrastructure diverges from what Terraform thinks exists. Someone clicked in the console and added a security group rule. Terraform doesn't know about it. Next terraform apply might delete that rule, breaking production.
bash # Detect drift: compare state to reality terraform plan -refresh-only # If drift detected, two options: # 1. Import the manual change into state (adopt it) terraform import aws_security_group_rule.manual_rule sg-12345 # 2. Override the manual change (enforce IaC as source of truth) terraform apply # Will revert manual changes # Prevention: lock down console access with SCPs # Run drift detection on a cron (every 4 hours)
GitOps takes IaC further: Git is not just where you store infrastructure code, it is the sole source of truth for cluster state. A GitOps controller (ArgoCD, Flux) continuously watches your Git repo and reconciles the cluster to match. If someone manually changes a Kubernetes resource, the controller reverts it within seconds.
The pattern for Parallel: one Git repo contains Helm charts and Kustomize overlays for every service. ArgoCD runs in each regional cluster. A merge to main triggers automatic rollout to staging, then a manual approval gate for production. Rollback is git revert — ArgoCD sees the revert commit and reconciles the cluster back to the previous state.
python # Simplified K8s operator: scale GPU pods based on inference queue depth import kopf, kubernetes from kubernetes import client @kopf.timer('apps', 'v1', 'deployments', labels={'parallel.ai/gpu-autoscale': 'true'}, interval=30) def autoscale_gpu_pods(name, namespace, spec, status, **kwargs): """Scale GPU inference pods based on queue depth metric.""" # Read queue depth from Prometheus queue_depth = query_prometheus( f'llm_request_queue_depth{{deployment="{name}"}}' ) current_replicas = status.get('readyReplicas', 0) annotations = spec.get('template', {}).get('metadata', {}).get('annotations', {}) max_replicas = int(annotations.get('parallel.ai/max-replicas', '16')) min_replicas = int(annotations.get('parallel.ai/min-replicas', '2')) target_queue_per_pod = int(annotations.get('parallel.ai/target-queue', '10')) # Calculate desired replicas if queue_depth > target_queue_per_pod * current_replicas * 1.5: desired = min(max_replicas, current_replicas + 2) # Scale up aggressively elif queue_depth < target_queue_per_pod * current_replicas * 0.3: desired = max(min_replicas, current_replicas - 1) # Scale down conservatively else: desired = current_replicas # No change if desired != current_replicas: apps_v1 = client.AppsV1Api() patch = {'spec': {'replicas': desired}} apps_v1.patch_namespaced_deployment_scale(name, namespace, patch) kopf.info(spec, reason='Autoscale', message=f'Scaled {name}: {current_replicas} → {desired} (queue={queue_depth})')
GPU inference containers are 3-8GB (CUDA runtime, Python, vLLM, model loading code). Pulling this image from a registry when a new node starts takes 2-5 minutes. During an autoscaling event, this 2-5 minutes is dead time where you have a GPU node running ($10/hr) but not serving traffic.
Solutions, ordered by effectiveness:
| Approach | Pull time | Complexity |
|---|---|---|
| Pre-pull on node boot | 0s (already cached) | DaemonSet that pulls images on startup |
| Dragonfly P2P | 30-60s (distributed pull) | P2P image distribution, no registry bottleneck |
| ECR + VPC endpoint | 60-90s (fast private pull) | Avoid internet egress, use S3 transfer speeds |
| Standard registry pull | 2-5 min | Default, slowest |
Three regions serving traffic. Click Region Failure to see automatic failover. The health indicators show cluster status.
terraform plan and see it wants to delete a security group rule you didn't add. What happened and what's the safest response?A single NVIDIA H100 GPU costs $25,000-$40,000. A single 8-GPU node costs $250,000+. Parallel's inference fleet might have 200+ H100s across regions. That is $5-8M in GPU hardware. Your job is to keep every one of those GPUs doing useful work — not sitting idle, not thermal throttling, not stuck in a driver crash loop. GPU utilization is the single most important metric in your entire infrastructure.
Understanding GPU hardware is not optional. When a debug session leads you to "the model is slow," you need to know why at the hardware level.
| GPU | HBM (GB) | Memory BW (TB/s) | FP16 TFLOPS | NVLink BW (GB/s) | Cost/hr (cloud) |
|---|---|---|---|---|---|
| A100 80GB | 80 | 2.0 | 312 | 600 | ~$3.50 |
| H100 80GB | 80 | 3.35 | 990 | 900 | ~$8-12 |
| H200 141GB | 141 | 4.8 | 990 | 900 | ~$12-15 |
| B200 192GB | 192 | 8.0 | 2250 | 1800 | ~$15-20 |
LLM inference is memory-bandwidth bound, not compute bound. During autoregressive decoding, each generated token requires reading the entire model weights from HBM once. A 70B parameter model in FP16 is 140GB — it must be read from memory for every single token. The H100 has 3.35 TB/s of memory bandwidth, so it can read 140GB in ~42ms, giving you roughly 24 tokens/second on a single GPU. This is why memory bandwidth matters more than TFLOPS for inference.
You have 200 GPUs. You have training jobs, inference serving, fine-tuning jobs, and evaluation runs competing for them. How do you schedule?
Option 1: SLURM. The traditional HPC scheduler. Jobs submit resource requests (e.g., "I need 8 GPUs for 4 hours"), SLURM queues them and allocates when resources are free. Good for batch workloads (training, eval). Bad for real-time inference because SLURM doesn't do preemption well — a 4-hour training job blocks GPUs even if inference latency is spiking.
Option 2: Kubernetes with GPU operator. NVIDIA's GPU operator for K8s handles driver installation, device plugin, and GPU resource advertisement. Pods request nvidia.com/gpu: 1 and K8s schedules them. Good for inference (pods auto-scale), but K8s doesn't understand GPU topology (which GPUs share NVLink) without the GPU Operator's topology-aware scheduling.
Option 3: Hybrid. Parallel's architecture: K8s for inference serving (needs instant scaling), SLURM for training and fine-tuning (needs large allocations). A resource broker sits between them, dynamically shifting GPUs from the training pool to inference during peak hours and back during off-peak.
python import subprocess, json, time, requests def check_gpu_health() -> list[dict]: """Query nvidia-smi for all GPUs, return health status.""" result = subprocess.run( ["nvidia-smi", "--query-gpu=index,uuid,temperature.gpu," "utilization.gpu,memory.used,memory.total,ecc.errors.corrected.volatile.total," "ecc.errors.uncorrected.volatile.total,power.draw", "--format=csv,noheader,nounits"], capture_output=True, text=True ) gpus = [] for line in result.stdout.strip().split("\n"): parts = [p.strip() for p in line.split(",")] gpu = { "index": int(parts[0]), "uuid": parts[1], "temp_c": int(parts[2]), "util_pct": int(parts[3]), "mem_used_mb": int(parts[4]), "mem_total_mb": int(parts[5]), "ecc_corrected": int(parts[6]), "ecc_uncorrected": int(parts[7]), "power_w": float(parts[8]), } gpu["healthy"] = ( gpu["temp_c"] < 83 # Throttle at 83C and gpu["ecc_uncorrected"] == 0 # Uncorrectable = bad memory and gpu["power_w"] > 50 # GPU drawing power = alive ) gpus.append(gpu) return gpus def alert_unhealthy(gpus: list[dict], webhook: str): """Send PagerDuty alert for unhealthy GPUs.""" bad = [g for g in gpus if not g["healthy"]] if bad: requests.post(webhook, json={ "routing_key": "GPU_HEALTH", "event_action": "trigger", "payload": { "summary": f"{len(bad)} unhealthy GPU(s): {[g['index'] for g in bad]}", "severity": "critical", "source": "gpu-health-monitor", } })
ECC memory errors. GPUs have Error-Correcting Code memory. Correctable errors are normal (1-2 per day). Uncorrectable errors mean bad memory cells — the GPU must be drained and replaced. Check: nvidia-smi -q -d ECC.
Xid errors. NVIDIA kernel driver errors appear in dmesg as "Xid" messages. Xid 79 = "GPU fallen off the bus" (PCIe link failed). Xid 48 = double-bit ECC error. Xid 31 = GPU memory page retirement. Each has a different severity and recovery path. A staff engineer maintains a runbook mapping Xid codes to actions.
NVLink failures. Multi-GPU nodes use NVLink for fast GPU-to-GPU communication. If an NVLink fails, tensor parallelism falls back to PCIe (10x slower). Detection: nvidia-smi nvlink -s shows link status. Symptom: one node's throughput is 10x lower than peers.
Multi-Instance GPU (MIG) partitions a single H100 into up to 7 isolated instances, each with its own memory, compute, and L2 cache. This lets you run 7 small models (7B parameters) on one GPU, instead of wasting an 80GB GPU on a model that uses 14GB. MIG is critical for cost optimization when serving a mix of model sizes.
The frontier in 2025 is time-sliced GPU sharing with NVIDIA's MPS (Multi-Process Service) and the emerging fractional GPU schedulers (Run:AI, Nebius). These let multiple pods share a GPU with guaranteed isolation, going beyond MIG's fixed partitions to dynamic fractional allocation.
Managing 200+ GPUs is not just scheduling. It is a lifecycle management problem:
The most frustrating GPU debug scenario: your inference workload crashes with CUDA error: no kernel image is available for execution on the device. The model was compiled with CUDA 12.4 but the node has driver 535 which only supports CUDA 12.2. The fix seems simple (update the driver) but the driver update requires a node reboot, which means draining all workloads first.
Prevention: pin CUDA version in your container images and in your Terraform node provisioning. Use a compatibility matrix:
| NVIDIA Driver | Max CUDA | Supported GPUs |
|---|---|---|
| 535.x | 12.2 | A100, H100 |
| 545.x | 12.3 | A100, H100 |
| 550.x | 12.4 | A100, H100, H200 |
| 560.x | 12.6 | A100, H100, H200, B200 |
8 GPUs across 2 nodes. Jobs arrive and get scheduled. Use the slider to adjust training vs. inference priority. Watch for scheduling conflicts and fragmentation.
When a model is too large for one GPU, you split it across multiple GPUs. There are two fundamentally different strategies:
Tensor Parallelism (TP): Split each layer's weight matrices across GPUs. For a matrix multiply Y = X · W, you split W column-wise across N GPUs. Each GPU computes a partial result, then they all-reduce to combine. Requires fast interconnect (NVLink) because GPUs communicate at every layer. Best for GPUs within a single node (8 GPUs with NVLink). Typical: TP=2 for 70B, TP=8 for 405B.
Pipeline Parallelism (PP): Assign different layers to different GPUs. GPU-0 runs layers 0-19, GPU-1 runs layers 20-39, etc. GPUs only communicate between stages (one activation tensor per stage). Tolerates slower interconnect (PCIe, even network). Best for spanning multiple nodes. Downside: pipeline bubbles — when GPU-0 is computing, GPU-1 is idle waiting for GPU-0's output.
| Strategy | Communication | When to use | Scaling limit |
|---|---|---|---|
| Tensor Parallel | All-reduce every layer | Within a node (NVLink) | 8 GPUs (one node) |
| Pipeline Parallel | Once per stage boundary | Across nodes (network) | 10-20 stages before bubbles dominate |
| TP + PP (hybrid) | Both | Very large models (405B+) | TP=8 within node, PP across nodes |
bash #!/bin/bash # gpu-watch.sh — Continuous GPU monitoring with alerts # Run as: watch -n 5 ./gpu-watch.sh nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,\ memory.used,memory.total,power.draw,power.limit,\ ecc.errors.corrected.volatile.total,\ ecc.errors.uncorrected.volatile.total \ --format=csv,noheader | while IFS="," read -r idx name temp util \ mem_used mem_total power power_limit ecc_corr ecc_uncorr; do # Strip whitespace temp=$(echo $temp | tr -d ' ') util=$(echo $util | tr -d ' %') ecc_uncorr=$(echo $ecc_uncorr | tr -d ' ') # Alert conditions if [ "$ecc_uncorr" != "0" ]; then echo "CRITICAL: GPU $idx has uncorrectable ECC errors!" fi if [ "$temp" -gt 82 ]; then echo "WARNING: GPU $idx at ${temp}C (throttle at 83C)" fi if [ "$util" -lt 5 ]; then echo "WARNING: GPU $idx at ${util}% utilization (idle?)" fi done
You have GPUs. You have a model. Now you need to serve it to 10,000 concurrent users with p99 latency under 800ms for the first token, and 50ms per subsequent token. This is the serving layer — the most technically demanding piece of the entire infrastructure stack because it requires understanding both systems engineering and machine learning internals simultaneously.
Every LLM request has two phases with fundamentally different performance characteristics:
Prefill (prompt processing). The model processes all input tokens in parallel. This is compute-bound — the GPU's TFLOPS matter. A 1000-token prompt on an H100 with a 70B model takes ~200ms. The GPU is fully utilized because it processes all tokens simultaneously via matrix multiplication.
Decode (token generation). The model generates tokens one at a time, autoregressively. Each token requires reading the entire model weights from HBM. This is memory-bandwidth bound — the GPU's TB/s matters. The GPU is massively underutilized during decode because the arithmetic intensity (FLOPS per byte loaded) is tiny.
Static batching (naive) waits until B requests arrive, processes them together, and returns all results. Problem: if one request generates 500 tokens and another generates 10, the short request waits for the long one. GPU is idle while waiting for the batch to fill.
Continuous batching (vLLM, TensorRT-LLM) removes completed requests and inserts new ones at every decode step. No request waits for others to finish. No idle time waiting for a batch to fill. Throughput increases 2-10x over static batching.
PagedAttention (vLLM's key innovation) manages the KV cache — the cached key/value tensors from previous tokens. In naive serving, each request pre-allocates maximum sequence length worth of contiguous GPU memory. A 2048-token request on a 70B model needs ~2GB of KV cache. With 100 concurrent requests, that is 200GB just for KV cache — more than the GPU's total memory.
PagedAttention borrows the idea of virtual memory pages from OS design. KV cache is stored in fixed-size blocks (like 4KB pages). Blocks are allocated on demand, not pre-allocated. Blocks can be non-contiguous in physical GPU memory but addressed contiguously via a page table. Memory waste drops from 60-80% to under 4%.
python # serve.py — Production vLLM configuration for 70B model on 2x H100 from vllm import LLM, SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs engine_args = AsyncEngineArgs( model="meta-llama/Llama-3.1-70B-Instruct", tensor_parallel_size=2, # Split across 2 H100s via NVLink dtype="bfloat16", # Native H100 dtype, no precision loss max_model_len=8192, # Max context window per request max_num_seqs=256, # Max concurrent sequences in batch gpu_memory_utilization=0.92, # Reserve 8% for KV cache spikes enable_chunked_prefill=True, # Interleave prefill and decode enable_prefix_caching=True, # Cache shared system prompts block_size=16, # PagedAttention block size swap_space=4, # GB of CPU swap for KV cache overflow enforce_eager=False, # Use CUDA graphs for decode ) # Key tuning knobs: # - max_num_seqs: higher = better throughput, worse latency # - gpu_memory_utilization: higher = more KV cache capacity, risk of OOM # - enable_chunked_prefill: splits long prefills to avoid blocking decode # - enable_prefix_caching: huge win if many requests share system prompt
Speculative decoding is the most exciting serving optimization of 2024-2025. The idea: use a small, fast "draft" model (e.g., 1B parameters) to speculatively generate K tokens, then use the large "target" model to verify all K tokens in a single forward pass. If the draft model guesses correctly (which it does 60-80% of the time for predictable text), you generate K tokens for the cost of 1 target-model forward pass.
The math: if the draft model's acceptance rate is α and you speculate K tokens, expected speedup is Kα / (1 + K(1-α) · draft_cost/target_cost). For K=5, α=0.7, and draft cost = 5% of target, speedup is ~2.8x.
Symptom: throughput drops 80% despite GPU utilization looking normal. Memory utilization is at 100%. The culprit: KV cache thrashing. When the KV cache is full, PagedAttention must either reject new requests (queueing them) or swap blocks to CPU memory. CPU swap is 100x slower than GPU memory. The system appears to be working but is spending most of its time swapping.
Diagnosis: check vllm:num_preemptions_total metric. If preemptions are increasing, the KV cache is full. Fix: reduce max_num_seqs (fewer concurrent requests = less KV cache needed), or increase gpu_memory_utilization, or add more GPUs.
SGLang introduces RadixAttention — a radix tree structure for KV cache that enables automatic prefix sharing across requests. If 1000 requests share the same 2000-token system prompt, the KV cache for that prefix is stored once, not 1000 times. This is a 10-50x memory savings for production deployments with system prompts.
Disaggregated serving separates prefill and decode onto different hardware. Prefill is compute-bound → run on compute-dense GPUs. Decode is memory-bandwidth-bound → run on memory-optimized hardware. Companies like Anyscale and Fireworks are shipping this in 2025. The benefit: each phase runs on optimal hardware instead of compromising.
python def kv_cache_budget( model_params_b: float, # Billions of parameters num_layers: int, # Number of transformer layers hidden_dim: int, # Hidden dimension (d_model) num_kv_heads: int, # Number of KV attention heads (GQA) head_dim: int, # Dimension per head dtype_bytes: int = 2, # 2 for FP16/BF16, 1 for FP8 max_seq_len: int = 4096, max_batch: int = 256, ) -> dict: """Calculate KV cache memory requirements.""" # Per-token KV cache: 2 (K+V) * layers * kv_heads * head_dim * dtype per_token_bytes = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes per_token_mb = per_token_bytes / (1024 ** 2) # Per-request: per_token * max_seq_len per_request_mb = per_token_mb * max_seq_len # Max batch KV cache total_kv_gb = per_request_mb * max_batch / 1024 # Model weights model_gb = model_params_b * 1e9 * dtype_bytes / (1024 ** 3) return { "per_token_bytes": per_token_bytes, "per_request_mb": round(per_request_mb, 2), "max_batch_kv_gb": round(total_kv_gb, 2), "model_weights_gb": round(model_gb, 2), "total_gpu_gb": round(model_gb + total_kv_gb, 2), } # Llama 3.1 70B with GQA (8 KV heads instead of 64) budget = kv_cache_budget( model_params_b=70, num_layers=80, hidden_dim=8192, num_kv_heads=8, head_dim=128, dtype_bytes=2, max_seq_len=4096, max_batch=128 ) print(budget) # {'per_token_bytes': 327680, 'per_request_mb': 1280.0, # 'max_batch_kv_gb': 160.0, 'model_weights_gb': 130.39, # 'total_gpu_gb': 290.39} # → Need at least 4x H100 (80GB each = 320GB) for 128 concurrent requests!
In production, 90% of requests to Parallel share the same system prompt (2000-4000 tokens). Without prefix caching, each request recomputes the KV cache for the system prompt from scratch. That is 2000 tokens of prefill computation × every request = massive waste.
Prefix caching (vLLM's enable_prefix_caching, SGLang's RadixAttention) stores the KV cache for common prefixes in GPU memory. When a new request arrives with the same system prompt, the server looks up the cached KV blocks and skips prefill entirely for those tokens. Only the unique user query portion needs prefill.
The impact is dramatic:
| Scenario | Without prefix cache | With prefix cache | Improvement |
|---|---|---|---|
| System prompt: 2000 tok, User query: 200 tok | Prefill 2200 tokens (~150ms) | Prefill 200 tokens (~15ms) | 10x TTFT reduction |
| 1000 concurrent requests, same system prompt | KV cache: 1000 × 2000 tok = 2M token-slots | Shared prefix: 2000 tok + 1000 × 200 unique = 202K | 10x memory savings |
You enabled prefix caching but TTFT did not improve. The cache hit rate is 3%. Investigation: prefix caching requires exact token-level match of the prefix. If your system prompt includes a timestamp ("Current time: 2025-05-22T10:30:00Z"), every request has a unique system prompt and no prefix is shared. Fix: move dynamic content (timestamps, user IDs) to the end of the prompt, after the static system instructions. Static portion gets cached; dynamic portion gets prefilled.
Another subtle miss: your system prompt is 2048 tokens but the cache block size is 16 tokens. The cache stores 128 blocks for the prefix. If your system prompt changes by even one token between API versions (you added a space somewhere), all 128 blocks are invalidated because the hash does not match. Fix: version your system prompts explicitly and do not make micro-edits to production prompts without understanding the cache impact.
Watch requests enter the batch, get processed, and complete. The batch dynamically adds/removes requests every decode step. Compare throughput with static batching.
Every system at Parallel's scale is distributed. Your databases are sharded across machines. Your inference cluster spans multiple nodes. Your configuration changes must propagate consistently. You cannot build reliable infrastructure without understanding the fundamental constraints and tradeoffs of distributed systems.
The CAP theorem (Brewer, 2000) states that a distributed data store can provide at most two of three guarantees simultaneously:
Consistency (C): Every read receives the most recent write. If you write "rate limit = 100 req/s" and then read it, you always get 100, never the old value.
Availability (A): Every request receives a response (not an error), even if some nodes are down.
Partition tolerance (P): The system continues to operate even if network communication between nodes is lost.
In practice, network partitions will happen (a switch fails, a cable is pulled, a cloud AZ goes dark). So P is not optional — you must tolerate partitions. The real choice is between C and A during a partition:
CP systems (e.g., ZooKeeper, etcd, CockroachDB) refuse to serve reads/writes during a partition rather than return stale data. Your Kubernetes control plane uses etcd (CP) because incorrect cluster state is worse than temporary unavailability.
AP systems (e.g., Cassandra, DynamoDB, Redis Cluster) continue serving during a partition but may return stale data. Your session store might be AP because showing slightly stale session data is better than showing an error page.
You have 10 cache servers. A request for key "user:12345" must always go to the same server (otherwise the cache is useless). Naive approach: server = hash(key) % 10. Problem: when you add an 11th server, hash(key) % 11 maps almost every key to a different server. Your entire cache is invalidated. With a 10M-key cache, you just caused a cache stampede that hammers your database.
Consistent hashing fixes this. Imagine a circular ring of hash values (0 to 232). Each server is placed at a point on the ring. A key is hashed and walks clockwise until it hits a server. When you add a server, only the keys between the new server and its predecessor move. Adding an 11th server moves only ~1/11 of keys instead of ~10/11.
Virtual nodes improve balance: each physical server gets 100-200 virtual positions on the ring. This prevents hot spots when servers are unevenly distributed on the ring.
python import hashlib, bisect class ConsistentHashRing: def __init__(self, nodes: list[str], vnodes: int = 150): self.ring: list[tuple[int, str]] = [] self.vnodes = vnodes for node in nodes: self.add_node(node) self.ring.sort() def _hash(self, key: str) -> int: return int(hashlib.md5(key.encode()).hexdigest(), 16) def add_node(self, node: str): for i in range(self.vnodes): h = self._hash(f"{node}:{i}") bisect.insort(self.ring, (h, node)) def remove_node(self, node: str): self.ring = [(h, n) for h, n in self.ring if n != node] def get_node(self, key: str) -> str: if not self.ring: raise ValueError("Empty ring") h = self._hash(key) idx = bisect.bisect_right(self.ring, (h,)) if idx == len(self.ring): idx = 0 # Wrap around the ring return self.ring[idx][1] # Usage: route KV cache lookups to cache servers ring = ConsistentHashRing(["cache-01", "cache-02", "cache-03"]) server = ring.get_node("user:12345:session") # Deterministic routing ring.add_node("cache-04") # Only ~25% of keys re-map
A split brain occurs when a network partition divides your cluster into two halves, and both halves think they are the leader. With a CP database, this causes writes to fail (correct behavior). With a poorly configured AP system, both halves accept writes, creating conflicting data that must be merged when the partition heals.
The most common real-world split brain: your Redis Sentinel cluster has 3 nodes, the network splits 2-1, the majority side elects a new leader, but the minority side's old leader keeps accepting writes. Fix: configure min-replicas-to-write 1 so the old leader refuses writes when it cannot reach any replica.
python import redis, time, uuid class DistributedLock: """Simple distributed lock using Redis SET NX EX. For production, use the Redlock algorithm across 5 Redis instances.""" def __init__(self, r: redis.Redis, name: str, ttl_s: int = 10): self.r = r self.name = f"lock:{name}" self.ttl = ttl_s self.token = str(uuid.uuid4()) # Unique token to prevent releasing someone else's lock def acquire(self, timeout: float = 5.0) -> bool: deadline = time.monotonic() + timeout while time.monotonic() < deadline: # SET key value NX EX ttl — atomic acquire if self.r.set(self.name, self.token, nx=True, ex=self.ttl): return True time.sleep(0.05) # Retry after 50ms return False def release(self): # Lua script: only delete if we still own the lock script = """ if redis.call("get", KEYS[1]) == ARGV[1] then return redis.call("del", KEYS[1]) else return 0 end """ self.r.eval(script, 1, self.name, self.token) # Usage: ensure only one process runs database migration lock = DistributedLock(redis.Redis(), "db-migration", ttl_s=300) if lock.acquire(): try: run_migration() finally: lock.release()
Distributed systems bugs are notoriously hard to reproduce. Deterministic simulation testing (pioneered by FoundationDB, adopted by TigerBeetle) runs your entire distributed system inside a single-threaded simulation where all I/O is fake and all randomness is seeded. You can replay any bug with the same seed. Companies like Antithesis offer this as a service — they found consensus bugs in etcd and CockroachDB that years of traditional testing missed.
Many infrastructure systems need a single leader: the SLURM scheduler, the K8s controller manager, the database primary. Consensus algorithms (Raft, Paxos) solve the problem of electing a leader among distributed nodes such that all nodes agree on who the leader is, even when some nodes fail.
Raft (used by etcd, CockroachDB, TiKV) works in three phases:
1. Election: When a follower does not hear from the leader within a timeout (randomized, 150-300ms), it becomes a candidate and requests votes. If it gets a majority (3/5 or 2/3), it becomes the new leader. The randomized timeout prevents all followers from starting elections simultaneously.
2. Log replication: The leader receives writes, appends them to its log, and replicates to followers. A write is committed once a majority of nodes have written it. This guarantees durability: even if 2 of 5 nodes die, committed data is not lost.
3. Safety: A candidate cannot win an election if its log is behind any voter's log. This prevents a stale node from becoming leader and overwriting committed data.
kubectl apply, the API server writes to etcd. Etcd uses Raft to replicate that write to 2 other etcd nodes. If the etcd leader dies, a new leader is elected in ~300ms. Your cluster state (what pods should be running, what services exist) survives because it was replicated to a majority before being acknowledged. Understanding Raft helps you debug etcd issues, set appropriate cluster sizes (always odd: 3, 5, or 7), and understand why etcd performance matters for K8s responsiveness.Infrastructure operations must be idempotent: running the same operation twice produces the same result as running it once. This is critical because network failures can cause "did it succeed?" ambiguity. Terraform is idempotent by design (desired state, not imperative commands). But your custom scripts might not be.
python # BAD: Not idempotent — running twice creates two load balancers def create_lb(name): client.create_load_balancer(name=name) # GOOD: Idempotent — checks if LB exists first def ensure_lb(name, config): existing = client.get_load_balancer(name=name) if existing: if existing.config != config: client.update_load_balancer(name=name, config=config) return existing return client.create_load_balancer(name=name, config=config)
Nodes sit on a hash ring. Keys map to the nearest clockwise node. Add/remove nodes to see minimal key redistribution vs. modulo hashing.
Parallel stores three kinds of data: user/account data (relational, ~100GB), conversation logs and traces (time-series, ~50TB and growing), and vector embeddings for RAG (vector store, ~2TB). Each has fundamentally different access patterns and scaling strategies. A staff engineer knows which database technology fits each pattern and, critically, when to stop scaling a single database and start sharding.
Vertical sharding moves different tables to different databases. Users table on DB-1, conversations on DB-2, billing on DB-3. Simple, but each shard can still hit single-node limits.
Horizontal sharding splits rows of the same table across multiple databases. Users A-M on Shard-1, N-Z on Shard-2. Scales horizontally, but cross-shard queries (joins) become expensive network calls.
The shard key choice determines everything. Bad shard key = hot shards. Good shard key = even distribution + query locality.
| Data type | Shard key | Why |
|---|---|---|
| User accounts | user_id (hash) | Even distribution, all user queries hit one shard |
| Conversations | org_id + date | All conversations for one org on same shard (query locality), date prevents one shard from growing forever |
| Request logs | timestamp (range) | Time-series: recent data is hot, old data is cold storage |
A PostgreSQL connection costs ~10MB of RAM and 1 OS process. 1000 connections = 10GB RAM + 1000 processes. But your K8s cluster might have 200 pods, each wanting 10 connections. That is 2000 connections — your database is out of memory.
PgBouncer sits between your app and PostgreSQL. 200 pods connect to PgBouncer (2000 connections). PgBouncer maintains only 50 connections to PostgreSQL. It multiplexes queries from 2000 app connections across 50 database connections using transaction-level pooling (a connection is returned to the pool after each transaction, not after the session ends).
python import random from contextlib import contextmanager class DBRouter: """Route reads to replicas, writes to primary.""" def __init__(self, primary: str, replicas: list[str]): self.primary = primary self.replicas = replicas self._pools = {} # connection pools per host def get_read_conn(self, allow_stale: bool = True): """Pick a random replica for reads. If stale not OK, use primary.""" if not allow_stale or not self.replicas: return self._get_pool(self.primary) replica = random.choice(self.replicas) return self._get_pool(replica) def get_write_conn(self): """All writes go to primary.""" return self._get_pool(self.primary) @contextmanager def read_after_write(self): """After a write, read from primary to avoid stale reads.""" yield self._get_pool(self.primary) # Usage pattern: router = DBRouter( primary="db-primary.internal:5432", replicas=["db-replica-1.internal:5432", "db-replica-2.internal:5432"] ) # Dashboard query (stale OK): hits a replica conn = router.get_read_conn(allow_stale=True) # Billing check (must be fresh): hits primary conn = router.get_read_conn(allow_stale=False)
Your monitoring dashboard shows a spike. But the data is 30 seconds stale because it reads from a replica. The replica is lagging because a bulk data migration is running on the primary, generating so many WAL (Write-Ahead Log) entries that the replica cannot keep up.
Diagnosis: check pg_stat_replication.replay_lag on the primary. If lag > your tolerance (e.g., 5 seconds), route reads to primary until lag recovers. Better: set up lag-aware routing that automatically switches when lag exceeds threshold.
python import asyncpg, time class LagAwareRouter: """Route reads to replicas only when lag is acceptable.""" def __init__(self, primary_dsn: str, replica_dsns: list[str], max_lag_seconds: float = 5.0): self.primary_dsn = primary_dsn self.replica_dsns = replica_dsns self.max_lag = max_lag_seconds self.replica_lags: dict[str, float] = {} self.last_check = 0 self.check_interval = 2.0 # seconds async def _check_lags(self): """Query replication lag from primary's pg_stat_replication.""" if time.time() - self.last_check < self.check_interval: return conn = await asyncpg.connect(self.primary_dsn) rows = await conn.fetch( "SELECT client_addr, " "EXTRACT(EPOCH FROM replay_lag) as lag_s " "FROM pg_stat_replication" ) for row in rows: self.replica_lags[str(row['client_addr'])] = row['lag_s'] or 0 await conn.close() self.last_check = time.time() async def get_read_dsn(self, require_fresh: bool = False) -> str: if require_fresh: return self.primary_dsn await self._check_lags() # Pick a replica with acceptable lag for dsn in self.replica_dsns: host = dsn.split('@')[-1].split(':')[0] lag = self.replica_lags.get(host, 999) if lag <= self.max_lag: return dsn # All replicas lagging — fall back to primary return self.primary_dsn
Parallel generates ~500K metrics data points per second from GPU telemetry, inference latency, and request logs. Standard PostgreSQL cannot handle this write volume. The architecture:
Hot tier (0-7 days): TimescaleDB (PostgreSQL extension) with hypertables. Data is partitioned by time automatically. Continuous aggregates pre-compute 1-minute and 1-hour rollups. Query performance: sub-second for "show me GPU utilization for the last 6 hours."
Warm tier (7-90 days): Compressed hypertable chunks. TimescaleDB's native compression achieves 10-20x compression ratios for metrics data. Queries are slower (1-5 seconds) but data is still online.
Cold tier (90+ days): Parquet files in S3 via a nightly export job. Queryable via Athena/Trino for historical analysis. Virtually free storage ($0.023/GB/month).
Parallel's RAG system stores 500M embedding vectors. Each vector is 1536 dimensions × 4 bytes = ~6KB. Total: ~3TB of vector data. The architecture must support <50ms nearest-neighbor queries at 10K queries/sec:
| Component | Choice | Why |
|---|---|---|
| Vector DB | Qdrant (Rust, open-source) | Best latency/throughput for filtered search. Supports hybrid search (dense + sparse). |
| Index type | HNSW (Hierarchical Navigable Small World) | Sub-linear query time O(log n), 95%+ recall at ef=128 |
| Sharding | Hash-based on collection_id | Each customer's embeddings on one shard for query locality |
| Replication | 2 replicas per shard | Read throughput + fault tolerance. One replica can serve while other rebuilds index. |
| Hardware | Memory-optimized instances (r6i.8xlarge) | HNSW index must fit in RAM. 3TB data + index overhead ≈ 5TB RAM needed across shards. |
python import asyncio from qdrant_client import QdrantClient, models async def index_documents( docs: list[dict], qdrant: QdrantClient, embed_fn, collection: str, batch_size: int = 100, ): """Batch-index documents into Qdrant with parallel embedding.""" for i in range(0, len(docs), batch_size): batch = docs[i:i + batch_size] texts = [d["text"] for d in batch] # Embed in parallel (GPU-accelerated embedding model) vectors = await embed_fn(texts) # Upsert to Qdrant (idempotent via point IDs) points = [ models.PointStruct( id=d["id"], vector=v.tolist(), payload={"text": d["text"], "source": d["source"]}, ) for d, v in zip(batch, vectors) ] qdrant.upsert(collection_name=collection, points=points) # Trigger HNSW index optimization after bulk upsert qdrant.update_collection( collection_name=collection, optimizer_config=models.OptimizersConfigDiff( indexing_threshold=20000, ), )
Parallel stores embeddings for RAG. The naive approach — pgvector in PostgreSQL — works for 1M vectors. At 100M vectors, you need a purpose-built vector store: Pinecone (managed), Qdrant (open-source, Rust), or Milvus (open-source, GPU-accelerated search). The 2025 frontier is hybrid search: combining dense vector search with sparse keyword search (BM25) in a single query, scored with reciprocal rank fusion. Qdrant and Weaviate ship this natively.
See how different shard keys distribute data across 4 shards. Toggle between hash-based and range-based sharding to see hot spots form.
Parallel's GPU fleet costs $3-5M per month. A 10% utilization improvement saves $300K-$500K per month. A 5% cost reduction in cloud spend saves more than most engineers' annual salaries. Cost engineering is not a side project — it is a core competency. The difference between a profitable AI company and one that burns cash faster than it earns revenue often comes down to infrastructure cost efficiency.
Understanding where money goes is the first step to saving it. For an LLM serving company, the cost stack typically looks like this:
| Category | % of total | What drives it |
|---|---|---|
| GPU compute (inference) | 55-70% | Number of GPUs × hours × $/hr |
| GPU compute (training/finetune) | 10-20% | Training runs, eval batches |
| Storage (model weights, logs) | 5-10% | S3, EBS, vector store storage |
| Networking (inter-region, egress) | 5-8% | Cross-region replication, API responses |
| CPU compute (API, workers) | 3-5% | K8s cluster, background jobs |
| Managed services | 2-5% | RDS, ElastiCache, CloudWatch |
Spot instances cost 60-90% less than on-demand but can be reclaimed with 2 minutes notice. The strategy for inference serving:
Base layer (on-demand or reserved): 40% of your peak capacity. This handles your minimum traffic and never gets reclaimed. Use 1-year reserved instances for another 30% discount.
Elastic layer (spot): 0-60% of peak capacity. Scales up for traffic spikes, scales down during quiet hours. If spot capacity is reclaimed, requests queue and latency increases but the service stays up.
Burst layer (on-demand): Emergency overflow. If spot is reclaimed AND traffic is above base capacity, spin up on-demand instances. This is the most expensive option but keeps SLOs intact during spot reclamation events.
When AWS reclaims a spot instance, you get a 2-minute warning via the EC2 metadata endpoint. Your infrastructure must gracefully handle this:
python import requests, time, subprocess def spot_interruption_handler(): """Poll EC2 metadata for spot interruption notice. Run as a sidecar in every GPU pod on spot instances.""" while True: try: resp = requests.get( "http://169.254.169.254/latest/meta-data/spot/instance-action", timeout=1 ) if resp.status_code == 200: # INTERRUPTION INCOMING — we have ~2 minutes action = resp.json() print(f"Spot interruption: {action['action']} at {action['time']}") # Step 1: Stop accepting new requests subprocess.run(["kubectl", "cordon", node_name]) # Step 2: Drain in-flight requests (wait up to 90 seconds) wait_for_drain(timeout=90) # Step 3: Signal the autoscaler to compensate requests.post(webhook_url, json={ "event": "spot_interruption", "node": node_name, "gpu_count": 8, }) return except requests.exceptions.RequestException: pass # No interruption notice — normal operation time.sleep(5)
When comparing cloud providers or GPU types, the hourly rate is not the full picture. Total Cost of Ownership includes:
| Cost Component | Often Overlooked? | Example |
|---|---|---|
| Compute hourly rate | No | H100: $10/hr |
| Data transfer (egress) | Yes | $0.09/GB from AWS to internet. At 1TB/day = $2,700/month |
| Storage (EBS, S3) | Sometimes | Model weights on EBS: $0.10/GB/month. 140GB × 100 nodes = $1,400/month |
| Engineering time | Yes | 2 engineers × $300K salary / 12 = $50K/month in labor |
| Monitoring/logging | Yes | Datadog at scale: $20K-$100K/month |
| Support contracts | Sometimes | AWS Enterprise Support: $15K+/month |
A common mistake: choosing a cloud provider with 10% cheaper GPU rates but 3x more expensive egress. If your workload is egress-heavy (streaming tokens to users), the "cheaper" provider is actually more expensive. Always model the full TCO before making procurement decisions.
python from dataclasses import dataclass @dataclass class InferenceCostModel: gpu_hourly_cost: float # $/hour per GPU tokens_per_second: float # Throughput per GPU avg_input_tokens: int # Average request input length avg_output_tokens: int # Average request output length batch_efficiency: float # 0-1, how much batching helps gpu_utilization: float # 0-1, actual time GPU is doing work def cost_per_request(self) -> float: total_tokens = self.avg_input_tokens + self.avg_output_tokens # Time to process one request (seconds) effective_tps = self.tokens_per_second * self.batch_efficiency time_s = total_tokens / effective_tps # Cost = time * GPU cost, adjusted for utilization cost_per_second = self.gpu_hourly_cost / 3600 return (time_s * cost_per_second) / self.gpu_utilization def cost_per_million_tokens(self) -> float: cost_per_req = self.cost_per_request() tokens_per_req = self.avg_input_tokens + self.avg_output_tokens return (cost_per_req / tokens_per_req) * 1_000_000 # Example: Llama 70B on 2x H100 with vLLM model = InferenceCostModel( gpu_hourly_cost=20.0, # 2x H100 at $10/hr each tokens_per_second=80, # ~80 tok/s per 2-GPU setup avg_input_tokens=500, avg_output_tokens=200, batch_efficiency=0.7, # Continuous batching helps a lot gpu_utilization=0.75, # 75% utilization is good ) print(f"Cost per request: ${model.cost_per_request():.4f}") print(f"Cost per 1M tokens: ${model.cost_per_million_tokens():.2f}")
The most common cost leak: zombie resources — infrastructure that is running but no longer needed. Failed experiments that left GPU instances running. Load balancers for decommissioned services. EBS volumes detached from any instance. Elastic IPs not attached to anything. Each one quietly burns money.
Prevention: tag every resource with owner, team, expires. Run a nightly job that deletes resources past their expiry date. For GPU instances, automatically terminate any instance with <5% utilization for >2 hours (with a whitelist for legitimate idle use cases like pre-warming).
python import boto3 from datetime import datetime, timedelta def find_zombie_resources(region: str) -> list[dict]: """Find resources that are running but probably shouldn't be.""" zombies = [] ec2 = boto3.client('ec2', region_name=region) cw = boto3.client('cloudwatch', region_name=region) # Check running GPU instances with low utilization instances = ec2.describe_instances( Filters=[ {'Name': 'instance-state-name', 'Values': ['running']}, {'Name': 'instance-type', 'Values': ['p4d.*', 'p5.*']} ] ) for res in instances['Reservations']: for inst in res['Instances']: inst_id = inst['InstanceId'] # Check GPU utilization via CloudWatch custom metric stats = cw.get_metric_statistics( Namespace='GPU', MetricName='Utilization', Dimensions=[{'Name': 'InstanceId', 'Value': inst_id}], StartTime=datetime.utcnow() - timedelta(hours=2), EndTime=datetime.utcnow(), Period=3600, Statistics=['Average'] ) avg_util = sum(p['Average'] for p in stats['Datapoints']) / max(1, len(stats['Datapoints'])) if avg_util < 5: tags = {t['Key']: t['Value'] for t in inst.get('Tags', [])} zombies.append({ 'resource': inst_id, 'type': inst['InstanceType'], 'avg_util': round(avg_util, 1), 'owner': tags.get('owner', 'UNTAGGED'), 'hourly_cost': estimate_hourly_cost(inst['InstanceType']), }) return zombies
Most cloud resources are over-provisioned. Engineers request 8 vCPUs "just in case" but use 1.5 on average. Right-sizing means matching resource allocation to actual usage. The automation pipeline:
GPTQ/AWQ quantization (INT4) reduces model memory by 4x with ~1-3% quality loss. A 70B model goes from 140GB (FP16) to 35GB (INT4), fitting on a single H100 instead of two. Cost per request drops 50%+ because you need half the GPUs. The 2025 frontier: FP8 quantization on H100/B200 uses the native FP8 tensor cores — 2x throughput with essentially zero quality loss. Every major serving framework (vLLM, TensorRT-LLM, SGLang) supports FP8 natively.
Not every request needs the biggest model. A simple "what time is it?" does not need 70B parameters. Model routing classifies incoming requests and routes them to the cheapest model that can handle them:
| Request complexity | Model | Cost/request | Example |
|---|---|---|---|
| Simple (FAQ, greetings) | 8B model (1 GPU) | $0.0005 | "What are your business hours?" |
| Medium (analysis, summarization) | 70B model (2 GPUs) | $0.004 | "Summarize this document and list key points." |
| Complex (reasoning, multi-step) | 405B model (8 GPUs) | $0.02 | "Compare these 3 investment strategies considering tax implications." |
The router itself is a small classifier (fine-tuned BERT or even a rule-based system on prompt length + keywords). If 60% of requests are simple, model routing reduces average cost per request by 50-70% compared to sending everything to the 70B model.
Adjust GPU count, utilization, spot mix, and quantization to see the impact on monthly cost and cost per request.
Parallel's API serves customers whose businesses depend on it. When the API is down, their products are broken, their users are frustrated, and they are looking at competitors. "Five nines" (99.999%) uptime means at most 5 minutes of downtime per year. "Four nines" (99.99%) means 52 minutes per year. Every minute counts. Reliability is not a feature you bolt on — it is a design constraint that shapes every architectural decision.
A Service Level Indicator (SLI) is a metric that measures reliability. Examples: request success rate, p99 latency, time to first token.
A Service Level Objective (SLO) is a target for an SLI. "99.9% of requests complete in under 800ms." This is an internal commitment. It is more strict than what you promise customers (the SLA).
An error budget is the amount of unreliability your SLO allows. If your SLO is 99.9% availability, your error budget is 0.1% — roughly 43 minutes per month. When you have error budget remaining, you can take risks (deploy risky features, run experiments). When the budget is exhausted, you freeze deployments and focus on reliability improvements.
You cannot know how your system fails until it actually fails. Chaos engineering deliberately introduces failures in production (in a controlled way) to discover weaknesses before customers do.
The hierarchy of chaos experiments, from least to most scary:
python import time from enum import Enum class State(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Failing, reject requests HALF_OPEN = "half_open" # Testing if service recovered class CircuitBreaker: def __init__(self, failure_threshold=5, reset_timeout=30): self.state = State.CLOSED self.failures = 0 self.failure_threshold = failure_threshold self.reset_timeout = reset_timeout self.last_failure_time = 0 def call(self, func, *args, **kwargs): if self.state == State.OPEN: if time.time() - self.last_failure_time > self.reset_timeout: self.state = State.HALF_OPEN # Try one request else: raise CircuitOpenError("Circuit is open, refusing request") try: result = func(*args, **kwargs) if self.state == State.HALF_OPEN: self.state = State.CLOSED # Service recovered! self.failures = 0 return result except Exception as e: self.failures += 1 self.last_failure_time = time.time() if self.failures >= self.failure_threshold: self.state = State.OPEN raise # Usage: protect calls to the inference backend breaker = CircuitBreaker(failure_threshold=5, reset_timeout=30) try: result = breaker.call(inference_backend.predict, prompt=user_input) except CircuitOpenError: result = fallback_response() # Return cached/degraded response
The most dangerous production failure pattern: Service A depends on Service B. Service B slows down (doesn't fail, just gets slow). Service A's threads are all waiting for Service B, so Service A's queue fills up. Service C depends on Service A and also starts queueing. Within minutes, the entire system is down even though Service B only slowed by 200ms.
Prevention: timeouts on every external call (never wait forever), circuit breakers (stop calling a slow service), load shedding (reject new requests when overloaded rather than queueing them), and bulkheading (isolate thread pools per dependency so one slow dependency cannot consume all threads).
python import time from collections import deque class LoadShedder: """Reject excess requests to protect the system from overload. Uses a sliding window to measure current request rate.""" def __init__(self, max_rps: int, window_seconds: float = 1.0): self.max_rps = max_rps self.window = window_seconds self.timestamps: deque = deque() def should_accept(self) -> bool: now = time.monotonic() # Remove timestamps outside the window while self.timestamps and self.timestamps[0] < now - self.window: self.timestamps.popleft() if len(self.timestamps) >= self.max_rps: return False # Shed this request (HTTP 503) self.timestamps.append(now) return True # Usage in FastAPI middleware: shedder = LoadShedder(max_rps=5000) @app.middleware("http") async def load_shed_middleware(request, call_next): if not shedder.should_accept(): return JSONResponse( status_code=503, content={"error": "Service overloaded, please retry"}, headers={"Retry-After": "2"} ) return await call_next(request)
Every incident gets a postmortem within 48 hours. The postmortem is blameless — it focuses on systems, not people. The template:
The frontier is AI-assisted incident response. Tools like PagerDuty AIOps and Rootly automatically correlate alerts with recent deployments, suggest runbook steps, and draft postmortem timelines. In 2025, companies are experimenting with LLM agents that can execute runbook steps autonomously (e.g., "if p99 latency > 2s and recent deployment within 30 min, auto-rollback the deployment").
Watch the error budget burn as incidents occur. Green = budget remaining, red = budget exhausted (deployment freeze). Click to simulate incidents.
Every token your LLM generates must travel from a GPU in a data center to a user's browser. That journey involves DNS resolution, TLS handshake, load balancer routing, service-to-service calls, and response streaming. Each hop adds latency. Each hop can fail. A staff-level infrastructure engineer understands every layer of this stack because networking problems are the most common cause of production incidents.
Layer 4 (transport) load balancers route TCP/UDP connections. They see IP addresses and ports but cannot inspect HTTP headers, URL paths, or request bodies. They are fast (kernel-level, millions of connections) and simple. Use L4 for: gRPC services, database connections, any high-throughput TCP protocol.
Layer 7 (application) load balancers inspect HTTP requests. They can route based on URL path (/v1/chat goes to inference, /v1/embeddings goes to embedding service), HTTP headers (route by API key to customer-specific deployments), cookies (session affinity), and request body (route by model name). They are slower (must parse HTTP) but much more flexible. Use L7 for: API gateways, canary routing, A/B testing.
| Feature | L4 (NLB/IPVS) | L7 (ALB/Envoy/NGINX) |
|---|---|---|
| Routing granularity | IP + port | URL, headers, body, cookies |
| TLS termination | Passthrough or terminate | Always terminates |
| Protocol support | Any TCP/UDP | HTTP/1.1, HTTP/2, WebSocket, gRPC |
| Latency overhead | ~0.1ms | ~1-5ms |
| Cost (AWS) | $0.006/GB | $0.008/GB + $0.008/LCU-hr |
Parallel's infrastructure has 40+ microservices. Each needs: mTLS encryption, retry logic, circuit breaking, rate limiting, and observability. Implementing this in every service is a nightmare. A service mesh (Istio, Linkerd) injects a sidecar proxy (Envoy) alongside every pod. The sidecar handles all cross-cutting concerns transparently.
The tradeoff: each sidecar adds ~10MB memory and ~1ms latency per hop. For CPU services, this is fine. For GPU inference pods, you might skip the sidecar and handle networking directly to avoid the extra millisecond on every token stream.
LLM responses are streamed token-by-token via Server-Sent Events (SSE) or WebSockets. Each connection stays open for 5-60 seconds. This is fundamentally different from traditional HTTP request-response and creates unique infrastructure challenges:
Connection limits: Each open stream holds a file descriptor and a slot in the load balancer's connection pool. 10,000 concurrent streams = 10,000 open connections. Your Envoy default of 1024 connections per upstream cluster will be exhausted quickly.
Idle timeout traps: Some load balancers (AWS ALB) have a 60-second idle timeout. If token generation stalls for 60 seconds (e.g., waiting for tool call result), the connection is killed. Fix: send keepalive comments : keepalive\n\n every 15 seconds during idle periods.
Backpressure: If the client cannot consume tokens as fast as the model generates them (slow network), the TCP send buffer fills up. Without backpressure handling, the server buffers unboundedly and runs out of memory. Fix: monitor TCP send buffer per connection, pause generation when buffer exceeds threshold.
python # SSE streaming with keepalive and backpressure async def stream_response(request, engine): async def generate(): last_token_time = time.time() async for token in engine.generate(request.prompt): yield f"data: {json.dumps({'token': token})}\n\n" last_token_time = time.time() # Keepalive during tool calls while engine.waiting_for_tool: if time.time() - last_token_time > 15: yield ": keepalive\n\n" last_token_time = time.time() await asyncio.sleep(1) yield "data: [DONE]\n\n" return StreamingResponse( generate(), media_type="text/event-stream", headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"} )
python # gRPC uses HTTP/2 with persistent connections. A naive L4 LB # assigns one connection per client → all requests from that client # hit one backend → load imbalance. Solution: L7 or client-side LB. import grpc from grpc import aio async def create_channel(endpoints: list[str]) -> aio.Channel: """Create a gRPC channel with client-side round-robin LB.""" # Format: "dns:///my-service.internal:50051" # grpc resolves DNS, gets multiple IPs, round-robins across them target = f"dns:///{endpoints[0]}" channel = aio.insecure_channel( target, options=[ ("grpc.lb_policy_name", "round_robin"), ("grpc.dns_min_time_between_resolutions_ms", 5000), ("grpc.keepalive_time_ms", 10000), ("grpc.keepalive_timeout_ms", 5000), ("grpc.max_send_message_length", 50 * 1024 * 1024), # 50MB ] ) return channel # For inference: use "pick_first" for sticky sessions (KV cache affinity) # For stateless services: use "round_robin" for even distribution
Symptom: intermittent 503 errors, but backend services are healthy. The load balancer reports "no healthy upstream." Investigation: the L7 load balancer (Envoy) has a connection pool limit of 1024 per upstream host. During a traffic spike, each long-lived SSE stream for token streaming holds a connection open for 5-30 seconds. 1024 connections × 10 second average = 102 req/s per upstream. With 5 upstream pods, your max is 512 req/s. Beyond that, connections queue and time out.
Fix: increase max_connections in Envoy cluster config, add more upstream pods, or implement HTTP/2 multiplexing (multiple requests per connection).
Cilium replaces kube-proxy and iptables with eBPF programs that run directly in the Linux kernel. Benefits: 10x faster packet processing, kernel-level observability (every packet traced without instrumentation), and network policies that apply at the kernel level without sidecar proxies. In 2025, Cilium is the default CNI for GKE and is rapidly replacing Istio's sidecar model for service mesh functionality.
Every microservice needs to find every other microservice. In Kubernetes, you have two options:
Kubernetes DNS (CoreDNS): Every Service gets a DNS name: inference-server.production.svc.cluster.local. Pods resolve this via the cluster DNS. Simple, but DNS has TTL caching issues — when a pod dies and is replaced, clients may still have the old IP cached for 5-30 seconds. For inference serving where each request takes 500ms, 30 seconds of stale DNS = ~60 failed requests.
Service mesh (Envoy/Istio): The sidecar proxy maintains a real-time list of healthy endpoints via the control plane. No DNS caching issues. When a pod dies, the proxy knows within 1-2 seconds and stops routing to it. Better health checking (L7 health probes, not just TCP), but adds complexity and latency.
The hybrid approach for Parallel: Kubernetes DNS for stateless services (low sensitivity to endpoint staleness), Envoy service mesh for inference routing (needs sub-second failover), and external DNS (Route53/Cloud DNS) for cross-region service discovery.
python import asyncio from contextlib import asynccontextmanager class ConnectionPool: """Generic async connection pool with health checking. Prevents connection exhaustion, the #1 cause of 503 errors.""" def __init__(self, factory, max_size: int = 50, max_idle_time: float = 300): self.factory = factory # async callable that creates a connection self.max_size = max_size self.max_idle = max_idle_time self._pool: asyncio.Queue = asyncio.Queue(maxsize=max_size) self._size = 0 self._lock = asyncio.Lock() @asynccontextmanager async def acquire(self): conn = None try: conn = self._pool.get_nowait() except asyncio.QueueEmpty: async with self._lock: if self._size < self.max_size: conn = await self.factory() self._size += 1 if conn is None: conn = await asyncio.wait_for(self._pool.get(), timeout=5.0) try: yield conn finally: try: self._pool.put_nowait(conn) except asyncio.QueueFull: await conn.close() async with self._lock: self._size -= 1
Requests arrive and get distributed across backends. Compare round-robin, least-connections, and consistent-hash strategies. Watch for hot spots.
Parallel has 80 engineers deploying 20+ times per day. Each deployment touches infrastructure that serves millions of requests. A broken deployment means revenue loss. A slow deployment pipeline means engineering velocity loss. The CI/CD system must be fast (under 10 minutes from merge to production), safe (catch errors before they reach production), and recoverable (rollback in under 60 seconds).
There are four deployment strategies. Each has different risk, speed, and complexity tradeoffs:
Rolling update: Replace pods one at a time. Old and new versions run simultaneously during rollout. Simple, but if the new version has a bug, it serves some traffic before you detect and rollback. K8s default.
Blue-green: Run two identical environments (blue = current, green = new). Deploy to green, test, then switch the load balancer from blue to green. Instant rollback by switching back. Downside: 2x resource cost during transition.
Canary: Route 1% of traffic to the new version. Monitor error rate and latency for 15-30 minutes. If healthy, gradually increase to 5%, 25%, 50%, 100%. If unhealthy at any stage, rollback instantly. This is the gold standard for high-risk deployments.
Feature flags: Deploy the new code to 100% of servers but hide it behind a flag. Enable the flag for 1% of users. This decouples deployment (code on servers) from release (feature visible to users). You can turn off a broken feature without redeploying.
The most common production incident: "We deployed, now things are broken." The investigation framework:
Step 1: Confirm causation. Check if the metric degradation started within 5 minutes of the deployment. If yes, strong correlation. If it started 2 hours later, probably coincidence. Check the deploy log: kubectl rollout history deployment/inference-server.
Step 2: Identify the change. What exactly changed? New container image tag, new environment variable, new ConfigMap, new resource limits? Use kubectl diff against the previous revision or check the Git commit that triggered the deploy.
Step 3: Quantify impact. Before rolling back (which might be disruptive), assess: Is the SLO breached? Is the error rate above the error budget burn rate? If p99 went from 400ms to 500ms but the SLO is 800ms, you have time to investigate. If p99 went from 400ms to 2s, roll back immediately.
Step 4: Rollback or fix forward. For infrastructure changes, always rollback first (it is safer than debugging under pressure). For application bugs, sometimes a forward fix is faster if the bug is obvious (e.g., typo in a config value).
bash # Instant rollback to previous revision kubectl rollout undo deployment/inference-server # Rollback to a specific revision kubectl rollout undo deployment/inference-server --to-revision=42 # Check rollback status kubectl rollout status deployment/inference-server
Step 5: Postmortem. Even if the fix was trivial, write a postmortem. "Config typo caused 5 minutes of elevated latency" becomes a systemic improvement: "Add config validation to the CI pipeline so typos are caught before deploy."
Traditional infrastructure: deploy code updates to running servers. The server accumulates state over time (package versions, config drift, temp files). After 6 months, no two servers are identical even though they started from the same image. This is called configuration drift and it causes "works on server A but not server B" bugs.
Immutable infrastructure: never update a running server. Instead, build a new container image with the new code, deploy new pods from the new image, and destroy the old pods. Every pod starts from the exact same image, always. There is no drift because there is no mutation.
This is why containers (Docker) and orchestrators (K8s) dominate modern infrastructure. The container image is your artifact of truth. If it works in CI, it works in production, because the environment is identical. If a server is acting weird, you do not debug it — you kill it and let K8s spawn a fresh one from the known-good image.
Deploying a new model version to a GPU cluster is not like deploying a web app. Model weights are 140GB+. Loading them into GPU memory takes 2-5 minutes. You cannot tolerate 5 minutes of downtime during deployment.
python import time, requests class CanaryController: def __init__(self, baseline_metrics: dict, threshold: float = 1.1): self.baseline = baseline_metrics # {"p99_ms": 400, "error_rate": 0.001} self.threshold = threshold # 10% worse = rollback def check_canary(self, canary_metrics: dict) -> bool: """Return True if canary is healthy.""" for key, baseline_val in self.baseline.items(): canary_val = canary_metrics.get(key, 0) if canary_val > baseline_val * self.threshold: print(f"CANARY FAIL: {key} = {canary_val} > {baseline_val * self.threshold}") return False return True def progressive_rollout(self, stages=[1, 5, 25, 50, 100]): for pct in stages: self._set_traffic_split(pct) print(f"Canary at {pct}% — monitoring for 15 min...") time.sleep(900) # 15 min observation window metrics = self._fetch_canary_metrics() if not self.check_canary(metrics): self._set_traffic_split(0) # Instant rollback print("ROLLBACK: canary failed, traffic at 0%") return False print("SUCCESS: canary promoted to 100%") return True
Your CI pipeline takes 25 minutes. Engineers stack PRs and context-switch. Feedback is slow. Common culprits: Docker image builds that re-download 5GB of dependencies on every run (fix: layer caching, multi-stage builds, pre-built base images), integration tests that spin up real databases (fix: use testcontainers with cached images or mock layers), and sequential test suites (fix: parallelize with pytest-xdist or Bazel test sharding).
dockerfile # Multi-stage build for vLLM inference container # Stage 1: Build wheels (cached unless requirements change) FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 AS builder COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt \ && pip wheel vllm==0.6.0 -w /wheels # Stage 2: Runtime (smaller image, no build tools) FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04 COPY --from=builder /wheels /wheels RUN pip install --no-cache-dir /wheels/*.whl && rm -rf /wheels # Model weights: DON'T bake into image (140GB → huge image) # Instead, mount from shared volume or download at startup ENV MODEL_PATH=/models/llama-70b ENV HF_HUB_CACHE=/models/.cache # Health check endpoint HEALTHCHECK --interval=10s --timeout=5s --retries=3 \ CMD curl -f http://localhost:8000/health || exit 1 ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"] CMD ["--model", "/models/llama-70b", "--tensor-parallel-size", "2"]
Parallel's engineers need to test inference changes locally, but they do not have H100 GPUs on their laptops. The strategy has three tiers:
| Tier | What | GPU | Use case |
|---|---|---|---|
| Local | Docker Compose + CPU-only vLLM | None | API contract testing, UI development |
| Dev cluster | Shared K8s namespace with 2x A10G | A10G (24GB) | Small model testing (7B), integration tests |
| Staging | Full replica of prod with 4x H100 | H100 | Performance benchmarks, canary validation |
The key insight: local development does not need real GPU inference. Mock the inference endpoint with a service that returns canned responses at realistic latency. Engineers only need real GPUs when testing model-specific behavior, which happens in the dev cluster or staging.
python import launchdarkly as ld class InferenceRouter: """Route requests based on feature flags. Decouples deployment from release.""" def __init__(self, ld_client: ld.LDClient): self.ld = ld_client async def route(self, request, user_context: dict) -> str: # Check if user should get the new model version use_new_model = self.ld.variation( "llama-3.2-rollout", user_context, default=False ) # Check if speculative decoding is enabled for this customer tier use_spec_decode = self.ld.variation( "speculative-decoding", user_context, default=False ) # Route to appropriate model endpoint if use_new_model: endpoint = "inference-v2.production.svc" else: endpoint = "inference-v1.production.svc" # Attach decoding strategy to request metadata request.metadata["spec_decode"] = use_spec_decode return endpoint # Feature flags enable gradual rollouts: # Day 1: Enable for internal users (1%) # Day 3: Enable for beta customers (5%) # Day 7: Enable for all free-tier users (50%) # Day 14: Enable for all users (100%) # At any point: kill switch if quality regresses
Every infrastructure change at Parallel follows this workflow. No exceptions, no "just this once" SSH changes:
| Step | What happens | Tool |
|---|---|---|
| 1. Branch | Engineer creates a feature branch with infrastructure changes | Git |
| 2. Plan | CI runs terraform plan / pulumi preview and posts the diff as a PR comment | GitHub Actions + Terraform |
| 3. Review | Another engineer reviews the plan. For security group changes, requires SRE approval. | GitHub PR review |
| 4. Apply to staging | Merge to main triggers automatic apply to staging environment | ArgoCD / Terraform Cloud |
| 5. Validate staging | Run smoke tests against staging. Check that services are healthy. | Automated test suite |
| 6. Apply to production | Manual approval gate. Engineer clicks "Approve" after staging validation. | ArgoCD sync + approval |
| 7. Monitor | Watch dashboards for 30 minutes. If metrics deviate, rollback = git revert. | Grafana + PagerDuty |
git revert on the commit. ArgoCD sees the revert, reconciles the cluster to the previous state. No need to remember what changed or how to undo it manually. The entire rollback is auditable, reviewed, and reproducible. This is the superpower of GitOps.GitHub Merge Queue and Graphite's Stacked PRs represent the 2025 frontier of developer velocity. Merge Queue runs CI on the merge result (not the branch head), preventing broken main. Stacked PRs let engineers submit 5 dependent PRs and review/merge them independently. Combined with Bazel's remote cache (build only what changed), top teams achieve <5 minute CI + deploy cycles.
Watch a canary deployment progress through stages. Inject a regression to see automatic rollback.
You cannot fix what you cannot see. An LLM serving platform has hundreds of metrics, thousands of log streams, and millions of traces per day. The challenge is not collecting data — it is making that data useful. When a PagerDuty alert fires at 3 AM, you need to go from "something is wrong" to "this specific component failed for this specific reason" in under 5 minutes. That requires observability — not just monitoring (dashboards and alerts), but the ability to ask arbitrary questions about your system's behavior.
Metrics (Prometheus, Datadog): Numeric time-series data. Request rate, error rate, latency percentiles, GPU utilization, memory usage. Cheap to store, fast to query, good for dashboards and alerts. Bad for debugging specific requests.
Logs (Loki, Elasticsearch): Unstructured or semi-structured text. Application errors, kernel messages, NVIDIA driver warnings. Good for debugging specific events. Bad for aggregation (expensive to count "how many errors per minute" from logs).
Traces (Jaeger, Tempo): Request-level timelines showing how a request flows through services. A single inference request generates spans for: API gateway (2ms), auth (1ms), model router (3ms), queue wait (50ms), prefill (180ms), decode (400ms), response serialization (2ms). Traces show you exactly where latency lives.
Standard web metrics (request rate, error rate, latency) are necessary but not sufficient for LLM serving. You need LLM-specific metrics:
| Metric | What it measures | Alert threshold |
|---|---|---|
| Time To First Token (TTFT) | Latency from request to first generated token | p99 > 2s |
| Inter-Token Latency (ITL) | Time between consecutive generated tokens | p99 > 100ms |
| Tokens/sec/GPU | Throughput per GPU (efficiency) | < 40 tok/s/GPU |
| KV Cache Utilization | % of PagedAttention blocks in use | > 95% (preemptions imminent) |
| Batch Size (running) | Number of requests in the continuous batch | < 4 (underutilized GPU) |
| Queue Depth | Requests waiting for batch admission | > 100 (need more GPUs) |
| Preemptions/min | KV cache evictions per minute | > 10/min (memory pressure) |
| GPU SM Utilization | Streaming multiprocessor activity (true compute usage) | < 30% (wasted GPU) |
python from prometheus_client import Histogram, Gauge, Counter, start_http_server # LLM-specific metrics TTFT = Histogram( "llm_time_to_first_token_seconds", "Time from request to first generated token", buckets=[0.1, 0.25, 0.5, 0.8, 1.0, 2.0, 5.0], labelnames=["model", "region"], ) ITL = Histogram( "llm_inter_token_latency_seconds", "Latency between consecutive generated tokens", buckets=[0.01, 0.02, 0.05, 0.1, 0.2], labelnames=["model"], ) KV_CACHE_UTIL = Gauge( "llm_kv_cache_utilization_ratio", "Fraction of KV cache blocks in use", labelnames=["gpu_id"], ) QUEUE_DEPTH = Gauge( "llm_request_queue_depth", "Number of requests waiting for batch admission", ) PREEMPTIONS = Counter( "llm_kv_cache_preemptions_total", "Total KV cache preemptions (evictions)", ) # Usage in the serving loop: with TTFT.labels(model="llama-70b", region="us-east-1").time(): first_token = await engine.generate_first_token(request) # Expose metrics on :8080/metrics for Prometheus scraping start_http_server(8080)
The most insidious observability failure is not missing alerts — it is too many alerts. When oncall gets 50 alerts per day, they start ignoring them. The one critical alert that matters drowns in noise.
Fix: classify every alert as P1 (pages, wakes you up), P2 (ticket, fix today), or P3 (review weekly). Delete P3 alerts and convert them to dashboard panels. For P1 alerts, enforce the rule: every P1 alert must have a runbook that can be executed by any engineer, not just the alert's author. If an alert doesn't have a runbook, it is not ready to be P1.
python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter # Setup: export traces to Jaeger/Tempo via OTLP provider = TracerProvider() provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint="http://tempo:4317")) ) trace.set_tracer_provider(provider) tracer = trace.get_tracer("parallel.inference") async def handle_inference(request): with tracer.start_as_current_span("inference.request") as span: span.set_attribute("model", request.model) span.set_attribute("input_tokens", request.token_count) # Span 1: Auth check with tracer.start_as_current_span("auth.validate"): user = await validate_api_key(request.api_key) # Span 2: Route to model instance with tracer.start_as_current_span("router.select") as rspan: instance = await model_router.select(request) rspan.set_attribute("selected_instance", instance.id) rspan.set_attribute("queue_depth", instance.queue_depth) # Span 3: Inference (the expensive part) with tracer.start_as_current_span("inference.generate") as ispan: result = await instance.generate(request.prompt) ispan.set_attribute("output_tokens", result.token_count) ispan.set_attribute("ttft_ms", result.ttft_ms) ispan.set_attribute("gpu_id", instance.gpu_id) span.set_attribute("total_latency_ms", result.total_ms) return result
X-Request-ID). When debugging, you search by request_id and instantly get the complete trace: which model instance handled it, how long prefill took, whether KV cache was preempted, what the GPU utilization was during generation. Without request_id correlation, you are guessing.The oncall engineer's primary dashboard has four rows, designed for 3 AM triage when your brain is running at 50%:
| Row | What | Why |
|---|---|---|
| Row 1: Golden Signals | Request rate, error rate, TTFT p99, throughput (tok/s) | Is the system working? Anomaly = something is wrong. |
| Row 2: Resource Health | GPU util per node, KV cache %, memory %, CPU %, disk I/O | Where is the bottleneck? High KV cache = memory pressure. Low GPU util = wasted money. |
| Row 3: Infrastructure | Node count, pod restarts, OOM kills, network errors, spot reclamation events | What changed? Node loss or pod crash explains sudden degradation. |
| Row 4: Business | Cost/request, requests by customer tier, error budget remaining | Should we care? P1 customer affected = all hands. Budget burned = freeze deploys. |
The 2025 frontier: LLMs analyzing your observability data. Datadog's "Bits AI", Grafana's LLM features, and startups like Coralogix use LLMs to: correlate anomalies across metrics/logs/traces, suggest root causes from historical incident data, and auto-generate PromQL/LogQL queries from natural language. The killer feature: "What changed?" — the LLM diffs all recent deployments, config changes, and traffic patterns against the anomaly timeline.
python import numpy as np from collections import deque class ZScoreAnomalyDetector: """Detect anomalies in GPU metrics using rolling z-score. Simple but effective for metrics with stable baselines.""" def __init__(self, window: int = 120, threshold: float = 3.0): self.window = window # Rolling window (data points) self.threshold = threshold # Z-score threshold for anomaly self.history: deque = deque(maxlen=window) def ingest(self, value: float) -> dict: self.history.append(value) if len(self.history) < 30: return {"anomaly": False, "reason": "warming up"} arr = np.array(self.history) mean = arr.mean() std = arr.std() if std < 1e-6: return {"anomaly": False, "z_score": 0} z = (value - mean) / std is_anomaly = abs(z) > self.threshold return { "anomaly": is_anomaly, "z_score": round(z, 2), "value": value, "mean": round(mean, 2), "direction": "high" if z > 0 else "low", } # Usage: detect GPU utilization anomalies detector = ZScoreAnomalyDetector(window=120, threshold=3.0) # Feed every 10-second GPU utilization sample result = detector.ingest(gpu_util_pct) if result["anomaly"]: alert(f"GPU util anomaly: {result['value']}% (z={result['z_score']})")
Parallel generates ~100GB of logs per day from 40+ services. The pipeline must handle this volume without losing messages or creating backpressure on the services producing logs:
Live metrics for a simulated LLM cluster. Watch TTFT, throughput, GPU utilization, and KV cache fill. Click Traffic Spike to see metrics respond.
Everything you have learned comes together in this interactive simulation. You are running a GPU inference cluster serving LLM requests at scale. You control the infrastructure parameters. The simulation shows the consequences of your decisions in real time — throughput, latency, GPU utilization, cost per request. Break the system by creating traffic spikes. Fix it by adjusting resources. Find the optimal configuration that maximizes throughput while staying under your cost budget.
Control a GPU cluster serving LLM requests. Adjust parameters and watch throughput, latency, GPU utilization, and cost respond in real time. Try to maximize throughput while keeping cost per request under $0.01.
Throughput (tok/s): Total tokens generated per second across all GPUs. Higher is better. Limited by GPU memory bandwidth and batch efficiency.
TTFT p99 (ms): 99th percentile time to first token. Lower is better. Spikes when queue depth is high (requests waiting for batch admission). Your SLO is 800ms.
GPU Utilization (%): Average streaming multiprocessor activity. Target: 60-85%. Below 40% means wasted money. Above 90% means you are at capacity with no headroom for spikes.
Cost/request ($): Total GPU cost divided by requests served. Your budget is $0.01 per request. This includes idle GPU time — so underutilized GPUs increase this even if individual requests are efficient.
KV Cache (%): PagedAttention cache utilization. Above 90% triggers preemptions and latency spikes. Affected by batch size and sequence length.
The simulation models the core physics of LLM inference infrastructure. Here is how each parameter affects the metrics:
GPU Nodes: More nodes = more model instances = higher throughput and lower latency. But also higher cost. The relationship is roughly linear for throughput (2x nodes ≈ 2x throughput) but sublinear for latency (adding nodes only helps if the current ones are overloaded).
GPU Type: A100 → H100 → B200 increases memory bandwidth (2.0 → 3.35 → 8.0 TB/s), which directly increases decode speed. The B200 generates tokens ~2.4x faster than H100 because it reads model weights 2.4x faster. But it also costs ~1.8x more per hour. The cost/performance sweet spot depends on your traffic volume.
Max Batch Size: Larger batches amortize the weight-read cost across more requests, improving GPU utilization and throughput. But larger batches increase per-request latency because each decode step takes longer (more tokens to generate). The optimal batch size balances throughput against your TTFT SLO.
INT4 Quantization: Reduces model memory from 140GB to 35GB, allowing more model instances per node and/or freeing memory for KV cache. Throughput increases because less data is read from memory per token. Quality decreases by ~1-3% on benchmarks (usually acceptable for inference).
| Challenge | Goal | Hint |
|---|---|---|
| Cost Efficient | Handle 200 req/s at <$0.008/req | Fewer GPUs with higher utilization + INT4 |
| Low Latency | TTFT p99 < 300ms at 100 req/s | More GPUs with smaller batch size |
| Survive the Spike | Handle 3x traffic spike without p99 > 2s | Headroom in GPU count, moderate batch size |
| Maximum Throughput | Achieve highest tok/s possible | Max GPUs + max batch + B200 + INT4 |
You have now covered the entire infrastructure stack for an LLM scaling engineer. This chapter distills everything into an interview-ready arsenal: system design templates, coding patterns, debugging frameworks, and the specific questions you will face.
Every system design interview follows the same structure. Here is your framework:
| Category | Question | Key Points |
|---|---|---|
| System Design | "Design a multi-region LLM serving platform" | Global LB, regional K8s, model router, vLLM, KV cache, failover strategy, data replication |
| System Design | "Design a GPU cluster scheduler" | SLURM vs K8s, preemption, topology-aware scheduling, MIG, resource broker between training and inference |
| Coding | "Implement a consistent hash ring" | Virtual nodes, bisect for O(log n) lookup, minimal redistribution on node add/remove |
| Coding | "Write a circuit breaker" | Three states (closed/open/half-open), failure counting, timeout-based recovery |
| Coding | "Write a rate limiter (token bucket)" | Atomic token update, burst allowance, distributed rate limiting with Redis |
| Debug | "p99 latency doubled after a deployment" | Check recent deploys, compare metrics before/after, look at traces for slow span, check resource utilization |
| Debug | "GPU utilization dropped to 10%" | nvidia-smi for hardware errors, scheduler for job failures, NVLink status, driver/CUDA version mismatch |
| Frontier | "How would you serve a 405B model?" | 8-GPU tensor parallelism + pipeline parallelism across nodes, FP8 quantization, speculative decoding, disaggregated prefill/decode |
These are the 5 coding problems most likely to appear in infrastructure interviews. Practice each until you can write it from memory in under 10 minutes:
1. Rate Limiter (Token Bucket):
python import time class TokenBucket: def __init__(self, rate: float, capacity: int): self.rate = rate # Tokens per second self.capacity = capacity # Max burst size self.tokens = capacity # Current tokens self.last_refill = time.monotonic() def allow(self, tokens: int = 1) -> bool: now = time.monotonic() elapsed = now - self.last_refill self.tokens = min(self.capacity, self.tokens + elapsed * self.rate) self.last_refill = now if self.tokens >= tokens: self.tokens -= tokens return True return False limiter = TokenBucket(rate=100, capacity=200) # 100 req/s, burst of 200 if limiter.allow(): process_request() else: return HttpResponse(status=429) # Too Many Requests
2. LRU Cache (for KV cache eviction):
python from collections import OrderedDict class LRUCache: def __init__(self, capacity: int): self.cache = OrderedDict() self.capacity = capacity def get(self, key: str): if key not in self.cache: return None self.cache.move_to_end(key) # Mark as recently used return self.cache[key] def put(self, key: str, value): if key in self.cache: self.cache.move_to_end(key) self.cache[key] = value if len(self.cache) > self.capacity: self.cache.popitem(last=False) # Evict LRU
| Step | Action | Tools |
|---|---|---|
| 1. Detect | What metric violated its SLO? | PagerDuty alert, Grafana dashboard |
| 2. Correlate | What changed? Recent deploys, config changes, traffic patterns | Deploy log, git log, traffic graphs |
| 3. Isolate | Which component is responsible? Follow the trace. | Jaeger traces, service-level metrics |
| 4. Root cause | Why is that component broken? | Logs, nvidia-smi, kubectl describe, dmesg |
| 5. Fix + Prevent | Fix the immediate issue, then add guardrails | Rollback, hotfix, new alert, postmortem |
| Fact | Value |
|---|---|
| H100 memory bandwidth | 3.35 TB/s |
| B200 memory bandwidth | 8.0 TB/s |
| 70B FP16 model size | 140 GB |
| 70B INT4 model size | ~35 GB |
| KV cache per token (70B, FP16) | ~1.6 MB |
| vLLM continuous batching improvement | 2-10x vs static |
| Speculative decoding speedup | 1.5-3x |
| Spot instance discount | 60-90% off on-demand |
| 99.99% uptime = downtime per year | 52.6 minutes |
| 99.9% uptime = downtime per year | 8.76 hours |
| PCIe Gen5 bandwidth | 64 GB/s (per direction) |
| NVLink 4.0 bandwidth (H100) | 900 GB/s total |
Every system design interview starts with back-of-envelope math. Here is the template for LLM infrastructure:
text Given: - Target: 100K req/s - Average request: 500 input tokens + 200 output tokens = 700 tokens - Model: 70B FP16, needs 2x H100 (tensor parallel) - Single 2-GPU instance throughput: ~80 tok/s (decode) - With continuous batching (batch=64): ~1500 tok/s effective Calculation: Total tokens needed: 100K req/s × 700 tok/req = 70M tok/s Model instances needed: 70M / 1500 = ~47,000 instances GPUs needed: 47,000 × 2 = 94,000 H100s Wait — that's obviously wrong. 100K req/s is PEAK, not sustained. Average is probably 20K req/s. And requests take 3-5 seconds, so concurrency is 20K × 4s = 80K concurrent, not 20K/s × 700 tok. Corrected: Concurrent requests: 80K Requests per GPU instance: 64 (batch size) GPU instances needed: 80K / 64 = 1,250 GPUs needed: 1,250 × 2 = 2,500 H100s Cost: 2,500 × $10/hr = $25,000/hr = $18.25M/month With INT4 quantization (1 GPU per instance): GPU instances needed: 1,250 GPUs needed: 1,250 Cost: $12,500/hr = $9.1M/month (50% savings)
3. Distributed Rate Limiter (Redis):
python import redis, time class DistributedRateLimiter: """Sliding window rate limiter using Redis sorted sets. Works across multiple server instances.""" def __init__(self, redis_client: redis.Redis, limit: int, window_s: int): self.r = redis_client self.limit = limit self.window = window_s def allow(self, key: str) -> bool: now = time.time() pipe = self.r.pipeline() # Remove entries outside the window pipe.zremrangebyscore(key, 0, now - self.window) # Count entries in the window pipe.zcard(key) # Add current request pipe.zadd(key, {str(now): now}) # Set TTL so keys auto-expire pipe.expire(key, self.window) results = pipe.execute() count = results[1] return count < self.limit # Usage: per-customer rate limit limiter = DistributedRateLimiter(redis.Redis(), limit=1000, window_s=60) if not limiter.allow(f"rate:{customer_id}"): return HttpResponse(status=429)
4. Health Check Aggregator:
python import asyncio, aiohttp class HealthAggregator: """Check health of all dependencies, return aggregate status.""" def __init__(self, checks: dict[str, str]): self.checks = checks # {"db": "http://db:5432/health", ...} async def check_all(self, timeout: float = 3.0) -> dict: results = {} async with aiohttp.ClientSession() as session: tasks = { name: session.get(url, timeout=aiohttp.ClientTimeout(total=timeout)) for name, url in self.checks.items() } for name, task in tasks.items(): try: resp = await task results[name] = {"status": "healthy" if resp.status == 200 else "degraded"} except Exception as e: results[name] = {"status": "unhealthy", "error": str(e)} overall = "healthy" if all( r["status"] == "healthy" for r in results.values() ) else "degraded" return {"overall": overall, "checks": results}
5. Exponential Backoff with Jitter:
python import random, time def retry_with_backoff(func, max_retries=5, base_delay=0.5, max_delay=30): """Retry with exponential backoff + full jitter. Jitter prevents thundering herd when many clients retry simultaneously.""" for attempt in range(max_retries): try: return func() except Exception as e: if attempt == max_retries - 1: raise # Exponential backoff with full jitter delay = min(max_delay, base_delay * (2 ** attempt)) jittered = random.uniform(0, delay) time.sleep(jittered)
A reference architecture for "Design a multi-region LLM serving platform." This is your whiteboard answer. Study the components and their connections.
Infrastructure engineering at the frontier of AI is one of the most challenging and rewarding roles in software. You operate at the intersection of systems engineering, machine learning, economics, and reliability. Every decision you make — which GPU to buy, how to batch requests, when to shard the database, how to handle spot reclamation — has measurable impact on the business. A 10% utilization improvement saves hundreds of thousands of dollars per month. A 100ms latency reduction makes every customer's product feel faster. A well-designed failover system means no one ever notices when hardware fails.
The field is moving fast. New GPU architectures every 18 months. New serving frameworks every 6 months. New cost optimization techniques every quarter. The engineer who thrives in this environment is not the one who memorizes today's best practices, but the one who understands first principles deeply enough to adapt when everything changes. When the B300 arrives and makes the B200 look slow, you will know how to evaluate it. When a new serving framework claims 3x better throughput, you will know how to benchmark it honestly. When leadership asks "can we serve 10x more traffic at half the cost?", you will know which levers to pull and which tradeoffs to accept.
Go build. Break things in staging. Fix them before anyone notices. And when the pager goes off at 3 AM, take a deep breath, pull up Grafana, and follow the metrics. They will tell you the story.