Infrastructure & LLM Scaling Engineer at Parallel

Chapter 0: The Infrastructure Engineer's World

It is 2:47 AM. Your phone buzzes. PagerDuty: "LLM inference cluster us-east-1: p99 latency 4200ms (SLO: 800ms). GPU utilization dropped from 78% to 12%." You pull up Grafana on your phone. The throughput graph fell off a cliff 11 minutes ago. You SSH into the bastion host and check the NVIDIA driver logs — three H100 nodes reported an ECC memory error and the SLURM scheduler evicted all jobs. But the autoscaler hasn't replaced them because the spot capacity in us-east-1 is exhausted at this hour.

You execute the runbook: failover inference traffic to us-west-2 via the global load balancer, then spin up on-demand instances in us-east-1 as a bridge until spot capacity returns. Total time to recovery: 7 minutes. Cost of the on-demand bridge: ~$340/hour. Cost of the outage if you hadn't responded: ~$12,000/minute in lost API revenue.

This is the daily reality of an Infrastructure & LLM Scaling Engineer at Parallel. You are the person who makes sure the AI actually runs — at scale, cost-efficiently, and reliably. Every prompt that flows through Parallel's platform touches infrastructure you built or maintain: the Kubernetes clusters, the GPU scheduling, the model serving layer, the database shards, the monitoring pipelines, the CI/CD system, the cost optimization automation.

The role sits at the intersection of five engineering disciplines:

Discipline	What you own	Your daily intersection
Cloud & IaC	Terraform modules, K8s clusters, service mesh, multi-region	Every infrastructure change goes through your IaC pipeline
GPU & ML Infra	GPU scheduling, driver management, model serving, batch tuning	You keep the most expensive hardware in the fleet utilized
Distributed Systems	Data replication, consensus, sharding, consistency guarantees	You design the systems that survive partial failures
Reliability	SLOs, error budgets, chaos engineering, incident response	You define what "reliable" means and enforce it
Cost Engineering	Spot strategy, right-sizing, utilization monitoring, spend forecasting	You make the $2B company profitable by controlling infrastructure spend

Five dimensions, one platform. This lesson covers all five because a staff-level infrastructure engineer must reason across them. GPU scheduling is useless without cost awareness. Cost optimization is meaningless without reliability constraints. Reliability is unachievable without observability. And none of it matters if the distributed systems fundamentals are wrong. Every chapter prepares you to design, build, debug, and defend a complete infrastructure platform.

The Numbers That Define Your World

Before we dive in, internalize these numbers. They come up in every infrastructure interview and every oncall incident:

Fact	Value	Why it matters
H100 80GB cost (cloud)	~$10/hr	Your biggest cost line item. Every idle minute = $0.17 wasted.
H100 memory bandwidth	3.35 TB/s	The bottleneck for LLM decode. Determines tokens/sec/GPU.
70B model in FP16	140 GB	Doesn't fit on one H100. Need TP=2 minimum.
70B model in INT4	~35 GB	Fits on one H100 with room for KV cache. 50% cost reduction.
KV cache per token (70B)	~1.6 MB	128 concurrent requests × 4096 tokens = 800 GB of KV cache.
NVLink bandwidth (H100)	900 GB/s	15x faster than PCIe. Required for tensor parallelism.
99.99% uptime	52 min/yr downtime	Every minute of outage burns a week of error budget.
Spot instance savings	60-90%	The biggest cost lever after GPU utilization.
Continuous batching improvement	2-10x	The single most impactful serving optimization.

A Day In Your Life

9:00 AM — Check the overnight dashboard. GPU utilization averaged 72% (good). Error budget at 85% (comfortable). One P2 ticket: a customer's rate limiter is misconfigured, they are getting 429s.

10:00 AM — Design review for the new B200 migration plan. You present: phase 1 is a canary cluster (4 nodes, 32 B200s), run shadow traffic from production, compare latency/throughput/quality against H100 baseline. Phase 2 is gradual migration over 6 weeks. The tricky part: B200 uses a different CUDA driver version, so every container image needs rebuilding.

11:30 AM — Pair with a junior engineer on their K8s operator for GPU autoscaling. They wrote the controller logic but forgot to handle the case where spot instances are reclaimed mid-scaling. You add a reconciliation loop that checks for unexpected node removals.

2:00 PM — Cost review meeting. You present the monthly infrastructure report: GPU spend is $3.2M, down 8% from last month thanks to the FP8 quantization rollout. Networking egress is unexpectedly $180K — investigation reveals a misconfigured cross-region data pipeline that is copying model telemetry to all three regions instead of aggregating locally.

4:00 PM — Chaos engineering session. You run the "kill a random GPU node" experiment in production (with traffic-shifted to other nodes first). The autoscaler takes 4 minutes to replace the node. Your SLO requires recovery within 5 minutes. Passing, but barely. You file a ticket to reduce cold-start time by pre-pulling container images to all nodes.

5:30 PM — Postmortem for last week's incident (a bad Terraform apply deleted a security group rule, briefly exposing an internal service to the internet). You draft the timeline, root cause (lack of terraform plan review for auto-applied changes), and action items (require human approval for security group modifications).

The Platform You Operate

The diagram below traces a single LLM inference request from arrival to response. Every box is a system you own or co-own. This is your whiteboard answer in a system-design interview.

1. Edge & Load Balancing

Request arrives via CDN edge. L7 load balancer does TLS termination, rate limiting, geographic routing. Routes to nearest healthy region based on latency and capacity.

↓

2. API Gateway & Auth

API key validation, quota enforcement, request schema validation. Enqueue into the inference request queue with priority based on customer tier.

↓

3. Model Router

Select model variant (7B vs 70B vs MoE), choose serving instance based on current load, KV cache availability, and batch fullness. Implements speculative routing for latency-sensitive requests.

↓

4. GPU Inference Cluster

vLLM/TensorRT-LLM serving engine. Continuous batching, PagedAttention for KV cache, tensor parallelism across multi-GPU nodes. The most expensive and most optimized layer.

↓

5. Response & Telemetry

Stream tokens back via SSE/WebSocket. Log full request trace (latency breakdown, tokens generated, GPU utilization, cost) to the observability pipeline. Feed cost-tracking and capacity planning systems.

Infrastructure Request Flow

Watch an inference request flow through the full stack. Latency counters show where time is spent. Click Kill Node to see failover in action.

Interview Dimensions

Staff-level interviews at companies like Parallel, Anthropic, OpenAI, and Databricks test you across five dimensions. Each chapter maps to one or more:

Dimension	What they ask	Chapters
CONCEPT	"Explain how PagedAttention works and why it matters for serving"	0, 3, 4, 8
DESIGN	"Design a multi-region GPU inference platform for 100K req/s"	1, 2, 5, 8
CODE	"Write a K8s operator that autoscales GPU pods based on queue depth"	1, 2, 6, 9
DEBUG	"p99 latency doubled after a deployment. Walk me through investigation."	3, 7, 10, 11
FRONTIER	"How does SGLang compare to vLLM for MoE models in 2025?"	2, 3, 6, 12

An interviewer asks: "Your LLM serving cluster's GPU utilization dropped from 78% to 12% overnight but no deployments happened. What's your first investigation step?"

Restart all GPU nodes to clear any memory leaks Check if the model weights got corrupted Check NVIDIA driver/hardware logs for ECC errors or node evictions, then check the scheduler for job failures Scale up more GPU nodes to compensate

Chapter 1: Cloud Infrastructure & IaC

You cannot SSH into 400 servers and configure them by hand. You cannot click through the AWS console to set up a VPC, three subnets, a NAT gateway, security groups, an EKS cluster, and a service mesh. You cannot do it because it is not reproducible, not auditable, not reviewable, and not recoverable. If someone accidentally deletes a security group, there is no "undo." If a new hire needs to set up a staging environment, there is no "clone production."

Infrastructure as Code (IaC) means your entire infrastructure — every VPC, every firewall rule, every Kubernetes cluster, every IAM policy — exists as version-controlled code. Change infrastructure by making a pull request. Review it like you review application code. Roll it back with git revert. This is not optional for a $2B company. It is the foundation of everything.

CONCEPT: Terraform vs. Pulumi

Terraform uses HCL (HashiCorp Configuration Language), a declarative DSL. You describe the desired state ("I want 3 EC2 instances of type p4d.24xlarge in us-east-1") and Terraform figures out how to get there. It maintains a state file that records what currently exists, computes a diff against your desired state, and generates an execution plan.

Pulumi lets you write infrastructure in real programming languages (Python, TypeScript, Go). Instead of learning HCL, you use familiar constructs: loops, conditionals, functions, classes. The tradeoff: Pulumi code can have bugs that HCL cannot (infinite loops, runtime exceptions). But it is vastly more expressive for complex infrastructure.

Dimension	Terraform	Pulumi
Language	HCL (declarative DSL)	Python, TS, Go (imperative)
State	JSON state file (S3, Terraform Cloud)	Pulumi Service or self-hosted backend
Testing	terraform plan + Sentinel policies	Unit tests with pytest/jest
Learning curve	New DSL but simple	Your existing language
Industry adoption	Dominant (2024-2025)	Growing, esp. in startups
Drift detection	terraform plan detects drift	pulumi preview detects drift

DESIGN: Multi-Region K8s Architecture

Parallel serves customers globally. A single-region architecture means that if us-east-1 has an outage (and it does, roughly once per year), every customer is down. A multi-region architecture runs independent clusters in at least three regions, with a global load balancer routing traffic to the nearest healthy region.

The architecture has three layers:

Layer 1: Global routing. A DNS-based or anycast load balancer (AWS Global Accelerator, Cloudflare) routes requests to the nearest region. Health checks remove unhealthy regions in under 30 seconds.

Layer 2: Regional K8s clusters. Each region runs an independent EKS/GKE cluster with its own control plane, worker nodes, and GPU node pools. Clusters share configuration (via GitOps) but are operationally independent.

Layer 3: Data replication. The hard part. Model weights are read-only and can be pre-cached in each region. But stateful data (user sessions, conversation history, rate limit counters) must be replicated. Options: CockroachDB for strong consistency (higher latency), Redis Cluster with eventual consistency (lower latency), or a hybrid where critical data uses strong consistency and session data is eventually consistent.

CODE: Terraform Module for GPU K8s Cluster

hcl
# modules/gpu-cluster/main.tf
# Reusable module: one GPU-enabled EKS cluster per region

variable "region"      { type = string }
variable "gpu_type"    { type = string  default = "p5.48xlarge" }  # H100 nodes
variable "min_gpus"    { type = number  default = 4 }
variable "max_gpus"    { type = number  default = 32 }
variable "k8s_version" { type = string  default = "1.30" }

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  cidr    = "10.0.0.0/16"
  azs     = ["${var.region}a", "${var.region}b", "${var.region}c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  enable_nat_gateway     = true
  single_nat_gateway     = false  # One NAT per AZ for HA
  enable_dns_hostnames   = true
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "parallel-${var.region}"
  cluster_version = var.k8s_version
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets

  eks_managed_node_groups = {
    gpu_pool = {
      instance_types = [var.gpu_type]
      min_size       = var.min_gpus
      max_size       = var.max_gpus
      desired_size   = var.min_gpus
      ami_type       = "AL2_x86_64_GPU"
      labels         = { "nvidia.com/gpu.present" = "true" }
      taints         = [{ key = "nvidia.com/gpu", value = "true", effect = "NO_SCHEDULE" }]
    }
    cpu_pool = {
      instance_types = ["m6i.4xlarge"]
      min_size       = 3
      max_size       = 20
    }
  }
}

Key insight: taints and tolerations. GPU nodes are expensive ($30-$90/hour). You taint them with nvidia.com/gpu=true:NoSchedule so that only pods with a matching toleration (your inference workloads) get scheduled there. Without this, a random logging daemon could land on an H100 node and waste $90/hour of GPU time. This is the single most common IaC mistake in GPU clusters.

DEBUG: State Drift

The most dangerous IaC bug is state drift — when the real infrastructure diverges from what Terraform thinks exists. Someone clicked in the console and added a security group rule. Terraform doesn't know about it. Next terraform apply might delete that rule, breaking production.

bash
# Detect drift: compare state to reality
terraform plan -refresh-only

# If drift detected, two options:
# 1. Import the manual change into state (adopt it)
terraform import aws_security_group_rule.manual_rule sg-12345

# 2. Override the manual change (enforce IaC as source of truth)
terraform apply  # Will revert manual changes

# Prevention: lock down console access with SCPs
# Run drift detection on a cron (every 4 hours)

FRONTIER: GitOps with ArgoCD (2024-2025)

GitOps takes IaC further: Git is not just where you store infrastructure code, it is the sole source of truth for cluster state. A GitOps controller (ArgoCD, Flux) continuously watches your Git repo and reconciles the cluster to match. If someone manually changes a Kubernetes resource, the controller reverts it within seconds.

The pattern for Parallel: one Git repo contains Helm charts and Kustomize overlays for every service. ArgoCD runs in each regional cluster. A merge to main triggers automatic rollout to staging, then a manual approval gate for production. Rollback is git revert — ArgoCD sees the revert commit and reconciles the cluster back to the previous state.

CODE: K8s GPU Autoscaler Operator

python
# Simplified K8s operator: scale GPU pods based on inference queue depth
import kopf, kubernetes
from kubernetes import client

@kopf.timer('apps', 'v1', 'deployments',
            labels={'parallel.ai/gpu-autoscale': 'true'},
            interval=30)
def autoscale_gpu_pods(name, namespace, spec, status, **kwargs):
    """Scale GPU inference pods based on queue depth metric."""
    # Read queue depth from Prometheus
    queue_depth = query_prometheus(
        f'llm_request_queue_depth{{deployment="{name}"}}'
    )

    current_replicas = status.get('readyReplicas', 0)
    annotations = spec.get('template', {}).get('metadata', {}).get('annotations', {})
    max_replicas = int(annotations.get('parallel.ai/max-replicas', '16'))
    min_replicas = int(annotations.get('parallel.ai/min-replicas', '2'))
    target_queue_per_pod = int(annotations.get('parallel.ai/target-queue', '10'))

    # Calculate desired replicas
    if queue_depth > target_queue_per_pod * current_replicas * 1.5:
        desired = min(max_replicas, current_replicas + 2)  # Scale up aggressively
    elif queue_depth < target_queue_per_pod * current_replicas * 0.3:
        desired = max(min_replicas, current_replicas - 1)  # Scale down conservatively
    else:
        desired = current_replicas  # No change

    if desired != current_replicas:
        apps_v1 = client.AppsV1Api()
        patch = {'spec': {'replicas': desired}}
        apps_v1.patch_namespaced_deployment_scale(name, namespace, patch)
        kopf.info(spec, reason='Autoscale',
                  message=f'Scaled {name}: {current_replicas} → {desired} (queue={queue_depth})')

Scale up fast, scale down slow. The operator above scales up by 2 pods but scales down by only 1 pod. This asymmetry is deliberate. Traffic spikes require immediate response (scale up +2). Traffic dips might be temporary (scale down -1, wait, re-evaluate). Aggressive scale-down causes oscillation: you remove a pod, queue fills, you add it back, repeat. A 5-minute cooldown between scale-down decisions prevents this.

DESIGN: Container Image Delivery for GPU Nodes

GPU inference containers are 3-8GB (CUDA runtime, Python, vLLM, model loading code). Pulling this image from a registry when a new node starts takes 2-5 minutes. During an autoscaling event, this 2-5 minutes is dead time where you have a GPU node running ($10/hr) but not serving traffic.

Solutions, ordered by effectiveness:

Approach	Pull time	Complexity
Pre-pull on node boot	0s (already cached)	DaemonSet that pulls images on startup
Dragonfly P2P	30-60s (distributed pull)	P2P image distribution, no registry bottleneck
ECR + VPC endpoint	60-90s (fast private pull)	Avoid internet egress, use S3 transfer speeds
Standard registry pull	2-5 min	Default, slowest

Multi-Region Infrastructure

Three regions serving traffic. Click Region Failure to see automatic failover. The health indicators show cluster status.

You run terraform plan and see it wants to delete a security group rule you didn't add. What happened and what's the safest response?

Terraform has a bug; restart the plan Someone manually added the rule in the console (state drift); import it into Terraform state before applying Apply immediately to enforce the desired state Delete the state file and re-initialize

Chapter 2: GPU Cluster Management

A single NVIDIA H100 GPU costs $25,000-$40,000. A single 8-GPU node costs $250,000+. Parallel's inference fleet might have 200+ H100s across regions. That is $5-8M in GPU hardware. Your job is to keep every one of those GPUs doing useful work — not sitting idle, not thermal throttling, not stuck in a driver crash loop. GPU utilization is the single most important metric in your entire infrastructure.

CONCEPT: GPU Architecture for Inference

Understanding GPU hardware is not optional. When a debug session leads you to "the model is slow," you need to know why at the hardware level.

GPU	HBM (GB)	Memory BW (TB/s)	FP16 TFLOPS	NVLink BW (GB/s)	Cost/hr (cloud)
A100 80GB	80	2.0	312	600	~$3.50
H100 80GB	80	3.35	990	900	~$8-12
H200 141GB	141	4.8	990	900	~$12-15
B200 192GB	192	8.0	2250	1800	~$15-20

LLM inference is memory-bandwidth bound, not compute bound. During autoregressive decoding, each generated token requires reading the entire model weights from HBM once. A 70B parameter model in FP16 is 140GB — it must be read from memory for every single token. The H100 has 3.35 TB/s of memory bandwidth, so it can read 140GB in ~42ms, giving you roughly 24 tokens/second on a single GPU. This is why memory bandwidth matters more than TFLOPS for inference.

The memory wall. The fundamental bottleneck for LLM inference is not how fast the GPU can compute (TFLOPS), but how fast it can read model weights from memory (TB/s). This is why the B200 with 8 TB/s is a game-changer — it roughly doubles inference throughput over H100 not because it computes faster, but because it reads faster. Every infrastructure decision you make (quantization, tensor parallelism, batch size) is an attempt to work around the memory wall.

DESIGN: GPU Scheduling Architecture

You have 200 GPUs. You have training jobs, inference serving, fine-tuning jobs, and evaluation runs competing for them. How do you schedule?

Option 1: SLURM. The traditional HPC scheduler. Jobs submit resource requests (e.g., "I need 8 GPUs for 4 hours"), SLURM queues them and allocates when resources are free. Good for batch workloads (training, eval). Bad for real-time inference because SLURM doesn't do preemption well — a 4-hour training job blocks GPUs even if inference latency is spiking.

Option 2: Kubernetes with GPU operator. NVIDIA's GPU operator for K8s handles driver installation, device plugin, and GPU resource advertisement. Pods request nvidia.com/gpu: 1 and K8s schedules them. Good for inference (pods auto-scale), but K8s doesn't understand GPU topology (which GPUs share NVLink) without the GPU Operator's topology-aware scheduling.

Option 3: Hybrid. Parallel's architecture: K8s for inference serving (needs instant scaling), SLURM for training and fine-tuning (needs large allocations). A resource broker sits between them, dynamically shifting GPUs from the training pool to inference during peak hours and back during off-peak.

CODE: GPU Health Monitor

python
import subprocess, json, time, requests

def check_gpu_health() -> list[dict]:
    """Query nvidia-smi for all GPUs, return health status."""
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=index,uuid,temperature.gpu,"
         "utilization.gpu,memory.used,memory.total,ecc.errors.corrected.volatile.total,"
         "ecc.errors.uncorrected.volatile.total,power.draw",
         "--format=csv,noheader,nounits"],
        capture_output=True, text=True
    )
    gpus = []
    for line in result.stdout.strip().split("\n"):
        parts = [p.strip() for p in line.split(",")]
        gpu = {
            "index": int(parts[0]),
            "uuid": parts[1],
            "temp_c": int(parts[2]),
            "util_pct": int(parts[3]),
            "mem_used_mb": int(parts[4]),
            "mem_total_mb": int(parts[5]),
            "ecc_corrected": int(parts[6]),
            "ecc_uncorrected": int(parts[7]),
            "power_w": float(parts[8]),
        }
        gpu["healthy"] = (
            gpu["temp_c"] < 83              # Throttle at 83C
            and gpu["ecc_uncorrected"] == 0  # Uncorrectable = bad memory
            and gpu["power_w"] > 50         # GPU drawing power = alive
        )
        gpus.append(gpu)
    return gpus

def alert_unhealthy(gpus: list[dict], webhook: str):
    """Send PagerDuty alert for unhealthy GPUs."""
    bad = [g for g in gpus if not g["healthy"]]
    if bad:
        requests.post(webhook, json={
            "routing_key": "GPU_HEALTH",
            "event_action": "trigger",
            "payload": {
                "summary": f"{len(bad)} unhealthy GPU(s): {[g['index'] for g in bad]}",
                "severity": "critical",
                "source": "gpu-health-monitor",
            }
        })

DEBUG: Common GPU Failures

ECC memory errors. GPUs have Error-Correcting Code memory. Correctable errors are normal (1-2 per day). Uncorrectable errors mean bad memory cells — the GPU must be drained and replaced. Check: nvidia-smi -q -d ECC.

Xid errors. NVIDIA kernel driver errors appear in dmesg as "Xid" messages. Xid 79 = "GPU fallen off the bus" (PCIe link failed). Xid 48 = double-bit ECC error. Xid 31 = GPU memory page retirement. Each has a different severity and recovery path. A staff engineer maintains a runbook mapping Xid codes to actions.

NVLink failures. Multi-GPU nodes use NVLink for fast GPU-to-GPU communication. If an NVLink fails, tensor parallelism falls back to PCIe (10x slower). Detection: nvidia-smi nvlink -s shows link status. Symptom: one node's throughput is 10x lower than peers.

FRONTIER: MIG and GPU Sharing (2024-2025)

Multi-Instance GPU (MIG) partitions a single H100 into up to 7 isolated instances, each with its own memory, compute, and L2 cache. This lets you run 7 small models (7B parameters) on one GPU, instead of wasting an 80GB GPU on a model that uses 14GB. MIG is critical for cost optimization when serving a mix of model sizes.

The frontier in 2025 is time-sliced GPU sharing with NVIDIA's MPS (Multi-Process Service) and the emerging fractional GPU schedulers (Run:AI, Nebius). These let multiple pods share a GPU with guaranteed isolation, going beyond MIG's fixed partitions to dynamic fractional allocation.

DESIGN: GPU Fleet Management at Scale

Managing 200+ GPUs is not just scheduling. It is a lifecycle management problem:

1. Procurement

Lead times for H100/B200 are 3-12 months. You must forecast capacity 6+ months ahead based on growth projections and commit to reserved instances or hardware orders.

↓

2. Provisioning

New nodes need: OS install, NVIDIA driver (must match CUDA version), CUDA toolkit, container runtime, K8s node agent, monitoring agent, GPU health daemon. Automate with Ansible/Packer or pre-built AMIs.

↓

3. Validation

Before admitting to the cluster: run GPU burn-in test (24hr stress test), verify NVLink bandwidth (must be within 5% of spec), check ECC error count (must be zero), run inference benchmark (must match baseline within 10%).

↓

4. Operation

Continuous monitoring: temperature, utilization, ECC errors, power draw, NVLink status. Automatic drain on anomaly detection. Weekly firmware/driver update window (rolling, one node at a time).

↓

5. Retirement

After 3-5 years or repeated hardware failures: drain workloads, RMA defective GPUs, decommission node, update capacity planning.

DEBUG: CUDA Version Mismatches

The most frustrating GPU debug scenario: your inference workload crashes with CUDA error: no kernel image is available for execution on the device. The model was compiled with CUDA 12.4 but the node has driver 535 which only supports CUDA 12.2. The fix seems simple (update the driver) but the driver update requires a node reboot, which means draining all workloads first.

Prevention: pin CUDA version in your container images and in your Terraform node provisioning. Use a compatibility matrix:

NVIDIA Driver	Max CUDA	Supported GPUs
535.x	12.2	A100, H100
545.x	12.3	A100, H100
550.x	12.4	A100, H100, H200
560.x	12.6	A100, H100, H200, B200

Driver updates are infrastructure events, not routine maintenance. A driver update on a GPU node touches the kernel module, may change CUDA behavior, and requires a reboot. Treat it like a production deployment: canary one node first, run the full benchmark suite, wait 24 hours, then roll to the rest. Never update all GPU nodes simultaneously.

GPU Cluster Scheduler

8 GPUs across 2 nodes. Jobs arrive and get scheduled. Use the slider to adjust training vs. inference priority. Watch for scheduling conflicts and fragmentation.

Inference Priority 60%

DESIGN: Tensor Parallelism vs. Pipeline Parallelism

When a model is too large for one GPU, you split it across multiple GPUs. There are two fundamentally different strategies:

Tensor Parallelism (TP): Split each layer's weight matrices across GPUs. For a matrix multiply Y = X · W, you split W column-wise across N GPUs. Each GPU computes a partial result, then they all-reduce to combine. Requires fast interconnect (NVLink) because GPUs communicate at every layer. Best for GPUs within a single node (8 GPUs with NVLink). Typical: TP=2 for 70B, TP=8 for 405B.

Pipeline Parallelism (PP): Assign different layers to different GPUs. GPU-0 runs layers 0-19, GPU-1 runs layers 20-39, etc. GPUs only communicate between stages (one activation tensor per stage). Tolerates slower interconnect (PCIe, even network). Best for spanning multiple nodes. Downside: pipeline bubbles — when GPU-0 is computing, GPU-1 is idle waiting for GPU-0's output.

Strategy	Communication	When to use	Scaling limit
Tensor Parallel	All-reduce every layer	Within a node (NVLink)	8 GPUs (one node)
Pipeline Parallel	Once per stage boundary	Across nodes (network)	10-20 stages before bubbles dominate
TP + PP (hybrid)	Both	Very large models (405B+)	TP=8 within node, PP across nodes

The interview answer. "How do you serve a 405B model?" TP=8 within each node (fully uses NVLink bandwidth), PP=4 across 4 nodes (32 GPUs total). With FP8 quantization, the model is 405GB, each node holds ~101GB across 8 GPUs. The NVLink within each node handles the all-reduce for tensor parallelism. The network between nodes only carries one activation tensor per pipeline stage, which is small enough for 100 Gbps Ethernet.

CODE: NVIDIA SMI Monitoring Script

bash
#!/bin/bash
# gpu-watch.sh — Continuous GPU monitoring with alerts
# Run as: watch -n 5 ./gpu-watch.sh

nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,\
memory.used,memory.total,power.draw,power.limit,\
ecc.errors.corrected.volatile.total,\
ecc.errors.uncorrected.volatile.total \
--format=csv,noheader | while IFS="," read -r idx name temp util \
  mem_used mem_total power power_limit ecc_corr ecc_uncorr; do
    # Strip whitespace
    temp=$(echo $temp | tr -d ' ')
    util=$(echo $util | tr -d ' %')
    ecc_uncorr=$(echo $ecc_uncorr | tr -d ' ')

    # Alert conditions
    if [ "$ecc_uncorr" != "0" ]; then
        echo "CRITICAL: GPU $idx has uncorrectable ECC errors!"
    fi
    if [ "$temp" -gt 82 ]; then
        echo "WARNING: GPU $idx at ${temp}C (throttle at 83C)"
    fi
    if [ "$util" -lt 5 ]; then
        echo "WARNING: GPU $idx at ${util}% utilization (idle?)"
    fi
done

A 70B FP16 model needs 140GB of HBM. An H100 has 80GB. What is the minimum number of H100s needed, and what parallelism strategy do you use?

1 GPU with model compression 2 GPUs with tensor parallelism (split layers across GPUs, each holds ~70GB) 4 GPUs with data parallelism 8 GPUs with pipeline parallelism

Chapter 3: LLM Serving Infrastructure

You have GPUs. You have a model. Now you need to serve it to 10,000 concurrent users with p99 latency under 800ms for the first token, and 50ms per subsequent token. This is the serving layer — the most technically demanding piece of the entire infrastructure stack because it requires understanding both systems engineering and machine learning internals simultaneously.

CONCEPT: The Two Phases of LLM Inference

Every LLM request has two phases with fundamentally different performance characteristics:

Prefill (prompt processing). The model processes all input tokens in parallel. This is compute-bound — the GPU's TFLOPS matter. A 1000-token prompt on an H100 with a 70B model takes ~200ms. The GPU is fully utilized because it processes all tokens simultaneously via matrix multiplication.

Decode (token generation). The model generates tokens one at a time, autoregressively. Each token requires reading the entire model weights from HBM. This is memory-bandwidth bound — the GPU's TB/s matters. The GPU is massively underutilized during decode because the arithmetic intensity (FLOPS per byte loaded) is tiny.

The fundamental inefficiency. During decode, each token generation uses a fraction of the GPU's compute capacity but demands a full read of all model weights. A single request uses maybe 1-5% of the GPU's compute during decode. This is why batching is so important — by processing multiple requests simultaneously, you amortize the cost of reading model weights across many requests, recovering that wasted 95% compute.

CONCEPT: Continuous Batching and PagedAttention

Static batching (naive) waits until B requests arrive, processes them together, and returns all results. Problem: if one request generates 500 tokens and another generates 10, the short request waits for the long one. GPU is idle while waiting for the batch to fill.

Continuous batching (vLLM, TensorRT-LLM) removes completed requests and inserts new ones at every decode step. No request waits for others to finish. No idle time waiting for a batch to fill. Throughput increases 2-10x over static batching.

PagedAttention (vLLM's key innovation) manages the KV cache — the cached key/value tensors from previous tokens. In naive serving, each request pre-allocates maximum sequence length worth of contiguous GPU memory. A 2048-token request on a 70B model needs ~2GB of KV cache. With 100 concurrent requests, that is 200GB just for KV cache — more than the GPU's total memory.

PagedAttention borrows the idea of virtual memory pages from OS design. KV cache is stored in fixed-size blocks (like 4KB pages). Blocks are allocated on demand, not pre-allocated. Blocks can be non-contiguous in physical GPU memory but addressed contiguously via a page table. Memory waste drops from 60-80% to under 4%.

CODE: vLLM Serving Configuration

python
# serve.py — Production vLLM configuration for 70B model on 2x H100
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs

engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=2,         # Split across 2 H100s via NVLink
    dtype="bfloat16",                # Native H100 dtype, no precision loss
    max_model_len=8192,             # Max context window per request
    max_num_seqs=256,               # Max concurrent sequences in batch
    gpu_memory_utilization=0.92,    # Reserve 8% for KV cache spikes
    enable_chunked_prefill=True,    # Interleave prefill and decode
    enable_prefix_caching=True,     # Cache shared system prompts
    block_size=16,                   # PagedAttention block size
    swap_space=4,                    # GB of CPU swap for KV cache overflow
    enforce_eager=False,            # Use CUDA graphs for decode
)

# Key tuning knobs:
# - max_num_seqs: higher = better throughput, worse latency
# - gpu_memory_utilization: higher = more KV cache capacity, risk of OOM
# - enable_chunked_prefill: splits long prefills to avoid blocking decode
# - enable_prefix_caching: huge win if many requests share system prompt

DESIGN: Speculative Decoding

Speculative decoding is the most exciting serving optimization of 2024-2025. The idea: use a small, fast "draft" model (e.g., 1B parameters) to speculatively generate K tokens, then use the large "target" model to verify all K tokens in a single forward pass. If the draft model guesses correctly (which it does 60-80% of the time for predictable text), you generate K tokens for the cost of 1 target-model forward pass.

The math: if the draft model's acceptance rate is α and you speculate K tokens, expected speedup is Kα / (1 + K(1-α) · draft_cost/target_cost). For K=5, α=0.7, and draft cost = 5% of target, speedup is ~2.8x.

DEBUG: KV Cache Thrashing

Symptom: throughput drops 80% despite GPU utilization looking normal. Memory utilization is at 100%. The culprit: KV cache thrashing. When the KV cache is full, PagedAttention must either reject new requests (queueing them) or swap blocks to CPU memory. CPU swap is 100x slower than GPU memory. The system appears to be working but is spending most of its time swapping.

Diagnosis: check vllm:num_preemptions_total metric. If preemptions are increasing, the KV cache is full. Fix: reduce max_num_seqs (fewer concurrent requests = less KV cache needed), or increase gpu_memory_utilization, or add more GPUs.

FRONTIER: SGLang, Disaggregated Serving (2025)

SGLang introduces RadixAttention — a radix tree structure for KV cache that enables automatic prefix sharing across requests. If 1000 requests share the same 2000-token system prompt, the KV cache for that prefix is stored once, not 1000 times. This is a 10-50x memory savings for production deployments with system prompts.

Disaggregated serving separates prefill and decode onto different hardware. Prefill is compute-bound → run on compute-dense GPUs. Decode is memory-bandwidth-bound → run on memory-optimized hardware. Companies like Anyscale and Fireworks are shipping this in 2025. The benefit: each phase runs on optimal hardware instead of compromising.

CODE: KV Cache Memory Budget Calculator

python
def kv_cache_budget(
    model_params_b: float,    # Billions of parameters
    num_layers: int,           # Number of transformer layers
    hidden_dim: int,           # Hidden dimension (d_model)
    num_kv_heads: int,         # Number of KV attention heads (GQA)
    head_dim: int,             # Dimension per head
    dtype_bytes: int = 2,      # 2 for FP16/BF16, 1 for FP8
    max_seq_len: int = 4096,
    max_batch: int = 256,
) -> dict:
    """Calculate KV cache memory requirements."""
    # Per-token KV cache: 2 (K+V) * layers * kv_heads * head_dim * dtype
    per_token_bytes = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
    per_token_mb = per_token_bytes / (1024 ** 2)

    # Per-request: per_token * max_seq_len
    per_request_mb = per_token_mb * max_seq_len

    # Max batch KV cache
    total_kv_gb = per_request_mb * max_batch / 1024

    # Model weights
    model_gb = model_params_b * 1e9 * dtype_bytes / (1024 ** 3)

    return {
        "per_token_bytes": per_token_bytes,
        "per_request_mb": round(per_request_mb, 2),
        "max_batch_kv_gb": round(total_kv_gb, 2),
        "model_weights_gb": round(model_gb, 2),
        "total_gpu_gb": round(model_gb + total_kv_gb, 2),
    }

# Llama 3.1 70B with GQA (8 KV heads instead of 64)
budget = kv_cache_budget(
    model_params_b=70, num_layers=80, hidden_dim=8192,
    num_kv_heads=8, head_dim=128, dtype_bytes=2,
    max_seq_len=4096, max_batch=128
)
print(budget)
# {'per_token_bytes': 327680, 'per_request_mb': 1280.0,
#  'max_batch_kv_gb': 160.0, 'model_weights_gb': 130.39,
#  'total_gpu_gb': 290.39}
# → Need at least 4x H100 (80GB each = 320GB) for 128 concurrent requests!

GQA saves memory. Grouped Query Attention (GQA) uses fewer KV heads than query heads. Llama 3.1 70B has 64 query heads but only 8 KV heads — that is 8x less KV cache than standard multi-head attention. Without GQA, the KV cache budget above would be 1,280 GB for 128 concurrent requests, requiring 16 H100s just for KV cache. GQA makes large-batch serving economically viable.

DESIGN: Prefix Caching Architecture

In production, 90% of requests to Parallel share the same system prompt (2000-4000 tokens). Without prefix caching, each request recomputes the KV cache for the system prompt from scratch. That is 2000 tokens of prefill computation × every request = massive waste.

Prefix caching (vLLM's enable_prefix_caching, SGLang's RadixAttention) stores the KV cache for common prefixes in GPU memory. When a new request arrives with the same system prompt, the server looks up the cached KV blocks and skips prefill entirely for those tokens. Only the unique user query portion needs prefill.

The impact is dramatic:

Scenario	Without prefix cache	With prefix cache	Improvement
System prompt: 2000 tok, User query: 200 tok	Prefill 2200 tokens (~150ms)	Prefill 200 tokens (~15ms)	10x TTFT reduction
1000 concurrent requests, same system prompt	KV cache: 1000 × 2000 tok = 2M token-slots	Shared prefix: 2000 tok + 1000 × 200 unique = 202K	10x memory savings

Prefix caching is the highest-ROI optimization for production LLM serving. If your system prompt is 2000 tokens and your average user query is 200 tokens, prefix caching eliminates 91% of prefill computation and 91% of KV cache memory for the shared portion. In an interview, if asked "how would you reduce TTFT by 10x?", prefix caching is the first answer.

DEBUG: Prefix Cache Misses

You enabled prefix caching but TTFT did not improve. The cache hit rate is 3%. Investigation: prefix caching requires exact token-level match of the prefix. If your system prompt includes a timestamp ("Current time: 2025-05-22T10:30:00Z"), every request has a unique system prompt and no prefix is shared. Fix: move dynamic content (timestamps, user IDs) to the end of the prompt, after the static system instructions. Static portion gets cached; dynamic portion gets prefilled.

Another subtle miss: your system prompt is 2048 tokens but the cache block size is 16 tokens. The cache stores 128 blocks for the prefix. If your system prompt changes by even one token between API versions (you added a space somewhere), all 128 blocks are invalidated because the hash does not match. Fix: version your system prompts explicitly and do not make micro-edits to production prompts without understanding the cache impact.

Continuous Batching Simulator

Watch requests enter the batch, get processed, and complete. The batch dynamically adds/removes requests every decode step. Compare throughput with static batching.

Why is LLM decode memory-bandwidth bound, not compute bound?

Because the GPU doesn't have enough cores for decoding Because each token generation requires reading all model weights from HBM, but only does a tiny amount of computation per byte read Because the KV cache is stored on CPU Because attention is quadratic in sequence length

Chapter 4: Distributed Systems Fundamentals

Every system at Parallel's scale is distributed. Your databases are sharded across machines. Your inference cluster spans multiple nodes. Your configuration changes must propagate consistently. You cannot build reliable infrastructure without understanding the fundamental constraints and tradeoffs of distributed systems.

CONCEPT: The CAP Theorem

The CAP theorem (Brewer, 2000) states that a distributed data store can provide at most two of three guarantees simultaneously:

Consistency (C): Every read receives the most recent write. If you write "rate limit = 100 req/s" and then read it, you always get 100, never the old value.

Availability (A): Every request receives a response (not an error), even if some nodes are down.

Partition tolerance (P): The system continues to operate even if network communication between nodes is lost.

In practice, network partitions will happen (a switch fails, a cable is pulled, a cloud AZ goes dark). So P is not optional — you must tolerate partitions. The real choice is between C and A during a partition:

CP systems (e.g., ZooKeeper, etcd, CockroachDB) refuse to serve reads/writes during a partition rather than return stale data. Your Kubernetes control plane uses etcd (CP) because incorrect cluster state is worse than temporary unavailability.

AP systems (e.g., Cassandra, DynamoDB, Redis Cluster) continue serving during a partition but may return stale data. Your session store might be AP because showing slightly stale session data is better than showing an error page.

The real-world translation. CAP is not about choosing a database. It is about choosing, for each piece of data in your system, what happens when the network breaks. Rate limit counters? AP is fine — briefly allowing 105 req/s instead of 100 is acceptable. Billing records? CP is mandatory — double-charging a customer is unacceptable. A staff engineer maps every data store to its CAP requirement.

CONCEPT: Consistent Hashing

You have 10 cache servers. A request for key "user:12345" must always go to the same server (otherwise the cache is useless). Naive approach: server = hash(key) % 10. Problem: when you add an 11th server, hash(key) % 11 maps almost every key to a different server. Your entire cache is invalidated. With a 10M-key cache, you just caused a cache stampede that hammers your database.

Consistent hashing fixes this. Imagine a circular ring of hash values (0 to 2³²). Each server is placed at a point on the ring. A key is hashed and walks clockwise until it hits a server. When you add a server, only the keys between the new server and its predecessor move. Adding an 11th server moves only ~1/11 of keys instead of ~10/11.

Virtual nodes improve balance: each physical server gets 100-200 virtual positions on the ring. This prevents hot spots when servers are unevenly distributed on the ring.

CODE: Consistent Hash Ring

python
import hashlib, bisect

class ConsistentHashRing:
    def __init__(self, nodes: list[str], vnodes: int = 150):
        self.ring: list[tuple[int, str]] = []
        self.vnodes = vnodes
        for node in nodes:
            self.add_node(node)
        self.ring.sort()

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        for i in range(self.vnodes):
            h = self._hash(f"{node}:{i}")
            bisect.insort(self.ring, (h, node))

    def remove_node(self, node: str):
        self.ring = [(h, n) for h, n in self.ring if n != node]

    def get_node(self, key: str) -> str:
        if not self.ring:
            raise ValueError("Empty ring")
        h = self._hash(key)
        idx = bisect.bisect_right(self.ring, (h,))
        if idx == len(self.ring):
            idx = 0  # Wrap around the ring
        return self.ring[idx][1]

# Usage: route KV cache lookups to cache servers
ring = ConsistentHashRing(["cache-01", "cache-02", "cache-03"])
server = ring.get_node("user:12345:session")  # Deterministic routing
ring.add_node("cache-04")  # Only ~25% of keys re-map

DEBUG: Split Brain

A split brain occurs when a network partition divides your cluster into two halves, and both halves think they are the leader. With a CP database, this causes writes to fail (correct behavior). With a poorly configured AP system, both halves accept writes, creating conflicting data that must be merged when the partition heals.

The most common real-world split brain: your Redis Sentinel cluster has 3 nodes, the network splits 2-1, the majority side elects a new leader, but the minority side's old leader keeps accepting writes. Fix: configure min-replicas-to-write 1 so the old leader refuses writes when it cannot reach any replica.

CODE: Distributed Lock (Redlock)

python
import redis, time, uuid

class DistributedLock:
    """Simple distributed lock using Redis SET NX EX.
    For production, use the Redlock algorithm across 5 Redis instances."""

    def __init__(self, r: redis.Redis, name: str, ttl_s: int = 10):
        self.r = r
        self.name = f"lock:{name}"
        self.ttl = ttl_s
        self.token = str(uuid.uuid4())  # Unique token to prevent releasing someone else's lock

    def acquire(self, timeout: float = 5.0) -> bool:
        deadline = time.monotonic() + timeout
        while time.monotonic() < deadline:
            # SET key value NX EX ttl — atomic acquire
            if self.r.set(self.name, self.token, nx=True, ex=self.ttl):
                return True
            time.sleep(0.05)  # Retry after 50ms
        return False

    def release(self):
        # Lua script: only delete if we still own the lock
        script = """
        if redis.call("get", KEYS[1]) == ARGV[1] then
            return redis.call("del", KEYS[1])
        else
            return 0
        end
        """
        self.r.eval(script, 1, self.name, self.token)

# Usage: ensure only one process runs database migration
lock = DistributedLock(redis.Redis(), "db-migration", ttl_s=300)
if lock.acquire():
    try:
        run_migration()
    finally:
        lock.release()

Why the token matters. Without the unique token, you could accidentally release someone else's lock. Scenario: Process A acquires the lock, takes too long (TTL expires), Process B acquires the lock, Process A finishes and calls release — deleting Process B's lock. With the token check in the Lua script, Process A's release is a no-op because the lock value no longer matches its token.

FRONTIER: Deterministic Simulation Testing (2025)

Distributed systems bugs are notoriously hard to reproduce. Deterministic simulation testing (pioneered by FoundationDB, adopted by TigerBeetle) runs your entire distributed system inside a single-threaded simulation where all I/O is fake and all randomness is seeded. You can replay any bug with the same seed. Companies like Antithesis offer this as a service — they found consensus bugs in etcd and CockroachDB that years of traditional testing missed.

CONCEPT: Consensus and Leader Election

Many infrastructure systems need a single leader: the SLURM scheduler, the K8s controller manager, the database primary. Consensus algorithms (Raft, Paxos) solve the problem of electing a leader among distributed nodes such that all nodes agree on who the leader is, even when some nodes fail.

Raft (used by etcd, CockroachDB, TiKV) works in three phases:

1. Election: When a follower does not hear from the leader within a timeout (randomized, 150-300ms), it becomes a candidate and requests votes. If it gets a majority (3/5 or 2/3), it becomes the new leader. The randomized timeout prevents all followers from starting elections simultaneously.

2. Log replication: The leader receives writes, appends them to its log, and replicates to followers. A write is committed once a majority of nodes have written it. This guarantees durability: even if 2 of 5 nodes die, committed data is not lost.

3. Safety: A candidate cannot win an election if its log is behind any voter's log. This prevents a stale node from becoming leader and overwriting committed data.

Why Raft matters for infrastructure. When you run kubectl apply, the API server writes to etcd. Etcd uses Raft to replicate that write to 2 other etcd nodes. If the etcd leader dies, a new leader is elected in ~300ms. Your cluster state (what pods should be running, what services exist) survives because it was replicated to a majority before being acknowledged. Understanding Raft helps you debug etcd issues, set appropriate cluster sizes (always odd: 3, 5, or 7), and understand why etcd performance matters for K8s responsiveness.

DESIGN: Idempotency for Infrastructure Operations

Infrastructure operations must be idempotent: running the same operation twice produces the same result as running it once. This is critical because network failures can cause "did it succeed?" ambiguity. Terraform is idempotent by design (desired state, not imperative commands). But your custom scripts might not be.

python
# BAD: Not idempotent — running twice creates two load balancers
def create_lb(name):
    client.create_load_balancer(name=name)

# GOOD: Idempotent — checks if LB exists first
def ensure_lb(name, config):
    existing = client.get_load_balancer(name=name)
    if existing:
        if existing.config != config:
            client.update_load_balancer(name=name, config=config)
        return existing
    return client.create_load_balancer(name=name, config=config)

Consistent Hash Ring

Nodes sit on a hash ring. Keys map to the nearest clockwise node. Add/remove nodes to see minimal key redistribution vs. modulo hashing.

You add a 4th cache server to a 3-server consistent hash ring with 1000 keys. Approximately how many keys need to move?

~250 keys (1/4 of total — only keys between the new node and its predecessor move) ~750 keys (3/4 of total) ~1000 keys (all must re-hash) ~500 keys (half must re-hash)

Chapter 5: Scaling Databases

Parallel stores three kinds of data: user/account data (relational, ~100GB), conversation logs and traces (time-series, ~50TB and growing), and vector embeddings for RAG (vector store, ~2TB). Each has fundamentally different access patterns and scaling strategies. A staff engineer knows which database technology fits each pattern and, critically, when to stop scaling a single database and start sharding.

CONCEPT: Sharding Strategies

Vertical sharding moves different tables to different databases. Users table on DB-1, conversations on DB-2, billing on DB-3. Simple, but each shard can still hit single-node limits.

Horizontal sharding splits rows of the same table across multiple databases. Users A-M on Shard-1, N-Z on Shard-2. Scales horizontally, but cross-shard queries (joins) become expensive network calls.

The shard key choice determines everything. Bad shard key = hot shards. Good shard key = even distribution + query locality.

Data type	Shard key	Why
User accounts	user_id (hash)	Even distribution, all user queries hit one shard
Conversations	org_id + date	All conversations for one org on same shard (query locality), date prevents one shard from growing forever
Request logs	timestamp (range)	Time-series: recent data is hot, old data is cold storage

The shard key is a one-way door. Changing a shard key on a billion-row table requires migrating all data with zero downtime. This is one of the hardest operations in all of infrastructure engineering. Choose carefully during design. A common interview question: "Your shard key creates hot spots. How do you re-shard without downtime?" Answer: dual-write to old and new sharding scheme, backfill, switch reads, stop dual-write.

DESIGN: Connection Pooling Architecture

A PostgreSQL connection costs ~10MB of RAM and 1 OS process. 1000 connections = 10GB RAM + 1000 processes. But your K8s cluster might have 200 pods, each wanting 10 connections. That is 2000 connections — your database is out of memory.

PgBouncer sits between your app and PostgreSQL. 200 pods connect to PgBouncer (2000 connections). PgBouncer maintains only 50 connections to PostgreSQL. It multiplexes queries from 2000 app connections across 50 database connections using transaction-level pooling (a connection is returned to the pool after each transaction, not after the session ends).

CODE: Read Replica Routing

python
import random
from contextlib import contextmanager

class DBRouter:
    """Route reads to replicas, writes to primary."""
    def __init__(self, primary: str, replicas: list[str]):
        self.primary = primary
        self.replicas = replicas
        self._pools = {}  # connection pools per host

    def get_read_conn(self, allow_stale: bool = True):
        """Pick a random replica for reads. If stale not OK, use primary."""
        if not allow_stale or not self.replicas:
            return self._get_pool(self.primary)
        replica = random.choice(self.replicas)
        return self._get_pool(replica)

    def get_write_conn(self):
        """All writes go to primary."""
        return self._get_pool(self.primary)

    @contextmanager
    def read_after_write(self):
        """After a write, read from primary to avoid stale reads."""
        yield self._get_pool(self.primary)

# Usage pattern:
router = DBRouter(
    primary="db-primary.internal:5432",
    replicas=["db-replica-1.internal:5432", "db-replica-2.internal:5432"]
)
# Dashboard query (stale OK): hits a replica
conn = router.get_read_conn(allow_stale=True)
# Billing check (must be fresh): hits primary
conn = router.get_read_conn(allow_stale=False)

DEBUG: Replication Lag

Your monitoring dashboard shows a spike. But the data is 30 seconds stale because it reads from a replica. The replica is lagging because a bulk data migration is running on the primary, generating so many WAL (Write-Ahead Log) entries that the replica cannot keep up.

Diagnosis: check pg_stat_replication.replay_lag on the primary. If lag > your tolerance (e.g., 5 seconds), route reads to primary until lag recovers. Better: set up lag-aware routing that automatically switches when lag exceeds threshold.

CODE: Lag-Aware Read Routing

python
import asyncpg, time

class LagAwareRouter:
    """Route reads to replicas only when lag is acceptable."""
    def __init__(self, primary_dsn: str, replica_dsns: list[str],
                 max_lag_seconds: float = 5.0):
        self.primary_dsn = primary_dsn
        self.replica_dsns = replica_dsns
        self.max_lag = max_lag_seconds
        self.replica_lags: dict[str, float] = {}
        self.last_check = 0
        self.check_interval = 2.0  # seconds

    async def _check_lags(self):
        """Query replication lag from primary's pg_stat_replication."""
        if time.time() - self.last_check < self.check_interval:
            return
        conn = await asyncpg.connect(self.primary_dsn)
        rows = await conn.fetch(
            "SELECT client_addr, "
            "EXTRACT(EPOCH FROM replay_lag) as lag_s "
            "FROM pg_stat_replication"
        )
        for row in rows:
            self.replica_lags[str(row['client_addr'])] = row['lag_s'] or 0
        await conn.close()
        self.last_check = time.time()

    async def get_read_dsn(self, require_fresh: bool = False) -> str:
        if require_fresh:
            return self.primary_dsn
        await self._check_lags()
        # Pick a replica with acceptable lag
        for dsn in self.replica_dsns:
            host = dsn.split('@')[-1].split(':')[0]
            lag = self.replica_lags.get(host, 999)
            if lag <= self.max_lag:
                return dsn
        # All replicas lagging — fall back to primary
        return self.primary_dsn

DESIGN: Time-Series Data Architecture

Parallel generates ~500K metrics data points per second from GPU telemetry, inference latency, and request logs. Standard PostgreSQL cannot handle this write volume. The architecture:

Hot tier (0-7 days): TimescaleDB (PostgreSQL extension) with hypertables. Data is partitioned by time automatically. Continuous aggregates pre-compute 1-minute and 1-hour rollups. Query performance: sub-second for "show me GPU utilization for the last 6 hours."

Warm tier (7-90 days): Compressed hypertable chunks. TimescaleDB's native compression achieves 10-20x compression ratios for metrics data. Queries are slower (1-5 seconds) but data is still online.

Cold tier (90+ days): Parquet files in S3 via a nightly export job. Queryable via Athena/Trino for historical analysis. Virtually free storage ($0.023/GB/month).

DESIGN: Vector Store Architecture at Scale

Parallel's RAG system stores 500M embedding vectors. Each vector is 1536 dimensions × 4 bytes = ~6KB. Total: ~3TB of vector data. The architecture must support <50ms nearest-neighbor queries at 10K queries/sec:

Component	Choice	Why
Vector DB	Qdrant (Rust, open-source)	Best latency/throughput for filtered search. Supports hybrid search (dense + sparse).
Index type	HNSW (Hierarchical Navigable Small World)	Sub-linear query time O(log n), 95%+ recall at ef=128
Sharding	Hash-based on collection_id	Each customer's embeddings on one shard for query locality
Replication	2 replicas per shard	Read throughput + fault tolerance. One replica can serve while other rebuilds index.
Hardware	Memory-optimized instances (r6i.8xlarge)	HNSW index must fit in RAM. 3TB data + index overhead ≈ 5TB RAM needed across shards.

Vector indexes must fit in RAM. Unlike traditional databases where disk I/O is acceptable, HNSW graph traversal requires random memory access at each hop. If the index is on disk, each hop becomes a random disk read (100μs SSD vs. 100ns RAM = 1000x slower). For 500M vectors, this means your vector store cluster needs enough aggregate RAM for the entire index. At $0.10/GB/hr for memory instances, this is a significant cost center — budget ~$15K/month for a 5TB index.

CODE: Batch Embedding Pipeline

python
import asyncio
from qdrant_client import QdrantClient, models

async def index_documents(
    docs: list[dict],
    qdrant: QdrantClient,
    embed_fn,
    collection: str,
    batch_size: int = 100,
):
    """Batch-index documents into Qdrant with parallel embedding."""
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i + batch_size]
        texts = [d["text"] for d in batch]

        # Embed in parallel (GPU-accelerated embedding model)
        vectors = await embed_fn(texts)

        # Upsert to Qdrant (idempotent via point IDs)
        points = [
            models.PointStruct(
                id=d["id"],
                vector=v.tolist(),
                payload={"text": d["text"], "source": d["source"]},
            )
            for d, v in zip(batch, vectors)
        ]
        qdrant.upsert(collection_name=collection, points=points)

    # Trigger HNSW index optimization after bulk upsert
    qdrant.update_collection(
        collection_name=collection,
        optimizer_config=models.OptimizersConfigDiff(
            indexing_threshold=20000,
        ),
    )

FRONTIER: Vector Databases at Scale (2024-2025)

Parallel stores embeddings for RAG. The naive approach — pgvector in PostgreSQL — works for 1M vectors. At 100M vectors, you need a purpose-built vector store: Pinecone (managed), Qdrant (open-source, Rust), or Milvus (open-source, GPU-accelerated search). The 2025 frontier is hybrid search: combining dense vector search with sparse keyword search (BM25) in a single query, scored with reciprocal rank fusion. Qdrant and Weaviate ship this natively.

Database Sharding Visualizer

See how different shard keys distribute data across 4 shards. Toggle between hash-based and range-based sharding to see hot spots form.

You have 200 app pods each wanting 10 database connections, but PostgreSQL can only handle 200 connections. What is the standard solution?

Add more PostgreSQL instances Deploy PgBouncer as a connection pooler between apps and PostgreSQL with transaction-level pooling Reduce the number of app pods to 20 Switch to a NoSQL database

Chapter 6: Cost Optimization

Parallel's GPU fleet costs $3-5M per month. A 10% utilization improvement saves $300K-$500K per month. A 5% cost reduction in cloud spend saves more than most engineers' annual salaries. Cost engineering is not a side project — it is a core competency. The difference between a profitable AI company and one that burns cash faster than it earns revenue often comes down to infrastructure cost efficiency.

CONCEPT: The Cost Stack

Understanding where money goes is the first step to saving it. For an LLM serving company, the cost stack typically looks like this:

Category	% of total	What drives it
GPU compute (inference)	55-70%	Number of GPUs × hours × $/hr
GPU compute (training/finetune)	10-20%	Training runs, eval batches
Storage (model weights, logs)	5-10%	S3, EBS, vector store storage
Networking (inter-region, egress)	5-8%	Cross-region replication, API responses
CPU compute (API, workers)	3-5%	K8s cluster, background jobs
Managed services	2-5%	RDS, ElastiCache, CloudWatch

The 55-70% number is the whole game. Everything else is noise by comparison. If you optimize your GPU utilization from 40% to 80%, you have effectively halved your biggest cost category. This is why GPU scheduling, batching efficiency, and model quantization are not ML curiosities — they are the primary levers of business profitability.

DESIGN: Spot Instance Strategy

Spot instances cost 60-90% less than on-demand but can be reclaimed with 2 minutes notice. The strategy for inference serving:

Base layer (on-demand or reserved): 40% of your peak capacity. This handles your minimum traffic and never gets reclaimed. Use 1-year reserved instances for another 30% discount.

Elastic layer (spot): 0-60% of peak capacity. Scales up for traffic spikes, scales down during quiet hours. If spot capacity is reclaimed, requests queue and latency increases but the service stays up.

Burst layer (on-demand): Emergency overflow. If spot is reclaimed AND traffic is above base capacity, spin up on-demand instances. This is the most expensive option but keeps SLOs intact during spot reclamation events.

DESIGN: Spot Instance Interruption Handling

When AWS reclaims a spot instance, you get a 2-minute warning via the EC2 metadata endpoint. Your infrastructure must gracefully handle this:

python
import requests, time, subprocess

def spot_interruption_handler():
    """Poll EC2 metadata for spot interruption notice.
    Run as a sidecar in every GPU pod on spot instances."""
    while True:
        try:
            resp = requests.get(
                "http://169.254.169.254/latest/meta-data/spot/instance-action",
                timeout=1
            )
            if resp.status_code == 200:
                # INTERRUPTION INCOMING — we have ~2 minutes
                action = resp.json()
                print(f"Spot interruption: {action['action']} at {action['time']}")

                # Step 1: Stop accepting new requests
                subprocess.run(["kubectl", "cordon", node_name])

                # Step 2: Drain in-flight requests (wait up to 90 seconds)
                wait_for_drain(timeout=90)

                # Step 3: Signal the autoscaler to compensate
                requests.post(webhook_url, json={
                    "event": "spot_interruption",
                    "node": node_name,
                    "gpu_count": 8,
                })
                return
        except requests.exceptions.RequestException:
            pass  # No interruption notice — normal operation
        time.sleep(5)

The 2-minute window is your entire budget. In those 2 minutes, you must: stop accepting new requests (cordon the node), let in-flight requests complete (graceful drain), signal the autoscaler to launch replacement capacity (ideally in a different AZ with spot availability), and transfer any non-replicated state (KV cache cannot be transferred — those requests will need to re-prefill on the new instance). The handler above automates this. Without it, the instance simply disappears and all in-flight requests get 502 errors.

CONCEPT: Total Cost of Ownership (TCO)

When comparing cloud providers or GPU types, the hourly rate is not the full picture. Total Cost of Ownership includes:

Cost Component	Often Overlooked?	Example
Compute hourly rate	No	H100: $10/hr
Data transfer (egress)	Yes	$0.09/GB from AWS to internet. At 1TB/day = $2,700/month
Storage (EBS, S3)	Sometimes	Model weights on EBS: $0.10/GB/month. 140GB × 100 nodes = $1,400/month
Engineering time	Yes	2 engineers × $300K salary / 12 = $50K/month in labor
Monitoring/logging	Yes	Datadog at scale: $20K-$100K/month
Support contracts	Sometimes	AWS Enterprise Support: $15K+/month

A common mistake: choosing a cloud provider with 10% cheaper GPU rates but 3x more expensive egress. If your workload is egress-heavy (streaming tokens to users), the "cheaper" provider is actually more expensive. Always model the full TCO before making procurement decisions.

CODE: Inference Cost Calculator

python
from dataclasses import dataclass

@dataclass
class InferenceCostModel:
    gpu_hourly_cost: float     # $/hour per GPU
    tokens_per_second: float   # Throughput per GPU
    avg_input_tokens: int      # Average request input length
    avg_output_tokens: int     # Average request output length
    batch_efficiency: float    # 0-1, how much batching helps
    gpu_utilization: float     # 0-1, actual time GPU is doing work

    def cost_per_request(self) -> float:
        total_tokens = self.avg_input_tokens + self.avg_output_tokens
        # Time to process one request (seconds)
        effective_tps = self.tokens_per_second * self.batch_efficiency
        time_s = total_tokens / effective_tps
        # Cost = time * GPU cost, adjusted for utilization
        cost_per_second = self.gpu_hourly_cost / 3600
        return (time_s * cost_per_second) / self.gpu_utilization

    def cost_per_million_tokens(self) -> float:
        cost_per_req = self.cost_per_request()
        tokens_per_req = self.avg_input_tokens + self.avg_output_tokens
        return (cost_per_req / tokens_per_req) * 1_000_000

# Example: Llama 70B on 2x H100 with vLLM
model = InferenceCostModel(
    gpu_hourly_cost=20.0,       # 2x H100 at $10/hr each
    tokens_per_second=80,       # ~80 tok/s per 2-GPU setup
    avg_input_tokens=500,
    avg_output_tokens=200,
    batch_efficiency=0.7,       # Continuous batching helps a lot
    gpu_utilization=0.75,       # 75% utilization is good
)
print(f"Cost per request: ${model.cost_per_request():.4f}")
print(f"Cost per 1M tokens: ${model.cost_per_million_tokens():.2f}")

DEBUG: Zombie Resources

The most common cost leak: zombie resources — infrastructure that is running but no longer needed. Failed experiments that left GPU instances running. Load balancers for decommissioned services. EBS volumes detached from any instance. Elastic IPs not attached to anything. Each one quietly burns money.

Prevention: tag every resource with owner, team, expires. Run a nightly job that deletes resources past their expiry date. For GPU instances, automatically terminate any instance with <5% utilization for >2 hours (with a whitelist for legitimate idle use cases like pre-warming).

CODE: Zombie Resource Hunter

python
import boto3
from datetime import datetime, timedelta

def find_zombie_resources(region: str) -> list[dict]:
    """Find resources that are running but probably shouldn't be."""
    zombies = []
    ec2 = boto3.client('ec2', region_name=region)
    cw = boto3.client('cloudwatch', region_name=region)

    # Check running GPU instances with low utilization
    instances = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['running']},
            {'Name': 'instance-type', 'Values': ['p4d.*', 'p5.*']}
        ]
    )
    for res in instances['Reservations']:
        for inst in res['Instances']:
            inst_id = inst['InstanceId']
            # Check GPU utilization via CloudWatch custom metric
            stats = cw.get_metric_statistics(
                Namespace='GPU', MetricName='Utilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': inst_id}],
                StartTime=datetime.utcnow() - timedelta(hours=2),
                EndTime=datetime.utcnow(),
                Period=3600, Statistics=['Average']
            )
            avg_util = sum(p['Average'] for p in stats['Datapoints']) / max(1, len(stats['Datapoints']))
            if avg_util < 5:
                tags = {t['Key']: t['Value'] for t in inst.get('Tags', [])}
                zombies.append({
                    'resource': inst_id,
                    'type': inst['InstanceType'],
                    'avg_util': round(avg_util, 1),
                    'owner': tags.get('owner', 'UNTAGGED'),
                    'hourly_cost': estimate_hourly_cost(inst['InstanceType']),
                })
    return zombies

DESIGN: Right-Sizing Automation

Most cloud resources are over-provisioned. Engineers request 8 vCPUs "just in case" but use 1.5 on average. Right-sizing means matching resource allocation to actual usage. The automation pipeline:

1. Collect

Pull 14 days of CPU, memory, GPU, and network metrics from Prometheus for every pod and instance.

↓

2. Analyze

For each resource: compute p50, p95, and max utilization. If p95 < 50% of allocation, it is over-provisioned. If p95 > 90%, it is under-provisioned.

↓

3. Recommend

Generate a recommendation: "Pod inference-server: CPU request 8 → 2, memory request 16Gi → 6Gi. Estimated savings: $1,200/month."

↓

4. Apply (with approval)

Create a PR with the updated K8s resource specs. Human reviews and approves. ArgoCD deploys.

FRONTIER: Quantization Economics (2024-2025)

GPTQ/AWQ quantization (INT4) reduces model memory by 4x with ~1-3% quality loss. A 70B model goes from 140GB (FP16) to 35GB (INT4), fitting on a single H100 instead of two. Cost per request drops 50%+ because you need half the GPUs. The 2025 frontier: FP8 quantization on H100/B200 uses the native FP8 tensor cores — 2x throughput with essentially zero quality loss. Every major serving framework (vLLM, TensorRT-LLM, SGLang) supports FP8 natively.

DESIGN: Model Routing for Cost Optimization

Not every request needs the biggest model. A simple "what time is it?" does not need 70B parameters. Model routing classifies incoming requests and routes them to the cheapest model that can handle them:

Request complexity	Model	Cost/request	Example
Simple (FAQ, greetings)	8B model (1 GPU)	$0.0005	"What are your business hours?"
Medium (analysis, summarization)	70B model (2 GPUs)	$0.004	"Summarize this document and list key points."
Complex (reasoning, multi-step)	405B model (8 GPUs)	$0.02	"Compare these 3 investment strategies considering tax implications."

The router itself is a small classifier (fine-tuned BERT or even a rule-based system on prompt length + keywords). If 60% of requests are simple, model routing reduces average cost per request by 50-70% compared to sending everything to the 70B model.

The cheapest GPU cycle is the one you don't use. Model routing, caching, and semantic deduplication are often more impactful than GPU-level optimizations. If 10% of your requests are exact duplicates (same user prompt), a simple prompt cache (Redis with hash of the prompt as key) eliminates them entirely. At 100K req/s, that is 10K requests you serve from cache in 1ms instead of GPU in 500ms.

Cost Optimization Explorer

Adjust GPU count, utilization, spot mix, and quantization to see the impact on monthly cost and cost per request.

GPU Util % 65%

Spot Mix % 40%

A 70B FP16 model costs $0.008 per request on 2x H100. You quantize to INT4 and fit it on 1x H100. What is the approximate new cost per request?

$0.008 (quantization doesn't affect cost) ~$0.004 (half the GPUs, roughly half the cost, with minor throughput difference) $0.002 (quarter the cost) $0.016 (quantization adds overhead)

Chapter 7: Reliability Engineering

Parallel's API serves customers whose businesses depend on it. When the API is down, their products are broken, their users are frustrated, and they are looking at competitors. "Five nines" (99.999%) uptime means at most 5 minutes of downtime per year. "Four nines" (99.99%) means 52 minutes per year. Every minute counts. Reliability is not a feature you bolt on — it is a design constraint that shapes every architectural decision.

CONCEPT: SLOs, SLIs, and Error Budgets

A Service Level Indicator (SLI) is a metric that measures reliability. Examples: request success rate, p99 latency, time to first token.

A Service Level Objective (SLO) is a target for an SLI. "99.9% of requests complete in under 800ms." This is an internal commitment. It is more strict than what you promise customers (the SLA).

An error budget is the amount of unreliability your SLO allows. If your SLO is 99.9% availability, your error budget is 0.1% — roughly 43 minutes per month. When you have error budget remaining, you can take risks (deploy risky features, run experiments). When the budget is exhausted, you freeze deployments and focus on reliability improvements.

Error budgets align incentives. Without error budgets, the reliability team says "no changes, changes cause outages" and the product team says "ship faster." With error budgets, both teams share a quantitative budget. If reliability is good (budget remaining), the product team can ship fast. If reliability is bad (budget depleted), everyone slows down. This is the single most important organizational tool in SRE.

DESIGN: Chaos Engineering

You cannot know how your system fails until it actually fails. Chaos engineering deliberately introduces failures in production (in a controlled way) to discover weaknesses before customers do.

The hierarchy of chaos experiments, from least to most scary:

1. Kill a pod

Does K8s restart it? Does the load balancer route around it? Does any request fail?

↓

2. Kill a node

Do pods reschedule to other nodes? How long does it take? Do GPU jobs recover?

↓

3. Network partition

Inject latency or packet loss between services. Does the circuit breaker trip? Does the fallback work?

↓

4. Zone failure

Simulate an entire AZ going dark. Does traffic redistribute? Is data replicated?

↓

5. Region failure

The big one. Does the global load balancer route to another region? How long until full recovery?

CODE: Circuit Breaker

python
import time
from enum import Enum

class State(Enum):
    CLOSED = "closed"        # Normal operation
    OPEN = "open"            # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=30):
        self.state = State.CLOSED
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure_time = 0

    def call(self, func, *args, **kwargs):
        if self.state == State.OPEN:
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = State.HALF_OPEN  # Try one request
            else:
                raise CircuitOpenError("Circuit is open, refusing request")

        try:
            result = func(*args, **kwargs)
            if self.state == State.HALF_OPEN:
                self.state = State.CLOSED  # Service recovered!
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = State.OPEN
            raise

# Usage: protect calls to the inference backend
breaker = CircuitBreaker(failure_threshold=5, reset_timeout=30)
try:
    result = breaker.call(inference_backend.predict, prompt=user_input)
except CircuitOpenError:
    result = fallback_response()  # Return cached/degraded response

DEBUG: Cascading Failures

The most dangerous production failure pattern: Service A depends on Service B. Service B slows down (doesn't fail, just gets slow). Service A's threads are all waiting for Service B, so Service A's queue fills up. Service C depends on Service A and also starts queueing. Within minutes, the entire system is down even though Service B only slowed by 200ms.

Prevention: timeouts on every external call (never wait forever), circuit breakers (stop calling a slow service), load shedding (reject new requests when overloaded rather than queueing them), and bulkheading (isolate thread pools per dependency so one slow dependency cannot consume all threads).

CODE: Load Shedder

python
import time
from collections import deque

class LoadShedder:
    """Reject excess requests to protect the system from overload.
    Uses a sliding window to measure current request rate."""

    def __init__(self, max_rps: int, window_seconds: float = 1.0):
        self.max_rps = max_rps
        self.window = window_seconds
        self.timestamps: deque = deque()

    def should_accept(self) -> bool:
        now = time.monotonic()
        # Remove timestamps outside the window
        while self.timestamps and self.timestamps[0] < now - self.window:
            self.timestamps.popleft()
        if len(self.timestamps) >= self.max_rps:
            return False  # Shed this request (HTTP 503)
        self.timestamps.append(now)
        return True

# Usage in FastAPI middleware:
shedder = LoadShedder(max_rps=5000)

@app.middleware("http")
async def load_shed_middleware(request, call_next):
    if not shedder.should_accept():
        return JSONResponse(
            status_code=503,
            content={"error": "Service overloaded, please retry"},
            headers={"Retry-After": "2"}
        )
    return await call_next(request)

Shed early, not late. The worst thing you can do under overload is accept every request and process it slowly. This ties up server resources (memory, threads, GPU time) and makes everything slow for everyone. It is better to immediately reject the excess (HTTP 503 with Retry-After header) so the requests you do accept get full resources and fast responses. The client retries; the system stays healthy.

DESIGN: Postmortem Template

Every incident gets a postmortem within 48 hours. The postmortem is blameless — it focuses on systems, not people. The template:

1. Summary

What happened in one sentence. "GPU cluster us-east-1 served degraded latency for 23 minutes due to NVLink failure on 3 nodes."

↓

2. Timeline

Minute-by-minute: when it started, when detected, when triaged, when fixed, when resolved. Include alert IDs and response times.

↓

3. Root Cause

The technical root cause (not "human error"). "A firmware update to NVLink bridge chips was applied without draining active workloads, causing link renegotiation during inference."

↓

4. Impact

Quantified: requests affected, error rate, latency impact, revenue impact, error budget consumed.

↓

5. Action Items

Concrete, assigned, deadlined. "Add NVLink health check to pre-deployment validation [owner: @alice, due: May 30]."

FRONTIER: Incident Automation (2024-2025)

The frontier is AI-assisted incident response. Tools like PagerDuty AIOps and Rootly automatically correlate alerts with recent deployments, suggest runbook steps, and draft postmortem timelines. In 2025, companies are experimenting with LLM agents that can execute runbook steps autonomously (e.g., "if p99 latency > 2s and recent deployment within 30 min, auto-rollback the deployment").

Error Budget Tracker

Watch the error budget burn as incidents occur. Green = budget remaining, red = budget exhausted (deployment freeze). Click to simulate incidents.

Your SLO is 99.9% availability. It's day 20 of the month and you've used 80% of your error budget. The product team wants to deploy a risky new feature. What do you recommend?

Deploy immediately — 20% budget remaining is plenty Refuse all deployments until the month resets Deploy with a canary (1% traffic), monitor for 2 hours, and only promote if no budget is consumed Lower the SLO so you have more budget

Chapter 8: Networking at Scale

Every token your LLM generates must travel from a GPU in a data center to a user's browser. That journey involves DNS resolution, TLS handshake, load balancer routing, service-to-service calls, and response streaming. Each hop adds latency. Each hop can fail. A staff-level infrastructure engineer understands every layer of this stack because networking problems are the most common cause of production incidents.

CONCEPT: L4 vs. L7 Load Balancing

Layer 4 (transport) load balancers route TCP/UDP connections. They see IP addresses and ports but cannot inspect HTTP headers, URL paths, or request bodies. They are fast (kernel-level, millions of connections) and simple. Use L4 for: gRPC services, database connections, any high-throughput TCP protocol.

Layer 7 (application) load balancers inspect HTTP requests. They can route based on URL path (/v1/chat goes to inference, /v1/embeddings goes to embedding service), HTTP headers (route by API key to customer-specific deployments), cookies (session affinity), and request body (route by model name). They are slower (must parse HTTP) but much more flexible. Use L7 for: API gateways, canary routing, A/B testing.

Feature	L4 (NLB/IPVS)	L7 (ALB/Envoy/NGINX)
Routing granularity	IP + port	URL, headers, body, cookies
TLS termination	Passthrough or terminate	Always terminates
Protocol support	Any TCP/UDP	HTTP/1.1, HTTP/2, WebSocket, gRPC
Latency overhead	~0.1ms	~1-5ms
Cost (AWS)	$0.006/GB	$0.008/GB + $0.008/LCU-hr

DESIGN: Service Mesh for LLM Infrastructure

Parallel's infrastructure has 40+ microservices. Each needs: mTLS encryption, retry logic, circuit breaking, rate limiting, and observability. Implementing this in every service is a nightmare. A service mesh (Istio, Linkerd) injects a sidecar proxy (Envoy) alongside every pod. The sidecar handles all cross-cutting concerns transparently.

The tradeoff: each sidecar adds ~10MB memory and ~1ms latency per hop. For CPU services, this is fine. For GPU inference pods, you might skip the sidecar and handle networking directly to avoid the extra millisecond on every token stream.

DESIGN: SSE/WebSocket Streaming Architecture

LLM responses are streamed token-by-token via Server-Sent Events (SSE) or WebSockets. Each connection stays open for 5-60 seconds. This is fundamentally different from traditional HTTP request-response and creates unique infrastructure challenges:

Connection limits: Each open stream holds a file descriptor and a slot in the load balancer's connection pool. 10,000 concurrent streams = 10,000 open connections. Your Envoy default of 1024 connections per upstream cluster will be exhausted quickly.

Idle timeout traps: Some load balancers (AWS ALB) have a 60-second idle timeout. If token generation stalls for 60 seconds (e.g., waiting for tool call result), the connection is killed. Fix: send keepalive comments : keepalive\n\n every 15 seconds during idle periods.

Backpressure: If the client cannot consume tokens as fast as the model generates them (slow network), the TCP send buffer fills up. Without backpressure handling, the server buffers unboundedly and runs out of memory. Fix: monitor TCP send buffer per connection, pause generation when buffer exceeds threshold.

python
# SSE streaming with keepalive and backpressure
async def stream_response(request, engine):
    async def generate():
        last_token_time = time.time()
        async for token in engine.generate(request.prompt):
            yield f"data: {json.dumps({'token': token})}\n\n"
            last_token_time = time.time()

        # Keepalive during tool calls
        while engine.waiting_for_tool:
            if time.time() - last_token_time > 15:
                yield ": keepalive\n\n"
                last_token_time = time.time()
            await asyncio.sleep(1)

        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"}
    )

CODE: gRPC Load Balancing

python
# gRPC uses HTTP/2 with persistent connections. A naive L4 LB
# assigns one connection per client → all requests from that client
# hit one backend → load imbalance. Solution: L7 or client-side LB.

import grpc
from grpc import aio

async def create_channel(endpoints: list[str]) -> aio.Channel:
    """Create a gRPC channel with client-side round-robin LB."""
    # Format: "dns:///my-service.internal:50051"
    # grpc resolves DNS, gets multiple IPs, round-robins across them
    target = f"dns:///{endpoints[0]}"

    channel = aio.insecure_channel(
        target,
        options=[
            ("grpc.lb_policy_name", "round_robin"),
            ("grpc.dns_min_time_between_resolutions_ms", 5000),
            ("grpc.keepalive_time_ms", 10000),
            ("grpc.keepalive_timeout_ms", 5000),
            ("grpc.max_send_message_length", 50 * 1024 * 1024),  # 50MB
        ]
    )
    return channel

# For inference: use "pick_first" for sticky sessions (KV cache affinity)
# For stateless services: use "round_robin" for even distribution

DEBUG: Connection Pool Exhaustion

Symptom: intermittent 503 errors, but backend services are healthy. The load balancer reports "no healthy upstream." Investigation: the L7 load balancer (Envoy) has a connection pool limit of 1024 per upstream host. During a traffic spike, each long-lived SSE stream for token streaming holds a connection open for 5-30 seconds. 1024 connections × 10 second average = 102 req/s per upstream. With 5 upstream pods, your max is 512 req/s. Beyond that, connections queue and time out.

Fix: increase max_connections in Envoy cluster config, add more upstream pods, or implement HTTP/2 multiplexing (multiple requests per connection).

FRONTIER: eBPF-based Networking (2024-2025)

Cilium replaces kube-proxy and iptables with eBPF programs that run directly in the Linux kernel. Benefits: 10x faster packet processing, kernel-level observability (every packet traced without instrumentation), and network policies that apply at the kernel level without sidecar proxies. In 2025, Cilium is the default CNI for GKE and is rapidly replacing Istio's sidecar model for service mesh functionality.

DESIGN: DNS and Service Discovery

Every microservice needs to find every other microservice. In Kubernetes, you have two options:

Kubernetes DNS (CoreDNS): Every Service gets a DNS name: inference-server.production.svc.cluster.local. Pods resolve this via the cluster DNS. Simple, but DNS has TTL caching issues — when a pod dies and is replaced, clients may still have the old IP cached for 5-30 seconds. For inference serving where each request takes 500ms, 30 seconds of stale DNS = ~60 failed requests.

Service mesh (Envoy/Istio): The sidecar proxy maintains a real-time list of healthy endpoints via the control plane. No DNS caching issues. When a pod dies, the proxy knows within 1-2 seconds and stops routing to it. Better health checking (L7 health probes, not just TCP), but adds complexity and latency.

The hybrid approach for Parallel: Kubernetes DNS for stateless services (low sensitivity to endpoint staleness), Envoy service mesh for inference routing (needs sub-second failover), and external DNS (Route53/Cloud DNS) for cross-region service discovery.

CODE: Connection Pool Manager

python
import asyncio
from contextlib import asynccontextmanager

class ConnectionPool:
    """Generic async connection pool with health checking.
    Prevents connection exhaustion, the #1 cause of 503 errors."""

    def __init__(self, factory, max_size: int = 50,
                 max_idle_time: float = 300):
        self.factory = factory  # async callable that creates a connection
        self.max_size = max_size
        self.max_idle = max_idle_time
        self._pool: asyncio.Queue = asyncio.Queue(maxsize=max_size)
        self._size = 0
        self._lock = asyncio.Lock()

    @asynccontextmanager
    async def acquire(self):
        conn = None
        try:
            conn = self._pool.get_nowait()
        except asyncio.QueueEmpty:
            async with self._lock:
                if self._size < self.max_size:
                    conn = await self.factory()
                    self._size += 1
            if conn is None:
                conn = await asyncio.wait_for(self._pool.get(), timeout=5.0)
        try:
            yield conn
        finally:
            try:
                self._pool.put_nowait(conn)
            except asyncio.QueueFull:
                await conn.close()
                async with self._lock:
                    self._size -= 1

Connection pools are everywhere. Database connections, Redis connections, gRPC channels, HTTP persistent connections — every external dependency needs a pool. Without pooling, each request creates a new TCP connection (3-way handshake = 1-2ms), negotiates TLS (another 2-5ms), and tears it down after use. With pooling, the connection is reused, saving 3-7ms per request. At 10K req/s, that is 30-70 seconds of cumulative latency saved every second.

Load Balancer Simulator

Requests arrive and get distributed across backends. Compare round-robin, least-connections, and consistent-hash strategies. Watch for hot spots.

You use gRPC (HTTP/2) with an AWS NLB (L4). All traffic goes to one backend pod even though you have 5 pods. Why?

The NLB is misconfigured HTTP/2 uses persistent connections; the L4 LB binds one connection to one backend, and gRPC multiplexes all requests over that single connection gRPC doesn't support load balancing The other 4 pods have failed health checks

Chapter 9: CI/CD & Developer Tooling

Parallel has 80 engineers deploying 20+ times per day. Each deployment touches infrastructure that serves millions of requests. A broken deployment means revenue loss. A slow deployment pipeline means engineering velocity loss. The CI/CD system must be fast (under 10 minutes from merge to production), safe (catch errors before they reach production), and recoverable (rollback in under 60 seconds).

CONCEPT: Deployment Strategies

There are four deployment strategies. Each has different risk, speed, and complexity tradeoffs:

Rolling update: Replace pods one at a time. Old and new versions run simultaneously during rollout. Simple, but if the new version has a bug, it serves some traffic before you detect and rollback. K8s default.

Blue-green: Run two identical environments (blue = current, green = new). Deploy to green, test, then switch the load balancer from blue to green. Instant rollback by switching back. Downside: 2x resource cost during transition.

Canary: Route 1% of traffic to the new version. Monitor error rate and latency for 15-30 minutes. If healthy, gradually increase to 5%, 25%, 50%, 100%. If unhealthy at any stage, rollback instantly. This is the gold standard for high-risk deployments.

Feature flags: Deploy the new code to 100% of servers but hide it behind a flag. Enable the flag for 1% of users. This decouples deployment (code on servers) from release (feature visible to users). You can turn off a broken feature without redeploying.

Canary for infrastructure, feature flags for product. At staff level, you combine strategies. A new vLLM version? Canary deployment (1% traffic, monitor GPU metrics for 30 minutes). A new API parameter? Feature flag (deploy to all, enable for beta customers first). Infrastructure changes need canary because they cannot be flag-gated — a different binary is running. Product changes should use feature flags because they are faster to toggle.

DEBUG: Deployment-Caused Regressions

The most common production incident: "We deployed, now things are broken." The investigation framework:

Step 1: Confirm causation. Check if the metric degradation started within 5 minutes of the deployment. If yes, strong correlation. If it started 2 hours later, probably coincidence. Check the deploy log: kubectl rollout history deployment/inference-server.

Step 2: Identify the change. What exactly changed? New container image tag, new environment variable, new ConfigMap, new resource limits? Use kubectl diff against the previous revision or check the Git commit that triggered the deploy.

Step 3: Quantify impact. Before rolling back (which might be disruptive), assess: Is the SLO breached? Is the error rate above the error budget burn rate? If p99 went from 400ms to 500ms but the SLO is 800ms, you have time to investigate. If p99 went from 400ms to 2s, roll back immediately.

Step 4: Rollback or fix forward. For infrastructure changes, always rollback first (it is safer than debugging under pressure). For application bugs, sometimes a forward fix is faster if the bug is obvious (e.g., typo in a config value).

bash
# Instant rollback to previous revision
kubectl rollout undo deployment/inference-server

# Rollback to a specific revision
kubectl rollout undo deployment/inference-server --to-revision=42

# Check rollback status
kubectl rollout status deployment/inference-server

Step 5: Postmortem. Even if the fix was trivial, write a postmortem. "Config typo caused 5 minutes of elevated latency" becomes a systemic improvement: "Add config validation to the CI pipeline so typos are caught before deploy."

CONCEPT: Immutable Infrastructure

Traditional infrastructure: deploy code updates to running servers. The server accumulates state over time (package versions, config drift, temp files). After 6 months, no two servers are identical even though they started from the same image. This is called configuration drift and it causes "works on server A but not server B" bugs.

Immutable infrastructure: never update a running server. Instead, build a new container image with the new code, deploy new pods from the new image, and destroy the old pods. Every pod starts from the exact same image, always. There is no drift because there is no mutation.

This is why containers (Docker) and orchestrators (K8s) dominate modern infrastructure. The container image is your artifact of truth. If it works in CI, it works in production, because the environment is identical. If a server is acting weird, you do not debug it — you kill it and let K8s spawn a fresh one from the known-good image.

DESIGN: GPU Model Deployment Pipeline

Deploying a new model version to a GPU cluster is not like deploying a web app. Model weights are 140GB+. Loading them into GPU memory takes 2-5 minutes. You cannot tolerate 5 minutes of downtime during deployment.

1. Model Registry

New model version uploaded to S3/GCS with SHA256 checksum. Registry records metadata: size, quantization, tensor parallel config, benchmark results.

↓

2. Pre-warm

Launch new GPU pods alongside existing ones. Download weights (S3 transfer, ~3 min for 140GB with 6 Gbps). Load into GPU memory. Run warmup inference to populate CUDA caches.

↓

3. Canary

Route 1% of traffic to new pods. Compare latency, throughput, and output quality (LLM-as-judge) against baseline for 15 minutes.

↓

4. Promote or Rollback

If metrics are within SLO bounds, shift all traffic. Drain old pods (wait for in-flight requests to complete, up to 60s). Terminate old pods.

CODE: Canary Deployment Controller

python
import time, requests

class CanaryController:
    def __init__(self, baseline_metrics: dict, threshold: float = 1.1):
        self.baseline = baseline_metrics  # {"p99_ms": 400, "error_rate": 0.001}
        self.threshold = threshold         # 10% worse = rollback

    def check_canary(self, canary_metrics: dict) -> bool:
        """Return True if canary is healthy."""
        for key, baseline_val in self.baseline.items():
            canary_val = canary_metrics.get(key, 0)
            if canary_val > baseline_val * self.threshold:
                print(f"CANARY FAIL: {key} = {canary_val} > {baseline_val * self.threshold}")
                return False
        return True

    def progressive_rollout(self, stages=[1, 5, 25, 50, 100]):
        for pct in stages:
            self._set_traffic_split(pct)
            print(f"Canary at {pct}% — monitoring for 15 min...")
            time.sleep(900)  # 15 min observation window
            metrics = self._fetch_canary_metrics()
            if not self.check_canary(metrics):
                self._set_traffic_split(0)  # Instant rollback
                print("ROLLBACK: canary failed, traffic at 0%")
                return False
        print("SUCCESS: canary promoted to 100%")
        return True

DEBUG: Slow Builds

Your CI pipeline takes 25 minutes. Engineers stack PRs and context-switch. Feedback is slow. Common culprits: Docker image builds that re-download 5GB of dependencies on every run (fix: layer caching, multi-stage builds, pre-built base images), integration tests that spin up real databases (fix: use testcontainers with cached images or mock layers), and sequential test suites (fix: parallelize with pytest-xdist or Bazel test sharding).

CODE: Optimized Dockerfile for GPU Inference

dockerfile
# Multi-stage build for vLLM inference container
# Stage 1: Build wheels (cached unless requirements change)
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 AS builder
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt \
    && pip wheel vllm==0.6.0 -w /wheels

# Stage 2: Runtime (smaller image, no build tools)
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*.whl && rm -rf /wheels

# Model weights: DON'T bake into image (140GB → huge image)
# Instead, mount from shared volume or download at startup
ENV MODEL_PATH=/models/llama-70b
ENV HF_HUB_CACHE=/models/.cache

# Health check endpoint
HEALTHCHECK --interval=10s --timeout=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"]
CMD ["--model", "/models/llama-70b", "--tensor-parallel-size", "2"]

Never bake model weights into Docker images. A 70B model is 140GB. Your Docker image would be 150GB+. Pushing it to a registry takes 30+ minutes. Instead, store weights in S3/GCS, mount them via a PersistentVolume in K8s, or use a shared NFS volume. The container starts, finds the weights already on disk (pre-pulled by an init container), and loads them into GPU memory. Image size stays under 5GB, deploys stay under 5 minutes.

DESIGN: Developer Environment Strategy

Parallel's engineers need to test inference changes locally, but they do not have H100 GPUs on their laptops. The strategy has three tiers:

Tier	What	GPU	Use case
Local	Docker Compose + CPU-only vLLM	None	API contract testing, UI development
Dev cluster	Shared K8s namespace with 2x A10G	A10G (24GB)	Small model testing (7B), integration tests
Staging	Full replica of prod with 4x H100	H100	Performance benchmarks, canary validation

The key insight: local development does not need real GPU inference. Mock the inference endpoint with a service that returns canned responses at realistic latency. Engineers only need real GPUs when testing model-specific behavior, which happens in the dev cluster or staging.

CODE: Feature Flag Integration

python
import launchdarkly as ld

class InferenceRouter:
    """Route requests based on feature flags.
    Decouples deployment from release."""

    def __init__(self, ld_client: ld.LDClient):
        self.ld = ld_client

    async def route(self, request, user_context: dict) -> str:
        # Check if user should get the new model version
        use_new_model = self.ld.variation(
            "llama-3.2-rollout",
            user_context,
            default=False
        )

        # Check if speculative decoding is enabled for this customer tier
        use_spec_decode = self.ld.variation(
            "speculative-decoding",
            user_context,
            default=False
        )

        # Route to appropriate model endpoint
        if use_new_model:
            endpoint = "inference-v2.production.svc"
        else:
            endpoint = "inference-v1.production.svc"

        # Attach decoding strategy to request metadata
        request.metadata["spec_decode"] = use_spec_decode
        return endpoint

# Feature flags enable gradual rollouts:
# Day 1: Enable for internal users (1%)
# Day 3: Enable for beta customers (5%)
# Day 7: Enable for all free-tier users (50%)
# Day 14: Enable for all users (100%)
# At any point: kill switch if quality regresses

DESIGN: GitOps Workflow for Infrastructure Changes

Every infrastructure change at Parallel follows this workflow. No exceptions, no "just this once" SSH changes:

Step	What happens	Tool
1. Branch	Engineer creates a feature branch with infrastructure changes	Git
2. Plan	CI runs `terraform plan` / `pulumi preview` and posts the diff as a PR comment	GitHub Actions + Terraform
3. Review	Another engineer reviews the plan. For security group changes, requires SRE approval.	GitHub PR review
4. Apply to staging	Merge to main triggers automatic apply to staging environment	ArgoCD / Terraform Cloud
5. Validate staging	Run smoke tests against staging. Check that services are healthy.	Automated test suite
6. Apply to production	Manual approval gate. Engineer clicks "Approve" after staging validation.	ArgoCD sync + approval
7. Monitor	Watch dashboards for 30 minutes. If metrics deviate, rollback = git revert.	Grafana + PagerDuty

Git revert is the fastest rollback. If an infrastructure change causes a production issue, the fastest fix is git revert on the commit. ArgoCD sees the revert, reconciles the cluster to the previous state. No need to remember what changed or how to undo it manually. The entire rollback is auditable, reviewed, and reproducible. This is the superpower of GitOps.

FRONTIER: Trunk-Based Development + Merge Queues (2024-2025)

GitHub Merge Queue and Graphite's Stacked PRs represent the 2025 frontier of developer velocity. Merge Queue runs CI on the merge result (not the branch head), preventing broken main. Stacked PRs let engineers submit 5 dependent PRs and review/merge them independently. Combined with Bazel's remote cache (build only what changed), top teams achieve <5 minute CI + deploy cycles.

Deployment Pipeline

Watch a canary deployment progress through stages. Inject a regression to see automatic rollback.

You need to deploy a new vLLM version. Model weights are 140GB and take 3 minutes to load. How do you avoid 3 minutes of downtime?

Deploy during a maintenance window Pre-warm new pods (download weights + load to GPU) alongside old pods, then shift traffic and drain old pods Use a smaller model that loads faster Cache the weights in Redis

Chapter 10: Monitoring & Observability

You cannot fix what you cannot see. An LLM serving platform has hundreds of metrics, thousands of log streams, and millions of traces per day. The challenge is not collecting data — it is making that data useful. When a PagerDuty alert fires at 3 AM, you need to go from "something is wrong" to "this specific component failed for this specific reason" in under 5 minutes. That requires observability — not just monitoring (dashboards and alerts), but the ability to ask arbitrary questions about your system's behavior.

CONCEPT: The Three Pillars

Metrics (Prometheus, Datadog): Numeric time-series data. Request rate, error rate, latency percentiles, GPU utilization, memory usage. Cheap to store, fast to query, good for dashboards and alerts. Bad for debugging specific requests.

Logs (Loki, Elasticsearch): Unstructured or semi-structured text. Application errors, kernel messages, NVIDIA driver warnings. Good for debugging specific events. Bad for aggregation (expensive to count "how many errors per minute" from logs).

Traces (Jaeger, Tempo): Request-level timelines showing how a request flows through services. A single inference request generates spans for: API gateway (2ms), auth (1ms), model router (3ms), queue wait (50ms), prefill (180ms), decode (400ms), response serialization (2ms). Traces show you exactly where latency lives.

Metrics tell you WHAT is broken. Logs tell you WHY. Traces tell you WHERE. A complete investigation uses all three: the alert fires from a metric (p99 latency SLO breach), you look at traces to find which service is slow (the model router), then you check logs for that service to find the root cause (a DNS resolution timeout causing 2-second delays). You need all three, correlated by request ID.

DESIGN: LLM-Specific Metrics

Standard web metrics (request rate, error rate, latency) are necessary but not sufficient for LLM serving. You need LLM-specific metrics:

Metric	What it measures	Alert threshold
Time To First Token (TTFT)	Latency from request to first generated token	p99 > 2s
Inter-Token Latency (ITL)	Time between consecutive generated tokens	p99 > 100ms
Tokens/sec/GPU	Throughput per GPU (efficiency)	< 40 tok/s/GPU
KV Cache Utilization	% of PagedAttention blocks in use	> 95% (preemptions imminent)
Batch Size (running)	Number of requests in the continuous batch	< 4 (underutilized GPU)
Queue Depth	Requests waiting for batch admission	> 100 (need more GPUs)
Preemptions/min	KV cache evictions per minute	> 10/min (memory pressure)
GPU SM Utilization	Streaming multiprocessor activity (true compute usage)	< 30% (wasted GPU)

CODE: Custom Prometheus Metrics

python
from prometheus_client import Histogram, Gauge, Counter, start_http_server

# LLM-specific metrics
TTFT = Histogram(
    "llm_time_to_first_token_seconds",
    "Time from request to first generated token",
    buckets=[0.1, 0.25, 0.5, 0.8, 1.0, 2.0, 5.0],
    labelnames=["model", "region"],
)
ITL = Histogram(
    "llm_inter_token_latency_seconds",
    "Latency between consecutive generated tokens",
    buckets=[0.01, 0.02, 0.05, 0.1, 0.2],
    labelnames=["model"],
)
KV_CACHE_UTIL = Gauge(
    "llm_kv_cache_utilization_ratio",
    "Fraction of KV cache blocks in use",
    labelnames=["gpu_id"],
)
QUEUE_DEPTH = Gauge(
    "llm_request_queue_depth",
    "Number of requests waiting for batch admission",
)
PREEMPTIONS = Counter(
    "llm_kv_cache_preemptions_total",
    "Total KV cache preemptions (evictions)",
)

# Usage in the serving loop:
with TTFT.labels(model="llama-70b", region="us-east-1").time():
    first_token = await engine.generate_first_token(request)

# Expose metrics on :8080/metrics for Prometheus scraping
start_http_server(8080)

DEBUG: Alert Fatigue

The most insidious observability failure is not missing alerts — it is too many alerts. When oncall gets 50 alerts per day, they start ignoring them. The one critical alert that matters drowns in noise.

Fix: classify every alert as P1 (pages, wakes you up), P2 (ticket, fix today), or P3 (review weekly). Delete P3 alerts and convert them to dashboard panels. For P1 alerts, enforce the rule: every P1 alert must have a runbook that can be executed by any engineer, not just the alert's author. If an alert doesn't have a runbook, it is not ready to be P1.

CODE: Distributed Tracing Instrumentation

python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup: export traces to Jaeger/Tempo via OTLP
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://tempo:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("parallel.inference")

async def handle_inference(request):
    with tracer.start_as_current_span("inference.request") as span:
        span.set_attribute("model", request.model)
        span.set_attribute("input_tokens", request.token_count)

        # Span 1: Auth check
        with tracer.start_as_current_span("auth.validate"):
            user = await validate_api_key(request.api_key)

        # Span 2: Route to model instance
        with tracer.start_as_current_span("router.select") as rspan:
            instance = await model_router.select(request)
            rspan.set_attribute("selected_instance", instance.id)
            rspan.set_attribute("queue_depth", instance.queue_depth)

        # Span 3: Inference (the expensive part)
        with tracer.start_as_current_span("inference.generate") as ispan:
            result = await instance.generate(request.prompt)
            ispan.set_attribute("output_tokens", result.token_count)
            ispan.set_attribute("ttft_ms", result.ttft_ms)
            ispan.set_attribute("gpu_id", instance.gpu_id)

        span.set_attribute("total_latency_ms", result.total_ms)
        return result

The golden signal: request_id. Every request gets a unique ID at the edge. This ID propagates through every service via HTTP headers (W3C Trace Context or X-Request-ID). When debugging, you search by request_id and instantly get the complete trace: which model instance handled it, how long prefill took, whether KV cache was preempted, what the GPU utilization was during generation. Without request_id correlation, you are guessing.

DESIGN: Grafana Dashboard Layout

The oncall engineer's primary dashboard has four rows, designed for 3 AM triage when your brain is running at 50%:

Row	What	Why
Row 1: Golden Signals	Request rate, error rate, TTFT p99, throughput (tok/s)	Is the system working? Anomaly = something is wrong.
Row 2: Resource Health	GPU util per node, KV cache %, memory %, CPU %, disk I/O	Where is the bottleneck? High KV cache = memory pressure. Low GPU util = wasted money.
Row 3: Infrastructure	Node count, pod restarts, OOM kills, network errors, spot reclamation events	What changed? Node loss or pod crash explains sudden degradation.
Row 4: Business	Cost/request, requests by customer tier, error budget remaining	Should we care? P1 customer affected = all hands. Budget burned = freeze deploys.

FRONTIER: AI-Assisted Observability (2024-2025)

The 2025 frontier: LLMs analyzing your observability data. Datadog's "Bits AI", Grafana's LLM features, and startups like Coralogix use LLMs to: correlate anomalies across metrics/logs/traces, suggest root causes from historical incident data, and auto-generate PromQL/LogQL queries from natural language. The killer feature: "What changed?" — the LLM diffs all recent deployments, config changes, and traffic patterns against the anomaly timeline.

CODE: Anomaly Detection for GPU Metrics

python
import numpy as np
from collections import deque

class ZScoreAnomalyDetector:
    """Detect anomalies in GPU metrics using rolling z-score.
    Simple but effective for metrics with stable baselines."""

    def __init__(self, window: int = 120, threshold: float = 3.0):
        self.window = window      # Rolling window (data points)
        self.threshold = threshold # Z-score threshold for anomaly
        self.history: deque = deque(maxlen=window)

    def ingest(self, value: float) -> dict:
        self.history.append(value)
        if len(self.history) < 30:
            return {"anomaly": False, "reason": "warming up"}

        arr = np.array(self.history)
        mean = arr.mean()
        std = arr.std()
        if std < 1e-6:
            return {"anomaly": False, "z_score": 0}

        z = (value - mean) / std
        is_anomaly = abs(z) > self.threshold
        return {
            "anomaly": is_anomaly,
            "z_score": round(z, 2),
            "value": value,
            "mean": round(mean, 2),
            "direction": "high" if z > 0 else "low",
        }

# Usage: detect GPU utilization anomalies
detector = ZScoreAnomalyDetector(window=120, threshold=3.0)
# Feed every 10-second GPU utilization sample
result = detector.ingest(gpu_util_pct)
if result["anomaly"]:
    alert(f"GPU util anomaly: {result['value']}% (z={result['z_score']})")

DESIGN: Log Aggregation Pipeline

Parallel generates ~100GB of logs per day from 40+ services. The pipeline must handle this volume without losing messages or creating backpressure on the services producing logs:

1. Collection

Fluentbit DaemonSet on every K8s node. Tails container stdout/stderr. Adds metadata (pod name, namespace, node, GPU ID). Memory-buffered with filesystem fallback.

↓

2. Enrichment

Fluentbit routes to a Kafka topic. A stream processor (Flink or Vector) parses structured fields, extracts request IDs, and drops noisy debug logs from non-production namespaces.

↓

3. Storage

Two sinks: Loki for real-time queries (last 7 days, indexed by labels), S3/Parquet for long-term retention (90 days, queryable via Athena). Total storage cost: ~$500/month for 100GB/day.

↓

4. Query

Grafana connects to Loki. Engineers search by request_id, pod, error level. Dashboard shows log volume by service, error rate trends, and top-N error messages.

Observability Dashboard

Live metrics for a simulated LLM cluster. Watch TTFT, throughput, GPU utilization, and KV cache fill. Click Traffic Spike to see metrics respond.

Your TTFT p99 spiked from 500ms to 3s. GPU utilization is at 95%. KV cache utilization is at 98%. What is the most likely root cause?

The model weights are corrupted The network is slow KV cache is nearly full, causing preemptions and queue buildup — new requests wait for cache space The Prometheus scrape interval is too slow

Chapter 11: Infrastructure Dashboard — Showcase

Everything you have learned comes together in this interactive simulation. You are running a GPU inference cluster serving LLM requests at scale. You control the infrastructure parameters. The simulation shows the consequences of your decisions in real time — throughput, latency, GPU utilization, cost per request. Break the system by creating traffic spikes. Fix it by adjusting resources. Find the optimal configuration that maximizes throughput while staying under your cost budget.

This is your system design interview, live. Every slider maps to a real-world decision. In an interview, you would explain the tradeoffs of each parameter. Here, you can see those tradeoffs play out. Pay attention to how metrics interact: increasing batch size improves throughput but worsens latency. Adding GPUs improves both but increases cost. Quantization reduces cost but may affect quality. There is no free lunch.

Live Infrastructure Dashboard

Control a GPU cluster serving LLM requests. Adjust parameters and watch throughput, latency, GPU utilization, and cost respond in real time. Try to maximize throughput while keeping cost per request under $0.01.

GPU Nodes 4

GPU Type H100

Max Batch Size 128

Traffic (req/s) 200

How To Read The Dashboard

Throughput (tok/s): Total tokens generated per second across all GPUs. Higher is better. Limited by GPU memory bandwidth and batch efficiency.

TTFT p99 (ms): 99th percentile time to first token. Lower is better. Spikes when queue depth is high (requests waiting for batch admission). Your SLO is 800ms.

GPU Utilization (%): Average streaming multiprocessor activity. Target: 60-85%. Below 40% means wasted money. Above 90% means you are at capacity with no headroom for spikes.

Cost/request ($): Total GPU cost divided by requests served. Your budget is $0.01 per request. This includes idle GPU time — so underutilized GPUs increase this even if individual requests are efficient.

KV Cache (%): PagedAttention cache utilization. Above 90% triggers preemptions and latency spikes. Affected by batch size and sequence length.

Understanding The Model

The simulation models the core physics of LLM inference infrastructure. Here is how each parameter affects the metrics:

GPU Nodes: More nodes = more model instances = higher throughput and lower latency. But also higher cost. The relationship is roughly linear for throughput (2x nodes ≈ 2x throughput) but sublinear for latency (adding nodes only helps if the current ones are overloaded).

GPU Type: A100 → H100 → B200 increases memory bandwidth (2.0 → 3.35 → 8.0 TB/s), which directly increases decode speed. The B200 generates tokens ~2.4x faster than H100 because it reads model weights 2.4x faster. But it also costs ~1.8x more per hour. The cost/performance sweet spot depends on your traffic volume.

Max Batch Size: Larger batches amortize the weight-read cost across more requests, improving GPU utilization and throughput. But larger batches increase per-request latency because each decode step takes longer (more tokens to generate). The optimal batch size balances throughput against your TTFT SLO.

INT4 Quantization: Reduces model memory from 140GB to 35GB, allowing more model instances per node and/or freeing memory for KV cache. Throughput increases because less data is read from memory per token. Quality decreases by ~1-3% on benchmarks (usually acceptable for inference).

Challenges

Challenge	Goal	Hint
Cost Efficient	Handle 200 req/s at <$0.008/req	Fewer GPUs with higher utilization + INT4
Low Latency	TTFT p99 < 300ms at 100 req/s	More GPUs with smaller batch size
Survive the Spike	Handle 3x traffic spike without p99 > 2s	Headroom in GPU count, moderate batch size
Maximum Throughput	Achieve highest tok/s possible	Max GPUs + max batch + B200 + INT4

You need to serve 500 req/s at p99 TTFT < 800ms with minimum cost. Rank these actions from most to least impactful for reducing cost per request:

Add more GPUs > increase batch size > enable quantization Enable INT4 quantization (halve GPU count) > increase batch size (improve utilization) > use spot instances (reduce per-GPU cost) Use spot instances > add more GPUs > reduce batch size Reduce batch size > add more GPUs > switch to A100s

Chapter 12: Interview Arsenal

You have now covered the entire infrastructure stack for an LLM scaling engineer. This chapter distills everything into an interview-ready arsenal: system design templates, coding patterns, debugging frameworks, and the specific questions you will face.

System Design Cheat Sheet

Every system design interview follows the same structure. Here is your framework:

1. Requirements (5 min)

Clarify functional requirements (what does the system do?), non-functional requirements (latency, throughput, availability SLO), and constraints (budget, team size, timeline). Write numbers on the whiteboard: "100K req/s, p99 < 500ms, 99.99% uptime, $50K/month GPU budget."

↓

2. Capacity Estimation (5 min)

Back-of-envelope math. "100K req/s, 500 tokens/req avg, 80 tok/s/GPU = 625 GPUs needed for tokens alone. With batching (8x efficiency), ~80 GPUs. At $10/hr/GPU = $800/hr = $576K/month." This demonstrates infrastructure intuition.

↓

3. High-Level Design (10 min)

Draw the boxes: Edge → LB → API Gateway → Model Router → Inference Cluster → Response. Name specific technologies for each box. Explain data flow with arrows.

↓

4. Deep Dive (15 min)

Interviewer picks a box. Go deep: data structures, algorithms, failure modes, scaling strategy. For the inference cluster: explain vLLM, continuous batching, PagedAttention, tensor parallelism, KV cache management.

↓

5. Tradeoffs & Extensions (5 min)

Discuss what you would do differently with 10x more budget, 10x more traffic, or 10x stricter latency. Mention multi-region, read replicas, caching, CDN, speculative decoding.

Category	Question	Key Points
System Design	"Design a multi-region LLM serving platform"	Global LB, regional K8s, model router, vLLM, KV cache, failover strategy, data replication
System Design	"Design a GPU cluster scheduler"	SLURM vs K8s, preemption, topology-aware scheduling, MIG, resource broker between training and inference
Coding	"Implement a consistent hash ring"	Virtual nodes, bisect for O(log n) lookup, minimal redistribution on node add/remove
Coding	"Write a circuit breaker"	Three states (closed/open/half-open), failure counting, timeout-based recovery
Coding	"Write a rate limiter (token bucket)"	Atomic token update, burst allowance, distributed rate limiting with Redis
Debug	"p99 latency doubled after a deployment"	Check recent deploys, compare metrics before/after, look at traces for slow span, check resource utilization
Debug	"GPU utilization dropped to 10%"	nvidia-smi for hardware errors, scheduler for job failures, NVLink status, driver/CUDA version mismatch
Frontier	"How would you serve a 405B model?"	8-GPU tensor parallelism + pipeline parallelism across nodes, FP8 quantization, speculative decoding, disaggregated prefill/decode

Coding Drills

These are the 5 coding problems most likely to appear in infrastructure interviews. Practice each until you can write it from memory in under 10 minutes:

1. Rate Limiter (Token Bucket):

python
import time

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate          # Tokens per second
        self.capacity = capacity  # Max burst size
        self.tokens = capacity    # Current tokens
        self.last_refill = time.monotonic()

    def allow(self, tokens: int = 1) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = now
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

limiter = TokenBucket(rate=100, capacity=200)  # 100 req/s, burst of 200
if limiter.allow():
    process_request()
else:
    return HttpResponse(status=429)  # Too Many Requests

2. LRU Cache (for KV cache eviction):

python
from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int):
        self.cache = OrderedDict()
        self.capacity = capacity

    def get(self, key: str):
        if key not in self.cache:
            return None
        self.cache.move_to_end(key)  # Mark as recently used
        return self.cache[key]

    def put(self, key: str, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)  # Evict LRU

Debugging Framework: The 5-Step Investigation

Step	Action	Tools
1. Detect	What metric violated its SLO?	PagerDuty alert, Grafana dashboard
2. Correlate	What changed? Recent deploys, config changes, traffic patterns	Deploy log, git log, traffic graphs
3. Isolate	Which component is responsible? Follow the trace.	Jaeger traces, service-level metrics
4. Root cause	Why is that component broken?	Logs, nvidia-smi, kubectl describe, dmesg
5. Fix + Prevent	Fix the immediate issue, then add guardrails	Rollback, hotfix, new alert, postmortem

Numbers to Know

Fact	Value
H100 memory bandwidth	3.35 TB/s
B200 memory bandwidth	8.0 TB/s
70B FP16 model size	140 GB
70B INT4 model size	~35 GB
KV cache per token (70B, FP16)	~1.6 MB
vLLM continuous batching improvement	2-10x vs static
Speculative decoding speedup	1.5-3x
Spot instance discount	60-90% off on-demand
99.99% uptime = downtime per year	52.6 minutes
99.9% uptime = downtime per year	8.76 hours
PCIe Gen5 bandwidth	64 GB/s (per direction)
NVLink 4.0 bandwidth (H100)	900 GB/s total

Capacity Estimation Template

Every system design interview starts with back-of-envelope math. Here is the template for LLM infrastructure:

text
Given:
  - Target: 100K req/s
  - Average request: 500 input tokens + 200 output tokens = 700 tokens
  - Model: 70B FP16, needs 2x H100 (tensor parallel)
  - Single 2-GPU instance throughput: ~80 tok/s (decode)
  - With continuous batching (batch=64): ~1500 tok/s effective

Calculation:
  Total tokens needed: 100K req/s × 700 tok/req = 70M tok/s
  Model instances needed: 70M / 1500 = ~47,000 instances
  GPUs needed: 47,000 × 2 = 94,000 H100s

Wait — that's obviously wrong. 100K req/s is PEAK, not sustained.
Average is probably 20K req/s. And requests take 3-5 seconds,
so concurrency is 20K × 4s = 80K concurrent, not 20K/s × 700 tok.

Corrected:
  Concurrent requests: 80K
  Requests per GPU instance: 64 (batch size)
  GPU instances needed: 80K / 64 = 1,250
  GPUs needed: 1,250 × 2 = 2,500 H100s
  Cost: 2,500 × $10/hr = $25,000/hr = $18.25M/month

With INT4 quantization (1 GPU per instance):
  GPU instances needed: 1,250
  GPUs needed: 1,250
  Cost: $12,500/hr = $9.1M/month (50% savings)

Show your work, fix your mistakes. In the calculation above, the first attempt was wrong. That is fine — and even good. Interviewers want to see that you catch errors in your own reasoning. The corrected calculation shows understanding of concurrency vs. throughput, which is a staff-level distinction. Always sanity-check: "Does 94,000 H100s seem reasonable? No. What did I get wrong?"

Additional Coding Drills

3. Distributed Rate Limiter (Redis):

python
import redis, time

class DistributedRateLimiter:
    """Sliding window rate limiter using Redis sorted sets.
    Works across multiple server instances."""

    def __init__(self, redis_client: redis.Redis, limit: int, window_s: int):
        self.r = redis_client
        self.limit = limit
        self.window = window_s

    def allow(self, key: str) -> bool:
        now = time.time()
        pipe = self.r.pipeline()
        # Remove entries outside the window
        pipe.zremrangebyscore(key, 0, now - self.window)
        # Count entries in the window
        pipe.zcard(key)
        # Add current request
        pipe.zadd(key, {str(now): now})
        # Set TTL so keys auto-expire
        pipe.expire(key, self.window)
        results = pipe.execute()
        count = results[1]
        return count < self.limit

# Usage: per-customer rate limit
limiter = DistributedRateLimiter(redis.Redis(), limit=1000, window_s=60)
if not limiter.allow(f"rate:{customer_id}"):
    return HttpResponse(status=429)

4. Health Check Aggregator:

python
import asyncio, aiohttp

class HealthAggregator:
    """Check health of all dependencies, return aggregate status."""
    def __init__(self, checks: dict[str, str]):
        self.checks = checks  # {"db": "http://db:5432/health", ...}

    async def check_all(self, timeout: float = 3.0) -> dict:
        results = {}
        async with aiohttp.ClientSession() as session:
            tasks = {
                name: session.get(url, timeout=aiohttp.ClientTimeout(total=timeout))
                for name, url in self.checks.items()
            }
            for name, task in tasks.items():
                try:
                    resp = await task
                    results[name] = {"status": "healthy" if resp.status == 200 else "degraded"}
                except Exception as e:
                    results[name] = {"status": "unhealthy", "error": str(e)}
        overall = "healthy" if all(
            r["status"] == "healthy" for r in results.values()
        ) else "degraded"
        return {"overall": overall, "checks": results}

5. Exponential Backoff with Jitter:

python
import random, time

def retry_with_backoff(func, max_retries=5, base_delay=0.5, max_delay=30):
    """Retry with exponential backoff + full jitter.
    Jitter prevents thundering herd when many clients retry simultaneously."""
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with full jitter
            delay = min(max_delay, base_delay * (2 ** attempt))
            jittered = random.uniform(0, delay)
            time.sleep(jittered)

The meta-skill. The single most important thing in an infrastructure interview is not knowing the answer — it is showing your reasoning process. When asked a question you do not know, say: "I don't know the exact answer, but here's how I would reason about it..." and then work through it using first principles. Interviewers hire people who can solve novel problems, not people who memorized a study guide.

System Design Canvas

A reference architecture for "Design a multi-region LLM serving platform." This is your whiteboard answer. Study the components and their connections.

An interviewer says: "Your LLM cluster costs $500K/month. Leadership wants to cut it to $300K without losing throughput. What's your plan?" What is the highest-impact action?

Switch to cheaper A100 GPUs Quantize models to FP8/INT4 (halve GPU count while maintaining throughput), move elastic capacity to spot instances (60-90% cheaper), and improve GPU utilization via better batching Reduce the number of supported models Negotiate a better rate with the cloud provider

Final Thought

Infrastructure engineering at the frontier of AI is one of the most challenging and rewarding roles in software. You operate at the intersection of systems engineering, machine learning, economics, and reliability. Every decision you make — which GPU to buy, how to batch requests, when to shard the database, how to handle spot reclamation — has measurable impact on the business. A 10% utilization improvement saves hundreds of thousands of dollars per month. A 100ms latency reduction makes every customer's product feel faster. A well-designed failover system means no one ever notices when hardware fails.

The field is moving fast. New GPU architectures every 18 months. New serving frameworks every 6 months. New cost optimization techniques every quarter. The engineer who thrives in this environment is not the one who memorizes today's best practices, but the one who understands first principles deeply enough to adapt when everything changes. When the B300 arrives and makes the B200 look slow, you will know how to evaluate it. When a new serving framework claims 3x better throughput, you will know how to benchmark it honestly. When leadership asks "can we serve 10x more traffic at half the cost?", you will know which levers to pull and which tradeoffs to accept.

Go build. Break things in staging. Fix them before anyone notices. And when the pager goes off at 3 AM, take a deep breath, pull up Grafana, and follow the metrics. They will tell you the story.

Infrastructure & LLM Scalingat Parallel