Backend & API Engineer at Scale

Chapter 0: The Backend Engineer's World

An AI agent sends a POST request to your API: "create a custom inference pipeline with 4 GPU nodes, streaming output, and a 30-second timeout." Your system must authenticate the caller, check their quota, validate the payload against a schema that changes monthly, route to the right cluster, orchestrate the pipeline creation across three microservices, and return a structured response with a job ID — all in under 200ms. If any step fails, the response must explain exactly what went wrong in a way a developer can fix without reading your source code.

This is not a CRUD app. This is a developer-facing distributed system that millions of AI agents, scripts, and dashboards hit every day. And you are the engineer who makes it fast, reliable, and a joy to integrate with.

It is 9:00 AM. You badge into Parallel's office. On your first monitor, the overnight alerting shows a 12% increase in p99 latency on the /v2/pipelines endpoint after yesterday's database migration added a new index that's causing lock contention. On your second monitor, a partner integration team is stuck: their webhook handler is receiving duplicate events because your retry logic doesn't respect idempotency keys correctly. On your third monitor, a design doc from the platform team proposing GraphQL subscriptions for real-time pipeline status — but you're worried about connection scaling at 100K concurrent subscribers.

Before lunch, you will profile the slow query (the new composite index needs column order reversed), write a hotfix for the webhook deduplication (add an idempotency cache with a 24-hour TTL), and leave detailed comments on the GraphQL doc (propose a hybrid: GraphQL for reads, Server-Sent Events for real-time status, because SSE scales better behind your existing load balancer).

This is the daily reality of a Backend & API Engineer at Parallel. You own the API surface that every customer's code touches:

Responsibility	What you own	Daily intersection
API Surface	REST endpoints, versioning, schema evolution, SDK generation	Every external developer interacts through your contracts
Data Layer	Schema design, migrations, query optimization, connection pooling	Every request reads or writes your tables
Reliability	Rate limiting, caching, circuit breakers, graceful degradation	You keep the system alive when traffic spikes 10x
Security	Auth, API keys, RBAC, webhook signatures, audit logs	You protect customer data and prevent abuse
Performance	Profiling, caching, async processing, payload optimization	You make every response feel instant

Five dimensions, one API. This lesson covers the full stack because a staff-level backend engineer must reason across all of them. A beautifully designed API is useless if the database can't handle the load. A fast database is useless if the caching layer has stampede bugs. And none of it matters if authentication is broken. Every chapter prepares you to design, build, debug, and defend a production API system in an interview.

What Makes This Role Different

Backend engineering at an infrastructure company like Parallel is different from backend at a consumer app. Your users are developers. They read your error messages more carefully than your documentation. They will find every inconsistency in your API naming. They will script against your rate limit headers. They will reverse-engineer your pagination cursors. And they will loudly complain on Twitter when you ship a breaking change.

The ideal candidate has deep intuition on distributed systems, databases, and maintainable code design. They reason about trade-offs between speed, scalability, and developer ergonomics. They can design an API that's both high-performance and a joy to integrate with. They can debug a latency regression at 3 AM using only dashboards and distributed traces. They can write a migration that restructures a 100M-row table without downtime.

Most importantly, they understand that an API is a product. Every response time is a user experience. Every error message is customer support. Every changelog entry is a relationship with a developer who built their business on your platform.

A Day at Parallel

Here's what a typical Tuesday looks like:

Time	Activity	Skills used
9:00 AM	Triage overnight alerts: p99 latency spike on /v2/search	Observability, debugging
9:30 AM	Profile the slow query, find missing index, deploy fix	Database, performance
10:00 AM	Code review: teammate's rate limiter migration from fixed to sliding window	Rate limiting, API design
11:00 AM	Design doc: adding webhook retry with exponential backoff	System design, async patterns
12:00 PM	Partner sync: debug why their integration gets intermittent 403s	Auth, debugging, DX
2:00 PM	Implement cursor pagination for the /v2/logs endpoint	API design, database
3:00 PM	Review the auto-generated Python SDK for v2.5 release	DX, SDK design
4:00 PM	Plan capacity for a new enterprise customer (10x current traffic)	Scaling, caching

The Request You Serve

The diagram below traces a single API request from the internet to the database and back. Every box is a system you own or co-own. This is your opening whiteboard answer in a system-design interview.

1. DNS & CDN Edge

Client resolves api.parallel.dev. CDN edge checks for cached responses (GET only). TLS termination at the edge. Geographic routing to nearest PoP.

↓

2. Load Balancer

L7 load balancer distributes across API server fleet. Health checks remove unhealthy instances. Connection draining during deploys. Rate limit at the edge for DDoS protection.

↓

3. API Server

Request parsing, auth validation, schema validation, rate limit check (per-customer), business logic, database queries, cache lookups. Structured logging on every request.

↓

4. Data Layer

Connection pool → primary DB for writes, read replicas for reads. Redis for hot data and rate limit counters. Async job queue for long-running operations.

↓

5. Response

Serialize response, set cache headers, log latency breakdown, return to client. Error responses include request_id, error code, and human-readable message.

API Request Lifecycle

Watch a request flow through the full stack. Latency breakdown shows where time is spent. Click Inject Failure to see error handling.

Interview Dimensions

Staff-level interviews test you across five dimensions. Each chapter in this lesson maps to one or more:

Dimension	What they ask	Chapters
CONCEPT	"Explain how connection pooling works under the hood"	All
DESIGN	"Design an API gateway that handles 1M requests/minute"	0, 1, 2, 8, 11
CODE	"Implement a token bucket rate limiter"	1, 3, 4, 5, 6, 10
DEBUG	"Your p99 latency tripled. Walk me through your investigation."	2, 7, 10
FRONTIER	"How would you use HTTP/3 or edge computing to reduce latency?"	All

A customer reports that their API integration randomly gets 500 errors, but only during business hours. Your metrics show p50 latency is normal but p99 is 10x higher during those hours. What is the most likely root cause?

The CDN cache is stale The API server has a memory leak Database connection pool exhaustion under peak concurrent load — most requests get a connection fast (low p50) but a few wait until timeout (high p99, then 500) DNS resolution is slow

Chapter 1: API Design Principles

Your API is a contract. Every endpoint, every field name, every error code is a promise you make to thousands of developers. Break the promise and their production breaks. Make the promise confusing and they'll spend hours reading docs instead of building. A well-designed API is invisible — developers use it correctly without thinking. A badly-designed API generates support tickets.

At Parallel, your API is the primary product surface. AI agents don't use a dashboard — they call your endpoints programmatically. The API is the product.

REST, GraphQL, gRPC: When to Use Which

Protocol	Best for	Worst for	Parallel's use
REST	Public APIs, CRUD, caching (GET idempotency), broad ecosystem	Complex nested queries, real-time streams	Primary public API — /v2/pipelines, /v2/models, /v2/jobs
GraphQL	Flexible queries, mobile clients (minimize over-fetching), introspection	Caching (POST-based), rate limiting (query cost varies), file uploads	Internal dashboard API — flexible queries for analytics UI
gRPC	Service-to-service, streaming, strong typing (protobuf), low latency	Browser clients (needs proxy), debugging (binary), public APIs	Internal microservice mesh — pipeline orchestrator talks to GPU scheduler via gRPC

Design rule: one protocol per audience. Public API = REST (everyone understands it, tooling is universal). Internal services = gRPC (type safety, streaming, performance). Dashboard = GraphQL (frontend team can fetch exactly what they need). Mixing protocols for the same audience creates confusion.

REST Design: Resource Naming

REST APIs are organized around resources, not actions. A resource is a noun (pipeline, job, model), not a verb (createPipeline). The HTTP method provides the verb.

http
# GOOD: resources are nouns, methods provide verbs
GET    /v2/pipelines              # List pipelines
POST   /v2/pipelines              # Create a pipeline
GET    /v2/pipelines/{id}         # Get one pipeline
PATCH  /v2/pipelines/{id}         # Update a pipeline
DELETE /v2/pipelines/{id}         # Delete a pipeline
GET    /v2/pipelines/{id}/jobs    # List jobs for a pipeline (nested resource)

# BAD: verbs in URLs (this is RPC, not REST)
POST /v2/createPipeline
POST /v2/getPipelineById
POST /v2/deletePipeline

# TRICKY: actions that don't map to CRUD
# Option A: treat the action as a sub-resource
POST /v2/pipelines/{id}/restart   # Restart a pipeline
POST /v2/pipelines/{id}/scale     # Scale GPU count

# Option B: use a generic "actions" endpoint
POST /v2/pipelines/{id}/actions
{ "action": "restart", "params": {} }

Response Envelope Design

Every response should follow a consistent envelope. Developers build generic response parsers — if one endpoint returns {"data": [...]} and another returns a bare array [...], their parser breaks.

python
# Parallel's response envelope — consistent across ALL endpoints:

# Single resource:
{
  "data": {
    "id": "pipe_abc123",
    "name": "my-pipeline",
    "status": "active",
    "created_at": "2025-05-22T10:00:00Z"
  }
}

# Collection:
{
  "data": [
    {"id": "pipe_abc", ...},
    {"id": "pipe_def", ...}
  ],
  "pagination": {
    "next_cursor": "eyJpZCI6...",
    "has_more": true
  }
}

# Error:
{
  "error": {
    "type": "not_found",
    "message": "Pipeline 'pipe_xyz' does not exist.",
    "code": "PIPELINE_NOT_FOUND",
    "request_id": "req_789"
  }
}

# Rules:
# 1. Success always has "data" key
# 2. Error always has "error" key
# 3. Never both at once
# 4. Timestamps always ISO 8601 with timezone (Z or +00:00)
# 5. IDs always have a prefix: pipe_, job_, cust_, key_

Versioning: The Hardest Problem

You will change your API. Fields get renamed, response shapes evolve, deprecated endpoints must die. The question is how to do it without breaking existing integrations.

python
# Strategy 1: URL versioning (Parallel's choice)
# Simple, explicit, cacheable. Downside: proliferating paths.
GET /v2/pipelines/abc123
GET /v3/pipelines/abc123   # New response shape

# Strategy 2: Header versioning
# Cleaner URLs. Downside: invisible in logs, hard to cache.
GET /pipelines/abc123
Accept-Version: 2024-01-15

# Strategy 3: Query parameter
# Easy for debugging. Downside: pollutes cache keys.
GET /pipelines/abc123?version=2

# Parallel's approach: URL versioning for major (breaking),
# date-based header for minor (additive).
# v2 is the major contract. Adding a new field doesn't bump v2 → v3.
# Removing a field or changing a type DOES bump the version.

Pagination: Cursor vs. Offset

python
# OFFSET: simple but broken at scale
GET /v2/jobs?limit=20&offset=1000
# Problem: if 5 new jobs are created between page fetches,
# page 51 will show 5 items from page 50. Items shift.
# Also: OFFSET 1000 forces the DB to scan and skip 1000 rows.

# CURSOR: stable and performant
GET /v2/jobs?limit=20&cursor=eyJpZCI6MTAwMH0=
# Cursor encodes the last-seen sort key (e.g., base64 of {"id": 1000}).
# DB query: WHERE id > 1000 ORDER BY id LIMIT 20
# No scanning, no shifting. O(1) regardless of page depth.

# Response shape:
{
  "data": [...],
  "pagination": {
    "next_cursor": "eyJpZCI6MTAyMH0=",
    "has_more": true
  }
}

Error Handling: Your Most Important Feature

Developers spend more time debugging errors than reading success responses. A good error response is worth 100 lines of documentation.

python
# BAD: generic, useless
{"error": "Bad request"}

# BAD: leaks internals
{"error": "PostgreSQL error: relation 'pipelines' does not exist"}

# GOOD: structured, actionable, safe
{
  "error": {
    "type": "validation_error",
    "message": "Field 'gpu_count' must be between 1 and 8.",
    "code": "INVALID_GPU_COUNT",
    "param": "gpu_count",
    "request_id": "req_abc123",
    "doc_url": "https://docs.parallel.dev/errors/INVALID_GPU_COUNT"
  }
}

Idempotency: The Safety Net

Network failures happen. Clients retry. Without idempotency, a retry of "create pipeline" creates two pipelines. Idempotency means: calling the same operation twice produces the same result as calling it once.

python
# Client sends an Idempotency-Key header with mutating requests
POST /v2/pipelines
Idempotency-Key: idem_user123_1716400000
{"name": "my-pipeline", "gpu_count": 4}

# Server implementation:
async def create_pipeline(req: Request):
    key = req.headers["Idempotency-Key"]
    # Check if we've seen this key before
    cached = await redis.get(f"idem:{key}")
    if cached:
        return json.loads(cached)  # Return same response

    # Execute the operation
    pipeline = await db.create_pipeline(req.body)
    response = serialize(pipeline)

    # Cache for 24h so retries return the same result
    await redis.setex(f"idem:{key}", 86400, json.dumps(response))
    return response

Request Validation: Schema-First

Every POST/PATCH endpoint must validate the request body against a schema before touching the database. At Parallel, we use JSON Schema (for REST) and protobuf (for gRPC) to define what valid input looks like.

python
# JSON Schema for POST /v2/pipelines
PIPELINE_CREATE_SCHEMA = {
    "type": "object",
    "required": ["name", "gpu_count"],
    "properties": {
        "name": {
            "type": "string",
            "minLength": 1,
            "maxLength": 255,
            "pattern": "^[a-z0-9][a-z0-9-]*$"  # DNS-safe names
        },
        "gpu_count": {
            "type": "integer",
            "minimum": 1,
            "maximum": 8
        },
        "timeout_seconds": {
            "type": "integer",
            "minimum": 5,
            "maximum": 3600,
            "default": 300
        }
    },
    "additionalProperties": false  # Reject unknown fields
}

# Why additionalProperties: false?
# A client sends {"name": "test", "gpuCount": 4} (camelCase typo).
# Without this, the request succeeds with default gpu_count,
# and the client wonders why they got 1 GPU instead of 4.
# With this, they get: "Unknown field: gpuCount. Did you mean gpu_count?"

Debugging API Design

The most common API design bugs that generate support tickets:

Bug 1: Inconsistent naming. /v2/pipelines returns created_at but /v2/jobs returns createdAt. Developers write generic parsers that break. Fix: enforce a naming convention (snake_case for REST, camelCase for GraphQL) with a linter in CI.

Bug 2: Silent truncation. Client sends gpu_count: 16, server silently clamps to 8. Client thinks they have 16 GPUs. Fix: reject invalid values with a clear error, never silently mutate input.

Bug 3: Leaking internal IDs. Response includes internal_cluster_id: "prod-us-east-7". A competitor maps your infrastructure. Fix: only expose opaque external IDs. Internal IDs stay internal.

Bug 4: Missing Content-Type validation. Endpoint expects JSON but doesn't check Content-Type. A client sends form-encoded data, the JSON parser fails with a cryptic error. Fix: return 415 Unsupported Media Type if Content-Type isn't application/json.

Frontier: API-First with OpenAPI + Code Generation (2024-2025)

The state of the art is design-first API development. You write the OpenAPI spec before any code. Then code generation produces server stubs, client SDKs (Python, TypeScript, Go, Rust), documentation, and test fixtures — all from one source of truth. Parallel generates SDKs in 6 languages from a single OpenAPI YAML.

The frontier push: AI-native APIs. Endpoints designed for LLM tool-use: deterministic schemas, rich descriptions in the spec (so the LLM understands what each field does), and streaming responses via Server-Sent Events so agents get partial results without polling.

API Protocol Comparison

Compare REST, GraphQL, and gRPC across key dimensions. Click each protocol to highlight its strengths.

A developer complains: "I'm paginating through /v2/jobs with offset pagination but some jobs appear twice and others are missing." What's the fix?

Increase the page size Switch to cursor-based pagination — offset pagination shifts items when new records are inserted between page fetches Add a unique constraint to the database Cache the full result set server-side

Chapter 2: Request Lifecycle

Every API call is a journey through a dozen systems, each adding latency. Understanding this journey — and where milliseconds hide — is the difference between a 50ms response and a 500ms response. When an interviewer says "walk me through what happens when a client hits your API," this is what they want to hear.

The Full Path: DNS to Response

Step 1: DNS Resolution (1-50ms). The client resolves api.parallel.dev. If cached, instant. If not, the recursive resolver walks the DNS hierarchy. You control this with low TTLs for failover (60s) or high TTLs for speed (300s). Parallel uses Route 53 with latency-based routing — the DNS response points to the nearest edge PoP.

Step 2: TLS Handshake (10-50ms). TLS 1.3 requires one round trip (1-RTT). The client and server exchange keys, verify certificates, and establish an encrypted channel. With TLS session resumption or 0-RTT, subsequent connections skip this. Parallel terminates TLS at the CDN edge, so the internal network uses plain HTTP (faster, simpler).

Step 3: Load Balancer (1-5ms). The L7 load balancer (e.g., ALB, Envoy) routes to a healthy API server. Routing strategies: round-robin (simple), least-connections (better under uneven load), consistent hashing (for sticky sessions or cache affinity). Parallel uses least-connections with health checks every 5s.

Step 4: Reverse Proxy / API Gateway (2-10ms). Before hitting your application code, the request passes through an API gateway that handles cross-cutting concerns: request ID injection, rate limiting, auth token validation, request logging, CORS headers. This layer exists so your application code stays clean.

Step 5: Application Handler (5-200ms). Your code runs. Parse the request body, validate the schema, execute business logic, query the database, check the cache, assemble the response. This is where 80% of your optimization time goes.

Step 6: Database Query (1-100ms). Connection pool checkout (0-5ms), query execution (1-50ms for indexed reads, 10-100ms for complex joins), result serialization. Slow queries here dominate total latency.

Step 7: Response Serialization (1-5ms). Marshal the response to JSON (or protobuf for gRPC). Set cache-control headers. Compress with gzip/brotli if the client accepts it. Add the request ID to the response headers for debugging.

Latency budget. At Parallel, the p50 target is 50ms and p99 is 200ms for read endpoints. Here's the budget: DNS (0ms cached) + TLS (0ms reused) + LB (2ms) + Gateway (3ms) + Handler (10ms) + DB (20ms) + Serialization (2ms) = ~37ms p50. The p99 spike comes from DB (cache miss forces disk I/O) and handler (complex validation on edge-case payloads).

Design: Latency Breakdown Architecture

python
import time
from dataclasses import dataclass, field

@dataclass
class LatencyTrace:
    request_id: str
    spans: list = field(default_factory=list)

    def span(self, name: str):
        return SpanContext(self, name)

    def total_ms(self) -> float:
        return sum(s["duration_ms"] for s in self.spans)

class SpanContext:
    def __init__(self, trace, name):
        self.trace, self.name = trace, name
    def __enter__(self):
        self.start = time.perf_counter()
    def __exit__(self, *_):
        dur = (time.perf_counter() - self.start) * 1000
        self.trace.spans.append({"name": self.name, "duration_ms": round(dur, 2)})

# Usage in a handler:
async def get_pipeline(pipeline_id: str, trace: LatencyTrace):
    with trace.span("auth"):
        user = await validate_token(request.token)
    with trace.span("cache_check"):
        cached = await redis.get(f"pipeline:{pipeline_id}")
    if cached:
        return cached  # Cache hit: skip DB entirely
    with trace.span("db_query"):
        pipeline = await db.fetch_pipeline(pipeline_id)
    with trace.span("serialize"):
        response = serialize(pipeline)
    # Log: {"request_id": "req_abc", "spans": [{"name": "auth", "duration_ms": 2.1}, ...]}

Debugging: The Slow Request Investigation

An interviewer says: "Your p99 latency jumped from 200ms to 800ms. Walk me through your investigation."

Step 1: Is it all endpoints or one? Check per-endpoint latency dashboards. If it's one endpoint, the problem is in that handler. If it's all endpoints, the problem is in a shared layer (DB, cache, network).

Step 2: Is it all customers or one? A single customer with a 10MB payload can slow their requests without affecting others. Check per-customer latency distribution.

Step 3: Check the spans. Pull a sample of slow requests and look at the latency trace. If "db_query" went from 20ms to 600ms, the problem is in the database. If "auth" went from 2ms to 200ms, the auth service is degraded.

Step 4: Correlate with recent changes. Did someone deploy? Did the database auto-scale? Did a cron job start running a heavy migration? Check the deployment timeline against the latency graph.

The investigation funnel. Always go: broad → narrow. All endpoints or one? All customers or one? Which span is slow? What changed? This shows the interviewer you have a systematic debugging methodology, not a "restart and hope" approach.

The Load Balancer: More Than Round Robin

The load balancer is the traffic cop for your API fleet. A naive round-robin sends requests evenly, but that's only optimal when all servers are identical and all requests cost the same. In practice, neither is true.

Algorithm	How it works	Best for	Pitfall
Round robin	Each server gets requests in sequence: 1, 2, 3, 1, 2, 3...	Homogeneous fleet, similar request cost	One slow server gets same load as fast ones
Least connections	Route to the server with fewest active requests	Variable request durations (short reads + long writes)	Newly booted servers get flooded (0 connections)
Weighted round robin	Bigger servers get proportionally more requests	Mixed instance sizes (during migration, canary deploys)	Weights are static — don't adapt to runtime conditions
Consistent hashing	Hash the request key → always route to same server	Server-local caching, session affinity	Hotspot if one key is disproportionately popular

python
# Least-connections with slow-start: protect new instances

# Problem: a new server boots with 0 active connections.
# Least-connections sends ALL new requests to it.
# Its cache is cold, so requests are slow, memory spikes.

# Fix: slow-start ramp. New server gets linearly increasing
# weight over 30 seconds: 10% → 20% → ... → 100%

# In AWS ALB: slow_start.duration_seconds = 30
# In Envoy: slow_start_config { slow_start_window: 30s }

Graceful Shutdown: Don't Drop In-Flight Requests

When deploying new code, the old server must finish processing in-flight requests before shutting down. This is connection draining.

python
# Graceful shutdown pattern (Python/uvicorn):
import signal, asyncio

async def graceful_shutdown():
    # 1. Stop accepting new requests (health check returns 503)
    app.state.shutting_down = True

    # 2. Wait for in-flight requests to complete (max 30s)
    for _ in range(300):  # 30s in 100ms increments
        if app.state.active_requests == 0:
            break
        await asyncio.sleep(0.1)

    # 3. Close database connections cleanly
    await db_pool.close()
    await redis_pool.close()

    # 4. Exit
    sys.exit(0)

# Register signal handler (SIGTERM from container orchestrator)
signal.signal(signal.SIGTERM, lambda *_: asyncio.create_task(graceful_shutdown()))

The deploy sequence: (1) Load balancer removes old instance from rotation. (2) Old instance finishes in-flight requests (30s drain). (3) Old instance shuts down. (4) New instance boots, warms up, health check passes. (5) Load balancer adds new instance. Zero-downtime deploy.

Frontier: eBPF-Based Request Tracing (2024-2025)

The cutting edge is kernel-level observability with eBPF. Instead of instrumenting your application code with spans, eBPF programs attach to kernel syscalls (connect, read, write) and automatically measure network latency, TCP retransmits, and connection pool behavior — with zero application code changes. Tools like Cilium and Pixie give you full request traces from the kernel level.

Combined with OpenTelemetry auto-instrumentation, you get traces spanning your API server, database client, Redis client, and HTTP clients — all without manually adding span context. The frontier is zero-code full-stack tracing.

Request Latency Breakdown

Visualize where time is spent in an API request. Click scenarios to see how latency shifts.

Your API's p99 latency doubled but p50 is unchanged. Which investigation step should you take FIRST?

Check per-endpoint latency to see if the problem is isolated to one endpoint or affects all endpoints Restart all API servers Increase the database connection pool size Roll back the last deployment

Chapter 3: Database Design

Your database is the source of truth. Every API response ultimately comes from data stored here. Get the schema wrong and you'll spend months working around it. Get the indexes wrong and your API will be fast on day one and unusable at 10 million rows. Get the connection pooling wrong and your database will die under load that your application code could easily handle.

Schema Design: Think in Access Patterns

Don't design schemas by drawing entity-relationship diagrams. Design them by listing every query your API will run, then building tables that make those queries efficient. This is access-pattern-driven design.

sql
-- Access patterns for Parallel's pipeline API:
-- 1. Get pipeline by ID (most common, must be O(1))
-- 2. List pipelines by customer, sorted by created_at (pagination)
-- 3. Count active pipelines per customer (quota check)
-- 4. Find pipelines by status (admin dashboard)

CREATE TABLE pipelines (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    name        VARCHAR(255) NOT NULL,
    status      VARCHAR(32) NOT NULL DEFAULT 'pending',
    gpu_count   INTEGER NOT NULL CHECK (gpu_count BETWEEN 1 AND 8),
    config      JSONB NOT NULL DEFAULT '{}',
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),

    -- Index for pattern 2: list by customer, sorted by time
    -- Cursor pagination: WHERE customer_id = $1 AND created_at < $cursor
    -- ORDER BY created_at DESC LIMIT 20
    CONSTRAINT idx_customer_created
        UNIQUE (customer_id, created_at)
);

-- Partial index for pattern 3: only count active pipelines
-- Much smaller than a full index, only includes rows where status='active'
CREATE INDEX idx_active_by_customer
    ON pipelines (customer_id)
    WHERE status = 'active';

-- Index for pattern 4: admin filtering by status
CREATE INDEX idx_status ON pipelines (status);

Indexing Strategies: The B-Tree Mental Model

A B-tree index is a sorted data structure that lets the database find rows without scanning the entire table. Think of it as a phone book: if you want to find "Smith," you don't read every entry — you jump to the "S" section, then narrow down. Without an index, every query is a sequential scan that reads every row.

The indexing golden rule: The column order in a composite index matters enormously. An index on (customer_id, created_at) can efficiently serve WHERE customer_id = X and WHERE customer_id = X AND created_at > Y, but NOT WHERE created_at > Y alone. The leftmost column must always be in the WHERE clause. This is the leftmost prefix rule.

Query Optimization: EXPLAIN ANALYZE

sql
-- BEFORE optimization: full table scan (1.2s at 10M rows)
EXPLAIN ANALYZE
SELECT * FROM pipelines
WHERE customer_id = 'abc' AND status = 'active'
ORDER BY created_at DESC LIMIT 20;

-- Output shows: Seq Scan on pipelines (cost=0.00..185432.00)
-- This means: scanning ALL 10M rows, filtering in memory.

-- AFTER adding the right index: 0.3ms
-- Output shows: Index Scan using idx_customer_created (cost=0.56..24.12)
-- The database jumps directly to customer_id='abc', walks the sorted
-- created_at entries, and stops after 20 rows. O(log n + k).

Connection Pooling: The Bottleneck Nobody Sees

PostgreSQL creates a new process for every connection. At 500 connections, the OS spends more time context-switching between processes than executing queries. A connection pooler like PgBouncer sits between your application and the database, maintaining a pool of reusable connections.

python
# Without pooling: each request opens a new DB connection (20-50ms)
# With PgBouncer: request checks out a pre-opened connection (0.1ms)

# PgBouncer modes:
# - session: connection held for entire client session (safest, least efficient)
# - transaction: connection returned after each transaction (best for APIs)
# - statement: connection returned after each statement (most efficient,
#   but breaks multi-statement transactions)

# Parallel's config: transaction mode, 20 server connections,
# 1000 client connections. 50:1 multiplexing ratio.
# 1000 concurrent API requests share 20 actual DB connections.

# Application-level pooling (asyncpg):
pool = await asyncpg.create_pool(
    dsn="postgres://user:pass@pgbouncer:6432/parallel",
    min_size=5,      # Keep 5 connections warm
    max_size=20,     # Never exceed 20 from this process
    command_timeout=10,  # Kill queries after 10s
)

Database Migrations: The Zero-Downtime Challenge

You need to add a column, rename a field, or change a type — but your API has 1000 requests/second hitting this table. A naive ALTER TABLE ADD COLUMN can lock the table for seconds (or minutes at 100M rows). Here's how to do it without downtime.

sql
-- SAFE migration pattern: expand → migrate → contract

-- Step 1: EXPAND — add the new column (nullable, no default)
-- This is instant on PostgreSQL 11+ because it doesn't rewrite the table.
ALTER TABLE pipelines ADD COLUMN state VARCHAR(32);

-- Step 2: DUAL-WRITE — update application code to write to both columns
-- Deploy code that writes to both "status" and "state".
-- Reads still come from "status".

-- Step 3: BACKFILL — copy data from old column to new (batched!)
-- Never run UPDATE pipelines SET state = status; — locks entire table.
-- Instead, batch it:
UPDATE pipelines SET state = status
WHERE id IN (SELECT id FROM pipelines WHERE state IS NULL LIMIT 1000);
-- Run this in a loop until all rows are migrated.

-- Step 4: SWITCH — update reads to use new column
-- Deploy code that reads from "state" instead of "status".

-- Step 5: CONTRACT — drop old column (weeks later, after verification)
ALTER TABLE pipelines DROP COLUMN status;

The migration trap: Adding a column with a DEFAULT value on PostgreSQL <11 rewrites the entire table. At 100M rows, this takes minutes and locks all writes. Always check your PostgreSQL version. On PostgreSQL 11+, ADD COLUMN ... DEFAULT 'pending' is instant because the default is stored in the catalog, not written to every row.

Read Replicas: Scaling Reads

Writes go to the primary. Reads can go to read replicas — copies of the primary that stay up to date via replication. At Parallel, 90% of API calls are reads. Sending reads to 3 replicas means the primary only handles writes.

python
# Read-your-writes pattern implementation
# After a write, return a consistency token in the response header.
# The client sends this token back with subsequent reads.
# If the token is fresh (< 5s), route the read to primary.

async def create_pipeline(req):
    pipeline = await primary_db.insert(req.body)
    # Return consistency token = current WAL position
    lsn = await primary_db.query("SELECT pg_current_wal_lsn()")
    return Response(
        data=pipeline,
        headers={"X-Consistency-Token": encode_lsn(lsn)}
    )

async def get_pipeline(req, pipeline_id):
    token = req.headers.get("X-Consistency-Token")
    if token and not replica_has_reached(token):
        # Replica hasn't caught up — read from primary
        return await primary_db.fetch(pipeline_id)
    # Safe to read from replica
    return await replica_db.fetch(pipeline_id)

Replication lag trap. A user creates a pipeline (write to primary), then immediately GETs it (read from replica). If replication lag is 100ms, the GET returns 404 — the replica hasn't received the write yet. Fix: route reads-after-writes to the primary for a short window (the "read-your-writes" pattern).

JSONB: The Schema Flexibility Escape Hatch

Some fields don't fit neatly into a fixed schema. Pipeline configurations vary per customer, model parameters change over time, metadata is freeform. JSONB in PostgreSQL gives you a typed, indexed, queryable JSON column inside a relational table.

sql
-- Store flexible config in a JSONB column
CREATE TABLE pipelines (
    id UUID PRIMARY KEY,
    customer_id UUID NOT NULL,
    name VARCHAR(255) NOT NULL,
    config JSONB NOT NULL DEFAULT '{}',
    -- config might contain: {"gpu_type": "A100", "batch_size": 32,
    --   "model": "llama-3", "streaming": true}
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Query into JSONB: find all pipelines using A100 GPUs
SELECT * FROM pipelines
WHERE config ->> 'gpu_type' = 'A100';

-- GIN index on JSONB: makes @> (contains) queries fast
CREATE INDEX idx_config ON pipelines USING GIN (config);

-- Now this is indexed:
SELECT * FROM pipelines
WHERE config @> '{"streaming": true}';

-- Partial JSONB index: only index specific keys
CREATE INDEX idx_config_gpu ON pipelines ((config ->> 'gpu_type'));
-- Smaller than full GIN, efficient for specific lookups

JSONB trap: Don't put everything in JSONB. Columns you filter on frequently (customer_id, status, created_at) should be proper typed columns with proper indexes. Use JSONB for genuinely variable data. If you find yourself writing WHERE config ->> 'status' = 'active' on every query, that should be a real column.

pg_stat_statements: Your Query Performance Bible

sql
-- pg_stat_statements tracks execution stats for every query.
-- Enable it in postgresql.conf:
-- shared_preload_libraries = 'pg_stat_statements'

-- Find the slowest queries by total time:
SELECT
    query,
    calls,
    round(total_exec_time::numeric / 1000, 2) AS total_seconds,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- This tells you:
-- "SELECT * FROM pipelines WHERE customer_id=$1" runs 500K times/day,
-- mean 2ms, stddev 50ms (high variance = some queries hit cold cache)

-- Action: if mean is low but stddev is high, the query is fast USUALLY
-- but slow SOMETIMES. That's your p99 problem.
-- Look at buffer reads (shared_blks_hit vs shared_blks_read) to see
-- if slow queries are hitting disk instead of buffer cache.

Debugging: The Deadlock Investigation

sql
-- Scenario: two transactions deadlock
-- T1: UPDATE pipelines SET status='active' WHERE id='A';
--     UPDATE pipelines SET status='active' WHERE id='B';
-- T2: UPDATE pipelines SET status='active' WHERE id='B';
--     UPDATE pipelines SET status='active' WHERE id='A';
-- T1 locks A, waits for B. T2 locks B, waits for A. DEADLOCK.

-- Fix: always acquire locks in a deterministic order (sorted by ID)
-- Both transactions lock A first, then B. No cycle possible.

-- Detection: PostgreSQL auto-detects deadlocks and aborts one transaction.
-- But detection takes 1s (deadlock_timeout). Prevention is better.

Connection Pool Monitoring

Connection pool exhaustion is one of the top causes of API outages. Monitor these metrics:

python
# Metrics to emit from your connection pool:
pool_total_connections.set(pool.get_size())      # Total connections open
pool_available.set(pool.get_idle_size())          # Connections idle (available)
pool_waiting.set(pool.get_waiters())              # Requests waiting for a connection
pool_checkout_time.observe(checkout_duration_ms)  # Time to get a connection

# Alert when pool_waiting > 0 for > 30s
# → Requests are queuing for DB connections. You need more pool capacity
#   or PgBouncer is saturated.

# Alert when pool_checkout_time p99 > 100ms
# → Connections are held too long. Look for slow queries or missing
#   connection release (a `finally` block that doesn't close the conn).

# Typical healthy values at Parallel:
# pool_total: 20, pool_available: 12-18, pool_waiting: 0
# pool_checkout_time p99: <1ms

Frontier: Distributed SQL (2024-2025)

CockroachDB and TiDB offer PostgreSQL-compatible SQL with automatic sharding, distributed transactions, and multi-region replication. Instead of manually managing read replicas and sharding logic, the database handles it. The tradeoff: higher per-query latency (cross-node coordination) but horizontal scalability without application changes.

Neon and PlanetScale offer serverless Postgres/MySQL with instant branching (create a copy of production for testing in seconds) and auto-scaling to zero (no cost when idle). This is changing how teams think about database provisioning.

Index Performance Simulator

Compare query performance with and without indexes as table size grows.

Table rows 100K

You have a composite index on (customer_id, created_at). Which query can NOT use this index efficiently?

WHERE customer_id = 'abc' ORDER BY created_at DESC WHERE customer_id = 'abc' AND created_at > '2024-01-01' WHERE created_at > '2024-01-01' (without customer_id) — violates the leftmost prefix rule WHERE customer_id IN ('abc', 'def') ORDER BY created_at DESC

Chapter 4: Caching Strategies

Caching is the art of remembering expensive answers so you don't compute them again. At Parallel, a cache hit means responding in 2ms instead of 50ms — a 25x improvement. But caching introduces a new problem: how do you know when the cached answer is wrong? Cache invalidation is one of the two hard problems in computer science (the other is naming things).

The Cache Hierarchy

Layer	What it caches	TTL	Hit rate	Latency
CDN Edge	Static assets, public GET responses	5-60 min	~60%	1-10ms (geographically close)
API Gateway	Auth token validation results	5 min	~90%	0.5ms (in-memory)
Application (Redis)	DB query results, computed aggregations	1-10 min	~70%	1-3ms (network hop to Redis)
Application (Local)	Config, feature flags, rate limit rules	30s	~99%	0.01ms (process memory)
Database	Query plan cache, buffer pool (recently-read pages)	N/A (LRU)	~85%	0.1ms (RAM) vs 5ms (disk)

Cache-Aside vs. Write-Through

python
# CACHE-ASIDE (a.k.a. lazy loading) — Parallel's primary pattern
# Application manages the cache explicitly.
async def get_pipeline(pipeline_id: str):
    # 1. Check cache first
    cached = await redis.get(f"pipeline:{pipeline_id}")
    if cached:
        return json.loads(cached)  # Cache HIT: 2ms response

    # 2. Cache MISS: query database
    pipeline = await db.fetch_pipeline(pipeline_id)

    # 3. Populate cache for next time (TTL = 5 minutes)
    await redis.setex(f"pipeline:{pipeline_id}", 300, json.dumps(pipeline))

    return pipeline

# WRITE-THROUGH — update cache on every write
async def update_pipeline(pipeline_id: str, updates: dict):
    # 1. Write to database
    pipeline = await db.update_pipeline(pipeline_id, updates)

    # 2. Immediately update cache (cache is always fresh)
    await redis.setex(f"pipeline:{pipeline_id}", 300, json.dumps(pipeline))

    return pipeline

The Thundering Herd Problem

Imagine a popular endpoint whose cache entry expires. In the next millisecond, 500 requests arrive, all find the cache empty, and all query the database simultaneously. The database gets 500 identical queries instead of 1. This is the thundering herd (or cache stampede).

python
# Fix 1: Probabilistic early expiration
# Each request has a small chance of refreshing the cache BEFORE it expires.
# Instead of 500 requests all missing at t=300s, one request refreshes at t=280s.
import random, time

def should_refresh(ttl_remaining: float, beta: float = 1.0) -> bool:
    # XFetch algorithm: probability increases as TTL approaches 0
    if ttl_remaining <= 0:
        return True
    return random.random() < beta * (-ttl_remaining).exp()  # pseudo

# Fix 2: Lock-based refresh (single-flight)
# Only one request refreshes; others wait or get stale data.
async def get_with_lock(key: str, fetch_fn):
    cached = await redis.get(key)
    if cached:
        return json.loads(cached)

    # Try to acquire refresh lock (NX = only if not exists, EX = 5s TTL)
    lock = await redis.set(f"lock:{key}", "1", nx=True, ex=5)
    if lock:
        # We won the lock — fetch and populate
        data = await fetch_fn()
        await redis.setex(key, 300, json.dumps(data))
        await redis.delete(f"lock:{key}")
        return data
    else:
        # Someone else is refreshing — wait and retry
        await asyncio.sleep(0.1)
        return await get_with_lock(key, fetch_fn)

Cache Invalidation Strategies

The invalidation spectrum: TTL-based (simple, eventually consistent) → Event-driven (write triggers cache delete, strongly consistent but complex) → Version-tagged (cache key includes a version counter, bump on write). Parallel uses event-driven for mutable resources (pipelines, jobs) and TTL-based for immutable reads (model metadata that changes monthly).

python
# Strategy 1: TTL-based (simplest, eventually consistent)
# Set a TTL when caching. After it expires, next request re-fetches.
await redis.setex(f"pipeline:{id}", 300, data)  # 5 min TTL
# Pro: zero invalidation complexity
# Con: stale for up to TTL seconds after a write

# Strategy 2: Event-driven (strongly consistent)
# On every write, delete the cache entry.
async def update_pipeline(id, updates):
    pipeline = await db.update(id, updates)
    await redis.delete(f"pipeline:{id}")     # Invalidate
    await redis.delete(f"pipeline_list:{pipeline.customer_id}")  # Invalidate list too!
    return pipeline
# Pro: cache is always fresh
# Con: must invalidate EVERY cache key that includes this data
# (single resource AND list endpoints — easy to miss one)

# Strategy 3: Version-tagged keys
# Cache key includes a version counter. Bump on write.
# Old keys auto-expire via TTL. No explicit deletion needed.
version = await redis.incr(f"pipeline_version:{id}")
cache_key = f"pipeline:{id}:v{version}"
# Pro: no race conditions during invalidation
# Con: more Redis keys, relies on TTL to clean up old versions

# Strategy 4: Pub/Sub invalidation (for multi-server setups)
# Publish invalidation events. All servers subscribe and clear local caches.
await redis.publish("cache_invalidation", json.dumps({
    "type": "pipeline",
    "id": pipeline_id,
    "action": "updated"
}))

HTTP Cache Headers: Controlling the CDN

HTTP cache headers control how CDNs, browsers, and intermediate proxies cache your responses. Getting these wrong means either serving stale data or bypassing the cache entirely.

python
# For immutable resources (model metadata that changes monthly):
# Cache aggressively at CDN and browser.
headers = {
    "Cache-Control": "public, max-age=3600, s-maxage=86400",
    # max-age: browser caches for 1 hour
    # s-maxage: CDN caches for 24 hours
    "ETag": "\"v1-abc123\"",  # For conditional requests
}

# For mutable resources (pipelines, jobs):
# Don't cache at CDN or browser. Let application cache handle it.
headers = {
    "Cache-Control": "private, no-store",
    # private: CDN must not cache (contains user-specific data)
    # no-store: browser must not cache
}

# For list endpoints with stale-while-revalidate:
headers = {
    "Cache-Control": "public, max-age=10, stale-while-revalidate=60",
    # Serve cached for 10s. After that, serve stale AND revalidate in background.
    # User never sees a cache miss. Brilliant for high-traffic list endpoints.
}

Debugging: The Stale Cache Mystery

A customer updates their pipeline name but the GET endpoint returns the old name for 5 minutes. Investigation:

Symptom: Write succeeds (200 OK with new name), subsequent read returns old name.

Root cause 1: Cache-aside without invalidation. The write updates the DB but doesn't delete/update the cache entry. The stale cache is served until TTL expires.

Root cause 2: Read replica lag. The write goes to primary, the read goes to a replica that hasn't caught up. Cache is correct but the source data is stale.

Root cause 3: CDN caching. The GET response has a Cache-Control: max-age=300 header. Even after the application cache is updated, the CDN serves its stale copy.

Frontier: Edge Computing + Cache (2024-2025)

Cloudflare Workers KV and Vercel Edge Config push caching to the edge — 300+ PoPs worldwide. Read latency drops to <5ms globally. The frontier: write-through edge caches that propagate updates in <100ms worldwide. Combined with stale-while-revalidate HTTP semantics, users never see a cache miss.

Cache Hit/Miss Simulator

Simulate request patterns and see cache behavior. Adjust TTL and request rate to see thundering herd effects.

Cache TTL (s) 10

A popular cache entry expires and 500 requests simultaneously query the database for the same data. What is this problem called, and what is the most robust fix?

Cache poisoning — add more cache servers Thundering herd — use a distributed lock so only one request refreshes the cache while others wait or get stale data Cache bypass — increase the TTL to infinity Read amplification — add read replicas

Chapter 5: Rate Limiting & Quotas

Without rate limiting, one customer's script gone haywire can take down your entire API. Rate limiting is not about being mean to developers — it's about fairness. Every customer gets their fair share of capacity, and no single customer can starve the others. Think of it as traffic lights on a highway on-ramp: they slow individual cars so the highway keeps flowing for everyone.

Token Bucket Algorithm

The token bucket is the most common rate-limiting algorithm. Imagine a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 100/second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing short bursts.

python
import time

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate            # Tokens added per second
        self.capacity = capacity    # Max tokens in bucket
        self.tokens = capacity      # Start full
        self.last_refill = time.monotonic()

    def allow(self) -> bool:
        now = time.monotonic()
        # Refill tokens based on elapsed time
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True   # Request allowed
        return False  # Rate limited (429)

# Usage:
bucket = TokenBucket(rate=100, capacity=200)  # 100 req/s, burst up to 200
if not bucket.allow():
    return Response(status=429, headers={
        "Retry-After": "1",
        "X-RateLimit-Limit": "100",
        "X-RateLimit-Remaining": "0",
        "X-RateLimit-Reset": str(int(time.time()) + 1),
    })

Sliding Window vs. Fixed Window

Fixed window: Count requests in each 1-minute window (e.g., 12:00-12:01, 12:01-12:02). Problem: a burst at 12:00:59 + a burst at 12:01:01 gets 2x the limit because they span two windows.

Sliding window log: Track the timestamp of every request. Count requests in the last 60 seconds. Accurate but memory-intensive (storing every timestamp).

Sliding window counter: Combine the current window's count with a weighted portion of the previous window's count. Approximate but memory-efficient — only two counters per customer.

remaining ≈ limit - (current_count + previous_count × overlap_fraction)

Distributed Rate Limiting with Redis

python
# The challenge: your API runs on 10 servers. A per-process token bucket
# allows 10x the intended limit (each server has its own bucket).
# Solution: centralized counter in Redis.

async def check_rate_limit(customer_id: str, limit: int, window: int) -> bool:
    key = f"rl:{customer_id}:{int(time.time()) // window}"
    # Atomic increment + TTL in one round trip
    pipe = redis.pipeline()
    pipe.incr(key)
    pipe.expire(key, window + 1)  # +1 to avoid race
    count, _ = await pipe.execute()
    return count <= limit

# Per-customer quotas: different tiers get different limits
TIER_LIMITS = {
    "free":       {"rpm": 60,   "rpd": 1000},
    "pro":        {"rpm": 600,  "rpd": 50000},
    "enterprise": {"rpm": 6000, "rpd": 1000000},
}

Graceful Degradation

Instead of a hard 429, sophisticated APIs degrade gracefully. At 80% of the limit, start returning slightly stale cached data (faster, cheaper). At 100%, return 429 with Retry-After header. At 200% (suspected abuse), temporarily block the API key and alert the security team.

python
# Graceful degradation with response headers
async def rate_limit_middleware(request, call_next):
    customer = request.customer
    usage = await get_usage(customer.id)
    limit = TIER_LIMITS[customer.tier]["rpm"]

    # Always include rate limit headers (even when not limited)
    headers = {
        "X-RateLimit-Limit": str(limit),
        "X-RateLimit-Remaining": str(max(0, limit - usage)),
        "X-RateLimit-Reset": str(next_window_timestamp()),
    }

    if usage >= limit * 2:
        # 200%+ — suspected abuse, hard block
        await alert_security(customer.id, usage)
        return Response(status=429, headers={**headers, "Retry-After": "60"})

    if usage >= limit:
        # 100%+ — return 429 with retry guidance
        return Response(status=429, headers={**headers, "Retry-After": "1"},
            body={"error": {"type": "rate_limit_exceeded",
                         "message": f"Rate limit of {limit} requests/minute exceeded. Retry after 1 second.",
                         "doc_url": "https://docs.parallel.dev/rate-limits"}})

    if usage >= limit * 0.8:
        # 80%+ — serve from cache, add warning header
        headers["X-RateLimit-Warning"] = "Approaching rate limit"

    response = await call_next(request)
    response.headers.update(headers)
    return response

Debugging: The "Why Am I Rate Limited?" Investigation

The most common support ticket: "I'm getting 429 errors but I'm only making 10 requests/minute." Causes:

1. Multiple processes sharing one API key. The developer has 6 workers, each making 10 req/min = 60 total. They only see their worker's count.

2. Retry storms. Their client retries 429s immediately (without backoff), consuming more tokens and getting more 429s. A 10 req/min client can generate 100 actual requests/min through retries.

3. Clock skew. Their system clock is 30 seconds ahead. They think they're in a new rate limit window but the server disagrees.

Frontier: Cost-Based Rate Limiting (2024-2025)

Not all requests cost the same. A simple GET costs 1 "unit" but a complex search query costs 10 units (more CPU, more DB I/O). Cost-based rate limiting (used by GitHub GraphQL, Shopify, and now Parallel) assigns a cost to each request type and deducts from a budget. This prevents one customer from monopolizing expensive endpoints while staying within their "request count" limit.

python
# Cost-based rate limiting implementation
ENDPOINT_COSTS = {
    "GET /v2/pipelines/{id}":     1,   # Cheap: single row lookup
    "GET /v2/pipelines":           5,   # Medium: list query + pagination
    "POST /v2/pipelines":          10,  # Expensive: validation + DB write + queue
    "GET /v2/pipelines/{id}/logs": 20,  # Very expensive: scan log storage
    "POST /v2/search":             25,  # Most expensive: full-text search
}

# Customer budget: 10,000 units/minute (pro tier)
# A customer can make 10,000 cheap GETs, or 400 searches, or a mix.
# Response headers show remaining budget:
# X-RateLimit-Cost: 25
# X-RateLimit-Budget-Remaining: 4,975
# X-RateLimit-Budget-Reset: 1716400060

async def check_cost_limit(customer_id, endpoint, method):
    cost = ENDPOINT_COSTS.get(f"{method} {endpoint}", 1)
    key = f"budget:{customer_id}:{current_minute()}"
    current = await redis.incrby(key, cost)
    if current == cost:  # First request this minute
        await redis.expire(key, 61)
    budget = TIER_BUDGETS[customer.tier]  # e.g., 10000
    return current <= budget, budget - current, cost

Token Bucket Rate Limiter

Watch the token bucket fill and drain. Adjust the rate and burst size, then click Send to consume tokens.

Refill rate (/s) 5

Your API runs on 8 servers. Each server has a local token bucket allowing 100 requests/second. What is the actual per-customer rate limit?

100 requests/second (each server enforces independently) Up to 800 requests/second — each server allows 100, and requests are load-balanced across all 8. You need a centralized counter (Redis) for accurate limits. 12.5 requests/second (100 / 8 servers) It depends on the load balancer algorithm

Chapter 6: Authentication & Authorization

Authentication answers "who are you?" Authorization answers "what are you allowed to do?" Get either wrong and you're on the front page of Hacker News — not in a good way. At Parallel, every API request must be authenticated, every action must be authorized, and every access must be logged for audit.

API Key Authentication

The simplest auth: the client sends a secret key in every request. API keys are easy for developers (just add a header) but dangerous if leaked (no expiration, full access). Parallel uses API keys for server-to-server calls where the client is a backend service, not a browser.

python
# API key design: prefix + random bytes
# Prefix makes keys greppable in logs: "pk_live_" vs "pk_test_"
# Store the HASH in the database, not the key itself.

import secrets, hashlib

def generate_api_key(prefix: str = "pk_live_") -> tuple[str, str]:
    raw = secrets.token_urlsafe(32)   # 256 bits of entropy
    key = prefix + raw                # pk_live_a3Bc9d...
    key_hash = hashlib.sha256(key.encode()).hexdigest()
    return key, key_hash  # Give key to user, store hash in DB

async def validate_api_key(key: str) -> Customer | None:
    key_hash = hashlib.sha256(key.encode()).hexdigest()
    # Lookup by hash — constant-time comparison prevents timing attacks
    return await db.get_customer_by_key_hash(key_hash)

OAuth 2.0 + JWT

For user-facing applications (dashboards, CLI tools), Parallel uses OAuth 2.0 with JWT (JSON Web Tokens). The flow: user authenticates with their identity provider, gets a JWT, sends it with every request. The server validates the JWT's signature without hitting a database — the token contains the user's identity and permissions, signed by a private key.

python
# JWT structure: header.payload.signature
# Header: {"alg": "RS256", "typ": "JWT"}
# Payload: {"sub": "user_123", "org": "org_456",
#           "roles": ["admin"], "exp": 1716400000}
# Signature: RS256(header + "." + payload, private_key)

import jwt

def validate_jwt(token: str, public_key: str) -> dict:
    try:
        payload = jwt.decode(token, public_key, algorithms=["RS256"])
        return payload  # {"sub": "user_123", "roles": [...], ...}
    except jwt.ExpiredSignatureError:
        raise AuthError("Token expired", code=401)
    except jwt.InvalidTokenError:
        raise AuthError("Invalid token", code=401)

Role-Based Access Control (RBAC)

RBAC maps users to roles, and roles to permissions. A user can have multiple roles, each granting specific permissions on specific resources.

Role	Permissions	Use case
viewer	read:pipelines, read:jobs	Dashboard-only users, monitoring
developer	viewer + create:pipelines, update:pipelines	Engineers building on the platform
admin	developer + delete:pipelines, manage:keys, manage:members	Team leads, account owners
billing	read:usage, manage:billing	Finance team (no API access)

Webhook Signatures

When your API sends webhooks (event notifications), the receiver needs to verify they came from you, not an attacker. The solution: HMAC signatures.

python
import hmac, hashlib

def sign_webhook(payload: bytes, secret: str) -> str:
    return hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest()

# Sending: compute signature, include in header
sig = sign_webhook(body, customer_webhook_secret)
headers = {"X-Parallel-Signature": f"sha256={sig}"}

# Receiving (customer's code): verify signature
def verify_webhook(payload: bytes, header_sig: str, secret: str) -> bool:
    expected = sign_webhook(payload, secret)
    # Constant-time comparison prevents timing attacks
    return hmac.compare_digest(f"sha256={expected}", header_sig)

Key Rotation

API keys get leaked. Developers commit them to GitHub. Employees leave. You need a rotation mechanism that doesn't break existing integrations.

Dual-key rotation pattern: When a customer rotates their key, the old key stays valid for 24 hours. During this window, both keys work. This gives the customer time to deploy the new key across all their services. After 24 hours, the old key is permanently revoked. This prevents the "chicken-and-egg" problem where rotating a key breaks the service that needs the key to deploy the new key.

python
# Key rotation implementation

async def rotate_api_key(customer_id: str) -> dict:
    # Generate new key
    new_key, new_hash = generate_api_key("pk_live_")

    # Get current key info
    current = await db.get_active_key(customer_id)

    # Mark current key as "expiring" with 24h grace period
    await db.update_key(current.id, status="expiring",
                        expires_at=datetime.utcnow() + timedelta(hours=24))

    # Insert new key as active
    await db.insert_key(customer_id, key_hash=new_hash, status="active")

    # Both keys work during the transition window
    return {
        "new_key": new_key,  # Show once, never stored in plaintext
        "old_key_expires_at": (datetime.utcnow() + timedelta(hours=24)).isoformat(),
        "message": "Deploy the new key within 24 hours. The old key will be revoked after that."
    }

# Validation now checks both active and expiring keys:
async def validate_api_key(key: str):
    key_hash = hashlib.sha256(key.encode()).hexdigest()
    record = await db.get_key_by_hash(key_hash)
    if not record:
        return None
    if record.status == "revoked":
        return None
    if record.status == "expiring" and record.expires_at < datetime.utcnow():
        return None  # Grace period over
    return record.customer

Audit Logging: Who Did What When

Every API action that modifies data must produce an audit log entry. This is essential for security investigations, compliance (SOC 2), and debugging customer issues ("when did this pipeline get deleted?").

python
# Audit log schema — immutable, append-only
class AuditEntry:
    id: str                 # Unique event ID
    timestamp: datetime     # When
    actor_id: str           # Who (user_id or api_key_id)
    actor_type: str         # "user", "api_key", "system"
    action: str             # "pipeline.created", "key.rotated"
    resource_type: str      # "pipeline", "api_key"
    resource_id: str        # "pipe_abc123"
    ip_address: str         # Request source IP
    user_agent: str         # SDK version, etc.
    changes: dict           # {"status": {"from": "active", "to": "deleted"}}
    request_id: str         # Correlate with request logs

# Write to an append-only table (no UPDATE, no DELETE)
# Retention: 2 years minimum for SOC 2 compliance
# Indexed on: actor_id, resource_id, action, timestamp

Debugging: The "403 Forbidden" Mystery

A developer says: "I'm getting 403 on POST /v2/pipelines but I'm an admin." Investigation:

Check 1: Are they using the right API key? They might have a test key (pk_test_) hitting the production endpoint.

Check 2: Is the JWT expired? A 403 instead of 401 means the token is valid but the permissions are wrong — check the roles in the token payload.

Check 3: Organization context. They're an admin in org_A but their request targets a resource in org_B. RBAC is per-organization.

Scoped API Keys: Principle of Least Privilege

A single all-powerful API key is dangerous. If leaked, the attacker has full access to everything. Scoped keys limit each key to specific permissions and resources.

python
# Scoped key creation — the customer requests specific permissions
POST /v2/api-keys
{
    "name": "CI/CD deploy key",
    "permissions": ["pipelines:write", "jobs:read"],
    "resource_ids": ["pipe_abc", "pipe_def"],  # Only these pipelines
    "expires_at": "2025-06-22T00:00:00Z",      # Auto-expire in 30 days
    "ip_whitelist": ["203.0.113.0/24"]         # Only from CI network
}

# Response: pk_live_scoped_... (this key can ONLY write to those
# two pipelines, from that IP range, for 30 days)

# If this key is leaked, damage is contained:
# ✓ Can't read customer data (no "customers:read" permission)
# ✓ Can't delete pipelines (no "pipelines:delete" permission)
# ✓ Expires automatically
# ✓ Fails from non-whitelisted IPs

Frontier: Passkeys + FIDO2 (2024-2025)

The frontier of API authentication is moving beyond shared secrets. Passkeys (WebAuthn/FIDO2) use public-key cryptography — the private key never leaves the user's device. No passwords to leak, no API keys to rotate. For machine-to-machine auth, SPIFFE/SPIRE provides cryptographic identity without shared secrets, using short-lived X.509 certificates that rotate automatically.

Auth Flow Visualizer

See how different auth methods protect an API request. Click to see the flow for each method.

Why should you store the SHA-256 hash of an API key in your database instead of the key itself?

Hashes are faster to compare Hashes use less storage space If the database is breached, attackers get hashes (useless) instead of keys (which grant API access) It's required by OAuth 2.0 specification

Chapter 7: Observability & Reliability

You cannot improve what you cannot measure. Observability is the ability to understand a system's internal state from its external outputs — logs, metrics, and traces. Reliability is the discipline of making promises (SLOs) and keeping them. Together, they answer: "Is the API healthy, and how do I know?"

The Three Pillars

Logs: Structured event records. Every request produces a log entry with request_id, endpoint, status, latency, customer_id. Use JSON format so they're machine-parseable. Avoid print("something went wrong") — use structured logging with severity levels.

python
import structlog

log = structlog.get_logger()

# GOOD: structured, searchable, correlatable
log.info("request.completed",
    request_id="req_abc123",
    endpoint="GET /v2/pipelines/{id}",
    status=200,
    latency_ms=42.3,
    customer_id="cust_xyz",
    cache_hit=True,
    db_query_ms=0,
)
# Output: {"event": "request.completed", "request_id": "req_abc123", ...}
# Searchable in Datadog/Grafana: filter by customer_id, sort by latency_ms

# BAD: unstructured, impossible to search programmatically
print(f"Completed request for pipeline abc in 42ms")

Metrics: Numerical time-series data. Counter (requests_total), gauge (active_connections), histogram (request_latency_seconds). Metrics tell you "what is happening right now" but not "why."

python
# The four types of metrics and when to use each:

# COUNTER — monotonically increasing. Good for rates (req/s, errors/s).
http_requests_total.labels(method="GET", endpoint="/v2/pipelines", status=200).inc()

# GAUGE — goes up and down. Good for current state.
active_db_connections.set(pool.size - pool.available)
request_queue_depth.set(queue.qsize())

# HISTOGRAM — distribution of values. Good for latencies.
request_latency.labels(endpoint="/v2/pipelines").observe(duration_seconds)
# Automatically gives you p50, p90, p99 via quantile calculations.
# Bucket boundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5]

# SUMMARY — like histogram but pre-computes quantiles client-side.
# Avoid unless you specifically need client-side quantiles.
# Histograms are more flexible (can aggregate across instances).

Traces: End-to-end request paths through the system. A trace contains spans, each representing a unit of work (auth check, DB query, cache lookup). Traces tell you "where is the time going" for a specific request.

python
# OpenTelemetry tracing — the industry standard (2024+)
from opentelemetry import trace

tracer = trace.get_tracer("parallel.api")

async def get_pipeline(request, pipeline_id):
    with tracer.start_as_current_span("get_pipeline") as span:
        span.set_attribute("pipeline.id", pipeline_id)
        span.set_attribute("customer.id", request.customer_id)

        with tracer.start_as_current_span("cache.lookup"):
            cached = await redis.get(f"pipeline:{pipeline_id}")
            if cached:
                span.set_attribute("cache.hit", True)
                return cached

        with tracer.start_as_current_span("db.query") as db_span:
            pipeline = await db.fetch(pipeline_id)
            db_span.set_attribute("db.statement", "SELECT * FROM pipelines WHERE id=$1")
            db_span.set_attribute("db.rows_returned", 1)

        return pipeline

# This produces a trace like:
# [get_pipeline: 32ms]
#   ├── [cache.lookup: 1ms] (hit=false)
#   └── [db.query: 28ms] (rows=1)
# Each span has timestamps, attributes, and parent-child relationships.

SLIs, SLOs, and Error Budgets

A Service Level Indicator (SLI) is a metric you care about: request latency, error rate, availability. A Service Level Objective (SLO) is a target for that metric: "99.9% of requests complete in under 200ms." An error budget is how much you're allowed to miss the SLO: 0.1% of requests can be slow.

SLI	SLO	Error Budget (monthly)	Action on budget burn
Availability	99.95%	21.9 minutes downtime	Freeze deployments, investigate
Latency (p99)	<200ms	0.05% of requests can exceed	Scale up, optimize slow paths
Error rate	<0.1%	~4,300 errors per 4.3M requests	Rollback last deploy, page on-call

Error budgets change the conversation. Instead of "we can never have downtime" (unrealistic and paralyzing), error budgets say "we have 21.9 minutes to spend this month." This lets you take calculated risks: deploy a risky migration on Monday knowing you have budget for 10 minutes of degradation. If the budget is exhausted, you freeze features and focus on reliability until it recovers.

Alerting: Signal vs. Noise

yaml
# BAD alert: fires on any 500 error
# Result: 50 alerts/day, on-call ignores them all, misses real outage
- alert: Any500Error
  expr: http_errors_total{status="500"} > 0

# GOOD alert: fires on burn rate (how fast are we burning the error budget?)
# If we're burning 14.4x the budget rate, we'll exhaust it in 5 days.
# This triggers a page. Low burn rate gets a ticket.
- alert: HighErrorBurnRate
  expr: |
    (sum(rate(http_errors_total[1h])) / sum(rate(http_requests_total[1h])))
    > (14.4 * 0.001)  # 14.4x the 0.1% error budget rate
  for: 5m
  severity: page

- alert: SlowErrorBurnRate
  expr: |
    (sum(rate(http_errors_total[6h])) / sum(rate(http_requests_total[6h])))
    > (3 * 0.001)  # 3x the budget rate
  for: 30m
  severity: ticket

The Four Golden Signals

Google's SRE book defines four signals that every API must monitor. If you build one dashboard, build this one:

Signal	Metric	Alert threshold	What it tells you
Latency	request_duration_seconds (histogram)	p99 > 200ms for 5min	Something is slow: DB? Cache miss? Downstream service?
Traffic	http_requests_total (counter)	+50% vs. 1-week average	Organic growth, or viral partner, or attack?
Errors	http_errors_total (counter)	Error rate > 0.5% for 5min	Bug deployed, DB down, or upstream failure?
Saturation	CPU, memory, DB connections, queue depth	Any resource > 80% for 10min	Running out of capacity. Scale or shed load.

Debugging: "The System Is Slow But I Don't Know Why"

Step 1: Check the golden signals dashboard: latency, traffic, errors, saturation. Which signal is abnormal?

Step 2: If latency is high, pull a trace for a slow request. Find the slow span.

Step 3: If errors are high, check the error rate by endpoint and by error code. A spike in 503s means backend saturation. A spike in 400s means a client-side change (new SDK version with a bug?).

Step 4: Check saturation metrics: CPU, memory, DB connections, Redis connections. If DB connections are at max, the problem is connection pool exhaustion, not your application code.

python
# Real debugging session: p99 spiked from 200ms to 800ms

# Step 1: Which endpoint?
# → Only GET /v2/pipelines/{id} is slow. Others are fine.
# Conclusion: problem is in this handler, not a shared layer.

# Step 2: Which span is slow?
# Pull 10 slow traces. All show db.query span = 600ms.
# Conclusion: database query is the bottleneck.

# Step 3: Which query?
# Check pg_stat_statements for the slowest queries:
# SELECT * FROM pipelines WHERE id = $1
# mean_time went from 2ms to 600ms yesterday.

# Step 4: What changed yesterday?
# Migration log: added index on (customer_id, status) at 3 PM.
# Index creation on 50M rows took 8 minutes with CONCURRENTLY,
# but the planner started using a suboptimal query plan afterward.

# Fix: ANALYZE pipelines; (refresh query planner statistics)
# p99 drops back to 200ms within 2 minutes.

Frontier: AI-Powered Observability (2024-2025)

Anomaly detection using ML models that learn normal patterns and alert on deviations — no manual threshold tuning. Natural language querying: ask "why is the /v2/jobs endpoint slow today?" and the system correlates traces, metrics, and logs to generate an answer. Tools like Datadog AI Assistants and Honeycomb's Query Assistant are making this real.

SLO Error Budget Dashboard

Watch the error budget burn in real time. Inject errors to see how the burn rate alert fires.

Your SLO is 99.9% availability (43.8 minutes error budget per month). You've used 40 minutes. Your team wants to deploy a risky database migration. What should you do?

Deploy immediately — 3.8 minutes of budget remaining is enough Cancel the migration entirely Delay until next month when the budget resets, or do the migration during a maintenance window that doesn't count against the SLO Increase the SLO target to 99.5% to get more budget

Chapter 8: Scaling Patterns

Your API starts on one server. Then customers arrive. Then a viral integration sends 50x your normal traffic in 10 minutes. Scaling is not about handling today's load — it's about designing systems that can handle 10x without re-architecture and 100x with a planned migration. The system that scales well is the one where adding capacity is boring.

Horizontal vs. Vertical Scaling

Vertical scaling: Bigger machine (more CPU, RAM, faster disk). Simple but limited — there's a biggest machine you can buy. At Parallel, we vertically scale the primary database (it's the one piece that's hard to horizontally scale).

Horizontal scaling: More machines. Add API servers behind a load balancer. Works for stateless services. The challenge: state must be externalized (to a database, Redis, or object store) so any server can handle any request.

Sharding: Splitting the Database

When one database can't handle the load, you split it across multiple databases. Each shard holds a subset of the data. The key decision: what do you shard by?

python
# Sharding by customer_id: all data for one customer lives on one shard.
# Pro: queries within one customer never cross shards.
# Con: one big customer can hotspot a shard.

def get_shard(customer_id: str, num_shards: int) -> int:
    # Consistent hashing: customer_id → hash → shard number
    return int(hashlib.md5(customer_id.encode()).hexdigest(), 16) % num_shards

# Sharding by time: data for 2024-Q1 on shard A, Q2 on shard B.
# Pro: old data can be archived/compressed. Queries on recent data are fast.
# Con: cross-time-range queries need scatter-gather across shards.

# Parallel's approach: shard by customer_id with a routing layer.
# The routing table lives in Redis (fast lookup).
# When adding a shard, we migrate customers one by one (dual-write pattern).

Async Processing: Queue-Based Architecture

Not everything needs to happen in the request path. Creating a GPU pipeline takes 30 seconds, but the API should respond in 200ms. The solution: accept the request, put it on a job queue, return a job ID, and process asynchronously.

python
# Synchronous (BAD for long operations):
# POST /v2/pipelines → blocks for 30s → returns pipeline
# Client timeout, load balancer timeout, terrible UX.

# Asynchronous (GOOD):
# POST /v2/pipelines → returns 202 Accepted + job_id (200ms)
# GET /v2/jobs/{job_id} → returns status: "running" / "completed"

async def create_pipeline(req: Request) -> Response:
    # Validate, then enqueue
    job_id = str(uuid4())
    await queue.publish("pipeline.create", {
        "job_id": job_id,
        "customer_id": req.customer_id,
        "config": req.body,
    })
    return Response(
        status=202,
        body={"job_id": job_id, "status": "queued"},
        headers={"Location": f"/v2/jobs/{job_id}"}
    )

# Worker process (separate from API server):
async def process_pipeline_job(msg):
    await db.update_job(msg["job_id"], status="running")
    pipeline = await gpu_scheduler.create(msg["config"])
    await db.update_job(msg["job_id"], status="completed", result=pipeline)

Connection Pooling at Scale

The connection math: 50 API servers × 20 connections each = 1,000 connections to the database. PostgreSQL's practical limit is ~500 before performance degrades. Solution: PgBouncer in front of the database, multiplexing 1,000 client connections into 100 server connections. Each API server connects to PgBouncer, not directly to PostgreSQL.

Circuit Breaker Pattern

When a downstream service (database, payment API, GPU scheduler) fails, your API shouldn't keep hammering it. That makes recovery slower and wastes resources. A circuit breaker detects failures and stops sending requests until the service recovers.

python
# Circuit breaker state machine:
# CLOSED → requests flow normally, failures counted
# OPEN   → requests fail-fast (503), no downstream call
# HALF   → one test request allowed, if it succeeds → CLOSED

# Transitions:
# CLOSED → OPEN:  when failure count exceeds threshold (e.g., 5 in 60s)
# OPEN → HALF:    after reset_timeout (e.g., 30s)
# HALF → CLOSED:  if test request succeeds
# HALF → OPEN:    if test request fails

# In a handler:
db_breaker = CircuitBreaker(threshold=5, reset_time=30)

async def get_pipeline(pipeline_id):
    # Try cache first (cache doesn't use circuit breaker)
    cached = await redis.get(f"pipeline:{pipeline_id}")
    if cached:
        return cached

    # DB call is protected by circuit breaker
    try:
        result = await db_breaker.call(db.fetch_pipeline, pipeline_id)
        return result
    except CircuitOpenError:
        # Circuit is open — return degraded response
        return Response(status=503, body={
            "error": {"type": "service_degraded",
                     "message": "Database temporarily unavailable. Cached data may be stale.",
                     "retry_after": 30}
        })

Backpressure: Protecting Yourself

Backpressure is the mechanism by which an overloaded system signals upstream to slow down. Without it, requests pile up in memory until the server OOMs. With it, excess requests are rejected gracefully (429 or 503) before they consume resources.

python
# Backpressure via request queue with bounded capacity:
import asyncio

request_queue = asyncio.Queue(maxsize=1000)  # Bounded!

async def handle_request(request):
    try:
        request_queue.put_nowait(request)  # Non-blocking
    except asyncio.QueueFull:
        # Queue is full — reject with backpressure signal
        return Response(status=503, headers={"Retry-After": "5"})

# Workers process from the queue at a sustainable rate:
async def worker():
    while True:
        request = await request_queue.get()
        await process(request)

Debugging: The Traffic Spike Post-Mortem

Scenario: A partner's integration goes viral. Traffic jumps 10x in 10 minutes. The API returns 503s for 8 minutes before auto-scaling kicks in.

Root cause: Auto-scaling was configured to trigger at 80% CPU, with a 5-minute cooldown and 3-minute instance boot time. Total reaction time: 8 minutes. During those 8 minutes, existing servers are saturated.

Fix: Predictive scaling (scale based on traffic trend, not just current CPU). Pre-warm spare capacity during business hours. Add request queuing at the load balancer (instead of rejecting requests, queue them for 2 seconds before 503).

Dead Letter Queues: When Processing Fails

A job fails after 3 retries. You don't want to lose it (the customer's pipeline creation request is gone forever). You also don't want to keep retrying forever (the same error will keep happening). Solution: dead letter queue (DLQ).

python
# Job processing with DLQ
async def process_job(msg):
    try:
        await create_pipeline(msg)
        await queue.ack(msg)  # Success: remove from queue
    except RetryableError:
        if msg.retry_count < 3:
            await queue.nack(msg, delay=2 ** msg.retry_count)  # Retry with backoff
        else:
            # Move to DLQ for manual investigation
            await dlq.publish(msg, metadata={
                "error": str(e),
                "retries": msg.retry_count,
                "original_timestamp": msg.timestamp,
            })
            await queue.ack(msg)
            await alert_oncall(f"Job {msg.job_id} sent to DLQ after 3 retries")
    except FatalError:
        # Non-retryable: bad input, business logic violation
        await db.update_job(msg.job_id, status="failed", error=str(e))
        await queue.ack(msg)  # Don't retry, don't DLQ

# DLQ dashboard shows:
# - Failed job details (what was the request?)
# - Error message and stack trace
# - Retry count and timestamps
# - "Reprocess" button to retry manually after fixing the bug

Idempotent Consumers: At-Least-Once Delivery

Message queues guarantee at-least-once delivery, not exactly-once. If a worker crashes after processing but before acknowledging, the message is re-delivered. Your worker must be idempotent: processing the same message twice produces the same result as processing it once.

python
# Non-idempotent (BAD): creates duplicate pipelines
async def process_job(msg):
    await db.insert_pipeline(msg.config)  # Second delivery = duplicate!

# Idempotent (GOOD): uses job_id as dedup key
async def process_job(msg):
    existing = await db.get_pipeline_by_job_id(msg.job_id)
    if existing:
        return  # Already processed — idempotent skip
    await db.insert_pipeline(msg.config, job_id=msg.job_id)

Frontier: Serverless + Edge (2024-2025)

Serverless functions (AWS Lambda, Cloudflare Workers) auto-scale to zero and to infinity without managing servers. The tradeoff: cold starts (50-500ms) and limited execution time. The frontier: V8 isolate-based runtimes (Cloudflare Workers, Deno Deploy) with near-zero cold starts (<5ms) running at the edge. Your API handler executes in the datacenter closest to the user.

Horizontal Scaling Simulator

Watch auto-scaling respond to traffic changes. Adjust load and see servers spin up/down.

Traffic (req/s) 100

Your API accepts a request that takes 30 seconds to process (GPU pipeline creation). What is the correct response pattern?

Return 200 after 30 seconds with the result Return 202 Accepted immediately with a job ID, let the client poll for status — keeps the request path fast and prevents timeout issues Return 102 Processing and keep the connection open for 30 seconds Use WebSockets for all API calls

Chapter 9: Developer Experience

The best API in the world is useless if developers can't figure out how to use it. Developer experience (DX) is the sum of every interaction a developer has with your API: reading the docs, getting an API key, making the first request, debugging an error, upgrading to a new version. At Parallel, the DX team's north star metric is time-to-first-successful-request. If a new developer can't make a working API call in under 5 minutes, something is broken.

SDK Design

A good SDK wraps your REST API in language-native idioms so developers never think about HTTP. The SDK handles auth, retries, pagination, error parsing, and type safety. Bad SDKs are thin wrappers around HTTP calls. Good SDKs feel like a native library.

python
# BAD SDK: developer must know HTTP, JSON, pagination, error codes
response = requests.get(
    "https://api.parallel.dev/v2/pipelines",
    headers={"Authorization": f"Bearer {key}"},
    params={"limit": 20, "cursor": cursor}
)
if response.status_code == 429:
    time.sleep(int(response.headers["Retry-After"]))
    # retry...
data = response.json()

# GOOD SDK: language-native, handles everything
client = Parallel(api_key="pk_live_...")

# Auto-paginates, auto-retries on 429, returns typed objects
for pipeline in client.pipelines.list():
    print(pipeline.name, pipeline.status)  # IDE autocomplete works

# Errors are typed exceptions, not HTTP status codes
try:
    client.pipelines.create(name="test", gpu_count=16)
except parallel.ValidationError as e:
    print(e.param)     # "gpu_count"
    print(e.message)   # "Must be between 1 and 8"

Error Messages Are Documentation

The 3-part error message: Every error message should contain: (1) what went wrong, (2) why it went wrong, and (3) how to fix it. "Invalid API key" is useless. "Invalid API key: the key starts with 'pk_test_' but this is the production endpoint. Use a key starting with 'pk_live_' or switch to api.parallel.dev/sandbox" is helpful.

Sandbox Environments

Developers need a safe place to experiment without affecting production data or incurring costs. Parallel provides a full sandbox environment:

Feature	Production	Sandbox
Base URL	api.parallel.dev	sandbox.parallel.dev
API keys	pk_live_*	pk_test_*
Rate limits	Per plan	100 req/min (generous for testing)
GPU allocation	Real GPUs, real cost	Simulated (responds as if real, no actual GPU)
Data persistence	Permanent	Wiped weekly

SDK Architecture: The Internal Design

python
# A well-designed SDK has 4 layers:

# Layer 1: Transport — handles HTTP, retries, auth
class Transport:
    def __init__(self, api_key, base_url, max_retries=3):
        self.client = httpx.AsyncClient(
            base_url=base_url,
            headers={"Authorization": f"Bearer {api_key}"},
            timeout=30,
        )
        self.max_retries = max_retries

    async def request(self, method, path, **kwargs):
        for attempt in range(self.max_retries):
            resp = await self.client.request(method, path, **kwargs)
            if resp.status_code == 429:
                # Auto-retry with backoff
                delay = int(resp.headers.get("Retry-After", 1))
                await asyncio.sleep(delay)
                continue
            if resp.status_code >= 500 and attempt < self.max_retries - 1:
                await asyncio.sleep(2 ** attempt)
                continue
            return resp

# Layer 2: Resource — typed API for each resource
class PipelinesResource:
    def __init__(self, transport):
        self._t = transport

    async def create(self, *, name: str, gpu_count: int) -> Pipeline:
        resp = await self._t.request("POST", "/v2/pipelines",
            json={"name": name, "gpu_count": gpu_count})
        return Pipeline(**resp.json()["data"])

    def list(self, **filters) -> AsyncIterator[Pipeline]:
        # Auto-pagination: yields all pages transparently
        return AutoPaginator(self._t, "/v2/pipelines", Pipeline, **filters)

# Layer 3: Models — typed dataclasses
@dataclass
class Pipeline:
    id: str
    name: str
    status: str
    gpu_count: int
    created_at: datetime

# Layer 4: Client — the public API
class Parallel:
    def __init__(self, api_key: str):
        self._transport = Transport(api_key, "https://api.parallel.dev")
        self.pipelines = PipelinesResource(self._transport)
        self.jobs = JobsResource(self._transport)

Changelogs and Migration Guides

markdown
# GOOD changelog entry: actionable, with code diff

## v2.4.0 (2025-05-15)

### Breaking: `pipeline.status` field renamed to `pipeline.state`

**Why:** Aligning with industry standard terminology.
**Impact:** All integrations that read `pipeline.status` will get `undefined`.
**Migration:**

```python
# Before
pipeline.status  # "active"

# After
pipeline.state   # "active"
```

**Timeline:** `status` is deprecated now, removed in v2.5.0 (August 2025).
Both fields returned during transition period.

Debugging: The "Your Docs Are Wrong" Ticket

The most insidious DX bug: documentation says one thing, the API does another. This happens when docs are manually maintained separately from the code. Fix: generate docs from the OpenAPI spec (which is generated from the code), so docs are always in sync. Test the examples in CI — if a code sample in the docs fails, the build fails.

python
# CI pipeline for documentation accuracy:

# Step 1: Generate OpenAPI spec from code annotations
# (FastAPI does this automatically)
# spec = app.openapi()

# Step 2: Validate spec against published docs
# openapi-diff old-spec.yaml new-spec.yaml --breaking
# If breaking changes detected, fail CI unless changelog entry exists.

# Step 3: Run documentation code samples as integration tests
# Extract code blocks from docs/quickstart.md
# Execute against sandbox API
# Assert expected status codes and response shapes

def test_quickstart_example():
    # This code block appears in our quickstart docs
    client = Parallel(api_key="pk_test_ci_key")
    pipeline = client.pipelines.create(name="test", gpu_count=1)
    assert pipeline.id.startswith("pipe_")
    assert pipeline.status == "queued"
    # If the API changes and this breaks, the docs are stale.
    # CI catches it BEFORE the developer does.

API Versioning in SDKs

When you release API v3, you don't want to force all SDK users to upgrade immediately. The SDK should support multiple API versions and default to the latest stable one.

python
# SDK with version pinning
client = Parallel(
    api_key="pk_live_...",
    api_version="2025-01-15",  # Pin to a specific version
)

# The SDK sends: Parallel-Version: 2025-01-15
# Server returns the response shape matching that version.
# Even when v3 ships, this client gets v2 responses.

# Version lifecycle in the SDK:
# - SDK v1.x: supports API 2024-01-01 through 2024-12-01
# - SDK v2.x: supports API 2024-06-01 through 2025-06-01
# - SDK v3.x: supports API 2025-01-01 and later
# Deprecation warnings printed when using old versions.

Frontier: AI-Powered Developer Assistants (2024-2025)

The cutting edge: AI documentation assistants trained on your API spec, docs, and support tickets. Developers ask "how do I create a pipeline with streaming output?" and get a working code example specific to their SDK version and authentication setup. Stripe, Vercel, and Cloudflare already ship these. The assistant reduces time-to-first-request by 60%.

Developer Onboarding Flow

Track a developer's journey from signup to first successful API call. Click each step to see common friction points.

A new developer gets a 403 error on their first API call and gives up. Their error response was: {"error": "Forbidden"}. What is the DX fix?

Return 401 instead of 403 Return a rich error: "API key 'pk_test_...' is valid but lacks 'create:pipelines' permission. Your key has 'viewer' role. Request 'developer' role from your org admin at dashboard.parallel.dev/settings/members." Add more examples to the documentation Remove RBAC for sandbox environments

Chapter 10: Performance Optimization

Performance is not about making everything faster. It's about making the right things faster. A 10ms improvement on an endpoint called once a day is irrelevant. A 10ms improvement on an endpoint called 100,000 times per second saves 1,000 CPU-seconds per second. The first step is always: profile, don't guess.

The N+1 Query Problem

The most common performance bug in API development. You fetch a list of 50 pipelines, then for each pipeline, you fetch its jobs. That's 1 query + 50 queries = 51 database round trips. Each round trip takes 2ms of network latency, so you've burned 100ms on network alone.

python
# N+1 PROBLEM: 51 queries for 50 pipelines
pipelines = await db.query("SELECT * FROM pipelines WHERE customer_id = $1 LIMIT 50", cid)
for p in pipelines:
    p.jobs = await db.query("SELECT * FROM jobs WHERE pipeline_id = $1", p.id)
    # ^ This runs 50 times. 50 round trips. 100ms wasted.

# FIX 1: Batch query (2 queries total)
pipelines = await db.query("SELECT * FROM pipelines WHERE customer_id = $1 LIMIT 50", cid)
pipeline_ids = [p.id for p in pipelines]
all_jobs = await db.query("SELECT * FROM jobs WHERE pipeline_id = ANY($1)", pipeline_ids)
# Group jobs by pipeline_id in Python. 2 queries. 4ms total.

# FIX 2: JOIN (1 query)
rows = await db.query("""
    SELECT p.*, j.id as job_id, j.status as job_status
    FROM pipelines p
    LEFT JOIN jobs j ON j.pipeline_id = p.id
    WHERE p.customer_id = $1
    ORDER BY p.created_at DESC LIMIT 50
""", cid)
# 1 query. 3ms. But more complex response parsing.

Connection Reuse

Creating a new TCP connection (DNS + TLS handshake) takes 50-100ms. If your API calls downstream services, reuse connections with HTTP keep-alive and connection pooling.

python
# BAD: new connection per request (50ms overhead each time)
async def call_gpu_service(payload):
    async with httpx.AsyncClient() as client:  # New connection!
        return await client.post("https://gpu.internal/schedule", json=payload)

# GOOD: reuse connection pool (0ms connection overhead)
gpu_client = httpx.AsyncClient(
    base_url="https://gpu.internal",
    limits=httpx.Limits(max_connections=100, max_keepalive_connections=20),
    timeout=httpx.Timeout(10.0, connect=5.0),
)

async def call_gpu_service(payload):
    return await gpu_client.post("/schedule", json=payload)  # Reuses connection

Payload Optimization

A response with 50 pipelines, each with 20 fields, can easily be 100KB of JSON. If the client only needs id, name, and status, you're sending 95KB of wasted data. Solutions:

Field selection: GET /v2/pipelines?fields=id,name,status — the server only serializes requested fields. Stripe calls this "expansion."

Compression: gzip/brotli reduces JSON payloads by 70-90%. A 100KB response becomes 10KB over the wire. Always enable if the client sends Accept-Encoding: gzip.

Streaming: For large responses, use NDJSON (newline-delimited JSON) so the client can process records as they arrive instead of waiting for the entire response.

python
# NDJSON streaming for large exports
# Instead of: {"data": [50,000 pipeline objects]} (10MB, 3s to build)
# Stream: one JSON object per line, client processes as they arrive

from fastapi.responses import StreamingResponse

async def export_pipelines(customer_id: str):
    async def generate():
        # Stream rows from DB cursor (don't load all into memory)
        async for row in db.cursor(
            "SELECT * FROM pipelines WHERE customer_id = $1",
            customer_id
        ):
            yield orjson.dumps(row).decode() + "\n"

    return StreamingResponse(
        generate(),
        media_type="application/x-ndjson",
        headers={"Transfer-Encoding": "chunked"}
    )

# Client processes line by line:
# async for line in response.aiter_lines():
#     pipeline = json.loads(line)
#     process(pipeline)
# Memory usage: O(1) instead of O(n). First byte in 50ms, not 3s.

Database Query Optimization Patterns

python
# Pattern 1: SELECT only the columns you need
# BAD: SELECT * fetches all 20 columns (including 50KB JSONB config)
# GOOD: SELECT id, name, status FROM pipelines WHERE ...
# Reduces: network transfer, memory, serialization time.

# Pattern 2: Use EXPLAIN ANALYZE before deploying new queries
# The query planner sometimes makes bad choices. Verify it uses indexes.

# Pattern 3: Avoid COUNT(*) for large tables
# BAD: SELECT COUNT(*) FROM pipelines WHERE customer_id = $1
#   → scans entire index even with index. 100ms at 10M rows.
# GOOD: Use an approximate count or pre-computed counter:
#   → Redis counter incremented on insert/delete. O(1).

# Pattern 4: Use EXISTS instead of COUNT for existence checks
# BAD:  SELECT COUNT(*) FROM pipelines WHERE id = $1 (counts ALL matches)
# GOOD: SELECT EXISTS(SELECT 1 FROM pipelines WHERE id = $1) (stops at first match)

# Pattern 5: Batch operations to reduce round trips
# BAD:  for id in ids: await db.get(id)  # N round trips
# GOOD: await db.query("SELECT * FROM pipelines WHERE id = ANY($1)", ids)  # 1 round trip

Profiling in Production

python
# Continuous profiling: sample 1% of requests
# Use py-spy (Python) or pprof (Go) to capture CPU flamegraphs

# Targeted profiling for slow endpoints:
import cProfile, pstats

async def profile_handler(request):
    profiler = cProfile.Profile()
    profiler.enable()
    response = await actual_handler(request)
    profiler.disable()
    # Save profile for analysis
    stats = pstats.Stats(profiler)
    stats.sort_stats("cumulative")
    stats.print_stats(20)  # Top 20 slowest functions
    return response

# Common findings from profiling API handlers:
# 1. JSON serialization: 30% of CPU on hot paths → use orjson (3x faster)
# 2. ORM overhead: model instantiation for 1000 rows → use raw SQL for list endpoints
# 3. Regex compilation: re.compile() inside a loop → compile once, reuse

Debugging: "This Endpoint Is Slow But I Can't Reproduce It"

Intermittent slowness is the hardest to debug because by the time you look, it's gone.

Approach: Enable continuous profiling (1% sampling). When a slow request occurs, the profiler captures what was happening. Correlate slow requests with system metrics: was there a GC pause? A TCP retransmit? A lock contention in the database?

The "noisy neighbor" pattern: One customer's expensive query holds a database lock, making other customers' simple queries wait. This shows up as intermittent latency that's impossible to reproduce because it depends on two specific requests arriving at the same time. Fix: query timeouts, read replicas for heavy queries, or request-level isolation with row-level locking.

Streaming Responses: Server-Sent Events

For long-running operations (pipeline creation, large data exports), instead of making clients poll, stream updates to them. Server-Sent Events (SSE) is simpler than WebSockets and works with existing HTTP infrastructure (load balancers, CDNs).

python
# SSE endpoint for pipeline creation status
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

async def stream_pipeline_status(job_id: str):
    async def event_generator():
        while True:
            status = await get_job_status(job_id)
            # SSE format: "data: {json}\n\n"
            yield f"data: {json.dumps(status)}\n\n"
            if status["state"] in ("completed", "failed"):
                break
            await asyncio.sleep(1)

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache"}
    )

# Client-side (JavaScript):
# const source = new EventSource('/v2/jobs/job_123/stream');
# source.onmessage = (e) => console.log(JSON.parse(e.data));
# Automatically reconnects on network failure!

JSON Serialization: The Hidden CPU Hog

On hot API paths, JSON serialization can consume 20-40% of CPU time. Python's built-in json module is slow. Switching to orjson gives 3-10x speedup with zero code changes.

python
# Benchmark: serializing 1000 pipeline objects
import json, orjson, time

data = [{"id": f"pipe_{i}", "name": f"pipeline-{i}",
         "status": "active", "gpu_count": 4,
         "created_at": "2025-05-22T10:00:00Z"} for i in range(1000)]

# stdlib json: ~12ms
json.dumps(data)

# orjson: ~1.5ms (8x faster)
orjson.dumps(data)

# orjson also handles datetime, UUID, numpy arrays natively.
# One-line swap in FastAPI:
# from fastapi.responses import ORJSONResponse
# app = FastAPI(default_response_class=ORJSONResponse)

Frontier: HTTP/3 + QUIC (2024-2025)

HTTP/3 replaces TCP with QUIC (UDP-based). Benefits: zero round-trip connection setup (0-RTT), no head-of-line blocking (one lost packet doesn't stall all streams), and faster connection migration (WiFi to cellular without reconnecting). Cloudflare reports 12% latency improvement for API traffic after enabling HTTP/3.

N+1 Query Visualizer

Compare N+1 queries vs. batch query vs. JOIN. Watch the database round trips.

Your API returns 50 pipelines with all 20 fields, but most clients only use 3 fields. What is the most effective optimization?

Add field selection: GET /v2/pipelines?fields=id,name,status — reduces payload by ~85% and speeds up serialization Enable gzip compression Reduce the default page size from 50 to 10 Switch to GraphQL

Chapter 11: SHOWCASE — Interactive API Gateway

This is the payoff. Everything you've learned — routing, authentication, rate limiting, caching, database queries, error handling — comes together in a single interactive simulation. You are operating an API gateway that serves millions of requests. Adjust the controls, inject failures, and watch the system respond.

This is your system design interview on a screen. When an interviewer asks "design an API gateway," you should be able to draw this diagram and explain every box. Use this simulation to build intuition for how the pieces interact under load.

API Gateway Simulation

Incoming requests flow through auth → rate limiter → router → handler → cache/DB → response. Adjust load, inject failures, and observe metrics in real time.

Load (req/s) 50

Cache hit rate 70%

Error rate 2%

Reading the Simulation

Request flow (top to bottom): Each dot is a request traveling through the gateway stages. Green dots are successful, red dots are errors, yellow dots are rate-limited (429).

Metrics panel (right side):

Throughput: Successful responses per second. Should track close to incoming load unless errors are high.
p50/p99 latency: Median and tail latency. Watch p99 spike when cache hit rate drops or DB fails.
Error rate: Percentage of requests returning 4xx/5xx. Should stay below your SLO (0.1%).
Cache hit rate: Higher = lower latency and less DB load. Drop it to 0% and watch DB saturation.

Scenarios to Try

Scenario	What to do	What to observe
Normal operation	Load=50, Cache=70%, Error=2%	Smooth flow, low latency, green metrics
Cache failure	Drop cache to 0%	p99 latency spikes as all requests hit DB
Database failure	Click "DB Failure"	Cached requests still work, uncached requests error. Graceful degradation.
Auth service down	Click "Auth Down"	ALL requests fail at the first stage. Total outage.
DDoS attack	Click "DDoS (100x)"	Rate limiter activates, most requests return 429, legitimate traffic still served
Thundering herd	Set cache=0%, load=500	DB overwhelmed, errors spike, p99 goes to timeout

What this simulation teaches: Every system design has a bottleneck. At low load, the bottleneck is the DB (it's the slowest stage). At high load, the bottleneck shifts to the rate limiter (it protects everything downstream). When cache is healthy, the DB bottleneck is hidden. Remove the cache and the DB bottleneck becomes visible. This is why caching isn't optional at scale — it's structural.

The Architecture Behind the Simulation

Every box in the simulation maps to a real component. Here's the production architecture:

yaml
# Production API Gateway Architecture

ingress:
  - CloudFront CDN (TLS termination, static caching, DDoS absorption)
  - Route 53 (latency-based DNS, health checks for failover)

load_balancer:
  - ALB (Application Load Balancer)
  - Health checks: GET /healthz every 5s, 3 failures = remove
  - Connection draining: 30s on deploy (finish in-flight requests)

api_servers:
  - 20 instances (auto-scale 10-50 based on CPU + request count)
  - Each runs: FastAPI + uvicorn + 4 workers
  - Stateless: all state in Redis or Postgres

auth_layer:
  - API key validation: SHA-256 hash lookup in Redis (0.5ms)
  - JWT validation: RSA signature check (0.1ms, no network)
  - Rate limit: Redis INCR per customer (1ms)

data_layer:
  - PostgreSQL 16 (primary + 3 read replicas)
  - PgBouncer: 1000 client connections → 100 server connections
  - Redis cluster: 6 nodes, 3 masters + 3 replicas
  - Connection pool: asyncpg, min=5, max=20 per API server

async_processing:
  - SQS queues for long-running jobs
  - Worker fleet: 10 instances processing pipeline creation
  - Dead letter queue after 3 retries

observability:
  - Datadog: metrics, traces, logs
  - PagerDuty: alerting on SLO burn rate
  - Grafana dashboards for real-time monitoring

Interview Whiteboard Version

In an interview, you have 5 minutes to draw this. Here's the simplified version:

Client → CDN → LB

TLS at edge, cache static, route to healthy server

↓

API Gateway (Auth + Rate Limit)

Fail unauthorized early. Protect downstream from overload.

↓

Handler → Cache? → DB

Check Redis first. DB only on cache miss. Async for writes.

↓

Response + Log Trace

Return JSON, log latency breakdown, emit metrics.

Whiteboard tips: (1) Start with the request path, not the component list. (2) Label every arrow with the latency it adds. (3) Show where failures are handled (circuit breaker, fallback to cache). (4) End with "here's what I'd monitor" — throughput, p99, error rate, cache hit rate.

Chapter 12: Interview Arsenal

This chapter distills everything into a cheat sheet you can review in the 30 minutes before your interview. Every section maps to a common interview question type.

System Design Questions

Question	Key points to cover	Chapter
"Design an API rate limiter"	Token bucket algorithm, distributed counter in Redis, per-customer quotas, 429 with Retry-After, graceful degradation	5
"Design a URL shortener API"	Hashing, collision handling, cursor pagination for analytics, CDN caching for redirects, rate limiting writes	1, 4
"Design an API gateway"	Request lifecycle, auth, rate limiting, routing, caching, circuit breaker, observability. Draw the Ch 11 diagram.	2, 11
"Design a real-time notification system"	WebSocket vs. SSE, connection scaling, message queue, fanout, delivery guarantees, offline queue	8
"Your API needs to handle 1M requests/minute"	Horizontal scaling, caching, connection pooling, async processing, CDN for static responses, sharding	4, 8

Coding Drills

python
# DRILL 1: Implement cursor pagination
def paginate(items: list, cursor: str | None, limit: int = 20):
    if cursor:
        start = next((i for i, item in enumerate(items) if item["id"] == cursor), 0) + 1
    else:
        start = 0
    page = items[start:start + limit]
    next_cursor = page[-1]["id"] if len(page) == limit else None
    return {"data": page, "next_cursor": next_cursor, "has_more": next_cursor is not None}

# DRILL 2: Implement a circuit breaker
class CircuitBreaker:
    def __init__(self, threshold=5, reset_time=30):
        self.failures = 0
        self.threshold = threshold
        self.reset_time = reset_time
        self.state = "closed"   # closed=normal, open=failing, half=testing
        self.last_failure = 0

    async def call(self, fn, *args):
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_time:
                self.state = "half-open"  # Try one request
            else:
                raise CircuitOpenError("Service unavailable")

        try:
            result = await fn(*args)
            self.failures = 0
            self.state = "closed"
            return result
        except Exception:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.threshold:
                self.state = "open"
            raise

Debugging Scenarios

Symptom	Likely cause	Investigation
p99 latency spiked, p50 normal	Connection pool exhaustion, slow query for subset of requests	Check DB connection count, find slow queries in pg_stat_statements
Intermittent 500 errors, no pattern	Race condition, retry storm, or flaky downstream dependency	Correlate errors with specific request patterns, check distributed traces
Memory usage grows until OOM	Connection leak, unbounded cache, large response buffering	Heap dump analysis, check connection pool stats, monitor cache size
Latency increases linearly over weeks	Table growth without proper indexing, cache key space explosion	Check table sizes, EXPLAIN ANALYZE on hot queries, Redis memory stats

Quick Reference: HTTP Status Codes

Code	Meaning	When to use
200	OK	Successful GET, PUT, PATCH
201	Created	Successful POST that created a resource
202	Accepted	Request accepted for async processing (return job ID)
204	No Content	Successful DELETE
400	Bad Request	Invalid request body, missing required field
401	Unauthorized	Missing or invalid auth credentials
403	Forbidden	Valid auth but insufficient permissions
404	Not Found	Resource doesn't exist
409	Conflict	Duplicate resource, version conflict
422	Unprocessable	Valid JSON but semantic validation failed
429	Too Many Requests	Rate limit exceeded (include Retry-After header)
500	Internal Error	Bug in your code (never expose details)
502	Bad Gateway	Downstream service returned invalid response
503	Service Unavailable	Overloaded or maintenance (include Retry-After)

System Design Interview Framework

When given a system design question ("Design an API for X"), follow this framework in order. This structure shows the interviewer you think systematically.

1. Clarify Requirements (2 min)

Ask: read vs. write ratio? expected QPS? latency target? consistency requirements? Who are the users (developers, agents, browsers)?

↓

2. API Design (3 min)

Define endpoints, request/response shapes, auth model. REST for public, gRPC for internal. Pagination strategy. Error format.

↓

3. Data Model (3 min)

Tables, indexes, access patterns. Primary key choice. What's the hottest query? Where does denormalization help?

↓

4. Architecture (5 min)

Draw the request path. CDN → LB → API → Cache → DB. Where does async processing help? What needs a queue?

↓

5. Deep Dive (5 min)

Interviewer picks a component. Go deep: caching strategy, rate limiting, sharding, failure modes, monitoring.

↓

6. Trade-offs & Scaling (2 min)

Discuss what breaks at 10x, 100x. What would you change? What are the trade-offs you chose?

Additional Coding Drills

python
# DRILL 3: Implement a distributed lock with Redis
async def acquire_lock(redis, key: str, ttl: int = 10) -> str | None:
    lock_id = str(uuid4())  # Unique per caller
    acquired = await redis.set(
        f"lock:{key}", lock_id, nx=True, ex=ttl
    )
    return lock_id if acquired else None

async def release_lock(redis, key: str, lock_id: str):
    # Lua script: only release if we still own the lock
    # (prevents releasing a lock that expired and was acquired by another)
    script = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    end
    return 0
    """
    await redis.eval(script, 1, f"lock:{key}", lock_id)

# DRILL 4: Implement webhook retry with exponential backoff
async def send_webhook(url: str, payload: dict, max_retries: int = 5):
    for attempt in range(max_retries):
        try:
            resp = await http_client.post(url, json=payload, timeout=10)
            if resp.status_code < 300:
                return True  # Success
            if resp.status_code >= 400 and resp.status_code < 500:
                return False  # Client error — don't retry
        except (TimeoutError, ConnectionError):
            pass  # Retry

        # Exponential backoff: 1s, 2s, 4s, 8s, 16s
        delay = (2 ** attempt) + random.uniform(0, 1)  # Jitter!
        await asyncio.sleep(delay)

    return False  # All retries exhausted → send to dead letter queue

# DRILL 5: Implement request deduplication middleware
async def idempotency_middleware(request, call_next):
    key = request.headers.get("Idempotency-Key")
    if not key or request.method in ("GET", "DELETE"):
        return await call_next(request)

    # Check if we've seen this key
    cached = await redis.get(f"idem:{key}")
    if cached:
        return Response.from_cache(cached)  # Replay stored response

    # Lock to prevent concurrent execution of same key
    lock = await acquire_lock(redis, f"idem-lock:{key}", ttl=30)
    if not lock:
        return Response(status=409, body={"error": "Duplicate request in progress"})

    response = await call_next(request)
    # Store response for 24h so retries get the same result
    await redis.setex(f"idem:{key}", 86400, response.serialize())
    await release_lock(redis, f"idem-lock:{key}", lock)
    return response

Key Numbers to Memorize

Metric	Value	Why it matters
L1 cache access	~1ns	Baseline for "instant"
RAM access	~100ns	In-process cache speed
Redis GET (same datacenter)	~0.5-1ms	Distributed cache speed
SSD random read	~0.1ms	DB index scan (cached in page cache)
DB query (indexed, warm)	~1-5ms	Your p50 target for reads
DB query (full scan, cold)	~100-1000ms	Your "something is wrong" signal
Network round trip (same DC)	~0.5ms	Each microservice call adds this
Network round trip (cross-US)	~30-60ms	Why multi-region matters
TLS handshake	~10-50ms	Why connection reuse matters
JSON serialize (1000 objects, stdlib)	~12ms	Why orjson matters on hot paths
PostgreSQL max practical connections	~500	Why PgBouncer exists

Closing: The Backend Engineer's Oath

Your API is a promise. Every endpoint is a contract. Every error message is documentation. Every millisecond of latency is a developer waiting. Every 500 error is a production incident in someone else's system. The best backend engineers are not the ones who build the fastest systems — they are the ones who build the most predictable systems. Predictable performance, predictable error handling, predictable breaking changes. When a developer integrates with your API, they are trusting you with their production uptime. Earn that trust.

Interview Cheat Sheet

The 5-dimension view of every topic covered. Click a dimension to highlight the relevant concepts.

In a system design interview, the interviewer asks: "Your API's error rate just jumped from 0.1% to 5%. What's your first question before investigating?" What should you ask?

"What is the server CPU utilization?" "Is it all endpoints or a specific endpoint, and all customers or a specific customer?" — narrow the scope before diving into details "When was the last deployment?" "Are users complaining?"

Backend & API Engineerat Parallel