Staff-level interview prep: API design, databases, caching, rate limiting, auth, observability, scaling, and the full request lifecycle.
An AI agent sends a POST request to your API: "create a custom inference pipeline with 4 GPU nodes, streaming output, and a 30-second timeout." Your system must authenticate the caller, check their quota, validate the payload against a schema that changes monthly, route to the right cluster, orchestrate the pipeline creation across three microservices, and return a structured response with a job ID — all in under 200ms. If any step fails, the response must explain exactly what went wrong in a way a developer can fix without reading your source code.
This is not a CRUD app. This is a developer-facing distributed system that millions of AI agents, scripts, and dashboards hit every day. And you are the engineer who makes it fast, reliable, and a joy to integrate with.
It is 9:00 AM. You badge into Parallel's office. On your first monitor, the overnight alerting shows a 12% increase in p99 latency on the /v2/pipelines endpoint after yesterday's database migration added a new index that's causing lock contention. On your second monitor, a partner integration team is stuck: their webhook handler is receiving duplicate events because your retry logic doesn't respect idempotency keys correctly. On your third monitor, a design doc from the platform team proposing GraphQL subscriptions for real-time pipeline status — but you're worried about connection scaling at 100K concurrent subscribers.
Before lunch, you will profile the slow query (the new composite index needs column order reversed), write a hotfix for the webhook deduplication (add an idempotency cache with a 24-hour TTL), and leave detailed comments on the GraphQL doc (propose a hybrid: GraphQL for reads, Server-Sent Events for real-time status, because SSE scales better behind your existing load balancer).
This is the daily reality of a Backend & API Engineer at Parallel. You own the API surface that every customer's code touches:
| Responsibility | What you own | Daily intersection |
|---|---|---|
| API Surface | REST endpoints, versioning, schema evolution, SDK generation | Every external developer interacts through your contracts |
| Data Layer | Schema design, migrations, query optimization, connection pooling | Every request reads or writes your tables |
| Reliability | Rate limiting, caching, circuit breakers, graceful degradation | You keep the system alive when traffic spikes 10x |
| Security | Auth, API keys, RBAC, webhook signatures, audit logs | You protect customer data and prevent abuse |
| Performance | Profiling, caching, async processing, payload optimization | You make every response feel instant |
Backend engineering at an infrastructure company like Parallel is different from backend at a consumer app. Your users are developers. They read your error messages more carefully than your documentation. They will find every inconsistency in your API naming. They will script against your rate limit headers. They will reverse-engineer your pagination cursors. And they will loudly complain on Twitter when you ship a breaking change.
The ideal candidate has deep intuition on distributed systems, databases, and maintainable code design. They reason about trade-offs between speed, scalability, and developer ergonomics. They can design an API that's both high-performance and a joy to integrate with. They can debug a latency regression at 3 AM using only dashboards and distributed traces. They can write a migration that restructures a 100M-row table without downtime.
Most importantly, they understand that an API is a product. Every response time is a user experience. Every error message is customer support. Every changelog entry is a relationship with a developer who built their business on your platform.
Here's what a typical Tuesday looks like:
| Time | Activity | Skills used |
|---|---|---|
| 9:00 AM | Triage overnight alerts: p99 latency spike on /v2/search | Observability, debugging |
| 9:30 AM | Profile the slow query, find missing index, deploy fix | Database, performance |
| 10:00 AM | Code review: teammate's rate limiter migration from fixed to sliding window | Rate limiting, API design |
| 11:00 AM | Design doc: adding webhook retry with exponential backoff | System design, async patterns |
| 12:00 PM | Partner sync: debug why their integration gets intermittent 403s | Auth, debugging, DX |
| 2:00 PM | Implement cursor pagination for the /v2/logs endpoint | API design, database |
| 3:00 PM | Review the auto-generated Python SDK for v2.5 release | DX, SDK design |
| 4:00 PM | Plan capacity for a new enterprise customer (10x current traffic) | Scaling, caching |
The diagram below traces a single API request from the internet to the database and back. Every box is a system you own or co-own. This is your opening whiteboard answer in a system-design interview.
Watch a request flow through the full stack. Latency breakdown shows where time is spent. Click Inject Failure to see error handling.
Staff-level interviews test you across five dimensions. Each chapter in this lesson maps to one or more:
| Dimension | What they ask | Chapters |
|---|---|---|
| CONCEPT | "Explain how connection pooling works under the hood" | All |
| DESIGN | "Design an API gateway that handles 1M requests/minute" | 0, 1, 2, 8, 11 |
| CODE | "Implement a token bucket rate limiter" | 1, 3, 4, 5, 6, 10 |
| DEBUG | "Your p99 latency tripled. Walk me through your investigation." | 2, 7, 10 |
| FRONTIER | "How would you use HTTP/3 or edge computing to reduce latency?" | All |
Your API is a contract. Every endpoint, every field name, every error code is a promise you make to thousands of developers. Break the promise and their production breaks. Make the promise confusing and they'll spend hours reading docs instead of building. A well-designed API is invisible — developers use it correctly without thinking. A badly-designed API generates support tickets.
At Parallel, your API is the primary product surface. AI agents don't use a dashboard — they call your endpoints programmatically. The API is the product.
| Protocol | Best for | Worst for | Parallel's use |
|---|---|---|---|
| REST | Public APIs, CRUD, caching (GET idempotency), broad ecosystem | Complex nested queries, real-time streams | Primary public API — /v2/pipelines, /v2/models, /v2/jobs |
| GraphQL | Flexible queries, mobile clients (minimize over-fetching), introspection | Caching (POST-based), rate limiting (query cost varies), file uploads | Internal dashboard API — flexible queries for analytics UI |
| gRPC | Service-to-service, streaming, strong typing (protobuf), low latency | Browser clients (needs proxy), debugging (binary), public APIs | Internal microservice mesh — pipeline orchestrator talks to GPU scheduler via gRPC |
REST APIs are organized around resources, not actions. A resource is a noun (pipeline, job, model), not a verb (createPipeline). The HTTP method provides the verb.
http # GOOD: resources are nouns, methods provide verbs GET /v2/pipelines # List pipelines POST /v2/pipelines # Create a pipeline GET /v2/pipelines/{id} # Get one pipeline PATCH /v2/pipelines/{id} # Update a pipeline DELETE /v2/pipelines/{id} # Delete a pipeline GET /v2/pipelines/{id}/jobs # List jobs for a pipeline (nested resource) # BAD: verbs in URLs (this is RPC, not REST) POST /v2/createPipeline POST /v2/getPipelineById POST /v2/deletePipeline # TRICKY: actions that don't map to CRUD # Option A: treat the action as a sub-resource POST /v2/pipelines/{id}/restart # Restart a pipeline POST /v2/pipelines/{id}/scale # Scale GPU count # Option B: use a generic "actions" endpoint POST /v2/pipelines/{id}/actions { "action": "restart", "params": {} }
Every response should follow a consistent envelope. Developers build generic response parsers — if one endpoint returns {"data": [...]} and another returns a bare array [...], their parser breaks.
python # Parallel's response envelope — consistent across ALL endpoints: # Single resource: { "data": { "id": "pipe_abc123", "name": "my-pipeline", "status": "active", "created_at": "2025-05-22T10:00:00Z" } } # Collection: { "data": [ {"id": "pipe_abc", ...}, {"id": "pipe_def", ...} ], "pagination": { "next_cursor": "eyJpZCI6...", "has_more": true } } # Error: { "error": { "type": "not_found", "message": "Pipeline 'pipe_xyz' does not exist.", "code": "PIPELINE_NOT_FOUND", "request_id": "req_789" } } # Rules: # 1. Success always has "data" key # 2. Error always has "error" key # 3. Never both at once # 4. Timestamps always ISO 8601 with timezone (Z or +00:00) # 5. IDs always have a prefix: pipe_, job_, cust_, key_
You will change your API. Fields get renamed, response shapes evolve, deprecated endpoints must die. The question is how to do it without breaking existing integrations.
python # Strategy 1: URL versioning (Parallel's choice) # Simple, explicit, cacheable. Downside: proliferating paths. GET /v2/pipelines/abc123 GET /v3/pipelines/abc123 # New response shape # Strategy 2: Header versioning # Cleaner URLs. Downside: invisible in logs, hard to cache. GET /pipelines/abc123 Accept-Version: 2024-01-15 # Strategy 3: Query parameter # Easy for debugging. Downside: pollutes cache keys. GET /pipelines/abc123?version=2 # Parallel's approach: URL versioning for major (breaking), # date-based header for minor (additive). # v2 is the major contract. Adding a new field doesn't bump v2 → v3. # Removing a field or changing a type DOES bump the version.
python # OFFSET: simple but broken at scale GET /v2/jobs?limit=20&offset=1000 # Problem: if 5 new jobs are created between page fetches, # page 51 will show 5 items from page 50. Items shift. # Also: OFFSET 1000 forces the DB to scan and skip 1000 rows. # CURSOR: stable and performant GET /v2/jobs?limit=20&cursor=eyJpZCI6MTAwMH0= # Cursor encodes the last-seen sort key (e.g., base64 of {"id": 1000}). # DB query: WHERE id > 1000 ORDER BY id LIMIT 20 # No scanning, no shifting. O(1) regardless of page depth. # Response shape: { "data": [...], "pagination": { "next_cursor": "eyJpZCI6MTAyMH0=", "has_more": true } }
Developers spend more time debugging errors than reading success responses. A good error response is worth 100 lines of documentation.
python # BAD: generic, useless {"error": "Bad request"} # BAD: leaks internals {"error": "PostgreSQL error: relation 'pipelines' does not exist"} # GOOD: structured, actionable, safe { "error": { "type": "validation_error", "message": "Field 'gpu_count' must be between 1 and 8.", "code": "INVALID_GPU_COUNT", "param": "gpu_count", "request_id": "req_abc123", "doc_url": "https://docs.parallel.dev/errors/INVALID_GPU_COUNT" } }
Network failures happen. Clients retry. Without idempotency, a retry of "create pipeline" creates two pipelines. Idempotency means: calling the same operation twice produces the same result as calling it once.
python # Client sends an Idempotency-Key header with mutating requests POST /v2/pipelines Idempotency-Key: idem_user123_1716400000 {"name": "my-pipeline", "gpu_count": 4} # Server implementation: async def create_pipeline(req: Request): key = req.headers["Idempotency-Key"] # Check if we've seen this key before cached = await redis.get(f"idem:{key}") if cached: return json.loads(cached) # Return same response # Execute the operation pipeline = await db.create_pipeline(req.body) response = serialize(pipeline) # Cache for 24h so retries return the same result await redis.setex(f"idem:{key}", 86400, json.dumps(response)) return response
Every POST/PATCH endpoint must validate the request body against a schema before touching the database. At Parallel, we use JSON Schema (for REST) and protobuf (for gRPC) to define what valid input looks like.
python # JSON Schema for POST /v2/pipelines PIPELINE_CREATE_SCHEMA = { "type": "object", "required": ["name", "gpu_count"], "properties": { "name": { "type": "string", "minLength": 1, "maxLength": 255, "pattern": "^[a-z0-9][a-z0-9-]*$" # DNS-safe names }, "gpu_count": { "type": "integer", "minimum": 1, "maximum": 8 }, "timeout_seconds": { "type": "integer", "minimum": 5, "maximum": 3600, "default": 300 } }, "additionalProperties": false # Reject unknown fields } # Why additionalProperties: false? # A client sends {"name": "test", "gpuCount": 4} (camelCase typo). # Without this, the request succeeds with default gpu_count, # and the client wonders why they got 1 GPU instead of 4. # With this, they get: "Unknown field: gpuCount. Did you mean gpu_count?"
The most common API design bugs that generate support tickets:
created_at but /v2/jobs returns createdAt. Developers write generic parsers that break. Fix: enforce a naming convention (snake_case for REST, camelCase for GraphQL) with a linter in CI.gpu_count: 16, server silently clamps to 8. Client thinks they have 16 GPUs. Fix: reject invalid values with a clear error, never silently mutate input.internal_cluster_id: "prod-us-east-7". A competitor maps your infrastructure. Fix: only expose opaque external IDs. Internal IDs stay internal.The state of the art is design-first API development. You write the OpenAPI spec before any code. Then code generation produces server stubs, client SDKs (Python, TypeScript, Go, Rust), documentation, and test fixtures — all from one source of truth. Parallel generates SDKs in 6 languages from a single OpenAPI YAML.
The frontier push: AI-native APIs. Endpoints designed for LLM tool-use: deterministic schemas, rich descriptions in the spec (so the LLM understands what each field does), and streaming responses via Server-Sent Events so agents get partial results without polling.
Compare REST, GraphQL, and gRPC across key dimensions. Click each protocol to highlight its strengths.
Every API call is a journey through a dozen systems, each adding latency. Understanding this journey — and where milliseconds hide — is the difference between a 50ms response and a 500ms response. When an interviewer says "walk me through what happens when a client hits your API," this is what they want to hear.
Step 1: DNS Resolution (1-50ms). The client resolves api.parallel.dev. If cached, instant. If not, the recursive resolver walks the DNS hierarchy. You control this with low TTLs for failover (60s) or high TTLs for speed (300s). Parallel uses Route 53 with latency-based routing — the DNS response points to the nearest edge PoP.
Step 2: TLS Handshake (10-50ms). TLS 1.3 requires one round trip (1-RTT). The client and server exchange keys, verify certificates, and establish an encrypted channel. With TLS session resumption or 0-RTT, subsequent connections skip this. Parallel terminates TLS at the CDN edge, so the internal network uses plain HTTP (faster, simpler).
Step 3: Load Balancer (1-5ms). The L7 load balancer (e.g., ALB, Envoy) routes to a healthy API server. Routing strategies: round-robin (simple), least-connections (better under uneven load), consistent hashing (for sticky sessions or cache affinity). Parallel uses least-connections with health checks every 5s.
Step 4: Reverse Proxy / API Gateway (2-10ms). Before hitting your application code, the request passes through an API gateway that handles cross-cutting concerns: request ID injection, rate limiting, auth token validation, request logging, CORS headers. This layer exists so your application code stays clean.
Step 5: Application Handler (5-200ms). Your code runs. Parse the request body, validate the schema, execute business logic, query the database, check the cache, assemble the response. This is where 80% of your optimization time goes.
Step 6: Database Query (1-100ms). Connection pool checkout (0-5ms), query execution (1-50ms for indexed reads, 10-100ms for complex joins), result serialization. Slow queries here dominate total latency.
Step 7: Response Serialization (1-5ms). Marshal the response to JSON (or protobuf for gRPC). Set cache-control headers. Compress with gzip/brotli if the client accepts it. Add the request ID to the response headers for debugging.
python import time from dataclasses import dataclass, field @dataclass class LatencyTrace: request_id: str spans: list = field(default_factory=list) def span(self, name: str): return SpanContext(self, name) def total_ms(self) -> float: return sum(s["duration_ms"] for s in self.spans) class SpanContext: def __init__(self, trace, name): self.trace, self.name = trace, name def __enter__(self): self.start = time.perf_counter() def __exit__(self, *_): dur = (time.perf_counter() - self.start) * 1000 self.trace.spans.append({"name": self.name, "duration_ms": round(dur, 2)}) # Usage in a handler: async def get_pipeline(pipeline_id: str, trace: LatencyTrace): with trace.span("auth"): user = await validate_token(request.token) with trace.span("cache_check"): cached = await redis.get(f"pipeline:{pipeline_id}") if cached: return cached # Cache hit: skip DB entirely with trace.span("db_query"): pipeline = await db.fetch_pipeline(pipeline_id) with trace.span("serialize"): response = serialize(pipeline) # Log: {"request_id": "req_abc", "spans": [{"name": "auth", "duration_ms": 2.1}, ...]}
An interviewer says: "Your p99 latency jumped from 200ms to 800ms. Walk me through your investigation."
Step 1: Is it all endpoints or one? Check per-endpoint latency dashboards. If it's one endpoint, the problem is in that handler. If it's all endpoints, the problem is in a shared layer (DB, cache, network).
Step 2: Is it all customers or one? A single customer with a 10MB payload can slow their requests without affecting others. Check per-customer latency distribution.
Step 3: Check the spans. Pull a sample of slow requests and look at the latency trace. If "db_query" went from 20ms to 600ms, the problem is in the database. If "auth" went from 2ms to 200ms, the auth service is degraded.
Step 4: Correlate with recent changes. Did someone deploy? Did the database auto-scale? Did a cron job start running a heavy migration? Check the deployment timeline against the latency graph.
The load balancer is the traffic cop for your API fleet. A naive round-robin sends requests evenly, but that's only optimal when all servers are identical and all requests cost the same. In practice, neither is true.
| Algorithm | How it works | Best for | Pitfall |
|---|---|---|---|
| Round robin | Each server gets requests in sequence: 1, 2, 3, 1, 2, 3... | Homogeneous fleet, similar request cost | One slow server gets same load as fast ones |
| Least connections | Route to the server with fewest active requests | Variable request durations (short reads + long writes) | Newly booted servers get flooded (0 connections) |
| Weighted round robin | Bigger servers get proportionally more requests | Mixed instance sizes (during migration, canary deploys) | Weights are static — don't adapt to runtime conditions |
| Consistent hashing | Hash the request key → always route to same server | Server-local caching, session affinity | Hotspot if one key is disproportionately popular |
python # Least-connections with slow-start: protect new instances # Problem: a new server boots with 0 active connections. # Least-connections sends ALL new requests to it. # Its cache is cold, so requests are slow, memory spikes. # Fix: slow-start ramp. New server gets linearly increasing # weight over 30 seconds: 10% → 20% → ... → 100% # In AWS ALB: slow_start.duration_seconds = 30 # In Envoy: slow_start_config { slow_start_window: 30s }
When deploying new code, the old server must finish processing in-flight requests before shutting down. This is connection draining.
python # Graceful shutdown pattern (Python/uvicorn): import signal, asyncio async def graceful_shutdown(): # 1. Stop accepting new requests (health check returns 503) app.state.shutting_down = True # 2. Wait for in-flight requests to complete (max 30s) for _ in range(300): # 30s in 100ms increments if app.state.active_requests == 0: break await asyncio.sleep(0.1) # 3. Close database connections cleanly await db_pool.close() await redis_pool.close() # 4. Exit sys.exit(0) # Register signal handler (SIGTERM from container orchestrator) signal.signal(signal.SIGTERM, lambda *_: asyncio.create_task(graceful_shutdown()))
The cutting edge is kernel-level observability with eBPF. Instead of instrumenting your application code with spans, eBPF programs attach to kernel syscalls (connect, read, write) and automatically measure network latency, TCP retransmits, and connection pool behavior — with zero application code changes. Tools like Cilium and Pixie give you full request traces from the kernel level.
Combined with OpenTelemetry auto-instrumentation, you get traces spanning your API server, database client, Redis client, and HTTP clients — all without manually adding span context. The frontier is zero-code full-stack tracing.
Visualize where time is spent in an API request. Click scenarios to see how latency shifts.
Your database is the source of truth. Every API response ultimately comes from data stored here. Get the schema wrong and you'll spend months working around it. Get the indexes wrong and your API will be fast on day one and unusable at 10 million rows. Get the connection pooling wrong and your database will die under load that your application code could easily handle.
Don't design schemas by drawing entity-relationship diagrams. Design them by listing every query your API will run, then building tables that make those queries efficient. This is access-pattern-driven design.
sql -- Access patterns for Parallel's pipeline API: -- 1. Get pipeline by ID (most common, must be O(1)) -- 2. List pipelines by customer, sorted by created_at (pagination) -- 3. Count active pipelines per customer (quota check) -- 4. Find pipelines by status (admin dashboard) CREATE TABLE pipelines ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), customer_id UUID NOT NULL, name VARCHAR(255) NOT NULL, status VARCHAR(32) NOT NULL DEFAULT 'pending', gpu_count INTEGER NOT NULL CHECK (gpu_count BETWEEN 1 AND 8), config JSONB NOT NULL DEFAULT '{}', created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- Index for pattern 2: list by customer, sorted by time -- Cursor pagination: WHERE customer_id = $1 AND created_at < $cursor -- ORDER BY created_at DESC LIMIT 20 CONSTRAINT idx_customer_created UNIQUE (customer_id, created_at) ); -- Partial index for pattern 3: only count active pipelines -- Much smaller than a full index, only includes rows where status='active' CREATE INDEX idx_active_by_customer ON pipelines (customer_id) WHERE status = 'active'; -- Index for pattern 4: admin filtering by status CREATE INDEX idx_status ON pipelines (status);
A B-tree index is a sorted data structure that lets the database find rows without scanning the entire table. Think of it as a phone book: if you want to find "Smith," you don't read every entry — you jump to the "S" section, then narrow down. Without an index, every query is a sequential scan that reads every row.
(customer_id, created_at) can efficiently serve WHERE customer_id = X and WHERE customer_id = X AND created_at > Y, but NOT WHERE created_at > Y alone. The leftmost column must always be in the WHERE clause. This is the leftmost prefix rule.sql -- BEFORE optimization: full table scan (1.2s at 10M rows) EXPLAIN ANALYZE SELECT * FROM pipelines WHERE customer_id = 'abc' AND status = 'active' ORDER BY created_at DESC LIMIT 20; -- Output shows: Seq Scan on pipelines (cost=0.00..185432.00) -- This means: scanning ALL 10M rows, filtering in memory. -- AFTER adding the right index: 0.3ms -- Output shows: Index Scan using idx_customer_created (cost=0.56..24.12) -- The database jumps directly to customer_id='abc', walks the sorted -- created_at entries, and stops after 20 rows. O(log n + k).
PostgreSQL creates a new process for every connection. At 500 connections, the OS spends more time context-switching between processes than executing queries. A connection pooler like PgBouncer sits between your application and the database, maintaining a pool of reusable connections.
python # Without pooling: each request opens a new DB connection (20-50ms) # With PgBouncer: request checks out a pre-opened connection (0.1ms) # PgBouncer modes: # - session: connection held for entire client session (safest, least efficient) # - transaction: connection returned after each transaction (best for APIs) # - statement: connection returned after each statement (most efficient, # but breaks multi-statement transactions) # Parallel's config: transaction mode, 20 server connections, # 1000 client connections. 50:1 multiplexing ratio. # 1000 concurrent API requests share 20 actual DB connections. # Application-level pooling (asyncpg): pool = await asyncpg.create_pool( dsn="postgres://user:pass@pgbouncer:6432/parallel", min_size=5, # Keep 5 connections warm max_size=20, # Never exceed 20 from this process command_timeout=10, # Kill queries after 10s )
You need to add a column, rename a field, or change a type — but your API has 1000 requests/second hitting this table. A naive ALTER TABLE ADD COLUMN can lock the table for seconds (or minutes at 100M rows). Here's how to do it without downtime.
sql -- SAFE migration pattern: expand → migrate → contract -- Step 1: EXPAND — add the new column (nullable, no default) -- This is instant on PostgreSQL 11+ because it doesn't rewrite the table. ALTER TABLE pipelines ADD COLUMN state VARCHAR(32); -- Step 2: DUAL-WRITE — update application code to write to both columns -- Deploy code that writes to both "status" and "state". -- Reads still come from "status". -- Step 3: BACKFILL — copy data from old column to new (batched!) -- Never run UPDATE pipelines SET state = status; — locks entire table. -- Instead, batch it: UPDATE pipelines SET state = status WHERE id IN (SELECT id FROM pipelines WHERE state IS NULL LIMIT 1000); -- Run this in a loop until all rows are migrated. -- Step 4: SWITCH — update reads to use new column -- Deploy code that reads from "state" instead of "status". -- Step 5: CONTRACT — drop old column (weeks later, after verification) ALTER TABLE pipelines DROP COLUMN status;
ADD COLUMN ... DEFAULT 'pending' is instant because the default is stored in the catalog, not written to every row.Writes go to the primary. Reads can go to read replicas — copies of the primary that stay up to date via replication. At Parallel, 90% of API calls are reads. Sending reads to 3 replicas means the primary only handles writes.
python # Read-your-writes pattern implementation # After a write, return a consistency token in the response header. # The client sends this token back with subsequent reads. # If the token is fresh (< 5s), route the read to primary. async def create_pipeline(req): pipeline = await primary_db.insert(req.body) # Return consistency token = current WAL position lsn = await primary_db.query("SELECT pg_current_wal_lsn()") return Response( data=pipeline, headers={"X-Consistency-Token": encode_lsn(lsn)} ) async def get_pipeline(req, pipeline_id): token = req.headers.get("X-Consistency-Token") if token and not replica_has_reached(token): # Replica hasn't caught up — read from primary return await primary_db.fetch(pipeline_id) # Safe to read from replica return await replica_db.fetch(pipeline_id)
Some fields don't fit neatly into a fixed schema. Pipeline configurations vary per customer, model parameters change over time, metadata is freeform. JSONB in PostgreSQL gives you a typed, indexed, queryable JSON column inside a relational table.
sql -- Store flexible config in a JSONB column CREATE TABLE pipelines ( id UUID PRIMARY KEY, customer_id UUID NOT NULL, name VARCHAR(255) NOT NULL, config JSONB NOT NULL DEFAULT '{}', -- config might contain: {"gpu_type": "A100", "batch_size": 32, -- "model": "llama-3", "streaming": true} created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); -- Query into JSONB: find all pipelines using A100 GPUs SELECT * FROM pipelines WHERE config ->> 'gpu_type' = 'A100'; -- GIN index on JSONB: makes @> (contains) queries fast CREATE INDEX idx_config ON pipelines USING GIN (config); -- Now this is indexed: SELECT * FROM pipelines WHERE config @> '{"streaming": true}'; -- Partial JSONB index: only index specific keys CREATE INDEX idx_config_gpu ON pipelines ((config ->> 'gpu_type')); -- Smaller than full GIN, efficient for specific lookups
WHERE config ->> 'status' = 'active' on every query, that should be a real column.sql -- pg_stat_statements tracks execution stats for every query. -- Enable it in postgresql.conf: -- shared_preload_libraries = 'pg_stat_statements' -- Find the slowest queries by total time: SELECT query, calls, round(total_exec_time::numeric / 1000, 2) AS total_seconds, round(mean_exec_time::numeric, 2) AS mean_ms, round(stddev_exec_time::numeric, 2) AS stddev_ms, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10; -- This tells you: -- "SELECT * FROM pipelines WHERE customer_id=$1" runs 500K times/day, -- mean 2ms, stddev 50ms (high variance = some queries hit cold cache) -- Action: if mean is low but stddev is high, the query is fast USUALLY -- but slow SOMETIMES. That's your p99 problem. -- Look at buffer reads (shared_blks_hit vs shared_blks_read) to see -- if slow queries are hitting disk instead of buffer cache.
sql -- Scenario: two transactions deadlock -- T1: UPDATE pipelines SET status='active' WHERE id='A'; -- UPDATE pipelines SET status='active' WHERE id='B'; -- T2: UPDATE pipelines SET status='active' WHERE id='B'; -- UPDATE pipelines SET status='active' WHERE id='A'; -- T1 locks A, waits for B. T2 locks B, waits for A. DEADLOCK. -- Fix: always acquire locks in a deterministic order (sorted by ID) -- Both transactions lock A first, then B. No cycle possible. -- Detection: PostgreSQL auto-detects deadlocks and aborts one transaction. -- But detection takes 1s (deadlock_timeout). Prevention is better.
Connection pool exhaustion is one of the top causes of API outages. Monitor these metrics:
python # Metrics to emit from your connection pool: pool_total_connections.set(pool.get_size()) # Total connections open pool_available.set(pool.get_idle_size()) # Connections idle (available) pool_waiting.set(pool.get_waiters()) # Requests waiting for a connection pool_checkout_time.observe(checkout_duration_ms) # Time to get a connection # Alert when pool_waiting > 0 for > 30s # → Requests are queuing for DB connections. You need more pool capacity # or PgBouncer is saturated. # Alert when pool_checkout_time p99 > 100ms # → Connections are held too long. Look for slow queries or missing # connection release (a `finally` block that doesn't close the conn). # Typical healthy values at Parallel: # pool_total: 20, pool_available: 12-18, pool_waiting: 0 # pool_checkout_time p99: <1ms
CockroachDB and TiDB offer PostgreSQL-compatible SQL with automatic sharding, distributed transactions, and multi-region replication. Instead of manually managing read replicas and sharding logic, the database handles it. The tradeoff: higher per-query latency (cross-node coordination) but horizontal scalability without application changes.
Neon and PlanetScale offer serverless Postgres/MySQL with instant branching (create a copy of production for testing in seconds) and auto-scaling to zero (no cost when idle). This is changing how teams think about database provisioning.
Compare query performance with and without indexes as table size grows.
Caching is the art of remembering expensive answers so you don't compute them again. At Parallel, a cache hit means responding in 2ms instead of 50ms — a 25x improvement. But caching introduces a new problem: how do you know when the cached answer is wrong? Cache invalidation is one of the two hard problems in computer science (the other is naming things).
| Layer | What it caches | TTL | Hit rate | Latency |
|---|---|---|---|---|
| CDN Edge | Static assets, public GET responses | 5-60 min | ~60% | 1-10ms (geographically close) |
| API Gateway | Auth token validation results | 5 min | ~90% | 0.5ms (in-memory) |
| Application (Redis) | DB query results, computed aggregations | 1-10 min | ~70% | 1-3ms (network hop to Redis) |
| Application (Local) | Config, feature flags, rate limit rules | 30s | ~99% | 0.01ms (process memory) |
| Database | Query plan cache, buffer pool (recently-read pages) | N/A (LRU) | ~85% | 0.1ms (RAM) vs 5ms (disk) |
python # CACHE-ASIDE (a.k.a. lazy loading) — Parallel's primary pattern # Application manages the cache explicitly. async def get_pipeline(pipeline_id: str): # 1. Check cache first cached = await redis.get(f"pipeline:{pipeline_id}") if cached: return json.loads(cached) # Cache HIT: 2ms response # 2. Cache MISS: query database pipeline = await db.fetch_pipeline(pipeline_id) # 3. Populate cache for next time (TTL = 5 minutes) await redis.setex(f"pipeline:{pipeline_id}", 300, json.dumps(pipeline)) return pipeline # WRITE-THROUGH — update cache on every write async def update_pipeline(pipeline_id: str, updates: dict): # 1. Write to database pipeline = await db.update_pipeline(pipeline_id, updates) # 2. Immediately update cache (cache is always fresh) await redis.setex(f"pipeline:{pipeline_id}", 300, json.dumps(pipeline)) return pipeline
Imagine a popular endpoint whose cache entry expires. In the next millisecond, 500 requests arrive, all find the cache empty, and all query the database simultaneously. The database gets 500 identical queries instead of 1. This is the thundering herd (or cache stampede).
python # Fix 1: Probabilistic early expiration # Each request has a small chance of refreshing the cache BEFORE it expires. # Instead of 500 requests all missing at t=300s, one request refreshes at t=280s. import random, time def should_refresh(ttl_remaining: float, beta: float = 1.0) -> bool: # XFetch algorithm: probability increases as TTL approaches 0 if ttl_remaining <= 0: return True return random.random() < beta * (-ttl_remaining).exp() # pseudo # Fix 2: Lock-based refresh (single-flight) # Only one request refreshes; others wait or get stale data. async def get_with_lock(key: str, fetch_fn): cached = await redis.get(key) if cached: return json.loads(cached) # Try to acquire refresh lock (NX = only if not exists, EX = 5s TTL) lock = await redis.set(f"lock:{key}", "1", nx=True, ex=5) if lock: # We won the lock — fetch and populate data = await fetch_fn() await redis.setex(key, 300, json.dumps(data)) await redis.delete(f"lock:{key}") return data else: # Someone else is refreshing — wait and retry await asyncio.sleep(0.1) return await get_with_lock(key, fetch_fn)
python # Strategy 1: TTL-based (simplest, eventually consistent) # Set a TTL when caching. After it expires, next request re-fetches. await redis.setex(f"pipeline:{id}", 300, data) # 5 min TTL # Pro: zero invalidation complexity # Con: stale for up to TTL seconds after a write # Strategy 2: Event-driven (strongly consistent) # On every write, delete the cache entry. async def update_pipeline(id, updates): pipeline = await db.update(id, updates) await redis.delete(f"pipeline:{id}") # Invalidate await redis.delete(f"pipeline_list:{pipeline.customer_id}") # Invalidate list too! return pipeline # Pro: cache is always fresh # Con: must invalidate EVERY cache key that includes this data # (single resource AND list endpoints — easy to miss one) # Strategy 3: Version-tagged keys # Cache key includes a version counter. Bump on write. # Old keys auto-expire via TTL. No explicit deletion needed. version = await redis.incr(f"pipeline_version:{id}") cache_key = f"pipeline:{id}:v{version}" # Pro: no race conditions during invalidation # Con: more Redis keys, relies on TTL to clean up old versions # Strategy 4: Pub/Sub invalidation (for multi-server setups) # Publish invalidation events. All servers subscribe and clear local caches. await redis.publish("cache_invalidation", json.dumps({ "type": "pipeline", "id": pipeline_id, "action": "updated" }))
HTTP cache headers control how CDNs, browsers, and intermediate proxies cache your responses. Getting these wrong means either serving stale data or bypassing the cache entirely.
python # For immutable resources (model metadata that changes monthly): # Cache aggressively at CDN and browser. headers = { "Cache-Control": "public, max-age=3600, s-maxage=86400", # max-age: browser caches for 1 hour # s-maxage: CDN caches for 24 hours "ETag": "\"v1-abc123\"", # For conditional requests } # For mutable resources (pipelines, jobs): # Don't cache at CDN or browser. Let application cache handle it. headers = { "Cache-Control": "private, no-store", # private: CDN must not cache (contains user-specific data) # no-store: browser must not cache } # For list endpoints with stale-while-revalidate: headers = { "Cache-Control": "public, max-age=10, stale-while-revalidate=60", # Serve cached for 10s. After that, serve stale AND revalidate in background. # User never sees a cache miss. Brilliant for high-traffic list endpoints. }
A customer updates their pipeline name but the GET endpoint returns the old name for 5 minutes. Investigation:
Symptom: Write succeeds (200 OK with new name), subsequent read returns old name.
Root cause 1: Cache-aside without invalidation. The write updates the DB but doesn't delete/update the cache entry. The stale cache is served until TTL expires.
Root cause 2: Read replica lag. The write goes to primary, the read goes to a replica that hasn't caught up. Cache is correct but the source data is stale.
Root cause 3: CDN caching. The GET response has a Cache-Control: max-age=300 header. Even after the application cache is updated, the CDN serves its stale copy.
Cloudflare Workers KV and Vercel Edge Config push caching to the edge — 300+ PoPs worldwide. Read latency drops to <5ms globally. The frontier: write-through edge caches that propagate updates in <100ms worldwide. Combined with stale-while-revalidate HTTP semantics, users never see a cache miss.
Simulate request patterns and see cache behavior. Adjust TTL and request rate to see thundering herd effects.
Without rate limiting, one customer's script gone haywire can take down your entire API. Rate limiting is not about being mean to developers — it's about fairness. Every customer gets their fair share of capacity, and no single customer can starve the others. Think of it as traffic lights on a highway on-ramp: they slow individual cars so the highway keeps flowing for everyone.
The token bucket is the most common rate-limiting algorithm. Imagine a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 100/second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing short bursts.
python import time class TokenBucket: def __init__(self, rate: float, capacity: int): self.rate = rate # Tokens added per second self.capacity = capacity # Max tokens in bucket self.tokens = capacity # Start full self.last_refill = time.monotonic() def allow(self) -> bool: now = time.monotonic() # Refill tokens based on elapsed time elapsed = now - self.last_refill self.tokens = min(self.capacity, self.tokens + elapsed * self.rate) self.last_refill = now if self.tokens >= 1: self.tokens -= 1 return True # Request allowed return False # Rate limited (429) # Usage: bucket = TokenBucket(rate=100, capacity=200) # 100 req/s, burst up to 200 if not bucket.allow(): return Response(status=429, headers={ "Retry-After": "1", "X-RateLimit-Limit": "100", "X-RateLimit-Remaining": "0", "X-RateLimit-Reset": str(int(time.time()) + 1), })
Fixed window: Count requests in each 1-minute window (e.g., 12:00-12:01, 12:01-12:02). Problem: a burst at 12:00:59 + a burst at 12:01:01 gets 2x the limit because they span two windows.
Sliding window log: Track the timestamp of every request. Count requests in the last 60 seconds. Accurate but memory-intensive (storing every timestamp).
Sliding window counter: Combine the current window's count with a weighted portion of the previous window's count. Approximate but memory-efficient — only two counters per customer.
python # The challenge: your API runs on 10 servers. A per-process token bucket # allows 10x the intended limit (each server has its own bucket). # Solution: centralized counter in Redis. async def check_rate_limit(customer_id: str, limit: int, window: int) -> bool: key = f"rl:{customer_id}:{int(time.time()) // window}" # Atomic increment + TTL in one round trip pipe = redis.pipeline() pipe.incr(key) pipe.expire(key, window + 1) # +1 to avoid race count, _ = await pipe.execute() return count <= limit # Per-customer quotas: different tiers get different limits TIER_LIMITS = { "free": {"rpm": 60, "rpd": 1000}, "pro": {"rpm": 600, "rpd": 50000}, "enterprise": {"rpm": 6000, "rpd": 1000000}, }
Instead of a hard 429, sophisticated APIs degrade gracefully. At 80% of the limit, start returning slightly stale cached data (faster, cheaper). At 100%, return 429 with Retry-After header. At 200% (suspected abuse), temporarily block the API key and alert the security team.
python # Graceful degradation with response headers async def rate_limit_middleware(request, call_next): customer = request.customer usage = await get_usage(customer.id) limit = TIER_LIMITS[customer.tier]["rpm"] # Always include rate limit headers (even when not limited) headers = { "X-RateLimit-Limit": str(limit), "X-RateLimit-Remaining": str(max(0, limit - usage)), "X-RateLimit-Reset": str(next_window_timestamp()), } if usage >= limit * 2: # 200%+ — suspected abuse, hard block await alert_security(customer.id, usage) return Response(status=429, headers={**headers, "Retry-After": "60"}) if usage >= limit: # 100%+ — return 429 with retry guidance return Response(status=429, headers={**headers, "Retry-After": "1"}, body={"error": {"type": "rate_limit_exceeded", "message": f"Rate limit of {limit} requests/minute exceeded. Retry after 1 second.", "doc_url": "https://docs.parallel.dev/rate-limits"}}) if usage >= limit * 0.8: # 80%+ — serve from cache, add warning header headers["X-RateLimit-Warning"] = "Approaching rate limit" response = await call_next(request) response.headers.update(headers) return response
The most common support ticket: "I'm getting 429 errors but I'm only making 10 requests/minute." Causes:
1. Multiple processes sharing one API key. The developer has 6 workers, each making 10 req/min = 60 total. They only see their worker's count.
2. Retry storms. Their client retries 429s immediately (without backoff), consuming more tokens and getting more 429s. A 10 req/min client can generate 100 actual requests/min through retries.
3. Clock skew. Their system clock is 30 seconds ahead. They think they're in a new rate limit window but the server disagrees.
Not all requests cost the same. A simple GET costs 1 "unit" but a complex search query costs 10 units (more CPU, more DB I/O). Cost-based rate limiting (used by GitHub GraphQL, Shopify, and now Parallel) assigns a cost to each request type and deducts from a budget. This prevents one customer from monopolizing expensive endpoints while staying within their "request count" limit.
python # Cost-based rate limiting implementation ENDPOINT_COSTS = { "GET /v2/pipelines/{id}": 1, # Cheap: single row lookup "GET /v2/pipelines": 5, # Medium: list query + pagination "POST /v2/pipelines": 10, # Expensive: validation + DB write + queue "GET /v2/pipelines/{id}/logs": 20, # Very expensive: scan log storage "POST /v2/search": 25, # Most expensive: full-text search } # Customer budget: 10,000 units/minute (pro tier) # A customer can make 10,000 cheap GETs, or 400 searches, or a mix. # Response headers show remaining budget: # X-RateLimit-Cost: 25 # X-RateLimit-Budget-Remaining: 4,975 # X-RateLimit-Budget-Reset: 1716400060 async def check_cost_limit(customer_id, endpoint, method): cost = ENDPOINT_COSTS.get(f"{method} {endpoint}", 1) key = f"budget:{customer_id}:{current_minute()}" current = await redis.incrby(key, cost) if current == cost: # First request this minute await redis.expire(key, 61) budget = TIER_BUDGETS[customer.tier] # e.g., 10000 return current <= budget, budget - current, cost
Watch the token bucket fill and drain. Adjust the rate and burst size, then click Send to consume tokens.
Authentication answers "who are you?" Authorization answers "what are you allowed to do?" Get either wrong and you're on the front page of Hacker News — not in a good way. At Parallel, every API request must be authenticated, every action must be authorized, and every access must be logged for audit.
The simplest auth: the client sends a secret key in every request. API keys are easy for developers (just add a header) but dangerous if leaked (no expiration, full access). Parallel uses API keys for server-to-server calls where the client is a backend service, not a browser.
python # API key design: prefix + random bytes # Prefix makes keys greppable in logs: "pk_live_" vs "pk_test_" # Store the HASH in the database, not the key itself. import secrets, hashlib def generate_api_key(prefix: str = "pk_live_") -> tuple[str, str]: raw = secrets.token_urlsafe(32) # 256 bits of entropy key = prefix + raw # pk_live_a3Bc9d... key_hash = hashlib.sha256(key.encode()).hexdigest() return key, key_hash # Give key to user, store hash in DB async def validate_api_key(key: str) -> Customer | None: key_hash = hashlib.sha256(key.encode()).hexdigest() # Lookup by hash — constant-time comparison prevents timing attacks return await db.get_customer_by_key_hash(key_hash)
For user-facing applications (dashboards, CLI tools), Parallel uses OAuth 2.0 with JWT (JSON Web Tokens). The flow: user authenticates with their identity provider, gets a JWT, sends it with every request. The server validates the JWT's signature without hitting a database — the token contains the user's identity and permissions, signed by a private key.
python # JWT structure: header.payload.signature # Header: {"alg": "RS256", "typ": "JWT"} # Payload: {"sub": "user_123", "org": "org_456", # "roles": ["admin"], "exp": 1716400000} # Signature: RS256(header + "." + payload, private_key) import jwt def validate_jwt(token: str, public_key: str) -> dict: try: payload = jwt.decode(token, public_key, algorithms=["RS256"]) return payload # {"sub": "user_123", "roles": [...], ...} except jwt.ExpiredSignatureError: raise AuthError("Token expired", code=401) except jwt.InvalidTokenError: raise AuthError("Invalid token", code=401)
RBAC maps users to roles, and roles to permissions. A user can have multiple roles, each granting specific permissions on specific resources.
| Role | Permissions | Use case |
|---|---|---|
| viewer | read:pipelines, read:jobs | Dashboard-only users, monitoring |
| developer | viewer + create:pipelines, update:pipelines | Engineers building on the platform |
| admin | developer + delete:pipelines, manage:keys, manage:members | Team leads, account owners |
| billing | read:usage, manage:billing | Finance team (no API access) |
When your API sends webhooks (event notifications), the receiver needs to verify they came from you, not an attacker. The solution: HMAC signatures.
python import hmac, hashlib def sign_webhook(payload: bytes, secret: str) -> str: return hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest() # Sending: compute signature, include in header sig = sign_webhook(body, customer_webhook_secret) headers = {"X-Parallel-Signature": f"sha256={sig}"} # Receiving (customer's code): verify signature def verify_webhook(payload: bytes, header_sig: str, secret: str) -> bool: expected = sign_webhook(payload, secret) # Constant-time comparison prevents timing attacks return hmac.compare_digest(f"sha256={expected}", header_sig)
API keys get leaked. Developers commit them to GitHub. Employees leave. You need a rotation mechanism that doesn't break existing integrations.
python # Key rotation implementation async def rotate_api_key(customer_id: str) -> dict: # Generate new key new_key, new_hash = generate_api_key("pk_live_") # Get current key info current = await db.get_active_key(customer_id) # Mark current key as "expiring" with 24h grace period await db.update_key(current.id, status="expiring", expires_at=datetime.utcnow() + timedelta(hours=24)) # Insert new key as active await db.insert_key(customer_id, key_hash=new_hash, status="active") # Both keys work during the transition window return { "new_key": new_key, # Show once, never stored in plaintext "old_key_expires_at": (datetime.utcnow() + timedelta(hours=24)).isoformat(), "message": "Deploy the new key within 24 hours. The old key will be revoked after that." } # Validation now checks both active and expiring keys: async def validate_api_key(key: str): key_hash = hashlib.sha256(key.encode()).hexdigest() record = await db.get_key_by_hash(key_hash) if not record: return None if record.status == "revoked": return None if record.status == "expiring" and record.expires_at < datetime.utcnow(): return None # Grace period over return record.customer
Every API action that modifies data must produce an audit log entry. This is essential for security investigations, compliance (SOC 2), and debugging customer issues ("when did this pipeline get deleted?").
python # Audit log schema — immutable, append-only class AuditEntry: id: str # Unique event ID timestamp: datetime # When actor_id: str # Who (user_id or api_key_id) actor_type: str # "user", "api_key", "system" action: str # "pipeline.created", "key.rotated" resource_type: str # "pipeline", "api_key" resource_id: str # "pipe_abc123" ip_address: str # Request source IP user_agent: str # SDK version, etc. changes: dict # {"status": {"from": "active", "to": "deleted"}} request_id: str # Correlate with request logs # Write to an append-only table (no UPDATE, no DELETE) # Retention: 2 years minimum for SOC 2 compliance # Indexed on: actor_id, resource_id, action, timestamp
A developer says: "I'm getting 403 on POST /v2/pipelines but I'm an admin." Investigation:
Check 1: Are they using the right API key? They might have a test key (pk_test_) hitting the production endpoint.
Check 2: Is the JWT expired? A 403 instead of 401 means the token is valid but the permissions are wrong — check the roles in the token payload.
Check 3: Organization context. They're an admin in org_A but their request targets a resource in org_B. RBAC is per-organization.
A single all-powerful API key is dangerous. If leaked, the attacker has full access to everything. Scoped keys limit each key to specific permissions and resources.
python # Scoped key creation — the customer requests specific permissions POST /v2/api-keys { "name": "CI/CD deploy key", "permissions": ["pipelines:write", "jobs:read"], "resource_ids": ["pipe_abc", "pipe_def"], # Only these pipelines "expires_at": "2025-06-22T00:00:00Z", # Auto-expire in 30 days "ip_whitelist": ["203.0.113.0/24"] # Only from CI network } # Response: pk_live_scoped_... (this key can ONLY write to those # two pipelines, from that IP range, for 30 days) # If this key is leaked, damage is contained: # ✓ Can't read customer data (no "customers:read" permission) # ✓ Can't delete pipelines (no "pipelines:delete" permission) # ✓ Expires automatically # ✓ Fails from non-whitelisted IPs
The frontier of API authentication is moving beyond shared secrets. Passkeys (WebAuthn/FIDO2) use public-key cryptography — the private key never leaves the user's device. No passwords to leak, no API keys to rotate. For machine-to-machine auth, SPIFFE/SPIRE provides cryptographic identity without shared secrets, using short-lived X.509 certificates that rotate automatically.
See how different auth methods protect an API request. Click to see the flow for each method.
You cannot improve what you cannot measure. Observability is the ability to understand a system's internal state from its external outputs — logs, metrics, and traces. Reliability is the discipline of making promises (SLOs) and keeping them. Together, they answer: "Is the API healthy, and how do I know?"
Logs: Structured event records. Every request produces a log entry with request_id, endpoint, status, latency, customer_id. Use JSON format so they're machine-parseable. Avoid print("something went wrong") — use structured logging with severity levels.
python import structlog log = structlog.get_logger() # GOOD: structured, searchable, correlatable log.info("request.completed", request_id="req_abc123", endpoint="GET /v2/pipelines/{id}", status=200, latency_ms=42.3, customer_id="cust_xyz", cache_hit=True, db_query_ms=0, ) # Output: {"event": "request.completed", "request_id": "req_abc123", ...} # Searchable in Datadog/Grafana: filter by customer_id, sort by latency_ms # BAD: unstructured, impossible to search programmatically print(f"Completed request for pipeline abc in 42ms")
Metrics: Numerical time-series data. Counter (requests_total), gauge (active_connections), histogram (request_latency_seconds). Metrics tell you "what is happening right now" but not "why."
python # The four types of metrics and when to use each: # COUNTER — monotonically increasing. Good for rates (req/s, errors/s). http_requests_total.labels(method="GET", endpoint="/v2/pipelines", status=200).inc() # GAUGE — goes up and down. Good for current state. active_db_connections.set(pool.size - pool.available) request_queue_depth.set(queue.qsize()) # HISTOGRAM — distribution of values. Good for latencies. request_latency.labels(endpoint="/v2/pipelines").observe(duration_seconds) # Automatically gives you p50, p90, p99 via quantile calculations. # Bucket boundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5] # SUMMARY — like histogram but pre-computes quantiles client-side. # Avoid unless you specifically need client-side quantiles. # Histograms are more flexible (can aggregate across instances).
Traces: End-to-end request paths through the system. A trace contains spans, each representing a unit of work (auth check, DB query, cache lookup). Traces tell you "where is the time going" for a specific request.
python # OpenTelemetry tracing — the industry standard (2024+) from opentelemetry import trace tracer = trace.get_tracer("parallel.api") async def get_pipeline(request, pipeline_id): with tracer.start_as_current_span("get_pipeline") as span: span.set_attribute("pipeline.id", pipeline_id) span.set_attribute("customer.id", request.customer_id) with tracer.start_as_current_span("cache.lookup"): cached = await redis.get(f"pipeline:{pipeline_id}") if cached: span.set_attribute("cache.hit", True) return cached with tracer.start_as_current_span("db.query") as db_span: pipeline = await db.fetch(pipeline_id) db_span.set_attribute("db.statement", "SELECT * FROM pipelines WHERE id=$1") db_span.set_attribute("db.rows_returned", 1) return pipeline # This produces a trace like: # [get_pipeline: 32ms] # ├── [cache.lookup: 1ms] (hit=false) # └── [db.query: 28ms] (rows=1) # Each span has timestamps, attributes, and parent-child relationships.
A Service Level Indicator (SLI) is a metric you care about: request latency, error rate, availability. A Service Level Objective (SLO) is a target for that metric: "99.9% of requests complete in under 200ms." An error budget is how much you're allowed to miss the SLO: 0.1% of requests can be slow.
| SLI | SLO | Error Budget (monthly) | Action on budget burn |
|---|---|---|---|
| Availability | 99.95% | 21.9 minutes downtime | Freeze deployments, investigate |
| Latency (p99) | <200ms | 0.05% of requests can exceed | Scale up, optimize slow paths |
| Error rate | <0.1% | ~4,300 errors per 4.3M requests | Rollback last deploy, page on-call |
yaml # BAD alert: fires on any 500 error # Result: 50 alerts/day, on-call ignores them all, misses real outage - alert: Any500Error expr: http_errors_total{status="500"} > 0 # GOOD alert: fires on burn rate (how fast are we burning the error budget?) # If we're burning 14.4x the budget rate, we'll exhaust it in 5 days. # This triggers a page. Low burn rate gets a ticket. - alert: HighErrorBurnRate expr: | (sum(rate(http_errors_total[1h])) / sum(rate(http_requests_total[1h]))) > (14.4 * 0.001) # 14.4x the 0.1% error budget rate for: 5m severity: page - alert: SlowErrorBurnRate expr: | (sum(rate(http_errors_total[6h])) / sum(rate(http_requests_total[6h]))) > (3 * 0.001) # 3x the budget rate for: 30m severity: ticket
Google's SRE book defines four signals that every API must monitor. If you build one dashboard, build this one:
| Signal | Metric | Alert threshold | What it tells you |
|---|---|---|---|
| Latency | request_duration_seconds (histogram) | p99 > 200ms for 5min | Something is slow: DB? Cache miss? Downstream service? |
| Traffic | http_requests_total (counter) | +50% vs. 1-week average | Organic growth, or viral partner, or attack? |
| Errors | http_errors_total (counter) | Error rate > 0.5% for 5min | Bug deployed, DB down, or upstream failure? |
| Saturation | CPU, memory, DB connections, queue depth | Any resource > 80% for 10min | Running out of capacity. Scale or shed load. |
Step 1: Check the golden signals dashboard: latency, traffic, errors, saturation. Which signal is abnormal?
Step 2: If latency is high, pull a trace for a slow request. Find the slow span.
Step 3: If errors are high, check the error rate by endpoint and by error code. A spike in 503s means backend saturation. A spike in 400s means a client-side change (new SDK version with a bug?).
Step 4: Check saturation metrics: CPU, memory, DB connections, Redis connections. If DB connections are at max, the problem is connection pool exhaustion, not your application code.
python # Real debugging session: p99 spiked from 200ms to 800ms # Step 1: Which endpoint? # → Only GET /v2/pipelines/{id} is slow. Others are fine. # Conclusion: problem is in this handler, not a shared layer. # Step 2: Which span is slow? # Pull 10 slow traces. All show db.query span = 600ms. # Conclusion: database query is the bottleneck. # Step 3: Which query? # Check pg_stat_statements for the slowest queries: # SELECT * FROM pipelines WHERE id = $1 # mean_time went from 2ms to 600ms yesterday. # Step 4: What changed yesterday? # Migration log: added index on (customer_id, status) at 3 PM. # Index creation on 50M rows took 8 minutes with CONCURRENTLY, # but the planner started using a suboptimal query plan afterward. # Fix: ANALYZE pipelines; (refresh query planner statistics) # p99 drops back to 200ms within 2 minutes.
Anomaly detection using ML models that learn normal patterns and alert on deviations — no manual threshold tuning. Natural language querying: ask "why is the /v2/jobs endpoint slow today?" and the system correlates traces, metrics, and logs to generate an answer. Tools like Datadog AI Assistants and Honeycomb's Query Assistant are making this real.
Watch the error budget burn in real time. Inject errors to see how the burn rate alert fires.
Your API starts on one server. Then customers arrive. Then a viral integration sends 50x your normal traffic in 10 minutes. Scaling is not about handling today's load — it's about designing systems that can handle 10x without re-architecture and 100x with a planned migration. The system that scales well is the one where adding capacity is boring.
Vertical scaling: Bigger machine (more CPU, RAM, faster disk). Simple but limited — there's a biggest machine you can buy. At Parallel, we vertically scale the primary database (it's the one piece that's hard to horizontally scale).
Horizontal scaling: More machines. Add API servers behind a load balancer. Works for stateless services. The challenge: state must be externalized (to a database, Redis, or object store) so any server can handle any request.
When one database can't handle the load, you split it across multiple databases. Each shard holds a subset of the data. The key decision: what do you shard by?
python # Sharding by customer_id: all data for one customer lives on one shard. # Pro: queries within one customer never cross shards. # Con: one big customer can hotspot a shard. def get_shard(customer_id: str, num_shards: int) -> int: # Consistent hashing: customer_id → hash → shard number return int(hashlib.md5(customer_id.encode()).hexdigest(), 16) % num_shards # Sharding by time: data for 2024-Q1 on shard A, Q2 on shard B. # Pro: old data can be archived/compressed. Queries on recent data are fast. # Con: cross-time-range queries need scatter-gather across shards. # Parallel's approach: shard by customer_id with a routing layer. # The routing table lives in Redis (fast lookup). # When adding a shard, we migrate customers one by one (dual-write pattern).
Not everything needs to happen in the request path. Creating a GPU pipeline takes 30 seconds, but the API should respond in 200ms. The solution: accept the request, put it on a job queue, return a job ID, and process asynchronously.
python # Synchronous (BAD for long operations): # POST /v2/pipelines → blocks for 30s → returns pipeline # Client timeout, load balancer timeout, terrible UX. # Asynchronous (GOOD): # POST /v2/pipelines → returns 202 Accepted + job_id (200ms) # GET /v2/jobs/{job_id} → returns status: "running" / "completed" async def create_pipeline(req: Request) -> Response: # Validate, then enqueue job_id = str(uuid4()) await queue.publish("pipeline.create", { "job_id": job_id, "customer_id": req.customer_id, "config": req.body, }) return Response( status=202, body={"job_id": job_id, "status": "queued"}, headers={"Location": f"/v2/jobs/{job_id}"} ) # Worker process (separate from API server): async def process_pipeline_job(msg): await db.update_job(msg["job_id"], status="running") pipeline = await gpu_scheduler.create(msg["config"]) await db.update_job(msg["job_id"], status="completed", result=pipeline)
When a downstream service (database, payment API, GPU scheduler) fails, your API shouldn't keep hammering it. That makes recovery slower and wastes resources. A circuit breaker detects failures and stops sending requests until the service recovers.
python # Circuit breaker state machine: # CLOSED → requests flow normally, failures counted # OPEN → requests fail-fast (503), no downstream call # HALF → one test request allowed, if it succeeds → CLOSED # Transitions: # CLOSED → OPEN: when failure count exceeds threshold (e.g., 5 in 60s) # OPEN → HALF: after reset_timeout (e.g., 30s) # HALF → CLOSED: if test request succeeds # HALF → OPEN: if test request fails # In a handler: db_breaker = CircuitBreaker(threshold=5, reset_time=30) async def get_pipeline(pipeline_id): # Try cache first (cache doesn't use circuit breaker) cached = await redis.get(f"pipeline:{pipeline_id}") if cached: return cached # DB call is protected by circuit breaker try: result = await db_breaker.call(db.fetch_pipeline, pipeline_id) return result except CircuitOpenError: # Circuit is open — return degraded response return Response(status=503, body={ "error": {"type": "service_degraded", "message": "Database temporarily unavailable. Cached data may be stale.", "retry_after": 30} })
Backpressure is the mechanism by which an overloaded system signals upstream to slow down. Without it, requests pile up in memory until the server OOMs. With it, excess requests are rejected gracefully (429 or 503) before they consume resources.
python # Backpressure via request queue with bounded capacity: import asyncio request_queue = asyncio.Queue(maxsize=1000) # Bounded! async def handle_request(request): try: request_queue.put_nowait(request) # Non-blocking except asyncio.QueueFull: # Queue is full — reject with backpressure signal return Response(status=503, headers={"Retry-After": "5"}) # Workers process from the queue at a sustainable rate: async def worker(): while True: request = await request_queue.get() await process(request)
Scenario: A partner's integration goes viral. Traffic jumps 10x in 10 minutes. The API returns 503s for 8 minutes before auto-scaling kicks in.
Root cause: Auto-scaling was configured to trigger at 80% CPU, with a 5-minute cooldown and 3-minute instance boot time. Total reaction time: 8 minutes. During those 8 minutes, existing servers are saturated.
Fix: Predictive scaling (scale based on traffic trend, not just current CPU). Pre-warm spare capacity during business hours. Add request queuing at the load balancer (instead of rejecting requests, queue them for 2 seconds before 503).
A job fails after 3 retries. You don't want to lose it (the customer's pipeline creation request is gone forever). You also don't want to keep retrying forever (the same error will keep happening). Solution: dead letter queue (DLQ).
python # Job processing with DLQ async def process_job(msg): try: await create_pipeline(msg) await queue.ack(msg) # Success: remove from queue except RetryableError: if msg.retry_count < 3: await queue.nack(msg, delay=2 ** msg.retry_count) # Retry with backoff else: # Move to DLQ for manual investigation await dlq.publish(msg, metadata={ "error": str(e), "retries": msg.retry_count, "original_timestamp": msg.timestamp, }) await queue.ack(msg) await alert_oncall(f"Job {msg.job_id} sent to DLQ after 3 retries") except FatalError: # Non-retryable: bad input, business logic violation await db.update_job(msg.job_id, status="failed", error=str(e)) await queue.ack(msg) # Don't retry, don't DLQ # DLQ dashboard shows: # - Failed job details (what was the request?) # - Error message and stack trace # - Retry count and timestamps # - "Reprocess" button to retry manually after fixing the bug
Message queues guarantee at-least-once delivery, not exactly-once. If a worker crashes after processing but before acknowledging, the message is re-delivered. Your worker must be idempotent: processing the same message twice produces the same result as processing it once.
python # Non-idempotent (BAD): creates duplicate pipelines async def process_job(msg): await db.insert_pipeline(msg.config) # Second delivery = duplicate! # Idempotent (GOOD): uses job_id as dedup key async def process_job(msg): existing = await db.get_pipeline_by_job_id(msg.job_id) if existing: return # Already processed — idempotent skip await db.insert_pipeline(msg.config, job_id=msg.job_id)
Serverless functions (AWS Lambda, Cloudflare Workers) auto-scale to zero and to infinity without managing servers. The tradeoff: cold starts (50-500ms) and limited execution time. The frontier: V8 isolate-based runtimes (Cloudflare Workers, Deno Deploy) with near-zero cold starts (<5ms) running at the edge. Your API handler executes in the datacenter closest to the user.
Watch auto-scaling respond to traffic changes. Adjust load and see servers spin up/down.
The best API in the world is useless if developers can't figure out how to use it. Developer experience (DX) is the sum of every interaction a developer has with your API: reading the docs, getting an API key, making the first request, debugging an error, upgrading to a new version. At Parallel, the DX team's north star metric is time-to-first-successful-request. If a new developer can't make a working API call in under 5 minutes, something is broken.
A good SDK wraps your REST API in language-native idioms so developers never think about HTTP. The SDK handles auth, retries, pagination, error parsing, and type safety. Bad SDKs are thin wrappers around HTTP calls. Good SDKs feel like a native library.
python # BAD SDK: developer must know HTTP, JSON, pagination, error codes response = requests.get( "https://api.parallel.dev/v2/pipelines", headers={"Authorization": f"Bearer {key}"}, params={"limit": 20, "cursor": cursor} ) if response.status_code == 429: time.sleep(int(response.headers["Retry-After"])) # retry... data = response.json() # GOOD SDK: language-native, handles everything client = Parallel(api_key="pk_live_...") # Auto-paginates, auto-retries on 429, returns typed objects for pipeline in client.pipelines.list(): print(pipeline.name, pipeline.status) # IDE autocomplete works # Errors are typed exceptions, not HTTP status codes try: client.pipelines.create(name="test", gpu_count=16) except parallel.ValidationError as e: print(e.param) # "gpu_count" print(e.message) # "Must be between 1 and 8"
Developers need a safe place to experiment without affecting production data or incurring costs. Parallel provides a full sandbox environment:
| Feature | Production | Sandbox |
|---|---|---|
| Base URL | api.parallel.dev | sandbox.parallel.dev |
| API keys | pk_live_* | pk_test_* |
| Rate limits | Per plan | 100 req/min (generous for testing) |
| GPU allocation | Real GPUs, real cost | Simulated (responds as if real, no actual GPU) |
| Data persistence | Permanent | Wiped weekly |
python # A well-designed SDK has 4 layers: # Layer 1: Transport — handles HTTP, retries, auth class Transport: def __init__(self, api_key, base_url, max_retries=3): self.client = httpx.AsyncClient( base_url=base_url, headers={"Authorization": f"Bearer {api_key}"}, timeout=30, ) self.max_retries = max_retries async def request(self, method, path, **kwargs): for attempt in range(self.max_retries): resp = await self.client.request(method, path, **kwargs) if resp.status_code == 429: # Auto-retry with backoff delay = int(resp.headers.get("Retry-After", 1)) await asyncio.sleep(delay) continue if resp.status_code >= 500 and attempt < self.max_retries - 1: await asyncio.sleep(2 ** attempt) continue return resp # Layer 2: Resource — typed API for each resource class PipelinesResource: def __init__(self, transport): self._t = transport async def create(self, *, name: str, gpu_count: int) -> Pipeline: resp = await self._t.request("POST", "/v2/pipelines", json={"name": name, "gpu_count": gpu_count}) return Pipeline(**resp.json()["data"]) def list(self, **filters) -> AsyncIterator[Pipeline]: # Auto-pagination: yields all pages transparently return AutoPaginator(self._t, "/v2/pipelines", Pipeline, **filters) # Layer 3: Models — typed dataclasses @dataclass class Pipeline: id: str name: str status: str gpu_count: int created_at: datetime # Layer 4: Client — the public API class Parallel: def __init__(self, api_key: str): self._transport = Transport(api_key, "https://api.parallel.dev") self.pipelines = PipelinesResource(self._transport) self.jobs = JobsResource(self._transport)
markdown # GOOD changelog entry: actionable, with code diff ## v2.4.0 (2025-05-15) ### Breaking: `pipeline.status` field renamed to `pipeline.state` **Why:** Aligning with industry standard terminology. **Impact:** All integrations that read `pipeline.status` will get `undefined`. **Migration:** ```python # Before pipeline.status # "active" # After pipeline.state # "active" ``` **Timeline:** `status` is deprecated now, removed in v2.5.0 (August 2025). Both fields returned during transition period.
The most insidious DX bug: documentation says one thing, the API does another. This happens when docs are manually maintained separately from the code. Fix: generate docs from the OpenAPI spec (which is generated from the code), so docs are always in sync. Test the examples in CI — if a code sample in the docs fails, the build fails.
python # CI pipeline for documentation accuracy: # Step 1: Generate OpenAPI spec from code annotations # (FastAPI does this automatically) # spec = app.openapi() # Step 2: Validate spec against published docs # openapi-diff old-spec.yaml new-spec.yaml --breaking # If breaking changes detected, fail CI unless changelog entry exists. # Step 3: Run documentation code samples as integration tests # Extract code blocks from docs/quickstart.md # Execute against sandbox API # Assert expected status codes and response shapes def test_quickstart_example(): # This code block appears in our quickstart docs client = Parallel(api_key="pk_test_ci_key") pipeline = client.pipelines.create(name="test", gpu_count=1) assert pipeline.id.startswith("pipe_") assert pipeline.status == "queued" # If the API changes and this breaks, the docs are stale. # CI catches it BEFORE the developer does.
When you release API v3, you don't want to force all SDK users to upgrade immediately. The SDK should support multiple API versions and default to the latest stable one.
python # SDK with version pinning client = Parallel( api_key="pk_live_...", api_version="2025-01-15", # Pin to a specific version ) # The SDK sends: Parallel-Version: 2025-01-15 # Server returns the response shape matching that version. # Even when v3 ships, this client gets v2 responses. # Version lifecycle in the SDK: # - SDK v1.x: supports API 2024-01-01 through 2024-12-01 # - SDK v2.x: supports API 2024-06-01 through 2025-06-01 # - SDK v3.x: supports API 2025-01-01 and later # Deprecation warnings printed when using old versions.
The cutting edge: AI documentation assistants trained on your API spec, docs, and support tickets. Developers ask "how do I create a pipeline with streaming output?" and get a working code example specific to their SDK version and authentication setup. Stripe, Vercel, and Cloudflare already ship these. The assistant reduces time-to-first-request by 60%.
Track a developer's journey from signup to first successful API call. Click each step to see common friction points.
Performance is not about making everything faster. It's about making the right things faster. A 10ms improvement on an endpoint called once a day is irrelevant. A 10ms improvement on an endpoint called 100,000 times per second saves 1,000 CPU-seconds per second. The first step is always: profile, don't guess.
The most common performance bug in API development. You fetch a list of 50 pipelines, then for each pipeline, you fetch its jobs. That's 1 query + 50 queries = 51 database round trips. Each round trip takes 2ms of network latency, so you've burned 100ms on network alone.
python # N+1 PROBLEM: 51 queries for 50 pipelines pipelines = await db.query("SELECT * FROM pipelines WHERE customer_id = $1 LIMIT 50", cid) for p in pipelines: p.jobs = await db.query("SELECT * FROM jobs WHERE pipeline_id = $1", p.id) # ^ This runs 50 times. 50 round trips. 100ms wasted. # FIX 1: Batch query (2 queries total) pipelines = await db.query("SELECT * FROM pipelines WHERE customer_id = $1 LIMIT 50", cid) pipeline_ids = [p.id for p in pipelines] all_jobs = await db.query("SELECT * FROM jobs WHERE pipeline_id = ANY($1)", pipeline_ids) # Group jobs by pipeline_id in Python. 2 queries. 4ms total. # FIX 2: JOIN (1 query) rows = await db.query(""" SELECT p.*, j.id as job_id, j.status as job_status FROM pipelines p LEFT JOIN jobs j ON j.pipeline_id = p.id WHERE p.customer_id = $1 ORDER BY p.created_at DESC LIMIT 50 """, cid) # 1 query. 3ms. But more complex response parsing.
Creating a new TCP connection (DNS + TLS handshake) takes 50-100ms. If your API calls downstream services, reuse connections with HTTP keep-alive and connection pooling.
python # BAD: new connection per request (50ms overhead each time) async def call_gpu_service(payload): async with httpx.AsyncClient() as client: # New connection! return await client.post("https://gpu.internal/schedule", json=payload) # GOOD: reuse connection pool (0ms connection overhead) gpu_client = httpx.AsyncClient( base_url="https://gpu.internal", limits=httpx.Limits(max_connections=100, max_keepalive_connections=20), timeout=httpx.Timeout(10.0, connect=5.0), ) async def call_gpu_service(payload): return await gpu_client.post("/schedule", json=payload) # Reuses connection
A response with 50 pipelines, each with 20 fields, can easily be 100KB of JSON. If the client only needs id, name, and status, you're sending 95KB of wasted data. Solutions:
Field selection: GET /v2/pipelines?fields=id,name,status — the server only serializes requested fields. Stripe calls this "expansion."
Compression: gzip/brotli reduces JSON payloads by 70-90%. A 100KB response becomes 10KB over the wire. Always enable if the client sends Accept-Encoding: gzip.
Streaming: For large responses, use NDJSON (newline-delimited JSON) so the client can process records as they arrive instead of waiting for the entire response.
python # NDJSON streaming for large exports # Instead of: {"data": [50,000 pipeline objects]} (10MB, 3s to build) # Stream: one JSON object per line, client processes as they arrive from fastapi.responses import StreamingResponse async def export_pipelines(customer_id: str): async def generate(): # Stream rows from DB cursor (don't load all into memory) async for row in db.cursor( "SELECT * FROM pipelines WHERE customer_id = $1", customer_id ): yield orjson.dumps(row).decode() + "\n" return StreamingResponse( generate(), media_type="application/x-ndjson", headers={"Transfer-Encoding": "chunked"} ) # Client processes line by line: # async for line in response.aiter_lines(): # pipeline = json.loads(line) # process(pipeline) # Memory usage: O(1) instead of O(n). First byte in 50ms, not 3s.
python # Pattern 1: SELECT only the columns you need # BAD: SELECT * fetches all 20 columns (including 50KB JSONB config) # GOOD: SELECT id, name, status FROM pipelines WHERE ... # Reduces: network transfer, memory, serialization time. # Pattern 2: Use EXPLAIN ANALYZE before deploying new queries # The query planner sometimes makes bad choices. Verify it uses indexes. # Pattern 3: Avoid COUNT(*) for large tables # BAD: SELECT COUNT(*) FROM pipelines WHERE customer_id = $1 # → scans entire index even with index. 100ms at 10M rows. # GOOD: Use an approximate count or pre-computed counter: # → Redis counter incremented on insert/delete. O(1). # Pattern 4: Use EXISTS instead of COUNT for existence checks # BAD: SELECT COUNT(*) FROM pipelines WHERE id = $1 (counts ALL matches) # GOOD: SELECT EXISTS(SELECT 1 FROM pipelines WHERE id = $1) (stops at first match) # Pattern 5: Batch operations to reduce round trips # BAD: for id in ids: await db.get(id) # N round trips # GOOD: await db.query("SELECT * FROM pipelines WHERE id = ANY($1)", ids) # 1 round trip
python # Continuous profiling: sample 1% of requests # Use py-spy (Python) or pprof (Go) to capture CPU flamegraphs # Targeted profiling for slow endpoints: import cProfile, pstats async def profile_handler(request): profiler = cProfile.Profile() profiler.enable() response = await actual_handler(request) profiler.disable() # Save profile for analysis stats = pstats.Stats(profiler) stats.sort_stats("cumulative") stats.print_stats(20) # Top 20 slowest functions return response # Common findings from profiling API handlers: # 1. JSON serialization: 30% of CPU on hot paths → use orjson (3x faster) # 2. ORM overhead: model instantiation for 1000 rows → use raw SQL for list endpoints # 3. Regex compilation: re.compile() inside a loop → compile once, reuse
Intermittent slowness is the hardest to debug because by the time you look, it's gone.
Approach: Enable continuous profiling (1% sampling). When a slow request occurs, the profiler captures what was happening. Correlate slow requests with system metrics: was there a GC pause? A TCP retransmit? A lock contention in the database?
For long-running operations (pipeline creation, large data exports), instead of making clients poll, stream updates to them. Server-Sent Events (SSE) is simpler than WebSockets and works with existing HTTP infrastructure (load balancers, CDNs).
python # SSE endpoint for pipeline creation status from fastapi import FastAPI from fastapi.responses import StreamingResponse async def stream_pipeline_status(job_id: str): async def event_generator(): while True: status = await get_job_status(job_id) # SSE format: "data: {json}\n\n" yield f"data: {json.dumps(status)}\n\n" if status["state"] in ("completed", "failed"): break await asyncio.sleep(1) return StreamingResponse( event_generator(), media_type="text/event-stream", headers={"Cache-Control": "no-cache"} ) # Client-side (JavaScript): # const source = new EventSource('/v2/jobs/job_123/stream'); # source.onmessage = (e) => console.log(JSON.parse(e.data)); # Automatically reconnects on network failure!
On hot API paths, JSON serialization can consume 20-40% of CPU time. Python's built-in json module is slow. Switching to orjson gives 3-10x speedup with zero code changes.
python # Benchmark: serializing 1000 pipeline objects import json, orjson, time data = [{"id": f"pipe_{i}", "name": f"pipeline-{i}", "status": "active", "gpu_count": 4, "created_at": "2025-05-22T10:00:00Z"} for i in range(1000)] # stdlib json: ~12ms json.dumps(data) # orjson: ~1.5ms (8x faster) orjson.dumps(data) # orjson also handles datetime, UUID, numpy arrays natively. # One-line swap in FastAPI: # from fastapi.responses import ORJSONResponse # app = FastAPI(default_response_class=ORJSONResponse)
HTTP/3 replaces TCP with QUIC (UDP-based). Benefits: zero round-trip connection setup (0-RTT), no head-of-line blocking (one lost packet doesn't stall all streams), and faster connection migration (WiFi to cellular without reconnecting). Cloudflare reports 12% latency improvement for API traffic after enabling HTTP/3.
Compare N+1 queries vs. batch query vs. JOIN. Watch the database round trips.
This is the payoff. Everything you've learned — routing, authentication, rate limiting, caching, database queries, error handling — comes together in a single interactive simulation. You are operating an API gateway that serves millions of requests. Adjust the controls, inject failures, and watch the system respond.
Incoming requests flow through auth → rate limiter → router → handler → cache/DB → response. Adjust load, inject failures, and observe metrics in real time.
Request flow (top to bottom): Each dot is a request traveling through the gateway stages. Green dots are successful, red dots are errors, yellow dots are rate-limited (429).
Metrics panel (right side):
| Scenario | What to do | What to observe |
|---|---|---|
| Normal operation | Load=50, Cache=70%, Error=2% | Smooth flow, low latency, green metrics |
| Cache failure | Drop cache to 0% | p99 latency spikes as all requests hit DB |
| Database failure | Click "DB Failure" | Cached requests still work, uncached requests error. Graceful degradation. |
| Auth service down | Click "Auth Down" | ALL requests fail at the first stage. Total outage. |
| DDoS attack | Click "DDoS (100x)" | Rate limiter activates, most requests return 429, legitimate traffic still served |
| Thundering herd | Set cache=0%, load=500 | DB overwhelmed, errors spike, p99 goes to timeout |
Every box in the simulation maps to a real component. Here's the production architecture:
yaml # Production API Gateway Architecture ingress: - CloudFront CDN (TLS termination, static caching, DDoS absorption) - Route 53 (latency-based DNS, health checks for failover) load_balancer: - ALB (Application Load Balancer) - Health checks: GET /healthz every 5s, 3 failures = remove - Connection draining: 30s on deploy (finish in-flight requests) api_servers: - 20 instances (auto-scale 10-50 based on CPU + request count) - Each runs: FastAPI + uvicorn + 4 workers - Stateless: all state in Redis or Postgres auth_layer: - API key validation: SHA-256 hash lookup in Redis (0.5ms) - JWT validation: RSA signature check (0.1ms, no network) - Rate limit: Redis INCR per customer (1ms) data_layer: - PostgreSQL 16 (primary + 3 read replicas) - PgBouncer: 1000 client connections → 100 server connections - Redis cluster: 6 nodes, 3 masters + 3 replicas - Connection pool: asyncpg, min=5, max=20 per API server async_processing: - SQS queues for long-running jobs - Worker fleet: 10 instances processing pipeline creation - Dead letter queue after 3 retries observability: - Datadog: metrics, traces, logs - PagerDuty: alerting on SLO burn rate - Grafana dashboards for real-time monitoring
In an interview, you have 5 minutes to draw this. Here's the simplified version:
This chapter distills everything into a cheat sheet you can review in the 30 minutes before your interview. Every section maps to a common interview question type.
| Question | Key points to cover | Chapter |
|---|---|---|
| "Design an API rate limiter" | Token bucket algorithm, distributed counter in Redis, per-customer quotas, 429 with Retry-After, graceful degradation | 5 |
| "Design a URL shortener API" | Hashing, collision handling, cursor pagination for analytics, CDN caching for redirects, rate limiting writes | 1, 4 |
| "Design an API gateway" | Request lifecycle, auth, rate limiting, routing, caching, circuit breaker, observability. Draw the Ch 11 diagram. | 2, 11 |
| "Design a real-time notification system" | WebSocket vs. SSE, connection scaling, message queue, fanout, delivery guarantees, offline queue | 8 |
| "Your API needs to handle 1M requests/minute" | Horizontal scaling, caching, connection pooling, async processing, CDN for static responses, sharding | 4, 8 |
python # DRILL 1: Implement cursor pagination def paginate(items: list, cursor: str | None, limit: int = 20): if cursor: start = next((i for i, item in enumerate(items) if item["id"] == cursor), 0) + 1 else: start = 0 page = items[start:start + limit] next_cursor = page[-1]["id"] if len(page) == limit else None return {"data": page, "next_cursor": next_cursor, "has_more": next_cursor is not None} # DRILL 2: Implement a circuit breaker class CircuitBreaker: def __init__(self, threshold=5, reset_time=30): self.failures = 0 self.threshold = threshold self.reset_time = reset_time self.state = "closed" # closed=normal, open=failing, half=testing self.last_failure = 0 async def call(self, fn, *args): if self.state == "open": if time.time() - self.last_failure > self.reset_time: self.state = "half-open" # Try one request else: raise CircuitOpenError("Service unavailable") try: result = await fn(*args) self.failures = 0 self.state = "closed" return result except Exception: self.failures += 1 self.last_failure = time.time() if self.failures >= self.threshold: self.state = "open" raise
| Symptom | Likely cause | Investigation |
|---|---|---|
| p99 latency spiked, p50 normal | Connection pool exhaustion, slow query for subset of requests | Check DB connection count, find slow queries in pg_stat_statements |
| Intermittent 500 errors, no pattern | Race condition, retry storm, or flaky downstream dependency | Correlate errors with specific request patterns, check distributed traces |
| Memory usage grows until OOM | Connection leak, unbounded cache, large response buffering | Heap dump analysis, check connection pool stats, monitor cache size |
| Latency increases linearly over weeks | Table growth without proper indexing, cache key space explosion | Check table sizes, EXPLAIN ANALYZE on hot queries, Redis memory stats |
| Code | Meaning | When to use |
|---|---|---|
| 200 | OK | Successful GET, PUT, PATCH |
| 201 | Created | Successful POST that created a resource |
| 202 | Accepted | Request accepted for async processing (return job ID) |
| 204 | No Content | Successful DELETE |
| 400 | Bad Request | Invalid request body, missing required field |
| 401 | Unauthorized | Missing or invalid auth credentials |
| 403 | Forbidden | Valid auth but insufficient permissions |
| 404 | Not Found | Resource doesn't exist |
| 409 | Conflict | Duplicate resource, version conflict |
| 422 | Unprocessable | Valid JSON but semantic validation failed |
| 429 | Too Many Requests | Rate limit exceeded (include Retry-After header) |
| 500 | Internal Error | Bug in your code (never expose details) |
| 502 | Bad Gateway | Downstream service returned invalid response |
| 503 | Service Unavailable | Overloaded or maintenance (include Retry-After) |
When given a system design question ("Design an API for X"), follow this framework in order. This structure shows the interviewer you think systematically.
python # DRILL 3: Implement a distributed lock with Redis async def acquire_lock(redis, key: str, ttl: int = 10) -> str | None: lock_id = str(uuid4()) # Unique per caller acquired = await redis.set( f"lock:{key}", lock_id, nx=True, ex=ttl ) return lock_id if acquired else None async def release_lock(redis, key: str, lock_id: str): # Lua script: only release if we still own the lock # (prevents releasing a lock that expired and was acquired by another) script = """ if redis.call("get", KEYS[1]) == ARGV[1] then return redis.call("del", KEYS[1]) end return 0 """ await redis.eval(script, 1, f"lock:{key}", lock_id) # DRILL 4: Implement webhook retry with exponential backoff async def send_webhook(url: str, payload: dict, max_retries: int = 5): for attempt in range(max_retries): try: resp = await http_client.post(url, json=payload, timeout=10) if resp.status_code < 300: return True # Success if resp.status_code >= 400 and resp.status_code < 500: return False # Client error — don't retry except (TimeoutError, ConnectionError): pass # Retry # Exponential backoff: 1s, 2s, 4s, 8s, 16s delay = (2 ** attempt) + random.uniform(0, 1) # Jitter! await asyncio.sleep(delay) return False # All retries exhausted → send to dead letter queue # DRILL 5: Implement request deduplication middleware async def idempotency_middleware(request, call_next): key = request.headers.get("Idempotency-Key") if not key or request.method in ("GET", "DELETE"): return await call_next(request) # Check if we've seen this key cached = await redis.get(f"idem:{key}") if cached: return Response.from_cache(cached) # Replay stored response # Lock to prevent concurrent execution of same key lock = await acquire_lock(redis, f"idem-lock:{key}", ttl=30) if not lock: return Response(status=409, body={"error": "Duplicate request in progress"}) response = await call_next(request) # Store response for 24h so retries get the same result await redis.setex(f"idem:{key}", 86400, response.serialize()) await release_lock(redis, f"idem-lock:{key}", lock) return response
| Metric | Value | Why it matters |
|---|---|---|
| L1 cache access | ~1ns | Baseline for "instant" |
| RAM access | ~100ns | In-process cache speed |
| Redis GET (same datacenter) | ~0.5-1ms | Distributed cache speed |
| SSD random read | ~0.1ms | DB index scan (cached in page cache) |
| DB query (indexed, warm) | ~1-5ms | Your p50 target for reads |
| DB query (full scan, cold) | ~100-1000ms | Your "something is wrong" signal |
| Network round trip (same DC) | ~0.5ms | Each microservice call adds this |
| Network round trip (cross-US) | ~30-60ms | Why multi-region matters |
| TLS handshake | ~10-50ms | Why connection reuse matters |
| JSON serialize (1000 objects, stdlib) | ~12ms | Why orjson matters on hot paths |
| PostgreSQL max practical connections | ~500 | Why PgBouncer exists |
The 5-dimension view of every topic covered. Click a dimension to highlight the relevant concepts.