Load Balancing — From Absolute Zero to Mastery

Chapter 0: The Problem

You run a web service. One server handles all traffic. Things are fine at 100 requests per second. Then your app trends on social media. Traffic spikes to 10,000 requests per second. Your single server maxes out at 1,000 requests per second. The remaining 9,000 requests get timeouts. Users see errors. Revenue drops.

The obvious fix: add more servers. You spin up 10 identical copies of your application. Each can handle 1,000 requests per second. Together, they can handle 10,000. But here is the problem that nobody thinks about at first: how does each request know which server to go to?

If all 10,000 users still send their requests to the same IP address, the same single server still gets crushed. The other 9 servers sit idle. Adding capacity without distributing traffic is like opening 10 checkout lanes at a grocery store but putting all the "open" signs on lane 1.

You need something that sits between the clients and your servers — something that accepts every incoming request and decides which backend server should handle it. This is a load balancer. It is one of the most critical pieces of infrastructure in any distributed system, and understanding how it works — at every layer of the network stack — is the difference between a system that scales and one that falls over.

Single Server vs. Load-Balanced

Watch requests arrive. With one server, it gets overwhelmed. With a load balancer distributing to 4 servers, traffic flows smoothly.

Click a mode to begin

The overloaded server drops requests because its internal queue is full. Requests that do get processed take longer because the CPU is saturated — context-switching, thrashing memory, and spending more time on housekeeping than on actual work. Meanwhile, the load-balanced setup processes every request within its capacity budget.

Load balancing is not just "splitting traffic." A good load balancer also monitors server health (removing sick nodes), handles session persistence (keeping a user's requests on the same server), terminates TLS (so backends do not need certificates), and shapes traffic based on content. It is the Swiss Army knife of distributed systems.

The Layers Where Load Balancing Happens

Load balancing can happen at multiple layers of the network stack. Each layer has different visibility into the request and different performance trade-offs:

Layer	What it sees	Speed	Smarts
DNS	Domain name only	Very fast (no proxy)	Minimal (round-robin IPs)
L4 (Transport)	IP + port + TCP flags	Fast (kernel-level)	Moderate (connection-level)
L7 (Application)	Full HTTP: URL, headers, body	Slower (must parse HTTP)	High (content-based routing)

This lesson explores all three layers, the algorithms that power them, and the health-checking machinery that keeps the whole thing reliable. By the end, you will be able to design a load-balancing strategy for any system — from a small web app to a global-scale platform.

You add 5 more servers to handle a traffic spike, but all requests still go to the original server's IP address. What happens?

The new servers automatically split the traffic because they share the same application code. The original server is still overwhelmed. The 5 new servers sit idle. Without a load balancer to distribute requests, adding servers does nothing. Capacity without distribution is wasted capacity. The operating system kernel distributes connections across all servers automatically via IP multicast.

Chapter 1: DNS Load Balancing

The simplest form of load balancing does not require any special hardware or proxy software. It uses a system that already exists everywhere: DNS — the Domain Name System that translates human-readable domain names into IP addresses.

When a browser wants to reach api.example.com, it asks a DNS server "what IP address is this?" Normally, the DNS server returns one IP. But nothing stops it from returning multiple IPs. And nothing stops it from rotating which IP it returns first each time someone asks.

This is DNS round-robin. You configure your DNS records to list multiple A records for the same domain name. Each time a client resolves the domain, the DNS server returns the IPs in a different order. Most clients use the first IP in the list, so traffic naturally spreads across your servers.

// DNS A records for api.example.com
api.example.com. 300 IN A 10.0.1.1
api.example.com. 300 IN A 10.0.1.2
api.example.com. 300 IN A 10.0.1.3

// First query: returns [10.0.1.1, 10.0.1.2, 10.0.1.3]
// Second query: returns [10.0.1.2, 10.0.1.3, 10.0.1.1]
// Third query: returns [10.0.1.3, 10.0.1.1, 10.0.1.2]
// Client uses first IP → traffic distributes evenly

How DNS Resolution Actually Works

To understand why DNS load balancing has certain limitations, you need to understand the resolution chain. When a client wants to reach your domain:

Client

Browser asks its local resolver: "what is api.example.com?"

↓

Recursive Resolver

ISP or corporate DNS server. Checks its cache first.

↓

Root Nameserver

"I don't know, but .com is handled by these servers."

↓

TLD Nameserver (.com)

"example.com is handled by ns1.example.com."

↓

Authoritative Nameserver

Returns the A records: [10.0.1.1, 10.0.1.2, 10.0.1.3]

The key detail: every step in this chain can cache the result. The 300 in the DNS record is the TTL (Time To Live) in seconds — telling caches "keep this answer for 5 minutes before asking again." During those 5 minutes, every client behind the same recursive resolver gets the same cached answer. No rotation. No balancing.

The Caching Problem

This is DNS load balancing's fatal weakness. Suppose 10,000 users share the same corporate DNS resolver. The resolver caches the DNS response for 300 seconds. For those 5 minutes, all 10,000 users hit the same server. The "round-robin" happens only when the cache expires and a new lookup occurs.

You can lower the TTL to reduce caching effects, but there is a floor. Many resolvers ignore TTLs below 30 seconds. Some (infamously, older Java applications) cache DNS forever. And low TTLs increase DNS query volume, slowing down every first connection.

DNS Round-Robin with Caching

Watch how DNS caching causes uneven distribution. Each resolver caches for the TTL period, pinning all its clients to one server.

TTL (sec) 30

When DNS Load Balancing Makes Sense

Use case	Why it works
Global traffic distribution	Route users to nearest data center. DNS is the only layer that sees the client's location.
First layer in a multi-tier LB	DNS picks the region; a real load balancer within that region handles fine-grained distribution.
Stateless CDN edges	Any edge server can serve cached content. Imbalance does not matter much.

DNS is a coarse-grained sledgehammer, not a scalpel. It cannot detect server health. It cannot route based on content. It cannot maintain session affinity. And its cached responses mean changes take minutes to propagate. Use DNS for geographic distribution, then use a real load balancer (L4 or L7) within each region for everything else.

You set your DNS TTL to 30 seconds and configure round-robin across 3 servers. A corporate network with 5,000 users shares one recursive DNS resolver. How many of those users hit the same server simultaneously?

About 1,667 (evenly split across 3). About 833 — the resolver caches for 10 seconds, not 30. All 5,000. The shared resolver caches the DNS response for 30 seconds. During that window, every client behind it resolves to the same IP. DNS round-robin rotates between resolvers, not between individual clients.

Chapter 2: L4 — Transport Layer Load Balancing

DNS can point clients to different IPs, but it is blind to server health, cannot react in real time, and its caching makes distribution unpredictable. We need something that sits in the actual traffic path and makes per-connection decisions. Enter L4 load balancing — operating at the transport layer (TCP/UDP).

An L4 load balancer sees the TCP SYN packet that initiates a connection. It reads the source IP, destination IP, source port, and destination port — the 4-tuple. It does not read the HTTP payload. It does not know the URL or headers. It picks a backend server and forwards the entire TCP connection to that server.

Because it does not parse application-layer data, an L4 load balancer is extremely fast. Modern L4 balancers (like Linux's IPVS or Maglev) can handle millions of connections per second on commodity hardware. They operate in the kernel, avoiding the overhead of user-space processing.

Three Forwarding Modes

Once the L4 balancer picks a backend, it needs to forward packets. There are three ways to do this, each with different trade-offs:

NAT (Network Address Translation). The load balancer rewrites the destination IP of every incoming packet to the backend's IP, and rewrites the source IP of every response back to the load balancer's IP. The client thinks it is talking to the load balancer; the backend thinks it is talking to the load balancer. Both directions flow through the LB. This is the simplest mode but means the load balancer is in the return path — it sees every response byte.

DSR (Direct Server Return). The load balancer forwards the initial packet to the backend, but the backend responds directly to the client, bypassing the load balancer on the return path. Since response data is typically 10-100x larger than request data (think: you send a small GET request, you get back a 2 MB image), DSR removes the biggest bottleneck. The trade-off: backends must be configured to accept traffic for the load balancer's IP (via a loopback alias), and the LB cannot modify responses.

IP Tunneling (IPIP / GRE). Like DSR, but the load balancer encapsulates the original packet inside a new IP packet headed for the backend. The backend decapsulates, processes, and responds directly to the client. This allows the backend to be on a different network subnet from the load balancer — useful for geographically distributed backends.

// NAT mode: LB rewrites both directions
Client → [dst: LB_IP] → LB rewrites → [dst: Backend_IP] → Backend
Backend → [dst: LB_IP] → LB rewrites → [dst: Client_IP] → Client

// DSR mode: LB forwards, backend replies directly
Client → [dst: LB_IP] → LB forwards → [dst: Backend_IP, MAC rewrite] → Backend
Backend → [dst: Client_IP] → (bypasses LB) → Client

// Return path bandwidth comparison for a 2 MB response:
NAT: 2 MB flows through LB (bottleneck)
DSR: 2 MB goes directly to client (LB only handles the tiny request)

NAT vs. DSR Traffic Flow

Watch packets flow through NAT (all traffic through LB) vs. DSR (responses bypass LB). Notice the bandwidth difference on the load balancer.

Click a mode to begin

Real-World L4 Load Balancers

System	Mode	Scale
Linux IPVS	NAT, DSR, Tunneling	Millions of connections; in-kernel, used by most Kubernetes setups
Google Maglev	DSR via GRE tunneling	Handles all of Google's external traffic; consistent hashing for connection affinity
AWS NLB	DSR-like (flow hash)	Millions of requests/sec; preserves client source IP
Facebook Katran	XDP/eBPF + DSR	Kernel-bypass via BPF programs; sub-microsecond forwarding

The Connection Table

An L4 load balancer must remember which backend it assigned to each connection. If a client sends 50 TCP segments within one connection, they all must go to the same backend. The LB maintains a connection table mapping (client IP, client port) to the chosen backend. This table is consulted for every packet.

For NAT mode, the connection table also stores the rewritten addresses so responses can be mapped back. For DSR, the table is only needed for the forward direction. When a connection closes (FIN/ACK exchange) or times out, its entry is removed from the table.

// Connection table entry
Key: (client_ip=203.0.113.5, client_port=49152, proto=TCP)
Value: (backend_ip=10.0.1.3, backend_port=8080, state=ESTABLISHED, age=12s)

// Table size = number of concurrent connections
// A busy LB might have 1M+ entries
// Memory: ~100 bytes/entry × 1M = ~100 MB

L4 is fast but blind. An L4 load balancer can route millions of connections per second because it never reads the HTTP payload. But this means it cannot route based on URL path, HTTP method, cookies, or headers. It cannot do A/B testing by URL. It cannot route /api to one pool and /static to another. For content-aware routing, you need L7.

Your API returns 5 MB JSON responses. Your L4 load balancer in NAT mode is hitting bandwidth limits. Which forwarding mode would reduce load on the balancer most?

Switch to IP tunneling mode — it compresses responses. Switch to DSR mode. The 5 MB responses go directly from backend to client, bypassing the load balancer entirely. The LB only handles the small inbound requests. Since responses are orders of magnitude larger than requests, this massively reduces LB bandwidth. Add more NAT-mode load balancers in front of the current one.

Chapter 3: L7 — Application Layer Load Balancing

L4 load balancing is fast because it ignores the application payload. But sometimes you need to see inside the payload. You want to route /api/v2 to a new fleet of servers while /api/v1 stays on the old ones. You want to send mobile clients to servers optimized for mobile. You want to terminate TLS at the load balancer so backends do not need certificates. None of this is possible at L4.

L7 load balancing operates at the application layer — HTTP, gRPC, WebSocket. The load balancer fully terminates the client's TCP connection (and TLS, if applicable), parses the HTTP request, reads the URL, headers, cookies, and even the body. Then it opens a separate TCP connection to the chosen backend and forwards the request.

This is fundamentally different from L4. The L4 balancer is a packet forwarder — it shuttles raw TCP segments. The L7 balancer is a full reverse proxy — it understands the protocol and can modify, rewrite, or reject requests before forwarding.

What L7 Can See (That L4 Cannot)

HTTP Field	Routing decision it enables
URL path	Route `/api` to API servers, `/static` to CDN origin
Host header	Route `shop.example.com` and `blog.example.com` to different backends
Cookie	Session affinity — same user always hits same backend
Authorization header	Route authenticated vs. unauthenticated traffic differently
HTTP method	Route reads (GET) to read replicas, writes (POST/PUT) to the leader
Content-Type	Route JSON APIs vs. file uploads to specialized handlers
Query parameters	A/B testing: `?variant=B` goes to experimental backend

TLS Termination

One of the most valuable L7 features has nothing to do with routing. TLS termination means the load balancer handles HTTPS encryption. Clients connect to the LB via HTTPS. The LB decrypts the traffic, reads the HTTP request, then forwards it to backends over plain HTTP (or re-encrypts for end-to-end encryption).

Why this matters: TLS handshakes are CPU-intensive (especially the initial RSA or ECDHE key exchange). By offloading TLS to the load balancer, you free backend servers to spend their CPU on application logic. You also centralize certificate management — update one certificate on the LB instead of on every backend.

// L7 load balancer with TLS termination
Client → HTTPS (TLS 1.3) → L7 LB decrypts → reads HTTP headers → picks backend
L7 LB → plain HTTP → Backend:8080

// Benefits:
• Backends: no TLS overhead, no certificate management
• LB: can read HTTP headers for smart routing
• Security: TLS terminates at a hardened edge; internal network is trusted

// Cost:
• LB must handle all TLS handshakes (CPU intensive for RSA-2048)
• Traffic between LB and backends is unencrypted (mitigate with mTLS if needed)

L7 Content-Based Routing

Requests arrive with different URL paths. The L7 load balancer reads each path and routes to the appropriate backend pool.

Click Send to watch content-based routing

L4 vs. L7: The Trade-off

Property	L4	L7
Throughput	Millions of conns/sec	Tens of thousands of req/sec per core
Latency overhead	Microseconds	Milliseconds (TLS + HTTP parsing)
Routing granularity	Per connection (IP + port)	Per request (URL, headers, cookies)
Connection model	Forwards existing connection	Two connections: client→LB, LB→backend
Protocol awareness	None (raw TCP/UDP)	Full HTTP/gRPC/WebSocket
Real-world examples	AWS NLB, Maglev, IPVS	Nginx, HAProxy, Envoy, AWS ALB

Most production setups use both. L4 in front (fast, handles raw connections, distributes across L7 balancers) and L7 behind (smart, reads HTTP, routes to the right application pool). Google's architecture: Maglev (L4) → GFE (L7) → backends. AWS: NLB (L4) → ALB (L7) → EC2 instances. The two layers complement each other.

You need to route requests for /api/v2 to a canary fleet while /api/v1 stays on production. Which load balancer type can do this?

L4 — it can read the TCP payload to extract the URL path. DNS — configure separate domain names for v1 and v2. L7. Only an L7 load balancer parses the HTTP request and can read the URL path. L4 sees only the TCP 4-tuple (IPs + ports) and cannot distinguish /api/v1 from /api/v2 because they arrive on the same IP and port.

Chapter 4: Load Balancing Algorithms

You have decided to use a load balancer. Requests arrive. You have 5 backend servers. The fundamental question: which server gets the next request? This is the selection algorithm, and the choice matters enormously. A bad algorithm creates hotspots. A good one spreads traffic evenly even when servers have different capacities.

Round-Robin

The simplest possible algorithm. Assign requests to servers in order: 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, ... Each server gets exactly 1/N of the traffic. No state needed beyond a counter.

When it breaks: if servers have different capacities (one has 8 cores, another has 2), round-robin sends them equal traffic. The small server gets overwhelmed while the large one is underutilized. Also fails when requests have vastly different costs — a lightweight health check and a heavy database report both count as "one request."

Weighted Round-Robin

Assign each server a weight proportional to its capacity. A server with weight 3 gets 3x as many requests as a server with weight 1. The total weight determines the cycle length.

// Weighted round-robin example
Server A: weight 5 (beefy 8-core machine)
Server B: weight 3 (medium 4-core machine)
Server C: weight 2 (small 2-core machine)
Total weight: 5 + 3 + 2 = 10

// In every 10 requests:
A gets 5, B gets 3, C gets 2
Sequence: A, A, B, A, C, A, B, A, B, C (smooth variant)

// Nginx uses a smooth weighted round-robin that avoids
// bursts like AAAAABBBCC

Least Connections

Track how many active connections each server has. Send the next request to the server with the fewest. This naturally handles servers of different speeds: a fast server completes requests quickly, drops its connection count, and gets more traffic. A slow server's count stays high, so it gets fewer new requests.

When it shines: long-lived connections with variable duration (WebSockets, database connections, file uploads). When it struggles: very short-lived connections where the count is always near zero (thousands of requests per second, each completing in 1ms). In that case, the counts are noisy and round-robin works just as well.

Weighted Least Connections

Combine least-connections with server weights. The metric becomes active_connections / weight — the server with the lowest ratio gets the next request. This handles both heterogeneous hardware and variable request durations.

Consistent Hashing

Hash the request key (e.g., user ID) to a point on a virtual ring. Each server owns an arc of the ring. The request goes to whichever server's arc it lands on. The magic: when a server is added or removed, only the keys in the affected arc move — all other assignments stay the same.

This is critical for stateful services. If server A has user 123's session cached in memory, you want every request from user 123 to go to server A. Consistent hashing achieves this without a session table, as long as the server set is stable.

// Consistent hashing ring
Ring positions: 0 to 2³² - 1
Server A hashed to position: 1,000,000
Server B hashed to position: 2,500,000
Server C hashed to position: 3,800,000

// Request from user "alice" hashed to: 1,700,000
// Walk clockwise → next server = B (at 2,500,000)
// alice always goes to B (until the ring changes)

// Add server D at position 2,000,000
// Now alice (1,700,000) → D (2,000,000) instead of B
// But user "bob" at 3,000,000 still → C. Minimal disruption.

Algorithm Comparison

4 servers with different speeds. Watch how each algorithm distributes load. The fastest server (A) should get the most traffic under smart algorithms.

Pick an algorithm

Summary Table

Algorithm	State needed	Best for	Weakness
Round-Robin	Counter	Homogeneous servers, uniform requests	Ignores server capacity and request cost
Weighted RR	Counter + weights	Heterogeneous hardware	Weights are static; does not adapt to load
Least Connections	Per-server conn count	Variable-duration requests	Noisy with very short-lived connections
Consistent Hashing	Hash ring	Stateful services, caching	Uneven with few servers (use virtual nodes)
Random	None	Large server pools	Variance is high with few servers
Power of Two Choices	Per-server conn count	Large pools; low overhead	Slightly less optimal than full least-conn

Power of Two Choices. A beautiful compromise: instead of checking ALL servers (like least-connections), pick TWO servers at random and send to whichever has fewer connections. This gives you 90% of least-connections' benefit with O(1) overhead instead of O(N). Used by Envoy, Netflix's Ribbon, and many internal load balancers at scale.

You have 3 servers: A (8 cores), B (4 cores), C (2 cores). With plain round-robin, each gets 33% of traffic. Server C is overwhelmed while A is 75% idle. Which algorithm fixes this without tracking per-request state?

Weighted round-robin. Assign weights proportional to capacity: A=4, B=2, C=1. Now A gets 4/7 ≈ 57% of traffic, B gets 29%, C gets 14%. Each server receives traffic proportional to its capacity, no per-request state needed — just a static weight configuration. Least connections — it automatically sends to the least busy server. Consistent hashing — hash each request to distribute evenly.

Chapter 5: Health Checks

You have 5 backend servers and a load balancer distributing traffic. Server 3 crashes. The load balancer does not know. It keeps sending 20% of requests to a dead server. Those requests time out after 30 seconds. Users see spinning wheels. Your system's effective throughput drops by 20% even though 4 healthy servers could handle the full load — if only the load balancer knew to stop sending traffic to the dead one.

This is why every production load balancer has health checks — periodic probes that test whether each backend is alive and capable of serving traffic. If a server fails health checks, the load balancer removes it from the pool. When it recovers, it is added back.

Types of Health Checks

Health checks come in layers, from simple to thorough:

TCP check (L4). Can the load balancer open a TCP connection to the backend's port? If the 3-way handshake completes, the server is "up." This catches crashes and network failures but misses application-level problems (e.g., the process is running but returning 500 errors).

HTTP check (L7). The load balancer sends an actual HTTP request (typically GET /health) and checks the response. A healthy server returns 200 OK. An unhealthy one returns 500 or times out. This catches application-level failures, database connection issues, and resource exhaustion.

Deep health check. The /health endpoint does not just return 200 — it actually tests critical dependencies. Can it reach the database? Is the cache responsive? Is disk space above 10%? This catches "gray failures" where the process is running but cannot serve real traffic because a dependency is down.

// Health check endpoint implementation (Python)
// Shallow check: "am I running?"
@app.get("/health")
def health():
  return {"status": "ok"} # Always returns 200 if process is alive

// Deep check: "can I actually serve traffic?"
@app.get("/health/ready")
def readiness():
  try:
    db.execute("SELECT 1") # Can I reach the database?
    cache.ping() # Can I reach Redis?
    if disk_usage() > 0.95: # Am I running out of disk?
      return {"status": "full"}, 503
    return {"status": "ready"}, 200
  except Exception:
    return {"status": "unhealthy"}, 503

Health Check Parameters

Parameter	Typical value	What it controls
Interval	5-30 seconds	How often to probe each backend
Timeout	2-5 seconds	Max time to wait for a probe response
Unhealthy threshold	2-3 failures	How many consecutive failures before marking unhealthy
Healthy threshold	2-3 successes	How many consecutive successes before marking healthy again

The thresholds prevent flapping. A single failed check does not remove a server (it might have been a network blip). The server must fail 2-3 checks in a row before it is removed. Similarly, a recovering server must pass 2-3 checks before it is added back — preventing a partially-recovered server from receiving traffic and immediately failing again.

Health Check Simulation

5 servers are running. Click "Kill Server" to crash one. Watch the health checker detect the failure and remove it from the pool. Click "Revive" to bring it back.

All 5 servers healthy

Liveness vs. Readiness

Kubernetes popularized a distinction that applies broadly:

Liveness probe: "Is the process alive?" If it fails, the orchestrator kills and restarts the container. Use this to detect deadlocks or unrecoverable states. Example: the process is stuck in an infinite loop and cannot serve any requests.

Readiness probe: "Can this instance serve traffic right now?" If it fails, the load balancer stops sending traffic but does NOT restart the container. Use this for temporary conditions: the instance is still warming its cache, loading a model, or waiting for a database migration. Once ready, traffic resumes.

A common mistake: putting dependency checks in the liveness probe. If your liveness probe checks the database, and the database goes down, Kubernetes kills ALL your pods. Now you have no application instances to serve traffic when the database recovers. Put dependency checks in the readiness probe — it removes unhealthy instances from the load balancer without killing them.

Your health check interval is 10 seconds and the unhealthy threshold is 3. A server crashes. What is the maximum time before the load balancer stops sending it traffic?

Up to 30 seconds. In the worst case, the server crashes right after a successful health check. The load balancer must wait for 3 consecutive failures at 10-second intervals: 10s + 10s + 10s = 30 seconds. During this window, 20% of requests (if 5 servers) hit a dead server and time out. Exactly 10 seconds — one failed check is enough. Exactly 3 seconds — one second per threshold count.

Chapter 6: Session Affinity

Most load-balancing algorithms treat every request independently. Request 1 goes to server A, request 2 goes to server B, request 3 goes to server C. This is fine for stateless services where any server can handle any request. But many real applications are stateful — they store user sessions in server memory.

Suppose a user logs in and their session token is stored in server A's memory. The next request goes to server B — which has no session for this user. The user sees a login page again. Frustrated, they log in again. Their new session is on server B. The next request goes to server C. Login page. This is the session affinity problem.

Session affinity (also called sticky sessions) ensures that all requests from the same user are routed to the same backend server. The load balancer "remembers" which server a user was assigned to and keeps sending them there.

How Sticky Sessions Work

There are several mechanisms, each with different trade-offs:

Source IP affinity. Hash the client's IP address to pick a server. Same IP always goes to the same server. Problem: many users share the same IP (corporate NAT, carrier-grade NAT). Thousands of users behind one IP all get pinned to one server. Also breaks when users switch networks (Wi-Fi to cellular).

Cookie-based affinity. The load balancer sets a cookie (e.g., SERVERID=backend-3) on the first response. Subsequent requests include this cookie, and the LB reads it to route to the same backend. This is the most reliable method — it tracks individual users, not IPs. Requires L7 (must read HTTP cookies).

Consistent hashing on user ID. If the request includes a user identifier (session token, JWT), hash it to pick a server via consistent hashing. No cookie needed; no state on the LB. Works for both HTTP and non-HTTP protocols. Disruption when servers are added/removed is minimal.

// Cookie-based session affinity (Nginx config)
upstream backends {
  server 10.0.1.1:8080;
  server 10.0.1.2:8080;
  server 10.0.1.3:8080;
  sticky cookie srv_id expires=1h domain=.example.com path=/;
}

// First request: no cookie → LB picks server 2 → response includes:
Set-Cookie: srv_id=10.0.1.2; Path=/; Domain=.example.com; Max-Age=3600

// Second request: cookie present → LB reads srv_id → routes to 10.0.1.2

The Draining Problem

Session affinity creates a new problem: server draining. You need to take server A down for maintenance. But 200 users have sticky sessions on server A. If you remove it from the pool, those sessions break.

The solution is graceful draining: stop sending new sessions to server A, but continue routing existing sessions there until they expire or complete. Once active sessions drop to zero, safely remove the server.

Sticky Sessions vs. Stateless

Watch colored users make multiple requests. Without stickiness, they bounce between servers. With stickiness, each user stays on their assigned server.

Pick a mode

The best solution: eliminate server-side sessions entirely. Store sessions in an external store (Redis, database) so any server can serve any user. This makes load balancing trivial — no affinity needed, no draining complexity, no state loss on server failure. The overhead is one extra network hop per request, but the operational simplicity is worth it. This is what most modern stateless architectures do.

Your app stores user sessions in server memory. You use source-IP-based sticky sessions. 3,000 users are behind the same corporate NAT (same public IP). What happens?

Each user gets their own server assignment based on their private IP. All 3,000 users are pinned to the same backend server. Source-IP hashing sees one IP for all of them. That server gets 3,000 users while others get far fewer. This is why cookie-based affinity (which tracks individual users, not IPs) is preferred for HTTP traffic. The load balancer automatically detects the NAT and distributes users evenly.

Chapter 7: Global Load Balancing

Everything so far has been about distributing traffic within a single data center. But what if you have data centers in Virginia, Frankfurt, and Tokyo? A user in Japan making requests to your Virginia servers suffers 200ms of round-trip latency on every single request. Multiply that by the dozens of requests a typical page load makes, and the experience is painful.

Global load balancing (GSLB — Global Server Load Balancing) routes users to the nearest or best-performing data center before any local load balancer takes over. There are two primary mechanisms: GeoDNS and anycast.

GeoDNS

GeoDNS is DNS with geographic awareness. When a DNS query arrives, the authoritative nameserver looks up the source IP in a GeoIP database (like MaxMind) to determine the client's approximate location. It then returns the IP address of the nearest data center.

// GeoDNS resolution
Client in Tokyo resolves api.example.com
DNS server sees source IP → GeoIP lookup → Japan
Returns: 103.21.0.1 (Tokyo data center)

Client in Berlin resolves api.example.com
DNS server sees source IP → GeoIP lookup → Germany
Returns: 185.45.0.1 (Frankfurt data center)

// Same domain, different IPs based on location
// Providers: Route 53 (AWS), Cloud DNS (GCP), Cloudflare

GeoDNS has the same caching limitations as regular DNS — a corporate resolver in one city might serve clients in another city, and they all get the same cached answer. But at the global scale (routing to continents or countries), this is usually acceptable.

Anycast

Anycast is a fundamentally different approach. Instead of giving each data center a different IP, you give them all the same IP address. Every data center announces this IP via BGP (the internet routing protocol). When a client sends a packet to that IP, the internet's routing infrastructure naturally delivers it to the nearest data center — the one with the shortest BGP path.

// Anycast: same IP, multiple locations
api.example.com → 198.51.100.1

Tokyo data center announces: "I can reach 198.51.100.1" via BGP
Frankfurt data center announces: "I can reach 198.51.100.1" via BGP
Virginia data center announces: "I can reach 198.51.100.1" via BGP

// Internet routing delivers packets to the nearest announcer
Client in Tokyo → packet arrives at Tokyo DC (shortest BGP path)
Client in London → packet arrives at Frankfurt DC (closest)

// Used by: Cloudflare, Google, all root DNS servers

Anycast works beautifully for UDP-based protocols (DNS) and short-lived TCP connections (HTTP over TLS 1.3 with 0-RTT). For long-lived TCP connections, anycast has a subtle problem: if BGP routes change mid-connection (path flap), the packets might suddenly be delivered to a different data center, which does not have the TCP state. The connection breaks. This is why Cloudflare and Google use anycast + connection ID (via QUIC) to handle this.

GeoDNS vs. Anycast Routing

Clients from 3 regions connect to your service. Watch how GeoDNS and Anycast each route them to the nearest data center.

Pick a routing method

Failover at the Global Level

Global load balancing also handles data center failures. If the Tokyo data center goes down:

GeoDNS failover: The health-checking system detects the outage and removes Tokyo's IP from DNS responses. Japanese clients start getting the next-nearest data center (maybe Seoul or Singapore). Failover time depends on DNS TTL — could be minutes.

Anycast failover: The Tokyo data center stops announcing the IP via BGP. Internet routing automatically converges — within seconds to minutes — and delivers packets to the next-nearest announcing data center. No DNS change needed. This is one of anycast's biggest advantages.

Property	GeoDNS	Anycast
Mechanism	DNS returns different IPs per region	Same IP, BGP routing picks nearest
Granularity	Country/city level (GeoIP accuracy)	Network topology (BGP path length)
Failover speed	Depends on DNS TTL (minutes)	BGP convergence (seconds to minutes)
Long TCP connections	Stable (each DC has its own IP)	Risk of route flaps breaking connections
Setup complexity	Low (DNS config)	High (BGP peering, ISP coordination)

The full picture: three layers of load balancing. Global LB (GeoDNS/anycast) picks the data center. L4 LB (Maglev/NLB) distributes connections within the DC. L7 LB (Envoy/Nginx) routes requests to the right application pool. Each layer adds intelligence at the cost of latency. Modern systems like Google, Cloudflare, and AWS use all three.

Your Tokyo data center goes down. You use anycast. What happens to clients in Japan?

BGP routing automatically converges. Since Tokyo stopped announcing the IP, packets are delivered to the next-nearest data center (e.g., Seoul or Singapore) within seconds to minutes. No DNS change needed — the IP address stays the same. Existing TCP connections may break, but new ones go to the healthy DC. Clients get connection timeouts until you manually update DNS records. All traffic is dropped because the anycast IP no longer exists on the internet.

Chapter 8: Interactive Load Balancer

Time to put everything together. Below is a fully interactive load balancer simulation. You control the algorithm, the server health, and the request rate. Watch how requests distribute, see servers fail and recover, and observe how different algorithms handle uneven loads.

This is your playground. Kill servers and watch failover. Crank up the request rate and watch queues fill. Switch algorithms mid-traffic and compare distributions. The goal: build intuition for how load balancers behave under stress.

Load Balancer Simulator

Requests arrive at the load balancer and get distributed to backend servers. Kill servers, change algorithms, and adjust request rate to see how the system responds.

Request Rate 5/s

Press Play to begin

What to Try

Experiment 1: Kill a server under round-robin. Watch how the dead server's share of traffic produces errors until the health checker removes it. Then switch to least-connections and kill a server — notice the difference in response time.

Experiment 2: Crank the request rate to 20/s. Watch server queues fill up. Which algorithm keeps latency lowest? (Hint: least-connections adapts; round-robin does not.)

Experiment 3: Kill 3 of 5 servers. The remaining 2 must handle all traffic. Watch queue depths spike. This is why capacity planning matters — your system must handle N-2 failures if you have N servers.

Chapter 9: Connections

Load balancing is one piece of the distributed systems puzzle. Here is how it connects to everything else.

What We Covered

Concept	Key takeaway
DNS LB	Coarse-grained, cached, good for global distribution only
L4 LB	Fast (kernel-level), sees TCP 4-tuple, cannot inspect HTTP
L7 LB	Smart (content-based routing), slower (HTTP parsing), handles TLS
Algorithms	Round-robin for simple, least-conn for adaptive, consistent hashing for stateful
Health checks	TCP/HTTP probes with thresholds prevent routing to dead servers
Session affinity	Cookie-based preferred; externalized state is better than sticky sessions
Global LB	GeoDNS for simplicity, anycast for speed and automatic failover

Where to Go Next

Topic	Connection
Service Architecture	Service meshes (Envoy, Istio) build L7 load balancing into every service-to-service call
Data Storage	Caching (Redis) eliminates the need for session affinity; replication handles read scaling
Messaging	Message queues decouple producers from consumers, providing a different kind of load distribution
Consensus	Leader election decides which replica is primary; the load balancer must know who the leader is

The load balancer is the front door of every distributed system. It is the first thing a request hits, and the last line of defense before backend failures become user-visible. Mastering load balancing — at every layer — is foundational to understanding how large-scale systems work.

You are designing a global system. Traffic enters via GeoDNS to reach the nearest region. Within each region, what is the typical load-balancing architecture?

Just one L7 load balancer that handles everything. L4 load balancer (fast, kernel-level) distributes connections across multiple L7 load balancers (smart, content-aware), which then route to application backends. The two layers complement each other: L4 for raw throughput, L7 for routing intelligence. This is Google's (Maglev → GFE) and AWS's (NLB → ALB) architecture. Direct anycast to every backend server individually.