DNS round-robin, L4 transport routing, L7 application routing, algorithms, health checks, and global traffic management.
You run a web service. One server handles all traffic. Things are fine at 100 requests per second. Then your app trends on social media. Traffic spikes to 10,000 requests per second. Your single server maxes out at 1,000 requests per second. The remaining 9,000 requests get timeouts. Users see errors. Revenue drops.
The obvious fix: add more servers. You spin up 10 identical copies of your application. Each can handle 1,000 requests per second. Together, they can handle 10,000. But here is the problem that nobody thinks about at first: how does each request know which server to go to?
If all 10,000 users still send their requests to the same IP address, the same single server still gets crushed. The other 9 servers sit idle. Adding capacity without distributing traffic is like opening 10 checkout lanes at a grocery store but putting all the "open" signs on lane 1.
You need something that sits between the clients and your servers — something that accepts every incoming request and decides which backend server should handle it. This is a load balancer. It is one of the most critical pieces of infrastructure in any distributed system, and understanding how it works — at every layer of the network stack — is the difference between a system that scales and one that falls over.
Watch requests arrive. With one server, it gets overwhelmed. With a load balancer distributing to 4 servers, traffic flows smoothly.
The overloaded server drops requests because its internal queue is full. Requests that do get processed take longer because the CPU is saturated — context-switching, thrashing memory, and spending more time on housekeeping than on actual work. Meanwhile, the load-balanced setup processes every request within its capacity budget.
Load balancing can happen at multiple layers of the network stack. Each layer has different visibility into the request and different performance trade-offs:
| Layer | What it sees | Speed | Smarts |
|---|---|---|---|
| DNS | Domain name only | Very fast (no proxy) | Minimal (round-robin IPs) |
| L4 (Transport) | IP + port + TCP flags | Fast (kernel-level) | Moderate (connection-level) |
| L7 (Application) | Full HTTP: URL, headers, body | Slower (must parse HTTP) | High (content-based routing) |
This lesson explores all three layers, the algorithms that power them, and the health-checking machinery that keeps the whole thing reliable. By the end, you will be able to design a load-balancing strategy for any system — from a small web app to a global-scale platform.
The simplest form of load balancing does not require any special hardware or proxy software. It uses a system that already exists everywhere: DNS — the Domain Name System that translates human-readable domain names into IP addresses.
When a browser wants to reach api.example.com, it asks a DNS server "what IP address is this?" Normally, the DNS server returns one IP. But nothing stops it from returning multiple IPs. And nothing stops it from rotating which IP it returns first each time someone asks.
This is DNS round-robin. You configure your DNS records to list multiple A records for the same domain name. Each time a client resolves the domain, the DNS server returns the IPs in a different order. Most clients use the first IP in the list, so traffic naturally spreads across your servers.
To understand why DNS load balancing has certain limitations, you need to understand the resolution chain. When a client wants to reach your domain:
The key detail: every step in this chain can cache the result. The 300 in the DNS record is the TTL (Time To Live) in seconds — telling caches "keep this answer for 5 minutes before asking again." During those 5 minutes, every client behind the same recursive resolver gets the same cached answer. No rotation. No balancing.
This is DNS load balancing's fatal weakness. Suppose 10,000 users share the same corporate DNS resolver. The resolver caches the DNS response for 300 seconds. For those 5 minutes, all 10,000 users hit the same server. The "round-robin" happens only when the cache expires and a new lookup occurs.
You can lower the TTL to reduce caching effects, but there is a floor. Many resolvers ignore TTLs below 30 seconds. Some (infamously, older Java applications) cache DNS forever. And low TTLs increase DNS query volume, slowing down every first connection.
Watch how DNS caching causes uneven distribution. Each resolver caches for the TTL period, pinning all its clients to one server.
| Use case | Why it works |
|---|---|
| Global traffic distribution | Route users to nearest data center. DNS is the only layer that sees the client's location. |
| First layer in a multi-tier LB | DNS picks the region; a real load balancer within that region handles fine-grained distribution. |
| Stateless CDN edges | Any edge server can serve cached content. Imbalance does not matter much. |
DNS can point clients to different IPs, but it is blind to server health, cannot react in real time, and its caching makes distribution unpredictable. We need something that sits in the actual traffic path and makes per-connection decisions. Enter L4 load balancing — operating at the transport layer (TCP/UDP).
An L4 load balancer sees the TCP SYN packet that initiates a connection. It reads the source IP, destination IP, source port, and destination port — the 4-tuple. It does not read the HTTP payload. It does not know the URL or headers. It picks a backend server and forwards the entire TCP connection to that server.
Because it does not parse application-layer data, an L4 load balancer is extremely fast. Modern L4 balancers (like Linux's IPVS or Maglev) can handle millions of connections per second on commodity hardware. They operate in the kernel, avoiding the overhead of user-space processing.
Once the L4 balancer picks a backend, it needs to forward packets. There are three ways to do this, each with different trade-offs:
Watch packets flow through NAT (all traffic through LB) vs. DSR (responses bypass LB). Notice the bandwidth difference on the load balancer.
| System | Mode | Scale |
|---|---|---|
| Linux IPVS | NAT, DSR, Tunneling | Millions of connections; in-kernel, used by most Kubernetes setups |
| Google Maglev | DSR via GRE tunneling | Handles all of Google's external traffic; consistent hashing for connection affinity |
| AWS NLB | DSR-like (flow hash) | Millions of requests/sec; preserves client source IP |
| Facebook Katran | XDP/eBPF + DSR | Kernel-bypass via BPF programs; sub-microsecond forwarding |
An L4 load balancer must remember which backend it assigned to each connection. If a client sends 50 TCP segments within one connection, they all must go to the same backend. The LB maintains a connection table mapping (client IP, client port) to the chosen backend. This table is consulted for every packet.
For NAT mode, the connection table also stores the rewritten addresses so responses can be mapped back. For DSR, the table is only needed for the forward direction. When a connection closes (FIN/ACK exchange) or times out, its entry is removed from the table.
/api to one pool and /static to another. For content-aware routing, you need L7.L4 load balancing is fast because it ignores the application payload. But sometimes you need to see inside the payload. You want to route /api/v2 to a new fleet of servers while /api/v1 stays on the old ones. You want to send mobile clients to servers optimized for mobile. You want to terminate TLS at the load balancer so backends do not need certificates. None of this is possible at L4.
L7 load balancing operates at the application layer — HTTP, gRPC, WebSocket. The load balancer fully terminates the client's TCP connection (and TLS, if applicable), parses the HTTP request, reads the URL, headers, cookies, and even the body. Then it opens a separate TCP connection to the chosen backend and forwards the request.
This is fundamentally different from L4. The L4 balancer is a packet forwarder — it shuttles raw TCP segments. The L7 balancer is a full reverse proxy — it understands the protocol and can modify, rewrite, or reject requests before forwarding.
| HTTP Field | Routing decision it enables |
|---|---|
| URL path | Route /api to API servers, /static to CDN origin |
| Host header | Route shop.example.com and blog.example.com to different backends |
| Cookie | Session affinity — same user always hits same backend |
| Authorization header | Route authenticated vs. unauthenticated traffic differently |
| HTTP method | Route reads (GET) to read replicas, writes (POST/PUT) to the leader |
| Content-Type | Route JSON APIs vs. file uploads to specialized handlers |
| Query parameters | A/B testing: ?variant=B goes to experimental backend |
One of the most valuable L7 features has nothing to do with routing. TLS termination means the load balancer handles HTTPS encryption. Clients connect to the LB via HTTPS. The LB decrypts the traffic, reads the HTTP request, then forwards it to backends over plain HTTP (or re-encrypts for end-to-end encryption).
Why this matters: TLS handshakes are CPU-intensive (especially the initial RSA or ECDHE key exchange). By offloading TLS to the load balancer, you free backend servers to spend their CPU on application logic. You also centralize certificate management — update one certificate on the LB instead of on every backend.
Requests arrive with different URL paths. The L7 load balancer reads each path and routes to the appropriate backend pool.
| Property | L4 | L7 |
|---|---|---|
| Throughput | Millions of conns/sec | Tens of thousands of req/sec per core |
| Latency overhead | Microseconds | Milliseconds (TLS + HTTP parsing) |
| Routing granularity | Per connection (IP + port) | Per request (URL, headers, cookies) |
| Connection model | Forwards existing connection | Two connections: client→LB, LB→backend |
| Protocol awareness | None (raw TCP/UDP) | Full HTTP/gRPC/WebSocket |
| Real-world examples | AWS NLB, Maglev, IPVS | Nginx, HAProxy, Envoy, AWS ALB |
/api/v2 to a canary fleet while /api/v1 stays on production. Which load balancer type can do this?You have decided to use a load balancer. Requests arrive. You have 5 backend servers. The fundamental question: which server gets the next request? This is the selection algorithm, and the choice matters enormously. A bad algorithm creates hotspots. A good one spreads traffic evenly even when servers have different capacities.
The simplest possible algorithm. Assign requests to servers in order: 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, ... Each server gets exactly 1/N of the traffic. No state needed beyond a counter.
When it breaks: if servers have different capacities (one has 8 cores, another has 2), round-robin sends them equal traffic. The small server gets overwhelmed while the large one is underutilized. Also fails when requests have vastly different costs — a lightweight health check and a heavy database report both count as "one request."
Assign each server a weight proportional to its capacity. A server with weight 3 gets 3x as many requests as a server with weight 1. The total weight determines the cycle length.
Track how many active connections each server has. Send the next request to the server with the fewest. This naturally handles servers of different speeds: a fast server completes requests quickly, drops its connection count, and gets more traffic. A slow server's count stays high, so it gets fewer new requests.
When it shines: long-lived connections with variable duration (WebSockets, database connections, file uploads). When it struggles: very short-lived connections where the count is always near zero (thousands of requests per second, each completing in 1ms). In that case, the counts are noisy and round-robin works just as well.
Combine least-connections with server weights. The metric becomes active_connections / weight — the server with the lowest ratio gets the next request. This handles both heterogeneous hardware and variable request durations.
Hash the request key (e.g., user ID) to a point on a virtual ring. Each server owns an arc of the ring. The request goes to whichever server's arc it lands on. The magic: when a server is added or removed, only the keys in the affected arc move — all other assignments stay the same.
This is critical for stateful services. If server A has user 123's session cached in memory, you want every request from user 123 to go to server A. Consistent hashing achieves this without a session table, as long as the server set is stable.
4 servers with different speeds. Watch how each algorithm distributes load. The fastest server (A) should get the most traffic under smart algorithms.
| Algorithm | State needed | Best for | Weakness |
|---|---|---|---|
| Round-Robin | Counter | Homogeneous servers, uniform requests | Ignores server capacity and request cost |
| Weighted RR | Counter + weights | Heterogeneous hardware | Weights are static; does not adapt to load |
| Least Connections | Per-server conn count | Variable-duration requests | Noisy with very short-lived connections |
| Consistent Hashing | Hash ring | Stateful services, caching | Uneven with few servers (use virtual nodes) |
| Random | None | Large server pools | Variance is high with few servers |
| Power of Two Choices | Per-server conn count | Large pools; low overhead | Slightly less optimal than full least-conn |
You have 5 backend servers and a load balancer distributing traffic. Server 3 crashes. The load balancer does not know. It keeps sending 20% of requests to a dead server. Those requests time out after 30 seconds. Users see spinning wheels. Your system's effective throughput drops by 20% even though 4 healthy servers could handle the full load — if only the load balancer knew to stop sending traffic to the dead one.
This is why every production load balancer has health checks — periodic probes that test whether each backend is alive and capable of serving traffic. If a server fails health checks, the load balancer removes it from the pool. When it recovers, it is added back.
Health checks come in layers, from simple to thorough:
GET /health) and checks the response. A healthy server returns 200 OK. An unhealthy one returns 500 or times out. This catches application-level failures, database connection issues, and resource exhaustion./health endpoint does not just return 200 — it actually tests critical dependencies. Can it reach the database? Is the cache responsive? Is disk space above 10%? This catches "gray failures" where the process is running but cannot serve real traffic because a dependency is down.| Parameter | Typical value | What it controls |
|---|---|---|
| Interval | 5-30 seconds | How often to probe each backend |
| Timeout | 2-5 seconds | Max time to wait for a probe response |
| Unhealthy threshold | 2-3 failures | How many consecutive failures before marking unhealthy |
| Healthy threshold | 2-3 successes | How many consecutive successes before marking healthy again |
The thresholds prevent flapping. A single failed check does not remove a server (it might have been a network blip). The server must fail 2-3 checks in a row before it is removed. Similarly, a recovering server must pass 2-3 checks before it is added back — preventing a partially-recovered server from receiving traffic and immediately failing again.
5 servers are running. Click "Kill Server" to crash one. Watch the health checker detect the failure and remove it from the pool. Click "Revive" to bring it back.
Kubernetes popularized a distinction that applies broadly:
Liveness probe: "Is the process alive?" If it fails, the orchestrator kills and restarts the container. Use this to detect deadlocks or unrecoverable states. Example: the process is stuck in an infinite loop and cannot serve any requests.
Readiness probe: "Can this instance serve traffic right now?" If it fails, the load balancer stops sending traffic but does NOT restart the container. Use this for temporary conditions: the instance is still warming its cache, loading a model, or waiting for a database migration. Once ready, traffic resumes.
Most load-balancing algorithms treat every request independently. Request 1 goes to server A, request 2 goes to server B, request 3 goes to server C. This is fine for stateless services where any server can handle any request. But many real applications are stateful — they store user sessions in server memory.
Suppose a user logs in and their session token is stored in server A's memory. The next request goes to server B — which has no session for this user. The user sees a login page again. Frustrated, they log in again. Their new session is on server B. The next request goes to server C. Login page. This is the session affinity problem.
Session affinity (also called sticky sessions) ensures that all requests from the same user are routed to the same backend server. The load balancer "remembers" which server a user was assigned to and keeps sending them there.
There are several mechanisms, each with different trade-offs:
SERVERID=backend-3) on the first response. Subsequent requests include this cookie, and the LB reads it to route to the same backend. This is the most reliable method — it tracks individual users, not IPs. Requires L7 (must read HTTP cookies).Session affinity creates a new problem: server draining. You need to take server A down for maintenance. But 200 users have sticky sessions on server A. If you remove it from the pool, those sessions break.
The solution is graceful draining: stop sending new sessions to server A, but continue routing existing sessions there until they expire or complete. Once active sessions drop to zero, safely remove the server.
Watch colored users make multiple requests. Without stickiness, they bounce between servers. With stickiness, each user stays on their assigned server.
Everything so far has been about distributing traffic within a single data center. But what if you have data centers in Virginia, Frankfurt, and Tokyo? A user in Japan making requests to your Virginia servers suffers 200ms of round-trip latency on every single request. Multiply that by the dozens of requests a typical page load makes, and the experience is painful.
Global load balancing (GSLB — Global Server Load Balancing) routes users to the nearest or best-performing data center before any local load balancer takes over. There are two primary mechanisms: GeoDNS and anycast.
GeoDNS is DNS with geographic awareness. When a DNS query arrives, the authoritative nameserver looks up the source IP in a GeoIP database (like MaxMind) to determine the client's approximate location. It then returns the IP address of the nearest data center.
GeoDNS has the same caching limitations as regular DNS — a corporate resolver in one city might serve clients in another city, and they all get the same cached answer. But at the global scale (routing to continents or countries), this is usually acceptable.
Anycast is a fundamentally different approach. Instead of giving each data center a different IP, you give them all the same IP address. Every data center announces this IP via BGP (the internet routing protocol). When a client sends a packet to that IP, the internet's routing infrastructure naturally delivers it to the nearest data center — the one with the shortest BGP path.
Anycast works beautifully for UDP-based protocols (DNS) and short-lived TCP connections (HTTP over TLS 1.3 with 0-RTT). For long-lived TCP connections, anycast has a subtle problem: if BGP routes change mid-connection (path flap), the packets might suddenly be delivered to a different data center, which does not have the TCP state. The connection breaks. This is why Cloudflare and Google use anycast + connection ID (via QUIC) to handle this.
Clients from 3 regions connect to your service. Watch how GeoDNS and Anycast each route them to the nearest data center.
Global load balancing also handles data center failures. If the Tokyo data center goes down:
GeoDNS failover: The health-checking system detects the outage and removes Tokyo's IP from DNS responses. Japanese clients start getting the next-nearest data center (maybe Seoul or Singapore). Failover time depends on DNS TTL — could be minutes.
Anycast failover: The Tokyo data center stops announcing the IP via BGP. Internet routing automatically converges — within seconds to minutes — and delivers packets to the next-nearest announcing data center. No DNS change needed. This is one of anycast's biggest advantages.
| Property | GeoDNS | Anycast |
|---|---|---|
| Mechanism | DNS returns different IPs per region | Same IP, BGP routing picks nearest |
| Granularity | Country/city level (GeoIP accuracy) | Network topology (BGP path length) |
| Failover speed | Depends on DNS TTL (minutes) | BGP convergence (seconds to minutes) |
| Long TCP connections | Stable (each DC has its own IP) | Risk of route flaps breaking connections |
| Setup complexity | Low (DNS config) | High (BGP peering, ISP coordination) |
Time to put everything together. Below is a fully interactive load balancer simulation. You control the algorithm, the server health, and the request rate. Watch how requests distribute, see servers fail and recover, and observe how different algorithms handle uneven loads.
Requests arrive at the load balancer and get distributed to backend servers. Kill servers, change algorithms, and adjust request rate to see how the system responds.
Experiment 1: Kill a server under round-robin. Watch how the dead server's share of traffic produces errors until the health checker removes it. Then switch to least-connections and kill a server — notice the difference in response time.
Experiment 2: Crank the request rate to 20/s. Watch server queues fill up. Which algorithm keeps latency lowest? (Hint: least-connections adapts; round-robin does not.)
Experiment 3: Kill 3 of 5 servers. The remaining 2 must handle all traffic. Watch queue depths spike. This is why capacity planning matters — your system must handle N-2 failures if you have N servers.
Load balancing is one piece of the distributed systems puzzle. Here is how it connects to everything else.
| Concept | Key takeaway |
|---|---|
| DNS LB | Coarse-grained, cached, good for global distribution only |
| L4 LB | Fast (kernel-level), sees TCP 4-tuple, cannot inspect HTTP |
| L7 LB | Smart (content-based routing), slower (HTTP parsing), handles TLS |
| Algorithms | Round-robin for simple, least-conn for adaptive, consistent hashing for stateful |
| Health checks | TCP/HTTP probes with thresholds prevent routing to dead servers |
| Session affinity | Cookie-based preferred; externalized state is better than sticky sessions |
| Global LB | GeoDNS for simplicity, anycast for speed and automatic failover |
| Topic | Connection |
|---|---|
| Service Architecture | Service meshes (Envoy, Istio) build L7 load balancing into every service-to-service call |
| Data Storage | Caching (Redis) eliminates the need for session affinity; replication handles read scaling |
| Messaging | Message queues decouple producers from consumers, providing a different kind of load distribution |
| Consensus | Leader election decides which replica is primary; the load balancer must know who the leader is |