TCP, TLS, flow control, congestion control — how unreliable wires become reliable channels.
You type a URL into your browser. A fraction of a second later, a web page appears. It feels instantaneous and reliable, like flipping a light switch. But underneath, your request traveled through copper wires, fiber optic cables, and radio waves, passing through dozens of routers and switches, any one of which could have dropped it, delayed it, or delivered it out of order.
The internet is built on IP (Internet Protocol), and IP makes exactly one promise: best effort. That means:
That is the raw reality of every network on Earth. Every time you send data between two computers, you are throwing a message into a system that promises nothing about delivery, ordering, or integrity.
The simulation below shows what raw IP networking looks like. A sender tries to deliver 8 packets to a receiver. The network is hostile: it drops packets, reorders them, duplicates them, and delays them randomly. Watch the chaos.
Click "Send Packets" to send 8 numbered packets across an unreliable network. Watch what arrives (and what doesn't).
Some packets never arrived. Others arrived out of order. One might have been duplicated. If this were a file transfer, your file would be corrupted. If this were a bank transaction, money could vanish or double. The raw network is completely unusable for anything that matters.
A protocol is a set of rules that two computers agree to follow so they can communicate reliably despite the network's unreliability. TCP (Transmission Control Protocol) is the protocol that solves the chaos you just saw. It guarantees:
| Raw IP | TCP adds |
|---|---|
| Packets can be lost | Reliable delivery — lost packets are detected and retransmitted |
| Packets can arrive out of order | Ordered delivery — packets are reassembled in sequence |
| Packets can be duplicated | Deduplication — duplicates are silently discarded |
| Packets can be corrupted | Integrity checking — checksums detect corruption |
| No flow control | Flow control — sender won't overwhelm a slow receiver |
| No congestion control | Congestion control — sender won't overwhelm the network itself |
Before we dive into TCP, let's understand where it sits. When your application calls send(), the data passes through several layers:
| Layer | Protocol | What it adds | Analogy |
|---|---|---|---|
| Application | HTTP, gRPC, DNS | Request semantics (GET /page) | The letter you write |
| Transport | TCP, UDP, QUIC | Reliability, ordering, ports | The postal service (registered mail vs postcard) |
| Network | IP | Source/destination addresses, routing | The address on the envelope |
| Link | Ethernet, WiFi | Frame encoding, local delivery | The mail truck between sorting offices |
TCP operates at the transport layer. It takes your application's data, breaks it into segments, attaches sequence numbers and checksums, and hands it to IP for delivery. IP routes each segment independently — they might take different paths through the network. TCP at the other end reassembles them in order, detects any that were lost, and delivers a clean byte stream to the receiving application.
Over the next chapters, we will build TCP from scratch in your mind. Then we will layer TLS on top for encryption and authentication. By the end, you will understand every byte that flows between your browser and a server.
TCP transforms the unreliable mess of IP into a reliable, ordered byte stream. How? Three mechanisms: sequence numbers, acknowledgments, and retransmission. And before any data flows, the two sides must agree to talk — that is the three-way handshake.
Before a client and server can exchange data, they must establish a connection. Why? Because TCP needs both sides to agree on initial sequence numbers and allocate resources (buffers, state). The handshake has three steps:
Once the connection is established, every byte of data gets a sequence number. If the client's ISN was 1000 and it sends 500 bytes, those bytes are numbered 1001 through 1500. The server responds with an ACK number of 1501, meaning "I have received all bytes up to 1500; send me byte 1501 next."
This is how TCP detects loss. If the client sends bytes 1001-1500 and then bytes 1501-2000, but the first segment is lost, the server will keep ACKing 1001 — "I'm still waiting for byte 1001." After three duplicate ACKs for the same number, the client knows the segment was lost and retransmits it.
Notice that the server buffered the out-of-order segments (1501-2000 and 2001-2500). When the missing segment finally arrived, the server delivered all the buffered data at once and jumped the ACK number forward. This is called selective acknowledgment (SACK) in modern TCP.
Watch the three-way handshake, then data transfer with sequence numbers. Click "Drop Packet" to simulate loss and see retransmission.
How long does TCP wait before retransmitting a lost packet? It uses a retransmission timeout (RTO) that adapts to the network. TCP continuously measures the round-trip time (RTT) — how long between sending a segment and receiving its ACK — and sets the RTO to be slightly larger than the smoothed RTT.
This adaptive approach is crucial. On a LAN with 1ms RTT, the RTO might be 10ms. On a transatlantic link with 150ms RTT, it might be 800ms. TCP adjusts automatically.
Every TCP segment carries a header with critical fields. Here is what a real TCP header looks like:
The sequence number and acknowledgment number fields are the heart of TCP's reliability. The window field is flow control. The checksum catches corruption. Everything we discussed in this chapter is encoded in these 20 bytes.
A TCP connection is not just "open" or "closed." It moves through a carefully defined set of states, each serving a specific purpose. Understanding these states is essential for debugging real production problems like port exhaustion, half-open connections, and the infamous TIME_WAIT buildup.
Every TCP connection on your machine is in one of these states at any given moment. You can see them with netstat -an or ss -tan.
| State | Who | What it means |
|---|---|---|
| LISTEN | Server | Waiting for incoming SYN packets. This is a server socket ready to accept connections. |
| SYN_SENT | Client | Client sent SYN, waiting for SYN-ACK. If the server is unreachable, you'll sit here until timeout. |
| SYN_RECEIVED | Server | Server got SYN, sent SYN-ACK, waiting for final ACK. SYN flood attacks exploit this state. |
| ESTABLISHED | Both | Connection is open. Data can flow in both directions. This is the "normal" state. |
| FIN_WAIT_1 | Closer | Sent FIN, waiting for ACK of the FIN. |
| FIN_WAIT_2 | Closer | Got ACK of our FIN. Waiting for the other side's FIN. |
| CLOSE_WAIT | Other | Got FIN from peer. The application hasn't called close() yet. This is a bug if it accumulates. |
| LAST_ACK | Other | Sent our FIN, waiting for ACK. |
| TIME_WAIT | Closer | Both FINs exchanged. Waiting 2×MSL before fully closing. The most misunderstood state. |
| CLOSED | Both | Connection is fully torn down. All resources freed. |
Opening a connection takes 3 messages. Closing takes 4, because each direction closes independently. Think of TCP as two one-way streets — each side must close its own lane.
The Maximum Segment Lifetime (MSL) is the longest time a TCP segment can exist in the network before being discarded (typically 30 seconds to 2 minutes; Linux uses 60 seconds). TIME_WAIT lasts 2 × MSL (so 1-4 minutes).
Why wait at all? Two reasons:
In production, TIME_WAIT buildup is a common problem. A busy server closing thousands of connections per second can accumulate thousands of TIME_WAIT sockets, exhausting ephemeral ports. Solutions include SO_REUSEADDR, tcp_tw_reuse, and connection pooling (which we'll cover in Chapter 7).
Here is what to look for when you see each state accumulating:
bash # Count connections by state ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn # Example output for a healthy server: # 2847 ESTAB ← active connections, normal # 142 TIME-WAIT ← recently closed, normal if <1000 # 12 LISTEN ← server sockets, normal # 3 FIN-WAIT-2 ← closing, transient, normal # Example output for a SICK server: # 2847 ESTAB # 28104 TIME-WAIT ← PORT EXHAUSTION IMMINENT # 1893 CLOSE-WAIT ← APPLICATION BUG (not closing sockets) # 12 LISTEN
Click transitions to move through the TCP state machine. Watch both client and server states change.
ss -tan on a busy web server and see 30,000 connections in CLOSE_WAIT. What does this indicate?Imagine you are reading aloud to a friend who is writing down every word. You speak at 200 words per minute. Your friend writes at 50 words per minute. Within seconds, they fall behind. Words pile up faster than they can be recorded. Information is lost.
The same problem exists in TCP. A fast sender (a powerful server sending a large response) can overwhelm a slow receiver (a mobile phone with limited memory). The receiver's buffer fills up, and new packets have nowhere to go — they get dropped, retransmitted, dropped again. Without some mechanism to match the sender's speed to the receiver's capacity, the connection devolves into an inefficient cycle of overflow and retransmission.
Flow control is TCP's elegant solution: the receiver explicitly tells the sender how much buffer space is available, in every single ACK packet. The sender limits itself to sending only that much unacknowledged data. If the receiver is slow, the sender automatically slows down. If the receiver speeds up, the sender speeds up too. The mechanism is entirely receiver-driven — the receiver is always in control of the pace.
Every TCP ACK packet contains a field called the receive window (rwnd) — a 16-bit number (up to 65,535 bytes, or much larger with window scaling) that says "I have this many bytes of free buffer space." The sender must obey this limit.
As the receiver's application reads data from the buffer, more space frees up, and the receiver advertises a larger window. As the buffer fills, the window shrinks. If the buffer is completely full, the receiver advertises rwnd = 0, and the sender must stop sending entirely.
When the receiver advertises rwnd = 0, the sender enters a zero window state and stops sending data. But how does the sender know when the receiver has space again? The receiver will send a new ACK with a larger window when buffer space frees up — but what if that ACK is lost? The sender would wait forever.
TCP solves this with the persist timer. When the sender sees rwnd = 0, it periodically sends tiny window probe packets (1 byte) to ask "do you have space yet?" The receiver responds with the current window size. This prevents deadlocks.
Watch the sender's sliding window advance as ACKs arrive. Drag the "App Read Speed" slider to control how fast the receiver's application consumes data.
When you set the read speed low, watch the receive window shrink. The sender is forced to slow down — that is flow control in action. Set it to 1 and watch the window hit zero: the sender stops completely until the application drains the buffer.
The original TCP header allocated 16 bits for the receive window, giving a maximum of 65,535 bytes. On modern networks with high bandwidth and high latency (e.g., a 10 Gbps transatlantic link with 100ms RTT), 64 KB is absurdly small. You'd need to send 64 KB, wait 100ms for an ACK, send 64 KB, wait, send... Your effective throughput would be 64 KB / 100ms = 640 KB/s — on a 10 Gbps link.
Window scaling is negotiated during the three-way handshake using a TCP option. Both sides advertise their scale factor. Modern operating systems enable this by default.
These two mechanisms are frequently confused. Let's make the distinction crystal clear:
| Property | Flow Control | Congestion Control |
|---|---|---|
| What it protects | The receiver (application buffer) | The network (router buffers) |
| Who sets the limit | The receiver (advertises rwnd) | The sender (estimates cwnd) |
| Signal | Explicit: receiver tells sender its window | Implicit: sender infers congestion from packet loss or delay |
| What happens when violated | Receiver drops data it can't buffer | Routers drop packets from their queues |
| Sending rate | min(cwnd, rwnd) — whichever is smaller wins | min(cwnd, rwnd) — same formula, different limiter |
In practice, for most web traffic within a datacenter (low RTT, fast receivers), congestion control is the bottleneck — cwnd limits sending during slow start. For cross-continental transfers with slow endpoints, flow control is the bottleneck — rwnd limits sending because the receiver can't process data fast enough.
Flow control prevents a fast sender from overwhelming a slow receiver. But what about the network between them? Routers have finite buffer space. If too many senders push data too fast, router buffers overflow, packets are dropped, and every sender retransmits, making the congestion worse. This is called congestion collapse, and it nearly killed the early internet in 1986. Literally: the internet's usable bandwidth dropped to a tiny fraction of its physical capacity because every sender was aggressively retransmitting lost packets, which caused more congestion, which caused more loss, which caused more retransmission. Van Jacobson's 1988 paper introduced the algorithms we are about to learn, and the internet recovered.
Congestion control is TCP's mechanism for sensing and avoiding network overload. It is separate from flow control: flow control is about the receiver's capacity; congestion control is about the network's capacity. Every TCP sender on Earth runs congestion control independently, and together they cooperatively share the internet's bandwidth without any central coordinator. It is one of the most successful distributed algorithms ever deployed.
In addition to the receiver's window (rwnd), each TCP sender maintains its own congestion window (cwnd) — an estimate of how much data the network can handle. The actual sending limit is:
The sender can never have more than effective_window bytes in flight. Even if the receiver has a huge buffer, the sender will self-limit based on its perception of network congestion.
When a new connection starts, the sender has no idea how much bandwidth the network can handle. It starts with cwnd = 1 MSS (maximum segment size, typically 1460 bytes). For every ACK received, cwnd increases by 1 MSS. Since each RTT roughly doubles the number of segments acknowledged, cwnd grows exponentially: 1, 2, 4, 8, 16, 32...
"Slow start" is a misnomer — the growth is exponential, so it ramps up quickly. It continues until cwnd reaches a threshold called ssthresh (slow start threshold).
Once cwnd ≥ ssthresh, TCP switches to congestion avoidance. Now cwnd grows linearly: increase by 1 MSS per RTT (not per ACK). This is the additive increase phase — carefully probing for more bandwidth without being aggressive.
When a packet loss is detected (via timeout or triple duplicate ACK), TCP assumes the network is congested and cuts cwnd dramatically — typically by half. This is the multiplicative decrease. Together, this is AIMD: Additive Increase, Multiplicative Decrease.
There are two ways to detect loss:
| Detection Method | Response | Why |
|---|---|---|
| Timeout | cwnd = 1 MSS, enter slow start | A timeout means severe congestion — no ACKs at all. Start over from scratch. |
| Triple duplicate ACK | cwnd = cwnd/2, enter congestion avoidance | Three duplicate ACKs mean one segment was lost, but later segments DID arrive. The network isn't completely dead — just reduce the rate by half. |
Fast retransmit means retransmitting the lost segment immediately upon receiving the third duplicate ACK, without waiting for the full timeout. Fast recovery means starting congestion avoidance (linear growth) from the halved window rather than going back to slow start. These two optimizations make TCP much more responsive to isolated packet losses.
Watch cwnd grow exponentially (slow start), then linearly (congestion avoidance). Click "Drop Packet" to trigger loss. Green = slow start, blue = congestion avoidance.
The sawtooth is the signature of AIMD. Linear growth probes for available bandwidth. Each drop halves the window. Over time, TCP oscillates around the maximum capacity the network can handle. It is a beautifully simple and robust mechanism.
One of AIMD's most elegant properties is fairness. If two TCP connections share a bottleneck link, they converge to equal bandwidth — with no communication between them.
Imagine two senders, A and B, sharing a 100 Mbps link. A has cwnd=80, B has cwnd=40. Total = 120, which exceeds capacity. Packets are dropped. Both halve: A=40, B=20. Now both increase linearly: A=41, B=21, then A=42, B=22. At the next drop: A=halved, B=halved. Over time, the ratio converges to 1:1. This works because additive increase adds the same absolute amount, while multiplicative decrease preserves the ratio — but the combination trends toward equality.
| Property | CUBIC | BBR |
|---|---|---|
| Default in | Linux (since 2006) | Google's servers, YouTube, Google Cloud |
| Signal used | Packet loss | Bandwidth and RTT estimation |
| Growth function | Cubic function of time since last loss | Pacing rate based on estimated bottleneck bandwidth |
| Strength | Good fairness with other CUBIC/Reno flows | Excellent on high-BDP links, tolerant of random loss |
| Weakness | Underutilizes high-BDP links, suffers from random loss | Can be unfair to CUBIC flows, complex to tune |
| Key insight | After a loss, quickly return to the cwnd where loss occurred (the "plateau") | Loss does not necessarily mean congestion — measure the actual bottleneck rate |
TCP gives us reliable, ordered delivery. But it provides zero security. Every byte you send over TCP is transmitted in plaintext. Anyone on the network path — your ISP, a coffee shop WiFi operator, a compromised router — can read your passwords, credit card numbers, and private messages. They can also modify them in transit without you knowing.
TLS (Transport Layer Security) solves three problems at once:
| Problem | TLS Solution | Mechanism |
|---|---|---|
| Eavesdropping | Encryption | Symmetric encryption (AES-GCM, ChaCha20) makes data unreadable to anyone except the two endpoints |
| Impersonation | Authentication | Digital certificates + CA chain prove the server is who it claims to be |
| Tampering | Integrity | HMAC / AEAD authentication tags detect any modification to the ciphertext |
TLS uses both types of encryption, each for a different purpose:
Asymmetric encryption (RSA, ECDHE) uses a key pair: a public key anyone can have, and a private key only the owner holds. It is slow but solves the key exchange problem — two strangers can agree on a shared secret over a public channel.
Symmetric encryption (AES-256-GCM, ChaCha20-Poly1305) uses a single shared key for both encryption and decryption. It is fast — 100-1000x faster than asymmetric. Once both sides have the shared secret, all data is encrypted symmetrically.
TLS 1.3 (the current standard) completes the handshake in a single round trip, down from two in TLS 1.2. Here is every step:
How does the client know the server's certificate is legitimate? Through a chain of trust:
The client walks the chain: verify the server cert was signed by the intermediate CA, verify the intermediate was signed by a root CA that the client trusts. If any link fails, the connection is rejected.
TLS 1.3 mandates forward secrecy (also called perfect forward secrecy, PFS). This means that even if the server's private key is compromised in the future, past recorded conversations cannot be decrypted.
How? Because each connection generates a new ephemeral Diffie-Hellman key pair. The shared secret is derived from these ephemeral keys, not directly from the server's long-term key. The server's private key is only used to authenticate (sign the handshake), not to encrypt. Once the connection closes, the ephemeral keys are discarded.
Watch the TLS handshake in real time. Each message shows what it contains and what both sides know.
python import ssl import socket # Create a TLS-wrapped TCP connection context = ssl.create_default_context() # loads system CA certs with socket.create_connection(("example.com", 443)) as sock: with context.wrap_socket(sock, server_hostname="example.com") as tls: # Handshake happens automatically print(tls.version()) # 'TLSv1.3' print(tls.cipher()) # ('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256) cert = tls.getpeercert() print(cert['subject']) # server identity # Send an HTTP request over TLS tls.sendall(b"GET / HTTP/1.1\r\nHost: example.com\r\n\r\n") response = tls.recv(4096) print(response.decode())
python # Inspecting TLS details programmatically import ssl context = ssl.create_default_context() # Verify hostname and certificate context.check_hostname = True context.verify_mode = ssl.CERT_REQUIRED # Pin to specific TLS version (TLS 1.3 only) context.minimum_version = ssl.TLSVersion.TLSv1_3 context.maximum_version = ssl.TLSVersion.TLSv1_3 # Restrict cipher suites context.set_ciphers("TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256")
TCP is brilliant for most applications: web browsing, file transfer, database queries, API calls. But its guarantees come at a cost. For some use cases, TCP's reliability mechanisms actually hurt performance.
TCP delivers data as an ordered byte stream. If segment 5 out of 10 is lost, segments 6-10 must wait in the receiver's buffer until segment 5 is retransmitted and arrives. The application sees nothing until the gap is filled. This is head-of-line blocking.
For a file transfer, this is fine — you need the bytes in order. But for real-time applications, it is devastating:
| Application | Why TCP hurts | What you actually want |
|---|---|---|
| Video streaming | One lost frame blocks all subsequent frames. Player stutters. | Skip the lost frame and play the next one. |
| Online gaming | One lost position update blocks all future updates. Player teleports. | Use the latest position, ignore stale ones. |
| Voice calls | 200ms of audio delayed while waiting for retransmission. | Play silence for the gap, keep audio flowing. |
| HTTP/2 multiplexing | One lost packet on stream A blocks streams B, C, D (they share one TCP connection). | Independent streams that don't interfere. |
UDP (User Datagram Protocol) is the anti-TCP. It provides almost nothing: no reliability, no ordering, no flow control, no congestion control. You send a datagram, and it either arrives or it doesn't. Each datagram is independent.
UDP is not "unreliable TCP." It is a blank canvas. Applications that use UDP implement exactly the guarantees they need and nothing more. A game might implement reliable delivery for chat messages but fire-and-forget for position updates. A video player might implement FEC (forward error correction) to recover from loss without retransmission.
python import socket, struct, time # UDP sender: fire-and-forget position updates sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) seq = 0 while True: x, y = get_player_position() # Pack: sequence number + position data = struct.pack("!Iff", seq, x, y) sock.sendto(data, ("game-server.example.com", 9000)) seq += 1 time.sleep(1/60) # 60 Hz update rate # UDP receiver: always use LATEST position, drop stale def receive_positions(sock): latest_seq = -1 while True: data, addr = sock.recvfrom(1024) seq, x, y = struct.unpack("!Iff", data) if seq > latest_seq: # ignore out-of-order (stale) latest_seq = seq update_player(addr, x, y) # else: discard — we already have newer data
QUIC (originally "Quick UDP Internet Connections") is the protocol underneath HTTP/3. It runs on UDP but provides:
Choosing between TCP, UDP, and QUIC is not about which is "better" — it is about matching your application's requirements to the protocol's guarantees. The decision framework is simple: ask yourself two questions. First: "Can my application tolerate data loss?" If no, you need reliability (TCP or QUIC). If yes, UDP. Second: "Am I multiplexing independent streams?" If yes, QUIC avoids head-of-line blocking. If no, TCP is simpler and universally supported.
Three protocols, each sending 4 independent data streams. A packet loss occurs in stream 2. Watch how each protocol handles it. TCP blocks everything. UDP loses the data. QUIC blocks only stream 2.
QUIC is not just "TCP over UDP." It rethinks the transport layer from the ground up. Here are the critical implementation details:
The fact that each stream has its own offset is what eliminates head-of-line blocking. If stream 3's offset is missing, QUIC only blocks stream 3's data. Streams 1, 2, and 4 have their own offsets and can be delivered to the application immediately.
Another major difference: how many round trips before data can flow.
| Feature | TCP | UDP | QUIC |
|---|---|---|---|
| Reliability | Yes (whole stream) | No | Yes (per stream) |
| Ordering | Yes (whole stream) | No | Yes (per stream) |
| Encryption | Optional (TLS) | No | Mandatory (built-in TLS 1.3) |
| Head-of-line blocking | Yes | No | No (across streams) |
| Setup latency | 1-2 RTT | 0 RTT | 1 RTT (0 for repeat) |
| Connection migration | No (tied to IP:port) | N/A | Yes (connection ID) |
| Congestion control | Yes (kernel) | No | Yes (userspace) |
| Used by | HTTP/1.1, HTTP/2, SSH, databases | DNS, gaming, video, VoIP | HTTP/3, Google services |
You understand the protocols. Now let's talk about operating them in production. Every day, engineers debug TCP problems that waste hours — and the symptoms are always the same: "it's slow" or "connections are failing." The fix depends on understanding which TCP mechanism is misbehaving.
Nagle's algorithm (RFC 896) solves the "small packet problem." If your application writes 1 byte at a time (e.g., keystrokes in a telnet session), TCP would send a 41-byte packet (20 bytes IP header + 20 bytes TCP header + 1 byte data) for each byte of payload. This is wildly inefficient.
Nagle's algorithm says: if there is unacknowledged data in flight, buffer small writes and send them as one segment when the ACK arrives. This batches small writes into larger, efficient segments.
For interactive applications (SSH, gaming, RPC services), Nagle's algorithm adds intolerable latency. The fix: setsockopt(TCP_NODELAY, 1). This disables Nagle's algorithm, sending every write immediately regardless of size. Every modern web server, database driver, and RPC framework sets TCP_NODELAY.
python import socket # Disable Nagle's algorithm for low-latency RPC sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # Verify it's set nodelay = sock.getsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY) print(f"TCP_NODELAY: {'enabled' if nodelay else 'disabled'}") # Note: Python's http.client and requests library set this by default # But raw socket connections do NOT — you must set it yourself
We mentioned Nagle's algorithm, but the real devil is its interaction with delayed ACKs. The TCP receiver doesn't ACK every segment immediately — it waits up to 200ms (the delayed ACK timer) hoping to piggyback the ACK on a response data segment. This is efficient when the application sends a response quickly. But when combined with Nagle's algorithm, disaster strikes:
This is not a theoretical problem. It causes real 200ms latency spikes in production systems every day. The solution is TCP_NODELAY on interactive/RPC connections. There is almost never a reason to leave Nagle's algorithm enabled on a service that makes request-response calls.
TCP keep-alive detects dead connections. By default, TCP has no heartbeat — if one side crashes without sending FIN (power failure, OOM kill), the other side has no idea. It sits in ESTABLISHED state forever, holding resources.
Each TCP connection costs 1 RTT (handshake) + memory (buffers, state) + a TIME_WAIT socket when closed. If your application opens a new connection for every request (like naive HTTP/1.0), you pay this cost thousands of times per second.
Connection pooling keeps a set of established connections open and reuses them for multiple requests. This eliminates handshake latency and TIME_WAIT buildup.
python import urllib3 # urllib3 connection pool: reuses TCP connections pool = urllib3.HTTPSConnectionPool( "api.example.com", port=443, maxsize=10, # keep up to 10 connections alive block=True, # wait for a free connection if all 10 are busy retries=3, # retry on connection failure ) # First request: opens connection (1 RTT handshake + TLS) resp1 = pool.request("GET", "/users/1") # Second request: reuses existing connection (0 RTT overhead) resp2 = pool.request("GET", "/users/2") # Connection stays alive between requests (HTTP keep-alive)
These are the three most common TCP problems in production. Learn to recognize the symptoms and you will save hours of debugging.
connect() timeout after 10s. Your application can't establish TCP connections to a remote service. The SYN packets are being sent but no SYN-ACK comes back. Causes: (1) Firewall dropping SYN packets silently (most common). (2) Server's listen backlog is full — too many pending connections. (3) Server is down. Debug: tcpdump -i any 'tcp[tcpflags] & tcp-syn != 0' and host X.X.X.X — if you see SYN going out but no SYN-ACK coming back, it's a network/firewall issue.ss -tin for rcv_space. (2) High RTT with small window — BDP problem. (3) Packet loss causing cwnd to stay small. Debug: ss -tin dst X.X.X.X shows cwnd, rwnd, RTT, retransmissions.Cannot assign requested address or EADDRNOTAVAIL. You've run out of ephemeral ports because thousands of connections are in TIME_WAIT. Cause: high-volume short-lived connections (e.g., a load balancer creating a new connection for every request). Fix: (1) Connection pooling (best). (2) net.ipv4.tcp_tw_reuse = 1 (safe with TCP timestamps). (3) SO_REUSEADDR on the socket.bash # View current TCP settings sysctl net.ipv4.tcp_wmem # send buffer (min, default, max) sysctl net.ipv4.tcp_rmem # receive buffer (min, default, max) sysctl net.ipv4.tcp_congestion_control # cubic, bbr, etc. # Enable BBR congestion control (better for high-latency links) sysctl -w net.core.default_qdisc=fq sysctl -w net.ipv4.tcp_congestion_control=bbr # Increase buffer sizes for high-BDP links sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" # 16 MB max sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216" # Allow TIME_WAIT socket reuse sysctl -w net.ipv4.tcp_tw_reuse=1 # Aggressive keep-alive for production services sysctl -w net.ipv4.tcp_keepalive_time=60 sysctl -w net.ipv4.tcp_keepalive_intvl=10 sysctl -w net.ipv4.tcp_keepalive_probes=6 # Monitor TCP state distribution ss -tan state time-wait | wc -l # count TIME_WAIT sockets ss -tan state close-wait | wc -l # count CLOSE_WAIT (application bug!) ss -tin dst X.X.X.X # detailed TCP info for connections to X
The simulation below models a production scenario. A client sends requests to a server. You can introduce different problems — Nagle delay, small receive window, TIME_WAIT buildup — and watch the metrics change in real time. This is what ss -tin output looks like, made visual.
Click a scenario to introduce a TCP problem. Watch the metrics dashboard update. Then apply the fix.
These are not hypothetical — these are actual production incidents that TCP misconfigurations cause.
Networking is a favorite interview topic for backend, infrastructure, and distributed systems roles. Here is everything you need to answer confidently.
| Question | TCP | UDP | QUIC |
|---|---|---|---|
| Connection-oriented? | Yes (3-way handshake) | No (connectionless) | Yes (1 RTT handshake) |
| Reliable? | Yes (ACK + retransmit) | No | Yes (per stream) |
| Ordered? | Yes (single stream) | No | Yes (per stream) |
| Flow control? | Yes (receive window) | No | Yes (per stream + connection) |
| Congestion control? | Yes (AIMD in kernel) | No | Yes (pluggable, userspace) |
| Encryption? | Optional (TLS layer) | Optional (DTLS) | Mandatory (built-in) |
| Header size | 20-60 bytes | 8 bytes | Variable (~20 bytes) |
| HoL blocking? | Yes (whole connection) | No | No (across streams) |
If an interviewer says "explain TCP congestion control on a whiteboard," here is the structure:
| What | Value | Why it matters |
|---|---|---|
| Speed of light in fiber | ~200,000 km/s (2/3 of c) | NYC to London: 5,500 km = 27.5ms one-way minimum |
| TCP handshake | 1 RTT | Same datacenter: ~0.5ms. Cross-continent: ~150ms |
| TLS 1.3 handshake | 1 RTT | Adds 1 RTT on top of TCP. QUIC combines both in 1 RTT total. |
| TCP slow start IW | 10 MSS (~14 KB) | First RTT can only send 14 KB. Critical for small web page loads. |
| Ephemeral port range | ~28,000 ports | Linux default: 32768-60999. Each TIME_WAIT uses one for 1-4 minutes. |
| MSL (Maximum Segment Lifetime) | 60 seconds (Linux) | TIME_WAIT = 2 * MSL = 120 seconds |
| Delayed ACK timer | 40-200ms | Combined with Nagle = hidden 200ms latency |
| Default TCP buffer | 64 KB (many systems) | On 100ms RTT: max throughput = 64 KB / 100ms = 640 KB/s |
python # Drill 1: TCP echo server with connection pooling awareness import socket, selectors sel = selectors.DefaultSelector() def accept(sock): conn, addr = sock.accept() conn.setblocking(False) conn.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) # disable Nagle sel.register(conn, selectors.EVENT_READ, data=echo) def echo(conn): data = conn.recv(4096) if data: conn.sendall(data) else: sel.unregister(conn) conn.close() # important! prevents CLOSE_WAIT leak srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM) srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) srv.bind(("", 8000)) srv.listen(128) # backlog size srv.setblocking(False) sel.register(srv, selectors.EVENT_READ, data=accept) while True: for key, mask in sel.select(): key.data(key.fileobj)
python # Drill 2: Measure TCP handshake + TLS overhead import socket, ssl, time def measure_connect(host, port=443): # TCP handshake t0 = time.perf_counter() sock = socket.create_connection((host, port)) tcp_ms = (time.perf_counter() - t0) * 1000 # TLS handshake t1 = time.perf_counter() ctx = ssl.create_default_context() tls = ctx.wrap_socket(sock, server_hostname=host) tls_ms = (time.perf_counter() - t1) * 1000 print(f"TCP handshake: {tcp_ms:.1f}ms") print(f"TLS handshake: {tls_ms:.1f}ms") print(f"Total: {tcp_ms+tls_ms:.1f}ms") print(f"TLS version: {tls.version()}") tls.close() measure_connect("google.com") # TCP handshake: 12.3ms # TLS handshake: 28.7ms # Total: 41.0ms # TLS version: TLSv1.3
When under interview pressure, these mental models let you reason about networking questions without memorizing every detail:
You now have a solid understanding of how computers communicate reliably, securely, and efficiently over networks. This knowledge is the foundation for everything in distributed systems.
| Chapter | Core Concept | One-line Summary |
|---|---|---|
| 0 | The Problem | IP is unreliable — packets get lost, reordered, duplicated, corrupted |
| 1 | TCP Reliability | Sequence numbers + ACKs + retransmission make IP reliable |
| 2 | Connection Lifecycle | 11 states, 4-way close, TIME_WAIT prevents stale segments |
| 3 | Flow Control | Receive window prevents sender from overwhelming receiver |
| 4 | Congestion Control | AIMD (sawtooth) prevents sender from overwhelming the network |
| 5 | TLS/SSL | Asymmetric key exchange + symmetric encryption + certificate auth |
| 6 | Custom Protocols | UDP for real-time, QUIC for multiplexed streams without HoL blocking |
| 7 | Network in Practice | TCP_NODELAY, keep-alive, connection pooling, debugging |
| 8 | Interview Arsenal | Cheat sheets, whiteboard patterns, design questions |
Everything in this lesson is prerequisite knowledge for distributed systems topics:
| Topic | Why networking matters | Related Lesson |
|---|---|---|
| Distributed Trouble | Network partitions, unreliable message delivery, timeout-based failure detection — all built on TCP's limitations | The Trouble with Distributed Systems |
| Replication | Leader-follower, multi-leader, leaderless — all depend on network reliability for consistency guarantees | Replication (coming soon) |
| Consensus | Paxos, Raft, ZAB — all designed around the assumption that the network is unreliable and asynchronous | Consensus (coming soon) |
| RPC Frameworks | gRPC uses HTTP/2 (TCP), connection pooling, keep-alive, TLS. Understanding the transport layer is essential for debugging RPC issues | RPC and Service Mesh (coming soon) |
ss -tin, tcpdump, check for retransmissions, check rwnd/cwnd, check connection states. Most "application bugs" are TCP misconfigurations. Your debugging toolkit: ss -tan for connection states, ss -tin dst X.X.X.X for per-connection metrics (cwnd, rwnd, RTT, retransmits), tcpdump -w capture.pcap to capture packets for Wireshark analysis, and traceroute to identify where packets are being dropped or delayed.Understanding where these protocols came from helps you see where they are going.
| Year | Milestone | Why it mattered |
|---|---|---|
| 1974 | TCP/IP proposed (Cerf & Kahn) | First design for end-to-end reliable communication over unreliable networks |
| 1981 | TCP and IP split into separate protocols | Enabled UDP for applications that don't need reliability |
| 1986 | Internet congestion collapse | Motivated Van Jacobson's congestion control algorithms (1988) |
| 1995 | SSL 2.0 (Netscape) | First widely deployed encryption for web traffic |
| 1999 | TLS 1.0 (RFC 2246) | Standardized SSL, became the basis for HTTPS |
| 2006 | CUBIC replaces BIC in Linux | Better high-bandwidth performance, still loss-based |
| 2012 | QUIC development begins at Google | UDP-based transport to solve TCP's head-of-line blocking |
| 2016 | BBR published by Google | Model-based congestion control: measure, don't infer from loss |
| 2018 | TLS 1.3 (RFC 8446) | 1-RTT handshake, removed insecure ciphers, mandatory forward secrecy |
| 2021 | QUIC standardized (RFC 9000) | HTTP/3 became official, adopted by major browsers and CDNs |
The trend is clear: move complexity from the kernel to userspace (QUIC), reduce round trips (TLS 1.3, 0-RTT), and decouple streams (QUIC multiplexing). The next frontier is likely kernel bypass (DPDK, io_uring) for the lowest-latency applications, and post-quantum cryptography for TLS.
If you want to go deeper into any of these topics, these are the definitive sources:
| Topic | Source | Why read it |
|---|---|---|
| TCP internals | RFC 9293 (TCP specification, 2022) | The definitive TCP standard, replacing the original RFC 793 from 1981 |
| Congestion control | Van Jacobson, "Congestion Avoidance and Control" (1988) | The paper that saved the internet. Introduces slow start and congestion avoidance. |
| TLS 1.3 | RFC 8446 | Complete TLS 1.3 specification. Surprisingly readable for an RFC. |
| QUIC | RFC 9000 + RFC 9001 (QUIC-TLS) | The QUIC transport protocol and its integration with TLS 1.3 |
| BBR | Cardwell et al., "BBR: Congestion-Based Congestion Control" (2016) | Google's model-based congestion control — fundamentally different approach from loss-based |
| Practical TCP | Ilya Grigorik, High Performance Browser Networking (free online) | Excellent coverage of TCP, TLS, HTTP/2, and performance optimization |
If you are building a production networked service, here is the minimum you must get right:
"The nice thing about standards is that you have so many to choose from." — Andrew S. Tanenbaum