Distributed Systems

Network Foundations

TCP, TLS, flow control, congestion control — how unreliable wires become reliable channels.

Prerequisites: Basic networking concepts (IP addresses, ports). That's it.
10
Chapters
9
Simulations
0
Assumed Knowledge

Chapter 0: The Problem

You type a URL into your browser. A fraction of a second later, a web page appears. It feels instantaneous and reliable, like flipping a light switch. But underneath, your request traveled through copper wires, fiber optic cables, and radio waves, passing through dozens of routers and switches, any one of which could have dropped it, delayed it, or delivered it out of order.

The internet is built on IP (Internet Protocol), and IP makes exactly one promise: best effort. That means:

That is the raw reality of every network on Earth. Every time you send data between two computers, you are throwing a message into a system that promises nothing about delivery, ordering, or integrity.

The central question of this lesson. How do we build reliable, ordered, secure communication on top of a network that loses, reorders, duplicates, and corrupts data? The answer is TCP and TLS — two protocols that transform chaos into a dependable byte stream. Understanding how they work is the foundation of every distributed system.

The Unreliable Network

The simulation below shows what raw IP networking looks like. A sender tries to deliver 8 packets to a receiver. The network is hostile: it drops packets, reorders them, duplicates them, and delays them randomly. Watch the chaos.

Unreliable Network Simulator

Click "Send Packets" to send 8 numbered packets across an unreliable network. Watch what arrives (and what doesn't).

Click Send Packets to begin.

Some packets never arrived. Others arrived out of order. One might have been duplicated. If this were a file transfer, your file would be corrupted. If this were a bank transaction, money could vanish or double. The raw network is completely unusable for anything that matters.

Why We Need a Protocol

A protocol is a set of rules that two computers agree to follow so they can communicate reliably despite the network's unreliability. TCP (Transmission Control Protocol) is the protocol that solves the chaos you just saw. It guarantees:

Raw IPTCP adds
Packets can be lostReliable delivery — lost packets are detected and retransmitted
Packets can arrive out of orderOrdered delivery — packets are reassembled in sequence
Packets can be duplicatedDeduplication — duplicates are silently discarded
Packets can be corruptedIntegrity checking — checksums detect corruption
No flow controlFlow control — sender won't overwhelm a slow receiver
No congestion controlCongestion control — sender won't overwhelm the network itself

Where TCP Lives: The Network Stack

Before we dive into TCP, let's understand where it sits. When your application calls send(), the data passes through several layers:

LayerProtocolWhat it addsAnalogy
ApplicationHTTP, gRPC, DNSRequest semantics (GET /page)The letter you write
TransportTCP, UDP, QUICReliability, ordering, portsThe postal service (registered mail vs postcard)
NetworkIPSource/destination addresses, routingThe address on the envelope
LinkEthernet, WiFiFrame encoding, local deliveryThe mail truck between sorting offices

TCP operates at the transport layer. It takes your application's data, breaks it into segments, attaches sequence numbers and checksums, and hands it to IP for delivery. IP routes each segment independently — they might take different paths through the network. TCP at the other end reassembles them in order, detects any that were lost, and delivers a clean byte stream to the receiving application.

Key insight: TCP is end-to-end. Routers in the middle only see IP packets. They do not understand TCP. They cannot retransmit lost segments or reorder data. All of TCP's reliability mechanisms run on the sender and receiver endpoints. This is the end-to-end principle — intelligence at the edges, simplicity in the core. It is why the internet scales.

Over the next chapters, we will build TCP from scratch in your mind. Then we will layer TLS on top for encryption and authentication. By the end, you will understand every byte that flows between your browser and a server.

Quick check: You send packets 1, 2, 3, 4 over raw IP. The receiver gets packets 3, 1, 1, (and 2 and 4 are lost). Which of the following problems does this demonstrate?

Chapter 1: TCP Reliability

TCP transforms the unreliable mess of IP into a reliable, ordered byte stream. How? Three mechanisms: sequence numbers, acknowledgments, and retransmission. And before any data flows, the two sides must agree to talk — that is the three-way handshake.

The Three-Way Handshake

Before a client and server can exchange data, they must establish a connection. Why? Because TCP needs both sides to agree on initial sequence numbers and allocate resources (buffers, state). The handshake has three steps:

1. SYN (Client → Server)
Client picks a random initial sequence number (ISN), say 1000. Sends a SYN packet: "I want to connect. My starting sequence number is 1000."
2. SYN-ACK (Server → Client)
Server picks its own ISN, say 5000. Sends SYN-ACK: "I accept. My starting sequence number is 5000. I acknowledge your sequence number 1000 (next byte I expect from you: 1001)."
3. ACK (Client → Server)
Client acknowledges: "I acknowledge your sequence number 5000 (next byte I expect from you: 5001)." Connection is now ESTABLISHED on both sides.
Why three steps and not two? Imagine a two-way handshake: Client sends SYN, server sends SYN-ACK, done. Now imagine the client's SYN was delayed in the network for 30 seconds. The client times out and sends a new SYN. The old SYN finally arrives — the server accepts it and thinks a connection is established. But the client never sent an ACK for that old SYN, so it doesn't know about this "ghost" connection. The server wastes resources waiting for data that will never come. The third step (ACK) proves the client is still alive and actually wants this connection.

Sequence Numbers and Acknowledgments

Once the connection is established, every byte of data gets a sequence number. If the client's ISN was 1000 and it sends 500 bytes, those bytes are numbered 1001 through 1500. The server responds with an ACK number of 1501, meaning "I have received all bytes up to 1500; send me byte 1501 next."

This is how TCP detects loss. If the client sends bytes 1001-1500 and then bytes 1501-2000, but the first segment is lost, the server will keep ACKing 1001 — "I'm still waiting for byte 1001." After three duplicate ACKs for the same number, the client knows the segment was lost and retransmits it.

// TCP sequence number and ACK example

Client ISN: 1000
Server ISN: 5000

Client → Server: SEQ=1001, DATA=[500 bytes] // bytes 1001-1500
Server → Client: ACK=1501 // "got it, send 1501 next"
Client → Server: SEQ=1501, DATA=[500 bytes] // bytes 1501-2000
Server → Client: ACK=2001 // "got it, send 2001 next"

// Now imagine the first segment is lost:
Client → Server: SEQ=1001, DATA=[500 bytes] // LOST in network!
Client → Server: SEQ=1501, DATA=[500 bytes] // arrives
Server → Client: ACK=1001 // "I still need 1001!"
Client → Server: SEQ=2001, DATA=[500 bytes] // arrives
Server → Client: ACK=1001 // duplicate ACK #2
Server → Client: ACK=1001 // duplicate ACK #3 → RETRANSMIT!
Client → Server: SEQ=1001, DATA=[500 bytes] // retransmission
Server → Client: ACK=2501 // buffered 1501-2500, now got 1001-1500 too

Notice that the server buffered the out-of-order segments (1501-2000 and 2001-2500). When the missing segment finally arrived, the server delivered all the buffered data at once and jumped the ACK number forward. This is called selective acknowledgment (SACK) in modern TCP.

Animated TCP Handshake & Data Transfer

TCP Handshake & Data Transfer

Watch the three-way handshake, then data transfer with sequence numbers. Click "Drop Packet" to simulate loss and see retransmission.

Click Start Handshake to begin.

Retransmission Timeout (RTO)

How long does TCP wait before retransmitting a lost packet? It uses a retransmission timeout (RTO) that adapts to the network. TCP continuously measures the round-trip time (RTT) — how long between sending a segment and receiving its ACK — and sets the RTO to be slightly larger than the smoothed RTT.

// Jacobson's algorithm for computing RTO

SRTT = (1 - α) × SRTT + α × RTT_sample // smoothed RTT (α = 1/8)
RTTVAR = (1 - β) × RTTVAR + β × |SRTT - RTT_sample| // variance (β = 1/4)
RTO = SRTT + 4 × RTTVAR // timeout = mean + 4 stddev

// Example: if SRTT = 50ms, RTTVAR = 10ms
RTO = 50 + 4 × 10 = 90ms

// On each consecutive timeout, TCP doubles the RTO (exponential backoff)
1st timeout: RTO = 90ms
2nd timeout: RTO = 180ms
3rd timeout: RTO = 360ms
// ... up to a max (typically 60-120 seconds)

This adaptive approach is crucial. On a LAN with 1ms RTT, the RTO might be 10ms. On a transatlantic link with 150ms RTT, it might be 800ms. TCP adjusts automatically.

Putting It All Together: A Complete TCP Segment

Every TCP segment carries a header with critical fields. Here is what a real TCP header looks like:

// TCP Header (20 bytes minimum, up to 60 with options)

| Source Port (16 bits) | Dest Port (16 bits) |
| Sequence Number (32 bits) |
| Acknowledgment Number (32 bits) |
| Data Offset | Reserved | Flags | Window (16)|
| Checksum (16 bits) | Urgent Pointer (16) |
| Options (variable, up to 40 bytes) |
| Data... |

// Flags (6 bits):
SYN = "I want to synchronize sequence numbers" (connection open)
ACK = "My acknowledgment number field is valid"
FIN = "I'm finished sending" (connection close)
RST = "Something is wrong, abort the connection"
PSH = "Push this data to the application immediately"
URG = "There is urgent data (rarely used)"

The sequence number and acknowledgment number fields are the heart of TCP's reliability. The window field is flow control. The checksum catches corruption. Everything we discussed in this chapter is encoded in these 20 bytes.

Quick check: A TCP receiver has received bytes 1-1000 and bytes 2001-3000 (but NOT bytes 1001-2000, which were lost). What ACK number does the receiver send?

Chapter 2: Connection Lifecycle

A TCP connection is not just "open" or "closed." It moves through a carefully defined set of states, each serving a specific purpose. Understanding these states is essential for debugging real production problems like port exhaustion, half-open connections, and the infamous TIME_WAIT buildup.

The TCP State Machine

Every TCP connection on your machine is in one of these states at any given moment. You can see them with netstat -an or ss -tan.

StateWhoWhat it means
LISTENServerWaiting for incoming SYN packets. This is a server socket ready to accept connections.
SYN_SENTClientClient sent SYN, waiting for SYN-ACK. If the server is unreachable, you'll sit here until timeout.
SYN_RECEIVEDServerServer got SYN, sent SYN-ACK, waiting for final ACK. SYN flood attacks exploit this state.
ESTABLISHEDBothConnection is open. Data can flow in both directions. This is the "normal" state.
FIN_WAIT_1CloserSent FIN, waiting for ACK of the FIN.
FIN_WAIT_2CloserGot ACK of our FIN. Waiting for the other side's FIN.
CLOSE_WAITOtherGot FIN from peer. The application hasn't called close() yet. This is a bug if it accumulates.
LAST_ACKOtherSent our FIN, waiting for ACK.
TIME_WAITCloserBoth FINs exchanged. Waiting 2×MSL before fully closing. The most misunderstood state.
CLOSEDBothConnection is fully torn down. All resources freed.

The Four-Way Close

Opening a connection takes 3 messages. Closing takes 4, because each direction closes independently. Think of TCP as two one-way streets — each side must close its own lane.

1. FIN (A → B)
A says "I'm done sending data." A enters FIN_WAIT_1. A can still receive data from B.
2. ACK (B → A)
B acknowledges A's FIN. A moves to FIN_WAIT_2. B enters CLOSE_WAIT. B can still send data to A.
3. FIN (B → A)
B finishes sending and sends its own FIN. B enters LAST_ACK.
4. ACK (A → B)
A acknowledges B's FIN. A enters TIME_WAIT. B enters CLOSED immediately. A waits 2×MSL, then enters CLOSED.

TIME_WAIT: Why Wait 2×MSL?

The Maximum Segment Lifetime (MSL) is the longest time a TCP segment can exist in the network before being discarded (typically 30 seconds to 2 minutes; Linux uses 60 seconds). TIME_WAIT lasts 2 × MSL (so 1-4 minutes).

Why wait at all? Two reasons:

Reason 1: Ensure the final ACK arrives. If A's final ACK (step 4) is lost, B will retransmit its FIN. A needs to still be around to re-send the ACK. If A had immediately closed, B's retransmitted FIN would get a RST (reset), confusing B.
Reason 2: Prevent stale segments from a previous connection. Imagine connection (A:port 5000, B:port 80) closes. If A immediately reuses port 5000 for a new connection to B:80, a delayed packet from the OLD connection could arrive and be mistakenly accepted as part of the NEW connection. TIME_WAIT ensures all old packets expire before the port can be reused.

In production, TIME_WAIT buildup is a common problem. A busy server closing thousands of connections per second can accumulate thousands of TIME_WAIT sockets, exhausting ephemeral ports. Solutions include SO_REUSEADDR, tcp_tw_reuse, and connection pooling (which we'll cover in Chapter 7).

Diagnosing Connection States in Production

Here is what to look for when you see each state accumulating:

bash
# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Example output for a healthy server:
#  2847 ESTAB        ← active connections, normal
#   142 TIME-WAIT    ← recently closed, normal if <1000
#    12 LISTEN       ← server sockets, normal
#     3 FIN-WAIT-2   ← closing, transient, normal

# Example output for a SICK server:
#  2847 ESTAB
# 28104 TIME-WAIT    ← PORT EXHAUSTION IMMINENT
#  1893 CLOSE-WAIT   ← APPLICATION BUG (not closing sockets)
#    12 LISTEN
The CLOSE_WAIT alarm. CLOSE_WAIT means "the remote side closed, but our application hasn't closed the socket." If you see hundreds or thousands of CLOSE_WAIT connections, you have a resource leak in your application code. The fix is always in your code — you're forgetting to close sockets, probably in an error path. This is one of the most common production bugs in long-running services.

TCP State Machine Visualization

TCP Connection State Machine

Click transitions to move through the TCP state machine. Watch both client and server states change.

Click Next Transition to walk through the TCP lifecycle.
Quick check: You run ss -tan on a busy web server and see 30,000 connections in CLOSE_WAIT. What does this indicate?

Chapter 3: Flow Control

Imagine you are reading aloud to a friend who is writing down every word. You speak at 200 words per minute. Your friend writes at 50 words per minute. Within seconds, they fall behind. Words pile up faster than they can be recorded. Information is lost.

The same problem exists in TCP. A fast sender (a powerful server sending a large response) can overwhelm a slow receiver (a mobile phone with limited memory). The receiver's buffer fills up, and new packets have nowhere to go — they get dropped, retransmitted, dropped again. Without some mechanism to match the sender's speed to the receiver's capacity, the connection devolves into an inefficient cycle of overflow and retransmission.

Flow control is TCP's elegant solution: the receiver explicitly tells the sender how much buffer space is available, in every single ACK packet. The sender limits itself to sending only that much unacknowledged data. If the receiver is slow, the sender automatically slows down. If the receiver speeds up, the sender speeds up too. The mechanism is entirely receiver-driven — the receiver is always in control of the pace.

The Receive Window

Every TCP ACK packet contains a field called the receive window (rwnd) — a 16-bit number (up to 65,535 bytes, or much larger with window scaling) that says "I have this many bytes of free buffer space." The sender must obey this limit.

// Flow control constraint

bytes_in_flight = last_byte_sent - last_byte_acked

Rule: bytes_in_flight ≤ rwnd

// Example:
Receiver buffer size: 64 KB
Data received but not read by application: 40 KB
Available buffer: 64 - 40 = 24 KB
→ Receiver advertises rwnd = 24 KB
→ Sender can have at most 24 KB of unacknowledged data in flight

As the receiver's application reads data from the buffer, more space frees up, and the receiver advertises a larger window. As the buffer fills, the window shrinks. If the buffer is completely full, the receiver advertises rwnd = 0, and the sender must stop sending entirely.

The sliding window analogy. Think of the sender as looking through a window at the data stream. The window has a fixed size (rwnd). Data to the left of the window has been sent and acknowledged — done. Data inside the window can be sent. Data to the right of the window is waiting its turn. As ACKs arrive, the window slides right, revealing new data that can be sent. This is why it is called the sliding window protocol.

Zero Window and the Persist Timer

When the receiver advertises rwnd = 0, the sender enters a zero window state and stops sending data. But how does the sender know when the receiver has space again? The receiver will send a new ACK with a larger window when buffer space frees up — but what if that ACK is lost? The sender would wait forever.

TCP solves this with the persist timer. When the sender sees rwnd = 0, it periodically sends tiny window probe packets (1 byte) to ask "do you have space yet?" The receiver responds with the current window size. This prevents deadlocks.

Sliding Window Animation

TCP Sliding Window

Watch the sender's sliding window advance as ACKs arrive. Drag the "App Read Speed" slider to control how fast the receiver's application consumes data.

App Read Speed 5
Adjust App Read Speed, then click Start Transfer.

When you set the read speed low, watch the receive window shrink. The sender is forced to slow down — that is flow control in action. Set it to 1 and watch the window hit zero: the sender stops completely until the application drains the buffer.

Window Scaling

The original TCP header allocated 16 bits for the receive window, giving a maximum of 65,535 bytes. On modern networks with high bandwidth and high latency (e.g., a 10 Gbps transatlantic link with 100ms RTT), 64 KB is absurdly small. You'd need to send 64 KB, wait 100ms for an ACK, send 64 KB, wait, send... Your effective throughput would be 64 KB / 100ms = 640 KB/s — on a 10 Gbps link.

// The bandwidth-delay product problem

Bandwidth = 10 Gbps = 1.25 GB/s
RTT = 100ms
BDP = Bandwidth × RTT = 1.25 GB/s × 0.1s = 125 MB

// To fully utilize the link, you need 125 MB of data in flight
// But rwnd max is 64 KB without scaling
// Utilization = 64 KB / 125 MB = 0.05%

// Window scaling (RFC 1323): a scale factor S negotiated in handshake
Actual window = rwnd × 2S
// S can be 0-14, so max window = 65535 × 2^14 = ~1 GB

Window scaling is negotiated during the three-way handshake using a TCP option. Both sides advertise their scale factor. Modern operating systems enable this by default.

Flow Control vs. Congestion Control

These two mechanisms are frequently confused. Let's make the distinction crystal clear:

PropertyFlow ControlCongestion Control
What it protectsThe receiver (application buffer)The network (router buffers)
Who sets the limitThe receiver (advertises rwnd)The sender (estimates cwnd)
SignalExplicit: receiver tells sender its windowImplicit: sender infers congestion from packet loss or delay
What happens when violatedReceiver drops data it can't bufferRouters drop packets from their queues
Sending ratemin(cwnd, rwnd) — whichever is smaller winsmin(cwnd, rwnd) — same formula, different limiter

In practice, for most web traffic within a datacenter (low RTT, fast receivers), congestion control is the bottleneck — cwnd limits sending during slow start. For cross-continental transfers with slow endpoints, flow control is the bottleneck — rwnd limits sending because the receiver can't process data fast enough.

A common misconception. Many developers think "TCP is slow because of flow control." Usually it is congestion control during slow start that limits throughput for short-lived connections. A new TCP connection starts with cwnd = 10 MSS (~14 KB). On a 100ms RTT link, it takes several RTTs just to grow cwnd enough to fill the pipe. This is why connection pooling and HTTP keep-alive are so important — they avoid repeated slow starts.
Quick check: A TCP sender has 100 KB of data to send. The receiver advertises rwnd = 30 KB. The sender has already sent 20 KB that hasn't been acknowledged yet. How much more data can the sender transmit right now?

Chapter 4: Congestion Control

Flow control prevents a fast sender from overwhelming a slow receiver. But what about the network between them? Routers have finite buffer space. If too many senders push data too fast, router buffers overflow, packets are dropped, and every sender retransmits, making the congestion worse. This is called congestion collapse, and it nearly killed the early internet in 1986. Literally: the internet's usable bandwidth dropped to a tiny fraction of its physical capacity because every sender was aggressively retransmitting lost packets, which caused more congestion, which caused more loss, which caused more retransmission. Van Jacobson's 1988 paper introduced the algorithms we are about to learn, and the internet recovered.

Congestion control is TCP's mechanism for sensing and avoiding network overload. It is separate from flow control: flow control is about the receiver's capacity; congestion control is about the network's capacity. Every TCP sender on Earth runs congestion control independently, and together they cooperatively share the internet's bandwidth without any central coordinator. It is one of the most successful distributed algorithms ever deployed.

The Congestion Window (cwnd)

In addition to the receiver's window (rwnd), each TCP sender maintains its own congestion window (cwnd) — an estimate of how much data the network can handle. The actual sending limit is:

effective_window = min(cwnd, rwnd)

The sender can never have more than effective_window bytes in flight. Even if the receiver has a huge buffer, the sender will self-limit based on its perception of network congestion.

Phase 1: Slow Start

When a new connection starts, the sender has no idea how much bandwidth the network can handle. It starts with cwnd = 1 MSS (maximum segment size, typically 1460 bytes). For every ACK received, cwnd increases by 1 MSS. Since each RTT roughly doubles the number of segments acknowledged, cwnd grows exponentially: 1, 2, 4, 8, 16, 32...

// Slow start: exponential growth

RTT 1: cwnd = 1 MSS, send 1 segment, get 1 ACK → cwnd = 2
RTT 2: cwnd = 2 MSS, send 2 segments, get 2 ACKs → cwnd = 4
RTT 3: cwnd = 4 MSS, send 4 segments, get 4 ACKs → cwnd = 8
RTT 4: cwnd = 8 MSS → cwnd = 16
RTT 5: cwnd = 16 MSS → cwnd = 32

// In 5 RTTs, we went from 1 segment to 32 segments
// That's ~47 KB on a typical network (MSS = 1460 bytes)

"Slow start" is a misnomer — the growth is exponential, so it ramps up quickly. It continues until cwnd reaches a threshold called ssthresh (slow start threshold).

Phase 2: Congestion Avoidance (AIMD)

Once cwnd ≥ ssthresh, TCP switches to congestion avoidance. Now cwnd grows linearly: increase by 1 MSS per RTT (not per ACK). This is the additive increase phase — carefully probing for more bandwidth without being aggressive.

When a packet loss is detected (via timeout or triple duplicate ACK), TCP assumes the network is congested and cuts cwnd dramatically — typically by half. This is the multiplicative decrease. Together, this is AIMD: Additive Increase, Multiplicative Decrease.

// AIMD: the core of TCP congestion control

// Additive Increase (congestion avoidance):
On each ACK: cwnd += MSS × (MSS / cwnd) // ~1 MSS per RTT

// Multiplicative Decrease (on packet loss):
ssthresh = cwnd / 2
cwnd = ssthresh // (fast recovery) or cwnd = 1 MSS (timeout)

// This creates the classic "sawtooth" pattern:
// linear growth → loss → halve → linear growth → loss → halve

Fast Retransmit & Fast Recovery

There are two ways to detect loss:

Detection MethodResponseWhy
Timeoutcwnd = 1 MSS, enter slow startA timeout means severe congestion — no ACKs at all. Start over from scratch.
Triple duplicate ACKcwnd = cwnd/2, enter congestion avoidanceThree duplicate ACKs mean one segment was lost, but later segments DID arrive. The network isn't completely dead — just reduce the rate by half.

Fast retransmit means retransmitting the lost segment immediately upon receiving the third duplicate ACK, without waiting for the full timeout. Fast recovery means starting congestion avoidance (linear growth) from the halved window rather than going back to slow start. These two optimizations make TCP much more responsive to isolated packet losses.

Congestion Window Visualization

TCP Congestion Window Over Time

Watch cwnd grow exponentially (slow start), then linearly (congestion avoidance). Click "Drop Packet" to trigger loss. Green = slow start, blue = congestion avoidance.

Click Play to start. Then inject losses to see the sawtooth pattern.

The sawtooth is the signature of AIMD. Linear growth probes for available bandwidth. Each drop halves the window. Over time, TCP oscillates around the maximum capacity the network can handle. It is a beautifully simple and robust mechanism.

Fairness: How AIMD Shares Bandwidth

One of AIMD's most elegant properties is fairness. If two TCP connections share a bottleneck link, they converge to equal bandwidth — with no communication between them.

Imagine two senders, A and B, sharing a 100 Mbps link. A has cwnd=80, B has cwnd=40. Total = 120, which exceeds capacity. Packets are dropped. Both halve: A=40, B=20. Now both increase linearly: A=41, B=21, then A=42, B=22. At the next drop: A=halved, B=halved. Over time, the ratio converges to 1:1. This works because additive increase adds the same absolute amount, while multiplicative decrease preserves the ratio — but the combination trends toward equality.

// Why AIMD converges to fairness

// Two senders sharing a link of capacity C:
After loss: A' = A/2, B' = B/2
A'/B' = A/B // ratio preserved by multiplicative decrease

After N additive steps: A'' = A/2 + N, B'' = B/2 + N
A''/B'' = (A/2 + N) / (B/2 + N) // approaches 1 as N grows

// Each cycle of "increase then halve" moves the ratio closer to 1:1
// This is proven to converge for any starting ratio
Modern variants. The classic algorithm described here is TCP Reno/NewReno. Modern TCP uses more sophisticated algorithms: CUBIC (Linux default — uses a cubic function instead of linear increase for better high-bandwidth utilization), BBR (Google — estimates bandwidth and RTT directly instead of using loss as a signal), and DCTCP (datacenter TCP — uses ECN marks for fine-grained congestion signals). The principles are the same; the curves are different.

CUBIC vs BBR: The Modern Landscape

PropertyCUBICBBR
Default inLinux (since 2006)Google's servers, YouTube, Google Cloud
Signal usedPacket lossBandwidth and RTT estimation
Growth functionCubic function of time since last lossPacing rate based on estimated bottleneck bandwidth
StrengthGood fairness with other CUBIC/Reno flowsExcellent on high-BDP links, tolerant of random loss
WeaknessUnderutilizes high-BDP links, suffers from random lossCan be unfair to CUBIC flows, complex to tune
Key insightAfter a loss, quickly return to the cwnd where loss occurred (the "plateau")Loss does not necessarily mean congestion — measure the actual bottleneck rate
Quick check: TCP is in congestion avoidance with cwnd = 40 MSS. A triple duplicate ACK is received. What happens to cwnd and ssthresh?

Chapter 5: TLS/SSL

TCP gives us reliable, ordered delivery. But it provides zero security. Every byte you send over TCP is transmitted in plaintext. Anyone on the network path — your ISP, a coffee shop WiFi operator, a compromised router — can read your passwords, credit card numbers, and private messages. They can also modify them in transit without you knowing.

TLS (Transport Layer Security) solves three problems at once:

ProblemTLS SolutionMechanism
EavesdroppingEncryptionSymmetric encryption (AES-GCM, ChaCha20) makes data unreadable to anyone except the two endpoints
ImpersonationAuthenticationDigital certificates + CA chain prove the server is who it claims to be
TamperingIntegrityHMAC / AEAD authentication tags detect any modification to the ciphertext

Symmetric vs. Asymmetric Encryption

TLS uses both types of encryption, each for a different purpose:

Asymmetric encryption (RSA, ECDHE) uses a key pair: a public key anyone can have, and a private key only the owner holds. It is slow but solves the key exchange problem — two strangers can agree on a shared secret over a public channel.

Symmetric encryption (AES-256-GCM, ChaCha20-Poly1305) uses a single shared key for both encryption and decryption. It is fast — 100-1000x faster than asymmetric. Once both sides have the shared secret, all data is encrypted symmetrically.

The combination. TLS uses asymmetric cryptography to securely exchange a symmetric key, then uses that symmetric key to encrypt all the actual data. This gives you the best of both worlds: the key exchange security of asymmetric plus the speed of symmetric.

The TLS 1.3 Handshake

TLS 1.3 (the current standard) completes the handshake in a single round trip, down from two in TLS 1.2. Here is every step:

1. ClientHello
Client sends: supported TLS versions, supported cipher suites (e.g., TLS_AES_256_GCM_SHA384), supported key exchange methods (e.g., X25519), and a key share — the client's half of the key exchange, computed speculatively.
2. ServerHello + Certificate + Finished
Server picks a cipher suite, sends its key share (completing the key exchange — both sides now compute the shared secret), its certificate (proving identity), and a Finished message (a MAC over the entire handshake, proving the server holds the private key).
3. Client Finished
Client verifies the certificate chain up to a trusted CA. Verifies the Finished MAC. Sends its own Finished message. Handshake complete — encrypted data can flow.
// TLS 1.3 key exchange using Diffie-Hellman (simplified)

// Both sides agree on a curve (e.g., X25519)
Client: private_a = random(), public_a = ga mod p // sent in ClientHello
Server: private_b = random(), public_b = gb mod p // sent in ServerHello

// Shared secret (both compute the same value):
Client computes: shared = public_ba mod p = gab mod p
Server computes: shared = public_ab mod p = gab mod p

// An eavesdropper sees public_a and public_b but cannot compute g^ab
// (this is the Discrete Logarithm Problem — computationally infeasible)

// The shared secret is fed into HKDF to derive encryption keys
client_write_key = HKDF(shared, "client write key")
server_write_key = HKDF(shared, "server write key")

Certificate Chain Verification

How does the client know the server's certificate is legitimate? Through a chain of trust:

Server Certificate
Signed by an intermediate CA. Contains the server's public key and domain name.
↑ signed by
Intermediate CA Certificate
Signed by a root CA. The intermediate CA verified the server's identity before issuing the certificate.
↑ signed by
Root CA Certificate
Pre-installed in your browser/OS. Self-signed. There are ~150 trusted root CAs worldwide (DigiCert, Let's Encrypt, etc.).

The client walks the chain: verify the server cert was signed by the intermediate CA, verify the intermediate was signed by a root CA that the client trusts. If any link fails, the connection is rejected.

Forward Secrecy

TLS 1.3 mandates forward secrecy (also called perfect forward secrecy, PFS). This means that even if the server's private key is compromised in the future, past recorded conversations cannot be decrypted.

How? Because each connection generates a new ephemeral Diffie-Hellman key pair. The shared secret is derived from these ephemeral keys, not directly from the server's long-term key. The server's private key is only used to authenticate (sign the handshake), not to encrypt. Once the connection closes, the ephemeral keys are discarded.

// Without forward secrecy (TLS 1.2 RSA key exchange):
Client encrypts premaster secret with server's RSA public key
If attacker later obtains server's private key → can decrypt ALL past sessions

// With forward secrecy (TLS 1.3 ECDHE):
Each connection uses fresh DH key pair → unique shared secret
Server private key only signs, never decrypts
Compromise of server key → can impersonate server going forward
  but CANNOT decrypt any past recorded traffic
Ephemeral keys are deleted after handshake → no way to recover them
This is why TLS 1.3 removed RSA key exchange entirely. RSA key exchange (where the client encrypts a secret with the server's public key) does not provide forward secrecy. An attacker who records traffic and later steals the server key can decrypt everything. Ephemeral Diffie-Hellman (ECDHE) is the only key exchange mode in TLS 1.3.

TLS Handshake Animation

TLS 1.3 Handshake

Watch the TLS handshake in real time. Each message shows what it contains and what both sides know.

Click Start Handshake or Step through one message at a time.

Python TLS Socket

python
import ssl
import socket

# Create a TLS-wrapped TCP connection
context = ssl.create_default_context()  # loads system CA certs

with socket.create_connection(("example.com", 443)) as sock:
    with context.wrap_socket(sock, server_hostname="example.com") as tls:
        # Handshake happens automatically
        print(tls.version())       # 'TLSv1.3'
        print(tls.cipher())        # ('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256)
        cert = tls.getpeercert()
        print(cert['subject'])     # server identity

        # Send an HTTP request over TLS
        tls.sendall(b"GET / HTTP/1.1\r\nHost: example.com\r\n\r\n")
        response = tls.recv(4096)
        print(response.decode())
python
# Inspecting TLS details programmatically
import ssl

context = ssl.create_default_context()

# Verify hostname and certificate
context.check_hostname = True
context.verify_mode = ssl.CERT_REQUIRED

# Pin to specific TLS version (TLS 1.3 only)
context.minimum_version = ssl.TLSVersion.TLSv1_3
context.maximum_version = ssl.TLSVersion.TLSv1_3

# Restrict cipher suites
context.set_ciphers("TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256")
Quick check: In TLS 1.3, the client sends a key share in the ClientHello (before seeing the server's certificate). Why is this safe? What prevents a man-in-the-middle from intercepting the key exchange?

Chapter 6: Custom Protocols

TCP is brilliant for most applications: web browsing, file transfer, database queries, API calls. But its guarantees come at a cost. For some use cases, TCP's reliability mechanisms actually hurt performance.

When TCP Hurts: Head-of-Line Blocking

TCP delivers data as an ordered byte stream. If segment 5 out of 10 is lost, segments 6-10 must wait in the receiver's buffer until segment 5 is retransmitted and arrives. The application sees nothing until the gap is filled. This is head-of-line blocking.

For a file transfer, this is fine — you need the bytes in order. But for real-time applications, it is devastating:

ApplicationWhy TCP hurtsWhat you actually want
Video streamingOne lost frame blocks all subsequent frames. Player stutters.Skip the lost frame and play the next one.
Online gamingOne lost position update blocks all future updates. Player teleports.Use the latest position, ignore stale ones.
Voice calls200ms of audio delayed while waiting for retransmission.Play silence for the gap, keep audio flowing.
HTTP/2 multiplexingOne lost packet on stream A blocks streams B, C, D (they share one TCP connection).Independent streams that don't interfere.

UDP: The Escape Hatch

UDP (User Datagram Protocol) is the anti-TCP. It provides almost nothing: no reliability, no ordering, no flow control, no congestion control. You send a datagram, and it either arrives or it doesn't. Each datagram is independent.

// UDP header: only 8 bytes (vs TCP's 20+ bytes)

| Source Port (16 bits) | Dest Port (16 bits) |
| Length (16 bits) | Checksum (16 bits) |
| Data... |

// That's it. No sequence numbers, no ACKs, no windows.
// The application is responsible for everything.

UDP is not "unreliable TCP." It is a blank canvas. Applications that use UDP implement exactly the guarantees they need and nothing more. A game might implement reliable delivery for chat messages but fire-and-forget for position updates. A video player might implement FEC (forward error correction) to recover from loss without retransmission.

python
import socket, struct, time

# UDP sender: fire-and-forget position updates
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

seq = 0
while True:
    x, y = get_player_position()
    # Pack: sequence number + position
    data = struct.pack("!Iff", seq, x, y)
    sock.sendto(data, ("game-server.example.com", 9000))
    seq += 1
    time.sleep(1/60)  # 60 Hz update rate

# UDP receiver: always use LATEST position, drop stale
def receive_positions(sock):
    latest_seq = -1
    while True:
        data, addr = sock.recvfrom(1024)
        seq, x, y = struct.unpack("!Iff", data)
        if seq > latest_seq:  # ignore out-of-order (stale)
            latest_seq = seq
            update_player(addr, x, y)
        # else: discard — we already have newer data

QUIC: The Best of Both Worlds

QUIC (originally "Quick UDP Internet Connections") is the protocol underneath HTTP/3. It runs on UDP but provides:

The key innovation. TCP multiplexes everything into one ordered byte stream. QUIC multiplexes into independent streams. When you load a web page with 50 resources over HTTP/2 (TCP), one lost packet blocks ALL 50 resources. Over HTTP/3 (QUIC), one lost packet only blocks the ONE resource whose stream it belongs to. The other 49 continue unimpeded.

When to Use What

Choosing between TCP, UDP, and QUIC is not about which is "better" — it is about matching your application's requirements to the protocol's guarantees. The decision framework is simple: ask yourself two questions. First: "Can my application tolerate data loss?" If no, you need reliability (TCP or QUIC). If yes, UDP. Second: "Am I multiplexing independent streams?" If yes, QUIC avoids head-of-line blocking. If no, TCP is simpler and universally supported.

Use TCP when: You need every byte delivered, in order, exactly once. Database connections, file transfers, API calls, SSH. TCP is the default choice, and you should only move away from it when you have a specific reason.
Use UDP when: You can tolerate loss, and low latency matters more than completeness. Real-time video/audio (VoIP, video conferencing), online games (position updates), DNS queries (single request-response, faster to retry than to maintain a connection), and any application where "the latest data supersedes old data."
Use QUIC when: You need TCP's reliability but with independent stream multiplexing, mandatory encryption, and faster connection setup. Web applications (HTTP/3), mobile apps (connection migration across network changes), and any multiplexed protocol that suffers from TCP's head-of-line blocking.

The Big Comparison: TCP vs UDP vs QUIC

TCP vs UDP vs QUIC: Head-of-Line Blocking

Three protocols, each sending 4 independent data streams. A packet loss occurs in stream 2. Watch how each protocol handles it. TCP blocks everything. UDP loses the data. QUIC blocks only stream 2.

Click Start Transfer, then Inject Loss to see the difference.

QUIC Under the Hood

QUIC is not just "TCP over UDP." It rethinks the transport layer from the ground up. Here are the critical implementation details:

// QUIC packet structure (simplified)

| Header Form (1 bit) | Fixed Bit | Long/Short Header ... |
| Connection ID (variable, 0-20 bytes) |
| Packet Number (1-4 bytes, encrypted!) |
| Encrypted Payload: |
| Frame Type | Stream ID | Offset | Length | Data |
| Frame Type | Stream ID | Offset | Length | Data |
| ...multiple frames per packet |

// Key differences from TCP:
// 1. Packet numbers are NEVER reused (monotonically increasing)
// 2. Stream ID identifies which logical stream this data belongs to
// 3. Offset within stream replaces TCP's sequence number
// 4. Multiple streams can be multiplexed in ONE packet

The fact that each stream has its own offset is what eliminates head-of-line blocking. If stream 3's offset is missing, QUIC only blocks stream 3's data. Streams 1, 2, and 4 have their own offsets and can be delivered to the application immediately.

Why UDP? QUIC runs over UDP because it would be nearly impossible to deploy a new transport protocol on the internet. Middleboxes (firewalls, NATs, load balancers) are hard-coded to understand TCP and UDP. A new IP protocol number would be blocked by most networks. UDP provides the minimal "I'm a valid transport packet" wrapper that middleboxes accept, and QUIC implements everything else in userspace.

Connection Setup Latency

Another major difference: how many round trips before data can flow.

// Connection setup latency comparison

TCP + TLS 1.3:
1 RTT: TCP handshake (SYN, SYN-ACK, ACK)
1 RTT: TLS handshake (ClientHello, ServerHello+Cert+Finished, Finished)
Total: 2 RTTs before first data byte

QUIC (first connection):
1 RTT: Combined transport + TLS handshake
Total: 1 RTT before first data byte

QUIC (repeat connection with 0-RTT):
0 RTTs: Client sends data immediately using cached session key
Total: 0 RTTs (!) — data in the very first packet

// On a 100ms RTT link:
TCP+TLS: 200ms before data flows
QUIC: 100ms (or 0ms for 0-RTT!)
FeatureTCPUDPQUIC
ReliabilityYes (whole stream)NoYes (per stream)
OrderingYes (whole stream)NoYes (per stream)
EncryptionOptional (TLS)NoMandatory (built-in TLS 1.3)
Head-of-line blockingYesNoNo (across streams)
Setup latency1-2 RTT0 RTT1 RTT (0 for repeat)
Connection migrationNo (tied to IP:port)N/AYes (connection ID)
Congestion controlYes (kernel)NoYes (userspace)
Used byHTTP/1.1, HTTP/2, SSH, databasesDNS, gaming, video, VoIPHTTP/3, Google services
Quick check: You're multiplexing 4 HTTP requests over a single HTTP/2 connection (TCP). A packet belonging to request 3 is lost. What happens to requests 1, 2, and 4?

Chapter 7: Network in Practice

You understand the protocols. Now let's talk about operating them in production. Every day, engineers debug TCP problems that waste hours — and the symptoms are always the same: "it's slow" or "connections are failing." The fix depends on understanding which TCP mechanism is misbehaving.

Nagle's Algorithm and TCP_NODELAY

Nagle's algorithm (RFC 896) solves the "small packet problem." If your application writes 1 byte at a time (e.g., keystrokes in a telnet session), TCP would send a 41-byte packet (20 bytes IP header + 20 bytes TCP header + 1 byte data) for each byte of payload. This is wildly inefficient.

Nagle's algorithm says: if there is unacknowledged data in flight, buffer small writes and send them as one segment when the ACK arrives. This batches small writes into larger, efficient segments.

// Nagle's algorithm (pseudocode)

if data_to_send ≥ MSS:
  send immediately // full segment, no waiting
elif no_unacked_data:
  send immediately // nothing in flight, safe to send small
else:
  buffer until ACK arrives or enough data accumulates

// Problem: with delayed ACKs (receiver waits up to 200ms to ACK),
// Nagle + delayed ACK = 200ms latency on every small write.
// This is why interactive/RPC applications set TCP_NODELAY.

For interactive applications (SSH, gaming, RPC services), Nagle's algorithm adds intolerable latency. The fix: setsockopt(TCP_NODELAY, 1). This disables Nagle's algorithm, sending every write immediately regardless of size. Every modern web server, database driver, and RPC framework sets TCP_NODELAY.

python
import socket

# Disable Nagle's algorithm for low-latency RPC
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

# Verify it's set
nodelay = sock.getsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY)
print(f"TCP_NODELAY: {'enabled' if nodelay else 'disabled'}")

# Note: Python's http.client and requests library set this by default
# But raw socket connections do NOT — you must set it yourself

Delayed ACK Interaction

We mentioned Nagle's algorithm, but the real devil is its interaction with delayed ACKs. The TCP receiver doesn't ACK every segment immediately — it waits up to 200ms (the delayed ACK timer) hoping to piggyback the ACK on a response data segment. This is efficient when the application sends a response quickly. But when combined with Nagle's algorithm, disaster strikes:

// The Nagle + Delayed ACK deadlock

// Client sends 100 bytes (less than MSS)
Client: send(100 bytes) // Nagle: send immediately (nothing unacked)
Server: receives 100 bytes, starts delayed ACK timer (200ms)

// Client sends 50 more bytes
Client: send(50 bytes) // Nagle: BUFFER! (100 bytes still unacked)

// Deadlock: Client waits for ACK to send buffered data
// Server waits 200ms to send delayed ACK
// Nobody makes progress for 200ms!

Server: delayed ACK timer fires, sends ACK // 200ms later!
Client: receives ACK, sends buffered 50 bytes

// Total delay: 200ms for a 50 byte write on a 1ms RTT network

This is not a theoretical problem. It causes real 200ms latency spikes in production systems every day. The solution is TCP_NODELAY on interactive/RPC connections. There is almost never a reason to leave Nagle's algorithm enabled on a service that makes request-response calls.

TCP Keep-Alive

TCP keep-alive detects dead connections. By default, TCP has no heartbeat — if one side crashes without sending FIN (power failure, OOM kill), the other side has no idea. It sits in ESTABLISHED state forever, holding resources.

// Linux TCP keep-alive parameters

tcp_keepalive_time = 7200 // seconds before first probe (default: 2 hours!)
tcp_keepalive_intvl = 75 // seconds between probes
tcp_keepalive_probes = 9 // number of unanswered probes before declaring dead

// Default: 2 hours + 9*75s = 2h 11m 15s to detect a dead peer!
// Production settings (much more aggressive):
tcp_keepalive_time = 60
tcp_keepalive_intvl = 10
tcp_keepalive_probes = 6
// 60 + 6*10 = 120 seconds to detect dead peer

Connection Pooling

Each TCP connection costs 1 RTT (handshake) + memory (buffers, state) + a TIME_WAIT socket when closed. If your application opens a new connection for every request (like naive HTTP/1.0), you pay this cost thousands of times per second.

Connection pooling keeps a set of established connections open and reuses them for multiple requests. This eliminates handshake latency and TIME_WAIT buildup.

python
import urllib3

# urllib3 connection pool: reuses TCP connections
pool = urllib3.HTTPSConnectionPool(
    "api.example.com",
    port=443,
    maxsize=10,         # keep up to 10 connections alive
    block=True,         # wait for a free connection if all 10 are busy
    retries=3,          # retry on connection failure
)

# First request: opens connection (1 RTT handshake + TLS)
resp1 = pool.request("GET", "/users/1")

# Second request: reuses existing connection (0 RTT overhead)
resp2 = pool.request("GET", "/users/2")

# Connection stays alive between requests (HTTP keep-alive)

Debug Scenarios

These are the three most common TCP problems in production. Learn to recognize the symptoms and you will save hours of debugging.

Scenario 1: Connections timing out. Symptom: connect() timeout after 10s. Your application can't establish TCP connections to a remote service. The SYN packets are being sent but no SYN-ACK comes back. Causes: (1) Firewall dropping SYN packets silently (most common). (2) Server's listen backlog is full — too many pending connections. (3) Server is down. Debug: tcpdump -i any 'tcp[tcpflags] & tcp-syn != 0' and host X.X.X.X — if you see SYN going out but no SYN-ACK coming back, it's a network/firewall issue.
Scenario 2: Slow throughput despite high bandwidth. Symptom: transferring a file at 2 MB/s on a 1 Gbps link. Causes: (1) Small receive window — check ss -tin for rcv_space. (2) High RTT with small window — BDP problem. (3) Packet loss causing cwnd to stay small. Debug: ss -tin dst X.X.X.X shows cwnd, rwnd, RTT, retransmissions.
Scenario 3: Port exhaustion from TIME_WAIT. Symptom: Cannot assign requested address or EADDRNOTAVAIL. You've run out of ephemeral ports because thousands of connections are in TIME_WAIT. Cause: high-volume short-lived connections (e.g., a load balancer creating a new connection for every request). Fix: (1) Connection pooling (best). (2) net.ipv4.tcp_tw_reuse = 1 (safe with TCP timestamps). (3) SO_REUSEADDR on the socket.

TCP Tuning Cheat Sheet

bash
# View current TCP settings
sysctl net.ipv4.tcp_wmem          # send buffer (min, default, max)
sysctl net.ipv4.tcp_rmem          # receive buffer (min, default, max)
sysctl net.ipv4.tcp_congestion_control  # cubic, bbr, etc.

# Enable BBR congestion control (better for high-latency links)
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Increase buffer sizes for high-BDP links
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"   # 16 MB max
sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"

# Allow TIME_WAIT socket reuse
sysctl -w net.ipv4.tcp_tw_reuse=1

# Aggressive keep-alive for production services
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
sysctl -w net.ipv4.tcp_keepalive_probes=6

# Monitor TCP state distribution
ss -tan state time-wait | wc -l   # count TIME_WAIT sockets
ss -tan state close-wait | wc -l  # count CLOSE_WAIT (application bug!)
ss -tin dst X.X.X.X               # detailed TCP info for connections to X

Network Debug Dashboard

The simulation below models a production scenario. A client sends requests to a server. You can introduce different problems — Nagle delay, small receive window, TIME_WAIT buildup — and watch the metrics change in real time. This is what ss -tin output looks like, made visual.

Production Network Debugger

Click a scenario to introduce a TCP problem. Watch the metrics dashboard update. Then apply the fix.

Select a scenario to diagnose.

Real-World Case Studies

These are not hypothetical — these are actual production incidents that TCP misconfigurations cause.

Case: The 40ms mystery. A team migrated their service from bare metal to Kubernetes. Latency jumped from 2ms to 40ms. Root cause: the new container's TCP stack had delayed ACKs enabled (default), and the client had Nagle's algorithm enabled (default). Each small RPC response triggered the Nagle+delayed-ACK interaction. Fix: set TCP_NODELAY on all service sockets. Lesson: defaults that work on one platform can fail on another.
Case: The Black Friday meltdown. An e-commerce site's API gateway ran out of ephemeral ports at 2x normal traffic. All new connections failed with EADDRNOTAVAIL. Root cause: the gateway created a new TCP connection for every backend request and closed it immediately. At 2x traffic, TIME_WAIT sockets consumed all 28,000 ephemeral ports. Fix: connection pooling with a pool of 100 persistent connections per backend. TIME_WAIT count dropped from 28,000 to near zero.
Case: Transcontinental transfer at 1%. A data team moved 500 GB nightly from US-East to EU-West over a 10 Gbps link. Transfer took 8 hours instead of the expected 7 minutes. Root cause: default TCP receive buffer was 64 KB. With 100ms RTT, throughput = 64 KB / 100ms = 640 KB/s. BDP was 125 MB. Fix: increase tcp_rmem max to 16 MB and enable window scaling. Transfer time dropped to 9 minutes.
Quick check: Your microservice makes HTTP requests to another service. You notice 200ms of latency on every small request, even though the RTT is 1ms. You are NOT using TCP_NODELAY. What is likely happening?

Chapter 8: Interview Arsenal

Networking is a favorite interview topic for backend, infrastructure, and distributed systems roles. Here is everything you need to answer confidently.

Quick Reference: TCP vs UDP vs QUIC

QuestionTCPUDPQUIC
Connection-oriented?Yes (3-way handshake)No (connectionless)Yes (1 RTT handshake)
Reliable?Yes (ACK + retransmit)NoYes (per stream)
Ordered?Yes (single stream)NoYes (per stream)
Flow control?Yes (receive window)NoYes (per stream + connection)
Congestion control?Yes (AIMD in kernel)NoYes (pluggable, userspace)
Encryption?Optional (TLS layer)Optional (DTLS)Mandatory (built-in)
Header size20-60 bytes8 bytesVariable (~20 bytes)
HoL blocking?Yes (whole connection)NoNo (across streams)

Whiteboard: Explain TCP Congestion Control

If an interviewer says "explain TCP congestion control on a whiteboard," here is the structure:

1. Draw the sawtooth
X-axis = time, Y-axis = cwnd. Draw exponential growth (slow start), transition at ssthresh to linear growth (congestion avoidance), then a drop at packet loss. This visual IS the explanation.
2. Explain the two phases
Slow start: cwnd doubles every RTT. "We don't know the network capacity, so probe aggressively." Congestion avoidance: cwnd += 1 MSS per RTT. "We're near capacity, probe carefully."
3. Explain loss detection
Triple dup ACK: "mild loss, halve cwnd and continue." Timeout: "severe loss, reset cwnd to 1 and start over." Write AIMD: additive increase, multiplicative decrease.
4. Why it works
"All senders independently converge to fair share. If N senders share a link, each gets ~1/N of the bandwidth. No central coordinator needed."

Common Interview Questions

Q: What happens when you type google.com in your browser? (The classic.) DNS resolution → TCP 3-way handshake → TLS 1.3 handshake → HTTP/2 or HTTP/3 request → Server processes → Response (HTML) → Browser parses HTML, discovers CSS/JS/images → Parallel requests for resources (multiplexed over same connection) → DOM construction → Render. TCP + TLS setup = 2 RTTs (TCP+TLS) or 1 RTT (QUIC). On a 50ms RTT link, that's 100ms vs 50ms before the first data byte.
Q: Why does TCP use a 3-way handshake? Why not 2? A 2-way handshake cannot prevent stale SYN packets from creating ghost connections. The third step (ACK) confirms the client is alive and intends this connection. Without it, delayed SYN packets from old connections would cause the server to allocate resources for connections no one will use.
Q: Explain TIME_WAIT. Why does it exist? When is it a problem? TIME_WAIT lasts 2×MSL (1-4 minutes). Purpose: (1) ensure final ACK is delivered (if lost, peer retransmits FIN), (2) prevent stale packets from old connection being accepted by a new connection on the same port. Problem: high-volume servers closing many connections accumulate TIME_WAIT sockets, exhausting ephemeral ports. Fix: connection pooling, tcp_tw_reuse, SO_REUSEADDR.
Q: How does TLS prevent man-in-the-middle attacks? The server presents a certificate signed by a trusted CA, proving its identity. The Finished message includes a MAC over the entire handshake, proving the server holds the private key. An attacker who intercepts the key exchange cannot produce a valid Finished message without the server's private key.

Design Questions

Design: Real-time multiplayer game networking. Use UDP, not TCP. Implement your own reliability layer for critical messages (chat, game state changes) with sequence numbers and ACKs. For position updates, send unreliably — the latest position supersedes all previous ones. Use client-side prediction + server reconciliation. Send position at 20-60 Hz. Compress with delta encoding (only send what changed since last acknowledged state).
Design: CDN for video streaming. Use QUIC/HTTP/3 for adaptive bitrate streaming. Multiple independent streams per connection = no HoL blocking. 0-RTT for repeat viewers. Connection migration for mobile users switching networks. Segment-based delivery (2-4 second chunks) so the player can adapt bitrate per segment based on available bandwidth. Place edge servers close to users to minimize RTT.
Design: gRPC service mesh. gRPC uses HTTP/2 over TCP. Key decisions: (1) Connection pooling — one persistent connection per backend, multiplexed streams. (2) TCP_NODELAY on all connections — gRPC sends small frames, Nagle would add 200ms latency. (3) Keep-alive — detect dead backends before the load balancer's health check does. (4) Client-side load balancing across the pool — round-robin or least-connections. (5) Retry with idempotency tokens for non-idempotent RPCs. (6) Circuit breaker — if a backend fails N times, stop sending traffic for a cooldown period.

Numbers Every Engineer Should Know

WhatValueWhy it matters
Speed of light in fiber~200,000 km/s (2/3 of c)NYC to London: 5,500 km = 27.5ms one-way minimum
TCP handshake1 RTTSame datacenter: ~0.5ms. Cross-continent: ~150ms
TLS 1.3 handshake1 RTTAdds 1 RTT on top of TCP. QUIC combines both in 1 RTT total.
TCP slow start IW10 MSS (~14 KB)First RTT can only send 14 KB. Critical for small web page loads.
Ephemeral port range~28,000 portsLinux default: 32768-60999. Each TIME_WAIT uses one for 1-4 minutes.
MSL (Maximum Segment Lifetime)60 seconds (Linux)TIME_WAIT = 2 * MSL = 120 seconds
Delayed ACK timer40-200msCombined with Nagle = hidden 200ms latency
Default TCP buffer64 KB (many systems)On 100ms RTT: max throughput = 64 KB / 100ms = 640 KB/s

Coding Drills

python
# Drill 1: TCP echo server with connection pooling awareness
import socket, selectors

sel = selectors.DefaultSelector()

def accept(sock):
    conn, addr = sock.accept()
    conn.setblocking(False)
    conn.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)  # disable Nagle
    sel.register(conn, selectors.EVENT_READ, data=echo)

def echo(conn):
    data = conn.recv(4096)
    if data:
        conn.sendall(data)
    else:
        sel.unregister(conn)
        conn.close()  # important! prevents CLOSE_WAIT leak

srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
srv.bind(("", 8000))
srv.listen(128)  # backlog size
srv.setblocking(False)
sel.register(srv, selectors.EVENT_READ, data=accept)

while True:
    for key, mask in sel.select():
        key.data(key.fileobj)
python
# Drill 2: Measure TCP handshake + TLS overhead
import socket, ssl, time

def measure_connect(host, port=443):
    # TCP handshake
    t0 = time.perf_counter()
    sock = socket.create_connection((host, port))
    tcp_ms = (time.perf_counter() - t0) * 1000

    # TLS handshake
    t1 = time.perf_counter()
    ctx = ssl.create_default_context()
    tls = ctx.wrap_socket(sock, server_hostname=host)
    tls_ms = (time.perf_counter() - t1) * 1000

    print(f"TCP handshake: {tcp_ms:.1f}ms")
    print(f"TLS handshake: {tls_ms:.1f}ms")
    print(f"Total:         {tcp_ms+tls_ms:.1f}ms")
    print(f"TLS version:   {tls.version()}")
    tls.close()

measure_connect("google.com")
# TCP handshake: 12.3ms
# TLS handshake: 28.7ms
# Total:         41.0ms
# TLS version:   TLSv1.3

Mental Models for Quick Recall

When under interview pressure, these mental models let you reason about networking questions without memorizing every detail:

Model 1: The Bandwidth-Delay Product. The BDP tells you how many bytes can be "in flight" on a link. BDP = bandwidth × RTT. If your window (min of cwnd, rwnd) is smaller than BDP, you are wasting bandwidth. If it is larger, you are overflowing buffers. This single formula explains why transfers are slow on high-latency links, why window scaling exists, and why buffer tuning matters.
Model 2: The AIMD Triangle. TCP's congestion window draws a sawtooth. Each "tooth" is a triangle: linear rise, vertical drop. The area under each triangle is the data transferred. Width of the triangle base = time between losses. Height = cwnd at loss. If the network drops packets more often (base narrows), throughput drops. If latency is high (each RTT adds only 1 MSS), the rise is slower, and throughput drops. This is why TCP struggles on lossy, high-latency links.
Model 3: State Accumulation. Every open connection consumes state: memory (buffers), file descriptors, and port numbers. TCP connections that aren't properly closed accumulate as CLOSE_WAIT (application bug) or TIME_WAIT (normal, but volumetric). When you hear "connections failing" or "cannot bind," think state accumulation. The fix is always: close sockets properly, pool connections, or tune kernel limits.

5-Minute Whiteboard: TLS Handshake

1. Draw Client and Server
Two vertical timelines. Label them. Draw a dotted line for the network in between.
2. ClientHello arrow (right)
Label: "cipher suites, key share." Explain: "Client guesses what crypto the server supports and speculatively sends its half of Diffie-Hellman."
3. ServerHello + Cert + Finished (left)
Label: "chosen cipher, key share, certificate, MAC." Explain: "Server picks a cipher, sends its DH half (both sides can now compute shared secret), proves identity with certificate, and proves integrity with Finished MAC."
4. Client Finished (right)
Label: "handshake MAC." Explain: "Client verifies certificate chain, verifies Finished MAC, sends its own. Done — 1 RTT total."
5. Conclude with key points
"TLS 1.3 = 1 RTT. 0-RTT for repeat connections (QUIC). All data encrypted with symmetric key derived from DH. Certificate chain = trust anchor."
Quick check: You're designing a real-time stock ticker that pushes price updates to clients at 100 updates/second. Each update is 50 bytes. Should you use TCP, UDP, or QUIC?

Chapter 9: Connections

You now have a solid understanding of how computers communicate reliably, securely, and efficiently over networks. This knowledge is the foundation for everything in distributed systems.

What We Covered

ChapterCore ConceptOne-line Summary
0The ProblemIP is unreliable — packets get lost, reordered, duplicated, corrupted
1TCP ReliabilitySequence numbers + ACKs + retransmission make IP reliable
2Connection Lifecycle11 states, 4-way close, TIME_WAIT prevents stale segments
3Flow ControlReceive window prevents sender from overwhelming receiver
4Congestion ControlAIMD (sawtooth) prevents sender from overwhelming the network
5TLS/SSLAsymmetric key exchange + symmetric encryption + certificate auth
6Custom ProtocolsUDP for real-time, QUIC for multiplexed streams without HoL blocking
7Network in PracticeTCP_NODELAY, keep-alive, connection pooling, debugging
8Interview ArsenalCheat sheets, whiteboard patterns, design questions

How This Connects to Distributed Systems

Everything in this lesson is prerequisite knowledge for distributed systems topics:

TopicWhy networking mattersRelated Lesson
Distributed TroubleNetwork partitions, unreliable message delivery, timeout-based failure detection — all built on TCP's limitationsThe Trouble with Distributed Systems
ReplicationLeader-follower, multi-leader, leaderless — all depend on network reliability for consistency guaranteesReplication (coming soon)
ConsensusPaxos, Raft, ZAB — all designed around the assumption that the network is unreliable and asynchronousConsensus (coming soon)
RPC FrameworksgRPC uses HTTP/2 (TCP), connection pooling, keep-alive, TLS. Understanding the transport layer is essential for debugging RPC issuesRPC and Service Mesh (coming soon)

Key Takeaways for Systems Design

The network is always the bottleneck. In distributed systems, the speed of light sets an absolute lower bound on latency (3.3 μs per kilometer in fiber). No protocol optimization can beat physics. Design your system to minimize the number of network round trips, not the size of individual messages.
Choose your trade-offs consciously. TCP trades latency for reliability. UDP trades reliability for latency. QUIC tries to give you both by operating at a higher level of abstraction. There is no universally best protocol — only the best protocol for your specific requirements.
Debug from the transport layer up. When something is "slow" or "broken" in a distributed system, start with the network: ss -tin, tcpdump, check for retransmissions, check rwnd/cwnd, check connection states. Most "application bugs" are TCP misconfigurations. Your debugging toolkit: ss -tan for connection states, ss -tin dst X.X.X.X for per-connection metrics (cwnd, rwnd, RTT, retransmits), tcpdump -w capture.pcap to capture packets for Wireshark analysis, and traceroute to identify where packets are being dropped or delayed.

The Evolution of Network Protocols

Understanding where these protocols came from helps you see where they are going.

YearMilestoneWhy it mattered
1974TCP/IP proposed (Cerf & Kahn)First design for end-to-end reliable communication over unreliable networks
1981TCP and IP split into separate protocolsEnabled UDP for applications that don't need reliability
1986Internet congestion collapseMotivated Van Jacobson's congestion control algorithms (1988)
1995SSL 2.0 (Netscape)First widely deployed encryption for web traffic
1999TLS 1.0 (RFC 2246)Standardized SSL, became the basis for HTTPS
2006CUBIC replaces BIC in LinuxBetter high-bandwidth performance, still loss-based
2012QUIC development begins at GoogleUDP-based transport to solve TCP's head-of-line blocking
2016BBR published by GoogleModel-based congestion control: measure, don't infer from loss
2018TLS 1.3 (RFC 8446)1-RTT handshake, removed insecure ciphers, mandatory forward secrecy
2021QUIC standardized (RFC 9000)HTTP/3 became official, adopted by major browsers and CDNs

The trend is clear: move complexity from the kernel to userspace (QUIC), reduce round trips (TLS 1.3, 0-RTT), and decouple streams (QUIC multiplexing). The next frontier is likely kernel bypass (DPDK, io_uring) for the lowest-latency applications, and post-quantum cryptography for TLS.

Further Reading

If you want to go deeper into any of these topics, these are the definitive sources:

TopicSourceWhy read it
TCP internalsRFC 9293 (TCP specification, 2022)The definitive TCP standard, replacing the original RFC 793 from 1981
Congestion controlVan Jacobson, "Congestion Avoidance and Control" (1988)The paper that saved the internet. Introduces slow start and congestion avoidance.
TLS 1.3RFC 8446Complete TLS 1.3 specification. Surprisingly readable for an RFC.
QUICRFC 9000 + RFC 9001 (QUIC-TLS)The QUIC transport protocol and its integration with TLS 1.3
BBRCardwell et al., "BBR: Congestion-Based Congestion Control" (2016)Google's model-based congestion control — fundamentally different approach from loss-based
Practical TCPIlya Grigorik, High Performance Browser Networking (free online)Excellent coverage of TCP, TLS, HTTP/2, and performance optimization

Implementation Checklist

If you are building a production networked service, here is the minimum you must get right:

1. Set TCP_NODELAY
On every socket used for request-response communication. No exceptions.
2. Use connection pooling
Never create a new TCP connection per request. Pool and reuse.
3. Set reasonable timeouts
Connect timeout: 3-5s. Read timeout: 10-30s (depends on operation). Keep-alive: 60s.
4. Close sockets in error paths
Use try/finally or context managers. CLOSE_WAIT leaks are the #1 socket bug.
5. Enable TLS with certificate verification
Never disable certificate verification in production. Use system CA bundle.
6. Monitor connection states
Alert on TIME_WAIT > 10,000 and CLOSE_WAIT > 100. Dashboard with ss/netstat.

"The nice thing about standards is that you have so many to choose from." — Andrew S. Tanenbaum

Final check: A distributed system engineer says "TCP guarantees reliable delivery, so we don't need to worry about message loss in our system." What is wrong with this statement?
From network foundations to distributed systems. Everything you learned here — TCP's reliability mechanisms, TLS's authentication model, congestion control's cooperative bandwidth sharing, QUIC's stream independence — forms the bedrock upon which distributed systems are built. When a distributed systems paper says "we assume an asynchronous network model with unreliable message delivery," they mean exactly what Chapter 0 showed you: packets can be lost, delayed, reordered, and duplicated. When a system design interview asks "how would you handle a network partition," the answer starts with what you learned about TCP timeouts, keep-alive probes, and the fundamental impossibility of distinguishing a slow node from a dead one. The network is not a magic pipe. It is an unreliable, adversarial medium that actively works against your distributed system. Now you understand exactly how, and exactly what TCP, TLS, and QUIC do to tame it.