Distributed Systems

Messaging & Events

Message queues, pub/sub, delivery guarantees, backpressure, dead letter queues, and event streaming.

Prerequisites: HTTP request/response + Basic networking. That's it.
10
Chapters
9
Simulations
0
Assumed Knowledge

Chapter 0: The Problem

A user places an order on your e-commerce site. Your order service needs to: (1) charge the payment, (2) update inventory, (3) send a confirmation email, (4) notify the warehouse, (5) update the analytics dashboard. In a synchronous architecture, the order service calls each of these services in sequence. The user waits for all five to complete before seeing "Order Confirmed."

This is fragile. If the email service is slow (3 seconds), the user waits 3 extra seconds. If the analytics service is down, the entire order fails — even though analytics is not critical to order processing. If the warehouse notification times out, the payment has already been charged but the warehouse does not know about the order. You have a partial failure with no easy recovery.

// Synchronous chain: every step must succeed
order_service.create_order() # 50ms
payment_service.charge() # 200ms
inventory_service.decrement() # 30ms
email_service.send_confirmation() # 3000ms (slow SMTP server)
warehouse_service.notify() # TIMEOUT! (service down)
analytics_service.record() # Never reached

// Total time: 3280ms + timeout = user sees error after 8+ seconds
// Payment charged but warehouse not notified = inconsistency

The fundamental issue: synchronous communication couples services in time. The sender must wait for the receiver. If the receiver is slow or down, the sender is affected. You need a way to decouple services so they do not need to be available at the same time.

Synchronous vs. Asynchronous

Watch an order flow through 5 services. Synchronous: the user waits for each. Asynchronous: the order service publishes a message and returns immediately.

Click a mode to begin

With a message queue, the order service publishes an "order created" event and immediately returns success to the user. The payment, inventory, email, warehouse, and analytics services each consume this event at their own pace. If the email service is slow, it processes messages at its own speed — the user does not wait. If analytics is down, messages queue up and are processed when it recovers. The services are decoupled in time.

This lesson covers asynchronous messaging — the glue that connects services without coupling them. We will cover message queues, publish/subscribe, delivery guarantees, ordering, backpressure, dead letter queues, and the architecture of systems like Kafka. By the end, you will understand how to design reliable event-driven systems.
Your order service synchronously calls 5 downstream services. The email service is down. What happens to new orders?

Chapter 1: Point-to-Point Messaging

The simplest messaging pattern is point-to-point (also called a work queue or task queue). One or more producers put messages into a queue. One or more consumers take messages out. Each message is delivered to exactly one consumer. Once consumed, the message is removed from the queue.

Think of a queue at a deli counter. Customers (producers) take a number. When a worker (consumer) is free, they call the next number. Each number is served by exactly one worker. If there are more workers, numbers get served faster. If workers are busy, numbers queue up.

Producer
Puts message into queue
Queue (Broker)
Stores messages in order. Tracks which are consumed.
Consumer
Pulls next message. Processes it. Acknowledges.

Multiple Consumers (Competing Consumers)

When you have more messages than one consumer can handle, add more consumers. They compete for messages — each message goes to one consumer. This is horizontal scaling for message processing.

// 3 consumers processing from the same queue
Queue: [msg1, msg2, msg3, msg4, msg5, msg6, ...]

Consumer A takes msg1 → processes → ACK
Consumer B takes msg2 → processes → ACK
Consumer C takes msg3 → processes → ACK
Consumer A takes msg4 → processes → ACK
...

// Throughput: 3x a single consumer
// Each message processed exactly once (if ACKs work correctly)

The Acknowledgment Dance

A consumer pulls a message and starts processing. What if it crashes midway? The message is lost. To prevent this, queues use acknowledgments (ACKs). The flow:

1. Consumer pulls message. Queue marks it as "in flight" (invisible to other consumers). 2. Consumer processes the message. 3. Consumer sends ACK to the queue. Queue permanently removes the message. 4. If no ACK within the visibility timeout, the queue assumes the consumer died and makes the message visible again for another consumer.

Point-to-Point Queue

A producer sends messages. 3 consumers compete to process them. Watch the queue grow and shrink as consumers process at different speeds.

Click Send to add messages to the queue

Real-World Queues

SystemModelNotable feature
RabbitMQPoint-to-point + pub/subFlexible routing via exchanges, AMQP protocol
Amazon SQSPoint-to-pointManaged, infinite scale, at-least-once delivery
Redis StreamsLog-based queueIn-memory, sub-ms latency, consumer groups
CeleryTask queue (Python)Wraps RabbitMQ/Redis, adds task scheduling
Point-to-point is for work distribution. Use it when each message represents a unit of work that must be processed by exactly one worker. Image resizing, email sending, payment processing — these are all point-to-point patterns. If multiple consumers need to see the same message, you need pub/sub (next chapter).
You have a queue with 3 consumers. Consumer B pulls a message and crashes before sending ACK. What happens to that message?

Chapter 2: Publish/Subscribe

Point-to-point delivers each message to one consumer. But what if multiple systems need to react to the same event? When an order is created, the payment service, inventory service, email service, and analytics service all need to know. Sending separate messages to 4 queues is wasteful and creates tight coupling — the producer must know about every consumer.

Publish/Subscribe (pub/sub) solves this. The producer publishes a message to a topic (or exchange). Consumers subscribe to the topic. Every subscriber gets a copy of every message. The producer does not know or care who the subscribers are.

// Point-to-point: producer must know all consumers
queue_payment.send(order_event) # Producer knows about payment
queue_inventory.send(order_event) # Producer knows about inventory
queue_email.send(order_event) # Producer knows about email
// Adding analytics: modify the producer code!

// Pub/sub: producer publishes to a topic, consumers subscribe
topic_orders.publish(order_event) # Producer publishes once
// Payment subscribes → gets a copy
// Inventory subscribes → gets a copy
// Email subscribes → gets a copy
// Adding analytics: just subscribe! No producer changes.

Fan-Out

This is called fan-out: one message fans out to N subscribers. Each subscriber processes the message independently. If the email subscriber is slow, it does not affect the payment subscriber. If analytics is down, the other subscribers still receive and process messages.

Consumer Groups

Pub/sub gives every subscriber a copy. But what if the email service has 3 instances for horizontal scaling? You do not want all 3 instances to send the same confirmation email. You want ONE of them to process each message.

The solution: consumer groups. Within a group, messages are distributed (point-to-point style). Across groups, messages are broadcast (pub/sub style). Each consumer group gets one copy of each message. Within the group, the message goes to one member.

// Topic: "order-events"
// Consumer Group "payment": 2 instances
// → msg1 goes to instance 1, msg2 goes to instance 2
// Consumer Group "email": 3 instances
// → msg1 goes to instance 1, msg2 goes to instance 2, msg3 goes to instance 3
// Consumer Group "analytics": 1 instance
// → all messages go to instance 1

// Each GROUP gets every message (pub/sub)
// Within each group, messages are distributed (point-to-point)
Pub/Sub with Consumer Groups

A producer publishes to a topic. 3 consumer groups subscribe. Each group gets every message; within each group, messages are distributed to members.

Click Publish to send events to the topic
Pub/sub enables event-driven architecture. Services do not call each other directly. Instead, they publish events ("order created") and subscribe to events they care about. This is the backbone of systems like Uber's event platform, LinkedIn's Kafka ecosystem, and Netflix's event bus. Adding a new consumer is zero-touch — just subscribe and start processing.
You publish "order_created" to a topic. The payment, inventory, and email groups are subscribed. How many total copies of the message exist?

Chapter 3: Delivery Guarantees

When a producer sends a message and a consumer processes it, three things can go wrong: the message might not arrive (lost), it might arrive more than once (duplicated), or it might arrive but out of order. Delivery guarantees define what the messaging system promises about these failure modes.

At-Most-Once

The message is sent once. If it is lost (network failure, broker crash), it is not retried. Each message is delivered zero or one times. This is the simplest and fastest: fire and forget.

Use case: metrics, logs, non-critical notifications. If you lose a single CPU metric sample, the dashboard still works. The system is simpler and faster because there are no retries or deduplication.

At-Least-Once

The message is retried until the broker confirms receipt. If the confirmation is lost (network blip), the producer retries — and the broker might receive the message twice. Each message is delivered one or more times.

Use case: most production systems. Order processing, payment initiation, inventory updates. Duplicates are handled by making the consumer idempotent — processing the same message twice has the same effect as processing it once. For example, "set inventory to 10" is idempotent; "decrement inventory by 1" is not.

// At-least-once: the duplicate problem
Producer sends msg → Broker receives, writes to disk → ACK
If ACK is lost: Producer retries → Broker receives AGAIN → duplicate!

// Making consumers idempotent:
// BAD (not idempotent): UPDATE balance SET amount = amount - 10
// Processing twice: deducts 20 instead of 10!

// GOOD (idempotent): UPDATE balance SET amount = 90 WHERE txn_id = 'abc'
// Processing twice: sets to 90 both times. Same result.

// Also GOOD: track processed message IDs
// if msg_id in processed_set: skip
// else: process, add msg_id to processed_set

Exactly-Once

Exactly-once delivery means each message is processed exactly one time. No losses, no duplicates. This is what everyone wants, but it is extremely hard to achieve in a distributed system — some argue it is impossible in the general case.

What systems like Kafka actually provide is exactly-once semantics (EOS): a combination of idempotent producers (deduplicate at the broker), transactional writes (atomic batch commits), and consumer offset management. The message might physically be delivered twice, but the side effects are applied exactly once.

Delivery Guarantees

Watch messages flow from producer to consumer under each guarantee. Click "Inject Network Error" to see how each handles failures.

Pick a guarantee level
GuaranteeDuplicates?Loss?Consumer requirementPerformance
At-most-onceNoPossibleNoneFastest
At-least-oncePossibleNoIdempotent processingFast
Exactly-onceNoNoTransactional processingSlowest
In practice, at-least-once with idempotent consumers is the sweet spot. It is simpler than exactly-once, fast, and does not lose messages. The burden shifts to the consumer: make your operations idempotent. Use unique message IDs, upserts instead of inserts, absolute values instead of increments. This is why almost every production messaging system defaults to at-least-once.
Your payment consumer processes "charge $10" messages. Due to at-least-once delivery, a message is delivered twice. How do you prevent charging $20?

Chapter 4: Message Ordering

Messages arrive in order: "create order", "charge payment", "ship order". If they are processed out of order — "ship order" before "charge payment" — you ship before being paid. Order matters.

A single queue with a single consumer guarantees FIFO (First In, First Out) ordering. Messages are processed in exactly the order they were sent. But a single consumer is a throughput bottleneck. Multiple consumers process messages in parallel, and parallel processing breaks ordering.

// Single consumer: ordering preserved
Queue: [msg1, msg2, msg3]
Consumer A takes msg1 → msg2 → msg3 (in order, guaranteed)

// Multiple consumers: ordering broken
Queue: [msg1, msg2, msg3]
Consumer A takes msg1 (slow, 500ms processing)
Consumer B takes msg2 (fast, 50ms processing)
Consumer B finishes msg2 BEFORE Consumer A finishes msg1
// msg2 was processed before msg1 — ordering violated!

Partitioned Ordering

The solution: partition the topic by key. All messages with the same key go to the same partition. Each partition has one consumer. Messages within a partition are ordered. Messages across partitions are not.

For our order system: use order_id as the partition key. All events for order #42 (created, paid, shipped) go to the same partition and are processed by the same consumer in order. Events for order #43 go to a different partition and are processed independently.

// Kafka partitioning by order_id
Topic "order-events", 4 partitions

order_id=42 → hash(42) % 4 = partition 2
Events: [created, paid, shipped] → processed in order by consumer 2

order_id=43 → hash(43) % 4 = partition 3
Events: [created, paid] → processed in order by consumer 3

// Order 42 events are ordered relative to each other
// Order 42 and 43 events are NOT ordered relative to each other
// But that's fine — they're independent orders
Ordering with Partitions

Messages for 3 orders flow into partitions by key. Each partition maintains order for its key. Watch how events for the same order are always in sequence.

Click Send to watch partitioned ordering
Total ordering is expensive. Partitioned ordering is practical. If you need ALL messages globally ordered, you need a single partition — which limits throughput to one consumer. If you need messages ordered per entity (per user, per order, per session), partition by that entity's ID. You get ordering where it matters and parallelism where it does not.
Your topic has 4 partitions and 4 consumers. You send events for order #42: "created", "paid", "shipped". Can "shipped" be processed before "paid"?

Chapter 5: Backpressure

A producer sends 10,000 messages per second. The consumer can process 1,000. The queue grows by 9,000 messages per second. In 10 minutes, there are 5.4 million messages in the queue. In an hour, 32.4 million. Eventually, the broker runs out of disk. New messages are rejected. The producer fails. The entire system is down.

This is the backlog problem, and its solution is backpressure — a mechanism that slows down the producer when the consumer cannot keep up. Instead of letting the queue grow unboundedly, the system signals "slow down" upstream.

Strategies for Backpressure

Bounded queue. Set a maximum queue size. When the queue is full, the producer's publish call blocks or returns an error. The producer must slow down or buffer locally. This is explicit backpressure — the producer knows it is going too fast.
Rate limiting at the producer. The producer voluntarily limits its send rate based on observed consumer lag. If the consumer falls behind (lag > threshold), the producer reduces its rate. This is cooperative backpressure.
Scaling consumers. Instead of slowing down the producer, add more consumers. If you have a partitioned topic, increase the number of partitions and consumers. This increases throughput to match the producer's rate. This is the Kafka approach — scale the consumption side.
// Consumer lag = last produced offset - last consumed offset
Producer has written message #500,000
Consumer has processed message #450,000
Lag = 500,000 - 450,000 = 50,000 messages behind

// If lag is growing: consumer is slower than producer
// Options:
// 1. Add more consumers (if partitions allow)
// 2. Slow down producer (backpressure)
// 3. Drop low-priority messages (load shedding)
// 4. Do nothing and pray (not recommended)
Backpressure Simulation

A producer sends faster than the consumer can process. Watch the queue grow. Enable backpressure to see the producer slow down when the queue fills.

Producer Rate 10/s
Pick a mode
Backpressure vs. load shedding. Backpressure slows down the producer — nothing is lost. Load shedding deliberately drops messages (usually low-priority ones) to protect the system. A real system uses both: backpressure to slow normal bursts, load shedding to survive extreme spikes. AWS SQS, for example, has no built-in backpressure — the queue grows until your account's message limit. Kafka has configurable retention that acts as a soft limit.
Your Kafka topic has 4 partitions and 4 consumers. Consumer lag is growing. You cannot slow down the producer. What is the correct fix?

Chapter 6: Dead Letter Queues

A consumer processes a message and fails. The message is retried. It fails again. And again. And again. This is a poison pill — a message that can never be successfully processed. Maybe the payload is malformed. Maybe it references a deleted record. Maybe there is a bug in the consumer that always crashes for this specific input.

Without safeguards, the poison pill blocks the queue forever. The consumer keeps retrying, the message keeps failing, and all messages behind it wait. This is the head-of-line blocking problem. One bad message takes down the entire pipeline.

A Dead Letter Queue (DLQ) is the solution. After N failed processing attempts, the message is moved to a separate queue — the DLQ. The original queue continues processing. Engineers can inspect the DLQ, fix the bug, and replay the messages.

Consumer pulls message
Attempts to process
↓ failure
Retry (attempt 2)
Same message, same consumer
↓ failure
Retry (attempt 3)
Last attempt
↓ failure
Move to DLQ
Message goes to dead letter queue. Main queue continues.
// AWS SQS Dead Letter Queue configuration
Main Queue: "order-processing"
DLQ: "order-processing-dlq"
Max receives: 3 # After 3 failures, move to DLQ

// Message lifecycle:
Attempt 1: consumer pulls, processes, FAILS (exception)
→ message becomes visible after visibility timeout
Attempt 2: consumer pulls again, FAILS
Attempt 3: consumer pulls again, FAILS
→ receive count = 3 = max receives → moved to DLQ

// DLQ messages sit until an engineer:
// 1. Inspects them (what went wrong?)
// 2. Fixes the bug in the consumer
// 3. Replays them back to the main queue
Poison Pill and Dead Letter Queue

Messages flow through the queue. One is a poison pill (red). Watch it fail repeatedly, then get moved to the DLQ while normal messages continue processing.

Click Start then Inject

DLQ Operations

OperationWhen to use
InspectLook at DLQ messages to diagnose failures. Check the error, payload, metadata.
ReplayAfter fixing the bug, send DLQ messages back to the main queue for reprocessing.
PurgeIf the messages are truly unrecoverable (corrupt data, obsolete events), delete them.
AlertSet up monitoring: if the DLQ has more than N messages, alert the on-call engineer.
Every production queue needs a DLQ. Without one, a single malformed message can block your entire pipeline. With one, bad messages are quarantined and the pipeline keeps flowing. Monitor DLQ depth — a growing DLQ means something is systematically wrong. A DLQ that stays empty means your consumers are healthy.
Your queue processes 1,000 messages/sec. A poison pill message enters. Without a DLQ (max retries = infinity), what happens?

Chapter 7: Kafka Architecture

Apache Kafka is the most widely deployed distributed messaging system. It is not a traditional message queue — it is a distributed commit log. Understanding its architecture explains why it is so fast and why it works the way it does.

The Log

Kafka stores messages in an append-only, ordered, immutable log. Each message is assigned a sequential offset (like a line number). Messages are never deleted from the log — they expire after a configurable retention period (e.g., 7 days). Consumers read the log by tracking their current offset.

// A Kafka partition is a log
Offset: [0] [1] [2] [3] [4] [5] [6] [7] ...
Data: msg msg msg msg msg msg msg msg

Producer appends to the end → offset 8
Consumer A is at offset 3 → reads [3, 4, 5, ...]
Consumer B is at offset 6 → reads [6, 7, ...]

// Key insight: the log is immutable. Multiple consumers
// can read the same data at different offsets without interfering.
// No "message deletion" on consume — just advance the offset.

Topics, Partitions, and Brokers

A topic is a category of messages (e.g., "order-events"). A topic is split into partitions for parallelism. Each partition is a separate log, stored on a separate broker (server). Partitions are replicated across brokers for fault tolerance.

// Topic "orders" with 3 partitions, replication factor 3
Partition 0: Leader on Broker 1, Replicas on Brokers 2, 3
Partition 1: Leader on Broker 2, Replicas on Brokers 1, 3
Partition 2: Leader on Broker 3, Replicas on Brokers 1, 2

// Writes go to the partition leader
// Leader replicates to followers (ISR = In-Sync Replicas)
// If a leader dies, a follower is promoted

// Throughput: 3 partitions × 100 MB/s per partition = 300 MB/s
// Storage: 3 brokers × 2 TB each = 6 TB total, 2 TB effective (RF=3)

Why Kafka Is Fast

OptimizationHow it helps
Sequential I/OAppend-only writes to disk. Sequential disk writes are faster than random RAM access.
Zero-copyData goes directly from disk to network socket via sendfile(). No user-space copying.
BatchingProducer batches messages. Fewer network requests, better compression.
Page cacheKafka relies on the OS page cache. Recently written data is read from RAM, not disk.
Partition parallelismEach partition is an independent I/O stream. More partitions = more parallelism.
Kafka Partition Layout

A topic with 3 partitions across 3 brokers. Watch messages arrive and get distributed by partition key. Consumers track their offsets independently.

Click Produce to send messages
Kafka is a log, not a queue. Traditional queues delete messages on consumption. Kafka retains messages for a configurable period. This means: (1) multiple consumer groups can read the same data independently, (2) a new consumer can start from the beginning and replay all history, (3) a consumer can "rewind" its offset to reprocess messages. This log-based model is why Kafka is used for event sourcing, stream processing, and data pipelines — not just message passing.
A Kafka topic has 6 partitions and you have a consumer group with 4 consumers. How are partitions assigned?

Chapter 8: Message Queue Simulator

Time to put everything together. Below is a full message queue simulator with a producer, broker, and consumers. Control the production rate, inject failures, watch delivery guarantees in action, and observe backpressure and dead letter queues.

This is your message queue playground. Send messages, kill consumers, inject poison pills, change production rate. Build intuition for how queues behave under stress and failure.
Message Queue Simulator

Producer sends messages to a broker. 3 consumers process them. Inject failures and poison pills to see retries and DLQ behavior.

Production Rate 5/s
Press Play to begin

What to Try

Experiment 1: Normal operation. Set rate to 5/s and press Play. Watch messages flow from producer → broker → consumers. All healthy.

Experiment 2: Kill a consumer. One consumer dies. Its messages are redistributed to the remaining two. Throughput drops but the queue stays alive.

Experiment 3: Inject a poison pill. Watch the red message fail repeatedly, then get moved to the DLQ (shown separately). Normal messages continue processing.

Experiment 4: Overload. Set rate to 15/s with only 1 consumer alive. Watch the queue depth grow. This is what backpressure should prevent.

Chapter 9: Connections

Messaging is the nervous system of distributed architectures. Here is how it connects to everything else.

What We Covered

ConceptKey takeaway
Point-to-pointOne message → one consumer. Work distribution. Competing consumers for throughput.
Pub/subOne message → all subscribers. Event broadcasting. Consumer groups for per-group distribution.
Delivery guaranteesAt-most-once (lossy, fast), at-least-once (safe, needs idempotency), exactly-once (hard).
OrderingTotal ordering limits throughput. Partition by key for per-entity ordering with parallelism.
BackpressureSlow the producer or scale consumers. Unbounded queues eventually crash.
Dead letter queuesQuarantine poison pills. Protect the pipeline. Alert, inspect, replay.
KafkaDistributed commit log. Append-only, offset-based, high throughput via sequential I/O.

Where to Go Next

TopicConnection
Load BalancingConsumer groups are a form of load balancing for message processing
Data StorageEvent-driven cache invalidation: publish change events, subscribers invalidate caches
Service ArchitectureMessaging replaces synchronous HTTP calls between services for async workflows
ConsensusKafka uses ZooKeeper/KRaft for leader election and partition assignment
"The nice thing about event-driven architecture is that it forces you to think about what happens when things go wrong." — Martin Kleppmann. Synchronous calls hide failure modes. Messaging makes them explicit: retries, dead letters, ordering guarantees, backpressure. Every design decision in a messaging system is a bet about which failure modes matter most for your use case.
You are designing a system where: (1) multiple services need each event, (2) events for the same entity must be in order, (3) processing must be at-least-once. Which architecture fits?