Stream Processing — DDIA Chapter 12

Chapter 0: The Problem

You run a bank. Every time a customer swipes their card, your system needs to decide: is this transaction legitimate, or is it fraud? Your current system works in batches. Every hour, a MapReduce job pulls the last hour of transactions, runs a fraud model on them, and flags the suspicious ones. The analytics team loves it — 99.2% accuracy!

But here is the problem. A stolen credit card gets used at 2:03 PM. Your batch job ran at 2:00 PM and will not run again until 3:00 PM. For the next 57 minutes, the thief racks up $14,000 in purchases. Your model would have caught it on the first transaction — if only it had seen it in time.

This is the fundamental limitation of batch processing. It processes data that has already been collected, in bounded chunks, on a schedule. The data sits idle between runs. The lag between "event happens" and "system reacts" is bounded by the batch interval. For fraud detection, this is unacceptable. For recommendation engines, it means serving stale suggestions. For monitoring dashboards, it means flying blind between updates.

Stream processing flips the model: instead of collecting data and then processing it, you process each event as it arrives. The fraud model sees the 2:03 PM transaction at 2:03 PM, not at 3:00 PM. The thief gets one transaction through, not thirty.

Batch vs. stream is not either/or. Most real systems use both. Batch for heavy historical reprocessing (train the fraud model on last year's data). Stream for real-time reaction (run the trained model on each transaction as it arrives). The Lambda Architecture formalized this dual approach, though modern stream processors like Flink increasingly blur the line by handling both. This chapter focuses on the stream side.

Watch the Lag Kill You

The simulation below shows the same fraud scenario. On the left, a batch system processes every 60 seconds. On the right, a stream system processes each event immediately. Watch the fraudulent transactions (red) slip through the batch window while the stream system catches them in real time.

Batch vs. Stream: Fraud Detection Lag

Fraudulent transactions appear in red. Watch how many slip through the batch window before detection.

Batch interval: 60s

The longer the batch interval, the more fraud slips through. At 120 seconds, a single stolen card can accumulate thousands in charges. The stream system catches fraud within milliseconds of the event arriving.

But stream processing is not just "faster batch." It introduces fundamentally new challenges: What does "time" mean when events arrive out of order? How do you aggregate over an unbounded dataset? How do you recover from failures without losing or duplicating events? These are the questions that make stream processing its own discipline, and the questions we will answer in this lesson.

Interview warm-up: Your team runs a batch analytics pipeline that processes web clickstream data every 15 minutes. The product team wants to add a "trending right now" feature that updates in under 5 seconds. A junior engineer proposes just running the batch job every 5 seconds instead. What is wrong with this approach?

Running a batch job every 5 seconds is fine — it is just a very small batch The problem is that batch jobs are always slower than stream processors Batch jobs have fixed overhead (job startup, input scanning, scheduling) that dominates at small intervals. A 15-min batch job might take 2 min to run — launching it every 5 seconds means each run overlaps the previous. You also lose the efficiency of processing large chunks (sequential I/O, bulk writes). Stream processing is architecturally different: it maintains long-running operators with in-memory state, processes events individually as they arrive, and amortizes startup cost over the lifetime of the job

Chapter 1: Event Streams

Before we can process a stream, we need to define what a stream is. Forget queues and brokers for a moment. The concept is simpler than the infrastructure.

An event is an immutable record of something that happened at a point in time. "User 42 clicked the buy button at 14:03:27.391 UTC." "Sensor 7 reported temperature 23.4C at 14:03:27.500 UTC." "Order 9001 was shipped at 14:03:28.012 UTC." Each event has a timestamp, a type, and a payload. Once it happened, it cannot un-happen. You can record a compensating event (a refund reverses a charge), but the original event remains in the record forever.

A stream (also called an event stream, event log, or event feed) is an unbounded, append-only, ordered sequence of events. "Unbounded" is the key word: unlike a file that has a beginning and an end, a stream has a beginning but no end. New events keep arriving indefinitely.

Streams vs. databases: two sides of the same coin. A database table represents current state — the latest balance for each account. A stream represents history — every deposit and withdrawal that ever happened. You can always derive state from history (replay the events), and you can always derive history from state changes (capture each mutation). This duality is called the stream-table duality, and it is one of the most powerful ideas in data engineering. Pat Helland calls it "the log is the database."

Producers, Streams, and Consumers

The basic architecture has three roles:

Producer (Publisher)

Generates events and writes them to a stream. A web server logging clicks. A sensor reporting readings. A database emitting change events. Producers do not know or care who reads the events.

↓

Stream (Log / Topic)

Stores events durably and in order. Each event gets a sequential position (offset, sequence number, or timestamp). The stream is the decoupling layer — producers and consumers never talk directly.

↓

Consumer (Subscriber)

Reads events from the stream. A fraud detector analyzing transactions. A search indexer updating Elasticsearch. A dashboard computing real-time metrics. Multiple consumers can read the same stream independently, each at their own pace.

This decoupling is powerful. The producer does not slow down if a consumer is behind. A new consumer can join at any time and read from the beginning. A slow consumer does not block a fast one. The stream is the buffer, the ordering guarantee, and the persistent record all in one.

Multiple Consumers, One Stream

The simulation below shows a stream with three consumers reading at different speeds. Consumer A is fast (real-time analytics). Consumer B is medium (search indexing). Consumer C is slow (batch export to a data warehouse). Each tracks its own position in the stream.

Multi-Consumer Stream

Events flow from the producer into the log. Each consumer reads at its own pace. Watch the offsets diverge.

Event Sourcing Preview

Notice something profound: if we store every event, we can reconstruct any state at any point in time. The current balance of account 42 is just the sum of all deposits minus all withdrawals up to now. The balance at 3 PM yesterday is the same sum, but stopping at that timestamp. This is event sourcing, and we will explore it in depth in Chapter 5.

python
# An event: immutable, timestamped, typed
from dataclasses import dataclass
from datetime import datetime

@dataclass(frozen=True)  # frozen = immutable
class Event:
    timestamp: datetime
    event_type: str      # "deposit", "withdrawal", "transfer"
    payload: dict        # {"account": 42, "amount": 100.00}

# A stream: append-only, ordered
class EventStream:
    def __init__(self):
        self.log = []         # append-only list of events
        self.consumers = {}   # consumer_id -> offset

    def append(self, event: Event) -> int:
        offset = len(self.log)
        self.log.append(event)
        return offset         # sequential position

    def read(self, consumer_id: str, batch_size: int = 1):
        offset = self.consumers.get(consumer_id, 0)
        events = self.log[offset:offset + batch_size]
        self.consumers[consumer_id] = offset + len(events)
        return events

    def seek(self, consumer_id: str, offset: int):
        """Reset a consumer's position — enables replay."""
        self.consumers[consumer_id] = offset

The power of seek. The seek method is what makes log-based streams fundamentally different from traditional message queues. A consumer can rewind to any point and replay events. Found a bug in your search indexer? Fix the code, seek to offset 0, replay everything, and your index is now correct. Try doing that with a message queue that deletes messages after consumption.

Concept check: A stream has 1,000,000 events. Consumer A is at offset 999,998 (nearly caught up). Consumer B just joined and is at offset 0. Consumer C is at offset 500,000. Does Consumer B slow down Consumer A?

Yes — Consumer B reading old events creates I/O contention that slows Consumer A No — each consumer tracks its own offset independently. Consumer B reading from offset 0 has no effect on Consumer A reading from offset 999,998. The log is immutable and append-only, so readers do not contend with each other or with the writer. This is a key advantage of log-based streams over shared-state systems. It depends on whether the consumers are in the same consumer group

Chapter 2: Message Brokers

Events need to get from producers to consumers. In theory, you could have producers write directly to consumers (a webhook, a TCP socket, a UDP multicast). In practice, this creates tight coupling: the producer must know every consumer's address, handle retries if a consumer is down, and buffer events if a consumer is slow. This does not scale.

A message broker (also called a message queue or messaging middleware) sits between producers and consumers. Producers send events to the broker. Consumers pull events from the broker. The broker handles durability, buffering, delivery guarantees, and fan-out.

There are two fundamentally different models for message brokers, and understanding the difference is critical for system design interviews.

Model 1: Traditional Message Queues

Systems like RabbitMQ, ActiveMQ, and Amazon SQS follow this model. The broker maintains a queue of messages. When a consumer acknowledges a message, the broker deletes it. Key properties:

Property	Behavior
Delivery	Each message delivered to exactly one consumer (load balancing). If you have 3 consumers, each message goes to one of them.
Acknowledgment	Consumer sends ACK after processing. Broker deletes message on ACK. If consumer crashes before ACK, message is redelivered to another consumer.
Ordering	Best-effort. With multiple consumers, messages may be processed out of order (consumer A gets message 1, consumer B gets message 2, B finishes first).
Replay	Impossible. Messages are deleted after acknowledgment. If you need to reprocess, you must re-produce the messages.
Fan-out	Requires explicit configuration (exchanges/bindings in RabbitMQ). Each additional consumer group adds complexity.

Model 2: Log-Based Brokers

Systems like Apache Kafka, Amazon Kinesis, and Apache Pulsar follow this model. The broker is an append-only log. Messages are never deleted (until a configurable retention period expires). Key properties:

Property	Behavior
Delivery	Each partition delivered to one consumer per group, but multiple groups can read the same topic independently.
Acknowledgment	Consumer commits its offset. Messages stay in the log regardless. If consumer crashes, it restarts from last committed offset.
Ordering	Guaranteed within a partition. If all events for a given key hash to the same partition, they are processed in order.
Replay	Trivial. Reset the consumer offset to any position. Replay all events from the beginning if needed.
Fan-out	Free. Any number of consumer groups can independently consume the same topic with zero additional broker configuration.

The key insight. Traditional queues optimize for message delivery — get each message to a consumer and forget about it. Log-based brokers optimize for event history — keep a durable, replayable record of everything that happened. The right choice depends on whether you need history. If you are sending "process this task" commands to workers, a traditional queue is fine. If you are building a system where multiple downstream services need to derive state from a shared event log, you want Kafka.

Side-by-Side Comparison

The simulation below shows both models handling the same events. On the left, a traditional message queue delivers each message to one of two consumers and deletes it. On the right, a log-based broker appends each message and lets consumers track their own offset. Watch what happens when a consumer crashes and recovers.

Traditional Queue vs. Log-Based Broker

Watch messages get delivered and deleted (left) vs. persisted and consumed by offset (right). Crash a consumer to see the recovery difference.

Notice what happens after recovery. In the traditional queue, the crashed consumer lost the unacknowledged message — it may be redelivered (at-least-once) or lost (at-most-once), depending on configuration. In the log-based broker, the consumer simply resumes from its last committed offset. No message is ever lost because the log persists regardless of consumer state.

When to Use Which

text
Use a traditional message queue (RabbitMQ, SQS) when:
- Task distribution: "process this image", "send this email"
- You need per-message routing (complex topic/exchange patterns)
- Message order does not matter
- You do not need replay
- Consumer count >> partition count (thousands of workers)

Use a log-based broker (Kafka, Kinesis) when:
- Event sourcing: "this thing happened"
- Multiple independent consumers need the same data
- You need replay (reprocess after a bug fix)
- Ordering within a key matters (all events for user X in order)
- You are building CDC, analytics, or materialized views

Design question: You are building an e-commerce platform. The order service needs to notify three downstream systems: (1) the payment service (charge the card), (2) the inventory service (decrement stock), and (3) the analytics service (record the sale). Which broker model do you choose, and why?

Traditional message queue with three separate queues, one for each downstream service Log-based broker (Kafka). Publish the order event to one topic. Three consumer groups — payment, inventory, analytics — each independently consume the same stream. If analytics falls behind, payment is unaffected. If inventory has a bug, fix it, reset its offset, and replay. No duplicate publishing to multiple queues. Adding a fourth downstream service (e.g., recommendation engine) requires zero changes to the order service — just add a new consumer group. Direct HTTP webhooks from the order service to all three downstream services

Chapter 3: Apache Kafka

Kafka is the log-based broker that won. Originally built at LinkedIn in 2011 to handle their activity stream (page views, clicks, searches — 1.4 trillion messages per day at peak), it is now the backbone of event-driven architectures at Netflix, Uber, Airbnb, and thousands of other companies. Let us understand exactly how it works.

Topics and Partitions

A topic is a named stream of events. "orders", "user-clicks", "sensor-readings" — each is a topic. But a single topic on a single machine would be a bottleneck. So Kafka splits each topic into partitions.

A partition is an ordered, append-only log stored on a single broker. Each event in a partition gets a monotonically increasing offset (0, 1, 2, 3...). Ordering is guaranteed within a partition but NOT across partitions.

When a producer sends an event, it must go to exactly one partition. Two strategies:

Strategy	How it works	When to use
Key-based	hash(key) % num_partitions. All events with the same key go to the same partition.	When ordering per entity matters. E.g., all events for user_42 must be in order.
Round-robin	Distribute events evenly across partitions, ignoring keys.	When ordering does not matter and you want maximum throughput.

Partitions are the unit of parallelism. If a topic has 6 partitions, you can have at most 6 consumers in a consumer group reading in parallel (one partition per consumer). More partitions = more parallelism. But more partitions also mean more open file handles on brokers, more memory for offset tracking, and longer leader election times. LinkedIn's rule of thumb: start with max(expected_throughput_MB / 10, expected_consumer_count) partitions.

Consumer Groups

A consumer group is a set of consumers that cooperate to consume a topic. Kafka assigns each partition to exactly one consumer in the group. If you have 6 partitions and 3 consumers in a group, each consumer gets 2 partitions. If a consumer crashes, its partitions are reassigned to the remaining consumers — this is called a rebalance.

Multiple consumer groups can independently consume the same topic. Consumer group "fraud-detection" reads the "transactions" topic. Consumer group "analytics" reads the same topic. They do not interfere with each other. Each group tracks its own offsets.

Interactive Kafka Cluster

The simulation below shows a Kafka cluster with 3 brokers and 1 topic split into 6 partitions. Send messages with keys and watch them hash to partitions. Add consumers to a group and watch partition assignment. Remove a consumer and watch rebalancing.

Kafka: Partitions, Producers, and Consumer Groups

Send keyed messages, add/remove consumers, and watch partition assignment and rebalancing in real time.

Offsets and Commit

Each consumer tracks its offset — the position of the last event it has processed in each partition. Periodically (or after each event), the consumer commits its offset back to Kafka (stored in an internal topic called __consumer_offsets).

If a consumer crashes and restarts, it resumes from the last committed offset. This creates a delivery guarantee trade-off:

// At-most-once: commit BEFORE processing
offset = poll() # get events
commit(offset) # commit immediately
process(events) # if crash here, events are skipped

// At-least-once: commit AFTER processing (DEFAULT)
offset = poll() # get events
process(events) # process first
commit(offset) # if crash here, events are reprocessed

// Exactly-once: use Kafka transactions (Ch 8)
begin_transaction() # atomic: process + commit + produce
process(events)
produce(output_topic, results)
commit(offset)
commit_transaction() # all or nothing

Kafka Producer and Consumer in Python

python
from kafka import KafkaProducer, KafkaConsumer
import json

# --- Producer ---
producer = KafkaProducer(
    bootstrap_servers=['broker1:9092', 'broker2:9092'],
    key_serializer=lambda k: k.encode('utf-8'),
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    acks='all',           # wait for all replicas to acknowledge
    retries=3,             # retry on transient failures
    linger_ms=5,           # batch messages for 5ms for throughput
)

# Key determines partition: all user_42 events go to the same partition
producer.send(
    topic='transactions',
    key='user_42',
    value={'amount': 99.99, 'merchant': 'coffee_shop'}
)
producer.flush()  # block until all messages are sent

# --- Consumer ---
consumer = KafkaConsumer(
    'transactions',
    bootstrap_servers=['broker1:9092'],
    group_id='fraud-detection',   # consumer group name
    auto_offset_reset='earliest',  # start from beginning if no offset
    enable_auto_commit=False,     # manual commit for at-least-once
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
)

for msg in consumer:
    # msg.topic, msg.partition, msg.offset, msg.key, msg.value
    result = run_fraud_model(msg.value)
    if result.is_fraud:
        alert(msg.key, msg.value)
    consumer.commit()  # commit after successful processing

acks=all vs acks=1. With acks=1, the producer considers a message "sent" once the partition leader acknowledges it. If the leader crashes before replicating, the message is lost. With acks=all, the producer waits until all in-sync replicas (ISRs) acknowledge. This is slower but durable. For financial transactions, always use acks=all. For click tracking, acks=1 is often acceptable.

Debug scenario: You have a Kafka topic with 6 partitions and a consumer group with 8 consumers. Two consumers are sitting idle and never receive any messages. What is wrong, and how do you fix it?

Kafka assigns each partition to exactly one consumer per group. With 6 partitions and 8 consumers, 2 consumers have no partitions assigned and sit idle. The maximum useful consumer count equals the partition count. Fix: either reduce consumers to 6 or increase partitions (requires topic recreation or careful partition addition — note that adding partitions breaks key-based ordering guarantees for existing keys). The idle consumers are probably in a different consumer group The producers are not sending enough messages to keep all consumers busy

Chapter 4: Change Data Capture

Your application writes to PostgreSQL. Your search team wants every product update in Elasticsearch for full-text search. Your analytics team wants every order in the data warehouse. Your caching layer wants every user profile change in Redis. How do you keep all these systems in sync?

The naive approach: after writing to PostgreSQL, also write to Elasticsearch, the warehouse, and Redis. This is called dual writes, and it is a trap. If the PostgreSQL write succeeds but the Elasticsearch write fails, your systems are inconsistent. Even if you retry, the order of writes across systems is not guaranteed — one system might see update A then B, while another sees B then A. Dual writes are not atomic and do not preserve ordering.

Change Data Capture (CDC) is the solution. Instead of having the application write to multiple systems, you make the database the single source of truth and capture every change from its internal log. PostgreSQL has the Write-Ahead Log (WAL). MySQL has the binlog. These logs already record every INSERT, UPDATE, and DELETE — the database uses them for crash recovery and replication. CDC reads these logs and publishes the changes as events to a stream (usually Kafka).

The single-writer principle. CDC enforces a critical architectural constraint: there is exactly ONE system of record (the database), and all other systems are derived from it via the event stream. The application writes to PostgreSQL. Elasticsearch, Redis, and the warehouse are all consumers that derive their state from the CDC stream. If they diverge, you can always rebuild them by replaying from the beginning. This is vastly simpler than coordinating writes across multiple systems.

How CDC Works

1. Application writes to DB

INSERT INTO orders (id, user_id, amount) VALUES (9001, 42, 99.99). PostgreSQL writes this to its WAL before applying it to the table.

↓

2. CDC connector reads WAL

Debezium (the most popular CDC tool) connects to PostgreSQL's logical replication slot and reads WAL events in real time. Each event includes: operation type (INSERT/UPDATE/DELETE), the table name, the before and after values, and a transaction ID.

↓

3. Publish to Kafka

Each table becomes a Kafka topic (e.g., "db.public.orders"). The row's primary key becomes the Kafka message key. This ensures all changes for a given row go to the same partition and are processed in order.

↓

4. Consumers update derived systems

An Elasticsearch consumer reads the "orders" topic and updates the search index. A Redis consumer updates the cache. An analytics consumer loads data into the warehouse. Each is independent and can fall behind without affecting the others.

Log Compaction

Kafka topics have a retention period (default 7 days). After 7 days, old events are deleted. But what if a new consumer joins and needs the current state, not just the last 7 days of changes?

Log compaction solves this. Instead of deleting events by age, Kafka keeps only the latest event for each key. If user_42 has been updated 500 times, the compacted log retains only the most recent update. A new consumer can read the compacted log to build a complete snapshot of current state, then switch to the uncompacted tail for real-time updates.

To delete a record, the producer sends a tombstone — a message with the key but a null value. During compaction, the tombstone removes the key from the log.

Animated CDC Pipeline

CDC Pipeline: Database to Derived Systems

Application writes hit PostgreSQL. Debezium captures WAL changes. Kafka distributes to Elasticsearch, Redis, and the warehouse. Watch propagation lag.

Debezium Configuration

json
{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.dbname": "ecommerce",
    "database.server.name": "prod-db",
    "table.include.list": "public.orders,public.products",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "dbz_publication",
    "topic.prefix": "prod-db",
    "snapshot.mode": "initial"
  }
}

snapshot.mode=initial. When Debezium first connects, it takes a consistent snapshot of the existing table data and publishes it as INSERT events. After the snapshot completes, it switches to streaming WAL changes. This means a new consumer sees the full history from time zero, not just changes after connector setup.

Design question: Your CDC pipeline publishes PostgreSQL changes to Kafka. An Elasticsearch consumer indexes them. A developer accidentally DELETEs 10,000 rows from PostgreSQL. The CDC pipeline faithfully propagates the DELETEs to Elasticsearch, wiping the search index. How do you recover?

Restore the PostgreSQL backup and wait for CDC to propagate the restored rows Use Elasticsearch's undo feature to reverse the deletes (1) Restore PostgreSQL from a point-in-time backup or use pg_dump. (2) The CDC connector will capture the restored rows as new INSERTs. But this only works if the Kafka topic has not yet been consumed past the delete events. A better long-term approach: enable Kafka log compaction on the CDC topic and keep the Elasticsearch consumer's offset stored separately. Reset the Elasticsearch consumer to offset 0 and replay the compacted log to rebuild the index from scratch. Prevention: use soft deletes (a "deleted" boolean column) instead of hard DELETEs, so CDC captures an UPDATE, not a permanent removal.

Chapter 5: Event Sourcing

CDC captures changes that the application made to a database. The application writes "current state" (UPDATE accounts SET balance = 150 WHERE id = 42), and CDC derives the event stream from those mutations. Event sourcing flips this: the application writes events directly ("user 42 deposited $50"), and current state is derived from the event log.

Think of it like accounting. A traditional database is like a balance sheet: it tells you "Account 42 has $150 right now." Event sourcing is like a general ledger: it records every transaction. "Account 42 opened with $0. Deposited $100. Withdrew $20. Deposited $70." The balance sheet is derived from the ledger, not the other way around.

Why Store Events Instead of State?

Benefit	Explanation
Complete audit trail	Every change is recorded forever. Who changed what, when, and what was the previous value. Required in finance, healthcare, and legal systems.
Temporal queries	"What was the account balance at 3 PM yesterday?" Just replay events up to that timestamp. With a state-based database, you would need point-in-time backups.
Debugging	A bug corrupted some user accounts. With event sourcing, replay the events through the fixed code to reconstruct correct state. With a state database, the corrupted state is all you have.
Multiple read models	Derive different views from the same events. One view optimized for querying balances. Another for fraud detection. Another for monthly statements. Each is a different projection of the same event log.
Decoupling	New features can subscribe to existing events without modifying the write path. "We want to send SMS notifications on large withdrawals" — just add a consumer, no code changes to the withdrawal service.

CQRS: Command Query Responsibility Segregation

Event sourcing naturally leads to CQRS: separate the write side (accepting commands, producing events) from the read side (querying derived views).

Command Side (Write)

Receives commands: "Deposit $50 to account 42." Validates the command (is the account active? is the amount positive?). If valid, appends an event to the log: "AccountDeposited { account: 42, amount: 50, timestamp: ... }". Does NOT update any queryable state.

↓ events

Event Log

The immutable, append-only source of truth. All events, in order, forever. This is the one thing you cannot lose.

↓ consumed by

Query Side (Read)

One or more "projections" that consume the event log and build materialized views optimized for queries. A balance projection maintains a table of current balances. A statement projection maintains monthly transaction summaries. Each is independently rebuildable by replaying events.

The cost of event sourcing. It is not free. (1) Schema evolution: events are immutable, so you cannot retroactively change their schema. You must handle versioning (event_v1, event_v2) and upcasting (converting v1 to v2 at read time). (2) Replay performance: replaying millions of events to rebuild state is slow. Solution: periodic snapshots — save the current state every N events, and replay only from the latest snapshot. (3) Eventual consistency: the read models lag behind the write model by the time it takes to consume and project events. This is usually milliseconds, but it is NOT zero.

Interactive Event Log

The simulation below is an event-sourced bank account. Append events (deposits, withdrawals) to the log on the left, and watch the derived balance update on the right. Hit "Replay from Scratch" to rebuild state from the event log.

Event Sourcing: Events → Derived State

Append banking events and watch the derived balance recompute. Hit Replay to rebuild state from scratch.

python
from dataclasses import dataclass, field
from typing import List
import time

@dataclass(frozen=True)
class AccountEvent:
    event_type: str    # "created", "deposited", "withdrawn"
    amount: float
    timestamp: float = field(default_factory=time.time)

class EventSourcedAccount:
    def __init__(self, account_id: str):
        self.account_id = account_id
        self.events: List[AccountEvent] = []
        self._balance = 0.0  # derived state (cache)

    def deposit(self, amount: float):
        assert amount > 0
        event = AccountEvent("deposited", amount)
        self.events.append(event)
        self._balance += amount  # update cache

    def withdraw(self, amount: float):
        assert 0 < amount <= self._balance
        event = AccountEvent("withdrawn", amount)
        self.events.append(event)
        self._balance -= amount

    def replay(self) -> float:
        """Rebuild balance from events. The source of truth."""
        balance = 0.0
        for e in self.events:
            if e.event_type == "deposited":
                balance += e.amount
            elif e.event_type == "withdrawn":
                balance -= e.amount
        self._balance = balance
        return balance

    def balance_at(self, timestamp: float) -> float:
        """Temporal query: what was the balance at a point in time?"""
        balance = 0.0
        for e in self.events:
            if e.timestamp > timestamp:
                break
            if e.event_type == "deposited":
                balance += e.amount
            elif e.event_type == "withdrawn":
                balance -= e.amount
        return balance

Concept check: An event-sourced system has 50 million events in its log. Replaying them to rebuild the current account balance takes 8 minutes. A new read model needs to be deployed. What technique reduces the rebuild time from 8 minutes to seconds?

Index the event log for faster lookups Periodic snapshots. Every N events (e.g., 100,000), save the current derived state (the balance) alongside the event offset. To rebuild, load the latest snapshot and replay only the events after it. If the snapshot is at event 49,900,000, you only replay 100,000 events instead of 50 million. Snapshots are derived data — they can always be regenerated from the event log if corrupted. Delete old events that are no longer needed

Chapter 6: Stream Processing Operations

In batch processing, aggregation is straightforward: "count all the clicks in yesterday's log." The dataset is bounded — you know when it starts and when it ends. In stream processing, the dataset is unbounded. Events keep arriving forever. You cannot "count all the clicks" because there is no "all." You need a way to slice the infinite stream into finite chunks for aggregation. This is called windowing.

Four Types of Windows

A window is a time interval over which you compute an aggregate (count, sum, average, max). There are four types, and they differ in how they slice time.

Window Type	Description	Example
Tumbling	Fixed-size, non-overlapping. Each event belongs to exactly one window.	Count clicks per 5-minute window: [0:00-5:00), [5:00-10:00), [10:00-15:00)...
Hopping	Fixed-size, overlapping. Window size > hop size. Each event may belong to multiple windows.	5-minute window, 1-minute hop: [0:00-5:00), [1:00-6:00), [2:00-7:00)... A click at 3:30 is in 4 windows.
Sliding	Triggered by events. Contains all events within a time duration of each other.	"Alert if more than 3 failed logins within 10 minutes of each other." No fixed grid — the window slides with each event.
Session	Grouped by activity gaps. A session ends when no events arrive for a configurable gap duration.	A user's website session: starts with first click, ends after 30 minutes of inactivity. Sessions can be arbitrarily long.

Event time vs. processing time. When did the event happen (event time: the timestamp in the event payload) vs. when did the system receive it (processing time: wall clock when the event arrives at the processor)? These can differ by seconds, minutes, or even hours (mobile device offline, network partition, backlog replay). Windowing should almost always use event time. A click that happened at 2:03 PM but arrived at 2:10 PM should be counted in the 2:00-2:05 window, not the 2:10-2:15 window. This creates the out-of-order event problem, which we solve with watermarks.

Watermarks

A watermark is the system's estimate of "all events with timestamps up to time T have probably arrived." When the watermark advances past the end of a window, the system fires the window's computation and emits results.

Watermarks are a heuristic. They can be:

Type	Behavior	Trade-off
Perfect	Wait until you are certain all events have arrived (e.g., all sources have advanced past T).	Correct but slow. Delays results until the slowest source catches up.
Heuristic	Estimate based on observed event times, allowing some tolerance for late events.	Fast but some late events may arrive after the watermark. These require special handling.

Late events arrive after the watermark has passed their window. Strategies: (1) Drop them. (2) Refire the window computation with updated results. (3) Route them to a side output for manual handling. Most production systems use a combination: allow late events up to an allowed lateness threshold, and drop anything beyond that.

Interactive Windowing Demo

This is the showcase simulation. Events arrive on a timeline. Choose a window type and size. Watch events get grouped into windows and aggregations computed per window. Late events (after the watermark) are highlighted.

Windowing: Tumbling, Hopping, Sliding, Session

Events arrive over time. Select a window type, adjust the size, and watch aggregation in real time. Late events appear in red.

Type: Window size: 5s

Windowing in Apache Flink

python
# Apache Flink Python API (PyFlink) — tumbling window aggregation
from pyflink.table import EnvironmentSettings, TableEnvironment
from pyflink.table.window import Tumble
from pyflink.table.expressions import col, lit

env_settings = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env_settings)

# Read from Kafka topic
t_env.execute_sql("""
    CREATE TABLE clicks (
        user_id STRING,
        url STRING,
        click_time TIMESTAMP(3),
        WATERMARK FOR click_time AS click_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'user-clicks',
        'properties.bootstrap.servers' = 'broker:9092',
        'format' = 'json'
    )
""")

# Tumbling window: count clicks per user per 5-minute window
result = t_env.from_path('clicks') \
    .window(Tumble.over(lit(5).minutes).on(col('click_time')).alias('w')) \
    .group_by(col('user_id'), col('w')) \
    .select(
        col('user_id'),
        col('w').start.alias('window_start'),
        col('w').end.alias('window_end'),
        col('url').count.alias('click_count')
    )

WATERMARK FOR click_time AS click_time - INTERVAL '5' SECOND. This tells Flink: "Event time is in the click_time column, and events may arrive up to 5 seconds late." The watermark is always 5 seconds behind the latest observed event time. When the watermark passes the end of a 5-minute window, Flink fires the window. Events arriving more than 5 seconds late are dropped (unless you configure allowed lateness).

Design question: You are building a "trending topics" feature that shows the top 10 hashtags on a social platform, updated every minute. Users in different time zones post at different rates. Should you use tumbling or hopping windows, and why?

Tumbling windows of 1 minute — one result per minute, simple and efficient Hopping windows of 5 minutes with a 1-minute hop. Tumbling 1-minute windows are too spiky — a hashtag that trends for 4 minutes might appear and disappear each minute. A 5-minute hopping window smooths the signal while still updating every minute. Each update reflects the last 5 minutes of activity, giving a more stable "trending" ranking. The trade-off: each event is counted in 5 windows instead of 1, so you use 5x the computation and state. Session windows grouped by user activity

Chapter 7: Stream Joins

In batch processing, joining two datasets is straightforward: both are bounded, both are fully available, and you can sort-merge or hash-join them. In stream processing, data arrives continuously and you cannot "wait for all the data" because it never ends. Stream joins are fundamentally harder, and there are three types — each with different semantics and implementation costs.

Type 1: Stream-Stream Join

You have two event streams and you want to match events from one with events from the other, within a time window. The canonical example: an ad platform wants to match ad impressions (user saw an ad) with ad clicks (user clicked the ad), where the click must happen within 1 hour of the impression.

Implementation: maintain a buffer for each stream. When an impression arrives, store it in the impression buffer. When a click arrives, search the impression buffer for a matching impression within the time window. Emit a joined result. Expire entries from the buffer when they fall outside the window.

// Stream-stream join pseudocode
// Impression arrives:
impression_buffer[impression.ad_id].append(impression)
for click in click_buffer[impression.ad_id]:
  if |click.time - impression.time| ≤ 1 hour:
    emit(impression, click) # matched!

// Click arrives:
click_buffer[click.ad_id].append(click)
for impression in impression_buffer[click.ad_id]:
  if |click.time - impression.time| ≤ 1 hour:
    emit(impression, click) # matched!

// Periodically evict expired entries from both buffers

State cost. The buffer for each stream must hold all events within the join window. With 10 million impressions per hour and a 1-hour window, that is 10 million entries in memory. This is why stream-stream joins are expensive and why the window size directly controls memory usage. Shrink the window if you can.

Type 2: Stream-Table Join (Enrichment)

You have an event stream and you want to enrich each event with data from a table. Example: a "user activity" stream where each event has a user_id, and you want to add the user's name, country, and subscription tier from the user profile table.

Two approaches:

Approach	How it works	Pros	Cons
Remote lookup	For each event, query the database for the user profile.	Always fresh data.	Network latency per event. Database becomes a bottleneck at high throughput.
Local table copy	Use CDC to maintain a local copy of the user table in the stream processor. Join against the local copy.	No network latency. No database load.	Data may be slightly stale (CDC lag). Uses local memory/disk.

In practice, the local table copy (via CDC) is almost always the right choice for high-throughput systems. The table is typically much smaller than the stream and fits in memory.

Type 3: Table-Table Join (Materialized View)

Both inputs are CDC streams (representing tables), and the output is a continuously updated join. Example: a "orders" table and a "products" table. You want a materialized view that joins them: for each order, include the product name and price.

When an order is inserted, look up the product. When a product's price changes, update all orders that reference it. This is a live, continuously maintained join — the same thing a database does internally for a materialized view, but implemented at the application level using streams.

Three Join Types Side-by-Side

Stream Joins: Stream-Stream, Stream-Table, Table-Table

Watch events flow through three different join patterns. Stream-stream buffers events within a time window. Stream-table enriches events from a local table. Table-table maintains a live materialized view.

Join type:

python
# Stream-stream join: match impressions with clicks within 1 hour
from collections import defaultdict
import time

class StreamStreamJoin:
    def __init__(self, window_seconds: int):
        self.window = window_seconds
        self.left_buffer = defaultdict(list)   # key -> [(timestamp, event)]
        self.right_buffer = defaultdict(list)

    def add_left(self, key, ts, event):
        self._evict(self.left_buffer, key, ts)
        self.left_buffer[key].append((ts, event))
        # Check right buffer for matches
        matches = []
        for rts, revt in self.right_buffer.get(key, []):
            if abs(ts - rts) <= self.window:
                matches.append((event, revt))
        return matches

    def add_right(self, key, ts, event):
        self._evict(self.right_buffer, key, ts)
        self.right_buffer[key].append((ts, event))
        matches = []
        for lts, levt in self.left_buffer.get(key, []):
            if abs(ts - lts) <= self.window:
                matches.append((levt, event))
        return matches

    def _evict(self, buffer, key, current_ts):
        """Remove entries outside the join window."""
        if key in buffer:
            buffer[key] = [(t, e) for t, e in buffer[key]
                           if current_ts - t <= self.window]

# Usage: match ad impressions with clicks within 1 hour
join = StreamStreamJoin(window_seconds=3600)
# Impression arrives
matches = join.add_left("ad_123", time.time(), {"type": "impression"})
# Later, click arrives
matches = join.add_right("ad_123", time.time(), {"type": "click"})
# matches = [(impression_event, click_event)]

Design question: You have a user activity stream (100,000 events/sec) and a user profile table (50 million rows, updated ~100 times/sec). You need to enrich each activity event with the user's country. Which join type do you use, and how do you implement it?

Stream-stream join with a large window to capture profile updates Remote database lookup for each activity event Stream-table join with a local table copy. Use CDC (Debezium) to capture the user profile table changes and maintain a local copy in the stream processor (e.g., Flink's RocksDB state backend or a Kafka KTable). Each activity event looks up the user's country from the local copy — zero network latency. The 50M row table is ~4 GB in memory (80 bytes/row), which is manageable. Updates propagate via CDC with sub-second lag, so the local copy is nearly always fresh. At 100K events/sec, a remote DB lookup would require a connection pool of thousands and add 1-5 ms latency per event.

Chapter 8: Fault Tolerance

A batch job that fails halfway through is easy to recover: delete the partial output and re-run the whole job. The input is immutable, the output is deterministic, and the job is bounded. Stream processing is harder. The job runs forever. It accumulates state (window aggregates, join buffers, counters). If it crashes, you cannot just "re-run from the beginning" — that would take hours or days. And you must guarantee that each event is processed correctly despite failures. Not once. Not zero times. Exactly once.

The Three Delivery Guarantees

Guarantee	Meaning	How achieved	Risk
At-most-once	Events may be lost, but never duplicated.	Commit offset before processing. If crash, skipped events are gone.	Data loss. Acceptable for low-value metrics (page view counts).
At-least-once	Events are never lost, but may be duplicated.	Commit offset after processing. If crash, replay from last offset. Some events are reprocessed.	Duplicates. Fine if processing is idempotent (SET key=value). Dangerous if not (INCREMENT counter).
Exactly-once	Each event is processed exactly once, even across failures.	Either idempotent operations, or transactional writes, or Chandy-Lamport checkpointing.	Performance cost. Harder to implement. But required for financial systems.

"Exactly-once" is a lie (sort of). Truly processing each event exactly once is impossible in a distributed system (network partitions can always cause ambiguity about whether a message was received). What we actually achieve is effectively exactly-once: the system may internally re-process events, but the observable output is the same as if each event were processed once. This is achieved through a combination of idempotent writes, transactional commits, and checkpointing.

Strategy 1: Idempotent Writes

Design your output operations so that re-processing an event produces the same result as processing it once.

// IDEMPOTENT — safe to repeat:
SET user_42_balance = 150 # writing the same value twice is fine
SET user_42_last_login = "2024-01-15"

// NOT IDEMPOTENT — dangerous to repeat:
INCREMENT user_42_balance BY 50 # +50 twice = +100, not +50
INSERT INTO events (...) # duplicate row

To make non-idempotent operations safe, include a deduplication key (event ID or offset) in the output. Before writing, check if that key already exists. If it does, skip the write.

Strategy 2: Kafka Transactions

Kafka provides transactional producers that atomically write to multiple partitions and commit consumer offsets in a single transaction. Either all writes succeed and the offset is committed, or none of them do.

python
# Kafka transactional producer — exactly-once semantics
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers=['broker:9092'],
    transactional_id='fraud-processor-1',  # unique per instance
    enable_idempotence=True,               # required for transactions
)
producer.init_transactions()

try:
    producer.begin_transaction()
    # Process input events and produce output
    for msg in batch:
        result = process(msg.value)
        producer.send('fraud-alerts', key=msg.key, value=result)
    # Commit input offsets as part of the transaction
    producer.send_offsets_to_transaction(
        offsets={tp: offset for tp, offset in current_offsets.items()},
        group_id='fraud-detection'
    )
    producer.commit_transaction()
except Exception:
    producer.abort_transaction()  # rollback everything

Strategy 3: Chandy-Lamport Checkpointing (Flink)

Apache Flink uses a distributed snapshotting algorithm based on the Chandy-Lamport protocol. The idea:

1. Checkpoint Barrier

The coordinator injects a special "barrier" event into the input stream. This barrier flows through the pipeline like a regular event.

↓

2. Operator Snapshot

When an operator receives the barrier, it snapshots its current state (window contents, counters, join buffers) to durable storage (HDFS, S3). Then it forwards the barrier downstream.

↓

3. Barrier Alignment

If an operator has multiple input channels, it waits until it has received the barrier on ALL channels before snapshotting. This ensures the snapshot represents a consistent cut across the entire pipeline.

↓

4. Recovery

On failure, all operators restore from the latest complete checkpoint. Input sources (Kafka) seek back to the offsets recorded in the checkpoint. Processing resumes. Output may produce duplicates unless sinks are idempotent or transactional.

Flink Checkpointing in Action

The simulation below shows a Flink pipeline with 3 operators processing events. A checkpoint barrier flows through the pipeline, each operator saves its state. Then a failure occurs — watch the system restore from the checkpoint and replay.

Flink Checkpointing and Recovery

Watch checkpoint barriers flow through operators. Each operator snapshots state. On failure, the system rolls back and replays from the checkpoint.

Checkpoint interval trade-off. Frequent checkpoints (every 100 ms) mean fast recovery (replay only 100 ms of events) but high overhead (constant state snapshotting). Infrequent checkpoints (every 10 minutes) mean low overhead but slow recovery (replay 10 minutes of events). Flink's default is 10 seconds. For exactly-once sinks, the checkpoint interval also determines the minimum output latency (results are committed atomically with checkpoints).

Design Challenge: Fraud Detection with Exactly-Once

Design a real-time fraud detection system that processes credit card transactions from Kafka, applies a fraud model, and writes alerts to a "fraud-alerts" topic with exactly-once semantics. Consider: what happens if the fraud model has side effects (e.g., incrementing a "suspicious activity count" per user)?

text
Design sketch:

1. Input: Kafka topic "transactions" (partitioned by card_id)
2. Flink job with:
   - Source: KafkaSource with exactly-once checkpoint mode
   - Operator 1: Enrich with user profile (stream-table join via CDC)
   - Operator 2: Windowed aggregation (count transactions per card in 5-min window)
   - Operator 3: Fraud model (flag if count > threshold or model score > 0.9)
   - Sink: KafkaTransactionalSink to "fraud-alerts" topic

3. Checkpointing: every 5 seconds to S3
4. The "suspicious activity count" per user MUST be part of Flink's
   managed state (not an external database), so it is included in
   checkpoints and restored correctly on failure.
5. If count is in an external DB, use idempotent writes:
   SET suspicious_count[user] = computed_value (not INCREMENT)

Debug scenario: Your Flink job processes Kafka events and writes results to PostgreSQL. After a failure and recovery, you notice duplicate rows in PostgreSQL — some events produced two output rows. Your Flink checkpointing is configured for exactly-once. What went wrong?

The Flink checkpointing is broken and needs to be reconfigured The Kafka consumer is not committing offsets correctly Flink's exactly-once guarantees apply to its INTERNAL state and Kafka sinks (which use Kafka transactions). But the PostgreSQL sink is EXTERNAL — Flink wrote rows to PostgreSQL before the checkpoint, then crashed. On recovery, it replayed from the checkpoint and wrote the same rows again. Fix: make the PostgreSQL sink idempotent (use UPSERT / ON CONFLICT DO UPDATE with a deduplication key, such as the Kafka offset or a deterministic event ID), or use a two-phase commit sink that only commits PostgreSQL transactions atomically with Flink checkpoints.

Chapter 9: Interview Arsenal

This chapter compresses every stream processing concept into interview-ready form. Organized by the five interview dimensions: concept, design, code, debug, and frontier.

Concept Cheat Sheet

Concept	One-liner	Key detail
Event stream	Unbounded, append-only, ordered sequence of immutable events	Stream-table duality: stream = changelog of a table
Traditional broker	Message delivered to one consumer, deleted after ACK	RabbitMQ, SQS. Good for task distribution. No replay.
Log-based broker	Append-only log, consumers track offsets, messages persist	Kafka, Kinesis. Replay by resetting offset. Fan-out is free.
Partition	Unit of ordering and parallelism in Kafka	Max consumers per group = num partitions. Key hash determines partition.
Consumer group	Set of consumers cooperating on a topic. Each partition -> one consumer.	Multiple groups independently consume the same topic.
CDC	Capture database mutations from the WAL/binlog as events	Debezium reads WAL, publishes to Kafka. Single source of truth.
Log compaction	Keep only latest value per key in the log	Enables new consumers to build full snapshot from compacted log.
Event sourcing	Store events, derive state. Opposite of CDC.	Enables temporal queries, audit trails, multiple read models.
CQRS	Separate write model (events) from read model (materialized views)	Read models are rebuildable by replaying the event log.
Tumbling window	Fixed-size, non-overlapping time buckets	Each event in exactly one window. Simplest and cheapest.
Hopping window	Fixed-size, overlapping (window > hop)	Smooths spiky data. Events counted in multiple windows.
Session window	Grouped by activity gap (no events for N seconds = end)	Variable length. Good for user sessions, browsing activity.
Watermark	System's estimate: "all events up to time T have arrived"	Triggers window computation. Late events arrive after watermark.
Event time vs. processing time	When it happened vs. when the system saw it	Always window by event time. Processing time gives wrong results.
Stream-stream join	Join two streams within a time window	Requires buffering both sides. Memory = f(window size, throughput).
Stream-table join	Enrich events with table lookups	Use CDC to keep local table copy. Avoids DB bottleneck.
Exactly-once	Observable output same as if each event processed once	Achieved via idempotence, Kafka transactions, or Flink checkpointing.
Chandy-Lamport	Distributed snapshot via barrier injection	Flink's approach. Barriers flow through pipeline. Each operator snapshots state.

System Design Questions

Design: Real-time analytics dashboard. Requirements: show page views per URL per minute, updated in real time. Approach: Kafka topic "pageviews" partitioned by URL. Flink job with tumbling 1-minute windows, aggregating counts per URL. Output to a Kafka topic "pageview-counts". A WebSocket server consumes this topic and pushes updates to the dashboard. For "top 10 URLs," use a global window with a top-N aggregator (requires shuffling all data to one operator — expensive). Alternative: approximate with a Count-Min Sketch for lower memory usage.

Design: Notification deduplication service. Problem: multiple microservices can trigger the same notification (email, push, SMS). Ensure each user receives at most one notification per event, even if triggered by 3 different services. Approach: all notification triggers publish to a Kafka topic keyed by (user_id, event_id). A Flink job deduplicates: for each (user_id, event_id), use a keyed state map. If the key is seen for the first time, emit a send-notification event. If already seen, suppress. Use a TTL (e.g., 24 hours) on the state to bound memory. Exactly-once via Flink checkpointing + idempotent notification delivery (include event_id in the send request, notification service deduplicates).

Design: Fraud detection pipeline. Requirements: flag suspicious credit card transactions within 500 ms. Approach: Kafka topic "transactions" partitioned by card_id. Flink job with: (1) Stream-table join to enrich with user profile (CDC from PostgreSQL). (2) Keyed state per card_id tracking sliding window of transaction count, total amount, and geographic velocity (distance between last two transactions / time). (3) Rule engine + ML model score. Flag if: more than 5 transactions in 1 minute, or total amount exceeds $5000 in 10 minutes, or geographic velocity > 500 km/h (card used in two distant cities within minutes). Output to "fraud-alerts" topic consumed by the blocking service. Exactly-once via Flink checkpointing. Latency target: checkpoint interval (5s) + processing time (~10 ms) = ~5s end-to-end. For sub-second, use unaligned checkpoints.

Debug Scenarios

Symptom	Likely cause	Fix
Consumer lag growing unboundedly	Consumer is slower than producer. Could be: slow processing logic, GC pauses, insufficient parallelism (too few partitions), or partition rebalance storms (consumers joining/leaving too frequently).	Profile the consumer. Increase partitions and consumers. Tune max.poll.records and max.poll.interval.ms. Use static group membership to avoid rebalance storms.
Duplicate events in output	At-least-once delivery without idempotent sinks. Consumer committed offset after crash, replayed events, and wrote them again.	Make sinks idempotent (UPSERT with dedup key) or use Kafka transactions / Flink exactly-once.
Out-of-order results	Events from different partitions interleaved. Or windowing uses processing time instead of event time.	Ensure events that need ordering share the same key (same partition). Switch to event-time windowing with watermarks.
High checkpoint duration	Large operator state (big join buffers, wide windows). State backend (RocksDB) creating large incremental snapshots.	Reduce window sizes. Enable incremental checkpointing. Use SSD for RocksDB. Increase checkpoint timeout.
Missing events after schema change	Producer changed the event schema (added/removed fields). Consumer's deserializer fails and silently drops events.	Use a schema registry (Confluent, Apicurio). Enforce backward/forward compatibility. Use Avro or Protobuf with schema evolution rules.

Coding Drill: Tumbling Window Aggregator

python
from collections import defaultdict
import time

class TumblingWindowAggregator:
    """Count events per key per tumbling window."""

    def __init__(self, window_size_seconds: int):
        self.window_size = window_size_seconds
        # (window_start, key) -> count
        self.windows = defaultdict(int)
        self.emitted = set()  # track which windows have been emitted

    def _window_start(self, event_time: float) -> float:
        """Assign event to its tumbling window."""
        return event_time - (event_time % self.window_size)

    def process(self, key: str, event_time: float):
        ws = self._window_start(event_time)
        self.windows[(ws, key)] += 1

    def fire_windows(self, watermark: float):
        """Emit results for all windows whose end <= watermark."""
        results = []
        for (ws, key), count in list(self.windows.items()):
            window_end = ws + self.window_size
            if window_end <= watermark and (ws, key) not in self.emitted:
                results.append({
                    'window_start': ws,
                    'window_end': window_end,
                    'key': key,
                    'count': count,
                })
                self.emitted.add((ws, key))
        return results

# Example usage
agg = TumblingWindowAggregator(window_size_seconds=60)
agg.process("url_a", 1000.0)  # window [960, 1020)
agg.process("url_a", 1010.0)  # same window
agg.process("url_b", 1005.0)  # same window, different key
agg.process("url_a", 1025.0)  # next window [1020, 1080)

# Watermark advances past 1020 -> fire the first window
results = agg.fire_windows(watermark=1025.0)
# [{'window_start': 960, 'window_end': 1020, 'key': 'url_a', 'count': 2},
#  {'window_start': 960, 'window_end': 1020, 'key': 'url_b', 'count': 1}]

Coding Drill: Idempotent Event Processor

python
import hashlib, json

class IdempotentProcessor:
    """Process events exactly once using a deduplication log."""

    def __init__(self):
        self.seen = set()       # set of processed event IDs
        self.state = {}          # application state (e.g., user balances)

    def _event_id(self, event: dict) -> str:
        """Deterministic ID from event content."""
        raw = json.dumps(event, sort_keys=True).encode()
        return hashlib.sha256(raw).hexdigest()

    def process(self, event: dict) -> bool:
        eid = self._event_id(event)
        if eid in self.seen:
            return False  # already processed, skip

        # Process the event
        user = event['user_id']
        amount = event['amount']
        self.state[user] = self.state.get(user, 0) + amount

        self.seen.add(eid)
        return True  # successfully processed

# Same event processed twice -> only applied once
proc = IdempotentProcessor()
event = {'user_id': '42', 'amount': 100, 'ts': 1234567890}
proc.process(event)  # True  — applied, balance = 100
proc.process(event)  # False — skipped, balance still 100

Staff-level design: You are designing a real-time leaderboard showing the top 100 players by score in a multiplayer game with 10 million concurrent users. Score updates arrive as a Kafka stream at 500,000 events/sec. The leaderboard must update in under 2 seconds. How do you architect this?

Single Flink operator maintaining a sorted set of all 10M users Stream scores into a Redis sorted set and query top 100 Two-tier approach: (1) Kafka topic partitioned by player_id, Flink job computes per-player running totals in keyed state — cheap, embarrassingly parallel. (2) Downstream, a second Flink operator receives score summaries and maintains a top-100 heap (only 100 entries, tiny state). This operator receives at most one update per player per window, not 500K raw events/sec. Emit the top-100 list every 1-2 seconds to a Kafka topic consumed by the leaderboard API. Redis sorted set is a valid alternative for the top-N, but at 500K writes/sec to a single Redis key you risk hot-key issues — the two-tier Flink approach reduces writes to Redis to the actual change set (players entering/leaving top 100).

Chapter 10: Connections

Stream processing does not exist in isolation. It connects to nearly every topic in distributed systems. Here is how this chapter relates to the rest of DDIA and beyond.

Where Stream Processing Fits

Related Topic	Connection
Batch Processing (DDIA Ch 11)	Stream processing is "unbounded batch." Lambda Architecture runs both. Modern systems (Flink) unify them. Batch for backfills and retraining, stream for real-time.
Storage & Retrieval (DDIA Ch 4)	Kafka is an LSM-tree-like append-only log with compaction. Flink's RocksDB state backend IS an LSM-tree. Stream processing state management is a storage engine problem.
Replication (DDIA Ch 6)	CDC is database replication exposed as a public API. Kafka itself uses leader-follower replication for fault tolerance. Consumer groups are a form of partitioned processing.
Transactions (DDIA Ch 8)	Exactly-once semantics in stream processing is a transaction problem. Kafka transactions are lightweight distributed transactions. Flink checkpointing is a form of distributed snapshot.
Consistency (DDIA Ch 10)	Stream processing systems provide eventual consistency by default. Exactly-once makes the output consistent. Watermarks are a form of causal ordering.
Partitioning (DDIA Ch 7)	Kafka partitions are hash-partitioned. Key-based routing ensures related events are co-located. Repartitioning (changing key) in Flink is a shuffle operation.

Comparison: Stream Processing Frameworks

Framework	Processing model	State backend	Exactly-once	Best for
Apache Kafka Streams	Library (no cluster)	RocksDB, in-memory	Via Kafka transactions	Kafka-native apps, microservices
Apache Flink	Distributed runtime	RocksDB, heap	Via Chandy-Lamport checkpointing	Complex event processing, large state
Apache Spark Structured Streaming	Micro-batch	HDFS/S3	Via write-ahead log + idempotent sinks	Teams already using Spark for batch
Amazon Kinesis Data Analytics	Managed Flink	Managed	Flink-level	AWS-native, operational simplicity
Google Dataflow (Beam)	Unified batch/stream	Managed	Via shuffle + dedup	GCP-native, portable pipelines

Key Papers and Reading

Paper/Resource	Why it matters
Kafka: a Distributed Messaging System for Log Processing (Kreps et al., 2011)	The original Kafka paper. Explains the log abstraction and why it works.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale (Akidau et al., 2013)	Google's internal stream processor. Introduced per-key exactly-once and low watermarks.
The Dataflow Model (Akidau et al., 2015)	Unified model for batch and stream. Event time, watermarks, triggers, accumulation. Apache Beam's theoretical foundation.
Apache Flink: Stream and Batch Processing in a Single Engine (Carbone et al., 2015)	Flink's architecture. Chandy-Lamport checkpointing for exactly-once.
Questioning the Lambda Architecture (Kreps, 2014)	Jay Kreps argues that a single stream processing system can replace the dual batch+stream Lambda Architecture. The "Kappa Architecture."
Turning the database inside-out (Kleppmann, 2015)	Martin Kleppmann's talk connecting databases, caches, indexes, and streams as different views of the same data. The intellectual foundation for CDC and event sourcing.

The big picture. Stream processing is not just a technology — it is a way of thinking about data. Instead of "store data, then query it," you think "data flows through, computations happen as it passes." This shift affects how you design services (event-driven microservices), how you maintain derived data (CDC instead of dual writes), and how you build real-time features (windowed aggregations instead of periodic batch). The log is the database. The stream is the table. Everything is connected.

Final synthesis: A colleague says, "We should replace all our batch pipelines with stream processing — it's strictly better because it's faster." Is this correct?

Yes — stream processing can do everything batch can, but faster No. Stream processing has trade-offs: (1) Harder to reason about correctness (out-of-order events, late data, watermark heuristics). (2) More complex operationally (long-running stateful jobs, checkpoint management, schema evolution). (3) Some computations are inherently bounded (train a model on last year's data, generate a monthly report). (4) Batch can leverage sort-merge joins and full-dataset optimizations that streams cannot. (5) Backfill and reprocessing are simpler with batch. The right approach is usually both: stream for real-time, batch for heavy historical computation. The Kappa Architecture argues you can use only stream, but in practice most organizations use a mix. It depends on the volume of data