Designing Data-Intensive Applications — Chapters 13 & 14

Philosophy & Ethics

Streaming philosophy, bias, privacy, responsibility — the human side of data systems.

Prerequisites: All prior DDIA chapters. This is the capstone.
8
Chapters
6+
Simulations
3
Core Themes

Chapter 0: The Problem

You have spent twelve chapters learning how to build data systems. You know how to replicate, partition, process in batch, process in streams, achieve consensus, and handle failure. You can design a system that scales to millions of users and recovers from any crash. Congratulations. You are dangerous.

Here is why that word is deliberate. A credit scoring algorithm you build will decide who gets a home loan and who does not. A hiring system you design will filter resumes — and encode whatever biases exist in the training data into automated decisions that affect people's livelihoods. A social media feed you optimize for engagement will, if left unchecked, amplify outrage because outrage drives clicks.

The technical skills from the previous chapters are necessary but not sufficient. A data system is not just infrastructure — it is a lens that shapes what is possible, what is surveilled, who is included, and who is excluded. Every architectural decision has a human consequence, whether you intended it or not.

This chapter is different. There is no single "correct" architecture here. Instead, we tackle two intertwined themes from Kleppmann's closing chapters: (1) the philosophical framing of how data should flow through systems, and (2) the ethical responsibilities you carry as the person who builds them. These are not soft topics — they show up in system design interviews, in production incidents, and in courtrooms.

The Web of Consequences

The simulation below shows a simplified ecosystem of data systems: social media, credit scoring, hiring platforms, criminal justice risk tools, and ad targeting. Each system feeds data into the others. Watch what happens when a single decision in one system — say, a biased training dataset in the hiring platform — propagates through the entire web.

Data System Consequence Web

Click a system node to inject a bias. Watch it propagate through data flows to other systems. Click "Reset" to clear.

Click a node, then "Inject Bias".

A biased hiring algorithm feeds employment data into credit scoring systems. Lower credit scores feed into higher insurance premiums and reduced access to housing. Reduced housing access feeds into neighborhood-level data that criminal justice risk models use. Each system is "just using the data it has" — and yet the compound effect is discriminatory at a scale no single system intended.

This is the core tension of these final chapters. Technical excellence without ethical awareness is not neutral — it is harmful by default. The rest of this lesson equips you with the philosophical framing and the ethical vocabulary to navigate these questions, both in interviews and in production.

Why is a technically correct system not automatically an ethical one?

Chapter 1: The Philosophy of Streaming

We covered the mechanics of stream processing in Chapter 12. Now step back from the APIs and windowing functions. Batch versus stream is not merely a technical choice — it is a philosophical one about how you model time and reality.

Two Worldviews

Batch processing sees the world as a series of periodic snapshots. You freeze reality at regular intervals, process what accumulated, and move on. It is like a photographer taking one picture per hour. Between photos, the world changes and you do not notice.

Stream processing sees the world as a continuous flow of events. Every change is captured as it happens. It is like a video camera — nothing is lost between frames. The data is a living record of reality unfolding.

This difference runs deeper than performance. It changes what questions you can answer. A batch system can tell you "as of midnight, you had 14,232 active users." A stream system can tell you "at 3:47:12 PM, user #8891 went from active to inactive, which was the 17th deactivation in the last hour — a rate 3x higher than normal." The batch view is a summary. The stream view is the truth.

Event sourcing as a worldview. The log of events IS the truth. Derived views — databases, caches, indexes — are ephemeral projections. You can always rebuild them from the log. You cannot rebuild the log from a snapshot. This is not just a software pattern. It is how accounting has worked for 700 years: the journal (event log) is the source of truth. The balance sheet (derived view) is computed from it. You never modify the journal — you add correction entries.

Immutability: You Cannot Rewrite History

In an event-sourced system, once something happened, it happened. You do not go back and edit the event. If a customer was charged $50 by mistake, you do not delete the $50 charge. You add a new event: "refund $50." The ledger is append-only.

This is the same principle behind double-entry bookkeeping, invented in 13th-century Italy. Every transaction creates two entries (debit and credit) that must balance. Errors are corrected with new entries, never by erasing old ones. The complete history is always available for audit.

Why does this matter for software? Three reasons:

PropertyWhy it mattersExample
AuditabilityYou can trace exactly how you got to the current stateA regulator asks "why was this loan denied?" You can replay the exact sequence of events that led to the decision
DebuggabilityYou can replay events to reproduce bugsA race condition that corrupts state at 3 AM can be replayed deterministically from the event log
RecoveryDerived views can be rebuilt from scratchYour search index is corrupted? Rebuild it by replaying the event log. No data loss.

Batch vs. Stream: What Gets Lost

The simulation below shows the same data viewed through both lenses. On the left: a batch system takes periodic snapshots. On the right: a stream system captures every event. Toggle between them to see what information the batch view loses.

Batch Snapshots vs. Continuous Stream

Watch events flow in real time. The batch side only updates at snapshot intervals. Notice the events that slip between snapshots.

Snapshot interval 4s

With a long snapshot interval, the batch side misses short-lived spikes entirely — a brief surge of errors, a flash sale that sold out in seconds, a momentary network partition. The stream side captures every fluctuation. Increase the snapshot interval to see how much information is lost.

Why This Matters for System Design

The choice between batch and stream is not just about latency. It determines what questions your system can answer. Consider three scenarios:

ScenarioBatch viewStream view
User churn detection"Last month we lost 2,400 users." (One number, no context.)"At 3 PM on Tuesday, engagement dropped 40% for users in the Northeast — correlating with a deployment that broke mobile push notifications." (Root cause identified within minutes.)
Inventory management"At midnight, we had 340 units in stock." (May be stale by morning.)"Item #8891 just sold out at 10:47 AM. 14 users currently have it in their cart. Begin backorder process." (Real-time, actionable.)
Security monitoring"Yesterday's logs show 47 failed login attempts from IP x.x.x.x." (Discovered 24 hours too late.)"IP x.x.x.x has failed 5 logins in 30 seconds — brute force attack in progress. Blocking." (Stopped in real time.)

Event sourcing gives you both views from the same data. The stream view is always available. The batch view is just a materialized query over the event log at a point in time. You never have to choose one or the other — you get both, as long as you build on an event log.

A financial auditor asks you to explain exactly why a user's account balance changed from $500 to $340 between Tuesday and Wednesday. In a CRUD database that only stores current state, can you answer this?

Chapter 2: Dataflow as the Unifying Abstraction

Here is a radical idea: every data system you have studied in this book is the same thing. A database, a cache, a search index, an analytics warehouse — they are all derived views of an underlying event log. They differ only in how they project that log into a queryable form.

The Unified Model

Think of your event log as a river. A database is a reservoir that stores the latest state — it is a snapshot of the river's cumulative effect. A search index is a filter that extracts specific patterns from the water. A cache is a bucket that holds the most recently fetched water for quick access. An analytics warehouse is a dam that collects water for periodic measurement.

Each of these is just a materialized view — a different lens on the same underlying data. And because they are derived, they can be rebuilt from the event log at any time. The event log is the single source of truth. Everything else is a read-optimized projection.

SystemWhat it really isProjection type
Relational DBCache of latest state per keyLatest-value lookup
Search indexInverted index of event contentFull-text search
Cache (Redis)Hot subset of derived stateFrequently-accessed lookup
Analytics warehouseAggregated summary over event historyOLAP cubes, star schemas
Materialized viewPre-computed query resultDenormalized join
This changes how you think about architecture. Instead of "which database should I use?" the question becomes "what derived views do I need, and how do I keep them in sync with the event log?" Adding a new feature means adding a new consumer of the log — not migrating the database schema. Removing a feature means deleting the consumer. The log remains unchanged.

Lambda vs. Kappa Architecture

Two architectures formalize this idea:

Lambda Architecture (Nathan Marz, 2011): run two parallel pipelines. A batch layer periodically reprocesses the entire event history to produce accurate but delayed views. A speed layer (stream processor) handles recent events for low-latency but approximate views. Query results merge both layers. The downside: you maintain two codepaths that must produce the same results. Every business logic change must be implemented twice.

Kappa Architecture (Jay Kreps, 2014): just the speed layer. Stream processing replaces batch entirely. If you need to reprocess historical data, replay the event log through a new version of the stream processor. One codebase, one pipeline. The downside: your stream processor must handle the full throughput of a historical replay, which may be orders of magnitude higher than real-time rate.

Lambda
Event Log → Batch Layer (MapReduce, periodic) + Speed Layer (stream, real-time) → Merge results at query time. Two codepaths, two maintenance burdens.
↓ simplified to
Kappa
Event Log → Stream Processor (handles both real-time and replay) → Derived Views. One codebase. Replay = reprocess from log offset 0.

The Unbundled Database

The simulation below shows the unified dataflow model. An event log sits at the center. Derived systems — a database, a cache, a search index, an analytics view — are all fed from the same log. You can add or remove derived systems. Each one is independently scalable and replaceable.

Unified Dataflow: The Unbundled Database

Events flow from the log to all derived systems. Toggle systems on/off. All stay in sync via the log.

Notice: when you toggle a system off and then back on, it catches up by replaying from the log. No data is lost. This is the power of the event log as source of truth — derived systems are disposable. You can swap out Postgres for CockroachDB, or Elasticsearch for Meilisearch, without touching the log or any other derived system.

This is what Kleppmann calls the unbundled database. A traditional DBMS bundles storage, indexing, caching, and query processing into one product. The unbundled approach separates them into independently deployable services, all fed from the event log. The benefit: each component can be the best tool for its job (RocksDB for write-heavy storage, Elasticsearch for full-text search, Redis for low-latency caching). The cost: you own the glue between them. The event log IS that glue.

This is also the key insight behind modern data platforms like Confluent (Kafka ecosystem), Materialize (streaming SQL), and Decodable (stream processing). They are all building tools for the "event log as source of truth" world.

Your search index is corrupted after a deployment bug. In a traditional architecture (search index is a separate system with its own ingestion pipeline), what do you do? In a unified dataflow architecture (search index is a derived view of the event log), what do you do?

Chapter 3: Predictive Analytics and Bias

You build a hiring algorithm. You train it on ten years of your company's hiring data: who applied, who was interviewed, who was hired, who succeeded. The model learns patterns and starts predicting which new applicants will be successful. It is 89% accurate on your test set. Ship it.

Six months later, a reporter discovers that your algorithm rejects female applicants at 2.5x the rate of male applicants. What happened?

Your company's historical hiring data reflects a decade of human decisions made in an industry where 78% of engineers hired were male. The model did not "decide" that women are worse engineers. It learned that historically, women were hired less often, and it replicated that pattern. The model is a mirror of your past, not an oracle of the future.

This is not a hypothetical. Amazon built a hiring tool that penalized resumes containing the word "women's" (as in "women's chess club captain"). The system was trained on 10 years of resumes, reflecting a male-dominated hiring history. It downgraded graduates of all-women's colleges. Amazon scrapped the project in 2018, but the lesson stands: historical data encodes historical biases, and ML models amplify them.

The Feedback Loop

It gets worse. A biased prediction does not just reflect historical bias — it creates new bias. Here is the mechanism:

1. Biased Training Data
Historical data reflects past discrimination. Fewer women hired, fewer minorities approved for loans, more policing in certain neighborhoods.
2. Model Learns Bias
The model finds statistical patterns that correlate protected attributes (race, gender, zip code) with outcomes. It learns "zip code 90011 = high risk" because that is what the data says.
3. Biased Predictions
The model rejects more applicants from certain demographics. Fewer loans, fewer job offers, higher bail amounts.
4. Biased Outcomes
Rejected applicants have worse financial outcomes, live in under-resourced areas, have fewer opportunities. The real world changes to match the prediction.
↻ feeds back into training data

This is a self-fulfilling prophecy. The model predicts failure for a group, denies them opportunities, the group fails more often (because of denied opportunities, not lack of ability), and the model gets retrained on data that "confirms" its original prediction. Each cycle amplifies the bias.

COMPAS: A Case Study

In 2016, ProPublica analyzed COMPAS, a criminal justice risk assessment tool used in courts across the United States to predict recidivism (whether a defendant will reoffend). Their findings:

MetricBlack defendantsWhite defendants
Labeled high-risk but did NOT reoffend (false positive)44.9%23.5%
Labeled low-risk but DID reoffend (false negative)28.0%47.7%
Overall accuracy~65% for both groups

The overall accuracy was similar for both groups — about 65%. But the types of errors were distributed very differently. Black defendants were far more likely to be falsely flagged as high-risk (and thus given higher bail, longer sentences). White defendants were far more likely to be falsely labeled low-risk. Same accuracy, dramatically different consequences.

Fairness is not a single number. There are at least 21 mathematically distinct definitions of "fairness" in machine learning. Some are mutually exclusive — you literally cannot satisfy all of them simultaneously (the impossibility theorem of Chouldechova, 2017). Choosing which fairness metric to optimize is a value judgment, not a technical decision. The engineer who says "my model is fair because it has equal accuracy across groups" is hiding a choice — equal accuracy is not the same as equal false-positive rates.

The Bias Amplification Simulator

The simulation below shows a feedback loop in action. A training dataset has an initial bias (adjustable). A model learns from it, makes predictions, those predictions affect outcomes, and the outcomes feed back into the training data. Watch the bias amplify over cycles. Then toggle "Bias Correction" to see what happens when you intervene at each cycle.

Bias Feedback Loop Simulator

Each cycle: train → predict → outcomes → retrain. Watch the bias grow. Toggle correction to flatten it.

Initial bias 0.15

Without correction, even a small initial bias (0.10) doubles within 5 cycles and can reach catastrophic levels by cycle 10. With correction — re-sampling, adversarial debiasing, or human review of flagged decisions — the bias stabilizes or shrinks. The key insight: bias correction is not a one-time step. It must be applied at every retraining cycle, forever.

A model has 90% accuracy for both Group A and Group B. A critic says the model is biased. Is the critic necessarily wrong?

Chapter 4: Privacy and Surveillance

You collect data because your system needs it. User locations for ride-sharing. Browsing history for recommendations. Purchase history for fraud detection. Each data point is collected for a specific, reasonable purpose. But data collected for one purpose invariably gets used for another.

The browsing history collected for recommendations gets subpoenaed in a divorce case. The location data collected for ride-sharing gets sold to a data broker who sells it to a bounty hunter. The purchase history collected for fraud detection gets used to build a "creditworthiness" profile that determines whether you qualify for an apartment.

The "Nothing to Hide" Fallacy

The most common defense of mass data collection is: "If you have nothing to hide, you have nothing to fear." This argument collapses on inspection.

You have curtains on your windows. You close the bathroom door. You do not CC your boss on every text to your spouse. Privacy is not about hiding wrongdoing. It is about controlling who knows what about you and when. You share different things with your doctor, your employer, your friends, and your government. The ability to maintain these contextual boundaries is what privacy is.

When a data system collapses all these contexts into a single profile — your medical searches, your political donations, your late-night browsing, your location at 2 AM on a Saturday — it strips away your ability to present different facets of yourself to different audiences. This is not a theoretical harm. It is the mechanism by which people get fired for political opinions, denied insurance for genetic conditions, and stalked by abusive ex-partners.

Consent theater. The average American encounters 1,462 privacy policies per year. Each takes roughly 10 minutes to read. That is 244 hours — six full work weeks — of reading per year just to understand what you are "consenting" to. Nobody reads them. The companies know nobody reads them. "Consent" in this context is a legal fiction, not an informed choice. The GDPR tried to fix this with "plain language" requirements, but a 2019 study found that most GDPR-era policies are still written at a college reading level.

Re-identification: Fewer Data Points Than You Think

In 2006, Netflix released a dataset of 100 million movie ratings from 480,000 users, stripped of names, for a recommendation algorithm competition. Researchers at UT Austin showed that just 4 movie ratings (plus approximate dates) were enough to re-identify a user by cross-referencing with public IMDb reviews. An anonymous Netflix user was identified as a closeted lesbian mother of two. She sued Netflix.

In 2013, researchers showed that 4 credit card transactions (amount + store + date) uniquely identify 90% of people in a dataset of 1.1 million users. For women, the number drops to 3.

The lesson: anonymization does not work on high-dimensional data. Every additional data point exponentially shrinks the set of people who match, until the set contains exactly one person.

Privacy Erosion Visualizer

The simulation below starts with an anonymous user among a population. As you add data points — location, browsing history, purchases, social connections — watch the anonymity set shrink. How many data points does it take to uniquely identify someone?

Privacy Erosion: De-anonymization in Action

Add data points one by one. Watch the anonymity set (people who match the profile) shrink toward 1.

Anonymity set: 100,000 people

With just 3-4 data points, the anonymity set typically drops below 10. By 5-6 data points, you are usually uniquely identified. This is why "we stripped the names" is not privacy protection. If your dataset has location + timestamps + any two behavioral signals, it is personally identifiable.

GDPR vs. Immutable Logs: A Real Tension

The EU's General Data Protection Regulation (GDPR) establishes a right to erasure — a person can demand that you delete their data. But we just spent a chapter arguing that event logs should be immutable and append-only. These two principles collide head-on.

If your event log contains "User #4421 purchased product X at time T," and User #4421 exercises their right to erasure, what do you do? You cannot delete the event without breaking the log's integrity. Downstream derived views that depend on this event would become inconsistent.

Practical solutions exist but require design forethought:

ApproachHow it worksTrade-off
Crypto-shreddingEncrypt personal data with a per-user key. To "delete," destroy the key. The ciphertext remains in the log but is unreadable.Requires encryption infrastructure from day one. Cannot retrofit easily.
Tombstone eventsAppend a "delete user #4421" event to the log. Derived views process the tombstone and purge their projections.The original events still exist in the log (with personal data). May not satisfy strict GDPR interpretation.
Log compactionPeriodically rewrite the log, omitting events for deleted users.Breaks immutability. May invalidate downstream offsets. Complex operationally.
Design for deletion from day one. The cheapest time to implement crypto-shredding is before you write the first event. The most expensive time is after a regulator sends you a notice. If your system stores personal data, encrypt it with per-user keys from the start. This is not optional — it is a legal requirement in jurisdictions covering 4+ billion people.

Differential Privacy: A Mathematical Defense

Differential privacy is a mathematical framework for publishing aggregate statistics about a dataset without revealing information about any individual in it. The core idea: add carefully calibrated random noise to query results so that the output is approximately the same whether or not any single person's data is in the dataset.

Formally: a mechanism M is ε-differentially private if for any two datasets D and D' that differ by one person, and for any possible output S:

P(M(D) ∈ S) ≤ eε × P(M(D') ∈ S)

The parameter ε (epsilon) is the privacy budget. Smaller ε means more noise and stronger privacy, but less useful results. Larger ε means less noise and better accuracy, but weaker privacy guarantees. Apple uses ε = 2-8 for emoji usage statistics. The US Census Bureau used ε = 19.61 for the 2020 Census.

In practice, you implement differential privacy by adding noise drawn from a Laplace or Gaussian distribution to each query result. The noise magnitude is calibrated to the sensitivity of the query (how much one person's data can change the result). A count query has sensitivity 1 (adding or removing one person changes the count by at most 1). An average salary query has higher sensitivity (one person with a $10M salary shifts the average significantly).

Differential privacy in practice. Google's RAPPOR uses differential privacy for Chrome usage statistics. Apple uses it for keyboard and emoji analytics. The US Census Bureau uses it for population counts. The trade-off is always the same: more privacy = more noise = less precise answers. For large datasets, the noise is small relative to the true answer. For small datasets (e.g., "how many people in this small town have HIV?"), the noise may overwhelm the signal. This is a feature, not a bug — small populations deserve stronger privacy.
A dataset of 500,000 "anonymized" users contains: city, age bracket, and three product purchases per user. A researcher claims individual users can be re-identified. Is this plausible?

Chapter 5: Responsibility and Accountability

A self-driving car kills a pedestrian. Who is responsible? The engineer who wrote the perception model? The PM who decided the model was ready for deployment? The company that sold the vehicle? The regulator who approved it for road use? The training data that did not contain enough examples of pedestrians at night?

This is not a philosophical thought experiment. It happened in Tempe, Arizona in 2018 (Uber ATG). The NTSB investigation found failures at every level: the perception system classified the pedestrian as an "unknown object," the safety driver was watching a video on her phone, and Uber had disabled the Volvo's factory emergency braking system to avoid "jerky rides." Nobody went to prison. The safety driver was charged with negligent homicide; Uber paid a settlement. The engineers were not held individually liable.

Moral Crumple Zones

Anthropologist Madeleine Clare Elish coined the term moral crumple zone: when an automated system fails, blame collapses onto the nearest human — usually the lowest-ranking person in the chain. The safety driver. The content moderator. The bank teller who followed the algorithm's recommendation to deny a loan.

The people who designed the system, chose the training data, set the decision thresholds, and decided to deploy it — they are insulated from accountability by layers of organizational structure. The person who merely operated the system becomes the moral crumple zone: absorbing the impact of a failure they did not design and could not prevent.

The trolley problem is a distraction. Public debate about AI ethics obsesses over edge cases: should the self-driving car swerve left (killing one person) or right (killing two)? This is philosophically interesting and practically irrelevant. The real ethical decisions happen months earlier, in design meetings: What training data do we use? What error rate do we accept? Who bears the cost of false positives? When do we ship vs. when do we wait? These decisions are made by engineers and product managers, not by algorithms in the moment of crisis.

Who Is Responsible? A Framework

RoleWhat they controlWhat they are responsible for
Data engineerData collection, cleaning, labelingData quality, representation, consent for collection
ML engineerModel architecture, training, evaluationBias detection, fairness metrics, failure mode analysis
Product managerFeature decisions, deployment criteriaUse case ethics, threshold choices, user impact assessment
Engineering managerTimelines, staffing, prioritiesEnsuring time/resources exist for ethics review
ExecutiveBusiness model, strategyIncentive structures, compliance, corporate accountability

Notice that responsibility is distributed but not diluted. Every role has specific things they control and specific things they are accountable for. "I just built what the PM asked for" is not absolution — you, as the engineer, are the person who understands the failure modes. If you know the model has a 25% false-positive rate for a protected group and you ship it without raising the issue, that is your responsibility.

The Design Challenge

This is not a simulation — it is a scenario for you to think through. There is no "right" answer, but there are answers that demonstrate ethical reasoning versus answers that dodge responsibility.

Scenario: You are a senior engineer at a fintech company. Your team's credit scoring algorithm has been in production for 6 months. An internal audit reveals that the algorithm rejects 40% more applicants from minority zip codes than from non-minority zip codes, even after controlling for income and credit history. Your VP of Engineering says: "The model is using zip code as a feature, and zip codes correlate with default rates. The model is technically correct — it is optimizing for our business metric (minimizing defaults). Removing zip code will increase our default rate by 2%." What do you do?

Think about this before reading on. What are the stakeholders? What are the competing values? What would you actually say in a meeting?

A strong answer addresses multiple dimensions:

DimensionConsideration
LegalUsing zip code as a proxy for race may violate fair lending laws (ECOA, Fair Housing Act). "The model is technically correct" is not a legal defense if the effect is discriminatory — this is called disparate impact.
BusinessA 2% increase in defaults is a known, quantifiable cost. A discrimination lawsuit or regulatory action is an unknown, potentially existential cost. The risk calculus favors removing the feature.
TechnicalZip code is a proxy variable — it correlates with race because of historical redlining. Removing it may not fully solve the problem (other features may also proxy for race). You need a fairness audit across all features.
EthicalThe people denied loans are real. They cannot buy homes, start businesses, or build wealth. The 2% default cost is absorbed by a corporation; the denial cost is absorbed by individuals with the least power to recover.

The Spectrum of Response

When you discover a harmful system, your response options range from quiet to loud. Each has different costs and different effectiveness:

ActionRisk to youLikely impact
Document and raise internallyLowDepends on company culture. May be ignored, may trigger a fix.
Escalate to leadershipMediumHigher visibility. Works if leadership is receptive; backfires if they are defensive.
Refuse to shipHighForces the conversation. May cost you your job. Depends on your leverage and the severity of harm.
External whistleblowingVery highLast resort. Legal protections vary by jurisdiction. Some (EU, US federal) protect whistleblowers; many do not.

There is no universal right answer. But there is a wrong answer: doing nothing because "it is not my department." If you see a system causing harm and you have the technical knowledge to understand the harm, you have a professional obligation to act — starting with the lowest-risk option and escalating as needed.

A practical principle: Before shipping any system that makes decisions about people (hiring, lending, scoring, sentencing, content moderation), ask three questions: (1) Who is harmed if this system makes a mistake? (2) Does the person affected know a decision was made about them, and can they appeal? (3) Have we measured the error rates broken down by demographic group? If you cannot answer all three, the system is not ready to ship.
An engineer builds a model that discriminates against a protected group. The engineer did not intend discrimination — the model simply learned patterns in the training data. Is the engineer responsible?

Chapter 6: The Future of Data Systems

Where is the field heading? Kleppmann identifies four frontier challenges that will define the next decade of data system design. Each one is both a technical problem and, as we have seen, an ethical one.

1. Unbundling the Database

A traditional relational database is a monolith: storage engine, query optimizer, transaction manager, indexing, caching, replication — all packaged together. You pick Postgres or MySQL and you get all of it, tightly integrated.

The trend is toward unbundling: decompose the monolith into specialized components connected by event streams. Use a purpose-built storage engine (RocksDB, TiKV). Use a separate indexing service (Elasticsearch, Meilisearch). Use a separate caching layer (Redis, Memcached). Use a separate analytics engine (ClickHouse, DuckDB). Coordinate them through an event log (Kafka, Pulsar).

Each component is independently scalable, independently replaceable, and optimized for its specific access pattern. The cost: you now manage the consistency between them yourself, instead of getting it for free from the monolith's transaction manager.

2. End-to-End Correctness

Transactions give you correctness within a single database. But modern systems span multiple databases, caches, queues, and services. A user clicks "buy" and that event must flow correctly through: the order service, the inventory service, the payment service, the email service, and the analytics pipeline. A transaction in any single service is not enough — you need correctness across the entire chain.

The tools: idempotency keys (so retries do not cause duplicates), exactly-once delivery (which really means "effectively once" — you may deliver twice but the second is a no-op), and end-to-end checksums (verify that the final output matches what was intended, not just that each hop succeeded).

Exactly-once is a lie (sort of). You cannot prevent a message from being delivered twice over a network. What you CAN do is make the receiver idempotent: if it sees the same message twice, the second processing has no additional effect. This is "effectively once" processing. The trick is assigning a unique ID to each event and checking whether you have already processed it before taking action. Simple in concept, tricky at scale (the dedup table can grow unbounded).

3. Enforcing Constraints Across Partitions

You have a uniqueness constraint: no two users can register the same email address. Easy in a single database — add a unique index. Now partition your user table across 16 shards. User "alice@example.com" is on shard 7. A new registration request for "alice@example.com" arrives at shard 12 (because it was routed by user ID, not email). How does shard 12 know that this email is already taken on shard 7?

You need cross-partition coordination — which means consensus, which means latency. This is the fundamental trade-off: you can have fast writes (no coordination) or you can enforce constraints (coordination required), but not both at the same time. The solution space includes: (a) route by the constrained field (partition by email, not user ID), (b) use a separate global uniqueness service, or (c) accept eventual uniqueness (detect and resolve duplicates asynchronously).

4. Timeliness vs. Integrity

Not all data needs to be fresh. But all data needs to be correct. This distinction — timeliness (how fresh) vs. integrity (how correct) — is fundamental to system design.

PropertyDefinitionConsequence of violationSpectrum
IntegrityData is correct and consistentWrong decisions, financial loss, legal liabilityBinary: either correct or not
TimelinessData reflects recent realityStale recommendations, outdated dashboardsContinuous: from milliseconds-fresh to hours-stale

Integrity violations are catastrophic and hard to detect. If your bank balance is wrong, you may not notice until you overdraft. Timeliness violations are visible but usually tolerable. If your recommendation engine is 5 minutes stale, nobody dies.

Design systems that never sacrifice integrity but relax timeliness where acceptable. Use synchronous coordination (consensus, transactions) for integrity-critical paths (money movement, user registration). Use asynchronous replication (eventual consistency, stream processing) for timeliness-optional paths (analytics, recommendations, search indexing).

The ACID-to-BASE spectrum, revisited. In Chapter 8, we learned ACID (Atomicity, Consistency, Isolation, Durability) versus BASE (Basically Available, Soft state, Eventual consistency). The timeliness-vs-integrity framing gives this spectrum a sharper edge. ACID guarantees integrity at the cost of timeliness (synchronous coordination adds latency). BASE relaxes integrity guarantees (eventual consistency means temporary staleness) to improve timeliness and availability. The right choice depends on which path you are on. A single system often has both ACID paths (payment processing) and BASE paths (search indexing) — and the event log is what connects them.

The Unbundled Database Simulator

The simulation below shows a traditional monolithic database on the left and an unbundled architecture on the right. Events flow in. On the monolith side, a single system handles everything. On the unbundled side, specialized components handle their piece. Watch how the unbundled side can scale each component independently, and how a failure in one component does not take down the others.

Monolith vs. Unbundled Database Architecture

Send events to both architectures. Kill a component on the unbundled side — the others keep working. Kill the monolith — everything stops.

Ready. Send events to compare architectures.

When you kill the search component in the unbundled architecture, the database and cache keep processing events. When you bring search back, it catches up from the event log. When you kill the monolith, everything stops — storage, indexing, caching, queries — all go down together. This is the resilience advantage of unbundling.

Your system processes payments (integrity-critical) and updates a recommendation engine (timeliness-optional). Where should you use synchronous coordination (consensus/transactions) and where should you use asynchronous replication?

Chapter 7: Interview Arsenal & Connections

This is the capstone of the entire DDIA series. Below you will find: ethics talking points that distinguish you in system design interviews, technical philosophy questions that test deep understanding, a cheat sheet of every concept from these final chapters, and connections to the rest of the book.

Ethics Interview Questions

These questions are increasingly common in system design and behavioral interviews at companies that handle sensitive data (fintech, healthtech, adtech, social media, hiring platforms). Having a structured answer — not just "ethics is important" — is what separates a senior from a staff engineer.

QuestionWhat they want to hearStaff-level addition
"How do you think about data privacy in your designs?"Concrete techniques: encrypt PII at rest, per-user encryption keys, crypto-shredding for deletion, data minimization (collect only what you need), retention policies.Discuss the tension between event log immutability and GDPR right-to-erasure. Propose crypto-shredding as the design-time solution. Mention that anonymization is insufficient for high-dimensional data.
"What would you do if you discovered bias in a production model?"Immediate triage (quantify the bias, who is affected, how severely), stakeholder notification (legal, product, leadership), mitigation (feature removal, re-training, human-in-the-loop review), monitoring (add fairness metrics to the dashboard).Discuss the feedback loop: biased predictions create biased outcomes that create biased training data. A one-time fix is insufficient — you need ongoing fairness audits at every retraining cycle. Cite COMPAS or Amazon hiring as real examples.
"How do you balance feature development with data privacy?"Privacy by design: build privacy controls into the architecture from the start (access control, audit logs, encryption) rather than bolting them on later. Data minimization: every feature request should justify what data it needs and why.Discuss differential privacy for analytics (add calibrated noise so individual records cannot be extracted). Propose a "privacy budget" that limits the total information any external query can extract.

Technical Philosophy Questions

QuestionStrong answer
"Event sourcing vs. CRUD — when and why?"CRUD: simple, direct state updates. Good for low-complexity domains with few audit requirements. Event sourcing: append-only log of immutable events, derived views are projections. Better for: audit trails (finance, healthcare), temporal queries ("what was the state at time T?"), and multi-consumer architectures (event log feeds DB + cache + search + analytics). Cost: higher complexity, need to design for log compaction and snapshots.
"Lambda vs. Kappa architecture?"Lambda: batch + speed layers, two codepaths, accurate but complex. Kappa: stream-only, one codebase, replay for reprocessing. Lambda is legacy in most greenfield designs. Kappa works if your stream processor can handle replay throughput (Flink, Kafka Streams). Lambda still makes sense when batch and real-time have genuinely different semantics (e.g., ML training is batch, serving is stream).
"What does 'exactly-once' actually mean?"It means "effectively once" — messages may be delivered more than once, but processing is idempotent so the effect happens once. Achieved via: unique event IDs + dedup table on the consumer, or transactional outbox pattern, or Kafka's idempotent producer + transactions. True exactly-once over a network is impossible (Two Generals' Problem). What we guarantee is exactly-once semantics from the application's perspective.
"How do you enforce uniqueness across partitions?"Three strategies: (a) partition by the constrained field (ensures all checks hit one partition, but may skew data distribution), (b) global uniqueness service (single coordination point — potential bottleneck), (c) asynchronous conflict detection (allow duplicates briefly, detect and merge/reject later). Choice depends on how critical the constraint is. Email uniqueness: use (a) or (b). Username suggestion: (c) is fine.

Concept Cheat Sheet

ConceptOne-line summaryWhere it matters
Event sourcingAppend-only log of events is the source of truth; state is derivedAudit trails, temporal queries, multi-consumer architectures
ImmutabilityNever modify data — add corrections as new eventsAccounting, debugging, recovery
Derived dataDatabases, caches, indexes are all projections of the event logUnbundled database architecture
Lambda architectureBatch + speed layer, merge at query timeLegacy systems, mixed batch/stream semantics
Kappa architectureStream-only, replay from log for reprocessingGreenfield designs, Flink/Kafka ecosystems
Feedback loopBiased predictions cause biased outcomes that create more biased dataAny ML system that affects the reality it predicts
Disparate impactA facially neutral policy that disproportionately affects a protected groupHiring, lending, criminal justice, housing
Crypto-shreddingEncrypt PII with per-user keys; "delete" by destroying the keyGDPR compliance in event-sourced systems
Differential privacyAdd calibrated noise to query results so individual records cannot be identifiedAnalytics, census data, ML training
Moral crumple zoneThe nearest human absorbs blame for an automated system's failureSafety-critical systems, content moderation
Timeliness vs. integrityIntegrity is non-negotiable (data must be correct); timeliness is a spectrumPayment systems (integrity) vs. recommendations (timeliness)
Exactly-once semantics"Effectively once" — idempotent processing so duplicates have no effectPayment processing, inventory management

The Complete DDIA Journey

You have now covered the entire book. Here is the map of all lessons and how they connect:

Foundations
Ch 3: Data ModelsCh 4: Storage & Retrieval — How data is structured and stored.
Distribution
Ch 6: ReplicationCh 7: Sharding — Spreading data across machines.
Correctness
Ch 8: TransactionsCh 9: Distributed TroubleCh 10: Consensus — Making distributed systems reliable.
Processing
Ch 11: Batch ProcessingCh 12: Stream Processing — Deriving value from data.
Philosophy & Ethics
Ch 13-14: This lesson — How to think about the systems you build, and the responsibilities you carry.

Recommended Reading

Book/ResourceAuthorWhy read it
Weapons of Math DestructionCathy O'Neil (2016)How algorithmic decision-making amplifies inequality. Case studies in education, policing, lending, and hiring.
The Tyranny of MetricsJerry Z. Muller (2018)How optimizing for measurable metrics corrupts the underlying goal. Directly relevant to ML loss function design.
Automating InequalityVirginia Eubanks (2018)How data systems disproportionately surveil and punish poor communities.
GDPR Full TextEU (2016)If you build systems that process personal data, you should read the actual regulation. It is more readable than you expect.
Machine Bias (ProPublica)Angwin et al. (2016)The original COMPAS investigation. Essential primary source for any discussion of algorithmic fairness.
Kleppmann's talksMartin KleppmannHis Strange Loop and QCon talks cover much of this material in lecture form. Excellent for reinforcement.

Closing Thought

"With great power comes great responsibility." This is a cliche. Here is the harder version: with great technical power comes the responsibility to understand the limits of that power. The most dangerous engineer is not the one who builds a bad system — it is the one who builds a powerful system and does not think about who it hurts. You now have the technical skills to build systems that serve millions of people. Use them carefully. — Adapted from Martin Kleppmann, Designing Data-Intensive Applications, Chapter 14
In a system design interview, you are asked to design a credit scoring system. The interviewer asks: "How would you ensure fairness?" What is the staff-level answer?