Streaming philosophy, bias, privacy, responsibility — the human side of data systems.
You have spent twelve chapters learning how to build data systems. You know how to replicate, partition, process in batch, process in streams, achieve consensus, and handle failure. You can design a system that scales to millions of users and recovers from any crash. Congratulations. You are dangerous.
Here is why that word is deliberate. A credit scoring algorithm you build will decide who gets a home loan and who does not. A hiring system you design will filter resumes — and encode whatever biases exist in the training data into automated decisions that affect people's livelihoods. A social media feed you optimize for engagement will, if left unchecked, amplify outrage because outrage drives clicks.
The technical skills from the previous chapters are necessary but not sufficient. A data system is not just infrastructure — it is a lens that shapes what is possible, what is surveilled, who is included, and who is excluded. Every architectural decision has a human consequence, whether you intended it or not.
The simulation below shows a simplified ecosystem of data systems: social media, credit scoring, hiring platforms, criminal justice risk tools, and ad targeting. Each system feeds data into the others. Watch what happens when a single decision in one system — say, a biased training dataset in the hiring platform — propagates through the entire web.
Click a system node to inject a bias. Watch it propagate through data flows to other systems. Click "Reset" to clear.
A biased hiring algorithm feeds employment data into credit scoring systems. Lower credit scores feed into higher insurance premiums and reduced access to housing. Reduced housing access feeds into neighborhood-level data that criminal justice risk models use. Each system is "just using the data it has" — and yet the compound effect is discriminatory at a scale no single system intended.
This is the core tension of these final chapters. Technical excellence without ethical awareness is not neutral — it is harmful by default. The rest of this lesson equips you with the philosophical framing and the ethical vocabulary to navigate these questions, both in interviews and in production.
We covered the mechanics of stream processing in Chapter 12. Now step back from the APIs and windowing functions. Batch versus stream is not merely a technical choice — it is a philosophical one about how you model time and reality.
Batch processing sees the world as a series of periodic snapshots. You freeze reality at regular intervals, process what accumulated, and move on. It is like a photographer taking one picture per hour. Between photos, the world changes and you do not notice.
Stream processing sees the world as a continuous flow of events. Every change is captured as it happens. It is like a video camera — nothing is lost between frames. The data is a living record of reality unfolding.
This difference runs deeper than performance. It changes what questions you can answer. A batch system can tell you "as of midnight, you had 14,232 active users." A stream system can tell you "at 3:47:12 PM, user #8891 went from active to inactive, which was the 17th deactivation in the last hour — a rate 3x higher than normal." The batch view is a summary. The stream view is the truth.
In an event-sourced system, once something happened, it happened. You do not go back and edit the event. If a customer was charged $50 by mistake, you do not delete the $50 charge. You add a new event: "refund $50." The ledger is append-only.
This is the same principle behind double-entry bookkeeping, invented in 13th-century Italy. Every transaction creates two entries (debit and credit) that must balance. Errors are corrected with new entries, never by erasing old ones. The complete history is always available for audit.
Why does this matter for software? Three reasons:
| Property | Why it matters | Example |
|---|---|---|
| Auditability | You can trace exactly how you got to the current state | A regulator asks "why was this loan denied?" You can replay the exact sequence of events that led to the decision |
| Debuggability | You can replay events to reproduce bugs | A race condition that corrupts state at 3 AM can be replayed deterministically from the event log |
| Recovery | Derived views can be rebuilt from scratch | Your search index is corrupted? Rebuild it by replaying the event log. No data loss. |
The simulation below shows the same data viewed through both lenses. On the left: a batch system takes periodic snapshots. On the right: a stream system captures every event. Toggle between them to see what information the batch view loses.
Watch events flow in real time. The batch side only updates at snapshot intervals. Notice the events that slip between snapshots.
With a long snapshot interval, the batch side misses short-lived spikes entirely — a brief surge of errors, a flash sale that sold out in seconds, a momentary network partition. The stream side captures every fluctuation. Increase the snapshot interval to see how much information is lost.
The choice between batch and stream is not just about latency. It determines what questions your system can answer. Consider three scenarios:
| Scenario | Batch view | Stream view |
|---|---|---|
| User churn detection | "Last month we lost 2,400 users." (One number, no context.) | "At 3 PM on Tuesday, engagement dropped 40% for users in the Northeast — correlating with a deployment that broke mobile push notifications." (Root cause identified within minutes.) |
| Inventory management | "At midnight, we had 340 units in stock." (May be stale by morning.) | "Item #8891 just sold out at 10:47 AM. 14 users currently have it in their cart. Begin backorder process." (Real-time, actionable.) |
| Security monitoring | "Yesterday's logs show 47 failed login attempts from IP x.x.x.x." (Discovered 24 hours too late.) | "IP x.x.x.x has failed 5 logins in 30 seconds — brute force attack in progress. Blocking." (Stopped in real time.) |
Event sourcing gives you both views from the same data. The stream view is always available. The batch view is just a materialized query over the event log at a point in time. You never have to choose one or the other — you get both, as long as you build on an event log.
Here is a radical idea: every data system you have studied in this book is the same thing. A database, a cache, a search index, an analytics warehouse — they are all derived views of an underlying event log. They differ only in how they project that log into a queryable form.
Think of your event log as a river. A database is a reservoir that stores the latest state — it is a snapshot of the river's cumulative effect. A search index is a filter that extracts specific patterns from the water. A cache is a bucket that holds the most recently fetched water for quick access. An analytics warehouse is a dam that collects water for periodic measurement.
Each of these is just a materialized view — a different lens on the same underlying data. And because they are derived, they can be rebuilt from the event log at any time. The event log is the single source of truth. Everything else is a read-optimized projection.
| System | What it really is | Projection type |
|---|---|---|
| Relational DB | Cache of latest state per key | Latest-value lookup |
| Search index | Inverted index of event content | Full-text search |
| Cache (Redis) | Hot subset of derived state | Frequently-accessed lookup |
| Analytics warehouse | Aggregated summary over event history | OLAP cubes, star schemas |
| Materialized view | Pre-computed query result | Denormalized join |
Two architectures formalize this idea:
Lambda Architecture (Nathan Marz, 2011): run two parallel pipelines. A batch layer periodically reprocesses the entire event history to produce accurate but delayed views. A speed layer (stream processor) handles recent events for low-latency but approximate views. Query results merge both layers. The downside: you maintain two codepaths that must produce the same results. Every business logic change must be implemented twice.
Kappa Architecture (Jay Kreps, 2014): just the speed layer. Stream processing replaces batch entirely. If you need to reprocess historical data, replay the event log through a new version of the stream processor. One codebase, one pipeline. The downside: your stream processor must handle the full throughput of a historical replay, which may be orders of magnitude higher than real-time rate.
The simulation below shows the unified dataflow model. An event log sits at the center. Derived systems — a database, a cache, a search index, an analytics view — are all fed from the same log. You can add or remove derived systems. Each one is independently scalable and replaceable.
Events flow from the log to all derived systems. Toggle systems on/off. All stay in sync via the log.
Notice: when you toggle a system off and then back on, it catches up by replaying from the log. No data is lost. This is the power of the event log as source of truth — derived systems are disposable. You can swap out Postgres for CockroachDB, or Elasticsearch for Meilisearch, without touching the log or any other derived system.
This is what Kleppmann calls the unbundled database. A traditional DBMS bundles storage, indexing, caching, and query processing into one product. The unbundled approach separates them into independently deployable services, all fed from the event log. The benefit: each component can be the best tool for its job (RocksDB for write-heavy storage, Elasticsearch for full-text search, Redis for low-latency caching). The cost: you own the glue between them. The event log IS that glue.
This is also the key insight behind modern data platforms like Confluent (Kafka ecosystem), Materialize (streaming SQL), and Decodable (stream processing). They are all building tools for the "event log as source of truth" world.
You build a hiring algorithm. You train it on ten years of your company's hiring data: who applied, who was interviewed, who was hired, who succeeded. The model learns patterns and starts predicting which new applicants will be successful. It is 89% accurate on your test set. Ship it.
Six months later, a reporter discovers that your algorithm rejects female applicants at 2.5x the rate of male applicants. What happened?
Your company's historical hiring data reflects a decade of human decisions made in an industry where 78% of engineers hired were male. The model did not "decide" that women are worse engineers. It learned that historically, women were hired less often, and it replicated that pattern. The model is a mirror of your past, not an oracle of the future.
It gets worse. A biased prediction does not just reflect historical bias — it creates new bias. Here is the mechanism:
This is a self-fulfilling prophecy. The model predicts failure for a group, denies them opportunities, the group fails more often (because of denied opportunities, not lack of ability), and the model gets retrained on data that "confirms" its original prediction. Each cycle amplifies the bias.
In 2016, ProPublica analyzed COMPAS, a criminal justice risk assessment tool used in courts across the United States to predict recidivism (whether a defendant will reoffend). Their findings:
| Metric | Black defendants | White defendants |
|---|---|---|
| Labeled high-risk but did NOT reoffend (false positive) | 44.9% | 23.5% |
| Labeled low-risk but DID reoffend (false negative) | 28.0% | 47.7% |
| Overall accuracy | ~65% for both groups | |
The overall accuracy was similar for both groups — about 65%. But the types of errors were distributed very differently. Black defendants were far more likely to be falsely flagged as high-risk (and thus given higher bail, longer sentences). White defendants were far more likely to be falsely labeled low-risk. Same accuracy, dramatically different consequences.
The simulation below shows a feedback loop in action. A training dataset has an initial bias (adjustable). A model learns from it, makes predictions, those predictions affect outcomes, and the outcomes feed back into the training data. Watch the bias amplify over cycles. Then toggle "Bias Correction" to see what happens when you intervene at each cycle.
Each cycle: train → predict → outcomes → retrain. Watch the bias grow. Toggle correction to flatten it.
Without correction, even a small initial bias (0.10) doubles within 5 cycles and can reach catastrophic levels by cycle 10. With correction — re-sampling, adversarial debiasing, or human review of flagged decisions — the bias stabilizes or shrinks. The key insight: bias correction is not a one-time step. It must be applied at every retraining cycle, forever.
You collect data because your system needs it. User locations for ride-sharing. Browsing history for recommendations. Purchase history for fraud detection. Each data point is collected for a specific, reasonable purpose. But data collected for one purpose invariably gets used for another.
The browsing history collected for recommendations gets subpoenaed in a divorce case. The location data collected for ride-sharing gets sold to a data broker who sells it to a bounty hunter. The purchase history collected for fraud detection gets used to build a "creditworthiness" profile that determines whether you qualify for an apartment.
The most common defense of mass data collection is: "If you have nothing to hide, you have nothing to fear." This argument collapses on inspection.
You have curtains on your windows. You close the bathroom door. You do not CC your boss on every text to your spouse. Privacy is not about hiding wrongdoing. It is about controlling who knows what about you and when. You share different things with your doctor, your employer, your friends, and your government. The ability to maintain these contextual boundaries is what privacy is.
When a data system collapses all these contexts into a single profile — your medical searches, your political donations, your late-night browsing, your location at 2 AM on a Saturday — it strips away your ability to present different facets of yourself to different audiences. This is not a theoretical harm. It is the mechanism by which people get fired for political opinions, denied insurance for genetic conditions, and stalked by abusive ex-partners.
In 2006, Netflix released a dataset of 100 million movie ratings from 480,000 users, stripped of names, for a recommendation algorithm competition. Researchers at UT Austin showed that just 4 movie ratings (plus approximate dates) were enough to re-identify a user by cross-referencing with public IMDb reviews. An anonymous Netflix user was identified as a closeted lesbian mother of two. She sued Netflix.
In 2013, researchers showed that 4 credit card transactions (amount + store + date) uniquely identify 90% of people in a dataset of 1.1 million users. For women, the number drops to 3.
The lesson: anonymization does not work on high-dimensional data. Every additional data point exponentially shrinks the set of people who match, until the set contains exactly one person.
The simulation below starts with an anonymous user among a population. As you add data points — location, browsing history, purchases, social connections — watch the anonymity set shrink. How many data points does it take to uniquely identify someone?
Add data points one by one. Watch the anonymity set (people who match the profile) shrink toward 1.
With just 3-4 data points, the anonymity set typically drops below 10. By 5-6 data points, you are usually uniquely identified. This is why "we stripped the names" is not privacy protection. If your dataset has location + timestamps + any two behavioral signals, it is personally identifiable.
The EU's General Data Protection Regulation (GDPR) establishes a right to erasure — a person can demand that you delete their data. But we just spent a chapter arguing that event logs should be immutable and append-only. These two principles collide head-on.
If your event log contains "User #4421 purchased product X at time T," and User #4421 exercises their right to erasure, what do you do? You cannot delete the event without breaking the log's integrity. Downstream derived views that depend on this event would become inconsistent.
Practical solutions exist but require design forethought:
| Approach | How it works | Trade-off |
|---|---|---|
| Crypto-shredding | Encrypt personal data with a per-user key. To "delete," destroy the key. The ciphertext remains in the log but is unreadable. | Requires encryption infrastructure from day one. Cannot retrofit easily. |
| Tombstone events | Append a "delete user #4421" event to the log. Derived views process the tombstone and purge their projections. | The original events still exist in the log (with personal data). May not satisfy strict GDPR interpretation. |
| Log compaction | Periodically rewrite the log, omitting events for deleted users. | Breaks immutability. May invalidate downstream offsets. Complex operationally. |
Differential privacy is a mathematical framework for publishing aggregate statistics about a dataset without revealing information about any individual in it. The core idea: add carefully calibrated random noise to query results so that the output is approximately the same whether or not any single person's data is in the dataset.
Formally: a mechanism M is ε-differentially private if for any two datasets D and D' that differ by one person, and for any possible output S:
The parameter ε (epsilon) is the privacy budget. Smaller ε means more noise and stronger privacy, but less useful results. Larger ε means less noise and better accuracy, but weaker privacy guarantees. Apple uses ε = 2-8 for emoji usage statistics. The US Census Bureau used ε = 19.61 for the 2020 Census.
In practice, you implement differential privacy by adding noise drawn from a Laplace or Gaussian distribution to each query result. The noise magnitude is calibrated to the sensitivity of the query (how much one person's data can change the result). A count query has sensitivity 1 (adding or removing one person changes the count by at most 1). An average salary query has higher sensitivity (one person with a $10M salary shifts the average significantly).
A self-driving car kills a pedestrian. Who is responsible? The engineer who wrote the perception model? The PM who decided the model was ready for deployment? The company that sold the vehicle? The regulator who approved it for road use? The training data that did not contain enough examples of pedestrians at night?
This is not a philosophical thought experiment. It happened in Tempe, Arizona in 2018 (Uber ATG). The NTSB investigation found failures at every level: the perception system classified the pedestrian as an "unknown object," the safety driver was watching a video on her phone, and Uber had disabled the Volvo's factory emergency braking system to avoid "jerky rides." Nobody went to prison. The safety driver was charged with negligent homicide; Uber paid a settlement. The engineers were not held individually liable.
Anthropologist Madeleine Clare Elish coined the term moral crumple zone: when an automated system fails, blame collapses onto the nearest human — usually the lowest-ranking person in the chain. The safety driver. The content moderator. The bank teller who followed the algorithm's recommendation to deny a loan.
The people who designed the system, chose the training data, set the decision thresholds, and decided to deploy it — they are insulated from accountability by layers of organizational structure. The person who merely operated the system becomes the moral crumple zone: absorbing the impact of a failure they did not design and could not prevent.
| Role | What they control | What they are responsible for |
|---|---|---|
| Data engineer | Data collection, cleaning, labeling | Data quality, representation, consent for collection |
| ML engineer | Model architecture, training, evaluation | Bias detection, fairness metrics, failure mode analysis |
| Product manager | Feature decisions, deployment criteria | Use case ethics, threshold choices, user impact assessment |
| Engineering manager | Timelines, staffing, priorities | Ensuring time/resources exist for ethics review |
| Executive | Business model, strategy | Incentive structures, compliance, corporate accountability |
Notice that responsibility is distributed but not diluted. Every role has specific things they control and specific things they are accountable for. "I just built what the PM asked for" is not absolution — you, as the engineer, are the person who understands the failure modes. If you know the model has a 25% false-positive rate for a protected group and you ship it without raising the issue, that is your responsibility.
This is not a simulation — it is a scenario for you to think through. There is no "right" answer, but there are answers that demonstrate ethical reasoning versus answers that dodge responsibility.
Think about this before reading on. What are the stakeholders? What are the competing values? What would you actually say in a meeting?
A strong answer addresses multiple dimensions:
| Dimension | Consideration |
|---|---|
| Legal | Using zip code as a proxy for race may violate fair lending laws (ECOA, Fair Housing Act). "The model is technically correct" is not a legal defense if the effect is discriminatory — this is called disparate impact. |
| Business | A 2% increase in defaults is a known, quantifiable cost. A discrimination lawsuit or regulatory action is an unknown, potentially existential cost. The risk calculus favors removing the feature. |
| Technical | Zip code is a proxy variable — it correlates with race because of historical redlining. Removing it may not fully solve the problem (other features may also proxy for race). You need a fairness audit across all features. |
| Ethical | The people denied loans are real. They cannot buy homes, start businesses, or build wealth. The 2% default cost is absorbed by a corporation; the denial cost is absorbed by individuals with the least power to recover. |
When you discover a harmful system, your response options range from quiet to loud. Each has different costs and different effectiveness:
| Action | Risk to you | Likely impact |
|---|---|---|
| Document and raise internally | Low | Depends on company culture. May be ignored, may trigger a fix. |
| Escalate to leadership | Medium | Higher visibility. Works if leadership is receptive; backfires if they are defensive. |
| Refuse to ship | High | Forces the conversation. May cost you your job. Depends on your leverage and the severity of harm. |
| External whistleblowing | Very high | Last resort. Legal protections vary by jurisdiction. Some (EU, US federal) protect whistleblowers; many do not. |
There is no universal right answer. But there is a wrong answer: doing nothing because "it is not my department." If you see a system causing harm and you have the technical knowledge to understand the harm, you have a professional obligation to act — starting with the lowest-risk option and escalating as needed.
Where is the field heading? Kleppmann identifies four frontier challenges that will define the next decade of data system design. Each one is both a technical problem and, as we have seen, an ethical one.
A traditional relational database is a monolith: storage engine, query optimizer, transaction manager, indexing, caching, replication — all packaged together. You pick Postgres or MySQL and you get all of it, tightly integrated.
The trend is toward unbundling: decompose the monolith into specialized components connected by event streams. Use a purpose-built storage engine (RocksDB, TiKV). Use a separate indexing service (Elasticsearch, Meilisearch). Use a separate caching layer (Redis, Memcached). Use a separate analytics engine (ClickHouse, DuckDB). Coordinate them through an event log (Kafka, Pulsar).
Each component is independently scalable, independently replaceable, and optimized for its specific access pattern. The cost: you now manage the consistency between them yourself, instead of getting it for free from the monolith's transaction manager.
Transactions give you correctness within a single database. But modern systems span multiple databases, caches, queues, and services. A user clicks "buy" and that event must flow correctly through: the order service, the inventory service, the payment service, the email service, and the analytics pipeline. A transaction in any single service is not enough — you need correctness across the entire chain.
The tools: idempotency keys (so retries do not cause duplicates), exactly-once delivery (which really means "effectively once" — you may deliver twice but the second is a no-op), and end-to-end checksums (verify that the final output matches what was intended, not just that each hop succeeded).
You have a uniqueness constraint: no two users can register the same email address. Easy in a single database — add a unique index. Now partition your user table across 16 shards. User "alice@example.com" is on shard 7. A new registration request for "alice@example.com" arrives at shard 12 (because it was routed by user ID, not email). How does shard 12 know that this email is already taken on shard 7?
You need cross-partition coordination — which means consensus, which means latency. This is the fundamental trade-off: you can have fast writes (no coordination) or you can enforce constraints (coordination required), but not both at the same time. The solution space includes: (a) route by the constrained field (partition by email, not user ID), (b) use a separate global uniqueness service, or (c) accept eventual uniqueness (detect and resolve duplicates asynchronously).
Not all data needs to be fresh. But all data needs to be correct. This distinction — timeliness (how fresh) vs. integrity (how correct) — is fundamental to system design.
| Property | Definition | Consequence of violation | Spectrum |
|---|---|---|---|
| Integrity | Data is correct and consistent | Wrong decisions, financial loss, legal liability | Binary: either correct or not |
| Timeliness | Data reflects recent reality | Stale recommendations, outdated dashboards | Continuous: from milliseconds-fresh to hours-stale |
Integrity violations are catastrophic and hard to detect. If your bank balance is wrong, you may not notice until you overdraft. Timeliness violations are visible but usually tolerable. If your recommendation engine is 5 minutes stale, nobody dies.
Design systems that never sacrifice integrity but relax timeliness where acceptable. Use synchronous coordination (consensus, transactions) for integrity-critical paths (money movement, user registration). Use asynchronous replication (eventual consistency, stream processing) for timeliness-optional paths (analytics, recommendations, search indexing).
The simulation below shows a traditional monolithic database on the left and an unbundled architecture on the right. Events flow in. On the monolith side, a single system handles everything. On the unbundled side, specialized components handle their piece. Watch how the unbundled side can scale each component independently, and how a failure in one component does not take down the others.
Send events to both architectures. Kill a component on the unbundled side — the others keep working. Kill the monolith — everything stops.
When you kill the search component in the unbundled architecture, the database and cache keep processing events. When you bring search back, it catches up from the event log. When you kill the monolith, everything stops — storage, indexing, caching, queries — all go down together. This is the resilience advantage of unbundling.
This is the capstone of the entire DDIA series. Below you will find: ethics talking points that distinguish you in system design interviews, technical philosophy questions that test deep understanding, a cheat sheet of every concept from these final chapters, and connections to the rest of the book.
These questions are increasingly common in system design and behavioral interviews at companies that handle sensitive data (fintech, healthtech, adtech, social media, hiring platforms). Having a structured answer — not just "ethics is important" — is what separates a senior from a staff engineer.
| Question | What they want to hear | Staff-level addition |
|---|---|---|
| "How do you think about data privacy in your designs?" | Concrete techniques: encrypt PII at rest, per-user encryption keys, crypto-shredding for deletion, data minimization (collect only what you need), retention policies. | Discuss the tension between event log immutability and GDPR right-to-erasure. Propose crypto-shredding as the design-time solution. Mention that anonymization is insufficient for high-dimensional data. |
| "What would you do if you discovered bias in a production model?" | Immediate triage (quantify the bias, who is affected, how severely), stakeholder notification (legal, product, leadership), mitigation (feature removal, re-training, human-in-the-loop review), monitoring (add fairness metrics to the dashboard). | Discuss the feedback loop: biased predictions create biased outcomes that create biased training data. A one-time fix is insufficient — you need ongoing fairness audits at every retraining cycle. Cite COMPAS or Amazon hiring as real examples. |
| "How do you balance feature development with data privacy?" | Privacy by design: build privacy controls into the architecture from the start (access control, audit logs, encryption) rather than bolting them on later. Data minimization: every feature request should justify what data it needs and why. | Discuss differential privacy for analytics (add calibrated noise so individual records cannot be extracted). Propose a "privacy budget" that limits the total information any external query can extract. |
| Question | Strong answer |
|---|---|
| "Event sourcing vs. CRUD — when and why?" | CRUD: simple, direct state updates. Good for low-complexity domains with few audit requirements. Event sourcing: append-only log of immutable events, derived views are projections. Better for: audit trails (finance, healthcare), temporal queries ("what was the state at time T?"), and multi-consumer architectures (event log feeds DB + cache + search + analytics). Cost: higher complexity, need to design for log compaction and snapshots. |
| "Lambda vs. Kappa architecture?" | Lambda: batch + speed layers, two codepaths, accurate but complex. Kappa: stream-only, one codebase, replay for reprocessing. Lambda is legacy in most greenfield designs. Kappa works if your stream processor can handle replay throughput (Flink, Kafka Streams). Lambda still makes sense when batch and real-time have genuinely different semantics (e.g., ML training is batch, serving is stream). |
| "What does 'exactly-once' actually mean?" | It means "effectively once" — messages may be delivered more than once, but processing is idempotent so the effect happens once. Achieved via: unique event IDs + dedup table on the consumer, or transactional outbox pattern, or Kafka's idempotent producer + transactions. True exactly-once over a network is impossible (Two Generals' Problem). What we guarantee is exactly-once semantics from the application's perspective. |
| "How do you enforce uniqueness across partitions?" | Three strategies: (a) partition by the constrained field (ensures all checks hit one partition, but may skew data distribution), (b) global uniqueness service (single coordination point — potential bottleneck), (c) asynchronous conflict detection (allow duplicates briefly, detect and merge/reject later). Choice depends on how critical the constraint is. Email uniqueness: use (a) or (b). Username suggestion: (c) is fine. |
| Concept | One-line summary | Where it matters |
|---|---|---|
| Event sourcing | Append-only log of events is the source of truth; state is derived | Audit trails, temporal queries, multi-consumer architectures |
| Immutability | Never modify data — add corrections as new events | Accounting, debugging, recovery |
| Derived data | Databases, caches, indexes are all projections of the event log | Unbundled database architecture |
| Lambda architecture | Batch + speed layer, merge at query time | Legacy systems, mixed batch/stream semantics |
| Kappa architecture | Stream-only, replay from log for reprocessing | Greenfield designs, Flink/Kafka ecosystems |
| Feedback loop | Biased predictions cause biased outcomes that create more biased data | Any ML system that affects the reality it predicts |
| Disparate impact | A facially neutral policy that disproportionately affects a protected group | Hiring, lending, criminal justice, housing |
| Crypto-shredding | Encrypt PII with per-user keys; "delete" by destroying the key | GDPR compliance in event-sourced systems |
| Differential privacy | Add calibrated noise to query results so individual records cannot be identified | Analytics, census data, ML training |
| Moral crumple zone | The nearest human absorbs blame for an automated system's failure | Safety-critical systems, content moderation |
| Timeliness vs. integrity | Integrity is non-negotiable (data must be correct); timeliness is a spectrum | Payment systems (integrity) vs. recommendations (timeliness) |
| Exactly-once semantics | "Effectively once" — idempotent processing so duplicates have no effect | Payment processing, inventory management |
You have now covered the entire book. Here is the map of all lessons and how they connect:
| Book/Resource | Author | Why read it |
|---|---|---|
| Weapons of Math Destruction | Cathy O'Neil (2016) | How algorithmic decision-making amplifies inequality. Case studies in education, policing, lending, and hiring. |
| The Tyranny of Metrics | Jerry Z. Muller (2018) | How optimizing for measurable metrics corrupts the underlying goal. Directly relevant to ML loss function design. |
| Automating Inequality | Virginia Eubanks (2018) | How data systems disproportionately surveil and punish poor communities. |
| GDPR Full Text | EU (2016) | If you build systems that process personal data, you should read the actual regulation. It is more readable than you expect. |
| Machine Bias (ProPublica) | Angwin et al. (2016) | The original COMPAS investigation. Essential primary source for any discussion of algorithmic fairness. |
| Kleppmann's talks | Martin Kleppmann | His Strange Loop and QCon talks cover much of this material in lecture form. Excellent for reinforcement. |