Relational, document, graph — choosing how to structure and query your data.
You are building LinkedIn. A single user has a name, a profile photo URL, a current job title, a current employer, a list of past jobs (each with a company, title, start date, end date), a list of degrees (each with a school, field, year), a list of skills (each endorsed by other users), and connections to thousands of other users who have the same complex structure.
Now store this. You have three choices, and each one will shape everything about your application — what queries are fast, what queries are slow, how you scale, and what happens when the product evolves.
Choice 1: Relational tables. Split the data into normalized tables. A users table, a jobs table with a foreign key to users, an education table with a foreign key to users, a skills table, and a connections table that is a many-to-many join table linking two user IDs. To display a single profile, you JOIN across 4-5 tables.
Choice 2: A JSON document. Store each user as one self-contained JSON blob. The profile, jobs, education, skills — all nested inside one document. One read to fetch a profile. No joins needed. But if two users both worked at "Google," you have the string "Google" duplicated in both documents. And if Google changes its name? You update every document that mentions it.
Choice 3: A graph. Every entity — user, company, school, skill — is a node. Relationships are edges: WORKS_AT, STUDIED_AT, KNOWS, HAS_SKILL. To find "all people who worked at Google and know Alice," you traverse edges. Natural for highly interconnected data. But simple lookups ("give me Alice's profile") require gathering data from many nodes.
Each choice is a bet on what matters most. If you bet on data integrity and complex queries, go relational. If you bet on read performance and schema flexibility, go document. If you bet on relationship-heavy queries and interconnected data, go graph. Most real systems use a combination — but you must understand each model deeply to know when to reach for it.
The simulation below shows a LinkedIn profile stored in all three formats side by side. Notice how the same information looks fundamentally different depending on the model. Click each tab to see the structure, and pay attention to where duplication lives and where references live.
Click a model to see how the same LinkedIn profile is stored. Watch where data is duplicated vs referenced.
No model wins on every dimension. Here is the fundamental triangle of trade-offs:
| Dimension | Relational | Document | Graph |
|---|---|---|---|
| Data locality (one read for a profile) | Poor (multi-table JOIN) | Excellent (one document) | Poor (traverse nodes) |
| Many-to-many relationships | Excellent (JOIN tables) | Poor (duplication or manual refs) | Excellent (edges) |
| Schema flexibility | Rigid (ALTER TABLE) | Flexible (schema-on-read) | Very flexible (add node/edge types) |
| Complex queries (analytics, aggregation) | Excellent (SQL) | Limited (no JOINs) | Good for traversal, poor for aggregation |
| Data integrity (no dangling refs) | Excellent (foreign keys) | None (application enforced) | Moderate (depends on DB) |
Every systems design interview starts with "How would you model the data?" If you choose relational when the data is hierarchical, you will spend the rest of the interview explaining why your JOINs are slow. If you choose document when the data has many-to-many relationships, you will spend the interview explaining how you handle data consistency. The data model choice constrains every subsequent decision: storage engine, indexing strategy, sharding key, caching layer, consistency model.
More practically: about 60% of production bugs in data-intensive applications trace back to a data model mismatch. Storing a graph in flat tables. Embedding data that should be referenced. Using a document store for transactional data. These are not bugs you can fix with better code — they require redesigning the schema, which means migrating data, which means downtime or a painful multi-week migration project.
The rest of this lesson walks through each model in depth, teaches the query languages that go with them, and builds your intuition for choosing the right model in a system design interview.
jobs table with a foreign key to the users table. If two users both worked at Google, how many times does the string "Google" appear in a well-normalized relational schema?In 1970, Edgar Codd published "A Relational Model of Data for Large Shared Data Banks" while working at IBM. The paper proposed something radical: stop thinking about how data is physically stored on disk (navigational pointers, linked lists, network paths) and instead think of data as relations — flat tables of rows and columns. Let the database figure out the physical layout. You just say what data you want.
This separation of logical model from physical storage is arguably the most important idea in the history of data management. It meant that application code did not break when the DBA reorganized the disk layout. It meant that a query optimizer could choose the best access path automatically. It meant that two different applications could share the same data without coordinating on how to traverse it.
Before Codd, databases used hierarchical models (IBM's IMS, 1966) or network models (CODASYL, 1969). In both, the programmer had to know the physical access path to data — "start at this record, follow this pointer, walk this linked list." If the physical layout changed, every program that touched the database broke. If you wanted a new query that traversed the data differently, you might need to restructure the entire database.
Codd's insight was revolutionary: separate the logical representation (tables, columns, rows) from the physical representation (indexes, page layout, buffer pools). Let the database system figure out how to execute a query efficiently. The programmer specifies what data they want (SQL), not how to navigate to it.
This idea was so powerful that relational databases dominated for 40 years. Oracle, PostgreSQL, MySQL, SQL Server — all relational. It was not until the late 2000s, when web-scale companies hit the limits of single-machine relational databases, that alternatives gained serious traction.
A relation is a table. A tuple is a row. An attribute is a column. This vocabulary comes from mathematical set theory — a relation is a set of tuples — but in practice everyone says table, row, column.
The killer feature of the relational model is the JOIN. Data is normalized: instead of duplicating information, you store it once in a dedicated table and reference it by ID from other tables. This reference is a foreign key. When you need the full picture, you JOIN the tables together at query time.
Schema-on-write means you define the schema (columns, types, constraints) before you insert any data. The database rejects data that does not conform. This is like static typing in programming: errors are caught early, at write time, not at read time.
companies table. Every job that references company_id=42 automatically shows the new name. In a denormalized system, you would need to find and update every document or row that contains the string "Google." Normalization trades write simplicity (one update) for read complexity (JOINs). This is the central bargain of the relational model.Let us design the relational schema for our LinkedIn profile step by step.
sql CREATE TABLE users ( user_id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL, photo_url TEXT, summary TEXT ); CREATE TABLE companies ( company_id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL, industry VARCHAR(100) ); CREATE TABLE jobs ( job_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id), company_id INT REFERENCES companies(company_id), title VARCHAR(200), start_date DATE, end_date DATE -- NULL means current job ); CREATE TABLE education ( edu_id SERIAL PRIMARY KEY, user_id INT REFERENCES users(user_id), school VARCHAR(200), field VARCHAR(200), grad_year INT ); CREATE TABLE connections ( user_a INT REFERENCES users(user_id), user_b INT REFERENCES users(user_id), since DATE, PRIMARY KEY(user_a, user_b) );
Five tables. No string is duplicated. Every relationship is a foreign key. This is Third Normal Form (3NF): every non-key attribute depends on the whole key and nothing but the key.
Notice the connections table. It has a composite primary key: (user_a, user_b). This is a junction table (also called a join table, bridge table, or associative entity). It represents a many-to-many relationship: a user can connect to many other users, and each user can have many connections. This pattern — a dedicated table that holds only foreign keys — is the standard way to model many-to-many relationships in the relational world.
Also notice the REFERENCES keyword on foreign keys. This is not just documentation — it is a constraint. If you try to insert a job with user_id = 999 and there is no user with that ID, the database rejects the insert. If you try to delete a user who has jobs, the database stops you (or cascades the delete, depending on configuration). This referential integrity guarantee is one of the most powerful features of the relational model: it is impossible to have dangling references or orphaned records.
To assemble a full profile, we join across tables:
sql SELECT u.name, u.summary, j.title, c.name AS company, j.start_date, j.end_date, e.school, e.field, e.grad_year FROM users u LEFT JOIN jobs j ON u.user_id = j.user_id LEFT JOIN companies c ON j.company_id = c.company_id LEFT JOIN education e ON u.user_id = e.user_id WHERE u.user_id = 42;
This is one SQL query, but the database must do real work: scan the users table (or use an index) to find user 42, then scan jobs to find all jobs for that user, look up each company by ID, and scan education for the user's degrees. The query planner decides the order and method (index scan, hash join, nested loop) automatically.
The simulation below shows three relational tables populated with sample data. Click "Run JOIN" to watch the database assemble a full profile by joining across tables. The query plan shows the execution steps.
Click Run JOIN to see how the database scans tables, matches foreign keys, and produces the result. Watch the query plan steps.
Interview question: "Find all users who worked at Google and have a CS degree."
sql SELECT DISTINCT u.name FROM users u JOIN jobs j ON u.user_id = j.user_id JOIN companies c ON j.company_id = c.company_id JOIN education e ON u.user_id = e.user_id WHERE c.name = 'Google' AND e.field = 'Computer Science';
Step by step: (1) the database finds company_id for "Google" from the companies table — likely an index lookup, O(log n). (2) It finds all job rows with that company_id — another index lookup. (3) It collects the user_ids from those jobs. (4) It checks which of those users have an education row where field = 'Computer Science'. (5) It looks up the names. The optimizer may reorder these steps for efficiency.
Normalization is the process of organizing a relational schema to minimize redundancy. There are several normal forms, each building on the previous:
First Normal Form (1NF): Every column contains atomic (indivisible) values. No repeating groups, no arrays. A column "skills" containing "Go, Kubernetes, Python" as a comma-separated string violates 1NF. Instead, you create a separate user_skills table with one row per skill per user.
Second Normal Form (2NF): 1NF plus every non-key column depends on the entire primary key, not just part of it. If your primary key is (student_id, course_id) and you have a column "student_name," that depends only on student_id, not the full key. Move it to a separate students table.
Third Normal Form (3NF): 2NF plus no non-key column depends on another non-key column. If you have "zip_code" and "city" in the same table, city depends on zip_code (not on the primary key directly). Move the zip→city mapping to a separate table.
In practice, most production schemas are in 3NF or close to it. Purists go further (BCNF, 4NF, 5NF), but the returns diminish rapidly. The pragmatic rule: normalize until it hurts (too many JOINs), then selectively denormalize the hot paths.
companies.name, it uses it. If the tables are small enough, it does a sequential scan. If the join is large, it uses a hash join instead of nested loops. This means the same SQL query can automatically get faster when you add indexes — without changing any application code.users table with 10 million rows and a jobs table with 50 million rows. To display a user's profile, you JOIN these two tables on user_id. Without any indexes, what is the time complexity of this JOIN?In the mid-2000s, web applications started hitting a wall. Social networks, content management systems, and real-time games had data that was naturally hierarchical — a user profile is a tree of nested objects, not a flat table. Splitting this tree across five relational tables and reassembling it with JOINs on every page load felt like unnecessary overhead. What if you just stored the tree as-is?
The document model does exactly that. Each record is a self-contained document, typically in JSON (or BSON, a binary encoding of JSON). A document can contain nested objects, arrays, arrays of objects — any tree structure. Databases like MongoDB, CouchDB, DynamoDB, and Firestore are document stores.
json { "user_id": 42, "name": "Alice Chen", "photo_url": "/photos/alice.jpg", "summary": "Infrastructure engineer", "jobs": [ { "title": "Staff Engineer", "company": "Google", "start": "2020-03", "end": null }, { "title": "Senior Engineer", "company": "Stripe", "start": "2017-06", "end": "2020-02" } ], "education": [ { "school": "MIT", "field": "Computer Science", "year": 2017 } ], "skills": ["Go", "Kubernetes", "Distributed Systems"] }
One document. One read. No JOINs. The entire profile is a single JSON blob stored contiguously on disk. This is called data locality — all the data you need for one operation lives in one place.
In the relational model, the schema is enforced at write time. If you try to insert a row without a required column, the database rejects it. This is schema-on-write: the structure is explicit and enforced.
Document databases typically do not enforce a schema. You can insert any JSON document into a collection, regardless of its structure. The application interprets the structure when it reads the data. This is schema-on-read: the structure is implicit and assumed by the application code.
Think of it like programming languages. Schema-on-write is like static typing (Java, Go): the compiler catches type errors before the code runs. Schema-on-read is like dynamic typing (Python, JavaScript): errors surface at runtime when you access a field that does not exist.
user.jobs[0].company, it is assuming the document has a jobs array where each element has a company field. If an old document lacks this structure, your code crashes. The schema exists; it is just not enforced by the database. This is the source of many production bugs in document-store applications.Data locality. Fetching a user profile is one read, not a multi-table JOIN. For read-heavy workloads that fetch entire entities, this can be 10-100x faster. The document is stored contiguously on disk, so a single disk seek retrieves everything. In a relational database, the user row, job rows, education rows, and company rows may be on different disk pages, requiring multiple seeks. At scale, this difference in I/O patterns is the primary driver of document store adoption.
Flexible schema. Different documents in the same collection can have different fields. Adding a new field to your user profile does not require migrating existing documents — new documents just include the field, old documents do not. Your application code handles both cases.
Natural fit for hierarchical data. A user profile, a blog post with comments, a product catalog — these are naturally tree-shaped. Mapping them to flat relational tables requires artificial decomposition. Documents preserve the natural structure.
No joins. If Alice and Bob both worked at Google, the string "Google" appears in both documents. If Google changes its name, you must find and update every document — and there is no single atomic operation to do this. During the update, some documents say "Google" and some say "Alphabet." Worse, there is no foreign key constraint stopping you from misspelling it "Gogle" in one document. The database happily stores it because it does not know that "company" is supposed to reference a canonical entity.
Denormalization breeds inconsistency. Without normalization, the same fact is stored in multiple places. This is fine until the fact changes. Then you have an update anomaly: some documents have the old value, some have the new one.
Poor for many-to-many relationships. If users endorse each other's skills, that is a many-to-many relationship between users and skills. Documents cannot naturally express this without duplicating data or using manual references (storing IDs and resolving them in application code — effectively reinventing JOINs badly).
Insert documents, then query by nested field. Compare the single-read fetch to the relational JOIN from Chapter 1.
Most application code is written in object-oriented languages — Python classes, Java objects, Go structs. But relational databases store data in flat tables. Translating between the object model in your application and the relational model in your database is called the impedance mismatch, and it is the single biggest source of complexity in relational applications.
Consider our LinkedIn profile. In Python, it is a natural nested object:
python class UserProfile: name: str summary: str jobs: List[Job] # nested list of objects education: List[Edu] # another nested list skills: List[str] # simple list
In the relational model, this single object becomes 4-5 tables. Loading the object requires joining them. Saving the object requires inserting/updating across multiple tables in a transaction. This translation layer — the ORM (Object-Relational Mapping) — is a constant source of bugs, performance problems, and complexity. Libraries like SQLAlchemy (Python), Hibernate (Java), and ActiveRecord (Ruby) exist entirely to manage this mismatch.
The document model eliminates this mismatch entirely. The JSON document IS the object. Load it, deserialize it, done. No ORM needed. This is why document databases gained traction alongside JavaScript (where JSON is native) and dynamic languages.
When designing a document schema, the key decision is: embed related data inside the document, or reference it by ID and resolve it in a separate query?
| Strategy | When to use | Example |
|---|---|---|
| Embed | One-to-few relationship, data is always fetched together, child data belongs to parent | A user's addresses, a post's comments (if few), a product's variants |
| Reference | One-to-many with many, many-to-many, data is shared across entities | A user's order history (could be thousands), a tag shared across posts |
The rule of thumb from the MongoDB documentation: embed unless you have a good reason not to. Good reasons: (1) the embedded array can grow without bound (risk of exceeding the 16MB document size limit), (2) the embedded data is updated independently of the parent, (3) the embedded data is referenced by multiple parents (many-to-many).
verified field (boolean) to new user sign-ups. A colleague reports that db.users.find({verified: true}) returns only users who signed up after the change, even though you ran a migration script to set verified: true for existing power users. The query returns 500 users but you expected 12,000. What is the most likely cause?This is the question that launches a thousand Hacker News threads. The truth is nuanced, and the industry has been converging: PostgreSQL now has excellent JSON support (jsonb columns with GIN indexes), and MongoDB added $lookup (a JOIN operator) in version 3.2. The lines are blurring. But the fundamental trade-offs remain.
Your data is mostly tree-structured (one-to-many). A blog post has comments. A product has variants. A resume has sections. If each entity is a self-contained tree with little cross-referencing between entities, documents are natural.
You rarely need joins between entities. If your application loads one document at a time (show a product page, display a user profile) and almost never needs to combine data from multiple entities in a single query, the document model's data locality is a pure win.
Your schema evolves rapidly. Early-stage startups often do not know what their data will look like in six months. Schema-on-read lets you iterate without migration headaches. Add fields to new documents; write application code that handles both old and new formats.
Content management systems (CMS) are the ideal use case for document stores. A blog post is a self-contained document: title, author, body text, tags, comments, metadata. Each post is fetched individually ("show this blog post"). Posts are rarely joined with other posts. The schema varies — a video post has a duration field, a photo post has dimensions, a text post has a word count. A document store handles this naturally. WordPress.com, Contentful, and Strapi all use document-oriented storage internally.
Conversely, the document model's worst case is anything that looks like an accounting ledger: debits must equal credits, account balances must be consistent, transactions must be atomic across multiple accounts. These are inherently relational problems that require JOIN, foreign keys, and ACID transactions.
Your data has many-to-many relationships. Users follow users. Orders reference products, which reference categories. Employees belong to departments that belong to divisions. These relationships form a web, not a tree. The relational model's JOINs and foreign keys handle webs naturally.
You need complex joins and aggregations. "What is the average order value by product category for the last quarter?" This is a multi-table join with GROUP BY and date filtering. SQL was designed for exactly this. Document databases can do it, but awkwardly.
Data integrity matters more than flexibility. A banking application cannot tolerate an account with a negative balance because of a missing field. Foreign keys, CHECK constraints, NOT NULL, and transactions enforce correctness at the database level rather than relying on application code.
Modern databases are hybrid:
| Database | Primary Model | Borrowed Feature |
|---|---|---|
| PostgreSQL | Relational | jsonb columns with indexing, path queries |
| MySQL 5.7+ | Relational | JSON data type with generated columns |
| MongoDB 3.2+ | Document | $lookup (left outer join), schema validation |
| DynamoDB | Document/KV | Global secondary indexes, transactions (2018) |
| CockroachDB | Relational | JSON columns, distributed SQL |
The simulation below walks you through a data model decision. Answer questions about your application's data, and see which model the decision tree recommends. Try different paths.
Answer the questions to get a recommendation. Click a choice at each branch.
You are building an e-commerce platform with four entity types: Products (name, description, price, attributes that vary by category — shirts have size/color, laptops have RAM/storage), Orders (user, items, quantities, shipping address, payment status), Reviews (user, product, text, rating, date), and User Profiles (name, email, addresses, payment methods, order history).
Think about each entity. Which model fits best?
| Entity | Best Model | Why |
|---|---|---|
| Products | Document (or relational + jsonb) | Attributes vary by category — a shirt has size/color, a laptop has RAM/storage. A rigid relational schema would need a separate table per category or an ugly EAV (entity-attribute-value) pattern. A JSON document naturally holds variable attributes. |
| Orders | Relational | Orders reference users and products (many-to-many). Payment status needs transactional updates. Integrity matters — you cannot lose an order. Foreign keys and ACID transactions are essential. |
| Reviews | Either | A review references a user and a product (relational JOINs useful), but each review is self-contained (document locality useful). PostgreSQL with foreign keys and a jsonb column for metadata is a good hybrid. |
| User Profiles | Document (or relational + jsonb) | A profile is tree-shaped (addresses array, payment methods array). Rarely joined with other entities beyond order lookups. Good document locality. |
PostgreSQL's jsonb column type deserves special attention because it effectively gives you a document database inside a relational database. You get foreign keys, transactions, and SQL for the structured parts of your data, plus JSON flexibility for the unstructured parts.
sql -- Relational structure + JSON flexibility CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL, price DECIMAL(10,2) NOT NULL, category VARCHAR(50) NOT NULL, attributes JSONB -- flexible per-category attributes ); -- A shirt: INSERT INTO products (name, price, category, attributes) VALUES ('Classic Tee', 29.99, 'apparel', '{"size": "M", "color": "navy", "material": "cotton"}'); -- A laptop: INSERT INTO products (name, price, category, attributes) VALUES ('ThinkPad X1', 1499.00, 'electronics', '{"ram_gb": 16, "storage_gb": 512, "screen": "14 inch"}'); -- Query JSON fields with GIN index support: CREATE INDEX idx_attrs ON products USING GIN(attributes); SELECT name, price FROM products WHERE attributes @> '{"color": "navy"}'; -- uses GIN index!
The @> operator means "contains" — it checks if the JSON value contains the specified key-value pairs. With a GIN index, this is fast even on millions of rows. You get the flexibility of a document store for the variable attributes, plus the integrity and query power of SQL for the structured fields.
You can even combine relational JOINs with JSON queries in a single statement:
sql -- Find orders for navy products over $50, with buyer info SELECT u.name, p.name, p.price, o.quantity FROM orders o JOIN users u ON o.user_id = u.user_id JOIN products p ON o.product_id = p.product_id WHERE p.attributes @> '{"color": "navy"}' AND p.price > 50;
This query uses a relational JOIN (orders → users, orders → products) combined with a JSON containment check on the product attributes. No other database model gives you this combination of flexibility and power in a single query.
A data model is only half the story. The other half is how you ask questions of your data. The query language determines what is easy to express, what is hard, and — critically — what the database can automatically optimize.
An imperative query tells the computer how to get the answer, step by step. You write a loop, you check conditions, you accumulate results. An declarative query tells the computer what you want, and the computer figures out how to get it.
Here is the same query — "find all users in the Engineering department" — written both ways:
javascript // IMPERATIVE: you specify the algorithm function getEngineers(users) { const result = []; for (let i = 0; i < users.length; i++) { if (users[i].dept === "Engineering") { result.push(users[i]); } } return result; }
sql -- DECLARATIVE: you specify the result SELECT * FROM users WHERE dept = 'Engineering';
The imperative version hardcodes the strategy: iterate from index 0 to length, check each element. If the data were sorted by department, a binary search would be faster — but the imperative code cannot take advantage of that without being rewritten.
The declarative version says nothing about how to find the rows. The query optimizer chooses: if there is an index on dept, use it. If the table is tiny, do a sequential scan. If the data is partitioned by department, go directly to the Engineering partition. The same SQL automatically adapts to different physical layouts.
Here is another example you already know: CSS. Styling a web page with imperative JavaScript means manually computing element positions, setting pixel values, handling resize events. CSS is declarative: you say display: flex; justify-content: center; and the browser figures out the layout. If the window resizes, the browser recomputes automatically. If a new browser version has a faster layout engine, your CSS automatically benefits. Same principle as SQL.
SQL (Structured Query Language) was invented in the 1970s at IBM, closely following Codd's relational model. It is declarative: you describe the shape of the result, not the algorithm to produce it.
The core operations:
| Operation | SQL | What it does |
|---|---|---|
| Select | SELECT cols FROM table | Choose which columns to return (projection) |
| Filter | WHERE condition | Keep only rows matching the condition (selection) |
| Join | JOIN table ON condition | Combine rows from two tables based on a relationship |
| Group | GROUP BY cols | Aggregate rows that share a common value |
| Order | ORDER BY cols | Sort the result set |
| Limit | LIMIT n | Return only the first n rows |
MongoDB's query language is also declarative, but uses a pipeline of stages instead of SQL clauses. Each stage transforms the data and passes it to the next stage.
javascript // MongoDB: "find all users in Engineering, group by team, count" db.users.aggregate([ { $match: { dept: "Engineering" } }, // WHERE { $group: { _id: "$team", count: { $sum: 1 } } }, // GROUP BY { $sort: { count: -1 } } // ORDER BY ]);
The pipeline is a sequence of stages: $match (filter), $group (aggregate), $sort (order), $project (select columns), $lookup (join), $unwind (flatten arrays). Each stage is declarative — you say what transformation you want, not how to implement it.
MapReduce is a programming model where you write two functions: a map function that extracts key-value pairs from each document, and a reduce function that combines all values for the same key. The framework handles distributing the work across machines.
javascript // MapReduce: count users per department db.users.mapReduce( function map() { // runs on each document emit(this.dept, 1); // emit (key, value) pair }, function reduce(key, values) { // combines values for same key return Array.sum(values); }, { out: "dept_counts" } );
MapReduce is a hybrid: the map and reduce functions are imperative (you write code), but the distribution and scheduling are declarative (the framework handles parallelism). In practice, MongoDB's aggregation pipeline has replaced MapReduce for most use cases because it is fully declarative and easier to optimize.
MapReduce had a fundamental problem: you had to write two functions for every query. To count users per department, you write a map function and a reduce function. To find the top 5 departments, you write another map and reduce. To filter by region first, you chain MapReduce jobs. Each job writes its output to disk and the next job reads it — massive I/O overhead.
MongoDB's aggregation pipeline replaced this with declarative stages that the optimizer can fuse, reorder, and push down. The same query that took 15 lines of MapReduce code becomes 5 lines of pipeline stages. More importantly, the pipeline engine can optimize: it can merge $match stages, push $limit before $sort (top-N optimization), and use indexes for the initial $match. MapReduce cannot do any of this because the user-written functions are opaque to the optimizer.
This same evolution happened in the big data world: Hadoop MapReduce (imperative, write Java map/reduce classes) was replaced by Spark SQL and Hive (declarative, write SQL). The lesson: declarative always wins because it enables automatic optimization.
The same query — "count users per department" — expressed in imperative JS, SQL, and MongoDB aggregation. Click each to see how it executes and how declarative forms enable parallelism.
SELECT AVG(total) FROM orders; in SQL. The database has just added a new clustered index on the total column. Which version automatically benefits from the new index?What happens when everything is connected to everything else? Social networks (people know people), road networks (intersections connect to intersections), the web (pages link to pages), fraud detection (accounts, devices, IP addresses, and transactions form a tangled web of relationships). When the relationships between entities are as important as the entities themselves, neither relational tables nor documents work well. You need a graph.
A graph has two ingredients: vertices (also called nodes) and edges (also called relationships). Each vertex represents an entity. Each edge represents a connection between two entities. Both vertices and edges can have properties — key-value pairs that store data about them.
Unlike the relational model, where the schema fixes the types of relationships at design time (foreign keys between specific tables), a graph lets you add any edge between any two vertices at any time. A person vertex can have an edge to a company ("WORKS_AT"), an edge to a school ("STUDIED_AT"), an edge to another person ("KNOWS"), and an edge to a city ("LIVES_IN"). No schema migration required.
Property graph (Neo4j, Amazon Neptune, TigerGraph). Each vertex has a unique ID, a set of properties (key-value pairs), and a set of outgoing and incoming edges. Each edge has a unique ID, a label (type), properties, and connects a tail vertex to a head vertex.
Think of it as two relational tables:
sql -- Property graph as relational tables (conceptual) CREATE TABLE vertices ( vertex_id INT PRIMARY KEY, properties JSONB -- {name: "Alice", age: 30} ); CREATE TABLE edges ( edge_id INT PRIMARY KEY, tail_id INT REFERENCES vertices, head_id INT REFERENCES vertices, label VARCHAR(50), -- "WORKS_AT", "KNOWS", etc. properties JSONB );
Triple store (RDF, SPARQL, Datomic). All information is stored as (subject, predicate, object) triples. For example: (Alice, works_at, Google), (Alice, age, 30), (Alice, knows, Bob). The subject is a vertex. The predicate is either an edge label (if the object is another vertex) or a property name (if the object is a value).
text # Triple store format (Turtle/N-Triples syntax) _:alice rdf:type :Person . _:alice :name "Alice Chen" . _:alice :works_at _:google . _:alice :knows _:bob . _:alice :age 30 . _:google rdf:type :Company . _:google :name "Google" .
The killer operation in a graph database is traversal — following edges from vertex to vertex. "Find all people who know Alice" is one hop. "Find all people who know someone who knows Alice" is two hops. "Find the shortest path between Alice and Dave" is a breadth-first search across the graph.
In a relational database, this requires recursive JOINs (a WITH RECURSIVE query in SQL) that become increasingly expensive at each level of depth. In a graph database, traversal is a native operation with consistent performance — each hop is an index lookup on the edge list, typically O(1) per edge.
Click to add people and companies. Create edges between them. Then run a graph traversal to find connections.
| Use Case | Why Graphs Win | Example Query |
|---|---|---|
| Social networks | Friends-of-friends, mutual connections, influence propagation | "Find all people within 3 degrees of Alice" |
| Fraud detection | Suspicious patterns: shared devices, addresses, payment methods across accounts | "Find all accounts that share a device with a known fraudulent account" |
| Knowledge graphs | Semantic relationships: "Paris is_capital_of France, France is_in Europe" | "What is the capital of the country that contains Lyon?" |
| Recommendation engines | Collaborative filtering via graph similarity | "Find products bought by people similar to this user" |
| Network infrastructure | Routers, switches, cables — physical connectivity | "If this switch fails, which servers lose connectivity?" |
Graph databases excel at local queries — "starting from this node, find nearby nodes matching a pattern." A 2-hop traversal from a specific user to find friends-of-friends is fast regardless of total graph size, because you only touch the nodes reachable in 2 hops.
Graph databases struggle with global queries — "count all users per country" or "find the average number of connections." These require scanning every node, and graph databases have no columnar storage or vectorized execution. A SQL GROUP BY on a columnar table will be orders of magnitude faster.
They also struggle with write-heavy workloads. Inserting a node is fast. Inserting an edge requires updating the adjacency list of both endpoints and maintaining indexes. High-throughput event ingestion (sensor data, logs, clickstreams) is better served by column-family stores like Cassandra or time-series databases.
Under the hood, graph databases use index-free adjacency: each node physically stores a pointer to its neighbors. When you follow an edge from Alice to Bob, the database does not perform an index lookup — it follows a direct pointer in memory. This means traversal time is proportional to the number of edges traversed, not the total size of the graph.
Compare this to a relational database storing the same graph in an edges table. To find Alice's neighbors, you query SELECT * FROM edges WHERE tail_id = 'alice'. Even with an index, this is O(log n) where n is the total number of edges in the table. In a graph database, it is O(k) where k is Alice's degree (number of edges from Alice) — independent of total graph size.
This O(k) vs O(log n) difference is negligible for small graphs. But for a social network with 2 billion edges, the difference between O(log 2B) ≈ 31 index lookups per hop and O(1) pointer dereference per hop is enormous — especially when traversing multiple hops.
To solidify your intuition, here is the same e-commerce data in each model. Notice what is easy and what is awkward in each.
sql -- RELATIONAL: clean, normalized, JOINs for queries SELECT p.name, AVG(r.rating), COUNT(r.review_id) FROM products p JOIN reviews r ON p.product_id = r.product_id GROUP BY p.product_id HAVING COUNT(r.review_id) > 10;
javascript // DOCUMENT: must use aggregation pipeline (verbose but functional) db.products.aggregate([ { $lookup: { from: "reviews", localField: "_id", foreignField: "product_id", as: "reviews" }}, { $addFields: { reviewCount: { $size: "$reviews" }, avgRating: { $avg: "$reviews.rating" } }}, { $match: { reviewCount: { $gt: 10 } }} ]);
cypher // GRAPH: traversal-based, elegant for relationship queries MATCH (p:Product)<-[r:REVIEWED_BY]-(u:User) WITH p, AVG(r.rating) AS avgRating, COUNT(r) AS cnt WHERE cnt > 10 RETURN p.name, avgRating, cnt;
All three can express this query. But the relational version is most natural for this aggregation-style query. The graph version works but is not playing to its strengths (aggregation is not what graphs are designed for). The document version is the most verbose because $lookup is a bolt-on feature, not a first-class citizen.
SQL was designed for rectangular tables. Querying a graph with SQL is possible (recursive CTEs, self-joins on an edges table) but painful. Graph databases have their own query languages designed for pattern matching on nodes and edges.
Cypher is a declarative graph query language created for Neo4j. Its core syntax uses ASCII art to describe graph patterns: parentheses for nodes, arrows for edges, square brackets for edge types.
cypher // Find all people who work at Google MATCH (p:Person)-[:WORKS_AT]->(c:Company {name: "Google"}) RETURN p.name; // Friends-of-friends: people 2 hops from Alice MATCH (a:Person {name: "Alice"})-[:KNOWS*2]->(fof:Person) WHERE NOT (a)-[:KNOWS]->(fof) // exclude direct friends RETURN DISTINCT fof.name; // Shortest path between Alice and Dave MATCH path = shortestPath( (a:Person {name: "Alice"})-[:KNOWS*]-(d:Person {name: "Dave"}) ) RETURN path; // Triangle detection: mutual friends MATCH (a:Person)-[:KNOWS]-(b:Person)-[:KNOWS]-(c:Person)-[:KNOWS]-(a) WHERE id(a) < id(b) AND id(b) < id(c) RETURN a.name, b.name, c.name;
Read the patterns like a sentence: (a:Person)-[:KNOWS]->(b:Person) means "a Person node a that has a KNOWS edge pointing to Person node b." The arrow direction matters. The *2 syntax means "exactly 2 hops." The * syntax means "any number of hops."
SPARQL (SPARQL Protocol and RDF Query Language, a recursive acronym) queries triple stores. It uses a pattern matching syntax similar to Cypher but based on triple patterns.
sparql # Find all people who work at Google SELECT ?name WHERE { ?person rdf:type :Person . ?person :works_at ?company . ?company :name "Google" . ?person :name ?name . } # Find people who know someone who works at Google SELECT ?name WHERE { ?person :knows ?friend . ?friend :works_at ?company . ?company :name "Google" . ?person :name ?name . }
Each line is a triple pattern: ?person :works_at ?company matches all triples where the predicate is :works_at. Variables start with ?. The query engine finds all variable bindings that satisfy all patterns simultaneously.
Datalog is the oldest of the three — a subset of Prolog used for database queries since the 1980s. Its distinctive feature is rules: you define derived facts from base facts, and the engine chains rules together.
datalog /* Base facts */ works_at(alice, google). works_at(bob, stripe). knows(alice, bob). knows(bob, carol). /* Rule: X can_reach Y if X knows Y directly */ can_reach(X, Y) :- knows(X, Y). /* Rule: X can_reach Y if X knows someone who can reach Y */ can_reach(X, Y) :- knows(X, Z), can_reach(Z, Y). /* Query: who can Alice reach? */ ?- can_reach(alice, Who). /* Result: Who = bob ; Who = carol */
The power of Datalog is recursive rules. The can_reach rule calls itself. This naturally expresses transitive closure — "find everyone reachable through any chain of KNOWS edges" — in just two lines. In SQL, this requires a WITH RECURSIVE CTE, which is more verbose and harder to read.
In practice, Cypher dominates the graph database world because Neo4j dominates the graph database market. If you learn one graph query language for interviews, learn Cypher. Here is a progression of increasingly complex Cypher patterns you should be able to write on a whiteboard:
| Pattern | Cypher | Use Case |
|---|---|---|
| Direct lookup | MATCH (p:Person {name: "Alice"}) RETURN p | Find a specific node |
| One hop | MATCH (a)-[:KNOWS]->(b) WHERE a.name = "Alice" RETURN b | Direct connections |
| Multi hop | MATCH (a)-[:KNOWS*2]-(b) RETURN b | Friends-of-friends |
| Variable hop | MATCH p = (a)-[:KNOWS*1..4]-(b) RETURN p | Reachability within distance |
| Shortest path | MATCH p = shortestPath((a)-[*]-(b)) RETURN p | Minimum hops between two nodes |
| Pattern match | MATCH (a)-[:KNOWS]-(b)-[:WORKS_AT]->(c) RETURN a, c | Compound relationship queries |
| Aggregation | MATCH (p)-[:KNOWS]-(f) RETURN p, COUNT(f) ORDER BY COUNT(f) DESC | Most connected people |
You CAN query graph-structured data in SQL using recursive CTEs (Common Table Expressions). But it is verbose and the optimizer is not tuned for it. Here is "find all people reachable from Alice" in SQL vs Cypher:
sql -- SQL: recursive CTE for transitive closure (verbose!) WITH RECURSIVE reachable AS ( -- Base case: direct connections from Alice SELECT e.head_id AS person_id, 1 AS depth FROM edges e JOIN vertices v ON e.tail_id = v.vertex_id WHERE v.properties->>'name' = 'Alice' AND e.label = 'KNOWS' UNION -- Recursive case: follow KNOWS edges further SELECT e.head_id, r.depth + 1 FROM reachable r JOIN edges e ON r.person_id = e.tail_id WHERE e.label = 'KNOWS' AND r.depth < 5 -- safety limit ) SELECT DISTINCT v.properties->>'name' FROM reachable r JOIN vertices v ON r.person_id = v.vertex_id;
cypher // Cypher: same query in one line MATCH (a:Person {name: "Alice"})-[:KNOWS*1..5]-(reachable:Person) RETURN DISTINCT reachable.name;
The Cypher version is not just shorter — it also performs better on graph-structured data. A graph database stores adjacency lists natively, so following an edge from a node is a pointer dereference (O(1)). In SQL, following an edge requires an index lookup on the edges table for each hop (O(log n) per hop).
The same query — "find people who worked at Google and know Bob" — in Cypher, SPARQL, and Datalog. Watch the pattern matching highlight nodes and edges in the graph.
(a)-[:KNOWS*1..3]->(b) means "a path from a to b following KNOWS edges with a minimum of 1 hop and a maximum of 3 hops." What does (a)-[:KNOWS*]->(b) (without the range) mean?Time to put everything together. You are the architect for a social media application. You have six entity types: Users, Posts, Comments, Likes, Follows, and Messages. The simulation below lets you choose a data model for each entity type and see the consequences play out in real time.
This is the big interactive payoff. You will see the schema, try sample queries, and observe how different models handle different access patterns. There is no single right answer — the goal is to understand the trade-offs deeply enough to defend your choices in an interview.
Choose a model for each entity. Watch the schema update, try sample queries, and see performance characteristics. Click an entity, then choose a model.
For each entity, ask yourself three questions:
1. What are the access patterns? How will this entity be queried most often? A user profile is loaded individually (single-document fetch). A news feed requires aggregating posts from many followed users (fan-out query). Likes are counted in aggregate (COUNT query). Each pattern favors a different model.
2. What are the relationships? Does this entity stand alone (document), reference a few other entities with well-defined foreign keys (relational), or connect to many other entities in unpredictable ways (graph)? Likes are a many-to-many relationship (relational junction table). Follows are a many-to-many relationship that you need to traverse (graph candidate). Messages are a one-to-many within a conversation (document candidate).
3. What consistency guarantees do you need? Can you tolerate a few seconds of stale data (eventual consistency, fine for like counts)? Or does every write need immediate visibility (strong consistency, required for payments)? Relational databases with ACID transactions give you strong consistency. Document stores and graph databases often offer eventual consistency for better performance.
| Entity | Best Model | Justification |
|---|---|---|
| Users | Relational + jsonb | Core entity referenced by everything. Foreign keys needed. Profile data (bio, settings) is tree-shaped, fits jsonb. Need indexes on username, email. |
| Posts | Relational (or Document) | Posts reference users (author). Media attachments as a JSON array. Need indexes on user_id and created_at for timeline queries. Document works if posts are mostly fetched individually. |
| Comments | Relational | Comments reference both a post and a user. Threaded comments (parent_comment_id) create a tree — but it is a tree within the relational model (self-referencing foreign key). Need to fetch all comments on a post efficiently. |
| Likes | Relational | A like is a many-to-many relationship between a user and a post. It is just a junction table: (user_id, post_id, created_at). No complex data. Pure relational. |
| Follows | Graph (or Relational) | Follows are a many-to-many relationship between users. Simple follow lookups (who does Alice follow?) are fine in a relational junction table. But "friends of friends," "suggested connections," and "mutual followers" queries are expensive in SQL and natural in a graph. If these features matter, add a graph layer for the social graph. |
| Messages | Document (or Relational) | Messages are tree-shaped conversations. Each conversation is a self-contained thread. Data locality matters — fetch an entire conversation in one read. But messages also reference users (sender, recipient), so you need some referencing. A hybrid: relational table for message metadata + document for message content. |
Here is how the same queries perform under different model choices:
| Query | Relational | Document | Graph |
|---|---|---|---|
| Get user profile | 1-3 table JOIN | 1 read | Gather from node + neighbors |
| Get post + comments | 2 table JOIN | 1 read (if embedded) | Traverse post→comments |
| News feed (posts from followed users) | JOIN follows + posts, sort by date | Fan-out: read each followed user's posts | Traverse follow edges, collect posts |
| Mutual friends | Self-join on follows (expensive) | Not practical | 2-hop pattern match |
| Analytics: posts per day | GROUP BY + COUNT | Aggregation pipeline | Poor (not designed for aggregation) |
Data models are not static. Features change, requirements evolve, and you need to add new fields, remove old ones, and restructure data. How this works depends enormously on your data model.
In the relational model, schema changes are explicit. You run an ALTER TABLE statement to add a column, change a type, or add a constraint.
sql -- Add a 'verified' field to all users ALTER TABLE users ADD COLUMN verified BOOLEAN DEFAULT FALSE; -- This runs instantly on PostgreSQL (just updates the catalog) -- But on MySQL < 8.0, this rewrites the entire table (locks it!) -- Add a NOT NULL column with a default ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT NOW(); -- Rename a column ALTER TABLE users RENAME COLUMN photo_url TO avatar_url;
The key insight: in PostgreSQL, adding a column with a default value is instant (since version 11). The default is stored in the catalog, not written to every row. Old rows return the default when read. This is a form of schema-on-read hiding inside a schema-on-write system.
But some operations are expensive: adding a NOT NULL constraint without a default requires scanning every row to verify no nulls exist. Changing a column type may require rewriting the table. On a table with 100 million rows, this can take hours and lock the table for writes.
In a document store, adding a field is trivial: just include it in new documents. Old documents do not have the field, and that is fine — the application code handles both cases.
javascript // Old document (before the change) { "user_id": 42, "name": "Alice", "email": "alice@example.com" } // New document (after adding 'verified') { "user_id": 99, "name": "Bob", "email": "bob@example.com", "verified": true } // Application code must handle both formats const isVerified = user.verified || false; // default for old docs
The advantage: zero downtime, no migration script, no table locks. The disadvantage: every piece of application code that reads the data must handle the absence of the field. This defensive coding adds up across a large codebase.
Graph databases are the most flexible for schema evolution. You can add new vertex types, new edge types, and new properties to existing vertices and edges without any schema migration at all.
cypher // Add a 'verified' property to Alice's node MATCH (a:Person {name: "Alice"}) SET a.verified = true; // Add an entirely new relationship type MATCH (a:Person {name: "Alice"}), (s:Skill {name: "Go"}) CREATE (a)-[:ENDORSES]->(s); // Add a new node type that didn't exist before CREATE (:Certification {name: "AWS Solutions Architect", level: "Professional"});
This is the most important migration technique to know for interviews. It works for all three data models but is most commonly discussed in the relational context.
This four-phase approach guarantees zero downtime, backward compatibility at every step, and the ability to roll back at any phase. It is how companies like GitHub, Stripe, and Shopify handle schema migrations on tables with billions of rows.
Watch how adding a new field works in each model. See the migration process, the time it takes, and the failure modes.
Your team decides to rename the photo_url field to avatar_url across all user profiles. Here is how it works in each model:
sql -- RELATIONAL: one command, affects all rows ALTER TABLE users RENAME COLUMN photo_url TO avatar_url; -- PostgreSQL: instant (catalog-only). All queries must update column name. -- All application code referencing 'photo_url' must be updated simultaneously.
javascript // DOCUMENT: can't rename in place. Must read + write each document. db.users.updateMany( { photo_url: { $exists: true } }, [{ $set: { avatar_url: "$photo_url" } }, { $unset: "photo_url" }] ); // For 100M documents, this takes time. During migration, some docs // have photo_url and some have avatar_url. Application code must // handle BOTH: const url = user.avatar_url || user.photo_url;
cypher // GRAPH: rename property on all nodes MATCH (u:User) WHERE u.photo_url IS NOT NULL SET u.avatar_url = u.photo_url REMOVE u.photo_url; // Similar to document: happens per-node, takes time for large graphs.
The relational version is the cleanest: one atomic DDL command. The document and graph versions require batch updates where the old and new field names coexist during migration, requiring application code to handle both.
| Scenario | Model | What goes wrong | How to fix |
|---|---|---|---|
| ALTER TABLE on 500M-row table | Relational | Table locked for hours, production writes blocked | Use gh-ost or pt-online-schema-change for zero-downtime migration |
| New field missing in old documents | Document | Application code throws NPE / undefined access | Always provide defaults in application code; validate on read |
| Renamed field in some documents | Document | Queries on old field name miss updated documents | Write a one-time migration to rename the field in all existing documents |
| New edge type lacks index | Graph | Traversal queries on new edge type are slow | Create an index on the new edge type before writing data |
ALTER TABLE users ADD COLUMN verified BOOLEAN NOT NULL DEFAULT true; on a PostgreSQL 15 table with 200 million rows. The command returns in 5 milliseconds. A colleague on MySQL 5.7 runs the equivalent statement and it takes 45 minutes. Why the difference?This chapter distills everything into the formats you will encounter in systems design interviews. Memorize the cheat sheet. Practice the design questions. Understand the debug scenarios deeply enough to walk through them on a whiteboard.
| Model | Best for | Worst for | Query language | Schema | Examples |
|---|---|---|---|---|---|
| Relational | Many-to-many, transactions, analytics | Hierarchical data, rapid schema changes | SQL (declarative) | Schema-on-write (strict) | PostgreSQL, MySQL, CockroachDB |
| Document | Hierarchical data, data locality, flexible schema | Many-to-many, cross-document joins | MongoDB QL, DynamoDB PartiQL | Schema-on-read (flexible) | MongoDB, DynamoDB, CouchDB |
| Graph | Traversal, many-to-many, relationship-heavy | Aggregation, simple CRUD, analytics | Cypher, SPARQL, Gremlin | Schema-optional | Neo4j, Neptune, TigerGraph |
| Key-Value | Simple lookups, caching, session storage | Complex queries, relationships | GET/SET API | None | Redis, Memcached, DynamoDB |
| Column-family | Wide rows, time-series, high write throughput | Complex queries, ad-hoc analytics | CQL (Cassandra) | Flexible columns per row | Cassandra, HBase, ScyllaDB |
Entities: Users, Tweets, Follows, Likes, Retweets, Direct Messages, Hashtags, Timelines.
Core decision: PostgreSQL for the transactional core (users, tweets, likes, retweets — all referential). A graph layer (or Redis-based adjacency list) for the follow graph (friend-of-friend suggestions, mutual follows). A timeline cache (Redis sorted set per user, keyed by tweet_id, scored by timestamp) for the home timeline because the fan-out-on-write vs fan-out-on-read trade-off is the critical design decision.
Fan-out-on-write: When a user tweets, push the tweet to every follower's timeline cache. Pros: fast reads (timeline is pre-computed). Cons: expensive writes for users with millions of followers (a celebrity tweet triggers millions of cache inserts).
Fan-out-on-read: When a user opens their timeline, fetch the latest tweets from all followed users and merge them. Pros: cheap writes. Cons: slow reads (must query N followed users and sort).
Twitter's actual solution: Hybrid. Fan-out-on-write for normal users (most users have < 1000 followers). Fan-out-on-read for celebrities (the top ~1% with millions of followers). A user's timeline is the merge of their pre-computed cache (from normal followed users) and a real-time fetch (from followed celebrities).
Entities: Riders, Drivers, Trips, Locations (real-time GPS), Payments, Ratings, Surge Pricing Zones.
Core decision: PostgreSQL for the transactional core (riders, drivers, trips, payments — strong consistency needed for money). Redis + geospatial indexes for real-time driver locations (need sub-100ms queries for "find nearest available drivers"). A time-series store or columnar database for trip telemetry (GPS traces at 1 Hz, millions of trips/day — too much for PostgreSQL). Document store or jsonb for rider/driver profiles (flexible attributes).
The location problem is the hard part. Every active driver sends a GPS update every second. With 100,000 active drivers, that is 100K writes/second to the location store. Each ride request needs to query "all drivers within 2 km of this point" with sub-second latency. PostgreSQL with PostGIS can handle this at moderate scale. At Uber's scale, they built a custom geospatial index (Ringpop, later H3-based) backed by in-memory stores.
A scenario appears. Click the best data model. See your score and the reasoning.
Drill 1: Design a normalized relational schema for a library management system (books, authors, borrowers, loans, genres). Include foreign keys and at least one many-to-many relationship.
sql CREATE TABLE authors ( author_id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL ); CREATE TABLE genres ( genre_id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL UNIQUE ); CREATE TABLE books ( book_id SERIAL PRIMARY KEY, title VARCHAR(300) NOT NULL, isbn VARCHAR(13) UNIQUE, pub_year INT ); -- Many-to-many: a book can have multiple authors CREATE TABLE book_authors ( book_id INT REFERENCES books(book_id), author_id INT REFERENCES authors(author_id), PRIMARY KEY(book_id, author_id) ); -- Many-to-many: a book can belong to multiple genres CREATE TABLE book_genres ( book_id INT REFERENCES books(book_id), genre_id INT REFERENCES genres(genre_id), PRIMARY KEY(book_id, genre_id) ); CREATE TABLE borrowers ( borrower_id SERIAL PRIMARY KEY, name VARCHAR(200) NOT NULL, email VARCHAR(200) UNIQUE ); CREATE TABLE loans ( loan_id SERIAL PRIMARY KEY, book_id INT REFERENCES books(book_id), borrower_id INT REFERENCES borrowers(borrower_id), borrowed_at TIMESTAMP DEFAULT NOW(), returned_at TIMESTAMP -- NULL = still borrowed );
Drill 2: Write a MongoDB aggregation pipeline for "find the top 5 most-reviewed products with average rating above 4.0."
javascript db.reviews.aggregate([ // Group reviews by product_id, compute count and average { $group: { _id: "$product_id", avg_rating: { $avg: "$rating" }, review_count: { $sum: 1 } }}, // Filter: average rating above 4.0 { $match: { avg_rating: { $gt: 4.0 } } }, // Sort by review count descending { $sort: { review_count: -1 } }, // Take top 5 { $limit: 5 }, // Optional: look up product details { $lookup: { from: "products", localField: "_id", foreignField: "_id", as: "product" }}, { $unwind: "$product" } ]);
Drill 3: Write a Cypher query for "find all people within 2 degrees of Alice who have the skill 'Kubernetes' and work at a company founded after 2010."
cypher MATCH (alice:Person {name: "Alice"})-[:KNOWS*1..2]-(person:Person) WHERE person <> alice MATCH (person)-[:HAS_SKILL]->(s:Skill {name: "Kubernetes"}) MATCH (person)-[:WORKS_AT]->(c:Company) WHERE c.founded > 2010 RETURN DISTINCT person.name, c.name AS company, c.founded;
Scenario 1: Your PostgreSQL JOIN query for the news feed (posts from followed users) is taking 8 seconds. The follows table has 500M rows and the posts table has 2B rows. What do you check?
(1) Check indexes: do you have indexes on follows.follower_id and posts.author_id? Without them, the JOIN is a sequential scan. (2) Check the query plan with EXPLAIN ANALYZE: is it doing a nested loop or a hash join? A hash join on 500M rows needs massive memory. (3) Consider materialized views: pre-compute each user's feed as a materialized view, refresh it periodically. (4) Consider a different architecture: this is the fan-out problem. Pre-compute timelines in a cache (Redis sorted set) on write, not on read.
Entities: Users, Conversations (1:1 and group), Messages, Media, Read Receipts, Typing Indicators.
Core challenge: Messages are append-only, extremely high throughput (100B+ messages/day for WhatsApp), and the primary access pattern is "get the last N messages in a conversation." This is a time-series access pattern.
Design: Conversations metadata in PostgreSQL (who is in the conversation, created_at, last_message_at). Messages in a column-family store like Cassandra or a log-structured store, partitioned by conversation_id, sorted by timestamp. This gives you O(1) access to "last N messages in conversation X" — just read the tail of the partition. Media (images, videos) as blobs in object storage (S3), referenced by URL in the message. Read receipts as a small relational table or Redis hash: (conversation_id, user_id) → last_read_message_id.
Why not a document store? A single conversation document would grow without bound as messages accumulate. A popular group chat could have millions of messages, exceeding MongoDB's 16MB document limit. You need to paginate messages, which means they must be stored as individual records, not embedded arrays.
Why not a relational database for messages? At WhatsApp's scale (100B+ messages/day), a single PostgreSQL instance cannot handle the write throughput. Cassandra's leaderless replication and tunable consistency make it ideal: write to any node, read from one node (CL=ONE), partition by conversation_id for data locality.
Key schema decision: partition key. In Cassandra, the partition key determines which node stores the data. For messages, the partition key is conversation_id. This means all messages in a conversation live on the same node, so "get the last 50 messages in this conversation" is a single-node read. But "get all messages sent by user X across all conversations" requires a scatter-gather query across all nodes — this is expensive and should be avoided as a primary access pattern. The partition key choice reflects the dominant access pattern: conversations, not users.
This is a general principle of data modeling in distributed systems: the partition key should match your most common query pattern. If most queries are "by user," partition by user_id. If most queries are "by conversation," partition by conversation_id. If you need both, you might need two copies of the data with different partition keys (denormalization at the storage level).
| Dimension | Example Question | What They Test |
|---|---|---|
| Concept | "Explain the impedance mismatch between OO code and relational databases" | Understanding of why document stores gained traction |
| Design | "Design the data model for a social network with 1B users" | Trade-off reasoning, polyglot persistence, scaling awareness |
| Code | "Write a SQL query to find mutual friends of two users" | SQL fluency, understanding of self-joins |
| Debug | "Our news feed query takes 8 seconds, what do you investigate?" | Systematic diagnosis: indexes, query plan, architecture |
| Frontier | "What is NewSQL and how does it relate to the relational/NoSQL debate?" | Awareness of CockroachDB, Spanner, TiDB — relational model with horizontal scaling |
Scenario 2: Your MongoDB $lookup (join) is timing out. The orders collection has 100M documents and you are joining with products (1M documents). What do you check?
(1) Check indexes: $lookup requires an index on the foreignField in the joined collection. Without it, every lookup is a collection scan. (2) Check if you can denormalize: embed product info directly in the order document to eliminate the join. (3) Consider pipeline optimization: add a $match stage before the $lookup to reduce the number of documents being joined. (4) Consider whether MongoDB is the right tool: frequent joins across collections suggest the data is relational, not document-shaped.
Scenario 3: Your Neo4j graph database is fast for 2-hop traversals (< 5ms) but a "find all shortest paths between any two users" query takes 30 seconds. Why?
(1) "All shortest paths" is a global graph algorithm (Dijkstra/BFS from every node), not a local traversal. Graph databases are optimized for local queries starting from a specific node. (2) Check if you actually need ALL shortest paths or just the shortest path between two specific users (much cheaper). (3) For global graph analytics (PageRank, centrality, community detection), use a graph processing framework like Apache Spark GraphX or Neo4j Graph Data Science library — these batch-process the entire graph in memory, not through the query engine. (4) Consider pre-computing: materialize shortest paths as edges for frequently queried pairs.
This is a classic interview coding question. "Find all mutual friends of user A (id=1) and user B (id=2)."
sql -- Mutual friends: people who are friends with BOTH user 1 AND user 2 SELECT f1.friend_id FROM friendships f1 JOIN friendships f2 ON f1.friend_id = f2.friend_id WHERE f1.user_id = 1 AND f2.user_id = 2;
This is a self-join: the friendships table is joined with itself. Row f1 represents "user 1's friends" and row f2 represents "user 2's friends." The JOIN condition f1.friend_id = f2.friend_id finds IDs that appear in both sets. Without an index on (user_id, friend_id), this is a nested loop over potentially millions of rows. With a composite index, each side is an efficient range scan.
In Cypher, the same query is one line: MATCH (a)-[:FRIENDS]-(mutual)-[:FRIENDS]-(b) WHERE a.id=1 AND b.id=2 RETURN mutual. The visual pattern-matching syntax makes graph queries far more readable for relationship-heavy operations.
Practice answering these in 30 seconds each. The goal is crisp, structured answers that demonstrate understanding of trade-offs.
| Question | Key Points for Your Answer |
|---|---|
| "When would you choose MongoDB over PostgreSQL?" | Hierarchical data (CMS, product catalogs), rapidly evolving schema, data locality for read-heavy workloads, when JOINs are rare. Not for: many-to-many relationships, complex analytics, strong integrity requirements. |
| "What is the N+1 query problem?" | ORM loads parent (1 query), then lazy-loads each child individually (N queries). Fix: eager load with JOIN, or use a batch loader (DataLoader pattern). Symptom: page load time scales linearly with number of items. |
| "Explain eventual consistency in document stores" | After a write, not all replicas are updated immediately. A subsequent read might hit a stale replica. Trade-off: lower latency and higher availability in exchange for potentially reading stale data. Tunable in most systems (read concern levels). |
| "How does sharding work differently for documents vs relational?" | Documents: shard by document ID or a field within the document. Self-contained, so single-document queries stay within one shard. Relational: shard by primary key, but JOINs across shards are expensive (cross-shard queries). This is why document stores scale horizontally more easily for simple queries. |
| "What is polyglot persistence?" | Using different databases for different parts of your system based on access patterns. Example: PostgreSQL for orders, Redis for sessions, Elasticsearch for full-text search, Neo4j for social graph. Trade-off: operational complexity of running multiple databases vs performance of each being optimized for its use case. |
Data models are the foundation. Every other chapter in DDIA builds on the choices you make here.
The 2010s saw the rise of NoSQL ("Not Only SQL") — document stores, key-value stores, graph databases, column-family stores. The pitch: horizontal scaling, flexible schemas, no JOINs needed. The reality: developers missed SQL, transactions, and referential integrity. Many NoSQL projects ended up reimplementing these features poorly in application code.
The response was NewSQL: databases that provide the familiar SQL interface, ACID transactions, and relational model, but with the horizontal scaling of NoSQL. The key examples:
| Database | Company | Key Innovation |
|---|---|---|
| Google Spanner | Globally distributed, externally consistent transactions using TrueTime (atomic clocks + GPS) | |
| CockroachDB | Cockroach Labs | Open-source Spanner-inspired, Raft consensus, PostgreSQL-compatible wire protocol |
| TiDB | PingCAP | MySQL-compatible, horizontal scaling, Raft-based replication |
| YugabyteDB | Yugabyte | PostgreSQL-compatible, Raft-based, designed for cloud-native deployment |
| PlanetScale | PlanetScale | MySQL-compatible, Vitess-based sharding, branchable schemas |
The NewSQL movement validates Codd's original insight: the relational model is not the bottleneck. The limitation was always in the single-machine implementation. With distributed consensus and sharding built into the storage engine, you can have SQL, transactions, and horizontal scaling.
| Next Topic | Connection | Lesson |
|---|---|---|
| Storage & Retrieval | Once you have chosen a data model, how does the database actually store it on disk? B-trees for relational indexes, LSM-trees for write-heavy workloads, column stores for analytics. | DDIA Ch4: Storage & Retrieval |
| Replication | How do you copy your data across multiple machines for fault tolerance? The data model affects replication: document stores replicate whole documents, relational databases replicate rows/pages. | DDIA Ch6: Replication |
| Partitioning (Sharding) | How do you split your data across machines? The data model determines the partition key: document ID for document stores, primary key for relational, vertex ID for graphs. | DDIA Ch7: Partitioning |
| Transactions | How do you make multi-step operations atomic? ACID transactions are native to relational databases. Document stores offer single-document atomicity. Graph databases vary widely. | DDIA Ch8: Transactions |
| Consistency & Consensus | How do replicas agree on the current state of data? The consistency model interacts with the data model: linearizability is easier for key-value stores, harder for complex graphs. | DDIA Ch10: Consistency & Consensus |
| Resource | Why |
|---|---|
| Codd, "A Relational Model of Data for Large Shared Data Banks" (1970) | The paper that started it all. Surprisingly readable. |
| MongoDB Data Modeling Guide | Official guide on when to embed vs reference in document stores. |
| Neo4j Graph Modeling Guidelines | Practical patterns for designing graph schemas. |
| Kleppmann, Designing Data-Intensive Applications, Ch 2 | The source material for this lesson. Read the full chapter. |
If you remember nothing else from this lesson, remember these five principles:
| # | Principle | One-Line Explanation |
|---|---|---|
| 1 | Start relational | PostgreSQL handles 90% of use cases. Reach for specialized stores only when you hit a specific bottleneck. |
| 2 | Normalize by default | Denormalize only when you have measured a JOIN bottleneck, not as a premature optimization. |
| 3 | Declarative over imperative | SQL and aggregation pipelines let the database optimize. Imperative code hardcodes the strategy. |
| 4 | Graphs for traversal, not storage | Add a graph layer only when traversal queries (friends-of-friends, shortest path) are a primary access pattern. |
| 5 | Schema evolution is inevitable | Design for it from day one: nullable columns, expand-contract migrations, backward-compatible reads. |
For inspiration and pattern-matching in interviews, here is what major companies actually use:
| Company | Primary Store | Specialized Stores |
|---|---|---|
| PostgreSQL (user data, media metadata, comments, likes) | Cassandra (direct messages at scale), Redis (caching), custom graph service (social graph) | |
| Uber | MySQL → migrated to Docstore (custom document store built on top of MySQL) | Redis (caching, geospatial), Cassandra (trip data), Apache Pinot (real-time analytics) |
| Airbnb | MySQL (listings, reservations, user accounts) | Elasticsearch (search), Redis (sessions), Hive (analytics warehouse) |
| Shopify | MySQL (orders, products, merchants) | Redis (caching), Kafka (event streaming), custom key-value store for cart data |
| Espresso (custom document store for member profiles) | Custom graph service (social connections), Pinot (analytics), Venice (derived data serving) |
The pattern is consistent: relational (or relational-inspired) for the transactional core, plus specialized stores for specific access patterns. No company uses "just MongoDB for everything" or "just Neo4j for everything" at scale.
"The limits of my language mean the limits of my world." — Ludwig Wittgenstein
The data model is your language for thinking about data. Choose it wisely.