Designing Data-Intensive Applications — Chapter 3

Data Models & Query Languages

Relational, document, graph — choosing how to structure and query your data.

Prerequisites: Basic SQL + JSON. That's it.
11
Chapters
9+
Simulations
5
Interview Dimensions

Chapter 0: The Problem

You are building LinkedIn. A single user has a name, a profile photo URL, a current job title, a current employer, a list of past jobs (each with a company, title, start date, end date), a list of degrees (each with a school, field, year), a list of skills (each endorsed by other users), and connections to thousands of other users who have the same complex structure.

Now store this. You have three choices, and each one will shape everything about your application — what queries are fast, what queries are slow, how you scale, and what happens when the product evolves.

Choice 1: Relational tables. Split the data into normalized tables. A users table, a jobs table with a foreign key to users, an education table with a foreign key to users, a skills table, and a connections table that is a many-to-many join table linking two user IDs. To display a single profile, you JOIN across 4-5 tables.

Choice 2: A JSON document. Store each user as one self-contained JSON blob. The profile, jobs, education, skills — all nested inside one document. One read to fetch a profile. No joins needed. But if two users both worked at "Google," you have the string "Google" duplicated in both documents. And if Google changes its name? You update every document that mentions it.

Choice 3: A graph. Every entity — user, company, school, skill — is a node. Relationships are edges: WORKS_AT, STUDIED_AT, KNOWS, HAS_SKILL. To find "all people who worked at Google and know Alice," you traverse edges. Natural for highly interconnected data. But simple lookups ("give me Alice's profile") require gathering data from many nodes.

Each choice is a bet on what matters most. If you bet on data integrity and complex queries, go relational. If you bet on read performance and schema flexibility, go document. If you bet on relationship-heavy queries and interconnected data, go graph. Most real systems use a combination — but you must understand each model deeply to know when to reach for it.

This is the most consequential decision in system design. The data model is not just a storage detail. It determines which queries are O(1) and which are O(n). It determines how your schema evolves. It determines whether your team spends Friday nights debugging join performance or data duplication bugs. Every systems design interview starts here: "What data model would you use, and why?"

See the Same Data Three Ways

The simulation below shows a LinkedIn profile stored in all three formats side by side. Notice how the same information looks fundamentally different depending on the model. Click each tab to see the structure, and pay attention to where duplication lives and where references live.

One Profile, Three Models

Click a model to see how the same LinkedIn profile is stored. Watch where data is duplicated vs referenced.

The Trade-off Triangle

No model wins on every dimension. Here is the fundamental triangle of trade-offs:

DimensionRelationalDocumentGraph
Data locality (one read for a profile)Poor (multi-table JOIN)Excellent (one document)Poor (traverse nodes)
Many-to-many relationshipsExcellent (JOIN tables)Poor (duplication or manual refs)Excellent (edges)
Schema flexibilityRigid (ALTER TABLE)Flexible (schema-on-read)Very flexible (add node/edge types)
Complex queries (analytics, aggregation)Excellent (SQL)Limited (no JOINs)Good for traversal, poor for aggregation
Data integrity (no dangling refs)Excellent (foreign keys)None (application enforced)Moderate (depends on DB)

Why This Matters for Your Career

Every systems design interview starts with "How would you model the data?" If you choose relational when the data is hierarchical, you will spend the rest of the interview explaining why your JOINs are slow. If you choose document when the data has many-to-many relationships, you will spend the interview explaining how you handle data consistency. The data model choice constrains every subsequent decision: storage engine, indexing strategy, sharding key, caching layer, consistency model.

More practically: about 60% of production bugs in data-intensive applications trace back to a data model mismatch. Storing a graph in flat tables. Embedding data that should be referenced. Using a document store for transactional data. These are not bugs you can fix with better code — they require redesigning the schema, which means migrating data, which means downtime or a painful multi-week migration project.

The rest of this lesson walks through each model in depth, teaches the query languages that go with them, and builds your intuition for choosing the right model in a system design interview.

Concept check: A user's LinkedIn profile has a list of past jobs. In the relational model, these are stored in a separate jobs table with a foreign key to the users table. If two users both worked at Google, how many times does the string "Google" appear in a well-normalized relational schema?

Chapter 1: The Relational Model

In 1970, Edgar Codd published "A Relational Model of Data for Large Shared Data Banks" while working at IBM. The paper proposed something radical: stop thinking about how data is physically stored on disk (navigational pointers, linked lists, network paths) and instead think of data as relations — flat tables of rows and columns. Let the database figure out the physical layout. You just say what data you want.

This separation of logical model from physical storage is arguably the most important idea in the history of data management. It meant that application code did not break when the DBA reorganized the disk layout. It meant that a query optimizer could choose the best access path automatically. It meant that two different applications could share the same data without coordinating on how to traverse it.

A Brief History

Before Codd, databases used hierarchical models (IBM's IMS, 1966) or network models (CODASYL, 1969). In both, the programmer had to know the physical access path to data — "start at this record, follow this pointer, walk this linked list." If the physical layout changed, every program that touched the database broke. If you wanted a new query that traversed the data differently, you might need to restructure the entire database.

Codd's insight was revolutionary: separate the logical representation (tables, columns, rows) from the physical representation (indexes, page layout, buffer pools). Let the database system figure out how to execute a query efficiently. The programmer specifies what data they want (SQL), not how to navigate to it.

This idea was so powerful that relational databases dominated for 40 years. Oracle, PostgreSQL, MySQL, SQL Server — all relational. It was not until the late 2000s, when web-scale companies hit the limits of single-machine relational databases, that alternatives gained serious traction.

The Core Concepts

A relation is a table. A tuple is a row. An attribute is a column. This vocabulary comes from mathematical set theory — a relation is a set of tuples — but in practice everyone says table, row, column.

The killer feature of the relational model is the JOIN. Data is normalized: instead of duplicating information, you store it once in a dedicated table and reference it by ID from other tables. This reference is a foreign key. When you need the full picture, you JOIN the tables together at query time.

Schema-on-write means you define the schema (columns, types, constraints) before you insert any data. The database rejects data that does not conform. This is like static typing in programming: errors are caught early, at write time, not at read time.

Why normalize? Normalization eliminates duplication. If "Google" changes its name to "Alphabet," you update one row in the companies table. Every job that references company_id=42 automatically shows the new name. In a denormalized system, you would need to find and update every document or row that contains the string "Google." Normalization trades write simplicity (one update) for read complexity (JOINs). This is the central bargain of the relational model.

Building a Schema: The LinkedIn Example

Let us design the relational schema for our LinkedIn profile step by step.

sql
CREATE TABLE users (
    user_id    SERIAL PRIMARY KEY,
    name       VARCHAR(100) NOT NULL,
    photo_url  TEXT,
    summary    TEXT
);

CREATE TABLE companies (
    company_id SERIAL PRIMARY KEY,
    name       VARCHAR(200) NOT NULL,
    industry   VARCHAR(100)
);

CREATE TABLE jobs (
    job_id     SERIAL PRIMARY KEY,
    user_id    INT REFERENCES users(user_id),
    company_id INT REFERENCES companies(company_id),
    title      VARCHAR(200),
    start_date DATE,
    end_date   DATE   -- NULL means current job
);

CREATE TABLE education (
    edu_id     SERIAL PRIMARY KEY,
    user_id    INT REFERENCES users(user_id),
    school     VARCHAR(200),
    field      VARCHAR(200),
    grad_year  INT
);

CREATE TABLE connections (
    user_a     INT REFERENCES users(user_id),
    user_b     INT REFERENCES users(user_id),
    since      DATE,
    PRIMARY KEY(user_a, user_b)
);

Five tables. No string is duplicated. Every relationship is a foreign key. This is Third Normal Form (3NF): every non-key attribute depends on the whole key and nothing but the key.

Notice the connections table. It has a composite primary key: (user_a, user_b). This is a junction table (also called a join table, bridge table, or associative entity). It represents a many-to-many relationship: a user can connect to many other users, and each user can have many connections. This pattern — a dedicated table that holds only foreign keys — is the standard way to model many-to-many relationships in the relational world.

Also notice the REFERENCES keyword on foreign keys. This is not just documentation — it is a constraint. If you try to insert a job with user_id = 999 and there is no user with that ID, the database rejects the insert. If you try to delete a user who has jobs, the database stops you (or cascades the delete, depending on configuration). This referential integrity guarantee is one of the most powerful features of the relational model: it is impossible to have dangling references or orphaned records.

The JOIN in Action

To assemble a full profile, we join across tables:

sql
SELECT u.name, u.summary,
       j.title, c.name AS company, j.start_date, j.end_date,
       e.school, e.field, e.grad_year
FROM users u
LEFT JOIN jobs j ON u.user_id = j.user_id
LEFT JOIN companies c ON j.company_id = c.company_id
LEFT JOIN education e ON u.user_id = e.user_id
WHERE u.user_id = 42;

This is one SQL query, but the database must do real work: scan the users table (or use an index) to find user 42, then scan jobs to find all jobs for that user, look up each company by ID, and scan education for the user's degrees. The query planner decides the order and method (index scan, hash join, nested loop) automatically.

Interactive: Run a Relational Query

The simulation below shows three relational tables populated with sample data. Click "Run JOIN" to watch the database assemble a full profile by joining across tables. The query plan shows the execution steps.

Relational JOIN: Assembling a Profile

Click Run JOIN to see how the database scans tables, matches foreign keys, and produces the result. Watch the query plan steps.

Worked Example: A Complex Query

Interview question: "Find all users who worked at Google and have a CS degree."

sql
SELECT DISTINCT u.name
FROM users u
JOIN jobs j ON u.user_id = j.user_id
JOIN companies c ON j.company_id = c.company_id
JOIN education e ON u.user_id = e.user_id
WHERE c.name = 'Google'
  AND e.field = 'Computer Science';

Step by step: (1) the database finds company_id for "Google" from the companies table — likely an index lookup, O(log n). (2) It finds all job rows with that company_id — another index lookup. (3) It collects the user_ids from those jobs. (4) It checks which of those users have an education row where field = 'Computer Science'. (5) It looks up the names. The optimizer may reorder these steps for efficiency.

Normalization: From 1NF to 3NF

Normalization is the process of organizing a relational schema to minimize redundancy. There are several normal forms, each building on the previous:

First Normal Form (1NF): Every column contains atomic (indivisible) values. No repeating groups, no arrays. A column "skills" containing "Go, Kubernetes, Python" as a comma-separated string violates 1NF. Instead, you create a separate user_skills table with one row per skill per user.

Second Normal Form (2NF): 1NF plus every non-key column depends on the entire primary key, not just part of it. If your primary key is (student_id, course_id) and you have a column "student_name," that depends only on student_id, not the full key. Move it to a separate students table.

Third Normal Form (3NF): 2NF plus no non-key column depends on another non-key column. If you have "zip_code" and "city" in the same table, city depends on zip_code (not on the primary key directly). Move the zip→city mapping to a separate table.

In practice, most production schemas are in 3NF or close to it. Purists go further (BCNF, 4NF, 5NF), but the returns diminish rapidly. The pragmatic rule: normalize until it hurts (too many JOINs), then selectively denormalize the hot paths.

The query optimizer is the secret weapon. You write what you want (declarative SQL). The database decides how to get it. If there is an index on companies.name, it uses it. If the tables are small enough, it does a sequential scan. If the join is large, it uses a hash join instead of nested loops. This means the same SQL query can automatically get faster when you add indexes — without changing any application code.
Design question: You have a users table with 10 million rows and a jobs table with 50 million rows. To display a user's profile, you JOIN these two tables on user_id. Without any indexes, what is the time complexity of this JOIN?

Chapter 2: The Document Model

In the mid-2000s, web applications started hitting a wall. Social networks, content management systems, and real-time games had data that was naturally hierarchical — a user profile is a tree of nested objects, not a flat table. Splitting this tree across five relational tables and reassembling it with JOINs on every page load felt like unnecessary overhead. What if you just stored the tree as-is?

The document model does exactly that. Each record is a self-contained document, typically in JSON (or BSON, a binary encoding of JSON). A document can contain nested objects, arrays, arrays of objects — any tree structure. Databases like MongoDB, CouchDB, DynamoDB, and Firestore are document stores.

The LinkedIn Profile as a Document

json
{
  "user_id": 42,
  "name": "Alice Chen",
  "photo_url": "/photos/alice.jpg",
  "summary": "Infrastructure engineer",
  "jobs": [
    {
      "title": "Staff Engineer",
      "company": "Google",
      "start": "2020-03",
      "end": null
    },
    {
      "title": "Senior Engineer",
      "company": "Stripe",
      "start": "2017-06",
      "end": "2020-02"
    }
  ],
  "education": [
    {
      "school": "MIT",
      "field": "Computer Science",
      "year": 2017
    }
  ],
  "skills": ["Go", "Kubernetes", "Distributed Systems"]
}

One document. One read. No JOINs. The entire profile is a single JSON blob stored contiguously on disk. This is called data locality — all the data you need for one operation lives in one place.

Schema-on-Read vs Schema-on-Write

In the relational model, the schema is enforced at write time. If you try to insert a row without a required column, the database rejects it. This is schema-on-write: the structure is explicit and enforced.

Document databases typically do not enforce a schema. You can insert any JSON document into a collection, regardless of its structure. The application interprets the structure when it reads the data. This is schema-on-read: the structure is implicit and assumed by the application code.

Think of it like programming languages. Schema-on-write is like static typing (Java, Go): the compiler catches type errors before the code runs. Schema-on-read is like dynamic typing (Python, JavaScript): errors surface at runtime when you access a field that does not exist.

Schema-on-read does not mean "no schema." There is always a schema — it is just implicit in your application code. When your code does user.jobs[0].company, it is assuming the document has a jobs array where each element has a company field. If an old document lacks this structure, your code crashes. The schema exists; it is just not enforced by the database. This is the source of many production bugs in document-store applications.

Advantages of Documents

Data locality. Fetching a user profile is one read, not a multi-table JOIN. For read-heavy workloads that fetch entire entities, this can be 10-100x faster. The document is stored contiguously on disk, so a single disk seek retrieves everything. In a relational database, the user row, job rows, education rows, and company rows may be on different disk pages, requiring multiple seeks. At scale, this difference in I/O patterns is the primary driver of document store adoption.

Flexible schema. Different documents in the same collection can have different fields. Adding a new field to your user profile does not require migrating existing documents — new documents just include the field, old documents do not. Your application code handles both cases.

Natural fit for hierarchical data. A user profile, a blog post with comments, a product catalog — these are naturally tree-shaped. Mapping them to flat relational tables requires artificial decomposition. Documents preserve the natural structure.

Disadvantages of Documents

No joins. If Alice and Bob both worked at Google, the string "Google" appears in both documents. If Google changes its name, you must find and update every document — and there is no single atomic operation to do this. During the update, some documents say "Google" and some say "Alphabet." Worse, there is no foreign key constraint stopping you from misspelling it "Gogle" in one document. The database happily stores it because it does not know that "company" is supposed to reference a canonical entity.

Denormalization breeds inconsistency. Without normalization, the same fact is stored in multiple places. This is fine until the fact changes. Then you have an update anomaly: some documents have the old value, some have the new one.

Poor for many-to-many relationships. If users endorse each other's skills, that is a many-to-many relationship between users and skills. Documents cannot naturally express this without duplicating data or using manual references (storing IDs and resolving them in application code — effectively reinventing JOINs badly).

Interactive: Document Store Operations

Document Store: Insert and Query

Insert documents, then query by nested field. Compare the single-read fetch to the relational JOIN from Chapter 1.

The Object-Relational Mismatch

Most application code is written in object-oriented languages — Python classes, Java objects, Go structs. But relational databases store data in flat tables. Translating between the object model in your application and the relational model in your database is called the impedance mismatch, and it is the single biggest source of complexity in relational applications.

Consider our LinkedIn profile. In Python, it is a natural nested object:

python
class UserProfile:
    name: str
    summary: str
    jobs: List[Job]        # nested list of objects
    education: List[Edu]  # another nested list
    skills: List[str]      # simple list

In the relational model, this single object becomes 4-5 tables. Loading the object requires joining them. Saving the object requires inserting/updating across multiple tables in a transaction. This translation layer — the ORM (Object-Relational Mapping) — is a constant source of bugs, performance problems, and complexity. Libraries like SQLAlchemy (Python), Hibernate (Java), and ActiveRecord (Ruby) exist entirely to manage this mismatch.

The document model eliminates this mismatch entirely. The JSON document IS the object. Load it, deserialize it, done. No ORM needed. This is why document databases gained traction alongside JavaScript (where JSON is native) and dynamic languages.

The ORM trap. ORMs make the impedance mismatch invisible, not absent. Behind the scenes, your ORM is generating multi-table JOINs, managing foreign key relationships, and tracking which fields have changed. The N+1 query problem — where loading N users triggers N separate queries for their jobs — is the most common ORM performance bug. Understanding the underlying data model helps you avoid these traps regardless of which ORM you use.

Embedding vs Referencing in Document Stores

When designing a document schema, the key decision is: embed related data inside the document, or reference it by ID and resolve it in a separate query?

StrategyWhen to useExample
EmbedOne-to-few relationship, data is always fetched together, child data belongs to parentA user's addresses, a post's comments (if few), a product's variants
ReferenceOne-to-many with many, many-to-many, data is shared across entitiesA user's order history (could be thousands), a tag shared across posts

The rule of thumb from the MongoDB documentation: embed unless you have a good reason not to. Good reasons: (1) the embedded array can grow without bound (risk of exceeding the 16MB document size limit), (2) the embedded data is updated independently of the parent, (3) the embedded data is referenced by multiple parents (many-to-many).

Debug scenario: Your MongoDB application stores user profiles as documents. Last month you added a verified field (boolean) to new user sign-ups. A colleague reports that db.users.find({verified: true}) returns only users who signed up after the change, even though you ran a migration script to set verified: true for existing power users. The query returns 500 users but you expected 12,000. What is the most likely cause?

Chapter 3: The Relational vs Document Debate

This is the question that launches a thousand Hacker News threads. The truth is nuanced, and the industry has been converging: PostgreSQL now has excellent JSON support (jsonb columns with GIN indexes), and MongoDB added $lookup (a JOIN operator) in version 3.2. The lines are blurring. But the fundamental trade-offs remain.

When to Use the Document Model

Your data is mostly tree-structured (one-to-many). A blog post has comments. A product has variants. A resume has sections. If each entity is a self-contained tree with little cross-referencing between entities, documents are natural.

You rarely need joins between entities. If your application loads one document at a time (show a product page, display a user profile) and almost never needs to combine data from multiple entities in a single query, the document model's data locality is a pure win.

Your schema evolves rapidly. Early-stage startups often do not know what their data will look like in six months. Schema-on-read lets you iterate without migration headaches. Add fields to new documents; write application code that handles both old and new formats.

The Document Model's Sweet Spot: Content Management

Content management systems (CMS) are the ideal use case for document stores. A blog post is a self-contained document: title, author, body text, tags, comments, metadata. Each post is fetched individually ("show this blog post"). Posts are rarely joined with other posts. The schema varies — a video post has a duration field, a photo post has dimensions, a text post has a word count. A document store handles this naturally. WordPress.com, Contentful, and Strapi all use document-oriented storage internally.

Conversely, the document model's worst case is anything that looks like an accounting ledger: debits must equal credits, account balances must be consistent, transactions must be atomic across multiple accounts. These are inherently relational problems that require JOIN, foreign keys, and ACID transactions.

When to Use the Relational Model

Your data has many-to-many relationships. Users follow users. Orders reference products, which reference categories. Employees belong to departments that belong to divisions. These relationships form a web, not a tree. The relational model's JOINs and foreign keys handle webs naturally.

You need complex joins and aggregations. "What is the average order value by product category for the last quarter?" This is a multi-table join with GROUP BY and date filtering. SQL was designed for exactly this. Document databases can do it, but awkwardly.

Data integrity matters more than flexibility. A banking application cannot tolerate an account with a negative balance because of a missing field. Foreign keys, CHECK constraints, NOT NULL, and transactions enforce correctness at the database level rather than relying on application code.

The Convergence

Modern databases are hybrid:

DatabasePrimary ModelBorrowed Feature
PostgreSQLRelationaljsonb columns with indexing, path queries
MySQL 5.7+RelationalJSON data type with generated columns
MongoDB 3.2+Document$lookup (left outer join), schema validation
DynamoDBDocument/KVGlobal secondary indexes, transactions (2018)
CockroachDBRelationalJSON columns, distributed SQL
The pragmatic answer in an interview. Do not be dogmatic. Say: "I would start with PostgreSQL because it gives me relational JOINs for the interconnected parts of my data (users, orders, payments) and jsonb columns for the tree-shaped parts (product attributes, user preferences). This avoids the operational overhead of running two databases. If we later find that 90% of our queries are single-document lookups and JOINs are rare, we can consider moving to a document store for the hot path." This shows you understand both models and can make pragmatic trade-offs.

Interactive: Decision Tree

The simulation below walks you through a data model decision. Answer questions about your application's data, and see which model the decision tree recommends. Try different paths.

Data Model Decision Tree

Answer the questions to get a recommendation. Click a choice at each branch.

Design Challenge: E-Commerce Platform

You are building an e-commerce platform with four entity types: Products (name, description, price, attributes that vary by category — shirts have size/color, laptops have RAM/storage), Orders (user, items, quantities, shipping address, payment status), Reviews (user, product, text, rating, date), and User Profiles (name, email, addresses, payment methods, order history).

Think about each entity. Which model fits best?

EntityBest ModelWhy
ProductsDocument (or relational + jsonb)Attributes vary by category — a shirt has size/color, a laptop has RAM/storage. A rigid relational schema would need a separate table per category or an ugly EAV (entity-attribute-value) pattern. A JSON document naturally holds variable attributes.
OrdersRelationalOrders reference users and products (many-to-many). Payment status needs transactional updates. Integrity matters — you cannot lose an order. Foreign keys and ACID transactions are essential.
ReviewsEitherA review references a user and a product (relational JOINs useful), but each review is self-contained (document locality useful). PostgreSQL with foreign keys and a jsonb column for metadata is a good hybrid.
User ProfilesDocument (or relational + jsonb)A profile is tree-shaped (addresses array, payment methods array). Rarely joined with other entities beyond order lookups. Good document locality.
The real answer: use one database for multiple patterns. PostgreSQL can serve as both your relational engine (orders, transactions) and your document store (product attributes in jsonb). Running a single database is dramatically simpler to operate than running PostgreSQL + MongoDB + Redis. In system design interviews, operational simplicity is a valid and strong argument.

PostgreSQL jsonb: The Best of Both Worlds

PostgreSQL's jsonb column type deserves special attention because it effectively gives you a document database inside a relational database. You get foreign keys, transactions, and SQL for the structured parts of your data, plus JSON flexibility for the unstructured parts.

sql
-- Relational structure + JSON flexibility
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    name       VARCHAR(200) NOT NULL,
    price      DECIMAL(10,2) NOT NULL,
    category   VARCHAR(50) NOT NULL,
    attributes JSONB  -- flexible per-category attributes
);

-- A shirt:
INSERT INTO products (name, price, category, attributes) VALUES
('Classic Tee', 29.99, 'apparel', '{"size": "M", "color": "navy", "material": "cotton"}');

-- A laptop:
INSERT INTO products (name, price, category, attributes) VALUES
('ThinkPad X1', 1499.00, 'electronics', '{"ram_gb": 16, "storage_gb": 512, "screen": "14 inch"}');

-- Query JSON fields with GIN index support:
CREATE INDEX idx_attrs ON products USING GIN(attributes);

SELECT name, price
FROM products
WHERE attributes @> '{"color": "navy"}';  -- uses GIN index!

The @> operator means "contains" — it checks if the JSON value contains the specified key-value pairs. With a GIN index, this is fast even on millions of rows. You get the flexibility of a document store for the variable attributes, plus the integrity and query power of SQL for the structured fields.

You can even combine relational JOINs with JSON queries in a single statement:

sql
-- Find orders for navy products over $50, with buyer info
SELECT u.name, p.name, p.price, o.quantity
FROM orders o
JOIN users u ON o.user_id = u.user_id
JOIN products p ON o.product_id = p.product_id
WHERE p.attributes @> '{"color": "navy"}'
  AND p.price > 50;

This query uses a relational JOIN (orders → users, orders → products) combined with a JSON containment check on the product attributes. No other database model gives you this combination of flexibility and power in a single query.

Design question: You are building a content management system where articles can have tags, and users can bookmark articles. An article can have many tags, a tag can be on many articles, and a user can bookmark many articles. Which data model best handles this core schema and why?

Chapter 4: Query Languages

A data model is only half the story. The other half is how you ask questions of your data. The query language determines what is easy to express, what is hard, and — critically — what the database can automatically optimize.

Imperative vs Declarative

An imperative query tells the computer how to get the answer, step by step. You write a loop, you check conditions, you accumulate results. An declarative query tells the computer what you want, and the computer figures out how to get it.

Here is the same query — "find all users in the Engineering department" — written both ways:

javascript
// IMPERATIVE: you specify the algorithm
function getEngineers(users) {
  const result = [];
  for (let i = 0; i < users.length; i++) {
    if (users[i].dept === "Engineering") {
      result.push(users[i]);
    }
  }
  return result;
}
sql
-- DECLARATIVE: you specify the result
SELECT * FROM users WHERE dept = 'Engineering';

The imperative version hardcodes the strategy: iterate from index 0 to length, check each element. If the data were sorted by department, a binary search would be faster — but the imperative code cannot take advantage of that without being rewritten.

The declarative version says nothing about how to find the rows. The query optimizer chooses: if there is an index on dept, use it. If the table is tiny, do a sequential scan. If the data is partitioned by department, go directly to the Engineering partition. The same SQL automatically adapts to different physical layouts.

Here is another example you already know: CSS. Styling a web page with imperative JavaScript means manually computing element positions, setting pixel values, handling resize events. CSS is declarative: you say display: flex; justify-content: center; and the browser figures out the layout. If the window resizes, the browser recomputes automatically. If a new browser version has a faster layout engine, your CSS automatically benefits. Same principle as SQL.

Why declarative wins. Declarative queries have three enormous advantages: (1) Automatic optimization — the database can improve execution plans as data grows, indexes are added, or the optimizer is upgraded, all without changing your query. (2) Parallelization — the database can split a declarative query across multiple CPU cores or machines because it knows what the query needs but not in what order. Imperative code with mutable variables is much harder to parallelize safely. (3) Abstraction — the physical storage can change (add an index, move to a different disk layout, shard the table) without breaking any queries.

SQL: The Original Declarative Language

SQL (Structured Query Language) was invented in the 1970s at IBM, closely following Codd's relational model. It is declarative: you describe the shape of the result, not the algorithm to produce it.

The core operations:

OperationSQLWhat it does
SelectSELECT cols FROM tableChoose which columns to return (projection)
FilterWHERE conditionKeep only rows matching the condition (selection)
JoinJOIN table ON conditionCombine rows from two tables based on a relationship
GroupGROUP BY colsAggregate rows that share a common value
OrderORDER BY colsSort the result set
LimitLIMIT nReturn only the first n rows

MongoDB Aggregation Pipeline

MongoDB's query language is also declarative, but uses a pipeline of stages instead of SQL clauses. Each stage transforms the data and passes it to the next stage.

javascript
// MongoDB: "find all users in Engineering, group by team, count"
db.users.aggregate([
  { $match: { dept: "Engineering" } },       // WHERE
  { $group: { _id: "$team", count: { $sum: 1 } } }, // GROUP BY
  { $sort: { count: -1 } }                    // ORDER BY
]);

The pipeline is a sequence of stages: $match (filter), $group (aggregate), $sort (order), $project (select columns), $lookup (join), $unwind (flatten arrays). Each stage is declarative — you say what transformation you want, not how to implement it.

MapReduce: The Hybrid

MapReduce is a programming model where you write two functions: a map function that extracts key-value pairs from each document, and a reduce function that combines all values for the same key. The framework handles distributing the work across machines.

javascript
// MapReduce: count users per department
db.users.mapReduce(
  function map() {            // runs on each document
    emit(this.dept, 1);        // emit (key, value) pair
  },
  function reduce(key, values) { // combines values for same key
    return Array.sum(values);
  },
  { out: "dept_counts" }
);

MapReduce is a hybrid: the map and reduce functions are imperative (you write code), but the distribution and scheduling are declarative (the framework handles parallelism). In practice, MongoDB's aggregation pipeline has replaced MapReduce for most use cases because it is fully declarative and easier to optimize.

Why MapReduce Lost to Declarative Pipelines

MapReduce had a fundamental problem: you had to write two functions for every query. To count users per department, you write a map function and a reduce function. To find the top 5 departments, you write another map and reduce. To filter by region first, you chain MapReduce jobs. Each job writes its output to disk and the next job reads it — massive I/O overhead.

MongoDB's aggregation pipeline replaced this with declarative stages that the optimizer can fuse, reorder, and push down. The same query that took 15 lines of MapReduce code becomes 5 lines of pipeline stages. More importantly, the pipeline engine can optimize: it can merge $match stages, push $limit before $sort (top-N optimization), and use indexes for the initial $match. MapReduce cannot do any of this because the user-written functions are opaque to the optimizer.

This same evolution happened in the big data world: Hadoop MapReduce (imperative, write Java map/reduce classes) was replaced by Spark SQL and Hive (declarative, write SQL). The lesson: declarative always wins because it enables automatic optimization.

Interactive: Same Query, Three Languages

One Query, Three Forms

The same query — "count users per department" — expressed in imperative JS, SQL, and MongoDB aggregation. Click each to see how it executes and how declarative forms enable parallelism.

Concept check: A developer writes an imperative Python script to compute the average order value: it loops through all orders, sums the totals, and divides. The same logic is also expressed as SELECT AVG(total) FROM orders; in SQL. The database has just added a new clustered index on the total column. Which version automatically benefits from the new index?

Chapter 5: Graph Data Models

What happens when everything is connected to everything else? Social networks (people know people), road networks (intersections connect to intersections), the web (pages link to pages), fraud detection (accounts, devices, IP addresses, and transactions form a tangled web of relationships). When the relationships between entities are as important as the entities themselves, neither relational tables nor documents work well. You need a graph.

What Is a Graph?

A graph has two ingredients: vertices (also called nodes) and edges (also called relationships). Each vertex represents an entity. Each edge represents a connection between two entities. Both vertices and edges can have properties — key-value pairs that store data about them.

Unlike the relational model, where the schema fixes the types of relationships at design time (foreign keys between specific tables), a graph lets you add any edge between any two vertices at any time. A person vertex can have an edge to a company ("WORKS_AT"), an edge to a school ("STUDIED_AT"), an edge to another person ("KNOWS"), and an edge to a city ("LIVES_IN"). No schema migration required.

Two Graph Models

Property graph (Neo4j, Amazon Neptune, TigerGraph). Each vertex has a unique ID, a set of properties (key-value pairs), and a set of outgoing and incoming edges. Each edge has a unique ID, a label (type), properties, and connects a tail vertex to a head vertex.

Think of it as two relational tables:

sql
-- Property graph as relational tables (conceptual)
CREATE TABLE vertices (
    vertex_id  INT PRIMARY KEY,
    properties JSONB         -- {name: "Alice", age: 30}
);

CREATE TABLE edges (
    edge_id    INT PRIMARY KEY,
    tail_id    INT REFERENCES vertices,
    head_id    INT REFERENCES vertices,
    label      VARCHAR(50),  -- "WORKS_AT", "KNOWS", etc.
    properties JSONB
);

Triple store (RDF, SPARQL, Datomic). All information is stored as (subject, predicate, object) triples. For example: (Alice, works_at, Google), (Alice, age, 30), (Alice, knows, Bob). The subject is a vertex. The predicate is either an edge label (if the object is another vertex) or a property name (if the object is a value).

text
# Triple store format (Turtle/N-Triples syntax)
_:alice  rdf:type     :Person .
_:alice  :name        "Alice Chen" .
_:alice  :works_at    _:google .
_:alice  :knows       _:bob .
_:alice  :age         30 .
_:google rdf:type     :Company .
_:google :name        "Google" .
Property graph vs triple store. They are essentially equivalent in expressive power. Property graphs are more intuitive for most developers (vertices and edges with properties). Triple stores are more natural for semantic web and knowledge graph applications. Neo4j uses the property graph model. Wikidata uses a triple store. In practice, choose based on the query language you want: Cypher (property graph) or SPARQL (triple store).

Why Graphs Shine: Traversal

The killer operation in a graph database is traversal — following edges from vertex to vertex. "Find all people who know Alice" is one hop. "Find all people who know someone who knows Alice" is two hops. "Find the shortest path between Alice and Dave" is a breadth-first search across the graph.

In a relational database, this requires recursive JOINs (a WITH RECURSIVE query in SQL) that become increasingly expensive at each level of depth. In a graph database, traversal is a native operation with consistent performance — each hop is an index lookup on the edge list, typically O(1) per edge.

Interactive: Build and Traverse a Graph

Property Graph: Build and Query

Click to add people and companies. Create edges between them. Then run a graph traversal to find connections.

When to Use a Graph Database

Use CaseWhy Graphs WinExample Query
Social networksFriends-of-friends, mutual connections, influence propagation"Find all people within 3 degrees of Alice"
Fraud detectionSuspicious patterns: shared devices, addresses, payment methods across accounts"Find all accounts that share a device with a known fraudulent account"
Knowledge graphsSemantic relationships: "Paris is_capital_of France, France is_in Europe""What is the capital of the country that contains Lyon?"
Recommendation enginesCollaborative filtering via graph similarity"Find products bought by people similar to this user"
Network infrastructureRouters, switches, cables — physical connectivity"If this switch fails, which servers lose connectivity?"

Graph Databases in Practice: What They Are Good and Bad At

Graph databases excel at local queries — "starting from this node, find nearby nodes matching a pattern." A 2-hop traversal from a specific user to find friends-of-friends is fast regardless of total graph size, because you only touch the nodes reachable in 2 hops.

Graph databases struggle with global queries — "count all users per country" or "find the average number of connections." These require scanning every node, and graph databases have no columnar storage or vectorized execution. A SQL GROUP BY on a columnar table will be orders of magnitude faster.

They also struggle with write-heavy workloads. Inserting a node is fast. Inserting an edge requires updating the adjacency list of both endpoints and maintaining indexes. High-throughput event ingestion (sensor data, logs, clickstreams) is better served by column-family stores like Cassandra or time-series databases.

The graph database adoption pattern. Companies rarely start with a graph database. They start with PostgreSQL. As the social graph grows and traversal queries become the bottleneck (mutual friends, recommendations, fraud detection), they extract the relationship data into a specialized graph layer. Facebook's TAO, Twitter's FlockDB, LinkedIn's graph service — all started as relational and migrated the social graph to a specialized store while keeping transactional data in relational. This is the pattern you should describe in interviews: "I would start relational, and if we hit traversal performance limits at scale, extract the social graph into a graph database."

How Graph Databases Store Data

Under the hood, graph databases use index-free adjacency: each node physically stores a pointer to its neighbors. When you follow an edge from Alice to Bob, the database does not perform an index lookup — it follows a direct pointer in memory. This means traversal time is proportional to the number of edges traversed, not the total size of the graph.

Compare this to a relational database storing the same graph in an edges table. To find Alice's neighbors, you query SELECT * FROM edges WHERE tail_id = 'alice'. Even with an index, this is O(log n) where n is the total number of edges in the table. In a graph database, it is O(k) where k is Alice's degree (number of edges from Alice) — independent of total graph size.

This O(k) vs O(log n) difference is negligible for small graphs. But for a social network with 2 billion edges, the difference between O(log 2B) ≈ 31 index lookups per hop and O(1) pointer dereference per hop is enormous — especially when traversing multiple hops.

Modeling the Same Data in All Three

To solidify your intuition, here is the same e-commerce data in each model. Notice what is easy and what is awkward in each.

sql
-- RELATIONAL: clean, normalized, JOINs for queries
SELECT p.name, AVG(r.rating), COUNT(r.review_id)
FROM products p
JOIN reviews r ON p.product_id = r.product_id
GROUP BY p.product_id
HAVING COUNT(r.review_id) > 10;
javascript
// DOCUMENT: must use aggregation pipeline (verbose but functional)
db.products.aggregate([
  { $lookup: { from: "reviews", localField: "_id",
              foreignField: "product_id", as: "reviews" }},
  { $addFields: { reviewCount: { $size: "$reviews" },
                   avgRating: { $avg: "$reviews.rating" } }},
  { $match: { reviewCount: { $gt: 10 } }}
]);
cypher
// GRAPH: traversal-based, elegant for relationship queries
MATCH (p:Product)<-[r:REVIEWED_BY]-(u:User)
WITH p, AVG(r.rating) AS avgRating, COUNT(r) AS cnt
WHERE cnt > 10
RETURN p.name, avgRating, cnt;

All three can express this query. But the relational version is most natural for this aggregation-style query. The graph version works but is not playing to its strengths (aggregation is not what graphs are designed for). The document version is the most verbose because $lookup is a bolt-on feature, not a first-class citizen.

Design question: You are building a ride-sharing app. You have users, drivers, trips, locations, and payment methods. Users take trips from location A to location B with a driver, paid via a payment method. Which data model is the best primary choice?

Chapter 6: Graph Query Languages

SQL was designed for rectangular tables. Querying a graph with SQL is possible (recursive CTEs, self-joins on an edges table) but painful. Graph databases have their own query languages designed for pattern matching on nodes and edges.

Cypher (Neo4j)

Cypher is a declarative graph query language created for Neo4j. Its core syntax uses ASCII art to describe graph patterns: parentheses for nodes, arrows for edges, square brackets for edge types.

cypher
// Find all people who work at Google
MATCH (p:Person)-[:WORKS_AT]->(c:Company {name: "Google"})
RETURN p.name;

// Friends-of-friends: people 2 hops from Alice
MATCH (a:Person {name: "Alice"})-[:KNOWS*2]->(fof:Person)
WHERE NOT (a)-[:KNOWS]->(fof)  // exclude direct friends
RETURN DISTINCT fof.name;

// Shortest path between Alice and Dave
MATCH path = shortestPath(
  (a:Person {name: "Alice"})-[:KNOWS*]-(d:Person {name: "Dave"})
)
RETURN path;

// Triangle detection: mutual friends
MATCH (a:Person)-[:KNOWS]-(b:Person)-[:KNOWS]-(c:Person)-[:KNOWS]-(a)
WHERE id(a) < id(b) AND id(b) < id(c)
RETURN a.name, b.name, c.name;

Read the patterns like a sentence: (a:Person)-[:KNOWS]->(b:Person) means "a Person node a that has a KNOWS edge pointing to Person node b." The arrow direction matters. The *2 syntax means "exactly 2 hops." The * syntax means "any number of hops."

SPARQL (Triple Stores / RDF)

SPARQL (SPARQL Protocol and RDF Query Language, a recursive acronym) queries triple stores. It uses a pattern matching syntax similar to Cypher but based on triple patterns.

sparql
# Find all people who work at Google
SELECT ?name WHERE {
  ?person rdf:type :Person .
  ?person :works_at ?company .
  ?company :name "Google" .
  ?person :name ?name .
}

# Find people who know someone who works at Google
SELECT ?name WHERE {
  ?person :knows ?friend .
  ?friend :works_at ?company .
  ?company :name "Google" .
  ?person :name ?name .
}

Each line is a triple pattern: ?person :works_at ?company matches all triples where the predicate is :works_at. Variables start with ?. The query engine finds all variable bindings that satisfy all patterns simultaneously.

Datalog

Datalog is the oldest of the three — a subset of Prolog used for database queries since the 1980s. Its distinctive feature is rules: you define derived facts from base facts, and the engine chains rules together.

datalog
/* Base facts */
works_at(alice, google).
works_at(bob, stripe).
knows(alice, bob).
knows(bob, carol).

/* Rule: X can_reach Y if X knows Y directly */
can_reach(X, Y) :- knows(X, Y).

/* Rule: X can_reach Y if X knows someone who can reach Y */
can_reach(X, Y) :- knows(X, Z), can_reach(Z, Y).

/* Query: who can Alice reach? */
?- can_reach(alice, Who).
/* Result: Who = bob ; Who = carol */

The power of Datalog is recursive rules. The can_reach rule calls itself. This naturally expresses transitive closure — "find everyone reachable through any chain of KNOWS edges" — in just two lines. In SQL, this requires a WITH RECURSIVE CTE, which is more verbose and harder to read.

Choosing a Graph Query Language

In practice, Cypher dominates the graph database world because Neo4j dominates the graph database market. If you learn one graph query language for interviews, learn Cypher. Here is a progression of increasingly complex Cypher patterns you should be able to write on a whiteboard:

PatternCypherUse Case
Direct lookupMATCH (p:Person {name: "Alice"}) RETURN pFind a specific node
One hopMATCH (a)-[:KNOWS]->(b) WHERE a.name = "Alice" RETURN bDirect connections
Multi hopMATCH (a)-[:KNOWS*2]-(b) RETURN bFriends-of-friends
Variable hopMATCH p = (a)-[:KNOWS*1..4]-(b) RETURN pReachability within distance
Shortest pathMATCH p = shortestPath((a)-[*]-(b)) RETURN pMinimum hops between two nodes
Pattern matchMATCH (a)-[:KNOWS]-(b)-[:WORKS_AT]->(c) RETURN a, cCompound relationship queries
AggregationMATCH (p)-[:KNOWS]-(f) RETURN p, COUNT(f) ORDER BY COUNT(f) DESCMost connected people

SQL for Graphs: The Recursive CTE Approach

You CAN query graph-structured data in SQL using recursive CTEs (Common Table Expressions). But it is verbose and the optimizer is not tuned for it. Here is "find all people reachable from Alice" in SQL vs Cypher:

sql
-- SQL: recursive CTE for transitive closure (verbose!)
WITH RECURSIVE reachable AS (
    -- Base case: direct connections from Alice
    SELECT e.head_id AS person_id, 1 AS depth
    FROM edges e
    JOIN vertices v ON e.tail_id = v.vertex_id
    WHERE v.properties->>'name' = 'Alice'
      AND e.label = 'KNOWS'

    UNION

    -- Recursive case: follow KNOWS edges further
    SELECT e.head_id, r.depth + 1
    FROM reachable r
    JOIN edges e ON r.person_id = e.tail_id
    WHERE e.label = 'KNOWS'
      AND r.depth < 5  -- safety limit
)
SELECT DISTINCT v.properties->>'name'
FROM reachable r
JOIN vertices v ON r.person_id = v.vertex_id;
cypher
// Cypher: same query in one line
MATCH (a:Person {name: "Alice"})-[:KNOWS*1..5]-(reachable:Person)
RETURN DISTINCT reachable.name;

The Cypher version is not just shorter — it also performs better on graph-structured data. A graph database stores adjacency lists natively, so following an edge from a node is a pointer dereference (O(1)). In SQL, following an edge requires an index lookup on the edges table for each hop (O(log n) per hop).

Pattern matching is the key insight. All three graph query languages share the same core idea: describe a pattern of nodes and edges, and the engine finds all subgraphs in the database that match that pattern. This is fundamentally different from SQL, which operates on rows and columns. SQL can simulate graph queries (recursive CTEs, self-joins), but the code is verbose and the optimizer is not tuned for graph traversal patterns.

Interactive: Same Query, Three Languages

Graph Query: Visual Pattern Matching

The same query — "find people who worked at Google and know Bob" — in Cypher, SPARQL, and Datalog. Watch the pattern matching highlight nodes and edges in the graph.

Code question: In Cypher, the pattern (a)-[:KNOWS*1..3]->(b) means "a path from a to b following KNOWS edges with a minimum of 1 hop and a maximum of 3 hops." What does (a)-[:KNOWS*]->(b) (without the range) mean?

Chapter 7: Data Model Showcase

Time to put everything together. You are the architect for a social media application. You have six entity types: Users, Posts, Comments, Likes, Follows, and Messages. The simulation below lets you choose a data model for each entity type and see the consequences play out in real time.

This is the big interactive payoff. You will see the schema, try sample queries, and observe how different models handle different access patterns. There is no single right answer — the goal is to understand the trade-offs deeply enough to defend your choices in an interview.

Social Media App: Design Your Data Model

Choose a model for each entity. Watch the schema update, try sample queries, and see performance characteristics. Click an entity, then choose a model.

How to Think About This

For each entity, ask yourself three questions:

1. What are the access patterns? How will this entity be queried most often? A user profile is loaded individually (single-document fetch). A news feed requires aggregating posts from many followed users (fan-out query). Likes are counted in aggregate (COUNT query). Each pattern favors a different model.

2. What are the relationships? Does this entity stand alone (document), reference a few other entities with well-defined foreign keys (relational), or connect to many other entities in unpredictable ways (graph)? Likes are a many-to-many relationship (relational junction table). Follows are a many-to-many relationship that you need to traverse (graph candidate). Messages are a one-to-many within a conversation (document candidate).

3. What consistency guarantees do you need? Can you tolerate a few seconds of stale data (eventual consistency, fine for like counts)? Or does every write need immediate visibility (strong consistency, required for payments)? Relational databases with ACID transactions give you strong consistency. Document stores and graph databases often offer eventual consistency for better performance.

Recommended Design (with Justification)

EntityBest ModelJustification
UsersRelational + jsonbCore entity referenced by everything. Foreign keys needed. Profile data (bio, settings) is tree-shaped, fits jsonb. Need indexes on username, email.
PostsRelational (or Document)Posts reference users (author). Media attachments as a JSON array. Need indexes on user_id and created_at for timeline queries. Document works if posts are mostly fetched individually.
CommentsRelationalComments reference both a post and a user. Threaded comments (parent_comment_id) create a tree — but it is a tree within the relational model (self-referencing foreign key). Need to fetch all comments on a post efficiently.
LikesRelationalA like is a many-to-many relationship between a user and a post. It is just a junction table: (user_id, post_id, created_at). No complex data. Pure relational.
FollowsGraph (or Relational)Follows are a many-to-many relationship between users. Simple follow lookups (who does Alice follow?) are fine in a relational junction table. But "friends of friends," "suggested connections," and "mutual followers" queries are expensive in SQL and natural in a graph. If these features matter, add a graph layer for the social graph.
MessagesDocument (or Relational)Messages are tree-shaped conversations. Each conversation is a self-contained thread. Data locality matters — fetch an entire conversation in one read. But messages also reference users (sender, recipient), so you need some referencing. A hybrid: relational table for message metadata + document for message content.
The real-world answer. Instagram uses PostgreSQL (relational) for most data and a custom graph service for the social graph. Twitter (X) uses a mix of MySQL, Manhattan (a key-value store), and FlockDB (a graph store for the social graph). LinkedIn uses a proprietary graph database for connections and Espresso (a document store) for profiles. The pattern is consistent: relational for transactional data, graph for social connections, document for content-heavy entities. No one uses just one model for everything.

Performance Comparison

Here is how the same queries perform under different model choices:

QueryRelationalDocumentGraph
Get user profile1-3 table JOIN1 readGather from node + neighbors
Get post + comments2 table JOIN1 read (if embedded)Traverse post→comments
News feed (posts from followed users)JOIN follows + posts, sort by dateFan-out: read each followed user's postsTraverse follow edges, collect posts
Mutual friendsSelf-join on follows (expensive)Not practical2-hop pattern match
Analytics: posts per dayGROUP BY + COUNTAggregation pipelinePoor (not designed for aggregation)

Chapter 8: Schema Evolution

Data models are not static. Features change, requirements evolve, and you need to add new fields, remove old ones, and restructure data. How this works depends enormously on your data model.

Relational: ALTER TABLE

In the relational model, schema changes are explicit. You run an ALTER TABLE statement to add a column, change a type, or add a constraint.

sql
-- Add a 'verified' field to all users
ALTER TABLE users ADD COLUMN verified BOOLEAN DEFAULT FALSE;

-- This runs instantly on PostgreSQL (just updates the catalog)
-- But on MySQL < 8.0, this rewrites the entire table (locks it!)

-- Add a NOT NULL column with a default
ALTER TABLE users ADD COLUMN created_at TIMESTAMP DEFAULT NOW();

-- Rename a column
ALTER TABLE users RENAME COLUMN photo_url TO avatar_url;

The key insight: in PostgreSQL, adding a column with a default value is instant (since version 11). The default is stored in the catalog, not written to every row. Old rows return the default when read. This is a form of schema-on-read hiding inside a schema-on-write system.

But some operations are expensive: adding a NOT NULL constraint without a default requires scanning every row to verify no nulls exist. Changing a column type may require rewriting the table. On a table with 100 million rows, this can take hours and lock the table for writes.

Migration strategies for large tables. (1) Expand-contract pattern: add the new column (nullable), backfill data in batches, add the NOT NULL constraint, drop the old column. (2) Dual-write: write to both old and new columns during transition, then switch reads to the new column, then drop the old one. (3) Online schema change tools: pt-online-schema-change (Percona) or gh-ost (GitHub) create a shadow copy of the table, apply the change, then atomically swap. These are real production techniques you should know for interviews.

Document: Just Write New Fields

In a document store, adding a field is trivial: just include it in new documents. Old documents do not have the field, and that is fine — the application code handles both cases.

javascript
// Old document (before the change)
{ "user_id": 42, "name": "Alice", "email": "alice@example.com" }

// New document (after adding 'verified')
{ "user_id": 99, "name": "Bob", "email": "bob@example.com", "verified": true }

// Application code must handle both formats
const isVerified = user.verified || false;  // default for old docs

The advantage: zero downtime, no migration script, no table locks. The disadvantage: every piece of application code that reads the data must handle the absence of the field. This defensive coding adds up across a large codebase.

Graph: Add New Types Freely

Graph databases are the most flexible for schema evolution. You can add new vertex types, new edge types, and new properties to existing vertices and edges without any schema migration at all.

cypher
// Add a 'verified' property to Alice's node
MATCH (a:Person {name: "Alice"})
SET a.verified = true;

// Add an entirely new relationship type
MATCH (a:Person {name: "Alice"}), (s:Skill {name: "Go"})
CREATE (a)-[:ENDORSES]->(s);

// Add a new node type that didn't exist before
CREATE (:Certification {name: "AWS Solutions Architect", level: "Professional"});

The Expand-Contract Pattern in Detail

This is the most important migration technique to know for interviews. It works for all three data models but is most commonly discussed in the relational context.

Phase 1: Expand
Add the new column/field as NULLABLE (no default required). Deploy new application code that writes to BOTH old and new fields. Old code continues to read from old field. No downtime.
Phase 2: Migrate
Run a background job to backfill the new field for all existing rows. Do this in small batches (1000 rows at a time) with sleep intervals to avoid overloading the database. Idempotent: safe to re-run if interrupted.
Phase 3: Transition
Deploy updated application code that reads from the new field. Verify correctness with dual-read (compare old vs new field, log discrepancies). Remove reads from the old field once confident.
Phase 4: Contract
Remove writes to the old field from application code. Drop the old column/field. Add NOT NULL constraint to the new field if needed. Done.

This four-phase approach guarantees zero downtime, backward compatibility at every step, and the ability to roll back at any phase. It is how companies like GitHub, Stripe, and Shopify handle schema migrations on tables with billions of rows.

Why not just use the database's built-in ALTER TABLE? For small tables (under 10 million rows), ALTER TABLE is fine. For tables with hundreds of millions of rows, an ALTER TABLE that rewrites the table can take hours and lock the table for writes the entire time. The expand-contract pattern avoids this by never requiring a blocking DDL operation. The only DDL needed is adding a nullable column (instant in PostgreSQL) and eventually dropping the old column (also instant). All the heavy work is done in application code, which can be deployed without downtime using rolling updates.

Interactive: Schema Migration Simulator

Schema Evolution: Add a 'verified' Field

Watch how adding a new field works in each model. See the migration process, the time it takes, and the failure modes.

Worked Example: Renaming a Field Across All Three Models

Your team decides to rename the photo_url field to avatar_url across all user profiles. Here is how it works in each model:

sql
-- RELATIONAL: one command, affects all rows
ALTER TABLE users RENAME COLUMN photo_url TO avatar_url;
-- PostgreSQL: instant (catalog-only). All queries must update column name.
-- All application code referencing 'photo_url' must be updated simultaneously.
javascript
// DOCUMENT: can't rename in place. Must read + write each document.
db.users.updateMany(
  { photo_url: { $exists: true } },
  [{ $set: { avatar_url: "$photo_url" } },
   { $unset: "photo_url" }]
);
// For 100M documents, this takes time. During migration, some docs
// have photo_url and some have avatar_url. Application code must
// handle BOTH: const url = user.avatar_url || user.photo_url;
cypher
// GRAPH: rename property on all nodes
MATCH (u:User)
WHERE u.photo_url IS NOT NULL
SET u.avatar_url = u.photo_url
REMOVE u.photo_url;
// Similar to document: happens per-node, takes time for large graphs.

The relational version is the cleanest: one atomic DDL command. The document and graph versions require batch updates where the old and new field names coexist during migration, requiring application code to handle both.

Common Migration Failures

ScenarioModelWhat goes wrongHow to fix
ALTER TABLE on 500M-row tableRelationalTable locked for hours, production writes blockedUse gh-ost or pt-online-schema-change for zero-downtime migration
New field missing in old documentsDocumentApplication code throws NPE / undefined accessAlways provide defaults in application code; validate on read
Renamed field in some documentsDocumentQueries on old field name miss updated documentsWrite a one-time migration to rename the field in all existing documents
New edge type lacks indexGraphTraversal queries on new edge type are slowCreate an index on the new edge type before writing data
Debug scenario: You run ALTER TABLE users ADD COLUMN verified BOOLEAN NOT NULL DEFAULT true; on a PostgreSQL 15 table with 200 million rows. The command returns in 5 milliseconds. A colleague on MySQL 5.7 runs the equivalent statement and it takes 45 minutes. Why the difference?

Chapter 9: Interview Arsenal

This chapter distills everything into the formats you will encounter in systems design interviews. Memorize the cheat sheet. Practice the design questions. Understand the debug scenarios deeply enough to walk through them on a whiteboard.

The Data Model Cheat Sheet

ModelBest forWorst forQuery languageSchemaExamples
RelationalMany-to-many, transactions, analyticsHierarchical data, rapid schema changesSQL (declarative)Schema-on-write (strict)PostgreSQL, MySQL, CockroachDB
DocumentHierarchical data, data locality, flexible schemaMany-to-many, cross-document joinsMongoDB QL, DynamoDB PartiQLSchema-on-read (flexible)MongoDB, DynamoDB, CouchDB
GraphTraversal, many-to-many, relationship-heavyAggregation, simple CRUD, analyticsCypher, SPARQL, GremlinSchema-optionalNeo4j, Neptune, TigerGraph
Key-ValueSimple lookups, caching, session storageComplex queries, relationshipsGET/SET APINoneRedis, Memcached, DynamoDB
Column-familyWide rows, time-series, high write throughputComplex queries, ad-hoc analyticsCQL (Cassandra)Flexible columns per rowCassandra, HBase, ScyllaDB

System Design: Design the Data Model for Twitter

Entities: Users, Tweets, Follows, Likes, Retweets, Direct Messages, Hashtags, Timelines.

Core decision: PostgreSQL for the transactional core (users, tweets, likes, retweets — all referential). A graph layer (or Redis-based adjacency list) for the follow graph (friend-of-friend suggestions, mutual follows). A timeline cache (Redis sorted set per user, keyed by tweet_id, scored by timestamp) for the home timeline because the fan-out-on-write vs fan-out-on-read trade-off is the critical design decision.

Fan-out-on-write: When a user tweets, push the tweet to every follower's timeline cache. Pros: fast reads (timeline is pre-computed). Cons: expensive writes for users with millions of followers (a celebrity tweet triggers millions of cache inserts).

Fan-out-on-read: When a user opens their timeline, fetch the latest tweets from all followed users and merge them. Pros: cheap writes. Cons: slow reads (must query N followed users and sort).

Twitter's actual solution: Hybrid. Fan-out-on-write for normal users (most users have < 1000 followers). Fan-out-on-read for celebrities (the top ~1% with millions of followers). A user's timeline is the merge of their pre-computed cache (from normal followed users) and a real-time fetch (from followed celebrities).

This is the most-asked system design question. If you are asked "Design Twitter" in an interview, the data model + fan-out trade-off is the core of the answer. Everything else — timeline service, tweet storage, media handling — follows from this decision. Practice this until you can whiteboard it in 10 minutes. The key numbers: ~400M active users, ~500M tweets/day, 300 average followers per user, median timeline read latency target of < 200ms. These numbers drive the architecture.

System Design: Design the Data Model for Uber

Entities: Riders, Drivers, Trips, Locations (real-time GPS), Payments, Ratings, Surge Pricing Zones.

Core decision: PostgreSQL for the transactional core (riders, drivers, trips, payments — strong consistency needed for money). Redis + geospatial indexes for real-time driver locations (need sub-100ms queries for "find nearest available drivers"). A time-series store or columnar database for trip telemetry (GPS traces at 1 Hz, millions of trips/day — too much for PostgreSQL). Document store or jsonb for rider/driver profiles (flexible attributes).

The location problem is the hard part. Every active driver sends a GPS update every second. With 100,000 active drivers, that is 100K writes/second to the location store. Each ride request needs to query "all drivers within 2 km of this point" with sub-second latency. PostgreSQL with PostGIS can handle this at moderate scale. At Uber's scale, they built a custom geospatial index (Ringpop, later H3-based) backed by in-memory stores.

Interactive: Data Model Quick-Pick Challenge

Quick-Pick: Choose the Right Model

A scenario appears. Click the best data model. See your score and the reasoning.

Coding Drills

Drill 1: Design a normalized relational schema for a library management system (books, authors, borrowers, loans, genres). Include foreign keys and at least one many-to-many relationship.

sql
CREATE TABLE authors (
    author_id SERIAL PRIMARY KEY,
    name      VARCHAR(200) NOT NULL
);

CREATE TABLE genres (
    genre_id SERIAL PRIMARY KEY,
    name     VARCHAR(100) NOT NULL UNIQUE
);

CREATE TABLE books (
    book_id  SERIAL PRIMARY KEY,
    title    VARCHAR(300) NOT NULL,
    isbn     VARCHAR(13) UNIQUE,
    pub_year INT
);

-- Many-to-many: a book can have multiple authors
CREATE TABLE book_authors (
    book_id   INT REFERENCES books(book_id),
    author_id INT REFERENCES authors(author_id),
    PRIMARY KEY(book_id, author_id)
);

-- Many-to-many: a book can belong to multiple genres
CREATE TABLE book_genres (
    book_id  INT REFERENCES books(book_id),
    genre_id INT REFERENCES genres(genre_id),
    PRIMARY KEY(book_id, genre_id)
);

CREATE TABLE borrowers (
    borrower_id SERIAL PRIMARY KEY,
    name        VARCHAR(200) NOT NULL,
    email       VARCHAR(200) UNIQUE
);

CREATE TABLE loans (
    loan_id     SERIAL PRIMARY KEY,
    book_id     INT REFERENCES books(book_id),
    borrower_id INT REFERENCES borrowers(borrower_id),
    borrowed_at TIMESTAMP DEFAULT NOW(),
    returned_at TIMESTAMP  -- NULL = still borrowed
);

Drill 2: Write a MongoDB aggregation pipeline for "find the top 5 most-reviewed products with average rating above 4.0."

javascript
db.reviews.aggregate([
  // Group reviews by product_id, compute count and average
  { $group: {
      _id: "$product_id",
      avg_rating: { $avg: "$rating" },
      review_count: { $sum: 1 }
  }},
  // Filter: average rating above 4.0
  { $match: { avg_rating: { $gt: 4.0 } } },
  // Sort by review count descending
  { $sort: { review_count: -1 } },
  // Take top 5
  { $limit: 5 },
  // Optional: look up product details
  { $lookup: {
      from: "products",
      localField: "_id",
      foreignField: "_id",
      as: "product"
  }},
  { $unwind: "$product" }
]);

Drill 3: Write a Cypher query for "find all people within 2 degrees of Alice who have the skill 'Kubernetes' and work at a company founded after 2010."

cypher
MATCH (alice:Person {name: "Alice"})-[:KNOWS*1..2]-(person:Person)
WHERE person <> alice
MATCH (person)-[:HAS_SKILL]->(s:Skill {name: "Kubernetes"})
MATCH (person)-[:WORKS_AT]->(c:Company)
WHERE c.founded > 2010
RETURN DISTINCT person.name, c.name AS company, c.founded;

Debug Scenarios

Scenario 1: Your PostgreSQL JOIN query for the news feed (posts from followed users) is taking 8 seconds. The follows table has 500M rows and the posts table has 2B rows. What do you check?

(1) Check indexes: do you have indexes on follows.follower_id and posts.author_id? Without them, the JOIN is a sequential scan. (2) Check the query plan with EXPLAIN ANALYZE: is it doing a nested loop or a hash join? A hash join on 500M rows needs massive memory. (3) Consider materialized views: pre-compute each user's feed as a materialized view, refresh it periodically. (4) Consider a different architecture: this is the fan-out problem. Pre-compute timelines in a cache (Redis sorted set) on write, not on read.

System Design: Design the Data Model for a Messaging App (WhatsApp)

Entities: Users, Conversations (1:1 and group), Messages, Media, Read Receipts, Typing Indicators.

Core challenge: Messages are append-only, extremely high throughput (100B+ messages/day for WhatsApp), and the primary access pattern is "get the last N messages in a conversation." This is a time-series access pattern.

Design: Conversations metadata in PostgreSQL (who is in the conversation, created_at, last_message_at). Messages in a column-family store like Cassandra or a log-structured store, partitioned by conversation_id, sorted by timestamp. This gives you O(1) access to "last N messages in conversation X" — just read the tail of the partition. Media (images, videos) as blobs in object storage (S3), referenced by URL in the message. Read receipts as a small relational table or Redis hash: (conversation_id, user_id) → last_read_message_id.

Why not a document store? A single conversation document would grow without bound as messages accumulate. A popular group chat could have millions of messages, exceeding MongoDB's 16MB document limit. You need to paginate messages, which means they must be stored as individual records, not embedded arrays.

Why not a relational database for messages? At WhatsApp's scale (100B+ messages/day), a single PostgreSQL instance cannot handle the write throughput. Cassandra's leaderless replication and tunable consistency make it ideal: write to any node, read from one node (CL=ONE), partition by conversation_id for data locality.

Key schema decision: partition key. In Cassandra, the partition key determines which node stores the data. For messages, the partition key is conversation_id. This means all messages in a conversation live on the same node, so "get the last 50 messages in this conversation" is a single-node read. But "get all messages sent by user X across all conversations" requires a scatter-gather query across all nodes — this is expensive and should be avoided as a primary access pattern. The partition key choice reflects the dominant access pattern: conversations, not users.

This is a general principle of data modeling in distributed systems: the partition key should match your most common query pattern. If most queries are "by user," partition by user_id. If most queries are "by conversation," partition by conversation_id. If you need both, you might need two copies of the data with different partition keys (denormalization at the storage level).

The Five Interview Dimensions Applied

DimensionExample QuestionWhat They Test
Concept"Explain the impedance mismatch between OO code and relational databases"Understanding of why document stores gained traction
Design"Design the data model for a social network with 1B users"Trade-off reasoning, polyglot persistence, scaling awareness
Code"Write a SQL query to find mutual friends of two users"SQL fluency, understanding of self-joins
Debug"Our news feed query takes 8 seconds, what do you investigate?"Systematic diagnosis: indexes, query plan, architecture
Frontier"What is NewSQL and how does it relate to the relational/NoSQL debate?"Awareness of CockroachDB, Spanner, TiDB — relational model with horizontal scaling

Scenario 2: Your MongoDB $lookup (join) is timing out. The orders collection has 100M documents and you are joining with products (1M documents). What do you check?

(1) Check indexes: $lookup requires an index on the foreignField in the joined collection. Without it, every lookup is a collection scan. (2) Check if you can denormalize: embed product info directly in the order document to eliminate the join. (3) Consider pipeline optimization: add a $match stage before the $lookup to reduce the number of documents being joined. (4) Consider whether MongoDB is the right tool: frequent joins across collections suggest the data is relational, not document-shaped.

Scenario 3: Your Neo4j graph database is fast for 2-hop traversals (< 5ms) but a "find all shortest paths between any two users" query takes 30 seconds. Why?

(1) "All shortest paths" is a global graph algorithm (Dijkstra/BFS from every node), not a local traversal. Graph databases are optimized for local queries starting from a specific node. (2) Check if you actually need ALL shortest paths or just the shortest path between two specific users (much cheaper). (3) For global graph analytics (PageRank, centrality, community detection), use a graph processing framework like Apache Spark GraphX or Neo4j Graph Data Science library — these batch-process the entire graph in memory, not through the query engine. (4) Consider pre-computing: materialize shortest paths as edges for frequently queried pairs.

SQL Fluency: Self-JOIN for Mutual Friends

This is a classic interview coding question. "Find all mutual friends of user A (id=1) and user B (id=2)."

sql
-- Mutual friends: people who are friends with BOTH user 1 AND user 2
SELECT f1.friend_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.friend_id
WHERE f1.user_id = 1
  AND f2.user_id = 2;

This is a self-join: the friendships table is joined with itself. Row f1 represents "user 1's friends" and row f2 represents "user 2's friends." The JOIN condition f1.friend_id = f2.friend_id finds IDs that appear in both sets. Without an index on (user_id, friend_id), this is a nested loop over potentially millions of rows. With a composite index, each side is an efficient range scan.

In Cypher, the same query is one line: MATCH (a)-[:FRIENDS]-(mutual)-[:FRIENDS]-(b) WHERE a.id=1 AND b.id=2 RETURN mutual. The visual pattern-matching syntax makes graph queries far more readable for relationship-heavy operations.

Rapid-Fire Interview Questions

Practice answering these in 30 seconds each. The goal is crisp, structured answers that demonstrate understanding of trade-offs.

QuestionKey Points for Your Answer
"When would you choose MongoDB over PostgreSQL?"Hierarchical data (CMS, product catalogs), rapidly evolving schema, data locality for read-heavy workloads, when JOINs are rare. Not for: many-to-many relationships, complex analytics, strong integrity requirements.
"What is the N+1 query problem?"ORM loads parent (1 query), then lazy-loads each child individually (N queries). Fix: eager load with JOIN, or use a batch loader (DataLoader pattern). Symptom: page load time scales linearly with number of items.
"Explain eventual consistency in document stores"After a write, not all replicas are updated immediately. A subsequent read might hit a stale replica. Trade-off: lower latency and higher availability in exchange for potentially reading stale data. Tunable in most systems (read concern levels).
"How does sharding work differently for documents vs relational?"Documents: shard by document ID or a field within the document. Self-contained, so single-document queries stay within one shard. Relational: shard by primary key, but JOINs across shards are expensive (cross-shard queries). This is why document stores scale horizontally more easily for simple queries.
"What is polyglot persistence?"Using different databases for different parts of your system based on access patterns. Example: PostgreSQL for orders, Redis for sessions, Elasticsearch for full-text search, Neo4j for social graph. Trade-off: operational complexity of running multiple databases vs performance of each being optimized for its use case.
Design question: You are designing the data model for a food delivery app (Uber Eats / DoorDash). The core entities are: restaurants, menus (items, prices, availability), customers, orders, drivers, and real-time driver locations. Which combination of data models would you use and why?

Chapter 10: Connections

Data models are the foundation. Every other chapter in DDIA builds on the choices you make here.

The NewSQL Movement

The 2010s saw the rise of NoSQL ("Not Only SQL") — document stores, key-value stores, graph databases, column-family stores. The pitch: horizontal scaling, flexible schemas, no JOINs needed. The reality: developers missed SQL, transactions, and referential integrity. Many NoSQL projects ended up reimplementing these features poorly in application code.

The response was NewSQL: databases that provide the familiar SQL interface, ACID transactions, and relational model, but with the horizontal scaling of NoSQL. The key examples:

DatabaseCompanyKey Innovation
Google SpannerGoogleGlobally distributed, externally consistent transactions using TrueTime (atomic clocks + GPS)
CockroachDBCockroach LabsOpen-source Spanner-inspired, Raft consensus, PostgreSQL-compatible wire protocol
TiDBPingCAPMySQL-compatible, horizontal scaling, Raft-based replication
YugabyteDBYugabytePostgreSQL-compatible, Raft-based, designed for cloud-native deployment
PlanetScalePlanetScaleMySQL-compatible, Vitess-based sharding, branchable schemas

The NewSQL movement validates Codd's original insight: the relational model is not the bottleneck. The limitation was always in the single-machine implementation. With distributed consensus and sharding built into the storage engine, you can have SQL, transactions, and horizontal scaling.

The interview punchline. When asked "SQL or NoSQL?", the staff-level answer is: "The relational model is almost always the right starting point. SQL databases have absorbed the best ideas from NoSQL (JSON columns, flexible schemas, horizontal scaling via NewSQL). I would only reach for a specialized store when the access pattern is genuinely different from what SQL handles well — high-throughput time-series, graph traversals, or full-text search."

Where This Leads

Next TopicConnectionLesson
Storage & RetrievalOnce you have chosen a data model, how does the database actually store it on disk? B-trees for relational indexes, LSM-trees for write-heavy workloads, column stores for analytics.DDIA Ch4: Storage & Retrieval
ReplicationHow do you copy your data across multiple machines for fault tolerance? The data model affects replication: document stores replicate whole documents, relational databases replicate rows/pages.DDIA Ch6: Replication
Partitioning (Sharding)How do you split your data across machines? The data model determines the partition key: document ID for document stores, primary key for relational, vertex ID for graphs.DDIA Ch7: Partitioning
TransactionsHow do you make multi-step operations atomic? ACID transactions are native to relational databases. Document stores offer single-document atomicity. Graph databases vary widely.DDIA Ch8: Transactions
Consistency & ConsensusHow do replicas agree on the current state of data? The consistency model interacts with the data model: linearizability is easier for key-value stores, harder for complex graphs.DDIA Ch10: Consistency & Consensus

The Hierarchy of Data Decisions

1. Data Model
Relational, Document, Graph — determines how you think about your data (THIS LESSON)
2. Storage Engine
B-tree, LSM-tree, column store — determines how data lives on disk (Ch4)
3. Replication
Leader-follower, multi-leader, leaderless — determines fault tolerance (Ch6)
4. Partitioning
Key range, hash, composite — determines horizontal scaling (Ch7)
5. Transactions
ACID, BASE, MVCC — determines consistency guarantees (Ch8)
6. Consensus
Raft, Paxos, ZAB — determines how distributed nodes agree (Ch10)
The data model is the foundation. Every decision that follows — storage engine, replication strategy, partitioning scheme, transaction model — is constrained by the data model you chose at the beginning. Get this right, and the rest falls into place. Get it wrong, and you are fighting the system at every layer.

Recommended Reading

ResourceWhy
Codd, "A Relational Model of Data for Large Shared Data Banks" (1970)The paper that started it all. Surprisingly readable.
MongoDB Data Modeling GuideOfficial guide on when to embed vs reference in document stores.
Neo4j Graph Modeling GuidelinesPractical patterns for designing graph schemas.
Kleppmann, Designing Data-Intensive Applications, Ch 2The source material for this lesson. Read the full chapter.

Key Takeaways for Your Next Interview

If you remember nothing else from this lesson, remember these five principles:

#PrincipleOne-Line Explanation
1Start relationalPostgreSQL handles 90% of use cases. Reach for specialized stores only when you hit a specific bottleneck.
2Normalize by defaultDenormalize only when you have measured a JOIN bottleneck, not as a premature optimization.
3Declarative over imperativeSQL and aggregation pipelines let the database optimize. Imperative code hardcodes the strategy.
4Graphs for traversal, not storageAdd a graph layer only when traversal queries (friends-of-friends, shortest path) are a primary access pattern.
5Schema evolution is inevitableDesign for it from day one: nullable columns, expand-contract migrations, backward-compatible reads.

Real-World Data Model Choices

For inspiration and pattern-matching in interviews, here is what major companies actually use:

CompanyPrimary StoreSpecialized Stores
InstagramPostgreSQL (user data, media metadata, comments, likes)Cassandra (direct messages at scale), Redis (caching), custom graph service (social graph)
UberMySQL → migrated to Docstore (custom document store built on top of MySQL)Redis (caching, geospatial), Cassandra (trip data), Apache Pinot (real-time analytics)
AirbnbMySQL (listings, reservations, user accounts)Elasticsearch (search), Redis (sessions), Hive (analytics warehouse)
ShopifyMySQL (orders, products, merchants)Redis (caching), Kafka (event streaming), custom key-value store for cart data
LinkedInEspresso (custom document store for member profiles)Custom graph service (social connections), Pinot (analytics), Venice (derived data serving)

The pattern is consistent: relational (or relational-inspired) for the transactional core, plus specialized stores for specific access patterns. No company uses "just MongoDB for everything" or "just Neo4j for everything" at scale.

"The limits of my language mean the limits of my world." — Ludwig Wittgenstein

The data model is your language for thinking about data. Choose it wisely.