CS224W Lecture 12

Relational Deep Learning

Your company's data lives in a relational database — 50 tables, thousands of foreign keys. Machine learning needs a feature vector. But what if the database is the graph, and GNNs can learn from it directly?

Prerequisites: What a database table is + basic GNN intuition. That's it.
10
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: Relational Data as Graphs

An e-commerce company has a database. Table: Customers. Table: Orders. Table: Products. Table: Reviews. These tables are connected by foreign keys — an Order row has a CustomerID that points to a Customer row, and a ProductID that points to a Product row. This is how all relational databases work.

Here's the insight that changes everything: a relational database with foreign keys is a heterogeneous graph. Every row is a node. Every foreign key relationship is an edge. The table name is the node type. The foreign key relationship is the edge type. You've been storing a graph in your database this whole time — you just never called it that.

Tables → node types. Rows → nodes. Foreign keys → edges. A row in the Orders table with CustomerID=42 and ProductID=7 creates two edges: Order→Customer and Order→Product. The entire database schema defines the graph's edge types. The data fills in the nodes and their attributes.
Relational Database → Heterogeneous Graph

A small e-commerce schema. Left: the database tables with foreign key arrows. Right: the same data as a heterogeneous graph. Click a table to highlight its rows as graph nodes.

Click a button to highlight that table's rows as graph nodes.

What Makes It "Heterogeneous"

A homogeneous graph has one type of node and one type of edge. A heterogeneous graph has multiple node types (Customer, Product, Order, Review) and multiple edge types (customer→order, order→product, customer→review). Each node type has its own feature space — a Customer has name, age, country; a Product has title, price, category. GNNs need to handle this heterogeneity explicitly.

This is not a niche scenario. Virtually every company's core data lives in a relational database. Healthcare: Patients, Diagnoses, Prescriptions, Doctors. Finance: Accounts, Transactions, Merchants, Fraud flags. Social media: Users, Posts, Comments, Follows. They're all relational databases. They're all heterogeneous graphs.

In the database-as-graph view, what does a foreign key relationship become?

Chapter 1: Why This Matters — The Feature Engineering Tax

You want to predict: which customers will churn in the next 30 days? You have a relational database. A standard ML workflow — the one that's been used for decades — starts with a painful step: feature engineering.

Feature engineering means manually converting your relational data into a flat feature vector per entity. For each customer: count their total orders, compute average order value, find their most recent purchase date, count their reviews, compute average review rating, find their most frequently purchased product category... and so on. This process takes weeks, requires domain expertise, and produces features that may or may not capture the right signals.

Feature engineering is the bottleneck in tabular ML. In a 2023 Kaggle survey, data scientists reported spending 60-80% of their time on feature engineering — before any model is trained. It's manual, slow, domain-specific, and doesn't transfer between tasks.

The RDL Promise

Relational Deep Learning (RDL) proposes to eliminate manual feature engineering by letting the GNN learn directly from the graph structure. Instead of manually aggregating "how many orders did this customer place in the last 90 days?", the GNN propagates information from Order nodes to Customer nodes automatically, learning the right aggregation function end-to-end from the task labels.

Traditional pipeline:
  1. Domain expert inspects schema
  2. Manually define features (weeks)
  3. SQL queries to compute features
  4. Flat feature table → XGBoost
RDL pipeline:
  1. Parse database schema automatically
  2. Construct heterogeneous graph
  3. Run GNN — features learned end-to-end
  4. Predict task labels directly

The GNN effectively learns to do feature engineering automatically. It discovers that aggregating 90-day orders matters for churn, and 180-day purchase history matters for lifetime value prediction — without a human specifying either. This is not magic: it's the inductive bias of graph structure doing the work that engineers used to do by hand.

What is the key problem that Relational Deep Learning aims to solve?

Chapter 2: Entity-Level Tasks — Predicting Row Properties

What can you actually predict with RDL? Almost anything you'd want to know about rows in a database table. These are called entity-level tasks — tasks where the prediction target is a property of a specific row (entity) in the database.

Task Taxonomy

TaskInputPredictionExample
Node classificationCustomer nodeClass labelWill this customer churn? (yes/no)
Node regressionCustomer nodeReal valueWhat is this customer's 12-month LTV?
Link predictionCustomer + ItemProbabilityWill this customer buy this product?
Time-to-eventCustomer nodeDurationWhen will this customer next make a purchase?

Notice that recommendation (link prediction between customers and products) is a special case of entity-level tasks in the RDL framework. Churn prediction, fraud detection, LTV estimation — these are all node-level regression or classification on rows in a relational database. RDL provides one unified framework for all of them.

Multi-Hop Information

The power of treating the database as a graph shows up when you consider what information is relevant to a prediction. To predict customer churn, you need not just the customer's direct features (age, country) but:

Each hop is a JOIN in SQL. A 3-hop GNN is equivalent to a 3-table JOIN followed by an aggregation — but the aggregation is learned, not specified. The GNN doesn't know you want a 3-hop aggregation; it discovers this through gradient descent on the task.
python
# Traditional approach: manual feature engineering
customer_features = db.execute("""
    SELECT c.age, c.country,
           COUNT(o.id) as num_orders,
           AVG(o.amount) as avg_order_value,
           MAX(o.timestamp) as last_order_date,
           COUNT(r.id) as num_reviews
    FROM customers c
    LEFT JOIN orders o ON o.customer_id = c.id
    LEFT JOIN reviews r ON r.customer_id = c.id
    WHERE o.timestamp > NOW() - INTERVAL 90 DAYS
    GROUP BY c.id
""")
# ^ This took a DBA 2 weeks to write and validate

# RDL approach: let the GNN learn the aggregations
graph = build_graph_from_db(db)          # automatic
embeddings = gnn(graph)                  # learns what to aggregate
predictions = classifier(embeddings['customer'])  # end-to-end
Why is a GNN with L layers on a relational graph analogous to L-table JOINs in SQL?

Chapter 3: Temporal Aspects — Preventing Future Leakage

Relational data is almost always timestamped. An order has a timestamp. A review has a timestamp. A login event has a timestamp. This temporal structure creates both an opportunity and a danger.

The opportunity: you can model time-evolving relationships. A customer's behavior in January might predict their churn in March. A product's review history might predict its future sales.

The danger: temporal leakage. If you use data from after the prediction time to make that prediction, you've given the model information it couldn't possibly have in production. This inflates metrics dramatically and produces models that fail in deployment.

The golden rule: when predicting a label at time T, you may only use data with timestamp ≤ T. Any data after T is future information and must be excluded from the graph at prediction time. Violating this rule causes data leakage — the single most common mistake in time-series ML.

Train / Validation / Test Splits

In relational ML, splits must be temporal. You cannot randomly shuffle rows and split 80/10/10 — that would let the model see future interactions during training.

Training Edges
All interactions before time T1
↓ temporal gap (no data)
Validation Predictions
Labels between T1 and T2
↓ temporal gap
Test Predictions
Labels after T2

The temporal gap between splits is important. Without it, information from the end of training bleeds into the validation period, making validation scores overoptimistic. Real deployments have this gap; your evaluation should too.

Materialization at Prediction Time

For each prediction (e.g., "will customer X churn between March 1 and March 31?"), you materialize a subgraph of the relational graph that includes only data with timestamps before March 1. This subgraph is what the GNN sees. March 1 is the cutoff time.

Temporal Cutoff Visualization

Drag the cutoff time slider to see how the available interaction graph changes. Nodes and edges after the cutoff are hidden from the GNN — preventing leakage.

Cutoff time T t=6
You're predicting whether a customer will churn in April. Which data is it safe to include in the GNN's input graph?

Chapter 4: The RDL Pipeline — SHOWCASE

Let's trace the full journey from a relational database schema to a GNN prediction. This is the complete Relational Deep Learning pipeline, step by step. Click each stage to expand it and see what data flows through.

Full RDL Pipeline — Interactive Walkthrough

Click "Step Forward" to advance through the pipeline stages. Watch how a customer churn prediction task materializes from raw database tables into a GNN prediction.

Step-by-Step Breakdown

1. Schema Parsing
Read table definitions + foreign keys. Each table becomes a node type. Each FK becomes an edge type. Column data types → initial feature encoders.
2. Temporal Materialization
For each prediction (entity, cutoff_time), select only rows with timestamp ≤ cutoff. Build the "past" subgraph. This prevents future leakage.
3. Feature Encoding
Encode each row's columns: numeric → normalize, categorical → embedding, text → pre-trained LM, timestamp → sinusoidal. Stack into node feature vectors.
4. Heterogeneous GNN
Run message passing respecting node types. Different weight matrices per (src_type, edge_type, dst_type). L layers = L JOIN depth of information aggregation.
5. Task Head + Loss
Take the target entity's embedding. Pass through MLP. Output prediction. Compute BCE (classification) or MSE (regression) loss. Backpropagate.
python
# RelBench-style RDL pipeline (simplified)
from relbench import Dataset, Task
from torch_geometric.nn import HeteroConv, SAGEConv

# Step 1: Load database + define task
dataset = Dataset("ecommerce")         # relational database
task = Task("customer-churn", dataset)  # churn in 30 days?

# Step 2: Materialize temporal graph
graph = dataset.make_graph(cutoff_time=task.cutoff)
# graph: HeteroData with node types and edge types from schema
# graph['customer'].x : [N_cust, d_cust]  (encoded row features)
# graph['order'].x    : [N_ord,  d_ord]
# graph['customer','places','order'].edge_index : [2, E]

# Step 3-4: Heterogeneous GNN
conv = HeteroConv({
    ('customer', 'places', 'order'): SAGEConv((128, 64), 64),
    ('order', 'rev_places', 'customer'): SAGEConv((64, 128), 128),
    ('order', 'contains', 'product'): SAGEConv((64, 96), 96),
})
x_dict = conv(graph.x_dict, graph.edge_index_dict)

# Step 5: Task-specific head
churn_logits = mlp(x_dict['customer'])   # [N_cust, 1]
loss = F.binary_cross_entropy_with_logits(churn_logits, labels)
In the RDL pipeline, what is "temporal materialization" and why is it necessary?

Chapter 5: Message Passing on Relational Graphs

Relational graphs are heterogeneous: different node types have different feature dimensions and different semantics. A Customer node has age and country; a Product node has price and category; an Order node has amount and timestamp. You can't apply the same weight matrix across all of them.

The solution is type-specific message passing. For each edge type (src_type, relation, dst_type), you learn a separate message function. This is exactly what PyTorch Geometric's HeteroConv does — it routes different edge types through different convolution modules.

The Heterogeneous Message Passing Equation

For a target node v of type τ(v), the update aggregates from all its neighbors across all incoming edge types:

hv(l+1) = σ( Wτ(v) hv(l) + ∑r ∈ R(v)u ∈ Nr(v) &frac1;|Nr(v)| Wr hu(l) )

Where R(v) is the set of edge types pointing to v, Nr(v) is v's neighbors via edge type r, and Wr is a learned weight matrix specific to that edge type. Each foreign key relationship gets its own learned transformation.

Why type-specific weights? The transformation from Order features to Customer features should be different from the transformation from Product features to Customer features. An order is "placed by" a customer (temporal); a product is "bought by" a customer (preference). These relationships carry fundamentally different information.

Handling Asymmetry

Foreign keys are directional — an Order points to a Customer, not the other way. But for GNN message passing, you usually want bidirectional information flow: customers learn from their orders, but orders also learn from the customers who placed them (this gives orders "customer context" that may help predict other properties). The standard approach is to add reverse edges for every foreign key, giving each direction its own edge type (places vs. rev_places).

Heterogeneous Message Passing Demo

Watch how information flows from different node types into a Customer node. Each edge type uses a different learned transformation (shown by color). Click "Propagate" to animate one message passing step.

Press "Propagate" to run one layer of heterogeneous message passing.
Why does heterogeneous GNN message passing use different weight matrices for different edge types?

Chapter 6: RelBench — A Standard Benchmark

For GNNs on social graphs, there's OGB (Open Graph Benchmark). For knowledge graphs, there's FB15k-237. For recommendation, there's MovieLens and Amazon. But for relational ML, there was no standard benchmark. Everyone used different datasets, different splits, different metrics — making comparison impossible.

RelBench (Fey et al., 2023) fills this gap. It provides a collection of real-world relational databases with standardized train/validation/test splits, evaluation metrics, and leaderboards. The databases come from actual production domains: e-commerce, stack exchange, Wikipedia edits, and more.

Datasets in RelBench

DatasetDomainTablesRowsTasks
rel-amazonE-commerce5~10MReview rating, churn
rel-stackexQ&A forum8~15MEngagement, badge prediction
rel-wikiWikipedia4~50MEdit activity, article quality
rel-trialClinical trials7~5MTrial completion, adverse events
rel-hmFashion retail4~30MPurchase prediction
RelBench uses temporal splits across all datasets. Train ends at T1, validation between T1 and T2, test after T2. This mimics production deployment, where you always predict into the future. It also means you cannot accidentally use future data — the split enforces it.

What RelBench Enables

Before RelBench, a paper claiming "RDL beats XGBoost on customer churn" was hard to evaluate — which churn dataset? Which features did XGBoost get? What was the evaluation protocol? With RelBench, comparisons are apples-to-apples. Everyone uses the same databases, same splits, same metrics. This is how science is supposed to work.

Why does RelBench use temporal (chronological) splits rather than random row shuffles?

Chapter 7: Results vs Traditional ML

Does RDL actually work? Or is it a beautiful idea that loses to gradient boosting in practice? The honest answer from RelBench results: it depends on the task and the data — and the reasons are instructive.

When RDL Wins

RDL consistently outperforms XGBoost on tasks where multi-hop relational information is the key signal. If predicting customer churn requires knowing what their friends bought, what reviews they wrote, what products those reviews cover — information that spans 3-4 table hops — then GNNs can capture this while XGBoost's flat feature vector misses it.

RDL vs XGBoost: Performance by Task Complexity

Relative performance gain of GNN-RDL over XGBoost baseline, across tasks requiring different amounts of multi-hop relational reasoning. Bar height = improvement in AUROC or RMSE.

Longer bars = more benefit from graph structure.

When XGBoost Holds Its Own

XGBoost with carefully engineered features still competes with RDL on tasks where the predictive signal is mostly local — contained within a single table or one JOIN away. The reason: XGBoost's boosting and regularization are highly tuned for tabular data, and its features, though manual, can be crafted by an expert who knows the domain.

The honest comparison: XGBoost + domain expert features vs. RDL + no feature engineering. RDL wins when: (1) the relevant signal is multi-hop, (2) the domain expert doesn't know what features matter, or (3) fast iteration without feature engineering is valuable. XGBoost wins when: (1) someone who knows the domain has spent weeks on features, and (2) the signal is local.

The Trend Is Clear

On newer, harder RelBench tasks introduced in 2024, RDL models consistently outperform XGBoost baselines — even XGBoost with extensive feature engineering. As tasks become more complex and datasets more relational, the advantage of learning from graph structure grows.

Task typeRelational hopsRDL vs XGBoost
Single-table prediction0≈ Tie (XGBoost slight edge)
1-hop aggregation1RDL slight edge
Multi-hop relational3+RDL wins clearly
Cold-start entitiesAnyRDL wins (content features help)
On which type of task does RDL show the largest improvement over XGBoost?

Chapter 8: Challenges — What RDL Doesn't Solve

RDL is promising, but it's not a drop-in replacement for everything. Three challenges stand out as the hardest open problems.

Challenge 1: Scalability

A production relational database might have 100 million customer rows, 500 million order rows, and 50 million product rows. Building the full graph and running message passing on it is impossible in a single GPU. Each customer's 2-hop neighborhood might include millions of nodes.

Solutions borrowed from GNN scalability: mini-batching with neighbor sampling (same as GraphSAGE), graph partitioning, and pre-computing static features. But temporal materialization adds complexity — the subgraph changes for every prediction cutoff, so you can't easily reuse cached neighborhoods.

Challenge 2: Temporal Leakage is Easy to Get Wrong

Temporal leakage is subtle. It's not just "don't use future labels." Consider: a product's average review score. Computed over all time, it includes future reviews. Computed over time ≤ cutoff, it's safe. Every aggregated feature needs a temporal filter. This is correct by construction in RDL (because the graph itself is materialized at cutoff), but manually engineered features often miss this.

The temporal graph materialization is the killer feature of RDL. By building the graph at the cutoff and running GNN on it, every aggregation is automatically temporally correct. The GNN cannot see future data because the graph doesn't contain it. This is much harder to guarantee in manual feature engineering.

Challenge 3: Schema Diversity

Every database has a different schema. A model trained on one schema (e-commerce) doesn't transfer to another (healthcare) without retraining. This is the "universal encoder" problem — can you build a GNN that works on any schema without schema-specific design? This is the focus of Lecture 13 (advanced RDL architectures).

ChallengeCurrent SolutionOpen Problem
Scale (millions of rows)Neighbor sampling, mini-batchingTemporal-aware sampling
Temporal correctnessGraph materialization at cutoffEfficient re-materialization
Schema diversitySchema-specific modelsUniversal relational encoders
Feature encodingPre-trained LMs for text/categoricalsUnified multi-modal encoding
Why does building the GNN input graph from data ≤ cutoff_time automatically prevent temporal leakage in RDL?

Chapter 9: Connections — Where to Go Next

Relational Deep Learning sits at the intersection of several research communities that have been working in parallel: graph learning, tabular ML, and database systems. It unifies them under a single framework.

The Key Insights to Carry Forward

Relational databases are graphs. This isn't a metaphor — it's exact. Foreign keys are edges, rows are nodes, tables are node types. Once you see this, the entire GNN toolkit becomes available for relational ML problems.
Temporal correctness is structural. By building the graph at the prediction cutoff, RDL makes temporal leakage architecturally impossible. This is safer than trusting engineers to filter features manually.
Feature engineering is learnable. Multi-hop SQL JOINs with complex aggregations can be replaced by GNN layers that learn the right aggregation. The GNN doesn't know SQL — but gradient descent on the task labels discovers what information matters.
Schema diversity is an open problem. Training a model per schema is feasible; training one model for all schemas is the frontier (see Lecture 13).

Related Lessons

TopicLessonConnection
GNN fundamentalsLecture 3: GNNsMessage passing, node embeddings
Heterogeneous graphsLecture 9: Hetero GNNsType-specific transformations
RecSys (bipartite)Lecture 11: RecSysUser-item graph, link prediction
Advanced RDLLecture 13: Advanced RDLUniversal encoders, Griffin, scalability

Key Papers

"The relational database is the most successful data abstraction in the history of computing. Making it a first-class citizen for machine learning is long overdue."
— paraphrasing the RDL research community