CS224W Lecture 12 — Relational Deep Learning

Chapter 0: Relational Data as Graphs

An e-commerce company has a database. Table: Customers. Table: Orders. Table: Products. Table: Reviews. These tables are connected by foreign keys — an Order row has a CustomerID that points to a Customer row, and a ProductID that points to a Product row. This is how all relational databases work.

Here's the insight that changes everything: a relational database with foreign keys is a heterogeneous graph. Every row is a node. Every foreign key relationship is an edge. The table name is the node type. The foreign key relationship is the edge type. You've been storing a graph in your database this whole time — you just never called it that.

Tables → node types. Rows → nodes. Foreign keys → edges. A row in the Orders table with CustomerID=42 and ProductID=7 creates two edges: Order→Customer and Order→Product. The entire database schema defines the graph's edge types. The data fills in the nodes and their attributes.

Relational Database → Heterogeneous Graph

A small e-commerce schema. Left: the database tables with foreign key arrows. Right: the same data as a heterogeneous graph. Click a table to highlight its rows as graph nodes.

Click a button to highlight that table's rows as graph nodes.

What Makes It "Heterogeneous"

A homogeneous graph has one type of node and one type of edge. A heterogeneous graph has multiple node types (Customer, Product, Order, Review) and multiple edge types (customer→order, order→product, customer→review). Each node type has its own feature space — a Customer has name, age, country; a Product has title, price, category. GNNs need to handle this heterogeneity explicitly.

This is not a niche scenario. Virtually every company's core data lives in a relational database. Healthcare: Patients, Diagnoses, Prescriptions, Doctors. Finance: Accounts, Transactions, Merchants, Fraud flags. Social media: Users, Posts, Comments, Follows. They're all relational databases. They're all heterogeneous graphs.

In the database-as-graph view, what does a foreign key relationship become?

A node attribute (feature of the row) A table (new type of entity) An edge between two nodes (the referenced row and the referencing row) A graph property (stored globally, not per-node)

Chapter 1: Why This Matters — The Feature Engineering Tax

You want to predict: which customers will churn in the next 30 days? You have a relational database. A standard ML workflow — the one that's been used for decades — starts with a painful step: feature engineering.

Feature engineering means manually converting your relational data into a flat feature vector per entity. For each customer: count their total orders, compute average order value, find their most recent purchase date, count their reviews, compute average review rating, find their most frequently purchased product category... and so on. This process takes weeks, requires domain expertise, and produces features that may or may not capture the right signals.

Feature engineering is the bottleneck in tabular ML. In a 2023 Kaggle survey, data scientists reported spending 60-80% of their time on feature engineering — before any model is trained. It's manual, slow, domain-specific, and doesn't transfer between tasks.

The RDL Promise

Relational Deep Learning (RDL) proposes to eliminate manual feature engineering by letting the GNN learn directly from the graph structure. Instead of manually aggregating "how many orders did this customer place in the last 90 days?", the GNN propagates information from Order nodes to Customer nodes automatically, learning the right aggregation function end-to-end from the task labels.

Traditional pipeline:

Domain expert inspects schema
Manually define features (weeks)
SQL queries to compute features
Flat feature table → XGBoost

RDL pipeline:

Parse database schema automatically
Construct heterogeneous graph
Run GNN — features learned end-to-end
Predict task labels directly

The GNN effectively learns to do feature engineering automatically. It discovers that aggregating 90-day orders matters for churn, and 180-day purchase history matters for lifetime value prediction — without a human specifying either. This is not magic: it's the inductive bias of graph structure doing the work that engineers used to do by hand.

What is the key problem that Relational Deep Learning aims to solve?

Relational databases are too slow for machine learning applications GNNs cannot handle more than one type of node or edge Manual feature engineering for relational data is slow, expensive, and doesn't transfer — RDL learns features automatically from the database graph Foreign key joins are too expensive to compute at ML scale

Chapter 2: Entity-Level Tasks — Predicting Row Properties

What can you actually predict with RDL? Almost anything you'd want to know about rows in a database table. These are called entity-level tasks — tasks where the prediction target is a property of a specific row (entity) in the database.

Task Taxonomy

Task	Input	Prediction	Example
Node classification	Customer node	Class label	Will this customer churn? (yes/no)
Node regression	Customer node	Real value	What is this customer's 12-month LTV?
Link prediction	Customer + Item	Probability	Will this customer buy this product?
Time-to-event	Customer node	Duration	When will this customer next make a purchase?

Notice that recommendation (link prediction between customers and products) is a special case of entity-level tasks in the RDL framework. Churn prediction, fraud detection, LTV estimation — these are all node-level regression or classification on rows in a relational database. RDL provides one unified framework for all of them.

Multi-Hop Information

The power of treating the database as a graph shows up when you consider what information is relevant to a prediction. To predict customer churn, you need not just the customer's direct features (age, country) but:

1-hop: their orders (recent? large? frequent?)
2-hop: the products they ordered (niche? popular? seasonal?)
3-hop: other customers who bought those products (churn rate of similar customers)

Each hop is a JOIN in SQL. A 3-hop GNN is equivalent to a 3-table JOIN followed by an aggregation — but the aggregation is learned, not specified. The GNN doesn't know you want a 3-hop aggregation; it discovers this through gradient descent on the task.

python
# Traditional approach: manual feature engineering
customer_features = db.execute("""
    SELECT c.age, c.country,
           COUNT(o.id) as num_orders,
           AVG(o.amount) as avg_order_value,
           MAX(o.timestamp) as last_order_date,
           COUNT(r.id) as num_reviews
    FROM customers c
    LEFT JOIN orders o ON o.customer_id = c.id
    LEFT JOIN reviews r ON r.customer_id = c.id
    WHERE o.timestamp > NOW() - INTERVAL 90 DAYS
    GROUP BY c.id
""")
# ^ This took a DBA 2 weeks to write and validate

# RDL approach: let the GNN learn the aggregations
graph = build_graph_from_db(db)          # automatic
embeddings = gnn(graph)                  # learns what to aggregate
predictions = classifier(embeddings['customer'])  # end-to-end

Why is a GNN with L layers on a relational graph analogous to L-table JOINs in SQL?

Both require the same amount of compute proportional to the number of rows Each GNN message passing layer aggregates information from one hop away — following one foreign key relationship — just as one SQL JOIN follows one foreign key link. L layers = L JOINs of information. SQL JOINs and GNN layers both use the same normalization strategy GNNs automatically generate SQL queries during message passing

Chapter 3: Temporal Aspects — Preventing Future Leakage

Relational data is almost always timestamped. An order has a timestamp. A review has a timestamp. A login event has a timestamp. This temporal structure creates both an opportunity and a danger.

The opportunity: you can model time-evolving relationships. A customer's behavior in January might predict their churn in March. A product's review history might predict its future sales.

The danger: temporal leakage. If you use data from after the prediction time to make that prediction, you've given the model information it couldn't possibly have in production. This inflates metrics dramatically and produces models that fail in deployment.

The golden rule: when predicting a label at time T, you may only use data with timestamp ≤ T. Any data after T is future information and must be excluded from the graph at prediction time. Violating this rule causes data leakage — the single most common mistake in time-series ML.

Train / Validation / Test Splits

In relational ML, splits must be temporal. You cannot randomly shuffle rows and split 80/10/10 — that would let the model see future interactions during training.

Training Edges

All interactions before time T₁

↓ temporal gap (no data)

Validation Predictions

Labels between T₁ and T₂

↓ temporal gap

Test Predictions

Labels after T₂

The temporal gap between splits is important. Without it, information from the end of training bleeds into the validation period, making validation scores overoptimistic. Real deployments have this gap; your evaluation should too.

Materialization at Prediction Time

For each prediction (e.g., "will customer X churn between March 1 and March 31?"), you materialize a subgraph of the relational graph that includes only data with timestamps before March 1. This subgraph is what the GNN sees. March 1 is the cutoff time.

Temporal Cutoff Visualization

Drag the cutoff time slider to see how the available interaction graph changes. Nodes and edges after the cutoff are hidden from the GNN — preventing leakage.

Cutoff time T t=6

You're predicting whether a customer will churn in April. Which data is it safe to include in the GNN's input graph?

All data in the database, including April transactions Only the customer's data, not their orders or reviews All data with timestamps before April 1 — the prediction cutoff date Only data from the same calendar year

Chapter 4: The RDL Pipeline — SHOWCASE

Let's trace the full journey from a relational database schema to a GNN prediction. This is the complete Relational Deep Learning pipeline, step by step. Click each stage to expand it and see what data flows through.

Full RDL Pipeline — Interactive Walkthrough

Click "Step Forward" to advance through the pipeline stages. Watch how a customer churn prediction task materializes from raw database tables into a GNN prediction.

Step-by-Step Breakdown

1. Schema Parsing

Read table definitions + foreign keys. Each table becomes a node type. Each FK becomes an edge type. Column data types → initial feature encoders.

↓

2. Temporal Materialization

For each prediction (entity, cutoff_time), select only rows with timestamp ≤ cutoff. Build the "past" subgraph. This prevents future leakage.

↓

3. Feature Encoding

Encode each row's columns: numeric → normalize, categorical → embedding, text → pre-trained LM, timestamp → sinusoidal. Stack into node feature vectors.

↓

4. Heterogeneous GNN

Run message passing respecting node types. Different weight matrices per (src_type, edge_type, dst_type). L layers = L JOIN depth of information aggregation.

↓

5. Task Head + Loss

Take the target entity's embedding. Pass through MLP. Output prediction. Compute BCE (classification) or MSE (regression) loss. Backpropagate.

python
# RelBench-style RDL pipeline (simplified)
from relbench import Dataset, Task
from torch_geometric.nn import HeteroConv, SAGEConv

# Step 1: Load database + define task
dataset = Dataset("ecommerce")         # relational database
task = Task("customer-churn", dataset)  # churn in 30 days?

# Step 2: Materialize temporal graph
graph = dataset.make_graph(cutoff_time=task.cutoff)
# graph: HeteroData with node types and edge types from schema
# graph['customer'].x : [N_cust, d_cust]  (encoded row features)
# graph['order'].x    : [N_ord,  d_ord]
# graph['customer','places','order'].edge_index : [2, E]

# Step 3-4: Heterogeneous GNN
conv = HeteroConv({
    ('customer', 'places', 'order'): SAGEConv((128, 64), 64),
    ('order', 'rev_places', 'customer'): SAGEConv((64, 128), 128),
    ('order', 'contains', 'product'): SAGEConv((64, 96), 96),
})
x_dict = conv(graph.x_dict, graph.edge_index_dict)

# Step 5: Task-specific head
churn_logits = mlp(x_dict['customer'])   # [N_cust, 1]
loss = F.binary_cross_entropy_with_logits(churn_logits, labels)

In the RDL pipeline, what is "temporal materialization" and why is it necessary?

Converting timestamps to unix epoch integers so GNNs can process them as node features Storing all graph snapshots in GPU memory for fast access during training Building a subgraph that contains only data before the prediction cutoff time — necessary to prevent the model from using future information it wouldn't have in deployment Converting relational data to temporal sequences (like RNNs) instead of graphs

Chapter 5: Message Passing on Relational Graphs

Relational graphs are heterogeneous: different node types have different feature dimensions and different semantics. A Customer node has age and country; a Product node has price and category; an Order node has amount and timestamp. You can't apply the same weight matrix across all of them.

The solution is type-specific message passing. For each edge type (src_type, relation, dst_type), you learn a separate message function. This is exactly what PyTorch Geometric's HeteroConv does — it routes different edge types through different convolution modules.

The Heterogeneous Message Passing Equation

For a target node v of type τ(v), the update aggregates from all its neighbors across all incoming edge types:

h_v^(l+1) = σ( W_τ(v) h_v^(l) + ∑_{r ∈ R(v)} ∑_{u ∈ N_r(v)} 1⁄|N_r(v)| W_r h_u^(l) )

Where R(v) is the set of edge types pointing to v, N_r(v) is v's neighbors via edge type r, and W_r is a learned weight matrix specific to that edge type. Each foreign key relationship gets its own learned transformation.

Why type-specific weights? The transformation from Order features to Customer features should be different from the transformation from Product features to Customer features. An order is "placed by" a customer (temporal); a product is "bought by" a customer (preference). These relationships carry fundamentally different information.

Handling Asymmetry

Foreign keys are directional — an Order points to a Customer, not the other way. But for GNN message passing, you usually want bidirectional information flow: customers learn from their orders, but orders also learn from the customers who placed them (this gives orders "customer context" that may help predict other properties). The standard approach is to add reverse edges for every foreign key, giving each direction its own edge type (places vs. rev_places).

Heterogeneous Message Passing Demo

Watch how information flows from different node types into a Customer node. Each edge type uses a different learned transformation (shown by color). Click "Propagate" to animate one message passing step.

Press "Propagate" to run one layer of heterogeneous message passing.

Why does heterogeneous GNN message passing use different weight matrices for different edge types?

Different edge types have different numbers of neighbors, requiring different matrix sizes Shared weights would cause gradient conflicts during backpropagation Different foreign key relationships carry semantically different information — "placed by" and "bought by" are different relationships, requiring different learned transformations to aggregate correctly GPU memory limits prevent sharing weights across edge types

Chapter 6: RelBench — A Standard Benchmark

For GNNs on social graphs, there's OGB (Open Graph Benchmark). For knowledge graphs, there's FB15k-237. For recommendation, there's MovieLens and Amazon. But for relational ML, there was no standard benchmark. Everyone used different datasets, different splits, different metrics — making comparison impossible.

RelBench (Fey et al., 2023) fills this gap. It provides a collection of real-world relational databases with standardized train/validation/test splits, evaluation metrics, and leaderboards. The databases come from actual production domains: e-commerce, stack exchange, Wikipedia edits, and more.

Datasets in RelBench

Dataset	Domain	Tables	Rows	Tasks
rel-amazon	E-commerce	5	~10M	Review rating, churn
rel-stackex	Q&A forum	8	~15M	Engagement, badge prediction
rel-wiki	Wikipedia	4	~50M	Edit activity, article quality
rel-trial	Clinical trials	7	~5M	Trial completion, adverse events
rel-hm	Fashion retail	4	~30M	Purchase prediction

RelBench uses temporal splits across all datasets. Train ends at T₁, validation between T₁ and T₂, test after T₂. This mimics production deployment, where you always predict into the future. It also means you cannot accidentally use future data — the split enforces it.

What RelBench Enables

Before RelBench, a paper claiming "RDL beats XGBoost on customer churn" was hard to evaluate — which churn dataset? Which features did XGBoost get? What was the evaluation protocol? With RelBench, comparisons are apples-to-apples. Everyone uses the same databases, same splits, same metrics. This is how science is supposed to work.

Why does RelBench use temporal (chronological) splits rather than random row shuffles?

Temporal splits are faster to compute for large databases Random splits cause class imbalance in temporal datasets Temporal splits mirror real deployment (always predicting the future from past data), preventing data leakage and giving honest performance estimates Temporal splits are required by GDPR for privacy compliance

Chapter 7: Results vs Traditional ML

Does RDL actually work? Or is it a beautiful idea that loses to gradient boosting in practice? The honest answer from RelBench results: it depends on the task and the data — and the reasons are instructive.

When RDL Wins

RDL consistently outperforms XGBoost on tasks where multi-hop relational information is the key signal. If predicting customer churn requires knowing what their friends bought, what reviews they wrote, what products those reviews cover — information that spans 3-4 table hops — then GNNs can capture this while XGBoost's flat feature vector misses it.

RDL vs XGBoost: Performance by Task Complexity

Relative performance gain of GNN-RDL over XGBoost baseline, across tasks requiring different amounts of multi-hop relational reasoning. Bar height = improvement in AUROC or RMSE.

Longer bars = more benefit from graph structure.

When XGBoost Holds Its Own

XGBoost with carefully engineered features still competes with RDL on tasks where the predictive signal is mostly local — contained within a single table or one JOIN away. The reason: XGBoost's boosting and regularization are highly tuned for tabular data, and its features, though manual, can be crafted by an expert who knows the domain.

The honest comparison: XGBoost + domain expert features vs. RDL + no feature engineering. RDL wins when: (1) the relevant signal is multi-hop, (2) the domain expert doesn't know what features matter, or (3) fast iteration without feature engineering is valuable. XGBoost wins when: (1) someone who knows the domain has spent weeks on features, and (2) the signal is local.

The Trend Is Clear

On newer, harder RelBench tasks introduced in 2024, RDL models consistently outperform XGBoost baselines — even XGBoost with extensive feature engineering. As tasks become more complex and datasets more relational, the advantage of learning from graph structure grows.

Task type	Relational hops	RDL vs XGBoost
Single-table prediction	0	≈ Tie (XGBoost slight edge)
1-hop aggregation	1	RDL slight edge
Multi-hop relational	3+	RDL wins clearly
Cold-start entities	Any	RDL wins (content features help)

On which type of task does RDL show the largest improvement over XGBoost?

Simple single-table classification (only features from one table) Tasks with perfectly balanced classes and large training sets Multi-hop relational tasks where the predictive signal spans 3+ table JOINs — information GNNs propagate automatically but XGBoost features must capture manually Tasks with high-cardinality categorical variables

Chapter 8: Challenges — What RDL Doesn't Solve

RDL is promising, but it's not a drop-in replacement for everything. Three challenges stand out as the hardest open problems.

Challenge 1: Scalability

A production relational database might have 100 million customer rows, 500 million order rows, and 50 million product rows. Building the full graph and running message passing on it is impossible in a single GPU. Each customer's 2-hop neighborhood might include millions of nodes.

Solutions borrowed from GNN scalability: mini-batching with neighbor sampling (same as GraphSAGE), graph partitioning, and pre-computing static features. But temporal materialization adds complexity — the subgraph changes for every prediction cutoff, so you can't easily reuse cached neighborhoods.

Challenge 2: Temporal Leakage is Easy to Get Wrong

Temporal leakage is subtle. It's not just "don't use future labels." Consider: a product's average review score. Computed over all time, it includes future reviews. Computed over time ≤ cutoff, it's safe. Every aggregated feature needs a temporal filter. This is correct by construction in RDL (because the graph itself is materialized at cutoff), but manually engineered features often miss this.

The temporal graph materialization is the killer feature of RDL. By building the graph at the cutoff and running GNN on it, every aggregation is automatically temporally correct. The GNN cannot see future data because the graph doesn't contain it. This is much harder to guarantee in manual feature engineering.

Challenge 3: Schema Diversity

Every database has a different schema. A model trained on one schema (e-commerce) doesn't transfer to another (healthcare) without retraining. This is the "universal encoder" problem — can you build a GNN that works on any schema without schema-specific design? This is the focus of Lecture 13 (advanced RDL architectures).

Challenge	Current Solution	Open Problem
Scale (millions of rows)	Neighbor sampling, mini-batching	Temporal-aware sampling
Temporal correctness	Graph materialization at cutoff	Efficient re-materialization
Schema diversity	Schema-specific models	Universal relational encoders
Feature encoding	Pre-trained LMs for text/categoricals	Unified multi-modal encoding

Why does building the GNN input graph from data ≤ cutoff_time automatically prevent temporal leakage in RDL?

GNNs cannot process data with future timestamps because of PyG's API constraints The GNN's message passing can only visit nodes and edges that exist in its input graph — if future data is excluded from the graph at construction time, there is literally no path for the GNN to reach it The loss function penalizes models that use future timestamp features Temporal data is stored in a separate database that the GNN doesn't have access to

Chapter 9: Connections — Where to Go Next

Relational Deep Learning sits at the intersection of several research communities that have been working in parallel: graph learning, tabular ML, and database systems. It unifies them under a single framework.

The Key Insights to Carry Forward

Relational databases are graphs. This isn't a metaphor — it's exact. Foreign keys are edges, rows are nodes, tables are node types. Once you see this, the entire GNN toolkit becomes available for relational ML problems.

Temporal correctness is structural. By building the graph at the prediction cutoff, RDL makes temporal leakage architecturally impossible. This is safer than trusting engineers to filter features manually.

Feature engineering is learnable. Multi-hop SQL JOINs with complex aggregations can be replaced by GNN layers that learn the right aggregation. The GNN doesn't know SQL — but gradient descent on the task labels discovers what information matters.

Schema diversity is an open problem. Training a model per schema is feasible; training one model for all schemas is the frontier (see Lecture 13).

Related Lessons

Topic	Lesson	Connection
GNN fundamentals	Lecture 3: GNNs	Message passing, node embeddings
Heterogeneous graphs	Lecture 9: Hetero GNNs	Type-specific transformations
RecSys (bipartite)	Lecture 11: RecSys	User-item graph, link prediction
Advanced RDL	Lecture 13: Advanced RDL	Universal encoders, Griffin, scalability

Key Papers

RelBench — Fey et al. (2023). "RelBench: A Benchmark for Deep Learning on Relational Databases." NeurIPS.
RELBENCH framework — Robinson et al. (2024). "Relational Deep Learning: Graph Representation Learning on Relational Databases." ICML.
HAN — Wang et al. (2019). "Heterogeneous Graph Attention Network." WWW.
HGT — Hu et al. (2020). "Heterogeneous Graph Transformer." WWW.

"The relational database is the most successful data abstraction in the history of computing. Making it a first-class citizen for machine learning is long overdue."
— paraphrasing the RDL research community