Your company's data lives in a relational database — 50 tables, thousands of foreign keys. Machine learning needs a feature vector. But what if the database is the graph, and GNNs can learn from it directly?
An e-commerce company has a database. Table: Customers. Table: Orders. Table: Products. Table: Reviews. These tables are connected by foreign keys — an Order row has a CustomerID that points to a Customer row, and a ProductID that points to a Product row. This is how all relational databases work.
Here's the insight that changes everything: a relational database with foreign keys is a heterogeneous graph. Every row is a node. Every foreign key relationship is an edge. The table name is the node type. The foreign key relationship is the edge type. You've been storing a graph in your database this whole time — you just never called it that.
A small e-commerce schema. Left: the database tables with foreign key arrows. Right: the same data as a heterogeneous graph. Click a table to highlight its rows as graph nodes.
A homogeneous graph has one type of node and one type of edge. A heterogeneous graph has multiple node types (Customer, Product, Order, Review) and multiple edge types (customer→order, order→product, customer→review). Each node type has its own feature space — a Customer has name, age, country; a Product has title, price, category. GNNs need to handle this heterogeneity explicitly.
This is not a niche scenario. Virtually every company's core data lives in a relational database. Healthcare: Patients, Diagnoses, Prescriptions, Doctors. Finance: Accounts, Transactions, Merchants, Fraud flags. Social media: Users, Posts, Comments, Follows. They're all relational databases. They're all heterogeneous graphs.
You want to predict: which customers will churn in the next 30 days? You have a relational database. A standard ML workflow — the one that's been used for decades — starts with a painful step: feature engineering.
Feature engineering means manually converting your relational data into a flat feature vector per entity. For each customer: count their total orders, compute average order value, find their most recent purchase date, count their reviews, compute average review rating, find their most frequently purchased product category... and so on. This process takes weeks, requires domain expertise, and produces features that may or may not capture the right signals.
Relational Deep Learning (RDL) proposes to eliminate manual feature engineering by letting the GNN learn directly from the graph structure. Instead of manually aggregating "how many orders did this customer place in the last 90 days?", the GNN propagates information from Order nodes to Customer nodes automatically, learning the right aggregation function end-to-end from the task labels.
The GNN effectively learns to do feature engineering automatically. It discovers that aggregating 90-day orders matters for churn, and 180-day purchase history matters for lifetime value prediction — without a human specifying either. This is not magic: it's the inductive bias of graph structure doing the work that engineers used to do by hand.
What can you actually predict with RDL? Almost anything you'd want to know about rows in a database table. These are called entity-level tasks — tasks where the prediction target is a property of a specific row (entity) in the database.
| Task | Input | Prediction | Example |
|---|---|---|---|
| Node classification | Customer node | Class label | Will this customer churn? (yes/no) |
| Node regression | Customer node | Real value | What is this customer's 12-month LTV? |
| Link prediction | Customer + Item | Probability | Will this customer buy this product? |
| Time-to-event | Customer node | Duration | When will this customer next make a purchase? |
Notice that recommendation (link prediction between customers and products) is a special case of entity-level tasks in the RDL framework. Churn prediction, fraud detection, LTV estimation — these are all node-level regression or classification on rows in a relational database. RDL provides one unified framework for all of them.
The power of treating the database as a graph shows up when you consider what information is relevant to a prediction. To predict customer churn, you need not just the customer's direct features (age, country) but:
python # Traditional approach: manual feature engineering customer_features = db.execute(""" SELECT c.age, c.country, COUNT(o.id) as num_orders, AVG(o.amount) as avg_order_value, MAX(o.timestamp) as last_order_date, COUNT(r.id) as num_reviews FROM customers c LEFT JOIN orders o ON o.customer_id = c.id LEFT JOIN reviews r ON r.customer_id = c.id WHERE o.timestamp > NOW() - INTERVAL 90 DAYS GROUP BY c.id """) # ^ This took a DBA 2 weeks to write and validate # RDL approach: let the GNN learn the aggregations graph = build_graph_from_db(db) # automatic embeddings = gnn(graph) # learns what to aggregate predictions = classifier(embeddings['customer']) # end-to-end
Relational data is almost always timestamped. An order has a timestamp. A review has a timestamp. A login event has a timestamp. This temporal structure creates both an opportunity and a danger.
The opportunity: you can model time-evolving relationships. A customer's behavior in January might predict their churn in March. A product's review history might predict its future sales.
The danger: temporal leakage. If you use data from after the prediction time to make that prediction, you've given the model information it couldn't possibly have in production. This inflates metrics dramatically and produces models that fail in deployment.
In relational ML, splits must be temporal. You cannot randomly shuffle rows and split 80/10/10 — that would let the model see future interactions during training.
The temporal gap between splits is important. Without it, information from the end of training bleeds into the validation period, making validation scores overoptimistic. Real deployments have this gap; your evaluation should too.
For each prediction (e.g., "will customer X churn between March 1 and March 31?"), you materialize a subgraph of the relational graph that includes only data with timestamps before March 1. This subgraph is what the GNN sees. March 1 is the cutoff time.
Drag the cutoff time slider to see how the available interaction graph changes. Nodes and edges after the cutoff are hidden from the GNN — preventing leakage.
Let's trace the full journey from a relational database schema to a GNN prediction. This is the complete Relational Deep Learning pipeline, step by step. Click each stage to expand it and see what data flows through.
Click "Step Forward" to advance through the pipeline stages. Watch how a customer churn prediction task materializes from raw database tables into a GNN prediction.
python # RelBench-style RDL pipeline (simplified) from relbench import Dataset, Task from torch_geometric.nn import HeteroConv, SAGEConv # Step 1: Load database + define task dataset = Dataset("ecommerce") # relational database task = Task("customer-churn", dataset) # churn in 30 days? # Step 2: Materialize temporal graph graph = dataset.make_graph(cutoff_time=task.cutoff) # graph: HeteroData with node types and edge types from schema # graph['customer'].x : [N_cust, d_cust] (encoded row features) # graph['order'].x : [N_ord, d_ord] # graph['customer','places','order'].edge_index : [2, E] # Step 3-4: Heterogeneous GNN conv = HeteroConv({ ('customer', 'places', 'order'): SAGEConv((128, 64), 64), ('order', 'rev_places', 'customer'): SAGEConv((64, 128), 128), ('order', 'contains', 'product'): SAGEConv((64, 96), 96), }) x_dict = conv(graph.x_dict, graph.edge_index_dict) # Step 5: Task-specific head churn_logits = mlp(x_dict['customer']) # [N_cust, 1] loss = F.binary_cross_entropy_with_logits(churn_logits, labels)
Relational graphs are heterogeneous: different node types have different feature dimensions and different semantics. A Customer node has age and country; a Product node has price and category; an Order node has amount and timestamp. You can't apply the same weight matrix across all of them.
The solution is type-specific message passing. For each edge type (src_type, relation, dst_type), you learn a separate message function. This is exactly what PyTorch Geometric's HeteroConv does — it routes different edge types through different convolution modules.
For a target node v of type τ(v), the update aggregates from all its neighbors across all incoming edge types:
Where R(v) is the set of edge types pointing to v, Nr(v) is v's neighbors via edge type r, and Wr is a learned weight matrix specific to that edge type. Each foreign key relationship gets its own learned transformation.
Foreign keys are directional — an Order points to a Customer, not the other way. But for GNN message passing, you usually want bidirectional information flow: customers learn from their orders, but orders also learn from the customers who placed them (this gives orders "customer context" that may help predict other properties). The standard approach is to add reverse edges for every foreign key, giving each direction its own edge type (places vs. rev_places).
Watch how information flows from different node types into a Customer node. Each edge type uses a different learned transformation (shown by color). Click "Propagate" to animate one message passing step.
For GNNs on social graphs, there's OGB (Open Graph Benchmark). For knowledge graphs, there's FB15k-237. For recommendation, there's MovieLens and Amazon. But for relational ML, there was no standard benchmark. Everyone used different datasets, different splits, different metrics — making comparison impossible.
RelBench (Fey et al., 2023) fills this gap. It provides a collection of real-world relational databases with standardized train/validation/test splits, evaluation metrics, and leaderboards. The databases come from actual production domains: e-commerce, stack exchange, Wikipedia edits, and more.
| Dataset | Domain | Tables | Rows | Tasks |
|---|---|---|---|---|
| rel-amazon | E-commerce | 5 | ~10M | Review rating, churn |
| rel-stackex | Q&A forum | 8 | ~15M | Engagement, badge prediction |
| rel-wiki | Wikipedia | 4 | ~50M | Edit activity, article quality |
| rel-trial | Clinical trials | 7 | ~5M | Trial completion, adverse events |
| rel-hm | Fashion retail | 4 | ~30M | Purchase prediction |
Before RelBench, a paper claiming "RDL beats XGBoost on customer churn" was hard to evaluate — which churn dataset? Which features did XGBoost get? What was the evaluation protocol? With RelBench, comparisons are apples-to-apples. Everyone uses the same databases, same splits, same metrics. This is how science is supposed to work.
Does RDL actually work? Or is it a beautiful idea that loses to gradient boosting in practice? The honest answer from RelBench results: it depends on the task and the data — and the reasons are instructive.
RDL consistently outperforms XGBoost on tasks where multi-hop relational information is the key signal. If predicting customer churn requires knowing what their friends bought, what reviews they wrote, what products those reviews cover — information that spans 3-4 table hops — then GNNs can capture this while XGBoost's flat feature vector misses it.
Relative performance gain of GNN-RDL over XGBoost baseline, across tasks requiring different amounts of multi-hop relational reasoning. Bar height = improvement in AUROC or RMSE.
XGBoost with carefully engineered features still competes with RDL on tasks where the predictive signal is mostly local — contained within a single table or one JOIN away. The reason: XGBoost's boosting and regularization are highly tuned for tabular data, and its features, though manual, can be crafted by an expert who knows the domain.
On newer, harder RelBench tasks introduced in 2024, RDL models consistently outperform XGBoost baselines — even XGBoost with extensive feature engineering. As tasks become more complex and datasets more relational, the advantage of learning from graph structure grows.
| Task type | Relational hops | RDL vs XGBoost |
|---|---|---|
| Single-table prediction | 0 | ≈ Tie (XGBoost slight edge) |
| 1-hop aggregation | 1 | RDL slight edge |
| Multi-hop relational | 3+ | RDL wins clearly |
| Cold-start entities | Any | RDL wins (content features help) |
RDL is promising, but it's not a drop-in replacement for everything. Three challenges stand out as the hardest open problems.
A production relational database might have 100 million customer rows, 500 million order rows, and 50 million product rows. Building the full graph and running message passing on it is impossible in a single GPU. Each customer's 2-hop neighborhood might include millions of nodes.
Solutions borrowed from GNN scalability: mini-batching with neighbor sampling (same as GraphSAGE), graph partitioning, and pre-computing static features. But temporal materialization adds complexity — the subgraph changes for every prediction cutoff, so you can't easily reuse cached neighborhoods.
Temporal leakage is subtle. It's not just "don't use future labels." Consider: a product's average review score. Computed over all time, it includes future reviews. Computed over time ≤ cutoff, it's safe. Every aggregated feature needs a temporal filter. This is correct by construction in RDL (because the graph itself is materialized at cutoff), but manually engineered features often miss this.
Every database has a different schema. A model trained on one schema (e-commerce) doesn't transfer to another (healthcare) without retraining. This is the "universal encoder" problem — can you build a GNN that works on any schema without schema-specific design? This is the focus of Lecture 13 (advanced RDL architectures).
| Challenge | Current Solution | Open Problem |
|---|---|---|
| Scale (millions of rows) | Neighbor sampling, mini-batching | Temporal-aware sampling |
| Temporal correctness | Graph materialization at cutoff | Efficient re-materialization |
| Schema diversity | Schema-specific models | Universal relational encoders |
| Feature encoding | Pre-trained LMs for text/categoricals | Unified multi-modal encoding |
Relational Deep Learning sits at the intersection of several research communities that have been working in parallel: graph learning, tabular ML, and database systems. It unifies them under a single framework.
| Topic | Lesson | Connection |
|---|---|---|
| GNN fundamentals | Lecture 3: GNNs | Message passing, node embeddings |
| Heterogeneous graphs | Lecture 9: Hetero GNNs | Type-specific transformations |
| RecSys (bipartite) | Lecture 11: RecSys | User-item graph, link prediction |
| Advanced RDL | Lecture 13: Advanced RDL | Universal encoders, Griffin, scalability |
"The relational database is the most successful data abstraction in the history of computing. Making it a first-class citizen for machine learning is long overdue."
— paraphrasing the RDL research community