Fey, Hu, Huang, Lenssen, Ranjan, Robinson, Ying, You, Leskovec — 2023

Relational Deep Learning: Databases as Graphs

Every relational database is already a heterogeneous graph. Tables are node types. Foreign keys are edges. GNNs can learn directly from these graphs — no feature engineering, no hand-crafted joins. RelBench makes this concrete with 11 benchmark tasks.

Prerequisites: What a SQL table is + Basic GNN intuition. No relational algebra required.
8
Chapters
4+
Simulations
2312.04615
arXiv

Chapter 0: The Problem

You're building a churn prediction model. Your data is in a relational database: a users table, an orders table, a products table, a reviews table. Each order references a user and a product. Each review references an order and a user.

The standard ML pipeline: call a data engineer. They write SQL joins to flatten everything into a single feature table — one row per user with 200 engineered features like "total orders in last 30 days," "average review sentiment," "product category diversity." Then you train XGBoost. The SQL took 3 weeks. The features are arbitrary choices. Relationships between entities are compressed into aggregations that lose structural information.

The hidden graph: Your database is a graph. Every foreign key reference is an edge. User 47 connects to Order 203, which connects to Product 91, which connects to a category and other orders. The relationships between entities carry information that flat features cannot capture — which users bought from the same suppliers, which products are purchased together, which users have similar review patterns. Feature engineering flattens all of this into scalars. Relational Deep Learning (RDL) leaves the graph intact and trains a GNN directly on it.

The practical cost of feature engineering is real:

Feature Engineering vs Graph Learning

The same relational data shown two ways. Left: the flat feature approach (all information compressed to per-entity features). Right: the graph approach (entities as nodes, foreign keys as edges). Click "Highlight Path" to see a multi-hop connection that flat features cannot capture.

What structural information do foreign key relationships carry that standard feature engineering loses?

Chapter 1: Database as Graph

The translation from relational database to graph is mechanical — no choices, no ambiguity. Every table becomes a node type. Every row in a table becomes a node instance. Every foreign key relationship between tables becomes an edge type. Every row-to-row FK reference becomes an edge instance.

G = (VT1 ∪ VT2 ∪ ... ∪ VTk, EFK1 ∪ EFK2 ∪ ... ∪ EFKm)

A heterogeneous graph where node types correspond to tables and edge types correspond to foreign key relations. This is precisely the input format that heterogeneous GNNs (like HGT, or the RelBench GNN) operate on. The table columns become node features.

Concrete example — e-commerce database:
Tables: users (id, age, city), orders (id, user_id, date, total), products (id, name, category, price), order_items (order_id, product_id, quantity)
→ Node types: User, Order, Product, OrderItem
→ Edge types: User→Order (placed), Order→OrderItem (contains), OrderItem→Product (references)
→ Each User node's features: [age, city_embedding]. Each Order node: [date, total]. Each Product: [name_embedding, category, price].

Feature Encoding for Tables

Table columns come in several types, each needing a different encoding strategy:

These features are concatenated into a per-node feature vector. Different node types may have different feature dimensions — the GNN handles this by projecting all types into a common latent dimension at the first layer.

Schema to Graph Translator

An e-commerce database schema. Toggle between the schema view (tables + FK arrows) and the graph view (nodes + edges). The information is identical — only the representation changes.

What do foreign key relationships become in the relational graph?

Chapter 2: GNN on Relations — Interactive

Once the database is a graph, the GNN processes it with standard heterogeneous message passing. For each prediction task (e.g., "will this user churn in the next 30 days?"), the GNN computes an embedding for the target node (user) by aggregating information from its neighborhood — orders, products, reviews — up to K hops away.

Input Graph
Heterogeneous graph: Users, Orders, Products, Reviews as node types. FK relations as edges. Column values as node features.
Linear Projection
Each node type: Wtype · xv → hv(0) ∈ Rd. All types projected to common d-dim space.
Heterogeneous Message Passing (K layers)
For each edge type (u-type, relation, v-type): me = MLPrelation(hu). Aggregate: hv(k+1) = AGG({me : e points to v by type}).
Readout + MLP head
hv(K) for target nodes → MLP → prediction (binary, regression, ranking).
Relational GNN — Live Message Passing

A small e-commerce graph. Click any node to select it as the prediction target. Adjust message-passing depth to see how information flows from related nodes. Colors show information density reaching the target.

GNN Depth K 1
python (PyG HeteroConv)
import torch
from torch_geometric.nn import HeteroConv, SAGEConv

# Define one conv per edge type
conv = HeteroConv({
    ('user', 'placed', 'order'): SAGEConv((-1, -1), 64),
    ('order', 'contains', 'product'): SAGEConv((-1, -1), 64),
    ('user', 'wrote', 'review'): SAGEConv((-1, -1), 64),
    # ... reverse edges for bidirectional flow
}, aggr='sum')

# Forward pass: x_dict = {node_type: feature_tensor}
x_dict = conv(x_dict, edge_index_dict)
# After K layers, x_dict['user'] = churn prediction embeddings
user_logits = mlp_head(x_dict['user'])  # [N_users, 1]
Why does the relational GNN need separate MLPs for each edge type?

Chapter 3: Temporal Data

Real relational databases are not static snapshots — they grow over time. Orders accumulate. User behavior evolves. Product reviews arrive daily. A churn prediction model trained naively on historical data will suffer from temporal leakage: using future information to predict the past.

Temporal leakage is subtle in graphs. Suppose you want to predict whether User A churned in June. User A's most recent order was July (after the target date). If the graph includes that July order, the model can infer "User A was still active in July — probably not churned yet." But at prediction time (June), you don't have the July order. The model has been trained on a time-traveling graph. This inflates validation accuracy by a lot and fails completely at deployment.

RDL enforces a temporal cutoff for each prediction task. For a target entity at time t*, only nodes and edges with timestamps ≤ t* are included in the graph. This requires:

Temporal node features: Beyond just filtering, timestamps are informative features. "This order was placed 3 days ago" and "this order was placed 2 years ago" carry different signals. RDL encodes relative time deltas — the difference between each entity's timestamp and the prediction cutoff t* — as additional node features. This lets the model learn time-decay patterns: recent activity is weighted more than old activity.
python
def build_temporal_graph(db, target_entity, cutoff_time):
    # Only include events before the prediction cutoff
    filtered_orders = db.orders[db.orders.timestamp <= cutoff_time]
    filtered_reviews = db.reviews[db.reviews.timestamp <= cutoff_time]

    # Encode relative time delta as a node feature
    filtered_orders['days_ago'] = (cutoff_time - filtered_orders.timestamp).dt.days
    filtered_reviews['days_ago'] = (cutoff_time - filtered_reviews.timestamp).dt.days

    # Build heterogeneous graph from filtered tables
    graph = schema_to_graph(db.users, filtered_orders, filtered_reviews, db.products)
    return graph

# Training: each example gets its own temporally-consistent graph
for user, label, cutoff in training_examples:
    g = build_temporal_graph(db, user, cutoff)
    pred = gnn(g)[user]
    loss = bce_loss(pred, label)
What is temporal leakage in relational graph learning, and how does RDL prevent it?

Chapter 4: RelBench

RelBench is the benchmark suite introduced alongside the RDL paper. 11 tasks across 5 real-world relational databases, each with standardized train/validation/test splits that respect temporal ordering. No data leakage. No cherry-picked tasks.

The Databases

The 11 Tasks

Tasks span classification, regression, and ranking — all evaluated as binary classification (top-k precision) or regression (MAE/RMSE) with temporal validation to prevent leakage:

DatabaseTaskTypeTarget
rel-amazonuser-churnBinaryWill user post another review in 30d?
rel-amazonitem-churnBinaryWill item get reviewed in 30d?
rel-stackuser-engageBinaryWill user answer a question in 30d?
rel-stackpost-votesRegressionHow many upvotes will post receive?
rel-hmuser-item-purchaseRankingWhich articles will customer buy next?
rel-trialstudy-outcomeBinaryWill trial report positive outcome?
Why RelBench matters: Before RelBench, relational ML papers evaluated on hand-crafted benchmarks with no standardized leakage prevention. RelBench's enforced temporal split means that a method that achieves good numbers has actually learned from historical relationships — not from peeking at the future. It also enables fair comparison between feature-engineering baselines (XGBoost on manual features) and GNN methods.
What does RelBench's standardized temporal split prevent?

Chapter 5: vs XGBoost

XGBoost on manually engineered features is the standard baseline for relational prediction tasks. It's fast, interpretable, and often competitive. The RDL paper's key claim: the GNN on the raw graph beats XGBoost on a significant fraction of tasks — and the gap grows with the complexity of the relational structure.

When does GNN win? The GNN wins on tasks where multi-hop relational structure carries signal that aggregations miss. "Users who share product preferences with highly active users tend to stay engaged" is a 3-hop pattern (User → Order → Product ← Order ← User) that XGBoost cannot compute without explicit engineered features. The GNN discovers these patterns automatically from the graph.
TaskXGBoost (manual features)RDL-GNNWinner
user-churn (Amazon)0.72 AUROC0.78 AUROCGNN (+8%)
post-votes (StackEx)0.64 RMSE norm.0.59 RMSE norm.GNN (+8%)
study-outcome (Trial)0.68 AUROC0.71 AUROCGNN (+4%)
user-item (H&M)0.021 precision@100.028 precision@10GNN (+33%)
item-churn (Amazon)0.69 AUROC0.67 AUROCXGBoost (+3%)

GNN wins in most tasks. XGBoost wins occasionally — specifically on tasks where the relational structure is shallow (entities don't chain deeply) and where the target is mainly predictable from the entity's own historical features.

The real cost comparison: XGBoost results require weeks of feature engineering by domain experts. The GNN results required writing a schema-to-graph converter (a few hundred lines of code, reusable across tasks) and running training. The per-task engineering cost for GNN is near zero once the converter exists. XGBoost's total engineering cost is the bottleneck, not model training time. For organizations running many prediction tasks on the same database, RDL's one-time setup cost is amortized across all tasks.
Performance vs Relational Depth

GNN advantage grows with the depth of relational structure in the task. Tasks where the answer requires multi-hop reasoning show larger GNN vs XGBoost gaps.

Relational Depth 1
On which types of tasks does the GNN most consistently outperform XGBoost with manual features?

Chapter 6: Scale

Real production databases have millions of rows per table. A naively constructed graph would have millions of nodes and hundreds of millions of edges — far too large for a full-graph GNN forward pass. How does RDL handle this?

Mini-batch Subgraph Sampling

The same strategy used by GraphSAGE and neighbor sampling GNNs: for each training example (target node), extract a K-hop subgraph by sampling a fixed number of neighbors at each hop. This subgraph is small regardless of global graph size.

For a K=2 GNN with fanout [25, 25]: at hop 1, sample at most 25 neighbors of the target node. At hop 2, sample at most 25 neighbors of each hop-1 node. Maximum subgraph size: 1 + 25 + 625 = 651 nodes — tiny, regardless of whether the graph has 1 million or 100 million total nodes.

Temporal subgraph sampling: Each training example has a different cutoff time. The subgraph must include only nodes/edges before that cutoff. Efficient temporal sampling requires pre-sorting neighbors by timestamp and binary-searching for the cutoff. RelBench's implementation does this — building time-indexed adjacency lists for each edge type.

Node-Level Partitioning for Very Large Graphs

For graphs too large even for sampling (billion-node graphs), graph partitioning distributes nodes across machines. Each machine owns a partition of nodes and their local edges. During training, cross-partition edges require communication. Libraries like PyG's DistNeighborSampler and dist module implement this for relational settings.

ScaleTechniqueGraph SizeBatch Size
Small (<100K nodes)Full-graph trainingEntire graph in memoryAll nodes
Medium (100K–10M)Neighbor samplingK-hop subgraph per exampleSampled subgraph
Large (10M–1B)Distributed samplingPer-machine partitionCross-machine subgraph
Very large (>1B)Cluster-GCN + partitionCluster per batchCluster subgraph
How does neighbor sampling make GNN training on million-row databases feasible?

Chapter 7: Connections

Relational Deep Learning connects decades of relational database research to the modern GNN literature. Understanding these connections situates RDL as not a niche contribution but a bridge between two large research communities.

MethodKey IdeaRelation to RDL
Inductive Logic Programming (Muggleton 1991)Learn logical rules over relationsSame motivation, different representation (rules vs neural)
Probabilistic Relational Models (Getoor 2007)Probabilistic graphical models for relational dataPrecursor — probabilistic instead of neural
HGT (Hu et al. 2020)Heterogeneous GNN with attentionOne valid backbone for RDL's GNN step
RDL (this paper)DB-as-graph + temporal split + RelBench
TabNet / AutoMLEnd-to-end learning on single flat tablesNo relational structure; RDL is the multi-table extension
RELBENCH (future)Community benchmark extensionsDirect descendant — benchmark expected to grow
The key technical novelty of RDL is not a new GNN architecture — it's the problem formulation. Prior work either (a) built specialized methods for specific relational tasks (knowledge graph completion, citation networks), or (b) required massive feature engineering to reach a single flat table. RDL's contribution is showing that the schema-to-graph translation is mechanical, the temporal leakage issue is solvable with a principled cutoff, and standard heterogeneous GNNs beat the feature-engineering baseline on the resulting tasks. The benchmark makes this reproducible and comparable.
Limitations: (1) Text and image columns: RDL handles these with pre-trained encoders, but the quality of encoding matters. (2) Very complex temporal patterns: RDL uses simple cutoff time — more complex time-series patterns (seasonality, trends) require additional modeling. (3) Schema changes: adding a new table requires schema-to-graph reprocessing. (4) Interpretability: the GNN's predictions are as opaque as any neural model — XGBoost's feature importances are easier to explain to stakeholders. (5) Cold-start: nodes with few connections (new users, rare products) are poorly represented — the same problem as any GNN.

Go Deeper

  • HGT (Hu et al. 2020) — heterogeneous GNN backbone for RDL
  • GraphSAGE (Hamilton et al. 2017) — neighbor sampling for large graphs
  • RGCN (Schlichtkrull 2018) — relational GCN for KB completion

Key Paper

Fey, Hu, Huang, Lenssen, Ranjan, Robinson, Ying, You, Leskovec. "Relational Deep Learning: Graph Representation Learning on Relational Databases." 2023. arXiv:2312.04615

"Every enterprise database is a graph. We just stopped pretending otherwise."