Every relational database is already a heterogeneous graph. Tables are node types. Foreign keys are edges. GNNs can learn directly from these graphs — no feature engineering, no hand-crafted joins. RelBench makes this concrete with 11 benchmark tasks.
You're building a churn prediction model. Your data is in a relational database: a users table, an orders table, a products table, a reviews table. Each order references a user and a product. Each review references an order and a user.
The standard ML pipeline: call a data engineer. They write SQL joins to flatten everything into a single feature table — one row per user with 200 engineered features like "total orders in last 30 days," "average review sentiment," "product category diversity." Then you train XGBoost. The SQL took 3 weeks. The features are arbitrary choices. Relationships between entities are compressed into aggregations that lose structural information.
The practical cost of feature engineering is real:
The same relational data shown two ways. Left: the flat feature approach (all information compressed to per-entity features). Right: the graph approach (entities as nodes, foreign keys as edges). Click "Highlight Path" to see a multi-hop connection that flat features cannot capture.
The translation from relational database to graph is mechanical — no choices, no ambiguity. Every table becomes a node type. Every row in a table becomes a node instance. Every foreign key relationship between tables becomes an edge type. Every row-to-row FK reference becomes an edge instance.
A heterogeneous graph where node types correspond to tables and edge types correspond to foreign key relations. This is precisely the input format that heterogeneous GNNs (like HGT, or the RelBench GNN) operate on. The table columns become node features.
users (id, age, city), orders (id, user_id, date, total), products (id, name, category, price), order_items (order_id, product_id, quantity)
Table columns come in several types, each needing a different encoding strategy:
These features are concatenated into a per-node feature vector. Different node types may have different feature dimensions — the GNN handles this by projecting all types into a common latent dimension at the first layer.
An e-commerce database schema. Toggle between the schema view (tables + FK arrows) and the graph view (nodes + edges). The information is identical — only the representation changes.
Once the database is a graph, the GNN processes it with standard heterogeneous message passing. For each prediction task (e.g., "will this user churn in the next 30 days?"), the GNN computes an embedding for the target node (user) by aggregating information from its neighborhood — orders, products, reviews — up to K hops away.
A small e-commerce graph. Click any node to select it as the prediction target. Adjust message-passing depth to see how information flows from related nodes. Colors show information density reaching the target.
python (PyG HeteroConv) import torch from torch_geometric.nn import HeteroConv, SAGEConv # Define one conv per edge type conv = HeteroConv({ ('user', 'placed', 'order'): SAGEConv((-1, -1), 64), ('order', 'contains', 'product'): SAGEConv((-1, -1), 64), ('user', 'wrote', 'review'): SAGEConv((-1, -1), 64), # ... reverse edges for bidirectional flow }, aggr='sum') # Forward pass: x_dict = {node_type: feature_tensor} x_dict = conv(x_dict, edge_index_dict) # After K layers, x_dict['user'] = churn prediction embeddings user_logits = mlp_head(x_dict['user']) # [N_users, 1]
Real relational databases are not static snapshots — they grow over time. Orders accumulate. User behavior evolves. Product reviews arrive daily. A churn prediction model trained naively on historical data will suffer from temporal leakage: using future information to predict the past.
RDL enforces a temporal cutoff for each prediction task. For a target entity at time t*, only nodes and edges with timestamps ≤ t* are included in the graph. This requires:
python def build_temporal_graph(db, target_entity, cutoff_time): # Only include events before the prediction cutoff filtered_orders = db.orders[db.orders.timestamp <= cutoff_time] filtered_reviews = db.reviews[db.reviews.timestamp <= cutoff_time] # Encode relative time delta as a node feature filtered_orders['days_ago'] = (cutoff_time - filtered_orders.timestamp).dt.days filtered_reviews['days_ago'] = (cutoff_time - filtered_reviews.timestamp).dt.days # Build heterogeneous graph from filtered tables graph = schema_to_graph(db.users, filtered_orders, filtered_reviews, db.products) return graph # Training: each example gets its own temporally-consistent graph for user, label, cutoff in training_examples: g = build_temporal_graph(db, user, cutoff) pred = gnn(g)[user] loss = bce_loss(pred, label)
RelBench is the benchmark suite introduced alongside the RDL paper. 11 tasks across 5 real-world relational databases, each with standardized train/validation/test splits that respect temporal ordering. No data leakage. No cherry-picked tasks.
Tasks span classification, regression, and ranking — all evaluated as binary classification (top-k precision) or regression (MAE/RMSE) with temporal validation to prevent leakage:
| Database | Task | Type | Target |
|---|---|---|---|
| rel-amazon | user-churn | Binary | Will user post another review in 30d? |
| rel-amazon | item-churn | Binary | Will item get reviewed in 30d? |
| rel-stack | user-engage | Binary | Will user answer a question in 30d? |
| rel-stack | post-votes | Regression | How many upvotes will post receive? |
| rel-hm | user-item-purchase | Ranking | Which articles will customer buy next? |
| rel-trial | study-outcome | Binary | Will trial report positive outcome? |
XGBoost on manually engineered features is the standard baseline for relational prediction tasks. It's fast, interpretable, and often competitive. The RDL paper's key claim: the GNN on the raw graph beats XGBoost on a significant fraction of tasks — and the gap grows with the complexity of the relational structure.
| Task | XGBoost (manual features) | RDL-GNN | Winner |
|---|---|---|---|
| user-churn (Amazon) | 0.72 AUROC | 0.78 AUROC | GNN (+8%) |
| post-votes (StackEx) | 0.64 RMSE norm. | 0.59 RMSE norm. | GNN (+8%) |
| study-outcome (Trial) | 0.68 AUROC | 0.71 AUROC | GNN (+4%) |
| user-item (H&M) | 0.021 precision@10 | 0.028 precision@10 | GNN (+33%) |
| item-churn (Amazon) | 0.69 AUROC | 0.67 AUROC | XGBoost (+3%) |
GNN wins in most tasks. XGBoost wins occasionally — specifically on tasks where the relational structure is shallow (entities don't chain deeply) and where the target is mainly predictable from the entity's own historical features.
GNN advantage grows with the depth of relational structure in the task. Tasks where the answer requires multi-hop reasoning show larger GNN vs XGBoost gaps.
Real production databases have millions of rows per table. A naively constructed graph would have millions of nodes and hundreds of millions of edges — far too large for a full-graph GNN forward pass. How does RDL handle this?
The same strategy used by GraphSAGE and neighbor sampling GNNs: for each training example (target node), extract a K-hop subgraph by sampling a fixed number of neighbors at each hop. This subgraph is small regardless of global graph size.
For a K=2 GNN with fanout [25, 25]: at hop 1, sample at most 25 neighbors of the target node. At hop 2, sample at most 25 neighbors of each hop-1 node. Maximum subgraph size: 1 + 25 + 625 = 651 nodes — tiny, regardless of whether the graph has 1 million or 100 million total nodes.
For graphs too large even for sampling (billion-node graphs), graph partitioning distributes nodes across machines. Each machine owns a partition of nodes and their local edges. During training, cross-partition edges require communication. Libraries like PyG's DistNeighborSampler and dist module implement this for relational settings.
| Scale | Technique | Graph Size | Batch Size |
|---|---|---|---|
| Small (<100K nodes) | Full-graph training | Entire graph in memory | All nodes |
| Medium (100K–10M) | Neighbor sampling | K-hop subgraph per example | Sampled subgraph |
| Large (10M–1B) | Distributed sampling | Per-machine partition | Cross-machine subgraph |
| Very large (>1B) | Cluster-GCN + partition | Cluster per batch | Cluster subgraph |
Relational Deep Learning connects decades of relational database research to the modern GNN literature. Understanding these connections situates RDL as not a niche contribution but a bridge between two large research communities.
| Method | Key Idea | Relation to RDL |
|---|---|---|
| Inductive Logic Programming (Muggleton 1991) | Learn logical rules over relations | Same motivation, different representation (rules vs neural) |
| Probabilistic Relational Models (Getoor 2007) | Probabilistic graphical models for relational data | Precursor — probabilistic instead of neural |
| HGT (Hu et al. 2020) | Heterogeneous GNN with attention | One valid backbone for RDL's GNN step |
| RDL (this paper) | DB-as-graph + temporal split + RelBench | — |
| TabNet / AutoML | End-to-end learning on single flat tables | No relational structure; RDL is the multi-table extension |
| RELBENCH (future) | Community benchmark extensions | Direct descendant — benchmark expected to grow |
Fey, Hu, Huang, Lenssen, Ranjan, Robinson, Ying, You, Leskovec. "Relational Deep Learning: Graph Representation Learning on Relational Databases." 2023. arXiv:2312.04615
"Every enterprise database is a graph. We just stopped pretending otherwise."