Relational Deep Learning

Chapter 0: The Problem

You're building a churn prediction model. Your data is in a relational database: a users table, an orders table, a products table, a reviews table. Each order references a user and a product. Each review references an order and a user.

The standard ML pipeline: call a data engineer. They write SQL joins to flatten everything into a single feature table — one row per user with 200 engineered features like "total orders in last 30 days," "average review sentiment," "product category diversity." Then you train XGBoost. The SQL took 3 weeks. The features are arbitrary choices. Relationships between entities are compressed into aggregations that lose structural information.

The hidden graph: Your database is a graph. Every foreign key reference is an edge. User 47 connects to Order 203, which connects to Product 91, which connects to a category and other orders. The relationships between entities carry information that flat features cannot capture — which users bought from the same suppliers, which products are purchased together, which users have similar review patterns. Feature engineering flattens all of this into scalars. Relational Deep Learning (RDL) leaves the graph intact and trains a GNN directly on it.

The practical cost of feature engineering is real:

Time: 60-80% of a data scientist's time is feature engineering on relational data. One estimate: 3-4 weeks per project before model training begins.
Information loss: Aggregating a user's 47 orders into "total orders" and "average order value" loses the sequence, the co-occurrence, the product diversity structure. These are real signals.
Maintenance: When the database schema changes (new table, new foreign key), all SQL pipelines break and must be rewritten. A GNN trained on the graph structure is robust to schema extensions.

Feature Engineering vs Graph Learning

The same relational data shown two ways. Left: the flat feature approach (all information compressed to per-entity features). Right: the graph approach (entities as nodes, foreign keys as edges). Click "Highlight Path" to see a multi-hop connection that flat features cannot capture.

What structural information do foreign key relationships carry that standard feature engineering loses?

Multi-hop connections between entities — which users share suppliers, which products appear together, which review patterns are correlated — that aggregations like "average order value" compress into scalars and lose The column data types of each table The primary key index structure

Chapter 1: Database as Graph

The translation from relational database to graph is mechanical — no choices, no ambiguity. Every table becomes a node type. Every row in a table becomes a node instance. Every foreign key relationship between tables becomes an edge type. Every row-to-row FK reference becomes an edge instance.

G = (V_T1 ∪ V_T2 ∪ ... ∪ V_Tk, E_FK1 ∪ E_FK2 ∪ ... ∪ E_FKm)

A heterogeneous graph where node types correspond to tables and edge types correspond to foreign key relations. This is precisely the input format that heterogeneous GNNs (like HGT, or the RelBench GNN) operate on. The table columns become node features.

Concrete example — e-commerce database:
Tables: users (id, age, city), orders (id, user_id, date, total), products (id, name, category, price), order_items (order_id, product_id, quantity)
→ Node types: User, Order, Product, OrderItem
→ Edge types: User→Order (placed), Order→OrderItem (contains), OrderItem→Product (references)
→ Each User node's features: [age, city_embedding]. Each Order node: [date, total]. Each Product: [name_embedding, category, price].

Feature Encoding for Tables

Table columns come in several types, each needing a different encoding strategy:

Numeric: Use the value directly (possibly normalized). Order total → scalar feature.
Categorical: Embed with a learned embedding table. Product category → 16-dim vector.
Text: Use a pre-trained text encoder (sentence transformers). Product name → 384-dim sentence embedding, then linearly projected.
Datetime: Encode as [year, month, day, hour, day-of-week] cyclically. Order timestamp → 5-dim feature.

These features are concatenated into a per-node feature vector. Different node types may have different feature dimensions — the GNN handles this by projecting all types into a common latent dimension at the first layer.

Schema to Graph Translator

An e-commerce database schema. Toggle between the schema view (tables + FK arrows) and the graph view (nodes + edges). The information is identical — only the representation changes.

What do foreign key relationships become in the relational graph?

Edge types — each FK relationship becomes an edge type, and each row-to-row FK reference becomes a specific edge instance in the heterogeneous graph Node features — the FK value is added to the node's feature vector Graph attributes — stored as global graph-level metadata

Chapter 2: GNN on Relations — Interactive

Once the database is a graph, the GNN processes it with standard heterogeneous message passing. For each prediction task (e.g., "will this user churn in the next 30 days?"), the GNN computes an embedding for the target node (user) by aggregating information from its neighborhood — orders, products, reviews — up to K hops away.

Input Graph

Heterogeneous graph: Users, Orders, Products, Reviews as node types. FK relations as edges. Column values as node features.

↓

Linear Projection

Each node type: W_type · x_v → h_v⁽⁰⁾ ∈ R^d. All types projected to common d-dim space.

↓

Heterogeneous Message Passing (K layers)

For each edge type (u-type, relation, v-type): m_e = MLP_relation(h_u). Aggregate: h_v^(k+1) = AGG({m_e : e points to v by type}).

↓

Readout + MLP head

h_v^(K) for target nodes → MLP → prediction (binary, regression, ranking).

Relational GNN — Live Message Passing

A small e-commerce graph. Click any node to select it as the prediction target. Adjust message-passing depth to see how information flows from related nodes. Colors show information density reaching the target.

GNN Depth K 1

python (PyG HeteroConv)
import torch
from torch_geometric.nn import HeteroConv, SAGEConv

# Define one conv per edge type
conv = HeteroConv({
    ('user', 'placed', 'order'): SAGEConv((-1, -1), 64),
    ('order', 'contains', 'product'): SAGEConv((-1, -1), 64),
    ('user', 'wrote', 'review'): SAGEConv((-1, -1), 64),
    # ... reverse edges for bidirectional flow
}, aggr='sum')

# Forward pass: x_dict = {node_type: feature_tensor}
x_dict = conv(x_dict, edge_index_dict)
# After K layers, x_dict['user'] = churn prediction embeddings
user_logits = mlp_head(x_dict['user'])  # [N_users, 1]

Why does the relational GNN need separate MLPs for each edge type?

Each edge type represents a different semantic relationship (e.g. "placed" vs "wrote") with different source/target node types — a single shared MLP cannot learn relation-specific patterns across different table pair interactions To reduce memory usage by sharing parameters across fewer node pairs Because different tables have different numbers of rows

Chapter 3: Temporal Data

Real relational databases are not static snapshots — they grow over time. Orders accumulate. User behavior evolves. Product reviews arrive daily. A churn prediction model trained naively on historical data will suffer from temporal leakage: using future information to predict the past.

Temporal leakage is subtle in graphs. Suppose you want to predict whether User A churned in June. User A's most recent order was July (after the target date). If the graph includes that July order, the model can infer "User A was still active in July — probably not churned yet." But at prediction time (June), you don't have the July order. The model has been trained on a time-traveling graph. This inflates validation accuracy by a lot and fails completely at deployment.

RDL enforces a temporal cutoff for each prediction task. For a target entity at time t*, only nodes and edges with timestamps ≤ t* are included in the graph. This requires:

Every row in every table must have a timestamp column (or be associated with one through FK chains).
The graph construction must filter edges based on the cutoff time for each training example.
Different training examples may have different cutoff times — the graph is resampled for each.

Temporal node features: Beyond just filtering, timestamps are informative features. "This order was placed 3 days ago" and "this order was placed 2 years ago" carry different signals. RDL encodes relative time deltas — the difference between each entity's timestamp and the prediction cutoff t* — as additional node features. This lets the model learn time-decay patterns: recent activity is weighted more than old activity.

python
def build_temporal_graph(db, target_entity, cutoff_time):
    # Only include events before the prediction cutoff
    filtered_orders = db.orders[db.orders.timestamp <= cutoff_time]
    filtered_reviews = db.reviews[db.reviews.timestamp <= cutoff_time]

    # Encode relative time delta as a node feature
    filtered_orders['days_ago'] = (cutoff_time - filtered_orders.timestamp).dt.days
    filtered_reviews['days_ago'] = (cutoff_time - filtered_reviews.timestamp).dt.days

    # Build heterogeneous graph from filtered tables
    graph = schema_to_graph(db.users, filtered_orders, filtered_reviews, db.products)
    return graph

# Training: each example gets its own temporally-consistent graph
for user, label, cutoff in training_examples:
    g = build_temporal_graph(db, user, cutoff)
    pred = gnn(g)[user]
    loss = bce_loss(pred, label)

What is temporal leakage in relational graph learning, and how does RDL prevent it?

Including future data (events after the prediction cutoff time) in the graph when predicting a past/present outcome — RDL prevents it by filtering edges/nodes to only include those with timestamps ≤ t* (the prediction cutoff) Using training data from one time period to test on another — prevented by train/test split Sharing edge embeddings across time steps — prevented by time-specific edge features

Chapter 4: RelBench

RelBench is the benchmark suite introduced alongside the RDL paper. 11 tasks across 5 real-world relational databases, each with standardized train/validation/test splits that respect temporal ordering. No data leakage. No cherry-picked tasks.

The Databases

rel-amazon — Amazon product reviews. Predict review rating. Nodes: Users, Reviews, Products (10M+ rows).
rel-stack — Stack Overflow. Predict upvotes, user badges. Nodes: Users, Posts, Tags, Comments (millions of rows).
rel-hm — H&M fashion retail. Predict article purchases. Nodes: Customers, Articles, Transactions.
rel-trial — Clinical trials. Predict trial outcome. Nodes: Trials, Sponsors, Conditions, Interventions.
rel-avito — Russian classifieds. Predict ad click probability. Nodes: Ads, Users, Locations, Categories.

The 11 Tasks

Tasks span classification, regression, and ranking — all evaluated as binary classification (top-k precision) or regression (MAE/RMSE) with temporal validation to prevent leakage:

Database	Task	Type	Target
rel-amazon	user-churn	Binary	Will user post another review in 30d?
rel-amazon	item-churn	Binary	Will item get reviewed in 30d?
rel-stack	user-engage	Binary	Will user answer a question in 30d?
rel-stack	post-votes	Regression	How many upvotes will post receive?
rel-hm	user-item-purchase	Ranking	Which articles will customer buy next?
rel-trial	study-outcome	Binary	Will trial report positive outcome?

Why RelBench matters: Before RelBench, relational ML papers evaluated on hand-crafted benchmarks with no standardized leakage prevention. RelBench's enforced temporal split means that a method that achieves good numbers has actually learned from historical relationships — not from peeking at the future. It also enables fair comparison between feature-engineering baselines (XGBoost on manual features) and GNN methods.

What does RelBench's standardized temporal split prevent?

Temporal leakage — where future data (events after the prediction cutoff) is used in training or validation, inflating apparent accuracy and causing models to fail at deployment time Overfitting — the temporal split acts as a regularizer on model complexity Data imbalance — the temporal split ensures equal class frequencies in each split

Chapter 5: vs XGBoost

XGBoost on manually engineered features is the standard baseline for relational prediction tasks. It's fast, interpretable, and often competitive. The RDL paper's key claim: the GNN on the raw graph beats XGBoost on a significant fraction of tasks — and the gap grows with the complexity of the relational structure.

When does GNN win? The GNN wins on tasks where multi-hop relational structure carries signal that aggregations miss. "Users who share product preferences with highly active users tend to stay engaged" is a 3-hop pattern (User → Order → Product ← Order ← User) that XGBoost cannot compute without explicit engineered features. The GNN discovers these patterns automatically from the graph.

Task	XGBoost (manual features)	RDL-GNN	Winner
user-churn (Amazon)	0.72 AUROC	0.78 AUROC	GNN (+8%)
post-votes (StackEx)	0.64 RMSE norm.	0.59 RMSE norm.	GNN (+8%)
study-outcome (Trial)	0.68 AUROC	0.71 AUROC	GNN (+4%)
user-item (H&M)	0.021 precision@10	0.028 precision@10	GNN (+33%)
item-churn (Amazon)	0.69 AUROC	0.67 AUROC	XGBoost (+3%)

GNN wins in most tasks. XGBoost wins occasionally — specifically on tasks where the relational structure is shallow (entities don't chain deeply) and where the target is mainly predictable from the entity's own historical features.

The real cost comparison: XGBoost results require weeks of feature engineering by domain experts. The GNN results required writing a schema-to-graph converter (a few hundred lines of code, reusable across tasks) and running training. The per-task engineering cost for GNN is near zero once the converter exists. XGBoost's total engineering cost is the bottleneck, not model training time. For organizations running many prediction tasks on the same database, RDL's one-time setup cost is amortized across all tasks.

Performance vs Relational Depth

GNN advantage grows with the depth of relational structure in the task. Tasks where the answer requires multi-hop reasoning show larger GNN vs XGBoost gaps.

Relational Depth 1

On which types of tasks does the GNN most consistently outperform XGBoost with manual features?

Tasks where multi-hop relational structure carries signal — patterns spanning 3+ table hops that XGBoost cannot compute without explicitly engineered join features Tasks with very large datasets (millions of rows) where XGBoost runs out of memory Tasks with high-dimensional text features that XGBoost cannot process

Chapter 6: Scale

Real production databases have millions of rows per table. A naively constructed graph would have millions of nodes and hundreds of millions of edges — far too large for a full-graph GNN forward pass. How does RDL handle this?

Mini-batch Subgraph Sampling

The same strategy used by GraphSAGE and neighbor sampling GNNs: for each training example (target node), extract a K-hop subgraph by sampling a fixed number of neighbors at each hop. This subgraph is small regardless of global graph size.

For a K=2 GNN with fanout [25, 25]: at hop 1, sample at most 25 neighbors of the target node. At hop 2, sample at most 25 neighbors of each hop-1 node. Maximum subgraph size: 1 + 25 + 625 = 651 nodes — tiny, regardless of whether the graph has 1 million or 100 million total nodes.

Temporal subgraph sampling: Each training example has a different cutoff time. The subgraph must include only nodes/edges before that cutoff. Efficient temporal sampling requires pre-sorting neighbors by timestamp and binary-searching for the cutoff. RelBench's implementation does this — building time-indexed adjacency lists for each edge type.

Node-Level Partitioning for Very Large Graphs

For graphs too large even for sampling (billion-node graphs), graph partitioning distributes nodes across machines. Each machine owns a partition of nodes and their local edges. During training, cross-partition edges require communication. Libraries like PyG's DistNeighborSampler and dist module implement this for relational settings.

Scale	Technique	Graph Size	Batch Size
Small (<100K nodes)	Full-graph training	Entire graph in memory	All nodes
Medium (100K–10M)	Neighbor sampling	K-hop subgraph per example	Sampled subgraph
Large (10M–1B)	Distributed sampling	Per-machine partition	Cross-machine subgraph
Very large (>1B)	Cluster-GCN + partition	Cluster per batch	Cluster subgraph

How does neighbor sampling make GNN training on million-row databases feasible?

For each training example, it extracts a small K-hop subgraph by sampling a fixed number of neighbors at each hop — the subgraph size is bounded (e.g., 651 nodes for K=2, fanout=25) regardless of the full graph's size It compresses each table to its most informative rows before building the graph It reduces the database to only the top-10% most connected nodes

Chapter 7: Connections

Relational Deep Learning connects decades of relational database research to the modern GNN literature. Understanding these connections situates RDL as not a niche contribution but a bridge between two large research communities.

Method	Key Idea	Relation to RDL
Inductive Logic Programming (Muggleton 1991)	Learn logical rules over relations	Same motivation, different representation (rules vs neural)
Probabilistic Relational Models (Getoor 2007)	Probabilistic graphical models for relational data	Precursor — probabilistic instead of neural
HGT (Hu et al. 2020)	Heterogeneous GNN with attention	One valid backbone for RDL's GNN step
RDL (this paper)	DB-as-graph + temporal split + RelBench	—
TabNet / AutoML	End-to-end learning on single flat tables	No relational structure; RDL is the multi-table extension
RELBENCH (future)	Community benchmark extensions	Direct descendant — benchmark expected to grow

The key technical novelty of RDL is not a new GNN architecture — it's the problem formulation. Prior work either (a) built specialized methods for specific relational tasks (knowledge graph completion, citation networks), or (b) required massive feature engineering to reach a single flat table. RDL's contribution is showing that the schema-to-graph translation is mechanical, the temporal leakage issue is solvable with a principled cutoff, and standard heterogeneous GNNs beat the feature-engineering baseline on the resulting tasks. The benchmark makes this reproducible and comparable.

Limitations: (1) Text and image columns: RDL handles these with pre-trained encoders, but the quality of encoding matters. (2) Very complex temporal patterns: RDL uses simple cutoff time — more complex time-series patterns (seasonality, trends) require additional modeling. (3) Schema changes: adding a new table requires schema-to-graph reprocessing. (4) Interpretability: the GNN's predictions are as opaque as any neural model — XGBoost's feature importances are easier to explain to stakeholders. (5) Cold-start: nodes with few connections (new users, rare products) are poorly represented — the same problem as any GNN.

Go Deeper

HGT (Hu et al. 2020) — heterogeneous GNN backbone for RDL
GraphSAGE (Hamilton et al. 2017) — neighbor sampling for large graphs
RGCN (Schlichtkrull 2018) — relational GCN for KB completion

Key Paper

Fey, Hu, Huang, Lenssen, Ranjan, Robinson, Ying, You, Leskovec. "Relational Deep Learning: Graph Representation Learning on Relational Databases." 2023. arXiv:2312.04615

"Every enterprise database is a graph. We just stopped pretending otherwise."

Relational Deep Learning: Databases as Graphs