Basic RDL works. But real databases have time-ordering, dozens of tables, and schemas that change between companies. The frontier: one model that works on any relational database without retraining.
You've built a GNN on a relational database graph. You've handled foreign keys as edges, rows as nodes, temporal cutoffs to prevent leakage. Your churn prediction model works. Now your company acquires another company — different database, completely different schema. You need to start over.
This is the central frustration of basic RDL: the model is schema-specific. The weight matrices were shaped for your particular tables and their particular column counts. A new schema means a new model, new training data, new hyperparameter search. It doesn't transfer.
Lecture 13 addresses three challenges that basic RDL leaves unsolved:
Each of these problems has an emerging solution. Temporal message passing addresses problem 1. Adaptive multi-hop aggregation addresses problem 2. Universal encoders (the Griffin architecture) attack problem 3. We'll cover all three.
Three user interaction histories with identical graph structure but different temporal patterns. Basic RDL gives them identical embeddings. Temporal-aware RDL distinguishes them. Click each user to see their event sequence.
When you build a relational graph at cutoff time T, you prevent future labels from leaking. But basic RDL still has a subtle temporal problem: a node's neighbors may have interacted with it at different times, and basic message passing treats all of them equally — as if they happened simultaneously.
Consider: order O1 happened 6 months ago, order O2 happened yesterday. For predicting churn, O2 is more relevant — recent behavior is a stronger signal. But a standard GNN aggregates both orders with equal weight. Temporal message passing fixes this by ordering and weighting messages by their timestamp.
There's a deeper version of temporal leakage: an entity's neighbor nodes may have edges that were created after the prediction cutoff. Example: order O3 references product P7. P7 received a review R10 at time T+2 (after the cutoff). A standard GNN would propagate P7's features, which include information from R10, through O3 to the customer — even though R10 was in the future at prediction time.
The solution: annotate every edge with its timestamp. During message passing, only aggregate from neighbor v if the edge (u, v) has timestamp τ(u,v) ≤ Tcutoff. More sophisticated TGNNs weight messages by recency:
Where α(Δt) is a time decay function — recent edges get higher weight. Common choices: α(Δt) = exp(-λ·Δt) (exponential decay) or α(Δt) = 1/(1 + Δt) (harmonic decay). The decay rate λ becomes a learned or hand-tuned hyperparameter.
A customer's 6 neighbor orders, each at a different time before the cutoff. Adjust the decay rate λ to see how much weight each order's message gets. Recency bias vs. uniform weighting.
A 1-layer GNN on a relational graph is equivalent to a single table JOIN. A 2-layer GNN is a 2-table JOIN. A 3-layer GNN is a 3-table JOIN. Each layer follows one more foreign key hop. But the key question is: how many hops does your task actually need?
Predicting customer churn from their order history: 1 hop (customer → orders). Predicting churn from the products they bought: 2 hops (customer → orders → products). Predicting churn based on what similar customers bought: 3 hops (customer → orders → products → other customers).
Set L too small, and you miss the relevant signal. Set L too large, and you include irrelevant signal from distant nodes (which adds noise) and you over-smooth the embeddings (nearby nodes become indistinguishable). This is the classic over-smoothing problem in GNNs, amplified by the fact that relational graphs can be densely connected.
| GNN Layers | Equivalent SQL | What It Captures |
|---|---|---|
| 0 (none) | SELECT * FROM customers | Customer's own features only |
| 1 | JOIN orders ON cust_id | Order history (count, amount, recency) |
| 2 | JOIN orders JOIN products | What products did this customer buy? |
| 3 | 3-table JOIN + GROUP BY | What do similar customers buy? (collaborative signal) |
| 4+ | Complex multi-join | Often over-smoothed — diminishing returns |
5 customer nodes with initially different embeddings (shown as colors). Increase GNN layers to see embeddings converge — over-smoothing. The sweet spot is typically L=2 or L=3.
The baseline approach in RDL is to train a separate GNN for each database schema. You inspect the e-commerce schema, design a HeteroConv with the right edge types, train it. You get a new healthcare schema, design another HeteroConv, train again. Each model is custom-built for its schema.
This is schema-specific modeling. It works. RelBench results show it beats XGBoost on complex tasks. But it has fundamental limitations that emerge in production.
A schema-specific model can use exactly the right weight matrices for each edge type. The "customer places order" transformation is tuned specifically to the relationship between your company's customer features and your order features. The model has full knowledge of the schema and can leverage it.
Schema-specific models break when:
python # Schema-specific: must know edge types in advance conv = HeteroConv({ ('customer', 'places', 'order'): SAGEConv(...), ('order', 'contains', 'product'): SAGEConv(...), # ... every edge type hardcoded at model creation time }) # ^^^ Breaks if schema changes. Cannot transfer to new DB. # Universal: schema described as data, not architecture model = UniversalRDLEncoder() graph = build_from_schema(any_database) # any schema! embeddings = model(graph) # zero-shot generalization
The question: can you build a GNN that generalizes across any relational database schema without retraining? This would be the "foundation model for relational data" — trained once, deployed everywhere.
Griffin (Graph Foundation Model for Relational Data) is one attempt at this. The insight: the schema itself is data. Instead of hardcoding the schema into the GNN's architecture, encode the schema as part of the input. The model learns to read the schema description and adapt its behavior accordingly.
1. Schema-agnostic node features. Different tables have different columns with different types. Griffin uses a universal feature encoder that can handle any column: numeric values are normalized and embedded, categorical values use a shared vocabulary embedding, text is encoded with a frozen language model, timestamps use sinusoidal encoding. The output is always a fixed-size vector, regardless of what columns the table has.
2. Schema description as context. The table name, column names, and foreign key relationships are serialized as text and passed through a language model. This "schema embedding" tells the GNN the semantics of each node type — even for schemas it has never seen before. A new table called "prescriptions" will produce an embedding that reflects medical prescription semantics, without any prescription-specific training.
3. Shared message passing weights. Instead of separate weight matrices per edge type, Griffin uses shared weights conditioned on the schema embedding. The message from an Order node to a Customer node uses weights that are modulated by the schema description of that edge type. Same base weights, different conditioning — efficient transfer.
Watch Griffin process two completely different database schemas — e-commerce and healthcare — using the same model weights. Click a schema to see how Griffin adapts its node features and message passing to that schema's structure.
| Aspect | Schema-Specific GNN | Griffin (Universal) |
|---|---|---|
| Separate model per schema | Yes | No |
| Handles schema changes | Retrain needed | Zero-shot |
| Transfer between databases | None | Yes |
| Performance on known schema | Best | Slightly lower |
| Performance on new schema | Needs retraining | Competitive zero-shot |
| Requires schema metadata | No | Yes (table/column names) |
One of RDL's core promises is eliminating manual feature engineering. But "eliminating" is too strong — it's more accurate to say RDL automates feature engineering through gradient descent. Understanding what it automates (and what it doesn't) is essential for using it wisely.
RDL's GNN layers learn the following kinds of aggregations automatically, without you specifying them:
Some features still require domain expertise to define correctly:
python # What RDL replaces (auto-learned) """ COUNT(orders) as num_orders AVG(order.amount) as avg_order MAX(order.ts) as last_order COUNT(DISTINCT products) as diversity COUNT(reviews) as review_count AVG(review.rating) as avg_rating """ # ↑ All captured by 2-layer GNN
python # What domain experts still provide """ orders_last_90d / orders_total industry_health_score (external) login_velocity_anomaly seasonal_adjustment_factor custom_business_kpi_1 """ # ↑ Needs human knowledge
A realistic production database: 50 million customer rows, 500 million order rows, 20 million product rows. The full relational graph has ~570 million nodes and potentially billions of edges. A 3-layer GNN would need to load the 3-hop neighborhood of every customer — which, for popular products, could be millions of nodes per customer. This is impossible.
The solution is the same one GraphSAGE introduced for homogeneous graphs: neighbor sampling. For each training example (customer node), sample at most K neighbors at each layer instead of using all of them. For K=20 and L=3 layers, each training example requires at most 20³ = 8,000 nodes — manageable.
Temporal RDL makes sampling harder. For each prediction at cutoff T, you need to sample from the temporally-filtered neighborhood — the neighbors that existed at time T. This means you can't precompute static neighborhood samples; each cutoff creates a different filtered graph.
Solutions in active development:
At inference time, embeddings for stable entities (products, categories) change slowly. You can cache their embeddings and only recompute when they receive new interactions. Customer embeddings change faster (new orders come in daily) but most customers have sparse activity — batch recompute nightly is feasible for most use cases.
| Scaling Challenge | Solution | Tradeoff |
|---|---|---|
| Graph too large for GPU | Neighbor sampling (K neighbors/layer) | Variance in gradient estimates |
| Temporal filtering per prediction | Temporal index + cache | Staleness vs. compute |
| Many predictions per day | Batch inference + embedding cache | Slightly stale embeddings |
| New rows arrive continuously | Streaming graph updates | Consistency guarantees |
Where is advanced RDL being applied today? The domains where relational data is richest and the prediction stakes are highest.
Fraud in financial transactions is inherently relational. A fraudster may use the same device across multiple accounts, or route money through a chain of accounts to obscure the trail. The fraud pattern is in the graph structure — not in any individual transaction's features.
The relational graph for fraud: Accounts (table), Transactions (table, FK→Account), Devices (table, FK→Account), Merchants (table, FK→Transaction). A 3-layer GNN on this graph captures: account features, their transaction history, what devices they used, what merchants they transacted with, and what other accounts use the same devices — exactly the multi-hop pattern that characterizes fraud rings.
Customer churn has a strong collaborative component: if a customer's "cohort" (peers who bought similar products at the same time) is churning, they're more likely to churn too. This 3-hop signal — customer → products → other customers → churn rate — is exactly what a 3-layer RDL GNN captures, and exactly what XGBoost with manual features would miss (unless an analyst knew to add "cohort churn rate" as a feature).
Pharmaceutical databases store drugs, targets (proteins), diseases, clinical trials, adverse events, and patients — all linked by foreign keys. Predicting drug-target interactions, adverse event probability, or trial success rates are all relational prediction tasks. The biological relationships span multiple hops: drug → targets → diseases → patients → outcomes.
| Domain | Prediction Task | Key Relational Pattern | Critical Hops |
|---|---|---|---|
| Finance | Fraud detection | Shared devices, IP addresses | 2-3 |
| E-commerce | Customer churn | Cohort behavior | 3 |
| Healthcare | Drug-target interaction | Target pathways, disease comorbidity | 3-4 |
| Social | Content moderation | Network of accounts, shared content | 2-3 |
| Hiring | Job match quality | Employer→employee→skills→jobs | 2-3 |
Relational deep learning is young — the first major paper was 2023. The open problems are not niche edge cases; they're fundamental challenges that determine whether RDL becomes a standard tool or stays a research curiosity.
Griffin is a first step, but its zero-shot performance still falls below schema-specific models on most benchmarks. The gap is shrinking, but it remains. The underlying challenge: a language model can encode the semantics of "prescriptions" (medical, time-sensitive) but not the statistical structure of that specific company's prescription table. Transfer of learned statistical patterns across schemas is still unsolved.
RelBench databases have millions of rows. Amazon, Meta, and Google have hundreds of billions. Temporal neighbor sampling at that scale, with per-prediction cutoffs, remains an engineering challenge with no clean solution. Research on approximate temporal sampling and hierarchical graph methods is active but not yet production-ready.
A compliance officer at a bank asks: "Why did the model flag this transaction as fraud?" An XGBoost model with SHAP values gives an interpretable answer: "feature X contributed +0.3, feature Y contributed -0.1." A GNN gives a high-dimensional embedding and a graph — explaining which neighbors contributed which signal through multi-hop paths is an active research area.
Real databases have NULL values in foreign keys, inconsistent encodings, outdated schemas, and denormalized tables (data duplicated across tables for performance reasons). RDL assumes a clean, consistent schema. Handling messy real-world databases robustly — without a DBA cleaning up first — is an unsolved engineering problem.
| Open Problem | Current State | Research Direction |
|---|---|---|
| Universal encoders | Griffin: promising but gap vs. schema-specific | Better schema embeddings, few-shot fine-tuning |
| Scale | Works to ~100M rows | Temporal sampling, distributed GNN |
| Interpretability | Post-hoc subgraph explanation (GNNExplainer) | Causal path explanations |
| Messy real databases | Requires clean schemas | Robust imputation + schema repair |
Advanced RDL sits at the crossroads of graph learning, relational databases, temporal reasoning, and foundation model research. Each direction opens a rich research area.
| Lecture | Topic | Key Idea |
|---|---|---|
| L3 | GNNs | Message passing, node embeddings |
| L6 | GNN Theory | WL test, expressive power |
| L8 | Link Prediction | Heuristics, embeddings, GNNs for edges |
| L9 | Hetero GNNs | Type-specific transformations |
| L10 | Knowledge Graphs | TransE, RotatE, relation patterns |
| L11 | RecSys | LightGCN, BPR, bipartite graphs |
| L12 | Basic RDL | Databases as graphs, RelBench |
| L13 (this) | Advanced RDL | Temporal MP, universal encoders, Griffin |
"The ability to work with relational data — the dominant format for the world's most important data — is the next frontier for deep learning. Graph neural networks are the key."
— Jure Leskovec, CS224W, 2024