CS224W Lecture 13 — Advanced RDL Architectures

Chapter 0: Beyond Basic RDL

You've built a GNN on a relational database graph. You've handled foreign keys as edges, rows as nodes, temporal cutoffs to prevent leakage. Your churn prediction model works. Now your company acquires another company — different database, completely different schema. You need to start over.

This is the central frustration of basic RDL: the model is schema-specific. The weight matrices were shaped for your particular tables and their particular column counts. A new schema means a new model, new training data, new hyperparameter search. It doesn't transfer.

The Three Hard Problems

Lecture 13 addresses three challenges that basic RDL leaves unsolved:

Problem 1: Temporal ordering. Basic RDL uses a graph cutoff, but doesn't model the sequence of events. A customer who bought items in the order A→B→C→D has a different pattern from one who bought D→C→B→A. Both get the same graph. The temporal sequence is discarded.

Problem 2: Multi-table aggregation depth. How many hops is the right number? For a task needing 4-hop information, a 2-layer GNN misses it. But a 10-layer GNN on a large graph is computationally ruinous. The right depth is task-dependent and hard to set manually.

Problem 3: Schema diversity. Every company's database has different tables, different column names, different foreign key patterns. A model trained on one schema doesn't transfer to another. Training a separate model per schema doesn't scale — there are infinitely many schemas in the world.

Each of these problems has an emerging solution. Temporal message passing addresses problem 1. Adaptive multi-hop aggregation addresses problem 2. Universal encoders (the Griffin architecture) attack problem 3. We'll cover all three.

What Basic RDL Misses

Three user interaction histories with identical graph structure but different temporal patterns. Basic RDL gives them identical embeddings. Temporal-aware RDL distinguishes them. Click each user to see their event sequence.

Two customers both interacted with items {P1, P2, P3} before the cutoff. They have exactly the same graph neighborhood. Basic LightGCN gives them the same final embedding. Why might this be a problem?

The embedding dimension might not be large enough to distinguish them The GNN might have too many layers to propagate information correctly The temporal ordering (P1 then P2 then P3 vs. P3 then P2 then P1) carries predictive signal that the graph structure doesn't encode — customers with identical neighborhoods but different temporal patterns may behave differently Graph embeddings are always non-unique for nodes with the same degree

Chapter 1: Temporal Message Passing — No Future Leakage

When you build a relational graph at cutoff time T, you prevent future labels from leaking. But basic RDL still has a subtle temporal problem: a node's neighbors may have interacted with it at different times, and basic message passing treats all of them equally — as if they happened simultaneously.

Consider: order O1 happened 6 months ago, order O2 happened yesterday. For predicting churn, O2 is more relevant — recent behavior is a stronger signal. But a standard GNN aggregates both orders with equal weight. Temporal message passing fixes this by ordering and weighting messages by their timestamp.

The Temporal Leakage Risk

There's a deeper version of temporal leakage: an entity's neighbor nodes may have edges that were created after the prediction cutoff. Example: order O3 references product P7. P7 received a review R10 at time T+2 (after the cutoff). A standard GNN would propagate P7's features, which include information from R10, through O3 to the customer — even though R10 was in the future at prediction time.

Temporal message passing enforces strict causality: when computing the embedding of node v at time T, messages can only come from neighbors whose interaction with v happened at time ≤ T. This eliminates temporal leakage from multi-hop paths, not just direct edges.

The Temporal Graph Neural Network (TGNN)

The solution: annotate every edge with its timestamp. During message passing, only aggregate from neighbor v if the edge (u, v) has timestamp τ(u,v) ≤ T_cutoff. More sophisticated TGNNs weight messages by recency:

h_u^(l+1) = σ( W_self h_u^(l) + ∑_{v: τ(v,u) ≤ T} α(T - τ(v,u)) · W h_v^(l) )

Where α(Δt) is a time decay function — recent edges get higher weight. Common choices: α(Δt) = exp(-λ·Δt) (exponential decay) or α(Δt) = 1/(1 + Δt) (harmonic decay). The decay rate λ becomes a learned or hand-tuned hyperparameter.

Temporal Message Weighting

A customer's 6 neighbor orders, each at a different time before the cutoff. Adjust the decay rate λ to see how much weight each order's message gets. Recency bias vs. uniform weighting.

Decay rate λ 0.30

In temporal message passing, why must edges be filtered by timestamp during aggregation, not just at graph construction time?

Filtering at construction time is too slow for large graphs Timestamp filtering must happen at every layer for numerical stability Multi-hop message passing can propagate future information through chains of edges that each individually pass the cutoff check — node v may have a post-cutoff edge to node w, and if v is included in the graph, that post-cutoff information flows to the target through v Graph construction only filters node timestamps, not edge timestamps

Chapter 2: Multi-Hop Aggregation — Information Across Multiple Tables

A 1-layer GNN on a relational graph is equivalent to a single table JOIN. A 2-layer GNN is a 2-table JOIN. A 3-layer GNN is a 3-table JOIN. Each layer follows one more foreign key hop. But the key question is: how many hops does your task actually need?

The Right Depth Depends on the Task

Predicting customer churn from their order history: 1 hop (customer → orders). Predicting churn from the products they bought: 2 hops (customer → orders → products). Predicting churn based on what similar customers bought: 3 hops (customer → orders → products → other customers).

Set L too small, and you miss the relevant signal. Set L too large, and you include irrelevant signal from distant nodes (which adds noise) and you over-smooth the embeddings (nearby nodes become indistinguishable). This is the classic over-smoothing problem in GNNs, amplified by the fact that relational graphs can be densely connected.

Over-smoothing in relational graphs. After many GNN layers, every customer node has aggregated from nearly every other customer node (through shared products). All customer embeddings converge to a single average, losing individual signal. For relational graphs, 2-3 layers is typically optimal — beyond that, over-smoothing dominates.

What Each Layer Captures

GNN Layers	Equivalent SQL	What It Captures
0 (none)	SELECT * FROM customers	Customer's own features only
1	JOIN orders ON cust_id	Order history (count, amount, recency)
2	JOIN orders JOIN products	What products did this customer buy?
3	3-table JOIN + GROUP BY	What do similar customers buy? (collaborative signal)
4+	Complex multi-join	Often over-smoothed — diminishing returns

Over-Smoothing Visualization

5 customer nodes with initially different embeddings (shown as colors). Increase GNN layers to see embeddings converge — over-smoothing. The sweet spot is typically L=2 or L=3.

GNN layers L 2

For a customer churn task, a 3-layer GNN on the relational graph collects information from "customers who bought the same products as this customer." This is useful because:

More layers always improve GNN performance Customer data always needs 3-hop aggregation to avoid under-fitting If similar customers (same products) have high churn rates, that's a signal this customer might churn too — collaborative filtering applied to churn prediction SQL can only do 2-table JOINs, so GNNs must handle 3+ hop information

Chapter 3: Schema-Specific GNNs — One Model Per Database

The baseline approach in RDL is to train a separate GNN for each database schema. You inspect the e-commerce schema, design a HeteroConv with the right edge types, train it. You get a new healthcare schema, design another HeteroConv, train again. Each model is custom-built for its schema.

This is schema-specific modeling. It works. RelBench results show it beats XGBoost on complex tasks. But it has fundamental limitations that emerge in production.

Why Schema-Specific Models Work

A schema-specific model can use exactly the right weight matrices for each edge type. The "customer places order" transformation is tuned specifically to the relationship between your company's customer features and your order features. The model has full knowledge of the schema and can leverage it.

Schema-specific models are the practical standard today. When you have one database and the schema is stable, training a custom GNN is the right approach. RelBench's baseline is a schema-specific heterogeneous GNN (HeteroSAGE), and it beats XGBoost on most tasks.

The Failure Modes

Schema-specific models break when:

Schema changes. A new table is added to the database → the model architecture is no longer compatible. Retrain from scratch.
New database. Acquisition, partnership, new product line → completely different schema. No transfer of learned representations.
Many databases. A SaaS company serving 1000 customers, each with a slightly different database schema. 1000 separate models to train and maintain is operationally impossible.
Few labels. A new customer's database has 6 months of data and only 50 labeled churn events. Not enough to train a GNN from scratch.

The ambition: can we build one model — trained once, on one set of databases — that zero-shot generalizes to any new relational database schema? This is the universal encoder problem, and it's the research frontier of RDL.

python
# Schema-specific: must know edge types in advance
conv = HeteroConv({
    ('customer', 'places', 'order'): SAGEConv(...),
    ('order', 'contains', 'product'): SAGEConv(...),
    # ... every edge type hardcoded at model creation time
})
# ^^^ Breaks if schema changes. Cannot transfer to new DB.

# Universal: schema described as data, not architecture
model = UniversalRDLEncoder()
graph = build_from_schema(any_database)   # any schema!
embeddings = model(graph)                  # zero-shot generalization

A SaaS company serves 500 enterprise clients, each running the same application but with slightly different database schemas (different tables added for different use cases). Why can't they use one schema-specific GNN for all clients?

GNNs can only run on schemas with fewer than 10 tables Schema-specific GNNs have hardcoded architecture for the exact node types and edge types of one schema — different schemas require different architectures, meaning 500 separate models to design, train, and maintain All clients would need to merge their databases into one for the GNN to work Enterprise databases are too large for any single GNN model

Chapter 4: Universal Encoders — The Griffin Architecture

The question: can you build a GNN that generalizes across any relational database schema without retraining? This would be the "foundation model for relational data" — trained once, deployed everywhere.

Griffin (Graph Foundation Model for Relational Data) is one attempt at this. The insight: the schema itself is data. Instead of hardcoding the schema into the GNN's architecture, encode the schema as part of the input. The model learns to read the schema description and adapt its behavior accordingly.

The Key Design Decisions in Griffin

1. Schema-agnostic node features. Different tables have different columns with different types. Griffin uses a universal feature encoder that can handle any column: numeric values are normalized and embedded, categorical values use a shared vocabulary embedding, text is encoded with a frozen language model, timestamps use sinusoidal encoding. The output is always a fixed-size vector, regardless of what columns the table has.

2. Schema description as context. The table name, column names, and foreign key relationships are serialized as text and passed through a language model. This "schema embedding" tells the GNN the semantics of each node type — even for schemas it has never seen before. A new table called "prescriptions" will produce an embedding that reflects medical prescription semantics, without any prescription-specific training.

3. Shared message passing weights. Instead of separate weight matrices per edge type, Griffin uses shared weights conditioned on the schema embedding. The message from an Order node to a Customer node uses weights that are modulated by the schema description of that edge type. Same base weights, different conditioning — efficient transfer.

Griffin: Universal Relational Encoder — SHOWCASE

Watch Griffin process two completely different database schemas — e-commerce and healthcare — using the same model weights. Click a schema to see how Griffin adapts its node features and message passing to that schema's structure.

How Griffin Handles New Schemas

New Database Schema

Table names, column names, foreign keys, dtypes

↓ text serialization

Schema Description

"Table 'prescriptions' with columns: patient_id (FK→patients), drug_name (text), dosage (float), date (timestamp)"

↓ frozen language model

Schema Embedding

Dense vector capturing semantic meaning of each table and relationship

↓ shared GNN weights conditioned on schema embedding

Node Embeddings

Same GNN backbone, schema-conditioned — zero-shot generalization

Griffin vs Schema-Specific Comparison

Aspect	Schema-Specific GNN	Griffin (Universal)
Separate model per schema	Yes	No
Handles schema changes	Retrain needed	Zero-shot
Transfer between databases	None	Yes
Performance on known schema	Best	Slightly lower
Performance on new schema	Needs retraining	Competitive zero-shot
Requires schema metadata	No	Yes (table/column names)

How does Griffin handle a database schema it has never seen before, without retraining its weights?

It retrieves the most similar schema from a stored library and copies its weights It performs fast meta-learning (MAML) to adapt in a few gradient steps It encodes the schema description as text through a language model, producing a schema embedding that conditions the shared GNN weights — the model learns to read schemas and adapt, rather than baking schema-specific weights in at training time It converts the new schema to match the training schema through automated schema mapping

Chapter 5: Feature Engineering Comparison

One of RDL's core promises is eliminating manual feature engineering. But "eliminating" is too strong — it's more accurate to say RDL automates feature engineering through gradient descent. Understanding what it automates (and what it doesn't) is essential for using it wisely.

What RDL Learns Automatically

RDL's GNN layers learn the following kinds of aggregations automatically, without you specifying them:

Count aggregations: "how many orders has this customer placed?" (1-hop, sum)
Value aggregations: "what is the average order amount?" (1-hop, mean)
Recency: "how long since the last interaction?" (temporal decay weighting)
Collaborative signal: "what do similar users buy?" (2-3 hop)
Cross-table correlations: "do reviews of their products predict their churn?" (2-hop)

What RDL Does NOT Learn Automatically

Some features still require domain expertise to define correctly:

Business-specific windows: "orders in the last 90 days" specifically (the model can learn recency bias but not that "90 days" is a meaningful threshold for your business)
Ratio features: "return rate = returns / orders" (requires computing this from two columns)
External data: "is this customer's industry in a recession?" (not in the database)
Domain-specific anomalies: "this pattern of 3 rapid logins is a fraud indicator" (needs domain knowledge to identify)

The correct mental model: RDL automates the graph traversal and aggregation part of feature engineering. It does not replace domain knowledge about which features are meaningful or what external data to include. Hybrid approaches — RDL for the graph features, domain experts for the specialized ones — often perform best.

python
# What RDL replaces (auto-learned)
"""
COUNT(orders) as num_orders
AVG(order.amount) as avg_order
MAX(order.ts) as last_order
COUNT(DISTINCT products) as diversity
COUNT(reviews) as review_count
AVG(review.rating) as avg_rating
"""
# ↑ All captured by 2-layer GNN

python
# What domain experts still provide
"""
orders_last_90d / orders_total
industry_health_score (external)
login_velocity_anomaly
seasonal_adjustment_factor
custom_business_kpi_1
"""
# ↑ Needs human knowledge

A domain expert knows that customers who place exactly 3 orders and then pause for 60 days are 80% likely to churn. Can a standard 2-layer GNN on the relational graph automatically capture this pattern?

Yes, GNNs can learn any function of the graph structure including exact counts and time gaps Yes, but only if the GNN uses attention mechanisms to count the 3 orders Partially — the GNN can learn recency bias and order count aggregations, but the specific threshold of "exactly 3 orders and exactly 60 days pause" requires domain knowledge to encode as a feature; the GNN might approximate it but may not perfectly capture these sharp thresholds No, GNNs cannot process temporal information at all

Chapter 6: Scalability for Large Databases

A realistic production database: 50 million customer rows, 500 million order rows, 20 million product rows. The full relational graph has ~570 million nodes and potentially billions of edges. A 3-layer GNN would need to load the 3-hop neighborhood of every customer — which, for popular products, could be millions of nodes per customer. This is impossible.

Mini-Batch Neighbor Sampling

The solution is the same one GraphSAGE introduced for homogeneous graphs: neighbor sampling. For each training example (customer node), sample at most K neighbors at each layer instead of using all of them. For K=20 and L=3 layers, each training example requires at most 20³ = 8,000 nodes — manageable.

The tradeoff: sampling K neighbors introduces variance (different runs see different neighborhoods). More K = lower variance, more memory. Typical values: K=25-50 for the innermost layer, K=10-25 for outer layers (fewer samples further from the target node, where individual samples are less critical).

The Temporal Complication

Temporal RDL makes sampling harder. For each prediction at cutoff T, you need to sample from the temporally-filtered neighborhood — the neighbors that existed at time T. This means you can't precompute static neighborhood samples; each cutoff creates a different filtered graph.

Solutions in active development:

Temporal index structures: index edges by timestamp for fast range queries
Approximate temporal subgraphs: pre-compute neighborhoods at a coarse time resolution
Snapshot GNNs: maintain graph snapshots at regular intervals, interpolate between them

Inference Caching

At inference time, embeddings for stable entities (products, categories) change slowly. You can cache their embeddings and only recompute when they receive new interactions. Customer embeddings change faster (new orders come in daily) but most customers have sparse activity — batch recompute nightly is feasible for most use cases.

Scaling Challenge	Solution	Tradeoff
Graph too large for GPU	Neighbor sampling (K neighbors/layer)	Variance in gradient estimates
Temporal filtering per prediction	Temporal index + cache	Staleness vs. compute
Many predictions per day	Batch inference + embedding cache	Slightly stale embeddings
New rows arrive continuously	Streaming graph updates	Consistency guarantees

Why does temporal message passing make neighbor sampling harder than in standard (non-temporal) GNNs?

Temporal GNNs require larger K (more neighbors) to compensate for missing temporal information Temporal graphs have higher average degree, requiring more memory per sample Each prediction has a different cutoff time, so the set of valid neighbors changes per prediction — you cannot precompute static neighborhood samples that work for all cutoff times Temporal GNNs cannot use mini-batching because order matters

Chapter 7: Real-World Applications

Where is advanced RDL being applied today? The domains where relational data is richest and the prediction stakes are highest.

Fraud Detection

Fraud in financial transactions is inherently relational. A fraudster may use the same device across multiple accounts, or route money through a chain of accounts to obscure the trail. The fraud pattern is in the graph structure — not in any individual transaction's features.

Why GNNs beat rule-based systems for fraud: fraud rings create complex multi-hop patterns. "Account A → Account B → Account C all used the same device within 1 hour" is a 3-hop temporal pattern. Rule-based systems require a human to specify this rule. GNNs discover such patterns automatically from labeled examples.

The relational graph for fraud: Accounts (table), Transactions (table, FK→Account), Devices (table, FK→Account), Merchants (table, FK→Transaction). A 3-layer GNN on this graph captures: account features, their transaction history, what devices they used, what merchants they transacted with, and what other accounts use the same devices — exactly the multi-hop pattern that characterizes fraud rings.

Churn Prediction

Customer churn has a strong collaborative component: if a customer's "cohort" (peers who bought similar products at the same time) is churning, they're more likely to churn too. This 3-hop signal — customer → products → other customers → churn rate — is exactly what a 3-layer RDL GNN captures, and exactly what XGBoost with manual features would miss (unless an analyst knew to add "cohort churn rate" as a feature).

Drug Discovery

Pharmaceutical databases store drugs, targets (proteins), diseases, clinical trials, adverse events, and patients — all linked by foreign keys. Predicting drug-target interactions, adverse event probability, or trial success rates are all relational prediction tasks. The biological relationships span multiple hops: drug → targets → diseases → patients → outcomes.

Domain	Prediction Task	Key Relational Pattern	Critical Hops
Finance	Fraud detection	Shared devices, IP addresses	2-3
E-commerce	Customer churn	Cohort behavior	3
Healthcare	Drug-target interaction	Target pathways, disease comorbidity	3-4
Social	Content moderation	Network of accounts, shared content	2-3
Hiring	Job match quality	Employer→employee→skills→jobs	2-3

In fraud detection, why is a GNN approach preferred over checking individual transaction features?

GNNs process transactions faster in real time Individual transaction features (amount, merchant) are encrypted and GNNs can process encrypted data Fraud is organized as rings with multi-hop patterns (shared devices, layered accounts) that are invisible in individual transaction features but visible as graph structure — GNNs detect the structure, not just individual anomalies GNNs don't need labeled examples, which is important because fraud labels are hard to get

Chapter 8: Open Problems

Relational deep learning is young — the first major paper was 2023. The open problems are not niche edge cases; they're fundamental challenges that determine whether RDL becomes a standard tool or stays a research curiosity.

Open Problem 1: Truly Universal Encoders

Griffin is a first step, but its zero-shot performance still falls below schema-specific models on most benchmarks. The gap is shrinking, but it remains. The underlying challenge: a language model can encode the semantics of "prescriptions" (medical, time-sensitive) but not the statistical structure of that specific company's prescription table. Transfer of learned statistical patterns across schemas is still unsolved.

Open Problem 2: Scalability at True Production Scale

RelBench databases have millions of rows. Amazon, Meta, and Google have hundreds of billions. Temporal neighbor sampling at that scale, with per-prediction cutoffs, remains an engineering challenge with no clean solution. Research on approximate temporal sampling and hierarchical graph methods is active but not yet production-ready.

Open Problem 3: Interpretability

A compliance officer at a bank asks: "Why did the model flag this transaction as fraud?" An XGBoost model with SHAP values gives an interpretable answer: "feature X contributed +0.3, feature Y contributed -0.1." A GNN gives a high-dimensional embedding and a graph — explaining which neighbors contributed which signal through multi-hop paths is an active research area.

GNN explainability for relational data is harder than for tabular data. The explanation must reference graph paths: "This customer was flagged because they share a device with Account B, which has pending fraud cases, which shares a merchant with Account C, which is blacklisted." Automatically generating such explanations is an open problem.

Open Problem 4: Handling Null Values and Schema Inconsistencies

Real databases have NULL values in foreign keys, inconsistent encodings, outdated schemas, and denormalized tables (data duplicated across tables for performance reasons). RDL assumes a clean, consistent schema. Handling messy real-world databases robustly — without a DBA cleaning up first — is an unsolved engineering problem.

Open Problem	Current State	Research Direction
Universal encoders	Griffin: promising but gap vs. schema-specific	Better schema embeddings, few-shot fine-tuning
Scale	Works to ~100M rows	Temporal sampling, distributed GNN
Interpretability	Post-hoc subgraph explanation (GNNExplainer)	Causal path explanations
Messy real databases	Requires clean schemas	Robust imputation + schema repair

Why is explaining a GNN fraud detection decision harder than explaining an XGBoost decision?

GNNs are always black boxes with no interpretability tools GNNs have too many parameters for SHAP to compute efficiently XGBoost decisions are explained by feature contributions (which input features mattered). GNN decisions depend on multi-hop graph paths — the explanation is a subgraph, not a feature list — and communicating "Account A is suspicious because of a 3-hop connection to a blacklisted merchant" in a compliance-readable format is an open problem XGBoost is a linear model so its decisions are trivially interpretable

Chapter 9: Connections — Where to Go Next

Advanced RDL sits at the crossroads of graph learning, relational databases, temporal reasoning, and foundation model research. Each direction opens a rich research area.

The Through-Line of This Course

Everything is a graph. Social networks, knowledge graphs, molecular structures, citation networks, recommendation systems, relational databases — they all reduce to nodes, edges, and message passing. The GNN toolkit is remarkably general. RDL is the latest and perhaps most impactful application of this insight.

Structure is supervision. GNNs don't need labels to learn useful representations — the graph structure itself provides self-supervisory signal. Temporal graph structure (who interacted with whom, and when) is even richer. The frontier is learning better priors from this structure.

Scale requires rethinking everything. GraphSAGE showed this for social graphs. PinSage showed it for recommendation. RDL is still working through the implications for temporal relational graphs. The memory/compute/accuracy tradeoffs are different at each scale order of magnitude.

Foundation models for structured data. LLMs are foundation models for text. Griffin aspires to be a foundation model for relational data. The community is still figuring out what "pre-training" and "transfer" mean for graphs and tables. This is 2025 frontier research.

Full CS224W Lecture Path

Lecture	Topic	Key Idea
L3	GNNs	Message passing, node embeddings
L6	GNN Theory	WL test, expressive power
L8	Link Prediction	Heuristics, embeddings, GNNs for edges
L9	Hetero GNNs	Type-specific transformations
L10	Knowledge Graphs	TransE, RotatE, relation patterns
L11	RecSys	LightGCN, BPR, bipartite graphs
L12	Basic RDL	Databases as graphs, RelBench
L13 (this)	Advanced RDL	Temporal MP, universal encoders, Griffin

Key Papers

Griffin — Pirhadi et al. (2024). "Griffin: Graph-based Foundation Model for Relational Data." arXiv.

RelGNN

RelBench — Fey et al. (2023). "RelBench: A Benchmark for Deep Learning on Relational Databases." NeurIPS.
TGNN — Rossi et al. (2020). "Temporal Graph Networks for Deep Learning on Dynamic Graphs." arXiv.

"The ability to work with relational data — the dominant format for the world's most important data — is the next frontier for deep learning. Graph neural networks are the key."
— Jure Leskovec, CS224W, 2024