CS224W Lecture 13

Advanced RDL Architectures

Basic RDL works. But real databases have time-ordering, dozens of tables, and schemas that change between companies. The frontier: one model that works on any relational database without retraining.

Prerequisites: Lecture 12 (basic RDL) + foreign keys and timestamps. That's it.
10
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: Beyond Basic RDL

You've built a GNN on a relational database graph. You've handled foreign keys as edges, rows as nodes, temporal cutoffs to prevent leakage. Your churn prediction model works. Now your company acquires another company — different database, completely different schema. You need to start over.

This is the central frustration of basic RDL: the model is schema-specific. The weight matrices were shaped for your particular tables and their particular column counts. A new schema means a new model, new training data, new hyperparameter search. It doesn't transfer.

The Three Hard Problems

Lecture 13 addresses three challenges that basic RDL leaves unsolved:

Problem 1: Temporal ordering. Basic RDL uses a graph cutoff, but doesn't model the sequence of events. A customer who bought items in the order A→B→C→D has a different pattern from one who bought D→C→B→A. Both get the same graph. The temporal sequence is discarded.
Problem 2: Multi-table aggregation depth. How many hops is the right number? For a task needing 4-hop information, a 2-layer GNN misses it. But a 10-layer GNN on a large graph is computationally ruinous. The right depth is task-dependent and hard to set manually.
Problem 3: Schema diversity. Every company's database has different tables, different column names, different foreign key patterns. A model trained on one schema doesn't transfer to another. Training a separate model per schema doesn't scale — there are infinitely many schemas in the world.

Each of these problems has an emerging solution. Temporal message passing addresses problem 1. Adaptive multi-hop aggregation addresses problem 2. Universal encoders (the Griffin architecture) attack problem 3. We'll cover all three.

What Basic RDL Misses

Three user interaction histories with identical graph structure but different temporal patterns. Basic RDL gives them identical embeddings. Temporal-aware RDL distinguishes them. Click each user to see their event sequence.

Two customers both interacted with items {P1, P2, P3} before the cutoff. They have exactly the same graph neighborhood. Basic LightGCN gives them the same final embedding. Why might this be a problem?

Chapter 1: Temporal Message Passing — No Future Leakage

When you build a relational graph at cutoff time T, you prevent future labels from leaking. But basic RDL still has a subtle temporal problem: a node's neighbors may have interacted with it at different times, and basic message passing treats all of them equally — as if they happened simultaneously.

Consider: order O1 happened 6 months ago, order O2 happened yesterday. For predicting churn, O2 is more relevant — recent behavior is a stronger signal. But a standard GNN aggregates both orders with equal weight. Temporal message passing fixes this by ordering and weighting messages by their timestamp.

The Temporal Leakage Risk

There's a deeper version of temporal leakage: an entity's neighbor nodes may have edges that were created after the prediction cutoff. Example: order O3 references product P7. P7 received a review R10 at time T+2 (after the cutoff). A standard GNN would propagate P7's features, which include information from R10, through O3 to the customer — even though R10 was in the future at prediction time.

Temporal message passing enforces strict causality: when computing the embedding of node v at time T, messages can only come from neighbors whose interaction with v happened at time ≤ T. This eliminates temporal leakage from multi-hop paths, not just direct edges.

The Temporal Graph Neural Network (TGNN)

The solution: annotate every edge with its timestamp. During message passing, only aggregate from neighbor v if the edge (u, v) has timestamp τ(u,v) ≤ Tcutoff. More sophisticated TGNNs weight messages by recency:

hu(l+1) = σ( Wself hu(l) + ∑v: τ(v,u) ≤ T α(T - τ(v,u)) · W hv(l) )

Where α(Δt) is a time decay function — recent edges get higher weight. Common choices: α(Δt) = exp(-λ·Δt) (exponential decay) or α(Δt) = 1/(1 + Δt) (harmonic decay). The decay rate λ becomes a learned or hand-tuned hyperparameter.

Temporal Message Weighting

A customer's 6 neighbor orders, each at a different time before the cutoff. Adjust the decay rate λ to see how much weight each order's message gets. Recency bias vs. uniform weighting.

Decay rate λ 0.30
In temporal message passing, why must edges be filtered by timestamp during aggregation, not just at graph construction time?

Chapter 2: Multi-Hop Aggregation — Information Across Multiple Tables

A 1-layer GNN on a relational graph is equivalent to a single table JOIN. A 2-layer GNN is a 2-table JOIN. A 3-layer GNN is a 3-table JOIN. Each layer follows one more foreign key hop. But the key question is: how many hops does your task actually need?

The Right Depth Depends on the Task

Predicting customer churn from their order history: 1 hop (customer → orders). Predicting churn from the products they bought: 2 hops (customer → orders → products). Predicting churn based on what similar customers bought: 3 hops (customer → orders → products → other customers).

Set L too small, and you miss the relevant signal. Set L too large, and you include irrelevant signal from distant nodes (which adds noise) and you over-smooth the embeddings (nearby nodes become indistinguishable). This is the classic over-smoothing problem in GNNs, amplified by the fact that relational graphs can be densely connected.

Over-smoothing in relational graphs. After many GNN layers, every customer node has aggregated from nearly every other customer node (through shared products). All customer embeddings converge to a single average, losing individual signal. For relational graphs, 2-3 layers is typically optimal — beyond that, over-smoothing dominates.

What Each Layer Captures

GNN LayersEquivalent SQLWhat It Captures
0 (none)SELECT * FROM customersCustomer's own features only
1JOIN orders ON cust_idOrder history (count, amount, recency)
2JOIN orders JOIN productsWhat products did this customer buy?
33-table JOIN + GROUP BYWhat do similar customers buy? (collaborative signal)
4+Complex multi-joinOften over-smoothed — diminishing returns
Over-Smoothing Visualization

5 customer nodes with initially different embeddings (shown as colors). Increase GNN layers to see embeddings converge — over-smoothing. The sweet spot is typically L=2 or L=3.

GNN layers L 2
For a customer churn task, a 3-layer GNN on the relational graph collects information from "customers who bought the same products as this customer." This is useful because:

Chapter 3: Schema-Specific GNNs — One Model Per Database

The baseline approach in RDL is to train a separate GNN for each database schema. You inspect the e-commerce schema, design a HeteroConv with the right edge types, train it. You get a new healthcare schema, design another HeteroConv, train again. Each model is custom-built for its schema.

This is schema-specific modeling. It works. RelBench results show it beats XGBoost on complex tasks. But it has fundamental limitations that emerge in production.

Why Schema-Specific Models Work

A schema-specific model can use exactly the right weight matrices for each edge type. The "customer places order" transformation is tuned specifically to the relationship between your company's customer features and your order features. The model has full knowledge of the schema and can leverage it.

Schema-specific models are the practical standard today. When you have one database and the schema is stable, training a custom GNN is the right approach. RelBench's baseline is a schema-specific heterogeneous GNN (HeteroSAGE), and it beats XGBoost on most tasks.

The Failure Modes

Schema-specific models break when:

The ambition: can we build one model — trained once, on one set of databases — that zero-shot generalizes to any new relational database schema? This is the universal encoder problem, and it's the research frontier of RDL.
python
# Schema-specific: must know edge types in advance
conv = HeteroConv({
    ('customer', 'places', 'order'): SAGEConv(...),
    ('order', 'contains', 'product'): SAGEConv(...),
    # ... every edge type hardcoded at model creation time
})
# ^^^ Breaks if schema changes. Cannot transfer to new DB.

# Universal: schema described as data, not architecture
model = UniversalRDLEncoder()
graph = build_from_schema(any_database)   # any schema!
embeddings = model(graph)                  # zero-shot generalization
A SaaS company serves 500 enterprise clients, each running the same application but with slightly different database schemas (different tables added for different use cases). Why can't they use one schema-specific GNN for all clients?

Chapter 4: Universal Encoders — The Griffin Architecture

The question: can you build a GNN that generalizes across any relational database schema without retraining? This would be the "foundation model for relational data" — trained once, deployed everywhere.

Griffin (Graph Foundation Model for Relational Data) is one attempt at this. The insight: the schema itself is data. Instead of hardcoding the schema into the GNN's architecture, encode the schema as part of the input. The model learns to read the schema description and adapt its behavior accordingly.

The Key Design Decisions in Griffin

1. Schema-agnostic node features. Different tables have different columns with different types. Griffin uses a universal feature encoder that can handle any column: numeric values are normalized and embedded, categorical values use a shared vocabulary embedding, text is encoded with a frozen language model, timestamps use sinusoidal encoding. The output is always a fixed-size vector, regardless of what columns the table has.

2. Schema description as context. The table name, column names, and foreign key relationships are serialized as text and passed through a language model. This "schema embedding" tells the GNN the semantics of each node type — even for schemas it has never seen before. A new table called "prescriptions" will produce an embedding that reflects medical prescription semantics, without any prescription-specific training.

3. Shared message passing weights. Instead of separate weight matrices per edge type, Griffin uses shared weights conditioned on the schema embedding. The message from an Order node to a Customer node uses weights that are modulated by the schema description of that edge type. Same base weights, different conditioning — efficient transfer.

Griffin: Universal Relational Encoder — SHOWCASE

Watch Griffin process two completely different database schemas — e-commerce and healthcare — using the same model weights. Click a schema to see how Griffin adapts its node features and message passing to that schema's structure.

How Griffin Handles New Schemas

New Database Schema
Table names, column names, foreign keys, dtypes
↓ text serialization
Schema Description
"Table 'prescriptions' with columns: patient_id (FK→patients), drug_name (text), dosage (float), date (timestamp)"
↓ frozen language model
Schema Embedding
Dense vector capturing semantic meaning of each table and relationship
↓ shared GNN weights conditioned on schema embedding
Node Embeddings
Same GNN backbone, schema-conditioned — zero-shot generalization

Griffin vs Schema-Specific Comparison

AspectSchema-Specific GNNGriffin (Universal)
Separate model per schemaYesNo
Handles schema changesRetrain neededZero-shot
Transfer between databasesNoneYes
Performance on known schemaBestSlightly lower
Performance on new schemaNeeds retrainingCompetitive zero-shot
Requires schema metadataNoYes (table/column names)
How does Griffin handle a database schema it has never seen before, without retraining its weights?

Chapter 5: Feature Engineering Comparison

One of RDL's core promises is eliminating manual feature engineering. But "eliminating" is too strong — it's more accurate to say RDL automates feature engineering through gradient descent. Understanding what it automates (and what it doesn't) is essential for using it wisely.

What RDL Learns Automatically

RDL's GNN layers learn the following kinds of aggregations automatically, without you specifying them:

What RDL Does NOT Learn Automatically

Some features still require domain expertise to define correctly:

The correct mental model: RDL automates the graph traversal and aggregation part of feature engineering. It does not replace domain knowledge about which features are meaningful or what external data to include. Hybrid approaches — RDL for the graph features, domain experts for the specialized ones — often perform best.
python
# What RDL replaces (auto-learned)
"""
COUNT(orders) as num_orders
AVG(order.amount) as avg_order
MAX(order.ts) as last_order
COUNT(DISTINCT products) as diversity
COUNT(reviews) as review_count
AVG(review.rating) as avg_rating
"""
# ↑ All captured by 2-layer GNN
python
# What domain experts still provide
"""
orders_last_90d / orders_total
industry_health_score (external)
login_velocity_anomaly
seasonal_adjustment_factor
custom_business_kpi_1
"""
# ↑ Needs human knowledge
A domain expert knows that customers who place exactly 3 orders and then pause for 60 days are 80% likely to churn. Can a standard 2-layer GNN on the relational graph automatically capture this pattern?

Chapter 6: Scalability for Large Databases

A realistic production database: 50 million customer rows, 500 million order rows, 20 million product rows. The full relational graph has ~570 million nodes and potentially billions of edges. A 3-layer GNN would need to load the 3-hop neighborhood of every customer — which, for popular products, could be millions of nodes per customer. This is impossible.

Mini-Batch Neighbor Sampling

The solution is the same one GraphSAGE introduced for homogeneous graphs: neighbor sampling. For each training example (customer node), sample at most K neighbors at each layer instead of using all of them. For K=20 and L=3 layers, each training example requires at most 20³ = 8,000 nodes — manageable.

The tradeoff: sampling K neighbors introduces variance (different runs see different neighborhoods). More K = lower variance, more memory. Typical values: K=25-50 for the innermost layer, K=10-25 for outer layers (fewer samples further from the target node, where individual samples are less critical).

The Temporal Complication

Temporal RDL makes sampling harder. For each prediction at cutoff T, you need to sample from the temporally-filtered neighborhood — the neighbors that existed at time T. This means you can't precompute static neighborhood samples; each cutoff creates a different filtered graph.

Solutions in active development:

Inference Caching

At inference time, embeddings for stable entities (products, categories) change slowly. You can cache their embeddings and only recompute when they receive new interactions. Customer embeddings change faster (new orders come in daily) but most customers have sparse activity — batch recompute nightly is feasible for most use cases.

Scaling ChallengeSolutionTradeoff
Graph too large for GPUNeighbor sampling (K neighbors/layer)Variance in gradient estimates
Temporal filtering per predictionTemporal index + cacheStaleness vs. compute
Many predictions per dayBatch inference + embedding cacheSlightly stale embeddings
New rows arrive continuouslyStreaming graph updatesConsistency guarantees
Why does temporal message passing make neighbor sampling harder than in standard (non-temporal) GNNs?

Chapter 7: Real-World Applications

Where is advanced RDL being applied today? The domains where relational data is richest and the prediction stakes are highest.

Fraud Detection

Fraud in financial transactions is inherently relational. A fraudster may use the same device across multiple accounts, or route money through a chain of accounts to obscure the trail. The fraud pattern is in the graph structure — not in any individual transaction's features.

Why GNNs beat rule-based systems for fraud: fraud rings create complex multi-hop patterns. "Account A → Account B → Account C all used the same device within 1 hour" is a 3-hop temporal pattern. Rule-based systems require a human to specify this rule. GNNs discover such patterns automatically from labeled examples.

The relational graph for fraud: Accounts (table), Transactions (table, FK→Account), Devices (table, FK→Account), Merchants (table, FK→Transaction). A 3-layer GNN on this graph captures: account features, their transaction history, what devices they used, what merchants they transacted with, and what other accounts use the same devices — exactly the multi-hop pattern that characterizes fraud rings.

Churn Prediction

Customer churn has a strong collaborative component: if a customer's "cohort" (peers who bought similar products at the same time) is churning, they're more likely to churn too. This 3-hop signal — customer → products → other customers → churn rate — is exactly what a 3-layer RDL GNN captures, and exactly what XGBoost with manual features would miss (unless an analyst knew to add "cohort churn rate" as a feature).

Drug Discovery

Pharmaceutical databases store drugs, targets (proteins), diseases, clinical trials, adverse events, and patients — all linked by foreign keys. Predicting drug-target interactions, adverse event probability, or trial success rates are all relational prediction tasks. The biological relationships span multiple hops: drug → targets → diseases → patients → outcomes.

DomainPrediction TaskKey Relational PatternCritical Hops
FinanceFraud detectionShared devices, IP addresses2-3
E-commerceCustomer churnCohort behavior3
HealthcareDrug-target interactionTarget pathways, disease comorbidity3-4
SocialContent moderationNetwork of accounts, shared content2-3
HiringJob match qualityEmployer→employee→skills→jobs2-3
In fraud detection, why is a GNN approach preferred over checking individual transaction features?

Chapter 8: Open Problems

Relational deep learning is young — the first major paper was 2023. The open problems are not niche edge cases; they're fundamental challenges that determine whether RDL becomes a standard tool or stays a research curiosity.

Open Problem 1: Truly Universal Encoders

Griffin is a first step, but its zero-shot performance still falls below schema-specific models on most benchmarks. The gap is shrinking, but it remains. The underlying challenge: a language model can encode the semantics of "prescriptions" (medical, time-sensitive) but not the statistical structure of that specific company's prescription table. Transfer of learned statistical patterns across schemas is still unsolved.

Open Problem 2: Scalability at True Production Scale

RelBench databases have millions of rows. Amazon, Meta, and Google have hundreds of billions. Temporal neighbor sampling at that scale, with per-prediction cutoffs, remains an engineering challenge with no clean solution. Research on approximate temporal sampling and hierarchical graph methods is active but not yet production-ready.

Open Problem 3: Interpretability

A compliance officer at a bank asks: "Why did the model flag this transaction as fraud?" An XGBoost model with SHAP values gives an interpretable answer: "feature X contributed +0.3, feature Y contributed -0.1." A GNN gives a high-dimensional embedding and a graph — explaining which neighbors contributed which signal through multi-hop paths is an active research area.

GNN explainability for relational data is harder than for tabular data. The explanation must reference graph paths: "This customer was flagged because they share a device with Account B, which has pending fraud cases, which shares a merchant with Account C, which is blacklisted." Automatically generating such explanations is an open problem.

Open Problem 4: Handling Null Values and Schema Inconsistencies

Real databases have NULL values in foreign keys, inconsistent encodings, outdated schemas, and denormalized tables (data duplicated across tables for performance reasons). RDL assumes a clean, consistent schema. Handling messy real-world databases robustly — without a DBA cleaning up first — is an unsolved engineering problem.

Open ProblemCurrent StateResearch Direction
Universal encodersGriffin: promising but gap vs. schema-specificBetter schema embeddings, few-shot fine-tuning
ScaleWorks to ~100M rowsTemporal sampling, distributed GNN
InterpretabilityPost-hoc subgraph explanation (GNNExplainer)Causal path explanations
Messy real databasesRequires clean schemasRobust imputation + schema repair
Why is explaining a GNN fraud detection decision harder than explaining an XGBoost decision?

Chapter 9: Connections — Where to Go Next

Advanced RDL sits at the crossroads of graph learning, relational databases, temporal reasoning, and foundation model research. Each direction opens a rich research area.

The Through-Line of This Course

Everything is a graph. Social networks, knowledge graphs, molecular structures, citation networks, recommendation systems, relational databases — they all reduce to nodes, edges, and message passing. The GNN toolkit is remarkably general. RDL is the latest and perhaps most impactful application of this insight.
Structure is supervision. GNNs don't need labels to learn useful representations — the graph structure itself provides self-supervisory signal. Temporal graph structure (who interacted with whom, and when) is even richer. The frontier is learning better priors from this structure.
Scale requires rethinking everything. GraphSAGE showed this for social graphs. PinSage showed it for recommendation. RDL is still working through the implications for temporal relational graphs. The memory/compute/accuracy tradeoffs are different at each scale order of magnitude.
Foundation models for structured data. LLMs are foundation models for text. Griffin aspires to be a foundation model for relational data. The community is still figuring out what "pre-training" and "transfer" mean for graphs and tables. This is 2025 frontier research.

Full CS224W Lecture Path

LectureTopicKey Idea
L3GNNsMessage passing, node embeddings
L6GNN TheoryWL test, expressive power
L8Link PredictionHeuristics, embeddings, GNNs for edges
L9Hetero GNNsType-specific transformations
L10Knowledge GraphsTransE, RotatE, relation patterns
L11RecSysLightGCN, BPR, bipartite graphs
L12Basic RDLDatabases as graphs, RelBench
L13 (this)Advanced RDLTemporal MP, universal encoders, Griffin

Key Papers

"The ability to work with relational data — the dominant format for the world's most important data — is the next frontier for deep learning. Graph neural networks are the key."
— Jure Leskovec, CS224W, 2024