CS224W Lecture 15 — Foundation Models for Knowledge Graphs

Chapter 0: The Problem

Freebase has 3 billion facts about the world: (Barack Obama, bornIn, Honolulu), (Honolulu, locatedIn, Hawaii), (Hawaii, partOf, USA). Wikidata has different entities, different relation names, but similar structure. A biomedical KG like Hetionet has (Aspirin, treats, Migraine), (Migraine, symptomOf, Inflammation), connecting drugs, diseases, and genes.

These knowledge graphs are incomplete. Facts are missing — not because they're false, but because the graph was built from text and text doesn't mention everything. Knowledge graph completion is the task: given a partial KG, predict missing facts. Standard approach: train an embedding model (TransE, RotatE, ComplEx) on one KG. Deploy. Done.

The problem: you've now trained a model that's tied to that specific KG. Its entity embeddings are lookup tables — random initializations optimized for Freebase's entities specifically. If a new entity appears (a newly elected politician, a recently discovered drug), it has no embedding. The model has no answer.

Three Distinct Failure Modes

New entities (inductive entities). A model trained on (A, rel, B) knows A and B. A new entity C appears. TransE gives C an embedding of [0, 0, ..., 0] — it has learned nothing about C. The model can't make any prediction about C. Every real-world deployment faces this: new users, new products, new drugs.

New relations (inductive relations). A new relation type appears — say, "hasSideEffect" was never in training data. The model has no embedding for this relation. Even if the entities are known, the model can't reason about new relation types. As KGs evolve, new facts types are constantly added.

New KGs entirely (zero-shot transfer). You've trained on Freebase. A biologist hands you Hetionet. Different entities (genes, diseases, drugs), different relations (treats, causes, interacts), completely different schema. Your TransE weights are useless — trained to predict (Person, bornIn, Place) triples, not (Drug, treats, Disease) triples. You must retrain from scratch.

Standard embedding models fail at all three. They're transductive: predictions require every entity and relation to have been seen during training. A foundation model for KGs would handle all three: reason about new entities, new relations, and entirely new KGs without retraining.

Transductive vs. Inductive Failure

Click "Add New Entity" to introduce a novel node to the trained KG. Watch how a transductive model fails (no embedding) but a structural model can still reason using the new node's graph neighborhood.

TransE trains entity embeddings as lookup table entries. Why does this make it "transductive" and unable to handle new entities at test time?

TransE is too simple to learn complex patterns TransE requires GPU memory to hold all embeddings simultaneously A new entity has no row in the lookup table — there's no learned vector for it, and no mechanism to compute one from context. The model can only score triples for entities seen during training

Chapter 1: Inductive KG Reasoning

How do humans reason about new entities? If you've never heard of "Montserrat" and I tell you it's a Caribbean island that's part of the British Overseas Territories, you immediately infer: it's in the Atlantic, it has a governor appointed by the Crown, it's probably small, it uses the East Caribbean dollar. You reasoned from relationships, not from stored facts about Montserrat itself.

This is the key insight behind inductive KG reasoning: don't embed entities directly. Instead, derive entity representations from the graph structure around them. Two entities that occupy the same structural position in a graph — connected to the same types of relations — should get similar representations, even if neither was seen during training.

GraIL: Inductive Link Prediction via Subgraphs

GraIL (Grail, Teru et al. 2020) operationalizes this insight. Instead of predicting whether (h, r, t) is a true fact by scoring embeddings, GraIL extracts the local subgraph around h and t, and classifies the subgraph as "supports this relation" or "doesn't support this relation." The GNN operates on the subgraph, not on entity embeddings.

The critical property of GraIL: the GNN input is the subgraph structure, not entity IDs. Two entirely new entities, with no embeddings, can still be scored by extracting their subgraph and running it through the GNN. Inductiveness comes for free — entity identity is irrelevant.

To handle relation types inductively, GraIL adds entity "type" features derived from the relation types of incident edges. Entity A is described by the set of relation types it participates in: {bornIn, worksAt, knows}. Entity B by {locatedIn, hasPopulation, partOf}. These relation-type profiles don't require knowing A or B — just their neighborhood structure.

python
# GraIL: subgraph-based inductive link prediction
def predict_link(h, t, graph, model):
    # Extract the k-hop subgraph around h and t
    subgraph = extract_subgraph(graph, [h, t], k=3)

    # Node features: relation-type profile (NOT entity ID lookup)
    # Each node feature = which relation types it participates in
    node_feats = compute_relation_profile(subgraph)   # [N_sub, n_rels]

    # Classify the subgraph: does it support relation r between h and t?
    score = model(subgraph, node_feats, h_idx=0, t_idx=1)  # [1]
    return torch.sigmoid(score)  # probability the link exists

# At test time: h and t can be BRAND NEW entities never seen in training.
# The model operates purely on subgraph structure — no entity ID needed.

GraIL uses "relation-type profiles" as node features instead of entity ID embeddings. What does this profile contain?

The entity's pre-trained TransE embedding vector A one-hot encoding of the entity's position in the KG A vector indicating which relation types (bornIn, locatedIn, etc.) that entity participates in — a structural fingerprint that doesn't depend on the entity's identity or whether it was seen during training

Chapter 2: Relation-Level Transfer

GraIL handles new entities. But what about new relation types? Suppose your training KG has {bornIn, livesIn, worksAt, knows}. At test time, a new relation "marriedTo" appears. GraIL's relation-type profile features don't include "marriedTo" — new dimension, undefined. The model can't score triples with the new relation.

This is the relation-level inductive problem. Solving it requires representing relation types themselves without relying on pre-learned relation embeddings — just as GraIL represents entities without entity embeddings.

ULTRA: Universal Reasoning on Any KG

ULTRA (Galkin et al., 2023) is a major step toward a universal KG foundation model. Its key insight: relations are characterized by their interaction patterns with other relations. The relation "bornIn" is defined by patterns like: if (A, bornIn, B) and (B, partOf, C), then (A, bornIn, C) is plausible (inheritance). These relation-to-relation patterns define relation semantics independently of the specific KG.

ULTRA's architecture: build a relation graph — a graph where nodes are relation types and edges represent how relation types interact (composition, inverse, etc.). Run a GNN on this relation graph to compute relation representations. Then use these relation representations to score triples in the entity graph. A new relation type becomes a new node in the relation graph — its GNN embedding is computed from its interaction patterns, not from a lookup table.

Relation Composition Patterns

The patterns that define relation semantics are classical: transitivity (bornIn is transitive for location hierarchies), inversion (if A marriedTo B, then B marriedTo A), and composition (bornIn + partOf = bornIn, approximately). ULTRA learns to detect these patterns from the structure of the relation graph and uses them to reason about any relation — including ones it never saw during pre-training.

Pattern	Formal rule	Example
Transitivity	r(A,B) ∧ r(B,C) → r(A,C)	locatedIn(NYC, NY) ∧ locatedIn(NY, USA) → locatedIn(NYC, USA)
Inversion	r(A,B) → r⁻¹(B,A)	marriedTo(A,B) → marriedTo(B,A)
Composition	r₁(A,B) ∧ r₂(B,C) → r₃(A,C)	bornIn(A,B) ∧ partOf(B,C) → bornIn(A,C)
Symmetry	r(A,B) → r(B,A)	knows(A,B) → knows(B,A)

Why this enables transfer: these logical patterns are universal. They exist in every KG regardless of domain. A model that learns to detect "this relation is transitive" from graph structure can apply that reasoning to any transitive relation, including new ones. The patterns transfer; the specific relation IDs do not.

ULTRA represents relation types via a "relation graph." What are the nodes and edges of this relation graph?

Nodes are entities; edges are relation types (same as the original KG) Nodes are relation types; edges connect two relation types if they compose, invert, or otherwise interact in the original KG — so the graph captures the structural semantics of each relation type Nodes are relation types; edges represent co-occurrence (when two relation types appear in the same triple)

Chapter 3: Subgraph Reasoning

You have a KG with 10 million entities. You want to answer: "Which drugs treat diseases caused by gene X?" This requires finding all paths of the form (Drug, treats, Disease) where (Disease, causedBy, gene X). The answer is not a single edge — it's a pattern over the graph, a path or subgraph that connects gene X to candidate drugs through a specific chain of relations.

This is complex query answering: answering queries that involve conjunctions (AND), existential quantifiers (∃), and multi-hop paths, not just single-edge link prediction. Standard embedding methods collapse at this — they can score single triples but can't compose predictions across multiple hops.

Queries as Subgraph Patterns

Every complex KG query can be expressed as a directed acyclic graph (DAG). Leaf nodes are known anchor entities. The query node is the unknown we're looking for. Intermediate nodes are existentially quantified variables. Edges are relation types.

Example query DAG: "Find proteins associated with diseases that share a pathway with cancer." Anchor: cancer. Relation chain: cancer ←(associatedWith)— Disease —(sharePathway)→ Disease —(associatedWith)→ Protein. The query DAG encodes this structure. We need to find all proteins that complete this pattern in the KG.

Complex Query Reasoning

Click a query type to see its DAG structure and which KG paths it matches. The selected anchor entity (warm color) propagates through relation chains to find valid answers.

Query2Box

Query2Box (Ren et al., 2020) answers complex queries by representing each query node as a hyperrectangle (box) in embedding space. The box represents the set of entities that could satisfy the query at that node. Relation edges transform boxes: applying relation r to a box moves and reshapes it. Intersection (AND) computes the intersection of two boxes. The query answer is the set of entities inside the final box.

Boxes are learned jointly with entity embeddings. The model trains on a mixture of query types so it learns box operations that compose correctly. At inference time, you run the box computation through the query DAG and score each candidate entity by its distance to the final box.

In Query2Box, what does a box in embedding space represent for a query node?

The position of a single entity in the KG The set of entities that could satisfy the query at that node — boxes represent regions of entity space, and operations (relation traversal, intersection) transform regions to propagate uncertainty across multi-hop paths The confidence score that a given entity is the correct answer

Chapter 4: Foundation KG Models (SHOWCASE)

The dream crystallizes here: a single model that accepts any knowledge graph as input and performs link prediction, even for entirely new entities and relation types it has never seen. No fine-tuning. No entity-specific embeddings. Just structural reasoning from graph patterns.

This showcase demonstrates the full pipeline: load two completely different KGs (one biomedical, one social), run the same foundation model on both, and watch it score triple candidates using only subgraph structure — no entity-specific information required.

The ULTRA Pipeline

Input: Any KG (h, r, t) query

New KG with unseen entities and relations. No entity embeddings exist.

↓

Build Relation Graph

Nodes = relation types. Edges = composition/inversion patterns. Run GNN → relation embeddings from structure alone.

↓

Extract Subgraph

k-hop neighborhood around (h, t). Node features = relation-type profiles (which relation types each entity participates in).

↓

GNN Scoring

Run NBFNet (neural Bellman-Ford) on the subgraph. It propagates relation embeddings along paths from h to t, computing path scores.

↓

Score & Rank

All candidate tail entities scored. Return ranked list. Works on any KG, any entity, any relation.

Foundation KG Model: Cross-KG Transfer

Select a KG. The foundation model scores triple candidates using structural reasoning only. Toggle between KGs to see the same model weights producing correct predictions on completely different schemas — no retraining.

Query hop depth 2

ULTRA performance (Galkin et al., 2023): pre-trained on 3 KGs, zero-shot on 50 new KGs. Achieves ~93% of the performance of a model fully fine-tuned on each KG — without any fine-tuning. On some smaller KGs, zero-shot ULTRA outperforms fine-tuned task-specific models, because the structural patterns learned from large KGs generalize better than patterns learned from sparse data.

ULTRA achieves "zero-shot" transfer to new KGs. What exactly is "zero-shot" here?

The model uses zero GNN layers (it's a purely MLP-based approach) The model is tested on KGs with zero entities The model is applied to a new KG it has never seen during training, with zero examples from that KG used for weight updates — it uses only the structural patterns learned from other KGs

Chapter 5: Training Strategies

A foundation model for KGs must be trained on many different KGs simultaneously. This raises concrete engineering questions: how do you sample training data across graphs of wildly different sizes? How do you balance a tiny drug-interaction KG (50,000 triples) against a massive social KG (5 billion triples)?

Negative Sampling in KGs

KG training uses contrastive learning at the triple level: given a true triple (h, r, t), generate corrupted triples (h, r, t') where t' is a random entity that is not a valid tail for this query. Train the model to score the true triple higher than the corrupted triples. This is called negative sampling.

The closed-world assumption trap. When we sample random t' and call it "negative," we assume the true fact (h, r, t') is false. But KGs are incomplete — (h, r, t') might be a true but missing fact. Scoring it as negative gives the model incorrect supervision. This is the false negative problem. In practice: use large negative sample counts to dilute false negatives, or use more sophisticated self-adversarial negative sampling.

Multi-Graph Training Curriculum

ULTRA's training curriculum mixes examples from different source KGs in each batch. But size imbalance is severe. Solution: sample graphs with probability proportional to the square root of their size (not proportional to size itself). This gives smaller KGs more representation without completely ignoring the large ones.

python
# Multi-KG training loop (ULTRA-style)
kg_sizes = [50_000, 1_200_000, 4_800_000]  # triples per KG
weights = [s**0.5 for s in kg_sizes]             # sqrt balancing
weights = [w / sum(weights) for w in weights]       # normalize

for step in range(max_steps):
    # Sample which KG to train on this step
    kg_idx = random.choices([0, 1, 2], weights=weights)[0]
    kg = knowledge_graphs[kg_idx]

    # Sample a batch of true triples from this KG
    triples = kg.sample_triples(batch_size=512)

    # Generate corrupted negatives (random tail entities)
    negatives = kg.corrupt_tail(triples, n_neg=64)

    # Score both true and corrupted; contrastive loss
    loss = margin_ranking_loss(model(kg, triples), model(kg, negatives))
    loss.backward()
    optimizer.step()

When training a KG model with negative sampling, why are randomly corrupted triples (h, r, t') not always truly "negative" examples?

Because t' might have the wrong entity type for relation r Because KGs are incomplete — the "corrupted" triple (h, r, t') might actually be a true fact that simply wasn't recorded in the training KG. Training on it as negative gives incorrect supervision (false negative problem) Because the model hasn't seen t' before, so it can't evaluate whether the triple is correct

Chapter 6: Evaluation

How do you measure whether a KG foundation model actually generalizes? Standard KG benchmarks (FB15k-237, WN18RR) evaluate models on held-out triples from the same KG used for training. That measures interpolation, not zero-shot transfer. Foundation models need a different evaluation protocol.

The Standard Protocol: Filtered MRR

For each test triple (h, r, t), replace the tail with every entity in the KG and rank them by the model's score. Report the rank of the true tail. Repeat for all test triples. The Mean Reciprocal Rank (MRR) is the average of 1/rank. A model that consistently puts the true answer first gets MRR = 1.0.

MRR = ¹⁄_|T| ∑_i=1^|T| ¹⁄_{rank_i}

The "filtered" part: before computing rank, remove all other true triples from the candidate list. Without filtering, you might rank (h, r, t') above (h, r, t) because (h, r, t') is also a true fact — which is technically correct but unfairly penalizes the model.

Zero-Shot Evaluation Protocol

For foundation model evaluation, the protocol is stricter: train on source KGs, evaluate on a completely different target KG with no fine-tuning. All entities and relations in the target KG are unseen. MRR is computed only on the target KG's test triples.

ULTRA's benchmark result: tested on 50 KG benchmarks, including KGs from biology, chemistry, social networks, and general knowledge. Zero-shot ULTRA (no target KG training) achieves comparable or better MRR than the strongest task-specific models on 42 out of 50 benchmarks. The foundation model paradigm works for KGs.

Hits@K metric: instead of MRR, sometimes report Hits@1, Hits@3, Hits@10 — the fraction of test triples where the correct answer appears in the top-1, top-3, or top-10 ranked candidates. More interpretable: "the model gets the right answer in its top 10 suggestions 87% of the time."

Why does standard KG benchmark evaluation (on held-out triples from the training KG) not adequately test a foundation model?

Because MRR is the wrong metric for foundation models Because the model has seen all entities and relations during training — the held-out test triples involve familiar entities. This measures interpolation (filling gaps in a known KG), not zero-shot generalization to a new KG with completely unseen entities and schemas Because KG benchmarks are too small to give meaningful statistics

Chapter 7: Challenges

KG foundation models are promising but far from solved. This chapter is honest about the open problems — not to discourage, but because understanding the limitations tells you exactly where the next breakthrough will come from.

Schema Diversity: The Alignment Problem

Different KGs don't just have different entities — they have different ontological assumptions. In Freebase, a "person" is an entity with relations like {bornIn, citizenship, occupation}. In DBpedia, a "person" has {birthPlace, nationality, profession}. These are semantically equivalent but syntactically different. A model trained on Freebase's schema may correctly learn that "bornIn" relates persons to locations — but on DBpedia, the relation is "birthPlace." The model must generalize across these surface-level schema differences.

The surface-form problem. ULTRA uses relation interaction patterns to define relation semantics — not relation names. So "bornIn" and "birthPlace" are treated the same way as long as they have similar patterns in their respective KGs (both are functional, non-symmetric, and compose with location hierarchy relations). This is the right approach, but it assumes the structural patterns are preserved across schemas. Sometimes they're not.

Scale: The Size Problem

Wikidata has over 100 million entities. Subgraph extraction for GraIL/ULTRA at this scale is prohibitive — extracting 3-hop subgraphs around every candidate entity during inference doesn't scale. Current foundation models work well on KGs up to ~5 million entities. Beyond that, approximate methods (pre-computed paths, sparse subgraph sampling) are needed but introduce errors.

Temporal and Numerical KGs

Many KG facts are time-qualified: (Obama, presidentOf, USA, [2009, 2017]). Standard embedding models (and ULTRA) don't handle temporal qualifiers. Similarly, numerical attributes (population, temperature, price) don't fit naturally into the entity-relation-entity triple framework. Extending foundation models to temporal and numerical facts remains largely unsolved.

The calibration gap. Even when a KG foundation model ranks the correct answer first, its confidence scores may be poorly calibrated. "Score = 0.95" doesn't necessarily mean 95% probability that the fact is true. For high-stakes applications (drug interaction prediction, medical diagnosis support), calibrated probabilities are essential. Current models are not calibrated without additional post-processing.

ULTRA represents relations by their structural interaction patterns, not their names. Why does this help with schema diversity but not completely solve it?

Because interaction patterns require knowing entity types, which vary across schemas Because not all KGs have the same number of relation types It helps because semantically equivalent relations (bornIn vs. birthPlace) tend to have similar patterns in their respective KGs. It doesn't completely solve it because structural patterns can differ even for semantically identical relations if the surrounding KG structure is organized differently

Chapter 8: Applications

KG foundation models aren't academic curiosities. The domains where KG completion matters most — drug discovery, genomics, materials science — are also the domains where you constantly encounter new entities (newly sequenced genes, newly synthesized compounds) and where retraining from scratch every time a new entity appears is completely impractical.

Drug Discovery

A biomedical KG links drugs, genes, diseases, proteins, and biological pathways. The key task: predict which drug might treat which disease (drug repurposing). Standard approach: train TransE on the current KG. As new drugs enter clinical trials, their graph connections are gradually added, but their initial embedding is random. A foundation model trained on structural patterns can immediately reason about new drugs by their biological connections — no retraining.

Real case: COVID-19 drug repurposing. When COVID-19 emerged in early 2020, researchers needed drug candidates fast. Biomedical KGs had no COVID-specific embeddings — the disease was new. But its protein connections were rapidly being added. KG foundation models that could reason from structural patterns (rather than learned embeddings) were immediately applicable, while transductive models needed to wait for enough data to retrain.

Knowledge Base Completion

Search engines maintain internal knowledge bases to answer factual queries directly ("What is the capital of France?"). These bases are perpetually incomplete — new entities are created daily (new companies, new movies, new people). A foundation model keeps the knowledge base useful for new entities without continuous retraining cycles.

Scientific Literature Graphs

Citation graphs + co-authorship graphs + concept graphs together form a scientific KG. Predicting which papers will be cited together, which authors are likely to collaborate, and which concepts are related helps research recommendation systems. New papers appear daily — a foundation model handles new paper entities from their first connections (authors, keywords, cited papers).

Domain	KG entities	Key task	Why foundation model needed
Drug discovery	Drugs, genes, diseases	Drug repurposing	New drugs appear constantly; no embedding at synthesis time
Knowledge bases	Entities, facts	Fact completion	New entities added daily at internet scale
Scientific literature	Papers, authors, concepts	Citation/collaboration prediction	~10,000 new papers per day
E-commerce	Products, attributes, categories	Product linking	New products added; schema differs across retailers

Why is KG foundation model research particularly valuable for drug discovery compared to, say, social network analysis?

Drug discovery KGs are smaller and easier to train on New entities (drugs, diseases, genes) appear constantly and require immediate reasoning without waiting for retraining — delays in drug repurposing predictions have direct human health costs, making the foundation model's inductive capability critically important rather than just convenient Social network KGs are already well solved by existing methods

Chapter 9: Connections

KG foundation models sit at the intersection of knowledge representation, graph learning, and the broader foundation model paradigm. Knowing where these threads lead helps you navigate the literature.

How KG Foundation Models Relate to Other Fields

KG Concept	Analogy in another field	Key difference
Inductive entity reasoning	Zero-shot image classification (new visual classes)	KG structure is discrete; image features are continuous
Relation-level transfer (ULTRA)	Meta-learning for new tasks	Task = relation type; support set = KG structure
Query2Box (box intersection)	Set operations in fuzzy logic	Boxes are learned end-to-end; fuzzy sets are manually designed
KG completion (link prediction)	Collaborative filtering (matrix completion)	KG has typed, directed edges; CF has untyped implicit ratings
Multi-KG pre-training	Multilingual NLP pre-training	KG "language" = schema; much higher diversity than human languages

Related Lessons and Next Steps

The natural next step combines KG reasoning with LLMs: LLMs have world knowledge encoded in weights; KGs have structured, verified, updatable facts. Combining them lets LLMs ground their reasoning in verifiable structure. → LLM + GNN (Lecture 16)

→ KG Basics (Lecture 10) — TransE, RotatE, ComplEx: the transductive methods that KG foundation models aim to replace.

→ Advanced GNN Topics (Lecture 14) — Equivariant GNNs, GNN explainability, and scalability methods that underlie the GNN backbone in KG foundation models.

→ Heterogeneous Graphs (Lecture 9) — KGs are heterogeneous graphs. The HGT and R-GCN methods developed for heterogeneous GNNs form the backbone of many KG GNN approaches.

"Knowledge is not a commodity to be stockpiled, but a capacity to connect things."
— adapted for KG foundation models