Every knowledge graph has a different schema. Can one model reason over all of them — predicting missing facts even about entities and relations it has never seen before?
Freebase has 3 billion facts about the world: (Barack Obama, bornIn, Honolulu), (Honolulu, locatedIn, Hawaii), (Hawaii, partOf, USA). Wikidata has different entities, different relation names, but similar structure. A biomedical KG like Hetionet has (Aspirin, treats, Migraine), (Migraine, symptomOf, Inflammation), connecting drugs, diseases, and genes.
These knowledge graphs are incomplete. Facts are missing — not because they're false, but because the graph was built from text and text doesn't mention everything. Knowledge graph completion is the task: given a partial KG, predict missing facts. Standard approach: train an embedding model (TransE, RotatE, ComplEx) on one KG. Deploy. Done.
The problem: you've now trained a model that's tied to that specific KG. Its entity embeddings are lookup tables — random initializations optimized for Freebase's entities specifically. If a new entity appears (a newly elected politician, a recently discovered drug), it has no embedding. The model has no answer.
Standard embedding models fail at all three. They're transductive: predictions require every entity and relation to have been seen during training. A foundation model for KGs would handle all three: reason about new entities, new relations, and entirely new KGs without retraining.
Click "Add New Entity" to introduce a novel node to the trained KG. Watch how a transductive model fails (no embedding) but a structural model can still reason using the new node's graph neighborhood.
How do humans reason about new entities? If you've never heard of "Montserrat" and I tell you it's a Caribbean island that's part of the British Overseas Territories, you immediately infer: it's in the Atlantic, it has a governor appointed by the Crown, it's probably small, it uses the East Caribbean dollar. You reasoned from relationships, not from stored facts about Montserrat itself.
This is the key insight behind inductive KG reasoning: don't embed entities directly. Instead, derive entity representations from the graph structure around them. Two entities that occupy the same structural position in a graph — connected to the same types of relations — should get similar representations, even if neither was seen during training.
GraIL (Grail, Teru et al. 2020) operationalizes this insight. Instead of predicting whether (h, r, t) is a true fact by scoring embeddings, GraIL extracts the local subgraph around h and t, and classifies the subgraph as "supports this relation" or "doesn't support this relation." The GNN operates on the subgraph, not on entity embeddings.
To handle relation types inductively, GraIL adds entity "type" features derived from the relation types of incident edges. Entity A is described by the set of relation types it participates in: {bornIn, worksAt, knows}. Entity B by {locatedIn, hasPopulation, partOf}. These relation-type profiles don't require knowing A or B — just their neighborhood structure.
python # GraIL: subgraph-based inductive link prediction def predict_link(h, t, graph, model): # Extract the k-hop subgraph around h and t subgraph = extract_subgraph(graph, [h, t], k=3) # Node features: relation-type profile (NOT entity ID lookup) # Each node feature = which relation types it participates in node_feats = compute_relation_profile(subgraph) # [N_sub, n_rels] # Classify the subgraph: does it support relation r between h and t? score = model(subgraph, node_feats, h_idx=0, t_idx=1) # [1] return torch.sigmoid(score) # probability the link exists # At test time: h and t can be BRAND NEW entities never seen in training. # The model operates purely on subgraph structure — no entity ID needed.
GraIL handles new entities. But what about new relation types? Suppose your training KG has {bornIn, livesIn, worksAt, knows}. At test time, a new relation "marriedTo" appears. GraIL's relation-type profile features don't include "marriedTo" — new dimension, undefined. The model can't score triples with the new relation.
This is the relation-level inductive problem. Solving it requires representing relation types themselves without relying on pre-learned relation embeddings — just as GraIL represents entities without entity embeddings.
ULTRA (Galkin et al., 2023) is a major step toward a universal KG foundation model. Its key insight: relations are characterized by their interaction patterns with other relations. The relation "bornIn" is defined by patterns like: if (A, bornIn, B) and (B, partOf, C), then (A, bornIn, C) is plausible (inheritance). These relation-to-relation patterns define relation semantics independently of the specific KG.
The patterns that define relation semantics are classical: transitivity (bornIn is transitive for location hierarchies), inversion (if A marriedTo B, then B marriedTo A), and composition (bornIn + partOf = bornIn, approximately). ULTRA learns to detect these patterns from the structure of the relation graph and uses them to reason about any relation — including ones it never saw during pre-training.
| Pattern | Formal rule | Example |
|---|---|---|
| Transitivity | r(A,B) ∧ r(B,C) → r(A,C) | locatedIn(NYC, NY) ∧ locatedIn(NY, USA) → locatedIn(NYC, USA) |
| Inversion | r(A,B) → r⁻¹(B,A) | marriedTo(A,B) → marriedTo(B,A) |
| Composition | r₁(A,B) ∧ r₂(B,C) → r₃(A,C) | bornIn(A,B) ∧ partOf(B,C) → bornIn(A,C) |
| Symmetry | r(A,B) → r(B,A) | knows(A,B) → knows(B,A) |
You have a KG with 10 million entities. You want to answer: "Which drugs treat diseases caused by gene X?" This requires finding all paths of the form (Drug, treats, Disease) where (Disease, causedBy, gene X). The answer is not a single edge — it's a pattern over the graph, a path or subgraph that connects gene X to candidate drugs through a specific chain of relations.
This is complex query answering: answering queries that involve conjunctions (AND), existential quantifiers (∃), and multi-hop paths, not just single-edge link prediction. Standard embedding methods collapse at this — they can score single triples but can't compose predictions across multiple hops.
Every complex KG query can be expressed as a directed acyclic graph (DAG). Leaf nodes are known anchor entities. The query node is the unknown we're looking for. Intermediate nodes are existentially quantified variables. Edges are relation types.
Click a query type to see its DAG structure and which KG paths it matches. The selected anchor entity (warm color) propagates through relation chains to find valid answers.
Query2Box (Ren et al., 2020) answers complex queries by representing each query node as a hyperrectangle (box) in embedding space. The box represents the set of entities that could satisfy the query at that node. Relation edges transform boxes: applying relation r to a box moves and reshapes it. Intersection (AND) computes the intersection of two boxes. The query answer is the set of entities inside the final box.
Boxes are learned jointly with entity embeddings. The model trains on a mixture of query types so it learns box operations that compose correctly. At inference time, you run the box computation through the query DAG and score each candidate entity by its distance to the final box.
The dream crystallizes here: a single model that accepts any knowledge graph as input and performs link prediction, even for entirely new entities and relation types it has never seen. No fine-tuning. No entity-specific embeddings. Just structural reasoning from graph patterns.
This showcase demonstrates the full pipeline: load two completely different KGs (one biomedical, one social), run the same foundation model on both, and watch it score triple candidates using only subgraph structure — no entity-specific information required.
Select a KG. The foundation model scores triple candidates using structural reasoning only. Toggle between KGs to see the same model weights producing correct predictions on completely different schemas — no retraining.
A foundation model for KGs must be trained on many different KGs simultaneously. This raises concrete engineering questions: how do you sample training data across graphs of wildly different sizes? How do you balance a tiny drug-interaction KG (50,000 triples) against a massive social KG (5 billion triples)?
KG training uses contrastive learning at the triple level: given a true triple (h, r, t), generate corrupted triples (h, r, t') where t' is a random entity that is not a valid tail for this query. Train the model to score the true triple higher than the corrupted triples. This is called negative sampling.
ULTRA's training curriculum mixes examples from different source KGs in each batch. But size imbalance is severe. Solution: sample graphs with probability proportional to the square root of their size (not proportional to size itself). This gives smaller KGs more representation without completely ignoring the large ones.
python # Multi-KG training loop (ULTRA-style) kg_sizes = [50_000, 1_200_000, 4_800_000] # triples per KG weights = [s**0.5 for s in kg_sizes] # sqrt balancing weights = [w / sum(weights) for w in weights] # normalize for step in range(max_steps): # Sample which KG to train on this step kg_idx = random.choices([0, 1, 2], weights=weights)[0] kg = knowledge_graphs[kg_idx] # Sample a batch of true triples from this KG triples = kg.sample_triples(batch_size=512) # Generate corrupted negatives (random tail entities) negatives = kg.corrupt_tail(triples, n_neg=64) # Score both true and corrupted; contrastive loss loss = margin_ranking_loss(model(kg, triples), model(kg, negatives)) loss.backward() optimizer.step()
How do you measure whether a KG foundation model actually generalizes? Standard KG benchmarks (FB15k-237, WN18RR) evaluate models on held-out triples from the same KG used for training. That measures interpolation, not zero-shot transfer. Foundation models need a different evaluation protocol.
For each test triple (h, r, t), replace the tail with every entity in the KG and rank them by the model's score. Report the rank of the true tail. Repeat for all test triples. The Mean Reciprocal Rank (MRR) is the average of 1/rank. A model that consistently puts the true answer first gets MRR = 1.0.
The "filtered" part: before computing rank, remove all other true triples from the candidate list. Without filtering, you might rank (h, r, t') above (h, r, t) because (h, r, t') is also a true fact — which is technically correct but unfairly penalizes the model.
For foundation model evaluation, the protocol is stricter: train on source KGs, evaluate on a completely different target KG with no fine-tuning. All entities and relations in the target KG are unseen. MRR is computed only on the target KG's test triples.
KG foundation models are promising but far from solved. This chapter is honest about the open problems — not to discourage, but because understanding the limitations tells you exactly where the next breakthrough will come from.
Different KGs don't just have different entities — they have different ontological assumptions. In Freebase, a "person" is an entity with relations like {bornIn, citizenship, occupation}. In DBpedia, a "person" has {birthPlace, nationality, profession}. These are semantically equivalent but syntactically different. A model trained on Freebase's schema may correctly learn that "bornIn" relates persons to locations — but on DBpedia, the relation is "birthPlace." The model must generalize across these surface-level schema differences.
Wikidata has over 100 million entities. Subgraph extraction for GraIL/ULTRA at this scale is prohibitive — extracting 3-hop subgraphs around every candidate entity during inference doesn't scale. Current foundation models work well on KGs up to ~5 million entities. Beyond that, approximate methods (pre-computed paths, sparse subgraph sampling) are needed but introduce errors.
Many KG facts are time-qualified: (Obama, presidentOf, USA, [2009, 2017]). Standard embedding models (and ULTRA) don't handle temporal qualifiers. Similarly, numerical attributes (population, temperature, price) don't fit naturally into the entity-relation-entity triple framework. Extending foundation models to temporal and numerical facts remains largely unsolved.
KG foundation models aren't academic curiosities. The domains where KG completion matters most — drug discovery, genomics, materials science — are also the domains where you constantly encounter new entities (newly sequenced genes, newly synthesized compounds) and where retraining from scratch every time a new entity appears is completely impractical.
A biomedical KG links drugs, genes, diseases, proteins, and biological pathways. The key task: predict which drug might treat which disease (drug repurposing). Standard approach: train TransE on the current KG. As new drugs enter clinical trials, their graph connections are gradually added, but their initial embedding is random. A foundation model trained on structural patterns can immediately reason about new drugs by their biological connections — no retraining.
Search engines maintain internal knowledge bases to answer factual queries directly ("What is the capital of France?"). These bases are perpetually incomplete — new entities are created daily (new companies, new movies, new people). A foundation model keeps the knowledge base useful for new entities without continuous retraining cycles.
Citation graphs + co-authorship graphs + concept graphs together form a scientific KG. Predicting which papers will be cited together, which authors are likely to collaborate, and which concepts are related helps research recommendation systems. New papers appear daily — a foundation model handles new paper entities from their first connections (authors, keywords, cited papers).
| Domain | KG entities | Key task | Why foundation model needed |
|---|---|---|---|
| Drug discovery | Drugs, genes, diseases | Drug repurposing | New drugs appear constantly; no embedding at synthesis time |
| Knowledge bases | Entities, facts | Fact completion | New entities added daily at internet scale |
| Scientific literature | Papers, authors, concepts | Citation/collaboration prediction | ~10,000 new papers per day |
| E-commerce | Products, attributes, categories | Product linking | New products added; schema differs across retailers |
KG foundation models sit at the intersection of knowledge representation, graph learning, and the broader foundation model paradigm. Knowing where these threads lead helps you navigate the literature.
| KG Concept | Analogy in another field | Key difference |
|---|---|---|
| Inductive entity reasoning | Zero-shot image classification (new visual classes) | KG structure is discrete; image features are continuous |
| Relation-level transfer (ULTRA) | Meta-learning for new tasks | Task = relation type; support set = KG structure |
| Query2Box (box intersection) | Set operations in fuzzy logic | Boxes are learned end-to-end; fuzzy sets are manually designed |
| KG completion (link prediction) | Collaborative filtering (matrix completion) | KG has typed, directed edges; CF has untyped implicit ratings |
| Multi-KG pre-training | Multilingual NLP pre-training | KG "language" = schema; much higher diversity than human languages |
"Knowledge is not a commodity to be stockpiled, but a capacity to connect things."
— adapted for KG foundation models