CS224W Lecture 10

Knowledge Graphs

The world's facts encoded as (head, relation, tail) triples. Billions of them. Most are missing. Learn to predict what's absent from what's present.

Prerequisites: Basic graph theory + vectors in Rk. That's it.
10
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: What Are Knowledge Graphs?

J.K. Rowling wrote Harry Potter. Harry Potter belongs to the Fantasy genre. Fantasy is a subtype of Fiction. These are facts. And facts have a shape: a thing, connected to another thing, by a relationship.

A knowledge graph (KG) is a database that stores facts exactly this way — as triples: (head entity, relation, tail entity), or (h, r, t) for short. Every fact in the world can be squeezed into this form.

(J.K. Rowling,   genre,   Fantasy)
(Harry Potter,   author,   J.K. Rowling)
(Penicillin,   treats,   Pneumonia)

The entities — people, places, diseases, proteins, concepts — are the nodes. The relations — genre, author, treats, birthplace, subclass_of — are the edge types. Together they form a directed, multi-relational graph.

Real Knowledge Graphs Are Enormous

This isn't a toy idea. Real KGs are used at Google, Microsoft, and Amazon to power search, recommendations, and question-answering:

KGEntitiesRelationsTriplesUsed For
Freebase80M6,000+3BGoogle Search (deprecated)
Wikidata100M+10,000+1.3BWikipedia, general knowledge
Google KG500M+3.5BSearch, Assistant
YAGO10M100+120MResearch, NLP
Hetionet47K242.25MDrug discovery, biology

The Big Problem: Massive Incompleteness

Here's the uncomfortable truth about every real KG: they are riddled with holes. Freebase, one of the largest KGs ever built, was studied carefully and the findings were stark:

93.8% of persons in Freebase have no recorded birthplace. 78.5% have no recorded nationality. 75% have no known parents. The KG is not a complete encyclopedia — it's a sparse sample of facts that happened to get entered.

This incompleteness isn't a bug in the data entry process. It's structural. The world has more facts than any team of humans can annotate. New entities are created every day. Relations are asymmetric — if A treats B is known, B is treatable by A might not be entered separately.

The Task: KG Completion

Knowledge graph completion (also called link prediction) is the problem of predicting missing triples from the ones you have. Given a partial KG, can a model score unseen (h, r, t) triples as "likely true" or "likely false"?

Knowledge Graph Explorer

A small KG about authors and books. Dashed edges are missing — facts that aren't in the graph yet. Can you guess what they should be? Click a node to highlight its known facts.

Click any node to see its known relations.
Freebase has 80 million entities but 93.8% of persons lack a recorded birthplace. What does this tell us about knowledge graphs?

Chapter 1: The KG Embedding Framework

How do you teach a computer to predict whether (Einstein, birthplace, Ulm) is a true fact? You can't do fuzzy string matching. You need to capture the meaning of entities and relations in a form a machine can reason about.

The answer: embeddings. Assign every entity and every relation a vector in some k-dimensional space. Then define a scoring function fr(h, t) that takes the head and tail entity vectors plus the relation vector, and outputs a scalar score — high if the triple is likely true, low otherwise.

fr(h, t) ∈ ℝ   →   higher = more likely true

Different embedding models differ in exactly one thing: the form of fr. Everything else — the training procedure, the evaluation metrics, the intuition — is shared.

The key insight: We're not learning rules like "X birthplace Y if Y is a city and X was born there." We're learning geometry. Entities and relations are points and transformations in vector space, and truth is proximity or alignment in that space.

What Goes In and What Comes Out

Entities: h, t ∈ ℝk
One vector per entity, shared across all relations. Learned during training.
Relation: r ∈ ℝk (or ℂk)
One vector (or matrix, or rotation) per relation type. Learned during training.
Scoring Function fr(h, t)
Combines h, r, t into a scalar. Different models = different fr.
Prediction
High score = likely true triple. Low score = likely false triple.

The Embedding Dimension k Is a Hyperparameter

Typical values: k = 50 to k = 1000. Larger k → more expressive, more parameters, more compute. For Freebase with 80M entities, even k = 200 gives 80M × 200 = 16 billion parameters for entity embeddings alone. This is why KG completion models are "shallow" — they don't stack layers.

Training Signal: Contrastive Learning

True triples from the KG should score high. But there are infinitely many false triples — we can't enumerate them all. Instead, we corrupt a true triple by replacing the head or tail with a random entity to create a negative sample:

(h, r, t)   → true      (h', r, t) or (h, r, t')   → corrupted (false)

The loss pushes true scores above corrupted scores. The margin, which model, and how many corruptions per true triple are all design choices that affect final performance significantly.

All KG embedding models share the same training procedure but differ in one thing. What is it?

Chapter 2: TransE — Translation in Embedding Space

TransE (Bordes et al., NeurIPS 2013) is the simplest and most elegant KG embedding model. It has one geometric intuition so clean you can draw it on a napkin: a relation is a translation vector.

If (h, r, t) is a true triple, then the tail entity should be located at the head entity plus the relation vector. In embedding space:

h + r ≈ t

Think of it like directions on a map. "Paris" + "is_capital_of" ≈ "France". "Einstein" + "birthplace" ≈ "Ulm". "Penicillin" + "treats" ≈ "Pneumonia". The relation arrow, when added to the head, points at the tail.

Scoring function: fr(h, t) = −||h + r − t||2. High (less negative) = the triple is close to true. Low (very negative) = far apart.

Why the Minus Sign?

Distance is always non-negative, so ||h + r − t|| = 0 means perfect alignment. We negate it so that the scoring function is higher for truer triples (||distance|| → 0 → score → 0), consistent with every other model where higher = more likely true.

What TransE Handles Well

Antisymmetric relations: If h + r ≈ t, then t + r ≠ h (unless r = 0). So "is_parent_of" learned as a vector naturally captures that parenthood is one-directional. Good.

Composition: If h + r₁ ≈ m and m + r₂ ≈ t, then h + (r₁ + r₂) ≈ t. Multi-hop facts compose naturally by adding relation vectors. Good.

What TransE Cannot Handle

Symmetric relations: "is_married_to" should satisfy fr(h, t) = fr(t, h). But if h + r ≈ t, then t + r ≈ h would require r = 0. So TransE would learn near-zero vectors for symmetric relations, which makes them uninformative. Bad.

1-to-N relations: "has_nationality" maps one country to many people. TransE would try to put all those people at the same point (France + has_nationality ≈ everyone born in France), forcing them to cluster together even if they're very different entities. Bad.

TransE: Watch the Translation

Entities are points in 2D space. Relations are arrows (translation vectors). True triple: the orange arrow from h points at t. Drag the head entity or adjust the relation vector to see how the score changes.

rx 80
ry -60
Score: fr(h,t) = -||h + r - t||
For a symmetric relation like "is_married_to", TransE requires fr(h,t) = fr(t,h). Why can't TransE satisfy this?

Chapter 3: TransR — Relation-Specific Projection

TransE's limitation with 1-to-N relations comes from a single cause: it forces every entity to live in the same space with the same metric. "France" must be simultaneously near "Paris" (via is_capital_of), near "Camus" (via nationality_of), and near "Brie" (via produced_in). It can't serve all three roles in one point.

TransR (Lin et al., AAAI 2015) fixes this with one idea: project entities into a relation-specific subspace before measuring distance. Each relation r gets not just a translation vector, but also a projection matrix Mr.

Mr h + r ≈ Mr t     where Mr ∈ ℝk×d

Here d is the entity embedding dimension and k is the relation space dimension. The matrix Mr projects the entity from d-dimensional entity space into k-dimensional relation space. Then the translation h + r ≈ t happens in that relation-specific k-dimensional space.

Analogy: Different relations see entities from different "perspectives." The relation "nationality_of" cares about where an entity was created or born — it projects France's embedding onto a geography-relevant subspace. The relation "authored_by" cares about the creative lineage — it projects onto a different subspace. France and Paris might be nearby in the nationality subspace but far apart in the authorship subspace.

How This Fixes 1-to-N

For a 1-to-N relation like "capital_of" (many capitals, one country each), entities that all map to France don't need to cluster in entity space. They just need to land near France after projection through Mr. Mr can map many different entity vectors to the same region in relation space.

fr(h, t) = −||Mrh + r − Mrt||2

The Cost: More Parameters

Each relation now has a d × k projection matrix plus a k-dimensional translation vector. For a KG with 6,000 relation types (like Freebase), that's 6,000 × (d × k + k) parameters just for relations. If d = k = 200, that's 6,000 × 40,200 = ~241M parameters for relations alone. Compute-intensive.

TransE

  • Params per relation: k (translation vector)
  • Symmetric: ✗
  • 1-to-N: ✗
  • Fast, simple

TransR

  • Params per relation: k×d + k (matrix + vector)
  • Symmetric: ✗ (still translation-based)
  • 1-to-N: ✓
  • Expensive for many relation types
Still can't do symmetric relations. TransR inherits TransE's core mechanism — translation in relation space — so it faces the same r = 0 problem for symmetric relations. The projection helps with 1-to-N but not with symmetry.
TransR uses a projection matrix Mr per relation. What specific limitation of TransE does this fix?

Chapter 4: DistMult — Bilinear Scoring

The TransE family uses geometric translation. Let's try something completely different: a multiplicative scoring function. Instead of asking "is h + r close to t?", ask "how well do h, r, and t align when you multiply their components?

DistMult (Yang et al., ICLR 2015) defines the scoring function as an element-wise trilinear product:

fr(h, t) = ⟨h, r, t⟩ = ∑i hi · ri · ti

Each dimension i contributes hi × ri × ti to the score. The relation vector ri acts as a per-dimension weight that says how much dimension i matters for this relation. If ri is large and positive, the model wants hi and ti to have the same sign. If ri is large and negative, it wants them to have opposite signs.

Analogy: a weighting mixer. Imagine h and t are mixing board settings — one vector per channel. The relation r tells you which channels matter. The score is the sum of what's playing in each active channel, weighted by both entities' settings in that channel.

The Symmetry Problem

Notice: ∑ hi · ri · ti = ∑ ti · ri · hi. Multiplication is commutative. So fr(h, t) = fr(t, h) always. DistMult is symmetric by construction.

This is a hard constraint, not a soft bias. No matter how you train DistMult, it will score (h, r, t) the same as (t, r, h) for every triple. For a relation like "is_married_to" — where the score should be symmetric — this is exactly right. For "is_parent_of" — which is antisymmetric — DistMult literally cannot represent it correctly.

What DistMult Can and Cannot Do

DistMult strengths

  • Symmetric relations: ✓ (natural)
  • 1-to-N relations: ✓ (bilinear is expressive enough)
  • Fast: no matrix multiplication, just elementwise
  • Scales to large KGs easily

DistMult limitations

  • Antisymmetric relations: ✗ (structurally impossible)
  • Inverse relations: ✗ (same issue as antisymmetry)
  • Composition: ✗ (bilinear doesn't compose)
  • Only captures diagonal of full bilinear form

The "only diagonal" point is worth unpacking. A full bilinear model would use a matrix Wr per relation and score fr(h, t) = hT Wr t. DistMult restricts Wr to be diagonal — this is the "Dist" (diagonalized) in "DistMult." It saves parameters (k instead of k²) but loses the ability to model interactions between different dimensions of h and t.

DistMult scores fr(h,t) = Σ hi ri ti. Why can it NOT model the relation "is_parent_of"?

Chapter 5: ComplEx — Complex-Valued Bilinear

DistMult is symmetric because real multiplication commutes. But what if we embedded in complex space? Complex multiplication doesn't commute — and that's exactly what we need to handle both symmetric and antisymmetric relations.

ComplEx (Trouillon et al., ICML 2016) embeds entities and relations in k — vectors of complex numbers. Each component hi = ai + bii, where a is the real part and b is the imaginary part.

The scoring function uses the Hermitian inner product:

fr(h, t) = Re(⟨h, r, t̄⟩) = Re(∑i hi · ri · t̄i)

That overbar on t̄ denotes the complex conjugate: if ti = a + bi, then t̄i = a − bi. We take the real part Re(·) of the sum because scores should be real numbers.

Why Conjugation Breaks Symmetry

Let's compare fr(h, t) and fr(t, h) in ComplEx:

fr(h, t) = Re(∑i hi rii)
fr(t, h) = Re(∑i ti rii)

These are NOT the same in general. The conjugate is applied to t in the first case, and to h in the second. Only when all imaginary parts of h and t are zero do they coincide — that's real DistMult as a special case.

The geometric picture: Complex multiplication rotates and scales. When you conjugate t, you flip its angle in the complex plane. So measuring alignment between h, r, and t̄ is different from measuring alignment between t, r, and h̄. The asymmetry is baked into the conjugation.

Handling Both Symmetric and Antisymmetric Relations

For symmetric relations (e.g., "is_married_to"): The model learns relation vectors r with zero imaginary part. Then conjugation has no effect, and fr(h, t) = fr(t, h). ComplEx reduces to DistMult for these.

For antisymmetric relations (e.g., "is_parent_of"): The model learns relation vectors r with nonzero imaginary parts. The conjugation now makes fr(h, t) ≠ fr(t, h). The model can score (parent, r, child) high and (child, r, parent) low simultaneously.

ComplEx: Complex Space Showcase

Entities h (orange) and t (teal) as complex numbers. The relation r rotates/scales. Conjugation flips t across the real axis. Watch how fr(h,t) ≠ fr(t,h) for asymmetric r. Toggle the relation type to see the difference.

h angle (θh) 30°
t angle (θt) 120°
Scores: f(h,t) = —    f(t,h) = —
ComplEx scores fr(h,t) = Re(Σ hi rii). What makes this NOT always equal to fr(t,h)?

Chapter 6: RotatE — Rotation as Relation

ComplEx handles symmetry and antisymmetry. But can one model handle everything? RotatE (Sun et al., ICLR 2019) comes close by making relations even more geometric: each relation is a rotation in complex space.

Embed every entity in ℂk. Represent each relation r as a vector of unit complex numbers — complex numbers on the unit circle, each parameterized by an angle θi:

ri = ei = cos(θi) + i sin(θi),    |ri| = 1

A true triple (h, r, t) should satisfy:

h ∘ r = t

Where ∘ is element-wise complex multiplication. Since |ri| = 1, multiplying by ri rotates hi by angle θi without changing its magnitude. Each dimension gets independently rotated by its own angle.

fr(h, t) = −||h ∘ r − t||

Why Rotation Covers All Relation Patterns

Each pattern corresponds to a specific constraint on the rotation angles θ:

RotatE is the most expressive of the four models. It handles symmetric, antisymmetric, inverse, and composition — four distinct relation patterns. The unit-circle constraint on r is not a limitation but the source of its power: rotations compose perfectly.
RotatE: Rotation in Complex Space

Each dot is a complex number (angle = position on circle, radius = magnitude). The relation rotates h by angle θ. True triple: h ∘ r should land on t. Adjust θ to see how rotation predicts the tail.

θ (rotation) 90°
h angle 30°
Score: fr(h,t) = -||h∘r - t||
In RotatE, a symmetric relation like "is_married_to" should satisfy h∘r = t AND t∘r = h. What rotation angle θ achieves this?

Chapter 7: Relation Patterns — The Full Comparison

We've seen four models. Each has a different geometric intuition and a different set of relation patterns it can express. A relation pattern is a structural property that holds across the triples of a relation — regardless of which specific entities are involved.

The Five Patterns

Symmetric: r(h,t) ⟹ r(t,h)
Examples: is_married_to, is_similar_to, co-occurs_with

Antisymmetric: r(h,t) ⟹ ¬r(t,h)
Examples: is_parent_of, is_larger_than, hypernym_of

Inverse: r₁(h,t) ⟺ r₂(t,h)
Examples: parent_of ↔ child_of, treats ↔ treated_by

Composition: r₁(h,m) ∧ r₂(m,t) ⟹ r₃(h,t)
Examples: parent_of ∘ parent_of = grandparent_of

1-to-N: r(h,t₁) ∧ r(h,t₂) with t₁ ≠ t₂
Examples: has_nationality (one person, one country), has_genre (movie, many genres)

Which Model Handles Which Pattern?

Pattern TransE TransR DistMult ComplEx RotatE
Symmetric
Antisymmetric
Inverse
Composition
1-to-N

RotatE dominates: it's the only model that handles all five patterns. But this isn't pure win — it comes with the overhead of complex arithmetic and careful initialization of angles.

Why DistMult and ComplEx fail on composition: Adding bilinear scores doesn't compose in the way rotation angles do. If fr₁(h,m) = Σ hi r₁ii is high and fr₂(m,t) is high, there's no algebraic way to derive that fr₃(h,t) should be high for r₃ = r₁ ∘ r₂. The bilinear form doesn't chain.

Practical Guidance

In practice, performance depends on your KG's dominant relation types. If most relations are symmetric (social networks, similarity graphs), DistMult or ComplEx win on speed. If composition is critical (hierarchical taxonomies, multi-hop reasoning), RotatE or TransE are better choices.

The relation "grandparent_of" can be expressed as "parent_of ∘ parent_of" (composition). Which models can represent this pattern?

Chapter 8: Training KG Embeddings

We know what to optimize (the scoring function), and we know the signal (true triples should score higher). But KGs only contain true facts — there are no labeled false triples. We have to create them.

Negative Sampling

For every true triple (h, r, t) in the training set, we generate negative triples by corrupting either the head or the tail:

(h', r, t)   where   h' ∼ Uniform(E)     or     (h, r, t')   where   t' ∼ Uniform(E)

This assumes the KG is a closed world for training — any triple not in the KG is treated as false. This is an approximation: some corrupted triples might actually be true but just not recorded. In practice, this rarely causes major problems because the KG is so sparse.

Loss Functions

Margin loss (used by TransE):

L = ∑(h,r,t)∈KG(h',r,t') max(0, γ + fr(h',t') − fr(h,t))

Here γ is the margin — we want true triples to score at least γ higher than corrupted ones. If they already do, the loss is zero. If not, we push in the right direction.

Cross-entropy loss (used by DistMult, ComplEx, RotatE): Apply softmax over all possible tails for a given (h, r), then take cross-entropy with the true tail. This is more expensive (sum over all entities) but often trains faster to convergence.

Self-adversarial negative sampling (RotatE's trick): Sample negative triples with probability proportional to the current model's score. High-scoring negatives are the "hard negatives" — the ones the model almost thinks are true. Training on hard negatives gives stronger gradient signal.
p(h'j, r, t'j) = softmax(α · fr(h'j, t'j))

The temperature α controls how aggressive the hard-negative mining is. High α → very focused on the hardest negatives. Low α → nearly uniform sampling like the basic approach.

Evaluation: Ranking Metrics

Given a test triple (h, r, t), replace t with every entity in E and score all of them. Rank t among all these candidates. Metrics:

MetricFormulaMeaning
MRRMean(1/rank)Mean reciprocal rank — 1.0 is perfect, higher is better
Hits@1% rank=1Fraction of test triples ranked #1
Hits@10% rank≤10Fraction of test triples in top 10

Typical results on FB15k-237 (a Freebase subset): RotatE achieves MRR ~0.338, Hits@10 ~0.533. TransE achieves MRR ~0.279, Hits@10 ~0.465. The gap is real but not enormous — dataset and hyperparameter choice matter a lot.

Training Dynamics Showcase

Watch a mini KG get trained with margin loss. Each step: pick a true triple, generate a negative, compute the loss, update embeddings. The score gap between true and false triples should grow over time.

Learning Rate 0.05
Press Start to begin training.
Why do KG embedding models use "negative sampling" during training?

Chapter 9: Connections and What's Next

We've built the full picture of shallow KG embeddings. Before moving on, let's situate this in the wider landscape — what these models can't do, and what comes next.

The Shallow Ceiling

Every model in this lecture — TransE, TransR, DistMult, ComplEx, RotatE — is a shallow embedding model. Entities have one embedding vector, and that vector doesn't change based on context. The embedding of "Paris" is the same whether you're asking about its geography, its cultural institutions, or its historical events.

The fundamental limitation: Shallow embeddings cannot perform multi-hop reasoning that requires integrating evidence across the graph. "Who are the siblings of Einstein's students?" requires traversing: Einstein → students → each student's parents → each parent's other children. A lookup of a fixed vector can't chain these steps.

Combining GNNs with KG Embeddings

The next step — covered in later CS224W lectures — is to replace the static entity embeddings with GNN-computed representations. Instead of a fixed vector for "Paris," a GNN aggregates information from Paris's neighbors (Eiffel Tower, Seine River, France, ...) and computes a context-aware embedding. Then you apply the same scoring functions (TransE, RotatE) on top.

This combination gives you the expressiveness of relational geometry AND the contextual flexibility of graph neural networks.

Applications Where KGs Matter

Search and QA

Google's Knowledge Panel ("Albert Einstein — German-born physicist, 1879-1955") is powered by a KG. KG embeddings enable fuzzy lookups: "Who are physicists born in Germany?" becomes a nearest-neighbor query in embedding space.

Drug Discovery

Hetionet KG connects Genes, Diseases, Compounds, Pathways. KG completion predicts: "Compound X treats Disease Y" — novel drug-disease links. Several COVID-19 drug repurposing studies used exactly this approach.

Recommendation

E-commerce KGs: (User, purchases, Product), (Product, belongs_to, Category), (User, searches, Query). KG completion predicts (User, will_buy, Product). Richer than pure collaborative filtering because it explains WHY (via intermediate nodes).

Entity Alignment

Two KGs about the same world — e.g., Wikidata and Freebase — both have "Paris" but with different IDs. KG embeddings in a shared space can align entities across KGs by embedding proximity.

How This Lecture Connects to Others

TopicConnection
Lec 9: Heterogeneous GraphsKGs ARE heterogeneous graphs — entities are node types, relations are edge types. RGCN can be applied to KGs as a deep alternative to shallow embeddings.
Lec 11: Reasoning over KGsMulti-hop reasoning — answering "Who are the siblings of Einstein's collaborators?" — extends KG embedding into path-based logic.
Lec 3-4: GNNsGNN-based KG completion replaces static entity embeddings with neighborhood-aggregated representations for context-aware scoring.
Key takeaway from this lecture: The choice of scoring function encodes a geometric hypothesis about how relations work. TransE says "relations are translations." RotatE says "relations are rotations." Neither is universally correct — the right model depends on which relation patterns dominate your specific KG.
"The language of mathematics is not discovered — it is invented. But how well it fits the world is a miracle nonetheless."
— adapted from Wigner, on the unreasonable effectiveness of mathematics
← Lec 9: Heterogeneous Graphs