The world's facts encoded as (head, relation, tail) triples. Billions of them. Most are missing. Learn to predict what's absent from what's present.
J.K. Rowling wrote Harry Potter. Harry Potter belongs to the Fantasy genre. Fantasy is a subtype of Fiction. These are facts. And facts have a shape: a thing, connected to another thing, by a relationship.
A knowledge graph (KG) is a database that stores facts exactly this way — as triples: (head entity, relation, tail entity), or (h, r, t) for short. Every fact in the world can be squeezed into this form.
The entities — people, places, diseases, proteins, concepts — are the nodes. The relations — genre, author, treats, birthplace, subclass_of — are the edge types. Together they form a directed, multi-relational graph.
This isn't a toy idea. Real KGs are used at Google, Microsoft, and Amazon to power search, recommendations, and question-answering:
| KG | Entities | Relations | Triples | Used For |
|---|---|---|---|---|
| Freebase | 80M | 6,000+ | 3B | Google Search (deprecated) |
| Wikidata | 100M+ | 10,000+ | 1.3B | Wikipedia, general knowledge |
| Google KG | 500M+ | — | 3.5B | Search, Assistant |
| YAGO | 10M | 100+ | 120M | Research, NLP |
| Hetionet | 47K | 24 | 2.25M | Drug discovery, biology |
Here's the uncomfortable truth about every real KG: they are riddled with holes. Freebase, one of the largest KGs ever built, was studied carefully and the findings were stark:
This incompleteness isn't a bug in the data entry process. It's structural. The world has more facts than any team of humans can annotate. New entities are created every day. Relations are asymmetric — if A treats B is known, B is treatable by A might not be entered separately.
Knowledge graph completion (also called link prediction) is the problem of predicting missing triples from the ones you have. Given a partial KG, can a model score unseen (h, r, t) triples as "likely true" or "likely false"?
A small KG about authors and books. Dashed edges are missing — facts that aren't in the graph yet. Can you guess what they should be? Click a node to highlight its known facts.
How do you teach a computer to predict whether (Einstein, birthplace, Ulm) is a true fact? You can't do fuzzy string matching. You need to capture the meaning of entities and relations in a form a machine can reason about.
The answer: embeddings. Assign every entity and every relation a vector in some k-dimensional space. Then define a scoring function fr(h, t) that takes the head and tail entity vectors plus the relation vector, and outputs a scalar score — high if the triple is likely true, low otherwise.
Different embedding models differ in exactly one thing: the form of fr. Everything else — the training procedure, the evaluation metrics, the intuition — is shared.
Typical values: k = 50 to k = 1000. Larger k → more expressive, more parameters, more compute. For Freebase with 80M entities, even k = 200 gives 80M × 200 = 16 billion parameters for entity embeddings alone. This is why KG completion models are "shallow" — they don't stack layers.
True triples from the KG should score high. But there are infinitely many false triples — we can't enumerate them all. Instead, we corrupt a true triple by replacing the head or tail with a random entity to create a negative sample:
The loss pushes true scores above corrupted scores. The margin, which model, and how many corruptions per true triple are all design choices that affect final performance significantly.
TransE (Bordes et al., NeurIPS 2013) is the simplest and most elegant KG embedding model. It has one geometric intuition so clean you can draw it on a napkin: a relation is a translation vector.
If (h, r, t) is a true triple, then the tail entity should be located at the head entity plus the relation vector. In embedding space:
Think of it like directions on a map. "Paris" + "is_capital_of" ≈ "France". "Einstein" + "birthplace" ≈ "Ulm". "Penicillin" + "treats" ≈ "Pneumonia". The relation arrow, when added to the head, points at the tail.
Distance is always non-negative, so ||h + r − t|| = 0 means perfect alignment. We negate it so that the scoring function is higher for truer triples (||distance|| → 0 → score → 0), consistent with every other model where higher = more likely true.
Antisymmetric relations: If h + r ≈ t, then t + r ≠ h (unless r = 0). So "is_parent_of" learned as a vector naturally captures that parenthood is one-directional. Good.
Composition: If h + r₁ ≈ m and m + r₂ ≈ t, then h + (r₁ + r₂) ≈ t. Multi-hop facts compose naturally by adding relation vectors. Good.
Symmetric relations: "is_married_to" should satisfy fr(h, t) = fr(t, h). But if h + r ≈ t, then t + r ≈ h would require r = 0. So TransE would learn near-zero vectors for symmetric relations, which makes them uninformative. Bad.
1-to-N relations: "has_nationality" maps one country to many people. TransE would try to put all those people at the same point (France + has_nationality ≈ everyone born in France), forcing them to cluster together even if they're very different entities. Bad.
Entities are points in 2D space. Relations are arrows (translation vectors). True triple: the orange arrow from h points at t. Drag the head entity or adjust the relation vector to see how the score changes.
TransE's limitation with 1-to-N relations comes from a single cause: it forces every entity to live in the same space with the same metric. "France" must be simultaneously near "Paris" (via is_capital_of), near "Camus" (via nationality_of), and near "Brie" (via produced_in). It can't serve all three roles in one point.
TransR (Lin et al., AAAI 2015) fixes this with one idea: project entities into a relation-specific subspace before measuring distance. Each relation r gets not just a translation vector, but also a projection matrix Mr.
Here d is the entity embedding dimension and k is the relation space dimension. The matrix Mr projects the entity from d-dimensional entity space into k-dimensional relation space. Then the translation h + r ≈ t happens in that relation-specific k-dimensional space.
For a 1-to-N relation like "capital_of" (many capitals, one country each), entities that all map to France don't need to cluster in entity space. They just need to land near France after projection through Mr. Mr can map many different entity vectors to the same region in relation space.
Each relation now has a d × k projection matrix plus a k-dimensional translation vector. For a KG with 6,000 relation types (like Freebase), that's 6,000 × (d × k + k) parameters just for relations. If d = k = 200, that's 6,000 × 40,200 = ~241M parameters for relations alone. Compute-intensive.
The TransE family uses geometric translation. Let's try something completely different: a multiplicative scoring function. Instead of asking "is h + r close to t?", ask "how well do h, r, and t align when you multiply their components?
DistMult (Yang et al., ICLR 2015) defines the scoring function as an element-wise trilinear product:
Each dimension i contributes hi × ri × ti to the score. The relation vector ri acts as a per-dimension weight that says how much dimension i matters for this relation. If ri is large and positive, the model wants hi and ti to have the same sign. If ri is large and negative, it wants them to have opposite signs.
Notice: ∑ hi · ri · ti = ∑ ti · ri · hi. Multiplication is commutative. So fr(h, t) = fr(t, h) always. DistMult is symmetric by construction.
The "only diagonal" point is worth unpacking. A full bilinear model would use a matrix Wr per relation and score fr(h, t) = hT Wr t. DistMult restricts Wr to be diagonal — this is the "Dist" (diagonalized) in "DistMult." It saves parameters (k instead of k²) but loses the ability to model interactions between different dimensions of h and t.
DistMult is symmetric because real multiplication commutes. But what if we embedded in complex space? Complex multiplication doesn't commute — and that's exactly what we need to handle both symmetric and antisymmetric relations.
ComplEx (Trouillon et al., ICML 2016) embeds entities and relations in ℂk — vectors of complex numbers. Each component hi = ai + bii, where a is the real part and b is the imaginary part.
The scoring function uses the Hermitian inner product:
That overbar on t̄ denotes the complex conjugate: if ti = a + bi, then t̄i = a − bi. We take the real part Re(·) of the sum because scores should be real numbers.
Let's compare fr(h, t) and fr(t, h) in ComplEx:
These are NOT the same in general. The conjugate is applied to t in the first case, and to h in the second. Only when all imaginary parts of h and t are zero do they coincide — that's real DistMult as a special case.
For symmetric relations (e.g., "is_married_to"): The model learns relation vectors r with zero imaginary part. Then conjugation has no effect, and fr(h, t) = fr(t, h). ComplEx reduces to DistMult for these.
For antisymmetric relations (e.g., "is_parent_of"): The model learns relation vectors r with nonzero imaginary parts. The conjugation now makes fr(h, t) ≠ fr(t, h). The model can score (parent, r, child) high and (child, r, parent) low simultaneously.
Entities h (orange) and t (teal) as complex numbers. The relation r rotates/scales. Conjugation flips t across the real axis. Watch how fr(h,t) ≠ fr(t,h) for asymmetric r. Toggle the relation type to see the difference.
ComplEx handles symmetry and antisymmetry. But can one model handle everything? RotatE (Sun et al., ICLR 2019) comes close by making relations even more geometric: each relation is a rotation in complex space.
Embed every entity in ℂk. Represent each relation r as a vector of unit complex numbers — complex numbers on the unit circle, each parameterized by an angle θi:
A true triple (h, r, t) should satisfy:
Where ∘ is element-wise complex multiplication. Since |ri| = 1, multiplying by ri rotates hi by angle θi without changing its magnitude. Each dimension gets independently rotated by its own angle.
Each pattern corresponds to a specific constraint on the rotation angles θ:
Each dot is a complex number (angle = position on circle, radius = magnitude). The relation rotates h by angle θ. True triple: h ∘ r should land on t. Adjust θ to see how rotation predicts the tail.
We've seen four models. Each has a different geometric intuition and a different set of relation patterns it can express. A relation pattern is a structural property that holds across the triples of a relation — regardless of which specific entities are involved.
Symmetric: r(h,t) ⟹ r(t,h)
Examples: is_married_to, is_similar_to, co-occurs_with
Antisymmetric: r(h,t) ⟹ ¬r(t,h)
Examples: is_parent_of, is_larger_than, hypernym_of
Inverse: r₁(h,t) ⟺ r₂(t,h)
Examples: parent_of ↔ child_of, treats ↔ treated_by
Composition: r₁(h,m) ∧ r₂(m,t) ⟹ r₃(h,t)
Examples: parent_of ∘ parent_of = grandparent_of
1-to-N: r(h,t₁) ∧ r(h,t₂) with t₁ ≠ t₂
Examples: has_nationality (one person, one country), has_genre (movie, many genres)
| Pattern | TransE | TransR | DistMult | ComplEx | RotatE |
|---|---|---|---|---|---|
| Symmetric | ✗ | ✗ | ✓ | ✓ | ✓ |
| Antisymmetric | ✓ | ✓ | ✗ | ✓ | ✓ |
| Inverse | ✓ | ✓ | ✗ | ✓ | ✓ |
| Composition | ✓ | ✗ | ✗ | ✗ | ✓ |
| 1-to-N | ✗ | ✓ | ✓ | ✓ | ✓ |
RotatE dominates: it's the only model that handles all five patterns. But this isn't pure win — it comes with the overhead of complex arithmetic and careful initialization of angles.
In practice, performance depends on your KG's dominant relation types. If most relations are symmetric (social networks, similarity graphs), DistMult or ComplEx win on speed. If composition is critical (hierarchical taxonomies, multi-hop reasoning), RotatE or TransE are better choices.
We know what to optimize (the scoring function), and we know the signal (true triples should score higher). But KGs only contain true facts — there are no labeled false triples. We have to create them.
For every true triple (h, r, t) in the training set, we generate negative triples by corrupting either the head or the tail:
This assumes the KG is a closed world for training — any triple not in the KG is treated as false. This is an approximation: some corrupted triples might actually be true but just not recorded. In practice, this rarely causes major problems because the KG is so sparse.
Margin loss (used by TransE):
Here γ is the margin — we want true triples to score at least γ higher than corrupted ones. If they already do, the loss is zero. If not, we push in the right direction.
Cross-entropy loss (used by DistMult, ComplEx, RotatE): Apply softmax over all possible tails for a given (h, r), then take cross-entropy with the true tail. This is more expensive (sum over all entities) but often trains faster to convergence.
The temperature α controls how aggressive the hard-negative mining is. High α → very focused on the hardest negatives. Low α → nearly uniform sampling like the basic approach.
Given a test triple (h, r, t), replace t with every entity in E and score all of them. Rank t among all these candidates. Metrics:
| Metric | Formula | Meaning |
|---|---|---|
| MRR | Mean(1/rank) | Mean reciprocal rank — 1.0 is perfect, higher is better |
| Hits@1 | % rank=1 | Fraction of test triples ranked #1 |
| Hits@10 | % rank≤10 | Fraction of test triples in top 10 |
Typical results on FB15k-237 (a Freebase subset): RotatE achieves MRR ~0.338, Hits@10 ~0.533. TransE achieves MRR ~0.279, Hits@10 ~0.465. The gap is real but not enormous — dataset and hyperparameter choice matter a lot.
Watch a mini KG get trained with margin loss. Each step: pick a true triple, generate a negative, compute the loss, update embeddings. The score gap between true and false triples should grow over time.
We've built the full picture of shallow KG embeddings. Before moving on, let's situate this in the wider landscape — what these models can't do, and what comes next.
Every model in this lecture — TransE, TransR, DistMult, ComplEx, RotatE — is a shallow embedding model. Entities have one embedding vector, and that vector doesn't change based on context. The embedding of "Paris" is the same whether you're asking about its geography, its cultural institutions, or its historical events.
The next step — covered in later CS224W lectures — is to replace the static entity embeddings with GNN-computed representations. Instead of a fixed vector for "Paris," a GNN aggregates information from Paris's neighbors (Eiffel Tower, Seine River, France, ...) and computes a context-aware embedding. Then you apply the same scoring functions (TransE, RotatE) on top.
This combination gives you the expressiveness of relational geometry AND the contextual flexibility of graph neural networks.
Google's Knowledge Panel ("Albert Einstein — German-born physicist, 1879-1955") is powered by a KG. KG embeddings enable fuzzy lookups: "Who are physicists born in Germany?" becomes a nearest-neighbor query in embedding space.
Hetionet KG connects Genes, Diseases, Compounds, Pathways. KG completion predicts: "Compound X treats Disease Y" — novel drug-disease links. Several COVID-19 drug repurposing studies used exactly this approach.
E-commerce KGs: (User, purchases, Product), (Product, belongs_to, Category), (User, searches, Query). KG completion predicts (User, will_buy, Product). Richer than pure collaborative filtering because it explains WHY (via intermediate nodes).
Two KGs about the same world — e.g., Wikidata and Freebase — both have "Paris" but with different IDs. KG embeddings in a shared space can align entities across KGs by embedding proximity.
| Topic | Connection |
|---|---|
| Lec 9: Heterogeneous Graphs | KGs ARE heterogeneous graphs — entities are node types, relations are edge types. RGCN can be applied to KGs as a deep alternative to shallow embeddings. |
| Lec 11: Reasoning over KGs | Multi-hop reasoning — answering "Who are the siblings of Einstein's collaborators?" — extends KG embedding into path-based logic. |
| Lec 3-4: GNNs | GNN-based KG completion replaces static entity embeddings with neighborhood-aggregated representations for context-aware scoring. |
"The language of mathematics is not discovered — it is invented. But how well it fits the world is a miracle nonetheless."
— adapted from Wigner, on the unreasonable effectiveness of mathematics