A knowledge graph contains billions of facts — (Paris, capital-of, France), (Einstein, bornIn, Germany) — but most facts are missing. TransE learns to complete those gaps by treating each relation as a translation vector in embedding space: if you start at the head entity and add the relation vector, you should arrive at the tail entity.
Freebase has 1.2 billion facts. Wikidata has 100 million. Google's Knowledge Graph has trillions. These databases — called knowledge graphs (KGs) — store facts as triples: (head entity, relation, tail entity). (Barack Obama, bornIn, Hawaii). (Marie Curie, wonPrize, Nobel Prize in Physics).
But they're massively incomplete. An analysis of Freebase found that 71% of people have no recorded place of birth. 75% have no nationality. The knowledge is there — scattered in Wikipedia articles, news archives, books — but not yet structured into triples. The task of knowledge graph completion (also called link prediction) is: given a KG with missing facts, predict which triples are likely true.
The naive approach is a massive boolean matrix: rows = all entities, columns = (relation, entity) pairs, cell = 1 if the fact exists. For 1M entities and 1000 relation types, this matrix has 10¹² entries — mostly zero, completely uncompressible. We need embeddings: dense vectors that capture the patterns.
TransE's idea is almost absurdly simple. Represent every entity and every relation as a vector in the same d-dimensional space. Then enforce this constraint: for a true fact (h, r, t), the embedding of h plus the embedding of r should approximately equal the embedding of t.
Think of it geometrically: entities are points. Relations are arrows. The arrow for "capital-of" should point from every country toward its capital city. Start at Germany (a vector), add "capital-of" (another vector), and you should land near Berlin (a third vector).
This mirrors the famous Word2Vec analogy: King − Man + Woman ≈ Queen. The same principle — relationships live in vector arithmetic — works surprisingly well for structured facts in knowledge graphs.
Drag the relation vector (orange arrow). Watch where h + r lands relative to t. The loss is the distance between h+r and t.
To complete the knowledge graph, you need to score candidate triples: "how likely is (h, r, t) to be a true fact?" TransE uses the dissimilarity of h+r from t as the score — lower is better:
If h + r lands exactly at t, the score is 0 — perfect. If they're far apart, the score is large — unlikely fact. The paper evaluates both L1 (Manhattan distance, sum of absolute differences) and L2 (Euclidean distance) norms, with L1 performing slightly better in practice.
At inference time, to answer "what is the capital of Germany?", you compute the score for every entity as a candidate tail: (Germany, capital-of, Berlin), (Germany, capital-of, Paris), (Germany, capital-of, Tokyo), ... and rank them by score. The entity with the lowest score is the predicted answer.
python import torch def transe_score(h, r, t, norm=1): # h, r, t: embedding vectors [batch, dim] # Returns dissimilarity score (lower = more likely true fact) diff = h + r - t # [batch, dim] return torch.norm(diff, p=norm, dim=-1) # [batch] def predict_tail(h, r, all_entities): # h: [dim], r: [dim], all_entities: [N, dim] candidate = h + r # [dim] dists = torch.norm(all_entities - candidate, p=1, dim=-1) # [N] return torch.argsort(dists) # ranked entity indices
TransE learns embeddings via a pairwise margin loss: for each true triple (h, r, t), sample a "corrupted" triple (h', r, t') where either h' or t' is a random entity. Then penalize the model when the true triple doesn't score better than the corrupted one by at least a margin γ.
The [·]₊ notation means "max(0, ·)" — only penalize when the bracketed expression is positive. So: loss is zero when f_r(h,t) ≤ f_r(h',t') − γ, i.e., when the true triple scores at least γ better than the corrupted one. γ is the margin hyperparameter (paper uses γ = 1 or 2).
The corrupted triple (h', r, t') replaces either the head or the tail (never the relation) with a random entity from the entity set. This is the KG analogy of negative sampling: the false triple serves as a negative example that the model learns to score poorly.
TransE is elegant, but the translation model has hard structural limits. Three relation patterns are fundamentally impossible to represent:
Symmetric relations: "is married to" is symmetric — if A is married to B, then B is married to A. In TransE, this requires h + r = t AND t + r = h simultaneously, which implies r = 0. But r = 0 means the relation has no information — every entity would be its own "spouse." TransE maps symmetric relations to zero or near-zero vectors, destroying their meaning.
1-to-N (one-to-many) relations: "speaks language" maps one person to many languages. If (Alice, speaks, English) and (Alice, speaks, French), then Alice + speaks ≈ English and Alice + speaks ≈ French simultaneously. The model wants Alice + speaks to be two different points at once — impossible. TransE learns speaks = midpoint between English and French, making the score for both okay but neither great.
Composition: "bornIn" ∘ "locatedIn" should imply "nationality" (born in Paris, Paris located in France → French nationality). TransE can sometimes capture this via vector addition, but it's not guaranteed and breaks down for longer chains.
TransE is evaluated on two standard benchmarks: FB15k (a 15K-entity subset of Freebase, 1345 relation types, 483K triples) and WN18 (18K entities from WordNet, 18 relation types like "hypernym," "meronym").
The evaluation metric is mean rank (lower is better) and Hits@10 (percentage of correct answers in the top-10 ranked candidates, higher is better) for tail prediction.
| Method | FB15k Mean Rank ↓ | FB15k Hits@10 ↑ | WN18 Mean Rank ↓ | WN18 Hits@10 ↑ |
|---|---|---|---|---|
| Unstructured | 1074 | 4.5% | 315 | 35.3% |
| SE (Structured Embeddings) | 273 | 28.8% | 1011 | 68.5% |
| SME (linear) | 274 | 30.7% | 533 | 74.1% |
| TransE | 125 | 47.1% | 251 | 89.2% |
TransE dramatically outperforms all prior methods. On WN18, Hits@10 improves from 74.1% to 89.2% — nearly 15 percentage points. And TransE is orders of magnitude simpler: Structured Embeddings uses two relation matrices per relation (2×1345 matrices), while TransE uses just one vector per relation. Simpler model, dramatically better performance.
TransE opened a decade-long research program in KG embedding. Every subsequent method is essentially an answer to the question: "how do we keep TransE's simplicity while fixing its structural limitations?"
TransR (Lin et al., 2015): Instead of sharing the same embedding space for entities and relations, TransR projects entities into a relation-specific subspace before computing the translation. Each relation has a projection matrix M_r: h_⊥ = h · M_r. Then the loss is ||h_⊥ + r − t_⊥||. This lets different relations highlight different entity features. It handles 1-to-N relations better but loses the computational simplicity of TransE.
DistMult (Yang et al., 2015): Instead of translation, uses element-wise multiplication: score(h, r, t) = Σ h_i · r_i · t_i. This is bilinear and handles symmetric relations (since h·r·t = t·r·h) but cannot distinguish (h, r, t) from (t, r, h) for asymmetric relations. Simpler than TransE, but more limited.
ComplEx (Trouillon et al., 2016): Extends DistMult to complex numbers. Entities and relations are complex vectors; score = Re(Σ h_i · r_i · t̄_i). Complex numbers allow asymmetric scoring (multiplication in complex space is not commutative), fixing DistMult's symmetry assumption. Handles symmetric + antisymmetric + inverse relations.
RotatE (Sun et al., 2019): Relations as rotations in complex space (|r_i|=1). Handles ALL patterns: symmetric (rotation by π), antisymmetric, inverse (opposite rotation), AND composition (rotations compose). See the RotatE lesson for details.
| Model | Year | Relation type | Symmetric? | 1-to-N? | Composition? |
|---|---|---|---|---|---|
| TransE | 2013 | Translation | No | Partial | Partial |
| TransR | 2015 | Projected translation | No | Yes | Partial |
| DistMult | 2015 | Bilinear | Yes | Yes | No |
| ComplEx | 2016 | Complex bilinear | Yes | Yes | No |
| RotatE | 2019 | Complex rotation | Yes | Yes | Yes |
TransE is the "simplest thing that works" baseline for knowledge graph embedding. Understanding it precisely — including its failures — is the prerequisite for understanding every subsequent KG embedding method.
| Key concept | Where to go deeper |
|---|---|
| Relation as rotation (fixes TransE's symmetric failure) | RotatE |
| Graph neural network embeddings for KGs | GraphSAGE |
| Network embedding (broader) | LINE |
Open problems
Key papers
"Entities should be embedded close to other entities that play similar roles in similar relations."
— Bordes et al. (2013)