Bordes, Usunier, Garcia-Duran, Weston, Yakhnenko (Facebook AI) — NeurIPS 2013

TransE: Relations as
Translations

A knowledge graph contains billions of facts — (Paris, capital-of, France), (Einstein, bornIn, Germany) — but most facts are missing. TransE learns to complete those gaps by treating each relation as a translation vector in embedding space: if you start at the head entity and add the relation vector, you should arrive at the tail entity.

Prerequisites: vectors + basic ML loss functions
8
Chapters
4+
Simulations

Chapter 0: Knowledge Graph Completion

Freebase has 1.2 billion facts. Wikidata has 100 million. Google's Knowledge Graph has trillions. These databases — called knowledge graphs (KGs) — store facts as triples: (head entity, relation, tail entity). (Barack Obama, bornIn, Hawaii). (Marie Curie, wonPrize, Nobel Prize in Physics).

But they're massively incomplete. An analysis of Freebase found that 71% of people have no recorded place of birth. 75% have no nationality. The knowledge is there — scattered in Wikipedia articles, news archives, books — but not yet structured into triples. The task of knowledge graph completion (also called link prediction) is: given a KG with missing facts, predict which triples are likely true.

The naive approach is a massive boolean matrix: rows = all entities, columns = (relation, entity) pairs, cell = 1 if the fact exists. For 1M entities and 1000 relation types, this matrix has 10¹² entries — mostly zero, completely uncompressible. We need embeddings: dense vectors that capture the patterns.

Why this matters: KG completion feeds downstream AI systems. Question answering ("What country borders Germany to the east?") requires a complete KG. Recommendation engines use knowledge about items. Drug interaction prediction uses biomedical KGs. An incomplete KG propagates uncertainty through every system that consumes it.
What is the knowledge graph completion (link prediction) task?

Chapter 1: Translation Intuition — Showcase

TransE's idea is almost absurdly simple. Represent every entity and every relation as a vector in the same d-dimensional space. Then enforce this constraint: for a true fact (h, r, t), the embedding of h plus the embedding of r should approximately equal the embedding of t.

h + r ≈ t

Think of it geometrically: entities are points. Relations are arrows. The arrow for "capital-of" should point from every country toward its capital city. Start at Germany (a vector), add "capital-of" (another vector), and you should land near Berlin (a third vector).

This mirrors the famous Word2Vec analogy: King − Man + Woman ≈ Queen. The same principle — relationships live in vector arithmetic — works surprisingly well for structured facts in knowledge graphs.

Word2Vec analogy revisited: In word embeddings, "Paris" − "France" ≈ "Rome" − "Italy" because the "is capital of" relation is encoded as a consistent vector offset. TransE makes this the explicit training objective: every relation is defined to be a translation, and embeddings are learned to satisfy as many translations simultaneously as possible.
Interactive h + r = t Showcase

Drag the relation vector (orange arrow). Watch where h + r lands relative to t. The loss is the distance between h+r and t.

Relation r_x 0.80
Relation r_y -0.20
In TransE, what geometric role does the relation embedding r play?

Chapter 2: Scoring Function

To complete the knowledge graph, you need to score candidate triples: "how likely is (h, r, t) to be a true fact?" TransE uses the dissimilarity of h+r from t as the score — lower is better:

fr(h, t) = || h + r − t ||L1/L2

If h + r lands exactly at t, the score is 0 — perfect. If they're far apart, the score is large — unlikely fact. The paper evaluates both L1 (Manhattan distance, sum of absolute differences) and L2 (Euclidean distance) norms, with L1 performing slightly better in practice.

At inference time, to answer "what is the capital of Germany?", you compute the score for every entity as a candidate tail: (Germany, capital-of, Berlin), (Germany, capital-of, Paris), (Germany, capital-of, Tokyo), ... and rank them by score. The entity with the lowest score is the predicted answer.

Normalization constraint: TransE enforces ||h||₂ = ||t||₂ = 1 (entity embeddings lie on the unit sphere). Without this, a trivially optimal solution is to make all embeddings zero — distance is 0 everywhere. The unit-norm constraint forces the model to actually encode information. Relation embeddings r are NOT normalized — they can have any magnitude.
python
import torch

def transe_score(h, r, t, norm=1):
    # h, r, t: embedding vectors [batch, dim]
    # Returns dissimilarity score (lower = more likely true fact)
    diff = h + r - t  # [batch, dim]
    return torch.norm(diff, p=norm, dim=-1)  # [batch]

def predict_tail(h, r, all_entities):
    # h: [dim], r: [dim], all_entities: [N, dim]
    candidate = h + r  # [dim]
    dists = torch.norm(all_entities - candidate, p=1, dim=-1)  # [N]
    return torch.argsort(dists)  # ranked entity indices
How does TransE score a candidate triple (h, r, t) during link prediction?

Chapter 3: Training with Margin Loss

TransE learns embeddings via a pairwise margin loss: for each true triple (h, r, t), sample a "corrupted" triple (h', r, t') where either h' or t' is a random entity. Then penalize the model when the true triple doesn't score better than the corrupted one by at least a margin γ.

L = ∑(h,r,t)(h',r,t') [ γ + fr(h,t) − fr(h',t') ]+

The [·]₊ notation means "max(0, ·)" — only penalize when the bracketed expression is positive. So: loss is zero when f_r(h,t) ≤ f_r(h',t') − γ, i.e., when the true triple scores at least γ better than the corrupted one. γ is the margin hyperparameter (paper uses γ = 1 or 2).

The corrupted triple (h', r, t') replaces either the head or the tail (never the relation) with a random entity from the entity set. This is the KG analogy of negative sampling: the false triple serves as a negative example that the model learns to score poorly.

Sample true triple (h, r, t)
e.g., (Paris, capital-of, France)
↓ corrupt head or tail
Generate corrupted triple (h', r, t')
e.g., (Tokyo, capital-of, France) or (Paris, capital-of, Germany)
↓ compute margin loss
Penalize if f(h,t) + γ > f(h',t')
Push true triple score down, corrupted score up
↻ repeat for all triples
Training detail: After each gradient update, re-normalize entity embeddings to the unit sphere: h ← h / ||h||. This maintains the normalization constraint. Relation embeddings are not normalized. The paper trains with SGD, batch size 120, and evaluates every 100 epochs to select early stopping point.
In TransE's margin loss, what does the margin γ enforce?

Chapter 4: What TransE Can't Do

TransE is elegant, but the translation model has hard structural limits. Three relation patterns are fundamentally impossible to represent:

Symmetric relations: "is married to" is symmetric — if A is married to B, then B is married to A. In TransE, this requires h + r = t AND t + r = h simultaneously, which implies r = 0. But r = 0 means the relation has no information — every entity would be its own "spouse." TransE maps symmetric relations to zero or near-zero vectors, destroying their meaning.

1-to-N (one-to-many) relations: "speaks language" maps one person to many languages. If (Alice, speaks, English) and (Alice, speaks, French), then Alice + speaks ≈ English and Alice + speaks ≈ French simultaneously. The model wants Alice + speaks to be two different points at once — impossible. TransE learns speaks = midpoint between English and French, making the score for both okay but neither great.

Composition: "bornIn" ∘ "locatedIn" should imply "nationality" (born in Paris, Paris located in France → French nationality). TransE can sometimes capture this via vector addition, but it's not guaranteed and breaks down for longer chains.

The translation bottleneck: All three failures stem from the same cause — TransE assigns ONE vector per relation. One vector means one direction of translation. But symmetric relations need to translate in BOTH directions simultaneously, and one-to-many relations need to translate to MULTIPLE destinations. Single vectors can't do this.
Why TransE Fails on Symmetric Relations
Why can't TransE handle symmetric relations like "is married to"?

Chapter 5: Results — FB15k and WN18

TransE is evaluated on two standard benchmarks: FB15k (a 15K-entity subset of Freebase, 1345 relation types, 483K triples) and WN18 (18K entities from WordNet, 18 relation types like "hypernym," "meronym").

The evaluation metric is mean rank (lower is better) and Hits@10 (percentage of correct answers in the top-10 ranked candidates, higher is better) for tail prediction.

MethodFB15k Mean Rank ↓FB15k Hits@10 ↑WN18 Mean Rank ↓WN18 Hits@10 ↑
Unstructured10744.5%31535.3%
SE (Structured Embeddings)27328.8%101168.5%
SME (linear)27430.7%53374.1%
TransE12547.1%25189.2%

TransE dramatically outperforms all prior methods. On WN18, Hits@10 improves from 74.1% to 89.2% — nearly 15 percentage points. And TransE is orders of magnitude simpler: Structured Embeddings uses two relation matrices per relation (2×1345 matrices), while TransE uses just one vector per relation. Simpler model, dramatically better performance.

Why TransE wins so decisively: WN18 has relations like "hypernym" (is-a) and "meronym" (part-of) — these are natural hierarchy relations that map perfectly to translations. Moving "up" the semantic hierarchy corresponds to consistently moving in one direction in embedding space. TransE's inductive bias matches the data's structure.
Why does TransE perform especially well on WN18 (WordNet)?

Chapter 6: Legacy — TransR, DistMult, RotatE

TransE opened a decade-long research program in KG embedding. Every subsequent method is essentially an answer to the question: "how do we keep TransE's simplicity while fixing its structural limitations?"

TransR (Lin et al., 2015): Instead of sharing the same embedding space for entities and relations, TransR projects entities into a relation-specific subspace before computing the translation. Each relation has a projection matrix M_r: h_⊥ = h · M_r. Then the loss is ||h_⊥ + r − t_⊥||. This lets different relations highlight different entity features. It handles 1-to-N relations better but loses the computational simplicity of TransE.

DistMult (Yang et al., 2015): Instead of translation, uses element-wise multiplication: score(h, r, t) = Σ h_i · r_i · t_i. This is bilinear and handles symmetric relations (since h·r·t = t·r·h) but cannot distinguish (h, r, t) from (t, r, h) for asymmetric relations. Simpler than TransE, but more limited.

ComplEx (Trouillon et al., 2016): Extends DistMult to complex numbers. Entities and relations are complex vectors; score = Re(Σ h_i · r_i · t̄_i). Complex numbers allow asymmetric scoring (multiplication in complex space is not commutative), fixing DistMult's symmetry assumption. Handles symmetric + antisymmetric + inverse relations.

RotatE (Sun et al., 2019): Relations as rotations in complex space (|r_i|=1). Handles ALL patterns: symmetric (rotation by π), antisymmetric, inverse (opposite rotation), AND composition (rotations compose). See the RotatE lesson for details.

ModelYearRelation typeSymmetric?1-to-N?Composition?
TransE2013TranslationNoPartialPartial
TransR2015Projected translationNoYesPartial
DistMult2015BilinearYesYesNo
ComplEx2016Complex bilinearYesYesNo
RotatE2019Complex rotationYesYesYes
DistMult handles symmetric relations that TransE cannot. What does DistMult sacrifice in return?

Chapter 7: Connections

TransE is the "simplest thing that works" baseline for knowledge graph embedding. Understanding it precisely — including its failures — is the prerequisite for understanding every subsequent KG embedding method.

Key conceptWhere to go deeper
Relation as rotation (fixes TransE's symmetric failure)RotatE
Graph neural network embeddings for KGsGraphSAGE
Network embedding (broader)LINE
TransE's conceptual legacy: More than the specific model, TransE established the vocabulary of "relation patterns" — symmetric, antisymmetric, inverse, composition — that every subsequent paper uses to evaluate expressiveness. This taxonomy is TransE's most lasting contribution: it gave the field a way to formally reason about what a KG embedding model can and cannot represent.

Open problems

  • N-ary relations (beyond triples)
  • Temporal KGs (facts that expire)
  • Text-enhanced embeddings
  • Uncertainty-aware embeddings

Key papers

  • Bordes et al., NeurIPS 2013 (TransE)
  • Lin et al., AAAI 2015 (TransR)
  • Yang et al., ICLR 2015 (DistMult)
  • Sun et al., ICLR 2019 (RotatE)
"Entities should be embedded close to other entities that play similar roles in similar relations."
— Bordes et al. (2013)