Bordes, Usunier, Garcia-Duran, Weston, Yakhnenko (Facebook AI) — NeurIPS 2013

TransE: Relations as
Translations

A knowledge graph contains billions of facts — (Paris, capital-of, France), (Einstein, bornIn, Germany) — but most facts are missing. TransE learns to complete those gaps by treating each relation as a translation vector in embedding space: if you start at the head entity and add the relation vector, you should arrive at the tail entity.

Prerequisites: vectors + basic ML loss functions

Chapters

Simulations

Chapter 0: Knowledge Graph Completion

Freebase has 1.2 billion facts. Wikidata has 100 million. Google's Knowledge Graph has trillions. These databases — called knowledge graphs (KGs) — store facts as triples: (head entity, relation, tail entity). (Barack Obama, bornIn, Hawaii). (Marie Curie, wonPrize, Nobel Prize in Physics).

But they're massively incomplete. An analysis of Freebase found that 71% of people have no recorded place of birth. 75% have no nationality. The knowledge is there — scattered in Wikipedia articles, news archives, books — but not yet structured into triples. The task of knowledge graph completion (also called link prediction) is: given a KG with missing facts, predict which triples are likely true.

The naive approach is a massive boolean matrix: rows = all entities, columns = (relation, entity) pairs, cell = 1 if the fact exists. For 1M entities and 1000 relation types, this matrix has 10¹² entries — mostly zero, completely uncompressible. We need embeddings: dense vectors that capture the patterns.

Why this matters: KG completion feeds downstream AI systems. Question answering ("What country borders Germany to the east?") requires a complete KG. Recommendation engines use knowledge about items. Drug interaction prediction uses biomedical KGs. An incomplete KG propagates uncertainty through every system that consumes it.

What is the knowledge graph completion (link prediction) task?

Given a knowledge graph with missing triples, predict which (head, relation, tail) facts are likely true based on patterns in the existing facts Summarize a knowledge graph into a smaller one with fewer entities but the same information Classify knowledge graph entities into predefined ontology categories

Chapter 1: Translation Intuition — Showcase

TransE's idea is almost absurdly simple. Represent every entity and every relation as a vector in the same d-dimensional space. Then enforce this constraint: for a true fact (h, r, t), the embedding of h plus the embedding of r should approximately equal the embedding of t.

h + r ≈ t

Think of it geometrically: entities are points. Relations are arrows. The arrow for "capital-of" should point from every country toward its capital city. Start at Germany (a vector), add "capital-of" (another vector), and you should land near Berlin (a third vector).

This mirrors the famous Word2Vec analogy: King − Man + Woman ≈ Queen. The same principle — relationships live in vector arithmetic — works surprisingly well for structured facts in knowledge graphs.

Word2Vec analogy revisited: In word embeddings, "Paris" − "France" ≈ "Rome" − "Italy" because the "is capital of" relation is encoded as a consistent vector offset. TransE makes this the explicit training objective: every relation is defined to be a translation, and embeddings are learned to satisfy as many translations simultaneously as possible.

Interactive h + r = t Showcase

Drag the relation vector (orange arrow). Watch where h + r lands relative to t. The loss is the distance between h+r and t.

Relation r_x 0.80

Relation r_y -0.20

In TransE, what geometric role does the relation embedding r play?

A translation vector — adding r to the head entity h should move you to the tail entity t in embedding space A scaling factor — r scales the head entity embedding to match the tail entity magnitude A rotation matrix — r rotates the head entity to align with the tail entity direction

Chapter 2: Scoring Function

To complete the knowledge graph, you need to score candidate triples: "how likely is (h, r, t) to be a true fact?" TransE uses the dissimilarity of h+r from t as the score — lower is better:

f_r(h, t) = || h + r − t ||_L1/L2

If h + r lands exactly at t, the score is 0 — perfect. If they're far apart, the score is large — unlikely fact. The paper evaluates both L1 (Manhattan distance, sum of absolute differences) and L2 (Euclidean distance) norms, with L1 performing slightly better in practice.

At inference time, to answer "what is the capital of Germany?", you compute the score for every entity as a candidate tail: (Germany, capital-of, Berlin), (Germany, capital-of, Paris), (Germany, capital-of, Tokyo), ... and rank them by score. The entity with the lowest score is the predicted answer.

Normalization constraint: TransE enforces ||h||₂ = ||t||₂ = 1 (entity embeddings lie on the unit sphere). Without this, a trivially optimal solution is to make all embeddings zero — distance is 0 everywhere. The unit-norm constraint forces the model to actually encode information. Relation embeddings r are NOT normalized — they can have any magnitude.

python
import torch

def transe_score(h, r, t, norm=1):
    # h, r, t: embedding vectors [batch, dim]
    # Returns dissimilarity score (lower = more likely true fact)
    diff = h + r - t  # [batch, dim]
    return torch.norm(diff, p=norm, dim=-1)  # [batch]

def predict_tail(h, r, all_entities):
    # h: [dim], r: [dim], all_entities: [N, dim]
    candidate = h + r  # [dim]
    dists = torch.norm(all_entities - candidate, p=1, dim=-1)  # [N]
    return torch.argsort(dists)  # ranked entity indices

How does TransE score a candidate triple (h, r, t) during link prediction?

It computes the norm ||h + r - t|| — lower score means the triple is more likely true It computes the dot product h · r · t — higher score means the triple is more likely true It runs the triple through a learned MLP classifier that outputs a probability

Chapter 3: Training with Margin Loss

TransE learns embeddings via a pairwise margin loss: for each true triple (h, r, t), sample a "corrupted" triple (h', r, t') where either h' or t' is a random entity. Then penalize the model when the true triple doesn't score better than the corrupted one by at least a margin γ.

L = ∑_(h,r,t) ∑_(h',r,t') [ γ + f_r(h,t) − f_r(h',t') ]₊

The [·]₊ notation means "max(0, ·)" — only penalize when the bracketed expression is positive. So: loss is zero when f_r(h,t) ≤ f_r(h',t') − γ, i.e., when the true triple scores at least γ better than the corrupted one. γ is the margin hyperparameter (paper uses γ = 1 or 2).

The corrupted triple (h', r, t') replaces either the head or the tail (never the relation) with a random entity from the entity set. This is the KG analogy of negative sampling: the false triple serves as a negative example that the model learns to score poorly.

Sample true triple (h, r, t)

e.g., (Paris, capital-of, France)

↓ corrupt head or tail

Generate corrupted triple (h', r, t')

e.g., (Tokyo, capital-of, France) or (Paris, capital-of, Germany)

↓ compute margin loss

Penalize if f(h,t) + γ > f(h',t')

Push true triple score down, corrupted score up

↻ repeat for all triples

Training detail: After each gradient update, re-normalize entity embeddings to the unit sphere: h ← h / ||h||. This maintains the normalization constraint. Relation embeddings are not normalized. The paper trains with SGD, batch size 120, and evaluates every 100 epochs to select early stopping point.

In TransE's margin loss, what does the margin γ enforce?

That true triples must score at least γ better (lower distance) than corrupted triples — it's a safety buffer preventing near-ties That entity embeddings must have norm exactly equal to γ That each relation must be applied at least γ times before the gradient is computed

Chapter 4: What TransE Can't Do

TransE is elegant, but the translation model has hard structural limits. Three relation patterns are fundamentally impossible to represent:

Symmetric relations: "is married to" is symmetric — if A is married to B, then B is married to A. In TransE, this requires h + r = t AND t + r = h simultaneously, which implies r = 0. But r = 0 means the relation has no information — every entity would be its own "spouse." TransE maps symmetric relations to zero or near-zero vectors, destroying their meaning.

1-to-N (one-to-many) relations: "speaks language" maps one person to many languages. If (Alice, speaks, English) and (Alice, speaks, French), then Alice + speaks ≈ English and Alice + speaks ≈ French simultaneously. The model wants Alice + speaks to be two different points at once — impossible. TransE learns speaks = midpoint between English and French, making the score for both okay but neither great.

Composition: "bornIn" ∘ "locatedIn" should imply "nationality" (born in Paris, Paris located in France → French nationality). TransE can sometimes capture this via vector addition, but it's not guaranteed and breaks down for longer chains.

The translation bottleneck: All three failures stem from the same cause — TransE assigns ONE vector per relation. One vector means one direction of translation. But symmetric relations need to translate in BOTH directions simultaneously, and one-to-many relations need to translate to MULTIPLE destinations. Single vectors can't do this.

Why TransE Fails on Symmetric Relations

Why can't TransE handle symmetric relations like "is married to"?

Symmetric relations require h+r=t AND t+r=h simultaneously, which forces r=0 — a zero relation vector carries no information Symmetric relations create cycles in the graph that cause infinite gradients during backpropagation TransE cannot handle undirected edges because it uses L1 distance which is not symmetric

Chapter 5: Results — FB15k and WN18

TransE is evaluated on two standard benchmarks: FB15k (a 15K-entity subset of Freebase, 1345 relation types, 483K triples) and WN18 (18K entities from WordNet, 18 relation types like "hypernym," "meronym").

The evaluation metric is mean rank (lower is better) and Hits@10 (percentage of correct answers in the top-10 ranked candidates, higher is better) for tail prediction.

Method	FB15k Mean Rank ↓	FB15k Hits@10 ↑	WN18 Mean Rank ↓	WN18 Hits@10 ↑
Unstructured	1074	4.5%	315	35.3%
SE (Structured Embeddings)	273	28.8%	1011	68.5%
SME (linear)	274	30.7%	533	74.1%
TransE	125	47.1%	251	89.2%

TransE dramatically outperforms all prior methods. On WN18, Hits@10 improves from 74.1% to 89.2% — nearly 15 percentage points. And TransE is orders of magnitude simpler: Structured Embeddings uses two relation matrices per relation (2×1345 matrices), while TransE uses just one vector per relation. Simpler model, dramatically better performance.

Why TransE wins so decisively: WN18 has relations like "hypernym" (is-a) and "meronym" (part-of) — these are natural hierarchy relations that map perfectly to translations. Moving "up" the semantic hierarchy corresponds to consistently moving in one direction in embedding space. TransE's inductive bias matches the data's structure.

Why does TransE perform especially well on WN18 (WordNet)?

WordNet's hierarchical relations (hypernym, meronym) are naturally antisymmetric hierarchy patterns that translate well as consistent vector offsets in embedding space WordNet has fewer entities than Freebase, so the model can memorize all triples WordNet relations are all binary (yes/no) which matches TransE's scoring function

Chapter 6: Legacy — TransR, DistMult, RotatE

TransE opened a decade-long research program in KG embedding. Every subsequent method is essentially an answer to the question: "how do we keep TransE's simplicity while fixing its structural limitations?"

TransR (Lin et al., 2015): Instead of sharing the same embedding space for entities and relations, TransR projects entities into a relation-specific subspace before computing the translation. Each relation has a projection matrix M_r: h_⊥ = h · M_r. Then the loss is ||h_⊥ + r − t_⊥||. This lets different relations highlight different entity features. It handles 1-to-N relations better but loses the computational simplicity of TransE.

DistMult (Yang et al., 2015): Instead of translation, uses element-wise multiplication: score(h, r, t) = Σ h_i · r_i · t_i. This is bilinear and handles symmetric relations (since h·r·t = t·r·h) but cannot distinguish (h, r, t) from (t, r, h) for asymmetric relations. Simpler than TransE, but more limited.

ComplEx (Trouillon et al., 2016): Extends DistMult to complex numbers. Entities and relations are complex vectors; score = Re(Σ h_i · r_i · t̄_i). Complex numbers allow asymmetric scoring (multiplication in complex space is not commutative), fixing DistMult's symmetry assumption. Handles symmetric + antisymmetric + inverse relations.

RotatE (Sun et al., 2019): Relations as rotations in complex space (|r_i|=1). Handles ALL patterns: symmetric (rotation by π), antisymmetric, inverse (opposite rotation), AND composition (rotations compose). See the RotatE lesson for details.

Model	Year	Relation type	Symmetric?	1-to-N?	Composition?
TransE	2013	Translation	No	Partial	Partial
TransR	2015	Projected translation	No	Yes	Partial
DistMult	2015	Bilinear	Yes	Yes	No
ComplEx	2016	Complex bilinear	Yes	Yes	No
RotatE	2019	Complex rotation	Yes	Yes	Yes

DistMult handles symmetric relations that TransE cannot. What does DistMult sacrifice in return?

It cannot distinguish (h, r, t) from (t, r, h) for asymmetric relations — its scoring function is inherently symmetric in h and t It requires twice as many parameters because it maintains both a head and tail embedding for every entity It cannot handle composition relations because element-wise products don't compose

Chapter 7: Connections

TransE is the "simplest thing that works" baseline for knowledge graph embedding. Understanding it precisely — including its failures — is the prerequisite for understanding every subsequent KG embedding method.

Key concept	Where to go deeper
Relation as rotation (fixes TransE's symmetric failure)	RotatE
Graph neural network embeddings for KGs	GraphSAGE
Network embedding (broader)	LINE

TransE's conceptual legacy: More than the specific model, TransE established the vocabulary of "relation patterns" — symmetric, antisymmetric, inverse, composition — that every subsequent paper uses to evaluate expressiveness. This taxonomy is TransE's most lasting contribution: it gave the field a way to formally reason about what a KG embedding model can and cannot represent.

Open problems

N-ary relations (beyond triples)
Temporal KGs (facts that expire)
Text-enhanced embeddings
Uncertainty-aware embeddings

Key papers

Bordes et al., NeurIPS 2013 (TransE)
Lin et al., AAAI 2015 (TransR)
Yang et al., ICLR 2015 (DistMult)
Sun et al., ICLR 2019 (RotatE)

"Entities should be embedded close to other entities that play similar roles in similar relations."
— Bordes et al. (2013)

TransE: Relations asTranslations