Yang, Yih, He, Gao, Deng — ICLR 2015 · arXiv 1412.6575

DistMult: Bilinear Knowledge Graph Embeddings

A beautifully simple idea: score a knowledge graph triple by the element-wise product of head, relation, and tail vectors. Symmetric by construction, surprisingly powerful in practice.

Prerequisites: dot products + basic embeddings. That's it.
8
Chapters
3+
Simulations
0
Assumed Knowledge

Chapter 0: The Problem

Freebase contains 1.2 billion facts. Yet it's missing 93% of the birthplaces and 78% of the nationalities it should have. Knowledge graphs are enormous — and full of holes.

The task is knowledge graph completion: given a set of known triples (head, relation, tail) — like (Marie Curie, born-in, Warsaw) — predict missing ones. Can we infer that Marie Curie won-Nobel-Prize or was-nationality Polish, even though it's not explicitly in the graph?

A knowledge graph is a collection of facts expressed as triples: (h, r, t) where h is the head entity, r is the relation, and t is the tail entity. The graph might have millions of entities and hundreds of relation types. Completing it by hand is impossible.

Knowledge Graph Visualization

A tiny fragment of a real knowledge graph. Missing edges are marked with "?" — these are what we need to predict.

The scale problem: Freebase has 40 million entities and 35,000 relation types. Manually completing even 1% of missing facts would require thousands of human-years. We need a model that learns from existing facts to predict new ones automatically.

Early approaches used symbolic reasoning — if A is-parent-of B and B is-parent-of C, then A is-grandparent-of C. But these hand-crafted rules don't generalize and can't handle uncertainty. We need something that learns the patterns from data.

What does knowledge graph completion mean?

Chapter 1: The Bilinear Model

The core idea of embedding-based KG completion is simple: map every entity and relation to a low-dimensional vector, then define a scoring function that assigns a high score to true triples and a low score to false ones.

The most general bilinear scoring function for a triple (h, r, t) is:

fr(h, t) = hT Wr t

Here h and t are embedding vectors for the head and tail entities, and Wr is a matrix specific to relation r. This is the full bilinear model — "bilinear" because the score is linear in both h and t separately.

Why bilinear? A bilinear product hTWrt can capture any pairwise interaction between every dimension of h and every dimension of t, mediated by the relation. It's expressive enough to model "does this head-tail pair fit the relation?" without being so complex it can't be trained.

The problem with the full bilinear model is scale. If entity embeddings have dimension d, then Wr has d² parameters — one matrix per relation. With thousands of relations, this explodes. DistMult's insight is to restrict Wr to be a diagonal matrix.

When Wr = diag(r) for a vector r, the bilinear product simplifies dramatically:

hT diag(r) t = ∑i hi ri ti

This is the DistMult scoring function. Instead of d² parameters per relation, we now have just d — a 10,000x reduction for typical embedding sizes.

Why does DistMult restrict W_r to be a diagonal matrix?

Chapter 2: The Scoring Function

The DistMult scoring function is deceptively simple:

fr(h, t) = ⟨h, r, t⟩ = ∑i=1d hi ri ti

Think of it as a three-way dot product: for each dimension i, multiply the head, relation, and tail values together, then sum. This is also called a Hadamard product followed by summation.

Intuitively: if hi and ti are both large (say, both entities are strongly "European"), and ri is also large for dimension i (dimension i is relevant to this relation), the contribution to the score is high. The relation vector r acts as a selector — emphasizing the dimensions that matter for this relation type.

Interactive Dot Product Demo

Drag the sliders to set 4D embeddings for head (h), relation (r), and tail (t). Watch how each dimension contributes to the final score ⟨h, r, t⟩.

h: [0.8, 0.3, -0.5, 0.6]
r: [0.9, 0.1, 0.7, -0.2]
t: [0.7, -0.4, -0.6, 0.5]
How to read the score: A high positive score means the model believes (h, r, t) is a valid fact. A score near zero or negative means it's probably false. After training, the model learns embeddings so that known true triples score high. We then rank all candidate tails for unseen queries and report which position the true answer appears at.

From score to prediction

For a query like (Marie Curie, won-prize, ?), we compute fr(h, t) for every possible entity t, sort by score, and report the top-k. If the correct answer (Nobel Prize) appears in the top 10, that's a "hit@10."

In the DistMult score ⟨h, r, t⟩, what role does the relation vector r play?

Chapter 3: The Symmetry Property

Here's a key property hidden inside the DistMult formula. What happens if we swap h and t?

h, r, t⟩ = ∑i hi ri ti = ∑i ti ri hi = ⟨t, r, h

Swapping head and tail leaves the score unchanged. DistMult is symmetric by construction: for any relation r, fr(h, t) = fr(t, h). Always.

What this means in practice: DistMult cannot distinguish between (A, friend-of, B) and (B, friend-of, A) — which is fine for "friend-of" (truly symmetric). But it gives the same score to (France, capital-of, Paris) and (Paris, capital-of, France) — which is very wrong. Capital-of is antisymmetric: one is true, the other is false. DistMult assigns them identical scores and cannot distinguish them.

Relations in knowledge graphs come in three flavors:

Symmetric relations
DistMult handles these well.
Examples: similar-to, married-to (in some datasets), synonym-of
Antisymmetric relations
DistMult fails completely.
Examples: born-in, capital-of, parent-of, employed-by

Most real-world relations are not symmetric. This is the fundamental limitation of DistMult — and it's precisely what ComplEx (the next paper) is designed to fix.

Why can't DistMult handle antisymmetric relations like "parent-of"?

Chapter 4: Training

DistMult is trained with negative sampling. For each true triple (h, r, t), we generate corrupted triples by replacing either h or t with a random entity. We then maximize the score of true triples and minimize the score of corrupted ones.

The loss function is a pairwise ranking loss (or equivalently, a softmax cross-entropy over all possible entities):

L = ∑(h,r,t) ∈ Τ log σ( fr(h, t) ) + ∑j=1N log σ( −fr(hj', t) )

Where (hj', t) are negative samples with random head entities. σ is the sigmoid function. The model learns to assign high scores to true triples and near-zero (post-sigmoid) scores to false ones.

Negative sampling as implicit regularization: You can't use all possible corrupted triples (millions of entities × millions of relations × millions of entities). Instead, sample a few negatives per positive. This is essentially noise-contrastive estimation — the model learns not just "what is true" but "true compared to what?"

Regularization

Embeddings are L2-regularized to prevent them from growing unboundedly (large embeddings could always produce high scores). The loss becomes:

Ltotal = Lranking + λ( ||h||22 + ||r||22 + ||t||22 )
python
import torch
import torch.nn as nn

class DistMult(nn.Module):
    def __init__(self, n_entities, n_relations, dim):
        super().__init__()
        self.entity_emb  = nn.Embedding(n_entities, dim)
        self.relation_emb = nn.Embedding(n_relations, dim)
        nn.init.xavier_uniform_(self.entity_emb.weight)
        nn.init.xavier_uniform_(self.relation_emb.weight)

    def score(self, h_idx, r_idx, t_idx):
        h = self.entity_emb(h_idx)   # (batch, dim)
        r = self.relation_emb(r_idx)  # (batch, dim)
        t = self.entity_emb(t_idx)   # (batch, dim)
        return (h * r * t).sum(dim=-1)  # (batch,)

    def forward(self, pos_h, pos_r, pos_t, neg_h, neg_r, neg_t):
        pos_score = self.score(pos_h, pos_r, pos_t)
        neg_score = self.score(neg_h, neg_r, neg_t)
        loss = -torch.log(torch.sigmoid(pos_score)).mean()
        loss += -torch.log(torch.sigmoid(-neg_score)).mean()
        return loss
What does negative sampling do during DistMult training?

Chapter 5: Results

Yang et al. evaluated DistMult on two standard benchmarks: WordNet (WN18) — a lexical database with 18 relation types like hypernym, hyponym, and member-of — and Freebase (FB15k) — a curated subset with 1,345 relation types across 14,951 entities.

The primary metric is Hits@10: for each test triple, rank all possible tail entities by score. What fraction of the time does the true tail appear in the top 10? Higher is better.

ModelWN18 Hits@10FB15k Hits@10Parameters
Unstructured38.2%6.3%O(n·d)
TransE89.2%47.1%O((n+m)·d)
RESCAL52.8%44.1%O(n·d + m·d²)
DistMult94.2%57.7%O((n+m)·d)
The key takeaway: DistMult outperforms all prior methods despite being dramatically simpler than RESCAL (which uses full d×d matrices). On WN18, it achieves 94.2% Hits@10 — meaning for 94% of test questions, the correct answer appears in the model's top 10 guesses out of 40,943 possible entities. This is not a coincidence: WN18 relations are largely symmetric (hypernym chains, synonyms), which is exactly what DistMult handles well.

On FB15k, DistMult's 57.7% also beats TransE's 47.1%. But there's a catch: FB15k contains many inverse relations (e.g., both "capital-of" and "is-capital") that artificially inflate scores. Later work (Toutanova and Chen, 2015) showed DistMult's advantage partly comes from exploiting these inverse pairs.

Why does DistMult perform especially well on WordNet (WN18)?

Chapter 6: DistMult vs TransE

TransE and DistMult represent two fundamentally different philosophies for modeling knowledge graph relations, and understanding the difference is crucial for knowing when to use each.

TransE (Bordes et al., 2013) models relations as translations in embedding space. A true triple (h, r, t) should satisfy h + rt. The scoring function is the negative distance: fr(h, t) = −||h + rt||.

PropertyTransEDistMult
Scoring−||h + r − t||⟨h, r, t⟩
Relation paramsd per relationd per relation
Symmetric relationsStrugglesNatural fit
Antisymmetric relationsHandles wellCannot model
Geometric intuitionTranslationFeature selection
WN18 Hits@1089.2%94.2%
FB15k Hits@1047.1%57.7%
The translation intuition: In TransE, if (Paris, capital-of, France) is true, then vec(Paris) + vec(capital-of) ≈ vec(France). This forces Paris and France to be "near" each other in embedding space (shifted by the relation vector). TransE naturally handles antisymmetric relations: h + r ≈ t does NOT imply t + r ≈ h (it implies t + r ≈ h + 2r). But TransE fails on symmetric: if h + r ≈ t AND t + r ≈ h, then adding gives 2r ≈ 0, forcing r to be the zero vector.

This complementarity is not accidental — it points toward a deeper truth. You need a model expressive enough to handle both symmetric and antisymmetric relations. That's exactly what ComplEx provides, by extending DistMult to complex vector spaces.

Why does TransE fail on symmetric relations like "similar-to"?

Chapter 7: Connections

DistMult is a foundational paper in knowledge graph embeddings — simple enough to analyze mathematically, strong enough to anchor years of subsequent research. Here's where it fits in the broader landscape.

ModelKey IdeaDistMult Relation
RESCALFull bilinear W_r matrixDistMult = diagonal RESCAL
TransETranslation h + r ≈ tDifferent family entirely
ComplExComplex-valued DistMultFixes symmetry limitation
RotatERotation in complex spaceInspired by ComplEx/DistMult
TuckERTucker decompositionGeneralization of DistMult
R-GCNGraph convolution + DistMult decoderUses DistMult as scoring head
DistMult as tensor decomposition: A knowledge graph can be seen as a 3D binary tensor X where X[h, r, t] = 1 if the triple is true. DistMult is exactly a rank-d CP decomposition of this tensor. The entity and relation embeddings are the factor matrices. This connection to tensor decomposition theory gives DistMult theoretical grounding and explains why it works as well as it does.

Limitations and when NOT to use DistMult

Closing insight: DistMult's power comes from its simplicity. By restricting W_r to a diagonal, we get a model that's: fast to train (O(d) per triple instead of O(d²)), easy to regularize, and interpretable (dimension i of the relation vector r literally weights "how much does dimension i of the embeddings matter for this relation?"). Simplicity is not weakness — it's a design choice that reveals what the minimum necessary structure is.

Further reading: