DistMult — Veanors

Chapter 0: The Problem

Freebase contains 1.2 billion facts. Yet it's missing 93% of the birthplaces and 78% of the nationalities it should have. Knowledge graphs are enormous — and full of holes.

The task is knowledge graph completion: given a set of known triples (head, relation, tail) — like (Marie Curie, born-in, Warsaw) — predict missing ones. Can we infer that Marie Curie won-Nobel-Prize or was-nationality Polish, even though it's not explicitly in the graph?

A knowledge graph is a collection of facts expressed as triples: (h, r, t) where h is the head entity, r is the relation, and t is the tail entity. The graph might have millions of entities and hundreds of relation types. Completing it by hand is impossible.

Knowledge Graph Visualization

A tiny fragment of a real knowledge graph. Missing edges are marked with "?" — these are what we need to predict.

The scale problem: Freebase has 40 million entities and 35,000 relation types. Manually completing even 1% of missing facts would require thousands of human-years. We need a model that learns from existing facts to predict new ones automatically.

Early approaches used symbolic reasoning — if A is-parent-of B and B is-parent-of C, then A is-grandparent-of C. But these hand-crafted rules don't generalize and can't handle uncertainty. We need something that learns the patterns from data.

What does knowledge graph completion mean?

Predicting missing (head, relation, tail) triples from known facts in a knowledge graph Building a larger knowledge graph by scraping the web Removing incorrect facts from a knowledge graph

Chapter 1: The Bilinear Model

The core idea of embedding-based KG completion is simple: map every entity and relation to a low-dimensional vector, then define a scoring function that assigns a high score to true triples and a low score to false ones.

The most general bilinear scoring function for a triple (h, r, t) is:

f_r(h, t) = h^T W_r t

Here h and t are embedding vectors for the head and tail entities, and W_r is a matrix specific to relation r. This is the full bilinear model — "bilinear" because the score is linear in both h and t separately.

Why bilinear? A bilinear product h^TW_rt can capture any pairwise interaction between every dimension of h and every dimension of t, mediated by the relation. It's expressive enough to model "does this head-tail pair fit the relation?" without being so complex it can't be trained.

The problem with the full bilinear model is scale. If entity embeddings have dimension d, then W_r has d² parameters — one matrix per relation. With thousands of relations, this explodes. DistMult's insight is to restrict W_r to be a diagonal matrix.

When W_r = diag(r) for a vector r, the bilinear product simplifies dramatically:

h^T diag(r) t = ∑_i h_i r_i t_i

This is the DistMult scoring function. Instead of d² parameters per relation, we now have just d — a 10,000x reduction for typical embedding sizes.

Why does DistMult restrict W_r to be a diagonal matrix?

To reduce the number of parameters per relation from d² to d, making the model scalable to thousands of relation types Because diagonal matrices are easier to invert To enforce symmetry in the embeddings

Chapter 2: The Scoring Function

The DistMult scoring function is deceptively simple:

f_r(h, t) = ⟨h, r, t⟩ = ∑_i=1^d h_i r_i t_i

Think of it as a three-way dot product: for each dimension i, multiply the head, relation, and tail values together, then sum. This is also called a Hadamard product followed by summation.

Intuitively: if h_i and t_i are both large (say, both entities are strongly "European"), and r_i is also large for dimension i (dimension i is relevant to this relation), the contribution to the score is high. The relation vector r acts as a selector — emphasizing the dimensions that matter for this relation type.

Interactive Dot Product Demo

Drag the sliders to set 4D embeddings for head (h), relation (r), and tail (t). Watch how each dimension contributes to the final score ⟨h, r, t⟩.

h: [0.8, 0.3, -0.5, 0.6]

r: [0.9, 0.1, 0.7, -0.2]

t: [0.7, -0.4, -0.6, 0.5]

How to read the score: A high positive score means the model believes (h, r, t) is a valid fact. A score near zero or negative means it's probably false. After training, the model learns embeddings so that known true triples score high. We then rank all candidate tails for unseen queries and report which position the true answer appears at.

From score to prediction

For a query like (Marie Curie, won-prize, ?), we compute f_r(h, t) for every possible entity t, sort by score, and report the top-k. If the correct answer (Nobel Prize) appears in the top 10, that's a "hit@10."

In the DistMult score ⟨h, r, t⟩, what role does the relation vector r play?

It acts as a per-dimension selector, amplifying dimensions that are important for this relation type and suppressing irrelevant ones It rotates the head embedding to align with the tail embedding It sets a threshold above which triples are considered true

Chapter 3: The Symmetry Property

Here's a key property hidden inside the DistMult formula. What happens if we swap h and t?

⟨h, r, t⟩ = ∑_i h_i r_i t_i = ∑_i t_i r_i h_i = ⟨t, r, h⟩

Swapping head and tail leaves the score unchanged. DistMult is symmetric by construction: for any relation r, f_r(h, t) = f_r(t, h). Always.

What this means in practice: DistMult cannot distinguish between (A, friend-of, B) and (B, friend-of, A) — which is fine for "friend-of" (truly symmetric). But it gives the same score to (France, capital-of, Paris) and (Paris, capital-of, France) — which is very wrong. Capital-of is antisymmetric: one is true, the other is false. DistMult assigns them identical scores and cannot distinguish them.

Relations in knowledge graphs come in three flavors:

Symmetric relations
DistMult handles these well.
Examples: similar-to, married-to (in some datasets), synonym-of

Antisymmetric relations
DistMult fails completely.
Examples: born-in, capital-of, parent-of, employed-by

Most real-world relations are not symmetric. This is the fundamental limitation of DistMult — and it's precisely what ComplEx (the next paper) is designed to fix.

Why can't DistMult handle antisymmetric relations like "parent-of"?

Because ⟨h, r, t⟩ = ⟨t, r, h⟩ always — swapping head and tail never changes the score, so DistMult gives (A, parent-of, B) and (B, parent-of, A) identical scores Because antisymmetric relations require larger embeddings Because DistMult does not learn relation embeddings

Chapter 4: Training

DistMult is trained with negative sampling. For each true triple (h, r, t), we generate corrupted triples by replacing either h or t with a random entity. We then maximize the score of true triples and minimize the score of corrupted ones.

The loss function is a pairwise ranking loss (or equivalently, a softmax cross-entropy over all possible entities):

L = ∑_{(h,r,t) ∈ Τ} log σ( f_r(h, t) ) + ∑_j=1^N log σ( −f_r(h_j', t) )

Where (h_j', t) are negative samples with random head entities. σ is the sigmoid function. The model learns to assign high scores to true triples and near-zero (post-sigmoid) scores to false ones.

Negative sampling as implicit regularization: You can't use all possible corrupted triples (millions of entities × millions of relations × millions of entities). Instead, sample a few negatives per positive. This is essentially noise-contrastive estimation — the model learns not just "what is true" but "true compared to what?"

Regularization

Embeddings are L2-regularized to prevent them from growing unboundedly (large embeddings could always produce high scores). The loss becomes:

L_total = L_ranking + λ( ||h||₂² + ||r||₂² + ||t||₂² )

python
import torch
import torch.nn as nn

class DistMult(nn.Module):
    def __init__(self, n_entities, n_relations, dim):
        super().__init__()
        self.entity_emb  = nn.Embedding(n_entities, dim)
        self.relation_emb = nn.Embedding(n_relations, dim)
        nn.init.xavier_uniform_(self.entity_emb.weight)
        nn.init.xavier_uniform_(self.relation_emb.weight)

    def score(self, h_idx, r_idx, t_idx):
        h = self.entity_emb(h_idx)   # (batch, dim)
        r = self.relation_emb(r_idx)  # (batch, dim)
        t = self.entity_emb(t_idx)   # (batch, dim)
        return (h * r * t).sum(dim=-1)  # (batch,)

    def forward(self, pos_h, pos_r, pos_t, neg_h, neg_r, neg_t):
        pos_score = self.score(pos_h, pos_r, pos_t)
        neg_score = self.score(neg_h, neg_r, neg_t)
        loss = -torch.log(torch.sigmoid(pos_score)).mean()
        loss += -torch.log(torch.sigmoid(-neg_score)).mean()
        return loss

What does negative sampling do during DistMult training?

Creates corrupted triples (with random entities) to train the model to score true triples higher than false ones Removes rare triples from the training set to reduce noise Adds negation relations to the knowledge graph

Chapter 5: Results

Yang et al. evaluated DistMult on two standard benchmarks: WordNet (WN18) — a lexical database with 18 relation types like hypernym, hyponym, and member-of — and Freebase (FB15k) — a curated subset with 1,345 relation types across 14,951 entities.

The primary metric is Hits@10: for each test triple, rank all possible tail entities by score. What fraction of the time does the true tail appear in the top 10? Higher is better.

Model	WN18 Hits@10	FB15k Hits@10	Parameters
Unstructured	38.2%	6.3%	O(n·d)
TransE	89.2%	47.1%	O((n+m)·d)
RESCAL	52.8%	44.1%	O(n·d + m·d²)
DistMult	94.2%	57.7%	O((n+m)·d)

The key takeaway: DistMult outperforms all prior methods despite being dramatically simpler than RESCAL (which uses full d×d matrices). On WN18, it achieves 94.2% Hits@10 — meaning for 94% of test questions, the correct answer appears in the model's top 10 guesses out of 40,943 possible entities. This is not a coincidence: WN18 relations are largely symmetric (hypernym chains, synonyms), which is exactly what DistMult handles well.

On FB15k, DistMult's 57.7% also beats TransE's 47.1%. But there's a catch: FB15k contains many inverse relations (e.g., both "capital-of" and "is-capital") that artificially inflate scores. Later work (Toutanova and Chen, 2015) showed DistMult's advantage partly comes from exploiting these inverse pairs.

Why does DistMult perform especially well on WordNet (WN18)?

WN18 relations (hypernyms, synonyms) are largely symmetric, which matches DistMult's symmetric scoring function WN18 has fewer entities, making the problem easier DistMult uses a larger embedding dimension on WN18

Chapter 6: DistMult vs TransE

TransE and DistMult represent two fundamentally different philosophies for modeling knowledge graph relations, and understanding the difference is crucial for knowing when to use each.

TransE (Bordes et al., 2013) models relations as translations in embedding space. A true triple (h, r, t) should satisfy h + r ≈ t. The scoring function is the negative distance: f_r(h, t) = −||h + r − t||.

Property	TransE	DistMult
Scoring	−\|\|h + r − t\|\|	⟨h, r, t⟩
Relation params	d per relation	d per relation
Symmetric relations	Struggles	Natural fit
Antisymmetric relations	Handles well	Cannot model
Geometric intuition	Translation	Feature selection
WN18 Hits@10	89.2%	94.2%
FB15k Hits@10	47.1%	57.7%

The translation intuition: In TransE, if (Paris, capital-of, France) is true, then vec(Paris) + vec(capital-of) ≈ vec(France). This forces Paris and France to be "near" each other in embedding space (shifted by the relation vector). TransE naturally handles antisymmetric relations: h + r ≈ t does NOT imply t + r ≈ h (it implies t + r ≈ h + 2r). But TransE fails on symmetric: if h + r ≈ t AND t + r ≈ h, then adding gives 2r ≈ 0, forcing r to be the zero vector.

This complementarity is not accidental — it points toward a deeper truth. You need a model expressive enough to handle both symmetric and antisymmetric relations. That's exactly what ComplEx provides, by extending DistMult to complex vector spaces.

Why does TransE fail on symmetric relations like "similar-to"?

If h + r ≈ t AND t + r ≈ h both hold, adding them gives 2r ≈ 0, forcing the relation vector to be zero — it collapses and carries no information TransE uses too few dimensions for symmetric relations Symmetric relations require multiplication, not addition

Chapter 7: Connections

DistMult is a foundational paper in knowledge graph embeddings — simple enough to analyze mathematically, strong enough to anchor years of subsequent research. Here's where it fits in the broader landscape.

Model	Key Idea	DistMult Relation
RESCAL	Full bilinear W_r matrix	DistMult = diagonal RESCAL
TransE	Translation h + r ≈ t	Different family entirely
ComplEx	Complex-valued DistMult	Fixes symmetry limitation
RotatE	Rotation in complex space	Inspired by ComplEx/DistMult
TuckER	Tucker decomposition	Generalization of DistMult
R-GCN	Graph convolution + DistMult decoder	Uses DistMult as scoring head

DistMult as tensor decomposition: A knowledge graph can be seen as a 3D binary tensor X where X[h, r, t] = 1 if the triple is true. DistMult is exactly a rank-d CP decomposition of this tensor. The entity and relation embeddings are the factor matrices. This connection to tensor decomposition theory gives DistMult theoretical grounding and explains why it works as well as it does.

Limitations and when NOT to use DistMult

Antisymmetric relations: Use ComplEx or RotatE instead.
Compositional patterns: (A parent-of B, B parent-of C → A grandparent-of C). DistMult cannot model reasoning chains. Use path-based models or R-GCN.
Inductive setting: New entities at test time (not seen during training). DistMult has no mechanism — use GNN-based encoders.

Closing insight: DistMult's power comes from its simplicity. By restricting W_r to a diagonal, we get a model that's: fast to train (O(d) per triple instead of O(d²)), easy to regularize, and interpretable (dimension i of the relation vector r literally weights "how much does dimension i of the embeddings matter for this relation?"). Simplicity is not weakness — it's a design choice that reveals what the minimum necessary structure is.

DistMult: Bilinear Knowledge Graph Embeddings