Schlichtkrull et al. — Vrije Universiteit Amsterdam, ESWC 2018

R-GCN: Relational Graph Convolutional Networks

One weight matrix per relation type — the minimal extension that makes GCNs useful on heterogeneous knowledge graphs with typed edges.

Prerequisites: GCN (Kipf & Welling) + Knowledge graphs (entities, relations) + Linear algebra basics
8
Chapters
4+
Simulations

Chapter 0: The Problem

A knowledge graph (KG) is a massive network of facts. Wikidata has ~100 million nodes (entities) and ~800 million edges (relations). Each edge has a type: "Barack Obama bornIn Hawaii", "Hawaii partOf USA", "USA hasCapital Washington D.C." These relations are heterogeneous — they mean completely different things.

Standard GCN treats all edges the same. It aggregates over all neighbors with one weight matrix W. But "bornIn" and "hasCapital" are fundamentally different relationships. Aggregating them with the same W conflates information that should be treated differently.

The problem with homogeneous GCN on KGs: If entity v has neighbors connected by "isA" (a taxonomic relation), "bornIn" (a geographic relation), and "marriedTo" (a personal relation), a standard GCN aggregates all three identically. The resulting embedding loses the meaning of which relation connected v to each neighbor — it can't distinguish "Obama is a type of politician" from "Obama was born in Hawaii."

Knowledge Graph Tasks

Two canonical tasks on KGs:

Entity classification: Predict properties of nodes. "Is this entity a person? a place? an organization?" Given partial labels, can we classify all entities using the graph structure?

Link prediction: Given a pair (subject, relation, ?), predict the missing object. E.g., (Barack Obama, bornIn, ?) → Hawaii. KGs are famously incomplete — Freebase has ~71% of people's birthplaces missing. Link prediction fills these gaps.

Heterogeneous vs Homogeneous Aggregation

A small knowledge graph. Edge colors = relation types. Standard GCN (left): blends all relations. R-GCN (right): uses separate weights per relation — different colors = different transformations.

Why does standard GCN fail on heterogeneous knowledge graphs?

Chapter 1: Relation-Specific Weights

R-GCN's solution is conceptually minimal: give each relation type its own weight matrix. Instead of one W, have one Wr per relation r.

From GCN to R-GCN

Recall the standard GCN layer:

h(l+1)i = σ( W(l)j∈N(i) cij h(l)j + W(l)self h(l)i )

where N(i) is all neighbors and cij is a normalization constant (1/degree).

R-GCN modifies this: instead of summing over all neighbors, sum over each relation type separately, with its own weight matrix:

h(l+1)i = σ( W(l)0 h(l)i + ∑r∈Rj∈Nr(i) (1/ci,r) W(l)r h(l)j )

where Nr(i) = neighbors of i connected by relation r, and ci,r = |Nr(i)| (per-relation degree normalization).

What changes, what stays the same: Everything in the aggregation is the same as GCN — sum over neighbors, normalize, apply nonlinearity. The only change is that the weight matrix W(l)r depends on which relation r the edge belongs to. This is a minimal, principled modification that costs exactly one extra weight matrix per relation.

Inverse Relations

An important practical detail: for each relation r, R-GCN also adds the inverse relation r-1. If (Obama, bornIn, Hawaii) is an edge, we also add (Hawaii, bornIn-1, Obama). This is a separate relation type with its own weight matrix Wr-1.

Why? Because the message flowing from Hawaii to Obama via "bornIn" is semantically different from the message flowing from Obama to Hawaii via the same edge. Obama "was born in" Hawaii; Hawaii "is the birthplace of" Obama. These carry different information.

What is the key architectural difference between a standard GCN layer and an R-GCN layer?

Chapter 2: R-GCN Layer (Showcase)

Let's trace exactly how information flows through an R-GCN layer for a single entity. This will make the computation concrete.

R-GCN Message Passing — Interactive

A small KG with 3 relation types. Select an entity (node) to compute its R-GCN update. See how each relation's neighbors contribute a separate weighted sum, combined to produce the new embedding.

Select an entity to see its R-GCN update computation step by step.

Data Shapes

For a KG with N entities, d input features, and d' output features:

Total parameters per layer: (|R| + 1) × d × d'. With |R| = 100 relation types, d = d' = 100: 1,010,000 parameters per layer. For Wikidata with |R| ≈ 800: ~8 million per layer. This is the over-parameterization problem that Chapters 3 and 4 address.

If a knowledge graph has 200 relation types and each R-GCN layer has input dimension d=128 and output d'=128, how many parameters does one R-GCN layer have (excluding bias)?

Chapter 3: Basis Decomposition

The naive R-GCN has |R| weight matrices — one per relation. For KGs with hundreds of relations, this is too many parameters, and relations with few training triples will be under-trained. The paper proposes two regularization strategies. Chapter 3 covers the first: basis decomposition.

The Idea: Share Across Relations

Instead of a fully independent Wr per relation, decompose each Wr as a linear combination of a small set of basis matrices V1, ..., VB:

W(l)r = ∑b=1B a(l)rb V(l)b

where Vb ∈ ℝd×d' are B shared basis matrices (same across all relations), and arb ∈ ℝ are relation-specific scalar coefficients. Only the coefficients arb are relation-specific — the bases Vb are learned once and shared.

The compression: Parameters go from |R| × d × d' to B × d × d' (bases) + |R| × B (coefficients). When B ≪ |R|, this is massive compression. With B = 2 (2 basis matrices), every relation's Wr is a linear combination of just 2 templates — much easier to train with few examples per relation.

Why This Works

The basis decomposition forces parameter sharing: relations that encode similar semantics (e.g., "bornIn" and "diedIn" are both geographic relations) will naturally learn similar coefficients arb, leading to similar Wr. Relations that are truly different will learn different coefficients. The basis matrices Vb become the "semantic primitives" of the KG.

Basis Decomposition — Visualization

B basis matrices (heatmaps). Each relation's Wr = weighted combination. Adjust the coefficients for one relation and see its weight matrix update in real time.

a1 (Basis 1 weight) 0.70
a2 (Basis 2 weight) 0.30
In basis decomposition with B=3 bases, how many parameters does each relation r contribute (per layer, excluding shared bases)?

Chapter 4: Block Diagonal Decomposition

The second regularization approach in the paper: block diagonal decomposition. Instead of sharing basis matrices across relations, restrict each Wr to be block-diagonal.

Block Diagonal Structure

A block diagonal matrix has non-zero entries only along the diagonal blocks and zero elsewhere:

Wr = diag( Q(1)r, Q(2)r, ..., Q(C)r )

where each Q(c)r ∈ ℝ(d/C)×(d'/C) is a small dense matrix. The full Wr is d × d' with C² blocks, but only C blocks are non-zero.

Interpretation: Block diagonality means the embedding is implicitly split into C independent "channels" or "aspects." Each relation transforms each aspect separately, with no cross-channel mixing. This is similar to grouped convolutions in CNNs — a structured form of parameter efficiency that also acts as a regularizer.

Parameters Comparison

VariantParameters per layerCompression vs full
Full Wr|R| × d × d'
Basis (B bases)B × d × d' + |R| × B≈ B/|R| × (ignoring small term)
Block diag (C blocks)|R| × C × (d/C) × (d'/C) = |R| × dd'/C1/C×

Basis vs Block Diagonal: When to Use Which

Basis decomposition works better when relations are semantically related — sharing basis matrices leverages inter-relation similarity. Best for KGs where relations cluster into semantic families.

Block diagonal works better when embedding dimensions have natural groupings — each "aspect" of the entity is processed independently. Simpler to implement; no basis learning required. The paper finds basis decomposition slightly better in practice for both tasks.

Block Diagonal vs Full Matrix

Visualization of a weight matrix Wr. Full (left): dense, all parameters free. Block diagonal (right): only diagonal blocks are non-zero — same input/output dimensions, fewer parameters.

Number of blocks C 4
With block diagonal decomposition using C=4 blocks on a d=d'=64 weight matrix, how many parameters does Wr have (instead of 64×64=4096)?

Chapter 5: Link Prediction with DistMult

For link prediction, R-GCN is used as an encoder to produce entity embeddings, then paired with a decoder that scores triples. The paper uses DistMult as the decoder.

The Encoder-Decoder Framework

Input: entity features
Initial one-hot vectors or random embeddings for each entity
↓ R-GCN (L layers, basis/block regularization)
Entity embeddings ei
Rich d-dimensional representation encoding the entity's graph neighborhood and relation context
↓ DistMult decoder
Triple score f(s, r, o)
How plausible is the fact (subject, relation, object)?

DistMult Decoder

DistMult (Yang et al., 2015) is a bilinear model for triple scoring:

f(s, r, o) = eTs · diag(Wr) · eo = ∑k esk · Wrk · eok

where es, eo are entity embeddings from R-GCN, and Wr is a diagonal matrix for relation r (one learnable scalar per dimension). This is a simple element-wise interaction — the score is high when corresponding dimensions of subject and object embeddings align, weighted by the relation.

Why DistMult? It's simple (one vector per relation), fast to compute, and empirically strong despite its simplicity. Its main weakness: it's symmetric in subject and object (f(s,r,o) = f(o,r,s)), so it can't handle asymmetric relations like "isParentOf." For symmetric relations (like "married to" in some KGs) this is fine; for asymmetric ones, TransE or ComplEx are better decoders.

Training with Negative Sampling

For each observed triple (s, r, o), generate negative triples by corrupting the subject or object: (s', r, o) or (s, r, o') where s', o' are randomly sampled entities. Train with cross-entropy loss (positive triple should score higher than negatives).

L = −∑(s,r,o)∈T+ log σ(f(s,r,o)) − ∑(s',r,o')∈T log σ(−f(s',r,o'))
In the R-GCN + DistMult framework, what is the role of the R-GCN?

Chapter 6: Results

Entity Classification on AIFB, MUTAG, BGS, AM

Four RDF knowledge graphs with node classification tasks. AIFB: 8,285 nodes, 29,043 edges, 45 relations, 4 classes. MUTAG: 23,644 nodes, 74,227 edges, 23 relations, 2 classes.

MethodAIFBMUTAGBGSAM
Feat (no graph)55.677.672.466.7
WL kernel80.680.686.287.4
DeepWalk25.076.358.665.2
R-GCN95.873.283.189.3

Link Prediction on FB15k-237

FB15k-237 is a standard KG benchmark (14,541 entities, 237 relation types, 272,115 training triples). Metric: Mean Reciprocal Rank (MRR) — higher is better. Hits@10 = fraction of test triples where the correct answer is in the top 10 predictions.

MethodMRRHits@10Hits@1
TransE0.29446.5
DistMult (standalone)0.24141.915.5
ComplEx0.24742.815.8
R-GCN + DistMult0.24941.715.1
The takeaway: For entity classification, R-GCN is dramatically better — the graph structure provides crucial context for node labeling. For link prediction, R-GCN is competitive with but not dramatically better than standalone DistMult. This is because link prediction quality depends heavily on the decoder's capacity — and DistMult is somewhat limited. Later works (CompGCN, 2019) show that better decoders unlock much stronger R-GCN performance on link prediction.
On which task does R-GCN show the most dramatic improvement over baselines?

Chapter 7: Connections & Beyond

Limitations of R-GCN

DistMult decoder is symmetric: Can't model asymmetric relations like "isParentOf." Replaced by ComplEx, RotatE, or TransE decoders in later work.

All relations treated independently (in full variant): No built-in mechanism to learn that "bornIn" and "diedIn" are related. Basis decomposition partially addresses this but doesn't explicitly model relation similarity.

Scalability: KGs with millions of entities require mini-batch training with neighbor sampling. The paper doesn't address this — later work (GraphSAGE, ClusterGCN) provides the tools.

No edge features: Edge attributes beyond type are not modeled. Some KGs have temporal edges (valid from date X to date Y), certainty scores, or other metadata.

Follow-Up Work

PaperKey Advance
CompGCN (Vashishth, 2020)Composition of entity + relation embeddings; much better link prediction
HGT (Hu, 2020)Attention-based type-specific projections for heterogeneous graphs (see HGT lesson)
RGAT (Busbridge, 2019)Adds attention weights across neighbors within each relation type
KGCN (Wang, 2019)Applies to recommendation via user-KG interaction graphs
R-GCN's lasting contribution: It established the principle that heterogeneous graphs need type-specific weight matrices and demonstrated this cleanly on knowledge graph tasks. Every subsequent heterogeneous GNN builds on this insight — they differ in how they parameterize the type-specific transformations (attention in GAT-style, type-specific in HGT, composition in CompGCN).

Related Lessons

"Not all edges are equal. Give each relation its own voice."
— Paraphrase of R-GCN's central design principle