One weight matrix per relation type — the minimal extension that makes GCNs useful on heterogeneous knowledge graphs with typed edges.
A knowledge graph (KG) is a massive network of facts. Wikidata has ~100 million nodes (entities) and ~800 million edges (relations). Each edge has a type: "Barack Obama bornIn Hawaii", "Hawaii partOf USA", "USA hasCapital Washington D.C." These relations are heterogeneous — they mean completely different things.
Standard GCN treats all edges the same. It aggregates over all neighbors with one weight matrix W. But "bornIn" and "hasCapital" are fundamentally different relationships. Aggregating them with the same W conflates information that should be treated differently.
Two canonical tasks on KGs:
Entity classification: Predict properties of nodes. "Is this entity a person? a place? an organization?" Given partial labels, can we classify all entities using the graph structure?
Link prediction: Given a pair (subject, relation, ?), predict the missing object. E.g., (Barack Obama, bornIn, ?) → Hawaii. KGs are famously incomplete — Freebase has ~71% of people's birthplaces missing. Link prediction fills these gaps.
A small knowledge graph. Edge colors = relation types. Standard GCN (left): blends all relations. R-GCN (right): uses separate weights per relation — different colors = different transformations.
R-GCN's solution is conceptually minimal: give each relation type its own weight matrix. Instead of one W, have one Wr per relation r.
Recall the standard GCN layer:
where N(i) is all neighbors and cij is a normalization constant (1/degree).
R-GCN modifies this: instead of summing over all neighbors, sum over each relation type separately, with its own weight matrix:
where Nr(i) = neighbors of i connected by relation r, and ci,r = |Nr(i)| (per-relation degree normalization).
An important practical detail: for each relation r, R-GCN also adds the inverse relation r-1. If (Obama, bornIn, Hawaii) is an edge, we also add (Hawaii, bornIn-1, Obama). This is a separate relation type with its own weight matrix Wr-1.
Why? Because the message flowing from Hawaii to Obama via "bornIn" is semantically different from the message flowing from Obama to Hawaii via the same edge. Obama "was born in" Hawaii; Hawaii "is the birthplace of" Obama. These carry different information.
Let's trace exactly how information flows through an R-GCN layer for a single entity. This will make the computation concrete.
A small KG with 3 relation types. Select an entity (node) to compute its R-GCN update. See how each relation's neighbors contribute a separate weighted sum, combined to produce the new embedding.
For a KG with N entities, d input features, and d' output features:
Total parameters per layer: (|R| + 1) × d × d'. With |R| = 100 relation types, d = d' = 100: 1,010,000 parameters per layer. For Wikidata with |R| ≈ 800: ~8 million per layer. This is the over-parameterization problem that Chapters 3 and 4 address.
The naive R-GCN has |R| weight matrices — one per relation. For KGs with hundreds of relations, this is too many parameters, and relations with few training triples will be under-trained. The paper proposes two regularization strategies. Chapter 3 covers the first: basis decomposition.
Instead of a fully independent Wr per relation, decompose each Wr as a linear combination of a small set of basis matrices V1, ..., VB:
where Vb ∈ ℝd×d' are B shared basis matrices (same across all relations), and arb ∈ ℝ are relation-specific scalar coefficients. Only the coefficients arb are relation-specific — the bases Vb are learned once and shared.
The basis decomposition forces parameter sharing: relations that encode similar semantics (e.g., "bornIn" and "diedIn" are both geographic relations) will naturally learn similar coefficients arb, leading to similar Wr. Relations that are truly different will learn different coefficients. The basis matrices Vb become the "semantic primitives" of the KG.
B basis matrices (heatmaps). Each relation's Wr = weighted combination. Adjust the coefficients for one relation and see its weight matrix update in real time.
The second regularization approach in the paper: block diagonal decomposition. Instead of sharing basis matrices across relations, restrict each Wr to be block-diagonal.
A block diagonal matrix has non-zero entries only along the diagonal blocks and zero elsewhere:
where each Q(c)r ∈ ℝ(d/C)×(d'/C) is a small dense matrix. The full Wr is d × d' with C² blocks, but only C blocks are non-zero.
| Variant | Parameters per layer | Compression vs full |
|---|---|---|
| Full Wr | |R| × d × d' | 1× |
| Basis (B bases) | B × d × d' + |R| × B | ≈ B/|R| × (ignoring small term) |
| Block diag (C blocks) | |R| × C × (d/C) × (d'/C) = |R| × dd'/C | 1/C× |
Basis decomposition works better when relations are semantically related — sharing basis matrices leverages inter-relation similarity. Best for KGs where relations cluster into semantic families.
Block diagonal works better when embedding dimensions have natural groupings — each "aspect" of the entity is processed independently. Simpler to implement; no basis learning required. The paper finds basis decomposition slightly better in practice for both tasks.
Visualization of a weight matrix Wr. Full (left): dense, all parameters free. Block diagonal (right): only diagonal blocks are non-zero — same input/output dimensions, fewer parameters.
For link prediction, R-GCN is used as an encoder to produce entity embeddings, then paired with a decoder that scores triples. The paper uses DistMult as the decoder.
DistMult (Yang et al., 2015) is a bilinear model for triple scoring:
where es, eo are entity embeddings from R-GCN, and Wr is a diagonal matrix for relation r (one learnable scalar per dimension). This is a simple element-wise interaction — the score is high when corresponding dimensions of subject and object embeddings align, weighted by the relation.
For each observed triple (s, r, o), generate negative triples by corrupting the subject or object: (s', r, o) or (s, r, o') where s', o' are randomly sampled entities. Train with cross-entropy loss (positive triple should score higher than negatives).
Four RDF knowledge graphs with node classification tasks. AIFB: 8,285 nodes, 29,043 edges, 45 relations, 4 classes. MUTAG: 23,644 nodes, 74,227 edges, 23 relations, 2 classes.
| Method | AIFB | MUTAG | BGS | AM |
|---|---|---|---|---|
| Feat (no graph) | 55.6 | 77.6 | 72.4 | 66.7 |
| WL kernel | 80.6 | 80.6 | 86.2 | 87.4 |
| DeepWalk | 25.0 | 76.3 | 58.6 | 65.2 |
| R-GCN | 95.8 | 73.2 | 83.1 | 89.3 |
FB15k-237 is a standard KG benchmark (14,541 entities, 237 relation types, 272,115 training triples). Metric: Mean Reciprocal Rank (MRR) — higher is better. Hits@10 = fraction of test triples where the correct answer is in the top 10 predictions.
| Method | MRR | Hits@10 | Hits@1 |
|---|---|---|---|
| TransE | 0.294 | 46.5 | — |
| DistMult (standalone) | 0.241 | 41.9 | 15.5 |
| ComplEx | 0.247 | 42.8 | 15.8 |
| R-GCN + DistMult | 0.249 | 41.7 | 15.1 |
DistMult decoder is symmetric: Can't model asymmetric relations like "isParentOf." Replaced by ComplEx, RotatE, or TransE decoders in later work.
All relations treated independently (in full variant): No built-in mechanism to learn that "bornIn" and "diedIn" are related. Basis decomposition partially addresses this but doesn't explicitly model relation similarity.
Scalability: KGs with millions of entities require mini-batch training with neighbor sampling. The paper doesn't address this — later work (GraphSAGE, ClusterGCN) provides the tools.
No edge features: Edge attributes beyond type are not modeled. Some KGs have temporal edges (valid from date X to date Y), certainty scores, or other metadata.
| Paper | Key Advance |
|---|---|
| CompGCN (Vashishth, 2020) | Composition of entity + relation embeddings; much better link prediction |
| HGT (Hu, 2020) | Attention-based type-specific projections for heterogeneous graphs (see HGT lesson) |
| RGAT (Busbridge, 2019) | Adds attention weights across neighbors within each relation type |
| KGCN (Wang, 2019) | Applies to recommendation via user-KG interaction graphs |
"Not all edges are equal. Give each relation its own voice."
— Paraphrase of R-GCN's central design principle