RGCN — Veanors

Chapter 0: The Problem

A knowledge graph (KG) is a massive network of facts. Wikidata has ~100 million nodes (entities) and ~800 million edges (relations). Each edge has a type: "Barack Obama bornIn Hawaii", "Hawaii partOf USA", "USA hasCapital Washington D.C." These relations are heterogeneous — they mean completely different things.

Standard GCN treats all edges the same. It aggregates over all neighbors with one weight matrix W. But "bornIn" and "hasCapital" are fundamentally different relationships. Aggregating them with the same W conflates information that should be treated differently.

The problem with homogeneous GCN on KGs: If entity v has neighbors connected by "isA" (a taxonomic relation), "bornIn" (a geographic relation), and "marriedTo" (a personal relation), a standard GCN aggregates all three identically. The resulting embedding loses the meaning of which relation connected v to each neighbor — it can't distinguish "Obama is a type of politician" from "Obama was born in Hawaii."

Knowledge Graph Tasks

Two canonical tasks on KGs:

Entity classification: Predict properties of nodes. "Is this entity a person? a place? an organization?" Given partial labels, can we classify all entities using the graph structure?

Link prediction: Given a pair (subject, relation, ?), predict the missing object. E.g., (Barack Obama, bornIn, ?) → Hawaii. KGs are famously incomplete — Freebase has ~71% of people's birthplaces missing. Link prediction fills these gaps.

Heterogeneous vs Homogeneous Aggregation

A small knowledge graph. Edge colors = relation types. Standard GCN (left): blends all relations. R-GCN (right): uses separate weights per relation — different colors = different transformations.

Why does standard GCN fail on heterogeneous knowledge graphs?

GCN can't handle graphs with more than one connected component Knowledge graphs are too large for GCN to scale to Standard GCN uses one weight matrix for all edges — it can't distinguish different relation types, conflating semantically distinct relationships

Chapter 1: Relation-Specific Weights

R-GCN's solution is conceptually minimal: give each relation type its own weight matrix. Instead of one W, have one W_r per relation r.

From GCN to R-GCN

Recall the standard GCN layer:

h^(l+1)_i = σ( W^(l) ∑_j∈N(i) c_ij h^(l)_j + W^(l)_self h^(l)_i )

where N(i) is all neighbors and c_ij is a normalization constant (1/degree).

R-GCN modifies this: instead of summing over all neighbors, sum over each relation type separately, with its own weight matrix:

h^(l+1)_i = σ( W^(l)₀ h^(l)_i + ∑_r∈R ∑_j∈N^r(i) (1/c_i,r) W^(l)_r h^(l)_j )

where N^r(i) = neighbors of i connected by relation r, and c_i,r = |N^r(i)| (per-relation degree normalization).

What changes, what stays the same: Everything in the aggregation is the same as GCN — sum over neighbors, normalize, apply nonlinearity. The only change is that the weight matrix W^(l)_r depends on which relation r the edge belongs to. This is a minimal, principled modification that costs exactly one extra weight matrix per relation.

Inverse Relations

An important practical detail: for each relation r, R-GCN also adds the inverse relation r^-1. If (Obama, bornIn, Hawaii) is an edge, we also add (Hawaii, bornIn^-1, Obama). This is a separate relation type with its own weight matrix W_r^-1.

Why? Because the message flowing from Hawaii to Obama via "bornIn" is semantically different from the message flowing from Obama to Hawaii via the same edge. Obama "was born in" Hawaii; Hawaii "is the birthplace of" Obama. These carry different information.

What is the key architectural difference between a standard GCN layer and an R-GCN layer?

R-GCN uses attention weights instead of fixed normalization R-GCN uses multiple aggregation steps instead of one R-GCN uses a separate weight matrix W_r for each relation type r, rather than one shared W for all edges

Chapter 2: R-GCN Layer (Showcase)

Let's trace exactly how information flows through an R-GCN layer for a single entity. This will make the computation concrete.

R-GCN Message Passing — Interactive

A small KG with 3 relation types. Select an entity (node) to compute its R-GCN update. See how each relation's neighbors contribute a separate weighted sum, combined to produce the new embedding.

Select an entity to see its R-GCN update computation step by step.

Data Shapes

For a KG with N entities, d input features, and d' output features:

Input H^(l): N × d matrix (one d-dimensional embedding per entity)
Weight matrices W_r: one d × d' matrix per relation r (|R| matrices total)
Self-weight W₀: one d × d' matrix for the self-connection
Output H^(l+1): N × d' matrix (new embeddings)

Total parameters per layer: (|R| + 1) × d × d'. With |R| = 100 relation types, d = d' = 100: 1,010,000 parameters per layer. For Wikidata with |R| ≈ 800: ~8 million per layer. This is the over-parameterization problem that Chapters 3 and 4 address.

If a knowledge graph has 200 relation types and each R-GCN layer has input dimension d=128 and output d'=128, how many parameters does one R-GCN layer have (excluding bias)?

16,384 — just one shared d×d' matrix 3,293,184 — (200 + 1) × 128 × 128 — one matrix per relation plus self-connection 200 × 128 = 25,600 — one vector per relation

Chapter 3: Basis Decomposition

The naive R-GCN has |R| weight matrices — one per relation. For KGs with hundreds of relations, this is too many parameters, and relations with few training triples will be under-trained. The paper proposes two regularization strategies. Chapter 3 covers the first: basis decomposition.

The Idea: Share Across Relations

Instead of a fully independent W_r per relation, decompose each W_r as a linear combination of a small set of basis matrices V₁, ..., V_B:

W^(l)_r = ∑_b=1^B a^(l)_rb V^(l)_b

where V_b ∈ ℝ^d×d' are B shared basis matrices (same across all relations), and a_rb ∈ ℝ are relation-specific scalar coefficients. Only the coefficients a_rb are relation-specific — the bases V_b are learned once and shared.

The compression: Parameters go from |R| × d × d' to B × d × d' (bases) + |R| × B (coefficients). When B ≪ |R|, this is massive compression. With B = 2 (2 basis matrices), every relation's W_r is a linear combination of just 2 templates — much easier to train with few examples per relation.

Why This Works

The basis decomposition forces parameter sharing: relations that encode similar semantics (e.g., "bornIn" and "diedIn" are both geographic relations) will naturally learn similar coefficients a_rb, leading to similar W_r. Relations that are truly different will learn different coefficients. The basis matrices V_b become the "semantic primitives" of the KG.

Basis Decomposition — Visualization

B basis matrices (heatmaps). Each relation's W_r = weighted combination. Adjust the coefficients for one relation and see its weight matrix update in real time.

a₁ (Basis 1 weight) 0.70

a₂ (Basis 2 weight) 0.30

In basis decomposition with B=3 bases, how many parameters does each relation r contribute (per layer, excluding shared bases)?

d × d' parameters — a full weight matrix B = 3 scalar coefficients a_r1, a_r2, a_r3 d + d' parameters — the input and output dimensions

Chapter 4: Block Diagonal Decomposition

The second regularization approach in the paper: block diagonal decomposition. Instead of sharing basis matrices across relations, restrict each W_r to be block-diagonal.

Block Diagonal Structure

A block diagonal matrix has non-zero entries only along the diagonal blocks and zero elsewhere:

W_r = diag( Q⁽¹⁾_r, Q⁽²⁾_r, ..., Q^(C)_r )

where each Q^(c)_r ∈ ℝ^{(d/C)×(d'/C)} is a small dense matrix. The full W_r is d × d' with C² blocks, but only C blocks are non-zero.

Interpretation: Block diagonality means the embedding is implicitly split into C independent "channels" or "aspects." Each relation transforms each aspect separately, with no cross-channel mixing. This is similar to grouped convolutions in CNNs — a structured form of parameter efficiency that also acts as a regularizer.

Parameters Comparison

Variant	Parameters per layer	Compression vs full
Full W_r	\|R\| × d × d'	1×
Basis (B bases)	B × d × d' + \|R\| × B	≈ B/\|R\| × (ignoring small term)
Block diag (C blocks)	\|R\| × C × (d/C) × (d'/C) = \|R\| × dd'/C	1/C×

Basis vs Block Diagonal: When to Use Which

Basis decomposition works better when relations are semantically related — sharing basis matrices leverages inter-relation similarity. Best for KGs where relations cluster into semantic families.

Block diagonal works better when embedding dimensions have natural groupings — each "aspect" of the entity is processed independently. Simpler to implement; no basis learning required. The paper finds basis decomposition slightly better in practice for both tasks.

Block Diagonal vs Full Matrix

Visualization of a weight matrix W_r. Full (left): dense, all parameters free. Block diagonal (right): only diagonal blocks are non-zero — same input/output dimensions, fewer parameters.

Number of blocks C 4

With block diagonal decomposition using C=4 blocks on a d=d'=64 weight matrix, how many parameters does W_r have (instead of 64×64=4096)?

4 parameters — just the 4 diagonal scalars 1024 — a 32×32 matrix 1024 — 4 blocks of (64/4)×(64/4) = 4 × 16 × 16 = 1024

Chapter 5: Link Prediction with DistMult

For link prediction, R-GCN is used as an encoder to produce entity embeddings, then paired with a decoder that scores triples. The paper uses DistMult as the decoder.

The Encoder-Decoder Framework

Input: entity features

Initial one-hot vectors or random embeddings for each entity

↓ R-GCN (L layers, basis/block regularization)

Entity embeddings e_i

Rich d-dimensional representation encoding the entity's graph neighborhood and relation context

↓ DistMult decoder

Triple score f(s, r, o)

How plausible is the fact (subject, relation, object)?

DistMult Decoder

DistMult (Yang et al., 2015) is a bilinear model for triple scoring:

f(s, r, o) = e^T_s · diag(W_r) · e_o = ∑_k e_sk · W_rk · e_ok

where e_s, e_o are entity embeddings from R-GCN, and W_r is a diagonal matrix for relation r (one learnable scalar per dimension). This is a simple element-wise interaction — the score is high when corresponding dimensions of subject and object embeddings align, weighted by the relation.

Why DistMult? It's simple (one vector per relation), fast to compute, and empirically strong despite its simplicity. Its main weakness: it's symmetric in subject and object (f(s,r,o) = f(o,r,s)), so it can't handle asymmetric relations like "isParentOf." For symmetric relations (like "married to" in some KGs) this is fine; for asymmetric ones, TransE or ComplEx are better decoders.

Training with Negative Sampling

For each observed triple (s, r, o), generate negative triples by corrupting the subject or object: (s', r, o) or (s, r, o') where s', o' are randomly sampled entities. Train with cross-entropy loss (positive triple should score higher than negatives).

L = −∑_{(s,r,o)∈T⁺} log σ(f(s,r,o)) − ∑_{(s',r,o')∈T⁻} log σ(−f(s',r,o'))

In the R-GCN + DistMult framework, what is the role of the R-GCN?

Directly predict the missing entity in a triple (end-to-end prediction) Encode entity embeddings that capture graph neighborhood context — these embeddings are then scored by DistMult for triple plausibility Score triples directly, with DistMult handling the graph convolution

Chapter 6: Results

Entity Classification on AIFB, MUTAG, BGS, AM

Four RDF knowledge graphs with node classification tasks. AIFB: 8,285 nodes, 29,043 edges, 45 relations, 4 classes. MUTAG: 23,644 nodes, 74,227 edges, 23 relations, 2 classes.

Method	AIFB	MUTAG	BGS	AM
Feat (no graph)	55.6	77.6	72.4	66.7
WL kernel	80.6	80.6	86.2	87.4
DeepWalk	25.0	76.3	58.6	65.2
R-GCN	95.8	73.2	83.1	89.3

Link Prediction on FB15k-237

FB15k-237 is a standard KG benchmark (14,541 entities, 237 relation types, 272,115 training triples). Metric: Mean Reciprocal Rank (MRR) — higher is better. Hits@10 = fraction of test triples where the correct answer is in the top 10 predictions.

Method	MRR	Hits@10	Hits@1
TransE	0.294	46.5	—
DistMult (standalone)	0.241	41.9	15.5
ComplEx	0.247	42.8	15.8
R-GCN + DistMult	0.249	41.7	15.1

The takeaway: For entity classification, R-GCN is dramatically better — the graph structure provides crucial context for node labeling. For link prediction, R-GCN is competitive with but not dramatically better than standalone DistMult. This is because link prediction quality depends heavily on the decoder's capacity — and DistMult is somewhat limited. Later works (CompGCN, 2019) show that better decoders unlock much stronger R-GCN performance on link prediction.

On which task does R-GCN show the most dramatic improvement over baselines?

Entity classification — R-GCN achieves 95.8% on AIFB vs 55.6% for feature-only baseline Link prediction — R-GCN is dramatically better than all KG embedding methods Both tasks show equally large improvements

Chapter 7: Connections & Beyond

Limitations of R-GCN

DistMult decoder is symmetric: Can't model asymmetric relations like "isParentOf." Replaced by ComplEx, RotatE, or TransE decoders in later work.

All relations treated independently (in full variant): No built-in mechanism to learn that "bornIn" and "diedIn" are related. Basis decomposition partially addresses this but doesn't explicitly model relation similarity.

Scalability: KGs with millions of entities require mini-batch training with neighbor sampling. The paper doesn't address this — later work (GraphSAGE, ClusterGCN) provides the tools.

No edge features: Edge attributes beyond type are not modeled. Some KGs have temporal edges (valid from date X to date Y), certainty scores, or other metadata.

Follow-Up Work

Paper	Key Advance
CompGCN (Vashishth, 2020)	Composition of entity + relation embeddings; much better link prediction
HGT (Hu, 2020)	Attention-based type-specific projections for heterogeneous graphs (see HGT lesson)
RGAT (Busbridge, 2019)	Adds attention weights across neighbors within each relation type
KGCN (Wang, 2019)	Applies to recommendation via user-KG interaction graphs

R-GCN's lasting contribution: It established the principle that heterogeneous graphs need type-specific weight matrices and demonstrated this cleanly on knowledge graph tasks. Every subsequent heterogeneous GNN builds on this insight — they differ in how they parameterize the type-specific transformations (attention in GAT-style, type-specific in HGT, composition in CompGCN).

Related Lessons

GCN — the homogeneous baseline R-GCN extends
GAT — adds attention to edge weighting (compatible with R-GCN)
HGT — type-specific attention for heterogeneous graphs (the Transformer-style successor)
TransE — the foundational KG embedding model R-GCN outperforms on classification

"Not all edges are equal. Give each relation its own voice."
— Paraphrase of R-GCN's central design principle

R-GCN: Relational Graph Convolutional Networks