Most real-world graphs have multiple types of nodes and edges. Treating them all the same throws away the most valuable information. Here's how to build GNNs that respect the difference.
Imagine you're building a recommendation system for an academic search engine. Your data looks like this: researchers write papers. Papers cite other papers. Papers get published at venues — conferences and journals. Each of these is a fundamentally different kind of entity.
Now say a standard GNN starts at a paper node and aggregates messages from neighbors. Its neighbors include: other papers it cites, the venue that published it, and the authors who wrote it. The GNN treats all three the same — just "neighbors" — and uses the same weight matrix to transform every message.
This isn't just an academic-graph problem. E-commerce graphs have Users, Items, and Queries. Biomedical knowledge graphs have Genes, Proteins, Diseases, and Drugs. Financial fraud graphs have Accounts, Transactions, and Merchants. In each case, the different node and edge types carry fundamentally different semantics.
When a standard GNN ignores type information, three things go wrong:
The same graph seen two ways. The left side shows what a standard GNN sees — just nodes and edges, all treated the same. The right shows the true heterogeneous structure with distinct node and edge types. Click to toggle.
A heterogeneous graph is a graph G = (V, E, τ, φ) where two type-mapping functions make it heterogeneous:
τ(v) maps each node to its node type from a set A of node type names. φ(e) maps each edge to its edge type from a set R of edge type names. When |A| = 1 and |R| = 1, we recover the standard homogeneous graph.
But the most useful concept for heterogeneous GNNs is the relation type. A relation type combines the source node type, the edge type, and the destination node type into a triplet:
For example, in the academic graph:
| Source Type | Edge Type | Destination Type | Relation r |
|---|---|---|---|
| Author | writes | Paper | (Author, writes, Paper) |
| Paper | cites | Paper | (Paper, cites, Paper) |
| Paper | published_in | Venue | (Paper, published_in, Venue) |
| Author | affiliated_with | Institution | (Author, affiliated_with, Institution) |
Each relation type r defines a distinct neighborhood. For a node v, its type-r neighbors are:
The key design choice in heterogeneous GNNs: each relation type r gets its own message-passing mechanism. What this mechanism looks like — a fixed matrix, learned attention, something else — is what separates RGCN from HGT from metapath approaches.
Click any node to highlight its neighbors, broken down by relation type. Each color is a different relation type with its own processing pathway.
Before building something complicated, let's ask whether we even need a new architecture. Can we just encode type as a feature and use a standard GNN?
The simplest approach: assign each node type a one-hot vector. If there are 3 node types, Paper gets [1,0,0], Author gets [0,1,0], Venue gets [0,0,1]. Concatenate this to the node's feature vector. Run a standard GCN. Done?
Different node types naturally have different feature dimensions. Authors might be represented by 128-dim name embeddings. Papers might have 768-dim BERT embeddings of their abstract. Venues might have 64-dim location features. You can't feed all of these through one weight matrix W ∈ ℝd×d — they're different shapes.
You could pad them all to the maximum dimension, but you'd be feeding zeros as features for most entries, which wastes capacity and misleads the model.
Suppose you want to predict a paper's research topic. The citation structure tells you a lot — papers citing similar papers tend to be on similar topics. But the venue structure also helps — venues have their own topic focus. A single weight matrix W applies the same linear transformation to messages from both relations simultaneously. It can't learn "treat citation messages this way, venue messages that way."
When you add type features and collapse everything into one message-passing scheme, you're asking a single channel to carry both relational semantics AND content. The gradient signal for "learn what citations mean" competes with "learn what authorship means." Keeping them separate gives each relation a dedicated learning channel.
In 2018, Schlichtkrull et al. proposed the Relational Graph Convolutional Network (RGCN). The core idea is elegantly simple: give each relation type its own weight matrix.
Recall standard GCN message passing for a node v:
One weight matrix W for all neighbors, regardless of edge type. RGCN changes this by making W depend on the relation type r:
Let's decode every piece:
| Symbol | What it is | Why it's there |
|---|---|---|
| Wr(l) | Weight matrix for relation r, layer l | Different relations → different linear transformations |
| W0(l) | Self-loop weight matrix | Node keeps its own identity across layers (separate from neighbors) |
| Nr(v) | Neighbors of v via relation r | We sum separately over each relation type |
| cv,r | Normalization constant for relation r at v | Prevents nodes with many neighbors in one relation from dominating; typically |Nr(v)| |
| σ | Nonlinearity (e.g., ReLU) | Enables the model to learn non-linear functions |
For each relation type r, RGCN does a mini-GCN: aggregate messages from r-type neighbors, transform with Wr. Then it sums up all these per-relation aggregations, adds the self-loop term, and applies a nonlinearity. Each Wr is free to learn a completely different linear map.
Say node v is a Paper. Its relations are: (Author, writes, Paper), (Paper, cites, Paper), (Venue, published_in, Paper). At layer l, v's update is:
# h[v] shape: (d,) — current Paper embedding # W_writes shape: (d_out, d_author) — transforms Author embeddings # W_cites shape: (d_out, d_paper) — transforms Paper embeddings # W_pub shape: (d_out, d_venue) — transforms Venue embeddings # W_self shape: (d_out, d_paper) — transforms Paper self msg_writes = mean([W_writes @ h[u] for u in N_writes(v)]) msg_cites = mean([W_cites @ h[u] for u in N_cites(v)]) msg_pub = mean([W_pub @ h[u] for u in N_pub(v)]) self_msg = W_self @ h[v] h_new[v] = relu(msg_writes + msg_cites + msg_pub + self_msg)
Each relation gets its own weight matrix. Each weight matrix can be a different shape (to handle different input feature dimensions). The outputs are all projected to the same d_out dimension so they can be summed.
RGCN is elegant. But it has a parameter count problem. Suppose you have R = 50 relation types (common in real knowledge graphs), d = 256 hidden dimensions, and L = 3 layers. Each layer needs R + 1 = 51 weight matrices of size d × d = 256 × 256. That's 51 × 65,536 = 3.3 million parameters per layer, or ~10 million parameters total.
For a large knowledge graph like Freebase, R can exceed 1,000. You'd need 1,000 × 256 × 256 = 65 million parameters per layer. This is way more than the number of training examples in many graph tasks, guaranteeing overfitting.
Instead of letting each Wr be a fully free matrix, express it as a linear combination of shared basis matrices:
Here V1, V2, ..., VB are B shared basis matrices (same shape as Wr). Each relation r gets its own coefficients ar1, ar2, ..., arB — just B scalars. The bases Vb are shared across ALL relations.
Parameter count comparison:
| Method | Parameters per layer | Example (R=50, d=256, B=8) |
|---|---|---|
| Full RGCN | R × d² | 50 × 65,536 = 3.28M |
| Basis decomposition | B × d² + R × B | 8 × 65,536 + 50 × 8 = 524,688 + 400 ≈ 525K |
| Reduction | ×(R/B) fewer | ~6× fewer parameters |
With B = 8 and R = 50, we cut parameters by 6×. With R = 1000, the savings are enormous: from 65M to ~524K + 8K ≈ 532K parameters.
The intuition is that different relation types aren't completely unrelated — they all operate in the same feature space, and many share structural patterns. Authorship and citation are different, but both are "send information from one academic entity to another." The basis matrices capture shared structure; the coefficients arb capture what's unique to each relation.
Wr = ∑ arb Vb
Shares structure across relations. Better when relations are semantically similar. B controls sharing tightness.
Wr = diag(Wr,1, ..., Wr,B)
Independent per relation, just factorized. Better when relations are semantically different. Each block learns independently.
RGCN assigns a weight matrix per relation, and every neighbor of the same relation type contributes equally (after normalization). But what if some author-to-paper relationships matter more than others? What if the model should learn to pay more attention to highly-cited co-authors?
The Heterogeneous Graph Transformer (HGT) (Hu et al., WWW 2020) answers this by combining the relation-type-specific processing of RGCN with the learned attention of GAT. For each edge (u→v) with relation r = (τ(u), φ(u,v), τ(v)), HGT computes:
Unlike standard GAT (which uses a single shared WQ), HGT uses different projection matrices for each node type. An Author node and a Paper node get projected differently before attention is computed. This handles the feature dimension mismatch problem from Chapter 2 — each node type can have a different input dimension.
The matrix Wφ(u,v)ATT depends on the edge type. So "how much paper v should attend to its writing authors" is computed differently from "how much paper v should attend to papers it cites." The edge type modulates the attention computation itself, not just the messages.
Even the message transformation depends on edge type. The aggregated message is then passed through a layer norm and a linear projection back to the target node type's dimension.
HGT's parameters scale with node types, not relation types. WQτ, WKτ, WVτ, WMSGτ are per node type. WATTφ is per edge type. With 3 node types and 4 edge types, you need 3 × 4 + 4 = 16 matrices — far fewer than the R relation types in RGCN.
class HGTLayer: def __init__(self, node_types, edge_types, d_in, d_out): # One Q, K, V per node type — handles feature dim mismatch self.W_Q = {t: Linear(d_in[t], d_out) for t in node_types} self.W_K = {t: Linear(d_in[t], d_out) for t in node_types} self.W_V = {t: Linear(d_in[t], d_out) for t in node_types} # One attention matrix per edge type self.W_ATT = {e: Linear(d_out, d_out, bias=False) for e in edge_types} # One message matrix per edge type self.W_MSG = {e: Linear(d_out, d_out) for e in edge_types} def forward(self, G, h): # For each node v, for each neighbor u via edge type e: for v in G.nodes: messages = [] for u, e in G.in_edges(v, data='etype'): Q = self.W_Q[G.ntype[v]] @ h[v] # query: target type K = self.W_K[G.ntype[u]] @ h[u] # key: source type V = self.W_V[G.ntype[u]] @ h[u] # value: source type score = Q.T @ self.W_ATT[e] @ K # edge-type attention msg = self.W_MSG[e] @ V # edge-type message messages.append((score, msg)) h_new[v] = softmax_aggregate(messages)
RGCN and HGT aggregate over direct neighbors, one hop at a time. But sometimes the most informative structure comes from longer patterns. Two researchers who both published at the same venue in the same year might be working on the same problem — but they're not directly connected in the graph.
A metapath is a sequence of relation types that defines a higher-order path pattern through the heterogeneous graph. Rather than "Author → Paper → Venue → Paper → Author," you name this whole pattern and traverse it explicitly.
Each metapath defines a different kind of "semantic similarity" between nodes. Walking the APVPA path from Author A finds all authors who published at any venue that A also published at — a way to discover researchers working in overlapping fields without direct collaboration.
In the MAGNN (Metapath Aggregated Graph Neural Network) approach, for each chosen metapath P:
βvP is the learned attention weight that node v assigns to metapath P. Different nodes can decide that different metapaths are more relevant to their classification.
A 2-layer GNN also aggregates 2-hop neighbors. The critical difference: metapaths are selective. A 2-layer GNN from an Author node aggregates over ALL 2-hop neighbors — other authors, papers of papers it cites, venues, institutions. A metapath-based approach only aggregates along semantically meaningful paths. The APVPA path ignores citation structure; the APCPA path ignores venue structure. You choose which semantic signals to model.
Choose a metapath and click "Walk" to animate traversal through the academic graph. Observe which nodes end up connected and why they're semantically related.
Here's RGCN running live on a small heterogeneous academic graph. Click any node to make it the target node — you'll see how messages from different relation types are aggregated with different weight matrices, and how turning relations on/off changes the resulting embedding.
This is the full RGCN computation in miniature: for each active relation type, aggregate neighbor messages → transform with Wr → sum → ReLU → new embedding. Watch how the embedding visualization changes when you activate/deactivate relation types.
Click a node to select it as the target. Toggle relation types with the buttons. The bar chart shows the contribution of each relation to the final embedding. Sliders set the (simulated) weight matrix "strength" for each relation.
You've now seen three approaches: RGCN (per-relation weight matrices), HGT (type-aware attention), and metapath methods. How do you decide which to use — or whether to bother with heterogeneous modeling at all?
Ask this question: Would a domain expert treat different node types differently, or are they interchangeable? If a biologist treats Genes, Proteins, and Diseases as fundamentally different entities with different feature spaces and interaction semantics, use heterogeneous GNNs. If the "types" are just arbitrary labels that don't change how nodes interact, a homogeneous GNN will do fine.
| Domain | Node Types | Why Heterogeneous | Task |
|---|---|---|---|
| Academic graphs | Author, Paper, Venue, Institution | Different features, different relation semantics (citation ≠ authorship) | Paper topic classification, citation prediction |
| Biomedical KGs | Gene, Protein, Disease, Drug, Pathway | Completely different biological entities with different interaction types | Drug-disease link prediction, side effect detection |
| E-commerce | User, Item, Query, Category | Users and items have different features; purchase ≠ search ≠ category membership | Click prediction, recommendation |
| Financial graphs | Account, Transaction, Merchant, Device | Different fraud signals per entity type; transaction and account interact differently | Fraud detection |
| Event graphs | Flight, Airport, Airline, Route | Delays propagate differently along routes vs. airline scheduling constraints | Delay prediction |
Microsoft Academic Graph: 100M+ papers, authors, venues, institutions. HGT was specifically designed and benchmarked on this graph. Tasks: predicting paper topics (node classification), predicting missing author-paper relationships (link prediction).
Drug repurposing: Knowledge graphs like Hetionet contain 47,000 nodes across 11 types and 2.2M edges across 24 relation types. RGCN with basis decomposition is the standard baseline for predicting which drugs treat which diseases.
E-commerce recommendation: Alibaba's product recommendation system models Users, Items, Queries, Categories, Brands as different node types. Type-specific message passing significantly outperforms homogeneous GNNs for click-through rate prediction.
Heterogeneous GNNs don't exist in isolation. They sit at the intersection of several powerful ideas — some of which we've already covered, and some that come next.
Lecture 10: Knowledge Graphs. Knowledge graphs (KGs) are a special case of heterogeneous graphs where edges are facts: (head_entity, relation, tail_entity). RGCN is one of the standard baselines for KG embedding and completion. The next lecture focuses on KG-specific methods like TransE and RotatE that exploit the specific structure of triple-based reasoning.
Graph Transformers (Lec 8, previous lesson). HGT is a Graph Transformer applied to heterogeneous graphs. The type-specific Q, K, V projections are exactly the heterogeneous extension of the Graphormer idea: inject structural information (here: node type and edge type) into the attention computation. GPS could in principle be extended heterogeneously.
Scalability (Lec 11). Heterogeneous GNNs face the same mini-batching challenges as homogeneous GNNs, plus additional complexity: when you sample neighbors, you need to stratify sampling by relation type to avoid dropping rare relations entirely.
| Paper | Venue | Contribution |
|---|---|---|
| Schlichtkrull et al. — R-GCN | ESWC 2018 | Per-relation weight matrices + basis decomposition |
| Hu et al. — HGT | WWW 2020 | Type-aware attention for heterogeneous graphs |
| Wang et al. — MAGNN | WWW 2019 | Metapath-based aggregation with intra-metapath attention |
| Yun et al. — GTN | NeurIPS 2019 | Graph Transformer Networks: learn metapaths end-to-end |
| Lv et al. — HeCo | KDD 2021 | Self-supervised learning on heterogeneous graphs |
"A graph is not just a structure — it is a vocabulary. Different edge types are different words. Using the same weight for every word is like reading a sentence where all words mean the same thing."
— paraphrased from the HGT paper motivation