CS224W Lecture 9

Heterogeneous Graphs

Most real-world graphs have multiple types of nodes and edges. Treating them all the same throws away the most valuable information. Here's how to build GNNs that respect the difference.

Prerequisites: Basic GNNs (Lec 3) + Attention in GNNs (Lec 4). That's it.
10
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: Why Heterogeneous?

Imagine you're building a recommendation system for an academic search engine. Your data looks like this: researchers write papers. Papers cite other papers. Papers get published at venues — conferences and journals. Each of these is a fundamentally different kind of entity.

Now say a standard GNN starts at a paper node and aggregates messages from neighbors. Its neighbors include: other papers it cites, the venue that published it, and the authors who wrote it. The GNN treats all three the same — just "neighbors" — and uses the same weight matrix to transform every message.

The core problem: A citation is completely different from an authorship relationship. Aggregating them with the same weight matrix is like adding apples and engine oil. You technically can, but the result means nothing.

This isn't just an academic-graph problem. E-commerce graphs have Users, Items, and Queries. Biomedical knowledge graphs have Genes, Proteins, Diseases, and Drugs. Financial fraud graphs have Accounts, Transactions, and Merchants. In each case, the different node and edge types carry fundamentally different semantics.

The Cost of Ignoring Type

When a standard GNN ignores type information, three things go wrong:

  1. Feature space mismatch. Authors might have 128-dim name embeddings. Papers have 768-dim abstract embeddings. You can't feed both through the same weight matrix without mangling one of them.
  2. Spurious correlations. The model conflates "this paper is cited a lot" with "this author published many papers" — two completely different signals mixed by the same weight.
  3. Missing relational semantics. "A cites B" and "A is published at B" tell you completely different things. Using the same weight for both loses both signals.
Academic Graph: Homogeneous vs. Heterogeneous View

The same graph seen two ways. The left side shows what a standard GNN sees — just nodes and edges, all treated the same. The right shows the true heterogeneous structure with distinct node and edge types. Click to toggle.

Homogeneous: all nodes gray, all edges the same. A standard GNN sees this.
In a citation network with Authors, Papers, and Venues, why is it a problem to use one shared weight matrix for all edge types?

Chapter 1: Formal Definition

A heterogeneous graph is a graph G = (V, E, τ, φ) where two type-mapping functions make it heterogeneous:

τ : V → A     φ : E → R

τ(v) maps each node to its node type from a set A of node type names. φ(e) maps each edge to its edge type from a set R of edge type names. When |A| = 1 and |R| = 1, we recover the standard homogeneous graph.

But the most useful concept for heterogeneous GNNs is the relation type. A relation type combines the source node type, the edge type, and the destination node type into a triplet:

r(u, v) = (τ(u), φ(u,v), τ(v))

For example, in the academic graph:

Source TypeEdge TypeDestination TypeRelation r
AuthorwritesPaper(Author, writes, Paper)
PapercitesPaper(Paper, cites, Paper)
Paperpublished_inVenue(Paper, published_in, Venue)
Authoraffiliated_withInstitution(Author, affiliated_with, Institution)

Each relation type r defines a distinct neighborhood. For a node v, its type-r neighbors are:

Nr(v) = { u : (u, v) ∈ E and φ(u,v) = r }

The key design choice in heterogeneous GNNs: each relation type r gets its own message-passing mechanism. What this mechanism looks like — a fixed matrix, learned attention, something else — is what separates RGCN from HGT from metapath approaches.

Why relation types, not just edge types? Because the same edge label can mean different things in different directions. "Follows" from a celebrity to a fan is different from "follows" from a fan to a celebrity. By encoding both source and destination type, relation types capture directionality automatically.
Relation Type Explorer

Click any node to highlight its neighbors, broken down by relation type. Each color is a different relation type with its own processing pathway.

Click a node to explore its typed neighborhoods.
An academic graph has 3 node types and 4 edge types. How many distinct relation types r = (τ(u), φ(u,v), τ(v)) can exist at most?

Chapter 2: Can We Just Add Features?

Before building something complicated, let's ask whether we even need a new architecture. Can we just encode type as a feature and use a standard GNN?

The simplest approach: assign each node type a one-hot vector. If there are 3 node types, Paper gets [1,0,0], Author gets [0,1,0], Venue gets [0,0,1]. Concatenate this to the node's feature vector. Run a standard GCN. Done?

This is called type-as-feature. It works in some settings, and it's worth knowing why it often doesn't.

Problem 1: Feature Dimension Mismatch

Different node types naturally have different feature dimensions. Authors might be represented by 128-dim name embeddings. Papers might have 768-dim BERT embeddings of their abstract. Venues might have 64-dim location features. You can't feed all of these through one weight matrix W ∈ ℝd×d — they're different shapes.

You could pad them all to the maximum dimension, but you'd be feeding zeros as features for most entries, which wastes capacity and misleads the model.

Problem 2: Different Relations Deserve Different Weights

Suppose you want to predict a paper's research topic. The citation structure tells you a lot — papers citing similar papers tend to be on similar topics. But the venue structure also helps — venues have their own topic focus. A single weight matrix W applies the same linear transformation to messages from both relations simultaneously. It can't learn "treat citation messages this way, venue messages that way."

The key intuition: A one-hot type feature tells the GNN WHAT kind of neighbor is sending a message. But it can't change HOW that message is processed — the weight matrix is still shared. Type-specific weight matrices change the processing, not just the content.

Problem 3: Over-Squashing Gets Worse

When you add type features and collapse everything into one message-passing scheme, you're asking a single channel to carry both relational semantics AND content. The gradient signal for "learn what citations mean" competes with "learn what authorship means." Keeping them separate gives each relation a dedicated learning channel.

Type-as-Feature (Fails When...)

  • Node features have different dimensions per type
  • Relations carry semantically distinct information
  • You need fine-grained relational reasoning
  • Downstream task is relation-specific (e.g., link prediction on ONE edge type)

Type-as-Feature (Works When...)

  • All node types have same feature dimension
  • Type distinction is minor (e.g., "user type A vs B" in same domain)
  • Graph is small and parameters are scarce
  • You need a fast baseline before investing in RGCN/HGT
Authors have 128-dim embeddings, Papers have 768-dim embeddings, Venues have 64-dim embeddings. What happens if you pad all to 768-dim and apply a standard GCN?

Chapter 3: RGCN: Relation Weights

In 2018, Schlichtkrull et al. proposed the Relational Graph Convolutional Network (RGCN). The core idea is elegantly simple: give each relation type its own weight matrix.

Recall standard GCN message passing for a node v:

hv(l+1) = σ( W(l)u ∈ N(v) (1/|N(v)|) hu(l) )

One weight matrix W for all neighbors, regardless of edge type. RGCN changes this by making W depend on the relation type r:

hv(l+1) = σ( W0(l) hv(l) + ∑r ∈ Ru ∈ Nr(v) (1/cv,r) Wr(l) hu(l) )

Let's decode every piece:

SymbolWhat it isWhy it's there
Wr(l)Weight matrix for relation r, layer lDifferent relations → different linear transformations
W0(l)Self-loop weight matrixNode keeps its own identity across layers (separate from neighbors)
Nr(v)Neighbors of v via relation rWe sum separately over each relation type
cv,rNormalization constant for relation r at vPrevents nodes with many neighbors in one relation from dominating; typically |Nr(v)|
σNonlinearity (e.g., ReLU)Enables the model to learn non-linear functions

What's Actually Happening

For each relation type r, RGCN does a mini-GCN: aggregate messages from r-type neighbors, transform with Wr. Then it sums up all these per-relation aggregations, adds the self-loop term, and applies a nonlinearity. Each Wr is free to learn a completely different linear map.

The self-loop weight W0 is important. Without it, a node's own representation only enters the update as "one more neighbor." With it, the model explicitly learns how much weight to give a node's own current embedding, independent of its neighbors. This is crucial for preserving identity across layers.

Data Flow: Concrete Example

Say node v is a Paper. Its relations are: (Author, writes, Paper), (Paper, cites, Paper), (Venue, published_in, Paper). At layer l, v's update is:

# h[v] shape: (d,) — current Paper embedding
# W_writes shape: (d_out, d_author) — transforms Author embeddings
# W_cites shape: (d_out, d_paper) — transforms Paper embeddings
# W_pub shape: (d_out, d_venue) — transforms Venue embeddings
# W_self shape: (d_out, d_paper) — transforms Paper self

msg_writes = mean([W_writes @ h[u] for u in N_writes(v)])
msg_cites  = mean([W_cites  @ h[u] for u in N_cites(v)])
msg_pub    = mean([W_pub    @ h[u] for u in N_pub(v)])
self_msg   = W_self @ h[v]

h_new[v] = relu(msg_writes + msg_cites + msg_pub + self_msg)

Each relation gets its own weight matrix. Each weight matrix can be a different shape (to handle different input feature dimensions). The outputs are all projected to the same d_out dimension so they can be summed.

In RGCN with R relation types and L layers, how many weight matrices are there per layer?

Chapter 4: Basis Decomposition

RGCN is elegant. But it has a parameter count problem. Suppose you have R = 50 relation types (common in real knowledge graphs), d = 256 hidden dimensions, and L = 3 layers. Each layer needs R + 1 = 51 weight matrices of size d × d = 256 × 256. That's 51 × 65,536 = 3.3 million parameters per layer, or ~10 million parameters total.

For a large knowledge graph like Freebase, R can exceed 1,000. You'd need 1,000 × 256 × 256 = 65 million parameters per layer. This is way more than the number of training examples in many graph tasks, guaranteeing overfitting.

The overfitting signal: If you have more parameters than training nodes, your model is memorizing, not learning. With rare relations (few training examples), each W_r has barely any gradient signal to learn from.

Solution: Basis Decomposition

Instead of letting each Wr be a fully free matrix, express it as a linear combination of shared basis matrices:

Wr = ∑b=1B arb · Vb

Here V1, V2, ..., VB are B shared basis matrices (same shape as Wr). Each relation r gets its own coefficients ar1, ar2, ..., arB — just B scalars. The bases Vb are shared across ALL relations.

Parameter count comparison:

MethodParameters per layerExample (R=50, d=256, B=8)
Full RGCNR × d²50 × 65,536 = 3.28M
Basis decompositionB × d² + R × B8 × 65,536 + 50 × 8 = 524,688 + 400 ≈ 525K
Reduction×(R/B) fewer~6× fewer parameters

With B = 8 and R = 50, we cut parameters by 6×. With R = 1000, the savings are enormous: from 65M to ~524K + 8K ≈ 532K parameters.

Why Does This Work?

The intuition is that different relation types aren't completely unrelated — they all operate in the same feature space, and many share structural patterns. Authorship and citation are different, but both are "send information from one academic entity to another." The basis matrices capture shared structure; the coefficients arb capture what's unique to each relation.

Block-diagonal decomposition is an alternative: constrain W_r to be block-diagonal, which means W_r is zero outside B diagonal blocks of size (d/B) × (d/B). This reduces parameters from d² to d²/B without introducing shared bases — different trade-off, similar savings.

Block Decomposition vs. Basis Decomposition

Basis Decomposition

Wr = ∑ arb Vb

Shares structure across relations. Better when relations are semantically similar. B controls sharing tightness.

Block Decomposition

Wr = diag(Wr,1, ..., Wr,B)

Independent per relation, just factorized. Better when relations are semantically different. Each block learns independently.

In basis decomposition W_r = Σ a_{rb} V_b, what do the basis matrices V_b represent?

Chapter 5: HGT: Type-Aware Attention

RGCN assigns a weight matrix per relation, and every neighbor of the same relation type contributes equally (after normalization). But what if some author-to-paper relationships matter more than others? What if the model should learn to pay more attention to highly-cited co-authors?

The Heterogeneous Graph Transformer (HGT) (Hu et al., WWW 2020) answers this by combining the relation-type-specific processing of RGCN with the learned attention of GAT. For each edge (u→v) with relation r = (τ(u), φ(u,v), τ(v)), HGT computes:

Step 1: Type-Specific Q, K, V Projections

Qv = WQτ(v) hv    Ku = WKτ(u) hu    Vu = WVτ(u) hu

Unlike standard GAT (which uses a single shared WQ), HGT uses different projection matrices for each node type. An Author node and a Paper node get projected differently before attention is computed. This handles the feature dimension mismatch problem from Chapter 2 — each node type can have a different input dimension.

Step 2: Relation-Dependent Attention Score

attn(u, v) = softmaxu ∈ N(v)( QvT · WATTφ(u,v) · Ku / √d )

The matrix Wφ(u,v)ATT depends on the edge type. So "how much paper v should attend to its writing authors" is computed differently from "how much paper v should attend to papers it cites." The edge type modulates the attention computation itself, not just the messages.

Step 3: Type-Aware Message Aggregation

hvnew = ∑u ∈ N(v) attn(u, v) · (WMSGφ(u,v) · Vu)

Even the message transformation depends on edge type. The aggregated message is then passed through a layer norm and a linear projection back to the target node type's dimension.

HGT vs. RGCN — the key difference: RGCN uses fixed per-relation weight matrices with uniform averaging. HGT uses learned attention weights that vary per pair of nodes, with type-specific projections for the query, key, value, and message. HGT is strictly more expressive — it can learn "ignore distant authors, focus on recent collaborators" in a way RGCN cannot.

Parameter Count

HGT's parameters scale with node types, not relation types. WQτ, WKτ, WVτ, WMSGτ are per node type. WATTφ is per edge type. With 3 node types and 4 edge types, you need 3 × 4 + 4 = 16 matrices — far fewer than the R relation types in RGCN.

class HGTLayer:
    def __init__(self, node_types, edge_types, d_in, d_out):
        # One Q, K, V per node type — handles feature dim mismatch
        self.W_Q = {t: Linear(d_in[t], d_out) for t in node_types}
        self.W_K = {t: Linear(d_in[t], d_out) for t in node_types}
        self.W_V = {t: Linear(d_in[t], d_out) for t in node_types}
        # One attention matrix per edge type
        self.W_ATT = {e: Linear(d_out, d_out, bias=False) for e in edge_types}
        # One message matrix per edge type
        self.W_MSG = {e: Linear(d_out, d_out) for e in edge_types}

    def forward(self, G, h):
        # For each node v, for each neighbor u via edge type e:
        for v in G.nodes:
            messages = []
            for u, e in G.in_edges(v, data='etype'):
                Q = self.W_Q[G.ntype[v]] @ h[v]   # query: target type
                K = self.W_K[G.ntype[u]] @ h[u]   # key: source type
                V = self.W_V[G.ntype[u]] @ h[u]   # value: source type
                score = Q.T @ self.W_ATT[e] @ K   # edge-type attention
                msg = self.W_MSG[e] @ V            # edge-type message
                messages.append((score, msg))
            h_new[v] = softmax_aggregate(messages)
In HGT, the query Q_v uses W_Q^{τ(v)} and the key K_u uses W_K^{τ(u)}. Why use different projection matrices for different node types rather than a shared W_Q?

Chapter 6: Metapaths

RGCN and HGT aggregate over direct neighbors, one hop at a time. But sometimes the most informative structure comes from longer patterns. Two researchers who both published at the same venue in the same year might be working on the same problem — but they're not directly connected in the graph.

A metapath is a sequence of relation types that defines a higher-order path pattern through the heterogeneous graph. Rather than "Author → Paper → Venue → Paper → Author," you name this whole pattern and traverse it explicitly.

Example metapaths in the academic graph:
  • APA: Author → Paper → Author (co-authors)
  • APVPA: Author → Paper → Venue → Paper → Author (same-venue co-authors)
  • APCPA: Author → Paper → Cites → Paper → Author (authors whose papers cite each other)
  • PVP: Paper → Venue → Paper (papers at the same venue)

Each metapath defines a different kind of "semantic similarity" between nodes. Walking the APVPA path from Author A finds all authors who published at any venue that A also published at — a way to discover researchers working in overlapping fields without direct collaboration.

Metapath-Based Aggregation

In the MAGNN (Metapath Aggregated Graph Neural Network) approach, for each chosen metapath P:

  1. Find all metapath instances. These are actual paths in the graph following the metapath pattern. For APVPA starting from author u: enumerate all u → p1 → v → p2 → author_end paths.
  2. Encode each instance. Aggregate the node features along the path (e.g., using attention over intermediate nodes).
  3. Aggregate across instances. All instances of the same metapath connecting u to its metapath-neighbors are aggregated.
  4. Aggregate across metapaths. Use attention to weight which metapaths matter most for the current task.
hv = ∑P ∈ metapaths βvP · aggP(v)

βvP is the learned attention weight that node v assigns to metapath P. Different nodes can decide that different metapaths are more relevant to their classification.

Metapaths vs. Standard Multi-Hop GNN

A 2-layer GNN also aggregates 2-hop neighbors. The critical difference: metapaths are selective. A 2-layer GNN from an Author node aggregates over ALL 2-hop neighbors — other authors, papers of papers it cites, venues, institutions. A metapath-based approach only aggregates along semantically meaningful paths. The APVPA path ignores citation structure; the APCPA path ignores venue structure. You choose which semantic signals to model.

Downside: Metapaths require domain expertise to design. You need to decide in advance which paths are semantically meaningful. RGCN and HGT are metapath-free — they let the model discover which relations matter through learning.
Metapath Traversal Simulator

Choose a metapath and click "Walk" to animate traversal through the academic graph. Observe which nodes end up connected and why they're semantically related.

Select a metapath to begin traversal.
Two authors both published papers at NeurIPS 2023, but they never co-authored a paper directly. Which metapath would connect them?

Chapter 7: RGCN Showcase

Here's RGCN running live on a small heterogeneous academic graph. Click any node to make it the target node — you'll see how messages from different relation types are aggregated with different weight matrices, and how turning relations on/off changes the resulting embedding.

This is the full RGCN computation in miniature: for each active relation type, aggregate neighbor messages → transform with Wr → sum → ReLU → new embedding. Watch how the embedding visualization changes when you activate/deactivate relation types.

RGCN Live — Heterogeneous Message Passing

Click a node to select it as the target. Toggle relation types with the buttons. The bar chart shows the contribution of each relation to the final embedding. Sliders set the (simulated) weight matrix "strength" for each relation.

W_writes strength 1.0
W_cites strength 1.0
W_pub_in strength 1.0
Select a node above to see its RGCN embedding breakdown.

Chapter 8: When Should You Use Heterogeneous GNNs?

You've now seen three approaches: RGCN (per-relation weight matrices), HGT (type-aware attention), and metapath methods. How do you decide which to use — or whether to bother with heterogeneous modeling at all?

The Core Decision: Do Types Carry Semantic Meaning?

Ask this question: Would a domain expert treat different node types differently, or are they interchangeable? If a biologist treats Genes, Proteins, and Diseases as fundamentally different entities with different feature spaces and interaction semantics, use heterogeneous GNNs. If the "types" are just arbitrary labels that don't change how nodes interact, a homogeneous GNN will do fine.

DomainNode TypesWhy HeterogeneousTask
Academic graphsAuthor, Paper, Venue, InstitutionDifferent features, different relation semantics (citation ≠ authorship)Paper topic classification, citation prediction
Biomedical KGsGene, Protein, Disease, Drug, PathwayCompletely different biological entities with different interaction typesDrug-disease link prediction, side effect detection
E-commerceUser, Item, Query, CategoryUsers and items have different features; purchase ≠ search ≠ category membershipClick prediction, recommendation
Financial graphsAccount, Transaction, Merchant, DeviceDifferent fraud signals per entity type; transaction and account interact differentlyFraud detection
Event graphsFlight, Airport, Airline, RouteDelays propagate differently along routes vs. airline scheduling constraintsDelay prediction

Choosing Among RGCN, HGT, and Metapaths

Use RGCN when:
  • Moderate number of relation types (<50)
  • All neighbors of a relation type are equally important
  • You need a strong, interpretable baseline
  • Knowledge graph completion (standard benchmark)
  • Use basis decomposition if R > 10
Use HGT when:
  • Neighbor importance varies (some authors more relevant than others)
  • Node types have different feature dimensions
  • Large heterogeneous graphs (HGT scales better with types)
  • You want attention weights for interpretability
  • Microsoft Academic Graph–scale problems
Use metapaths when: you have strong domain expertise about which paths are semantically meaningful, the graph has long-range dependencies not captured by 2-3 hop GNNs, or you want explicit control over which information flows where. Cost: metapath design requires human effort upfront; wrong metapaths give wrong answers.

Real Applications

Microsoft Academic Graph: 100M+ papers, authors, venues, institutions. HGT was specifically designed and benchmarked on this graph. Tasks: predicting paper topics (node classification), predicting missing author-paper relationships (link prediction).

Drug repurposing: Knowledge graphs like Hetionet contain 47,000 nodes across 11 types and 2.2M edges across 24 relation types. RGCN with basis decomposition is the standard baseline for predicting which drugs treat which diseases.

E-commerce recommendation: Alibaba's product recommendation system models Users, Items, Queries, Categories, Brands as different node types. Type-specific message passing significantly outperforms homogeneous GNNs for click-through rate prediction.

A knowledge graph has 8 node types and 200 relation types. Full RGCN with d=256 requires 200 × 256² ≈ 13M parameters per layer. Which mitigation is most appropriate?

Chapter 9: Connections

Heterogeneous GNNs don't exist in isolation. They sit at the intersection of several powerful ideas — some of which we've already covered, and some that come next.

Where This Came From

GCN (Lec 3)
Uniform neighbor averaging with one shared weight matrix. Homogeneous only.
↓ add attention
GAT (Lec 4)
Learned attention weights per neighbor. Still homogeneous — one Q, K, V for all edge types.
↓ add relation types
RGCN (2018)
Per-relation weight matrix W_r. Heterogeneous, but uniform within a relation type.
↓ combine both
HGT (2020)
Type-specific attention with per-type Q, K, V. Heterogeneous AND attention-weighted.

Where This Goes Next

Lecture 10: Knowledge Graphs. Knowledge graphs (KGs) are a special case of heterogeneous graphs where edges are facts: (head_entity, relation, tail_entity). RGCN is one of the standard baselines for KG embedding and completion. The next lecture focuses on KG-specific methods like TransE and RotatE that exploit the specific structure of triple-based reasoning.

Graph Transformers (Lec 8, previous lesson). HGT is a Graph Transformer applied to heterogeneous graphs. The type-specific Q, K, V projections are exactly the heterogeneous extension of the Graphormer idea: inject structural information (here: node type and edge type) into the attention computation. GPS could in principle be extended heterogeneously.

Scalability (Lec 11). Heterogeneous GNNs face the same mini-batching challenges as homogeneous GNNs, plus additional complexity: when you sample neighbors, you need to stratify sampling by relation type to avoid dropping rare relations entirely.

Key Papers

PaperVenueContribution
Schlichtkrull et al. — R-GCNESWC 2018Per-relation weight matrices + basis decomposition
Hu et al. — HGTWWW 2020Type-aware attention for heterogeneous graphs
Wang et al. — MAGNNWWW 2019Metapath-based aggregation with intra-metapath attention
Yun et al. — GTNNeurIPS 2019Graph Transformer Networks: learn metapaths end-to-end
Lv et al. — HeCoKDD 2021Self-supervised learning on heterogeneous graphs

Related Lessons

Closing insight: The story of graph learning is: start with the assumption that all nodes and edges are the same, then relax that assumption one constraint at a time. GCN relaxes "all nodes identical" by using features. GAT relaxes "all neighbors equal" by learning attention. RGCN relaxes "all edge types identical" by learning per-relation weights. HGT relaxes "all projections shared" by making Q, K, V type-specific. Each relaxation unlocks more expressive power at the cost of more parameters — the art is knowing which relaxation your task actually needs.
"A graph is not just a structure — it is a vocabulary. Different edge types are different words. Using the same weight for every word is like reading a sentence where all words mean the same thing."
— paraphrased from the HGT paper motivation