Hu et al. — UC San Diego & Microsoft Research, WWW 2020

HGT: Heterogeneous Graph Transformer

Bring Transformer-style attention to heterogeneous graphs: type-specific projections, mutual attention between source and target types, and relative temporal encoding for dynamic academic networks.

Prerequisites: Transformer attention (Q/K/V) + Graph neural networks (GCN, GAT) + Heterogeneous graphs (typed nodes + edges)

Chapters

Simulations

Chapter 0: The Problem

Consider the Open Academic Graph (OAG): 178 million nodes of three types — Papers, Authors, and Venues — and edges of multiple types: paper cites paper, author writes paper, paper published in venue. This is a heterogeneous graph: nodes and edges have different semantic types.

The tasks: predict a paper's research field (node classification) and predict author-venue connections (link prediction). Both require aggregating information across heterogeneous neighbors — papers, authors, venues — in a way that respects the different semantics of each type.

Why neither standard GCN nor R-GCN is enough: Standard GCN ignores type differences entirely. R-GCN uses separate weight matrices per relation type but uses fixed, uniform attention across neighbors — it doesn't learn which sources are more relevant for each target node given the context. HGT adds dynamic, content-dependent attention — the Transformer way.

The Key Ingredients HGT Adds

Three ideas on top of R-GCN:

Type-specific Q/K/V projections: Different node types get different linear projections for queries, keys, and values. An "Author" node's feature vector is projected differently than a "Paper" node's.
Mutual attention: The attention weight between source node s and target node t depends not just on s (as in GAT) but on the interaction between s-type and t-type. The attention is parameterized by the relation type (s_type → t_type).
Relative temporal encoding: For dynamic graphs (papers published over time), the positional encoding from Transformers is adapted to encode the time gap between source and target events.

Heterogeneous Graph: Academic Network

A small heterogeneous academic graph. Orange = Papers, Teal = Authors, Purple = Venues. Edge labels show relation types. Each type needs different treatment during aggregation.

What does HGT add to R-GCN's type-specific weight matrices?

Deeper networks (more layers) Larger hidden dimensions Dynamic, content-dependent attention — which neighbors to attend to depends on the actual node features, not just the relation type

Chapter 1: Type-Specific Projections

In a standard Transformer, queries, keys, and values are computed by multiplying the input by matrices W_Q, W_K, W_V. In HGT, each node type gets its own set of projection matrices.

The Type Projection

Let φ(v) denote the type of node v (e.g., Paper, Author, Venue). The type-specific projection maps any node's embedding to a shared attention space:

K^l_s = K-Linear_φ(s)( H^l-1_s ) Q^l_t = Q-Linear_φ(t)( H^l-1_t ) V^l_s = V-Linear_φ(s)( H^l-1_s )

where K-Linear_φ(s) is a learnable d × d_head matrix specific to type φ(s). Authors, Papers, and Venues each have different K, Q, V projection matrices.

Why different projections matter: An Author's 128-dim embedding and a Paper's 128-dim embedding live in very different semantic spaces — the dimensions mean different things. Projecting them through the same W_K before computing attention would be like asking "is this Author's dimension 37 similar to this Paper's dimension 37?" — a meaningless comparison. Type-specific projections first map each type into a shared semantic space where comparisons are meaningful.

Parameterization

With |T_N| node types and h attention heads of size d_head:

K projections: |T_N| × h matrices, each d × d_head
Q projections: |T_N| × h matrices, each d × d_head
V projections: |T_N| × h matrices, each d × d_head

For OAG with 3 node types, 8 heads, d = 256, d_head = 32: 3 × 8 × 256 × 32 × 3 = ~18.9M parameters just for projections. This is manageable compared to the model's total parameters.

Type-Specific Projection Space

Different node types (Paper, Author, Venue) are projected into a shared 2D attention space. Notice how the same raw feature value maps to very different locations depending on the node type.

In HGT, if there are 4 node types and 8 attention heads, how many separate Key projection matrices are needed per layer?

1 — a single K matrix shared across all types 4 — one per node type (heads are a detail within each type's computation) 32 — 4 node types × 8 heads (each head within each type has its own projection)

Chapter 2: Heterogeneous Attention (Showcase)

Given the type-specific K and Q projections, how is the attention weight between source s and target t computed? HGT's attention mechanism has one key difference from standard Transformer attention: it includes a relation-type-dependent prior in the attention score.

The Mutual Attention Formula

Attn(s, e, t) = softmax( (Q_t W^ATT_φ(e) K^T_s) / √d_head )

where W^ATT_φ(e) ∈ ℝ^{d_head×d_head} is an edge-type-specific interaction matrix — a small matrix that captures how query (target) and key (source) dimensions should interact for this particular relation type.

Compare to standard Transformer: Attn = softmax(QW_Q^T K / √d). HGT adds W^ATT_φ(e) between Q and K. This matrix is learned per relation type and parameterizes how the target "queries" for relevant sources through this type of edge.

"Mutual" attention: Standard GAT's attention depends only on the source node's features (how important is this neighbor?). HGT's attention depends on both source and target — the question "how important is Author A to Paper P?" depends on the features of both A and P, mediated by the "author writes" relation type. This is much richer than GAT.

Heterogeneous Attention — Interactive

A target Paper node aggregates from its heterogeneous neighbors. The attention weight for each source depends on the source type, target features, and edge type. Adjust the query vector and watch attention weights change.

Query dim 1 0.60

Query dim 2 0.30

Attention weights update as you change the target paper's query.

What does the edge-type matrix W^ATT_φ(e) in HGT's attention formula capture?

The number of edges of this type in the graph (used for normalization) The feature importance of the source node type How the target's query dimensions should interact with the source's key dimensions for this specific edge type — a learned relation-type-specific interaction pattern

Chapter 3: Message Passing

Once attention weights are computed, messages are created from source nodes and aggregated at the target. HGT's message computation also has a type-specific component.

Message Computation

For each source-target pair (s, t) via edge e, the message is:

Msg(s, e, t) = V^l_s · W^MSG_φ(e)

where W^MSG_φ(e) ∈ ℝ^{d_head×d_head} is another edge-type-specific matrix — this one transforming the value vector before aggregation. Like W^ATT, it's shared across all edges of the same type.

Aggregation and Update

Messages from all source nodes are aggregated at target t using the computed attention weights:

˜H^l_t = ⊕_h∈[H] Aggregate( { Attn_h(s,e,t) · Msg_h(s,e,t) : (s,e)∈N(t) } )

where ⊕ is concatenation across heads and N(t) is all edges pointing to t. The aggregation function is simply a weighted sum — just like standard Transformer's attention output.

Finally, the aggregated embedding is updated to produce the new node embedding:

H^l_t = A-Linear_φ(t)( σ( ˜H^l_t ) ) + H^l-1_t

Note the residual connection — just like Transformers. A-Linear_φ(t) is a type-specific output projection that maps back to the original d-dimensional space.

The full parameter set per HGT layer: K/Q/V projections (type-specific), W^ATT interaction matrices (edge-type-specific), W^MSG message matrices (edge-type-specific), A-Linear output projections (type-specific). Every parameter is aware of either node type or edge type — pure type semantics, no type confusion.

For each edge (s, e, t)

Project s → K_s, t → Q_t, s → V_s (type-specific)

↓

Attention

Attn = softmax(Q_t W^ATT_φ(e) K_s^T / √d)

↓

Message

Msg = V_s · W^MSG_φ(e)

↓ weighted sum over N(t)

Aggregate + Update

H^l_t = A-Linear_φ(t)(σ(Σ Attn·Msg)) + H^l-1_t

What is the role of the residual connection in HGT's update rule H^l_t = ... + H^l-1_t?

It averages the new embedding with the old one for stability It allows the model to process edges in reverse direction It preserves the node's own information from the previous layer even if neighborhood aggregation overwrites it — enabling stable training of deep (many-layer) HGT stacks

Chapter 4: Relative Temporal Encoding

Academic graphs are inherently temporal: a paper published in 2020 can cite papers from 2010, 2015, or 2019. The relationship "cited by a paper from 5 years in the future" is different from "cited by a paper from last year." HGT introduces relative temporal encoding (RTE) to capture this.

The Challenge

Transformers use absolute positional encodings (position 1, 2, 3, ...). For graphs, this doesn't apply — nodes don't have positions, only timestamps. And the relevant quantity is the difference in timestamps between source and target, not their absolute times.

RTE Design

For each edge (s, e, t) where source s has timestamp τ_s and target t has timestamp τ_t, the relative time is Δτ = τ_t − τ_s. This scalar is encoded as a vector using sinusoidal basis functions (borrowed from transformer positional encodings):

RTE(Δτ) = sin(Δτ / 10000^2k/d) for even dimensions k RTE(Δτ) = cos(Δτ / 10000^2k/d) for odd dimensions k

This d-dimensional vector is added to the source node's initial embedding before the Q/K/V projections: H_s ← H_s + W_RTE · RTE(Δτ). The model can thus learn to weight recent vs distant sources differently.

Why relative, not absolute? A paper from 1990 that's cited by a 2020 paper looks very different as a source than the same 1990 paper cited by a 1992 paper. The absolute timestamp of the source doesn't matter as much as how far in the past it is relative to the target. Relative encoding captures this temporal distance, not just the year.

Relative Temporal Encoding

Sinusoidal basis functions for different frequencies encode the time difference Δτ. Short time differences are encoded differently than long ones. This encoding is added to the source embedding before attention.

Time gap Δτ (years) 5

In HGT's Relative Temporal Encoding, what quantity is encoded as the temporal signal?

The absolute publication year of the source paper The absolute publication year of the target paper The difference Δτ = τ_target − τ_source — how far in the past the source is relative to the target

Chapter 5: HGSampling

OAG has 178 million nodes. You can't load the full graph into memory, let alone train a GNN on all of it at once. HGT uses HGSampling: a heterogeneity-aware mini-batch sampling strategy.

The Problem with Naive Sampling

Standard neighbor sampling (like GraphSAGE's) randomly samples k neighbors per node. On a heterogeneous graph, this can completely miss rare node types. If "Venue" nodes have degree 1,000 and "Author" nodes have degree 3, randomly sampling 10 neighbors of a Paper will mostly give Papers (which dominate the neighborhood), leaving Venues and Authors chronically under-represented.

HGSampling Design

HGSampling samples per type: for each target node, it separately samples k/|types| neighbors of each type. This ensures balanced representation across all node types in every mini-batch, regardless of the degree distribution skew.

Implementation detail: Maintain a separate neighbor list per (node, type) pair. For each batch, sample independently from each list. If a type has fewer than k/|types| neighbors (rare type), use all available neighbors. The resulting subgraph is "type-balanced" by construction.

Why Balanced Sampling Matters

If Venue nodes are undersampled during training, the W^ATT matrix for "paper published in venue" edges gets very few gradient updates. The model never learns to attend well to venue information. HGSampling ensures every relation type gets adequate gradient signal throughout training.

Naive vs HGSampling — Type Balance

For a target Paper node with 1000 Paper neighbors, 50 Author neighbors, and 5 Venue neighbors. Standard sampling (left) almost never includes Venues. HGSampling (right) guarantees balanced representation.

Sample size k 15

What problem does HGSampling solve compared to standard random neighbor sampling?

Standard sampling includes too many neighbors (slow training) Standard sampling doesn't respect temporal ordering of edges Standard sampling is biased toward high-degree node types — rare types (like Venues) are almost never sampled. HGSampling samples independently per type to guarantee balance.

Chapter 6: Results on OAG

HGT is evaluated on the Open Academic Graph (OAG) — one of the largest heterogeneous graphs in research. Two tasks: Paper Field Prediction (classify each paper into a research field) and Author Rank Prediction (predict each author's h-index quintile).

OAG Statistics

Property	Value
Papers	179 million
Authors	57 million
Venues	18,738
Citations	2.2 billion
Author-Paper edges	1.1 billion
Total edges	~2.5 billion

Paper Field Classification (Macro-F1)

Method	All fields F1	CS only F1
GCN (no types)	0.318	0.389
R-GCN	0.381	0.413
HAN (meta-path)	0.392	0.428
HGT	0.452	0.497

Author Rank Prediction (Macro-F1)

Method	F1 Score
GCN	0.241
R-GCN	0.299
HAN	0.312
HGT	0.386

HGT's gain over R-GCN is significant: On paper field classification, HGT achieves 0.452 vs R-GCN's 0.381 — a 19% relative improvement. The dynamic attention (knowing which neighbors matter for each specific node-pair) is responsible for most of this gain over R-GCN's fixed per-type weights. The temporal encoding contributes additionally for author rank (which depends heavily on citation time patterns).

Ablation Study Key Finding

The paper ablates each component:

Remove type-specific projections: F1 drops 8% (biggest loss — type semantics matter most)
Remove mutual attention W^ATT → uniform attention: F1 drops 5%
Remove temporal encoding: F1 drops 3% on author rank, 1% on paper field
Replace HGSampling → random sampling: F1 drops 4% (rare type under-training hurts)

Which ablation causes the largest drop in HGT performance according to the paper?

Removing type-specific projections (8% drop) — the biggest single contribution to HGT's gains Removing temporal encoding (5% drop) Replacing HGSampling with random sampling (3% drop)

Chapter 7: Connections & Beyond

Limitations

Quadratic attention: HGT's attention is computed per (source, target) pair — the same O(N²) bottleneck as standard Transformers. For dense neighborhoods, this is expensive. The sampling mitigates it, but at the cost of information loss.

Type count explosion: Parameters scale with |T_N| × |T_E|. KGs with hundreds of edge types (Wikidata: 800+ properties) would require enormous parameter counts without decomposition (R-GCN's basis trick applies here).

Static type assumption: HGT assumes node/edge types are fixed and discrete. Real-world graphs often have ambiguous or multi-type entities — a node that is both an Author and a Reviewer in different contexts.

HGT in Context

Model	Type handling	Attention	Temporal
GCN	None	None (fixed norm)	None
R-GCN	Type-specific W	None (uniform)	None
GAT	None	Source-only attention	None
HAN	Meta-path semantics	Node+semantic attn	None
HGT	Type-specific Q/K/V	Mutual (source+target)	RTE
SeHGNN (2023)	Type-specific	Multi-hop semantic	None

The design pattern: HGT's type-specific projection approach has been widely adopted. The core insight — that in a heterogeneous graph, different node types live in different semantic spaces and need separate projection matrices before comparison — has become standard practice in heterogeneous graph learning. Any future heterogeneous GNN should either adopt this or justify why it doesn't.

Related Lessons

R-GCN — the precursor: type-specific weights without attention
GAT — attention on homogeneous graphs that HGT extends to heterogeneous
GCN — the foundational graph convolution baseline
GPS — combines local GNN with global Transformer attention (related spirit)

"In a heterogeneous world, attention must be type-aware. Knowing who is attending to whom via which relation is not a luxury — it's the minimum semantic honesty."
— HGT design philosophy