Hu et al. — UC San Diego & Microsoft Research, WWW 2020

HGT: Heterogeneous Graph Transformer

Bring Transformer-style attention to heterogeneous graphs: type-specific projections, mutual attention between source and target types, and relative temporal encoding for dynamic academic networks.

Prerequisites: Transformer attention (Q/K/V) + Graph neural networks (GCN, GAT) + Heterogeneous graphs (typed nodes + edges)
8
Chapters
4+
Simulations

Chapter 0: The Problem

Consider the Open Academic Graph (OAG): 178 million nodes of three types — Papers, Authors, and Venues — and edges of multiple types: paper cites paper, author writes paper, paper published in venue. This is a heterogeneous graph: nodes and edges have different semantic types.

The tasks: predict a paper's research field (node classification) and predict author-venue connections (link prediction). Both require aggregating information across heterogeneous neighbors — papers, authors, venues — in a way that respects the different semantics of each type.

Why neither standard GCN nor R-GCN is enough: Standard GCN ignores type differences entirely. R-GCN uses separate weight matrices per relation type but uses fixed, uniform attention across neighbors — it doesn't learn which sources are more relevant for each target node given the context. HGT adds dynamic, content-dependent attention — the Transformer way.

The Key Ingredients HGT Adds

Three ideas on top of R-GCN:

  1. Type-specific Q/K/V projections: Different node types get different linear projections for queries, keys, and values. An "Author" node's feature vector is projected differently than a "Paper" node's.
  2. Mutual attention: The attention weight between source node s and target node t depends not just on s (as in GAT) but on the interaction between s-type and t-type. The attention is parameterized by the relation type (s_type → t_type).
  3. Relative temporal encoding: For dynamic graphs (papers published over time), the positional encoding from Transformers is adapted to encode the time gap between source and target events.
Heterogeneous Graph: Academic Network

A small heterogeneous academic graph. Orange = Papers, Teal = Authors, Purple = Venues. Edge labels show relation types. Each type needs different treatment during aggregation.

What does HGT add to R-GCN's type-specific weight matrices?

Chapter 1: Type-Specific Projections

In a standard Transformer, queries, keys, and values are computed by multiplying the input by matrices WQ, WK, WV. In HGT, each node type gets its own set of projection matrices.

The Type Projection

Let φ(v) denote the type of node v (e.g., Paper, Author, Venue). The type-specific projection maps any node's embedding to a shared attention space:

Kls = K-Linearφ(s)( Hl-1s ) Qlt = Q-Linearφ(t)( Hl-1t ) Vls = V-Linearφ(s)( Hl-1s )

where K-Linearφ(s) is a learnable d × dhead matrix specific to type φ(s). Authors, Papers, and Venues each have different K, Q, V projection matrices.

Why different projections matter: An Author's 128-dim embedding and a Paper's 128-dim embedding live in very different semantic spaces — the dimensions mean different things. Projecting them through the same WK before computing attention would be like asking "is this Author's dimension 37 similar to this Paper's dimension 37?" — a meaningless comparison. Type-specific projections first map each type into a shared semantic space where comparisons are meaningful.

Parameterization

With |TN| node types and h attention heads of size dhead:

For OAG with 3 node types, 8 heads, d = 256, dhead = 32: 3 × 8 × 256 × 32 × 3 = ~18.9M parameters just for projections. This is manageable compared to the model's total parameters.

Type-Specific Projection Space

Different node types (Paper, Author, Venue) are projected into a shared 2D attention space. Notice how the same raw feature value maps to very different locations depending on the node type.

In HGT, if there are 4 node types and 8 attention heads, how many separate Key projection matrices are needed per layer?

Chapter 2: Heterogeneous Attention (Showcase)

Given the type-specific K and Q projections, how is the attention weight between source s and target t computed? HGT's attention mechanism has one key difference from standard Transformer attention: it includes a relation-type-dependent prior in the attention score.

The Mutual Attention Formula

Attn(s, e, t) = softmax( (Qt WATTφ(e) KTs) / √dhead )

where WATTφ(e) ∈ ℝdhead×dhead is an edge-type-specific interaction matrix — a small matrix that captures how query (target) and key (source) dimensions should interact for this particular relation type.

Compare to standard Transformer: Attn = softmax(QWQT K / √d). HGT adds WATTφ(e) between Q and K. This matrix is learned per relation type and parameterizes how the target "queries" for relevant sources through this type of edge.

"Mutual" attention: Standard GAT's attention depends only on the source node's features (how important is this neighbor?). HGT's attention depends on both source and target — the question "how important is Author A to Paper P?" depends on the features of both A and P, mediated by the "author writes" relation type. This is much richer than GAT.
Heterogeneous Attention — Interactive

A target Paper node aggregates from its heterogeneous neighbors. The attention weight for each source depends on the source type, target features, and edge type. Adjust the query vector and watch attention weights change.

Query dim 1 0.60
Query dim 2 0.30
Attention weights update as you change the target paper's query.
What does the edge-type matrix WATTφ(e) in HGT's attention formula capture?

Chapter 3: Message Passing

Once attention weights are computed, messages are created from source nodes and aggregated at the target. HGT's message computation also has a type-specific component.

Message Computation

For each source-target pair (s, t) via edge e, the message is:

Msg(s, e, t) = Vls · WMSGφ(e)

where WMSGφ(e) ∈ ℝdhead×dhead is another edge-type-specific matrix — this one transforming the value vector before aggregation. Like WATT, it's shared across all edges of the same type.

Aggregation and Update

Messages from all source nodes are aggregated at target t using the computed attention weights:

˜Hlt = ⊕h∈[H] Aggregate( { Attnh(s,e,t) · Msgh(s,e,t) : (s,e)∈N(t) } )

where ⊕ is concatenation across heads and N(t) is all edges pointing to t. The aggregation function is simply a weighted sum — just like standard Transformer's attention output.

Finally, the aggregated embedding is updated to produce the new node embedding:

Hlt = A-Linearφ(t)( σ( ˜Hlt ) ) + Hl-1t

Note the residual connection — just like Transformers. A-Linearφ(t) is a type-specific output projection that maps back to the original d-dimensional space.

The full parameter set per HGT layer: K/Q/V projections (type-specific), WATT interaction matrices (edge-type-specific), WMSG message matrices (edge-type-specific), A-Linear output projections (type-specific). Every parameter is aware of either node type or edge type — pure type semantics, no type confusion.
For each edge (s, e, t)
Project s → Ks, t → Qt, s → Vs (type-specific)
Attention
Attn = softmax(Qt WATTφ(e) KsT / √d)
Message
Msg = Vs · WMSGφ(e)
↓ weighted sum over N(t)
Aggregate + Update
Hlt = A-Linearφ(t)(σ(Σ Attn·Msg)) + Hl-1t
What is the role of the residual connection in HGT's update rule Hlt = ... + Hl-1t?

Chapter 4: Relative Temporal Encoding

Academic graphs are inherently temporal: a paper published in 2020 can cite papers from 2010, 2015, or 2019. The relationship "cited by a paper from 5 years in the future" is different from "cited by a paper from last year." HGT introduces relative temporal encoding (RTE) to capture this.

The Challenge

Transformers use absolute positional encodings (position 1, 2, 3, ...). For graphs, this doesn't apply — nodes don't have positions, only timestamps. And the relevant quantity is the difference in timestamps between source and target, not their absolute times.

RTE Design

For each edge (s, e, t) where source s has timestamp τs and target t has timestamp τt, the relative time is Δτ = τt − τs. This scalar is encoded as a vector using sinusoidal basis functions (borrowed from transformer positional encodings):

RTE(Δτ) = sin(Δτ / 100002k/d) for even dimensions k RTE(Δτ) = cos(Δτ / 100002k/d) for odd dimensions k

This d-dimensional vector is added to the source node's initial embedding before the Q/K/V projections: Hs ← Hs + WRTE · RTE(Δτ). The model can thus learn to weight recent vs distant sources differently.

Why relative, not absolute? A paper from 1990 that's cited by a 2020 paper looks very different as a source than the same 1990 paper cited by a 1992 paper. The absolute timestamp of the source doesn't matter as much as how far in the past it is relative to the target. Relative encoding captures this temporal distance, not just the year.
Relative Temporal Encoding

Sinusoidal basis functions for different frequencies encode the time difference Δτ. Short time differences are encoded differently than long ones. This encoding is added to the source embedding before attention.

Time gap Δτ (years) 5
In HGT's Relative Temporal Encoding, what quantity is encoded as the temporal signal?

Chapter 5: HGSampling

OAG has 178 million nodes. You can't load the full graph into memory, let alone train a GNN on all of it at once. HGT uses HGSampling: a heterogeneity-aware mini-batch sampling strategy.

The Problem with Naive Sampling

Standard neighbor sampling (like GraphSAGE's) randomly samples k neighbors per node. On a heterogeneous graph, this can completely miss rare node types. If "Venue" nodes have degree 1,000 and "Author" nodes have degree 3, randomly sampling 10 neighbors of a Paper will mostly give Papers (which dominate the neighborhood), leaving Venues and Authors chronically under-represented.

HGSampling Design

HGSampling samples per type: for each target node, it separately samples k/|types| neighbors of each type. This ensures balanced representation across all node types in every mini-batch, regardless of the degree distribution skew.

Implementation detail: Maintain a separate neighbor list per (node, type) pair. For each batch, sample independently from each list. If a type has fewer than k/|types| neighbors (rare type), use all available neighbors. The resulting subgraph is "type-balanced" by construction.

Why Balanced Sampling Matters

If Venue nodes are undersampled during training, the WATT matrix for "paper published in venue" edges gets very few gradient updates. The model never learns to attend well to venue information. HGSampling ensures every relation type gets adequate gradient signal throughout training.

Naive vs HGSampling — Type Balance

For a target Paper node with 1000 Paper neighbors, 50 Author neighbors, and 5 Venue neighbors. Standard sampling (left) almost never includes Venues. HGSampling (right) guarantees balanced representation.

Sample size k 15
What problem does HGSampling solve compared to standard random neighbor sampling?

Chapter 6: Results on OAG

HGT is evaluated on the Open Academic Graph (OAG) — one of the largest heterogeneous graphs in research. Two tasks: Paper Field Prediction (classify each paper into a research field) and Author Rank Prediction (predict each author's h-index quintile).

OAG Statistics

PropertyValue
Papers179 million
Authors57 million
Venues18,738
Citations2.2 billion
Author-Paper edges1.1 billion
Total edges~2.5 billion

Paper Field Classification (Macro-F1)

MethodAll fields F1CS only F1
GCN (no types)0.3180.389
R-GCN0.3810.413
HAN (meta-path)0.3920.428
HGT0.4520.497

Author Rank Prediction (Macro-F1)

MethodF1 Score
GCN0.241
R-GCN0.299
HAN0.312
HGT0.386
HGT's gain over R-GCN is significant: On paper field classification, HGT achieves 0.452 vs R-GCN's 0.381 — a 19% relative improvement. The dynamic attention (knowing which neighbors matter for each specific node-pair) is responsible for most of this gain over R-GCN's fixed per-type weights. The temporal encoding contributes additionally for author rank (which depends heavily on citation time patterns).

Ablation Study Key Finding

The paper ablates each component:

Which ablation causes the largest drop in HGT performance according to the paper?

Chapter 7: Connections & Beyond

Limitations

Quadratic attention: HGT's attention is computed per (source, target) pair — the same O(N²) bottleneck as standard Transformers. For dense neighborhoods, this is expensive. The sampling mitigates it, but at the cost of information loss.

Type count explosion: Parameters scale with |TN| × |TE|. KGs with hundreds of edge types (Wikidata: 800+ properties) would require enormous parameter counts without decomposition (R-GCN's basis trick applies here).

Static type assumption: HGT assumes node/edge types are fixed and discrete. Real-world graphs often have ambiguous or multi-type entities — a node that is both an Author and a Reviewer in different contexts.

HGT in Context

ModelType handlingAttentionTemporal
GCNNoneNone (fixed norm)None
R-GCNType-specific WNone (uniform)None
GATNoneSource-only attentionNone
HANMeta-path semanticsNode+semantic attnNone
HGTType-specific Q/K/VMutual (source+target)RTE
SeHGNN (2023)Type-specificMulti-hop semanticNone
The design pattern: HGT's type-specific projection approach has been widely adopted. The core insight — that in a heterogeneous graph, different node types live in different semantic spaces and need separate projection matrices before comparison — has become standard practice in heterogeneous graph learning. Any future heterogeneous GNN should either adopt this or justify why it doesn't.

Related Lessons

"In a heterogeneous world, attention must be type-aware. Knowing who is attending to whom via which relation is not a luxury — it's the minimum semantic honesty."
— HGT design philosophy