Bring Transformer-style attention to heterogeneous graphs: type-specific projections, mutual attention between source and target types, and relative temporal encoding for dynamic academic networks.
Consider the Open Academic Graph (OAG): 178 million nodes of three types — Papers, Authors, and Venues — and edges of multiple types: paper cites paper, author writes paper, paper published in venue. This is a heterogeneous graph: nodes and edges have different semantic types.
The tasks: predict a paper's research field (node classification) and predict author-venue connections (link prediction). Both require aggregating information across heterogeneous neighbors — papers, authors, venues — in a way that respects the different semantics of each type.
Three ideas on top of R-GCN:
A small heterogeneous academic graph. Orange = Papers, Teal = Authors, Purple = Venues. Edge labels show relation types. Each type needs different treatment during aggregation.
In a standard Transformer, queries, keys, and values are computed by multiplying the input by matrices WQ, WK, WV. In HGT, each node type gets its own set of projection matrices.
Let φ(v) denote the type of node v (e.g., Paper, Author, Venue). The type-specific projection maps any node's embedding to a shared attention space:
where K-Linearφ(s) is a learnable d × dhead matrix specific to type φ(s). Authors, Papers, and Venues each have different K, Q, V projection matrices.
With |TN| node types and h attention heads of size dhead:
For OAG with 3 node types, 8 heads, d = 256, dhead = 32: 3 × 8 × 256 × 32 × 3 = ~18.9M parameters just for projections. This is manageable compared to the model's total parameters.
Different node types (Paper, Author, Venue) are projected into a shared 2D attention space. Notice how the same raw feature value maps to very different locations depending on the node type.
Given the type-specific K and Q projections, how is the attention weight between source s and target t computed? HGT's attention mechanism has one key difference from standard Transformer attention: it includes a relation-type-dependent prior in the attention score.
where WATTφ(e) ∈ ℝdhead×dhead is an edge-type-specific interaction matrix — a small matrix that captures how query (target) and key (source) dimensions should interact for this particular relation type.
Compare to standard Transformer: Attn = softmax(QWQT K / √d). HGT adds WATTφ(e) between Q and K. This matrix is learned per relation type and parameterizes how the target "queries" for relevant sources through this type of edge.
A target Paper node aggregates from its heterogeneous neighbors. The attention weight for each source depends on the source type, target features, and edge type. Adjust the query vector and watch attention weights change.
Once attention weights are computed, messages are created from source nodes and aggregated at the target. HGT's message computation also has a type-specific component.
For each source-target pair (s, t) via edge e, the message is:
where WMSGφ(e) ∈ ℝdhead×dhead is another edge-type-specific matrix — this one transforming the value vector before aggregation. Like WATT, it's shared across all edges of the same type.
Messages from all source nodes are aggregated at target t using the computed attention weights:
where ⊕ is concatenation across heads and N(t) is all edges pointing to t. The aggregation function is simply a weighted sum — just like standard Transformer's attention output.
Finally, the aggregated embedding is updated to produce the new node embedding:
Note the residual connection — just like Transformers. A-Linearφ(t) is a type-specific output projection that maps back to the original d-dimensional space.
Academic graphs are inherently temporal: a paper published in 2020 can cite papers from 2010, 2015, or 2019. The relationship "cited by a paper from 5 years in the future" is different from "cited by a paper from last year." HGT introduces relative temporal encoding (RTE) to capture this.
Transformers use absolute positional encodings (position 1, 2, 3, ...). For graphs, this doesn't apply — nodes don't have positions, only timestamps. And the relevant quantity is the difference in timestamps between source and target, not their absolute times.
For each edge (s, e, t) where source s has timestamp τs and target t has timestamp τt, the relative time is Δτ = τt − τs. This scalar is encoded as a vector using sinusoidal basis functions (borrowed from transformer positional encodings):
This d-dimensional vector is added to the source node's initial embedding before the Q/K/V projections: Hs ← Hs + WRTE · RTE(Δτ). The model can thus learn to weight recent vs distant sources differently.
Sinusoidal basis functions for different frequencies encode the time difference Δτ. Short time differences are encoded differently than long ones. This encoding is added to the source embedding before attention.
OAG has 178 million nodes. You can't load the full graph into memory, let alone train a GNN on all of it at once. HGT uses HGSampling: a heterogeneity-aware mini-batch sampling strategy.
Standard neighbor sampling (like GraphSAGE's) randomly samples k neighbors per node. On a heterogeneous graph, this can completely miss rare node types. If "Venue" nodes have degree 1,000 and "Author" nodes have degree 3, randomly sampling 10 neighbors of a Paper will mostly give Papers (which dominate the neighborhood), leaving Venues and Authors chronically under-represented.
HGSampling samples per type: for each target node, it separately samples k/|types| neighbors of each type. This ensures balanced representation across all node types in every mini-batch, regardless of the degree distribution skew.
If Venue nodes are undersampled during training, the WATT matrix for "paper published in venue" edges gets very few gradient updates. The model never learns to attend well to venue information. HGSampling ensures every relation type gets adequate gradient signal throughout training.
For a target Paper node with 1000 Paper neighbors, 50 Author neighbors, and 5 Venue neighbors. Standard sampling (left) almost never includes Venues. HGSampling (right) guarantees balanced representation.
HGT is evaluated on the Open Academic Graph (OAG) — one of the largest heterogeneous graphs in research. Two tasks: Paper Field Prediction (classify each paper into a research field) and Author Rank Prediction (predict each author's h-index quintile).
| Property | Value |
|---|---|
| Papers | 179 million |
| Authors | 57 million |
| Venues | 18,738 |
| Citations | 2.2 billion |
| Author-Paper edges | 1.1 billion |
| Total edges | ~2.5 billion |
| Method | All fields F1 | CS only F1 |
|---|---|---|
| GCN (no types) | 0.318 | 0.389 |
| R-GCN | 0.381 | 0.413 |
| HAN (meta-path) | 0.392 | 0.428 |
| HGT | 0.452 | 0.497 |
| Method | F1 Score |
|---|---|
| GCN | 0.241 |
| R-GCN | 0.299 |
| HAN | 0.312 |
| HGT | 0.386 |
The paper ablates each component:
Quadratic attention: HGT's attention is computed per (source, target) pair — the same O(N²) bottleneck as standard Transformers. For dense neighborhoods, this is expensive. The sampling mitigates it, but at the cost of information loss.
Type count explosion: Parameters scale with |TN| × |TE|. KGs with hundreds of edge types (Wikidata: 800+ properties) would require enormous parameter counts without decomposition (R-GCN's basis trick applies here).
Static type assumption: HGT assumes node/edge types are fixed and discrete. Real-world graphs often have ambiguous or multi-type entities — a node that is both an Author and a Reviewer in different contexts.
| Model | Type handling | Attention | Temporal |
|---|---|---|---|
| GCN | None | None (fixed norm) | None |
| R-GCN | Type-specific W | None (uniform) | None |
| GAT | None | Source-only attention | None |
| HAN | Meta-path semantics | Node+semantic attn | None |
| HGT | Type-specific Q/K/V | Mutual (source+target) | RTE |
| SeHGNN (2023) | Type-specific | Multi-hop semantic | None |
"In a heterogeneous world, attention must be type-aware. Knowing who is attending to whom via which relation is not a luxury — it's the minimum semantic honesty."
— HGT design philosophy