Ying, Cai, Luo, Zheng, Ke, He, Fang, Feng, Liu — Microsoft Research, NeurIPS 2021

Do Transformers Really Perform Bad for Graph Representation?

Graphormer shows that a standard Transformer, with three simple structural encodings, can match or beat specialized GNNs on graph-level prediction tasks — and won the OGB-LSC molecular competition.

Prerequisites: Self-attention / Transformers + GNN basics + Graph Laplacian (optional)
10
Chapters
4+
Simulations

Chapter 0: The Problem

By 2021, the standard wisdom was clear: Transformers dominate NLP and vision, but graphs belong to GNNs. Transformers "need" positional encoding to encode sequence order, and graphs have no order. Full self-attention ignores edge structure. So the answer, everyone assumed, was to design carefully structured GNN architectures with graph-specific inductive biases.

Graphormer asks the uncomfortable question: what if that's wrong? What if a vanilla Transformer, with a few simple modifications to inject structural information, can outperform the best GNNs on graph prediction tasks?

The paper's answer: yes, and it's not close. Graphormer won 1st place in the OGB Large-Scale Challenge (OGB-LSC) at KDD Cup 2021, beating the best GNN baselines on PCQM4M — the largest molecular property prediction benchmark in existence at the time.

The core insight: Standard Transformers fail on graphs not because of fundamental architectural incompatibility, but because they lack structural information. Feed them the right structural signals (node centrality, pairwise distance, edge content) and they outperform GNNs. The architecture is fine. The missing ingredient is structural encoding.

Three Structural Encodings

Graphormer makes exactly three modifications to a standard Transformer:

Centrality Encoding
Add a degree-based bias to node features before the first attention layer. Important nodes start different from peripheral nodes.
Spatial Encoding
Add a shortest-path-distance bias to every attention score. Nearby nodes attract each other more; distant nodes less.
Edge Encoding
Aggregate edge features along shortest paths and add as an additional attention bias. Captures bond types, distances, etc.

That's it. Standard Transformer otherwise. Layer norm, feedforward network, residual connections — all unchanged. The three encodings do all the graph-awareness work.

Paper details: Chengxuan Ying, Tianle Cai, Sijie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Fang, Jie Feng, Tie-Yan Liu. "Do Transformers Really Perform Bad for Graph Representation?" NeurIPS 2021. arXiv 2106.05234
Graphormer Overview: Three Structural Encodings

A molecular graph with 8 atoms. Each structural encoding adds different structural information. Click to highlight what each encoding contributes.

Select an encoding to see what structural information it injects.
What is the primary reason standard Transformers fail on graphs, according to Graphormer?

Chapter 1: Centrality Encoding

The first problem: standard Transformers treat all tokens equally at initialization. Every token starts with the same type of representation, regardless of its role in the graph. But in a molecule, a carbon atom bonded to 4 other atoms (a quaternary carbon) plays a fundamentally different structural role than a hydrogen atom bonded to 1 atom. They need different initial representations — before any attention is computed.

Graphormer's solution: add a centrality encoding to each node's initial feature vector. For undirected graphs, this is based on the node's degree. For directed graphs, it uses both in-degree and out-degree separately:

hi0 = xi WE + zdeg(i) + zdeg+(i)

Where xi is the raw node feature (atom type, etc.), WE is a learned projection, and zdeg is a learned embedding vector for each degree value. The degree embeddings are parameters of the model — learned during training, one vector per possible degree value.

Why degree encodes importance: In graph theory, high-degree nodes are hubs — they lie on many paths, bridge communities, and influence many other nodes. A hub atom in a molecule participates in more bonds and thus has more influence on global molecular properties. Encoding degree in the initial features lets the Transformer's attention mechanism learn to weight high-degree nodes appropriately from the very first layer.

Implementation Details

How many distinct degree values need embeddings? For molecular graphs (OGB-Mol), maximum degree is about 9 (heavy atoms). For larger graphs, it can be much higher. In practice, degrees above some threshold (e.g., 512) are binned together. Each degree gets one learned d-dimensional embedding vector — same dimensionality as the node features themselves, so addition is straightforward.

This is analogous to the learned position embeddings in BERT. BERT has one learned vector per position index. Graphormer has one learned vector per degree value. Both are added to the input features. The key difference: BERT positions are sequential integers (1, 2, 3, ...), while Graphormer degrees capture structural role, not sequential order.

Degree-Based Centrality: Different Nodes, Different Biases

A graph with nodes of different degrees. The bar height shows how much centrality encoding shifts each node's initial representation. High-degree nodes (hubs) receive the largest structural bias.

In Graphormer, centrality encoding adds z_deg(i) to node i's feature vector. What is z_deg(i)?

Chapter 2: Spatial Encoding

Centrality encoding gives each node a structural identity. But attention also needs to know about pairwise relationships — how close are nodes i and j in the graph? A standard Transformer treats every pair identically in the attention computation (the difference is learned from features, not injected from structure). Graphormer adds an explicit structural prior to every attention score.

For each pair of nodes (i, j), compute the shortest path distance SPD(i, j) — the minimum number of edges needed to travel from i to j. Add a bias term bSPD(i,j) to the attention score:

Aij = (qi · kj) / √d + bφ(vi, vj)

Where φ(vi, vj) = SPD(i, j) and b is a learned scalar bias for each distance value. So the model learns: "when two nodes are 1 hop apart, add b1 to their attention score; when they're 2 hops apart, add b2; ..." These biases are learned, not hand-coded.

What this achieves: The spatial encoding acts like a structural attention mask. If b1 > b2 > b3 (which the model typically learns), then connected nodes attend to each other more than distant nodes. This injects the graph topology directly into the attention matrix — without forcing hard constraints like GNN's message passing only along edges.

Disconnected Nodes

For disconnected node pairs (different connected components), SPD is undefined (infinity). Graphormer assigns a special distance value (e.g., -1 or a dedicated token) and learns a separate bias for this case. This allows the model to explicitly distinguish "no path exists" from "path of length k exists."

Why This Beats GNN Positional Encoding

Many GNNs inject structural information by adding Laplacian eigenvectors to node features. This modifies the tokens. Graphormer's spatial encoding modifies the attention scores directly — it tells the communication mechanism itself how to weight each node pair, rather than hoping the model infers this from modified features. More direct, more expressive.

Spatial Encoding: SPD Attention Bias Matrix (Interactive SHOWCASE)

A 7-node graph. The heatmap shows the attention bias b_SPD(i,j) for every node pair. Click a node to highlight its row/column — see how its attention to near neighbors (SPD=1) differs from far nodes (SPD=3,4). Drag the bias sliders to see how learned biases shape the attention pattern.

b₁ (SPD=1) 2.0
b₂ (SPD=2) 0.5
b₃ (SPD=3+) -0.5
Heatmap: attention bias between every pair. Warm = high bias (strong pull), cool = low bias.
Spatial encoding in Graphormer adds a bias b_SPD(i,j) to the attention score between nodes i and j. What determines this bias?

Chapter 3: Edge Encoding

Spatial encoding tells the model how far apart two nodes are. But it loses information about what's along the path. In a molecule, knowing that two atoms are 3 bonds apart is useful, but knowing those bonds are single-single-double tells you much more about the electronic environment.

Graphormer's edge encoding addresses this. For each pair (i, j), find the shortest path between them: i → e₁ → v₁ → e₂ → v₂ → ... → j. Each edge eₙ on this path has a feature vector xeₙ (bond type, bond length, etc.). The edge encoding computes an additional attention bias by averaging learned projections of these edge features:

cij = &frac{1}{Nij} ∑n=1Nij xenT wn

Where Nij is the length of the shortest path between i and j, xeₙ is the raw feature of the n-th edge on the path, and wn is a learned weight vector for position n in the path. The scalar cij is added to the attention score:

Aij = (qi · kj) / √d + bSPD(i,j) + cij
Why average over path edges? Two atoms might be 3 bonds apart via a purely aromatic path (high electron density) or via a single-bond path (low electron density). The spatial encoding (SPD=3) is identical for both. The edge encoding captures this chemical distinction. It "reads" the content of the path, not just its length.

Implementation and Computation

Pre-computing edge encodings requires finding shortest paths between all node pairs — O(n² · path_length). For molecular graphs (n ≈ 20–50 atoms), this is cheap. For larger graphs, it can be expensive and is one of the key computational costs of Graphormer.

The learned weights wn are position-specific: w1 for the first edge on the path, w2 for the second, etc. This is like a position-aware pooling — the model can learn to weight the edge closest to i differently from the edge closest to j. In practice, a maximum path length is set (e.g., 20), and paths longer than this use only the first 20 edges.

What edge features contain for molecules: Bond type (single, double, triple, aromatic), bond stereo (E/Z, cis/trans), bond ring membership, bond conjugation status. In the PCQM4M dataset, each edge has a 3-dimensional feature vector encoding these properties. Graphormer projects these to a scalar via the learned wn weights.
Find shortest path
i → v₁ → v₂ → j. Path length N_ij = 3 (3 edges).
Project each edge feature
x_{e1}ᵀ w₁, x_{e2}ᵀ w₂, x_{e3}ᵀ w₃. Each is a scalar.
Average
c_ij = (x_{e1}ᵀ w₁ + x_{e2}ᵀ w₂ + x_{e3}ᵀ w₃) / 3.
Add to attention score
A_ij += c_ij. Nodes with richer/different path chemistry attend differently.
Graphormer's edge encoding uses position-specific weights w_n for the n-th edge on the shortest path. Why?

Chapter 4: The Full Graphormer Layer

Now we can write the complete Graphormer computation. It's a standard Transformer with three injections at specific points:

Step 1: Input Preparation (once, before all layers)

hi0 = xi WE + zdeg-(i) + zdeg+(i)

Node features projected to d dimensions, plus in-degree and out-degree centrality embeddings. This is done once before layer 1.

Step 2: Modified Self-Attention (in every layer)

Q = Hl-1 WQ,   K = Hl-1 WK,   V = Hl-1 WV
Aij = (qi · kj) / √d + bSPD(i,j) + cij
Attni = ∑j softmaxj(Aij) · vj

Spatial bias bSPD and edge bias cij are pre-computed from graph structure and added to every attention head in every layer (with separate learned biases per head).

Step 3: Feed-Forward + Residual (standard Transformer)

Hl = FFN( LN( Attn + Hl-1 ) ) + LN( Attn + Hl-1 )

Standard pre-norm Transformer residual block. No changes from BERT here.

What's trained vs. pre-computed: Pre-computed (once per graph, before any training): SPD(i,j) for all pairs, shortest path edge sequences. Trained end-to-end: W_E (feature projection), z_deg (centrality embeddings), b_SPD (spatial bias per distance), w_n (edge encoding weights), W_Q/K/V/O (attention projections), FFN weights, LayerNorm parameters.

Tensor Shapes at Each Step

For a graph with n nodes, d model dimensions, H attention heads:

TensorShapeDescription
H⁰ (input)n × dNode features after centrality encoding
Q, K, Vn × d (each)Projected queries, keys, values
A (attention logits)n × nDot products + SPD bias + edge bias
Attn (output)n × dWeighted sum of values
HL (final)n × dNode representations after L layers
Graph repr.1 × dOutput of virtual node (see Ch 5)
Complete Graphormer Data Flow

Step through the full computation: input, centrality, attention with biases, FFN, and pooling. Click Next Step to advance.

Where exactly in the Graphormer computation are the spatial and edge encodings injected?

Chapter 5: Virtual Node (The [CLS] Token)

Graph-level prediction requires pooling all node representations into a single graph-level vector. Standard options — mean pooling, sum pooling, max pooling — all have problems: they're permutation-invariant by construction but don't know which nodes are important, and they're fixed operations that can't be learned.

Graphormer borrows BERT's elegant solution: a virtual node [VNode] connected to every node in the graph. Like BERT's [CLS] token, it participates in attention and its final representation serves as the graph-level embedding.

The [CLS] → [VNode] analogy: BERT adds [CLS] at position 0 of every sentence. [CLS] attends to every word, and its output is used for sentence classification. Graphormer adds [VNode] as a special "supernode" connected to all nodes. [VNode] attends to every graph node, and its output is the graph representation. In both cases, the token learns to summarize the whole input.

Implementation

The virtual node is implemented by assigning it a special distance value in the spatial encoding. Real nodes are at distances SPD = 1, 2, 3, ... from each other. The virtual node is assigned SPD = 0 from itself and SPD = -1 (a special token) from all real nodes. This gives it a unique learned attention bias and lets the model learn how strongly to weight the virtual node's participation.

The virtual node's centrality encoding also uses a special degree value (it's connected to all n nodes, so its degree is n — usually its own learned embedding vector rather than the n-th degree embedding, to avoid conflating it with high-degree real nodes).

What the Virtual Node Learns

After L Transformer layers, the virtual node has gathered information from all real nodes via attention — but not uniformly. High-degree real nodes contributed more (via their centrality encoding), nearby nodes contributed more (via spatial encoding), and nodes along chemically rich paths contributed differently (via edge encoding). The final [VNode] representation is a learned, structure-aware summary of the entire graph.

Virtual Node: Global Graph Aggregation

A 7-node graph with a virtual node (VNode) connected to all real nodes. The bar chart shows how much each real node contributes to the final VNode representation — taller = more influence. Click "Run Attention" to animate the aggregation.

VNode attends to all nodes simultaneously. Its output is the graph-level representation for prediction.
How does Graphormer produce a graph-level representation from node representations?

Chapter 6: Results

Graphormer was evaluated on two main benchmarks: the OGB-Mol suite (molecular property prediction) and the OGB-LSC competition (PCQM4M, the large-scale molecular dataset). On both, it substantially outperformed the best GNNs at the time.

OGB-Mol Benchmarks

DatasetTaskMetricGIN-VN (best GNN)Graphormer
ogbg-molhivBinary clsAUC ↑77.07%80.51%
ogbg-molpcbaMulti-labelAP ↑27.03%31.39%
ogbg-moltox21ToxicityAUC ↑75.58%76.36%
ZINCRegressionMAE ↓0.2520.122

On ZINC — a popular benchmark with 250K molecules — Graphormer achieves MAE of 0.122, roughly half the error of the best GNN baseline (0.252). This is a substantial gap for a benchmark that has been intensely optimized for years.

OGB-LSC: PCQM4M

PCQM4M contains 3.8 million molecules with DFT-computed quantum chemistry properties (HOMO-LUMO gap). This is the largest molecular ML benchmark available at the time. Graphormer won 1st place at KDD Cup 2021:

ModelValid MAE ↓Notes
GIN-VN (SOTA GNN)0.1395Previous best
GRPE0.0898Another graph Transformer
Graphormer0.08641st place, KDD Cup 2021
Ablation: each encoding matters. Removing centrality encoding: +4.2% MAE on PCQM4M. Removing spatial encoding: +8.1% MAE. Removing edge encoding: +3.7% MAE. Removing all three: +18.3% MAE — back to GNN level. Spatial encoding contributes the most; it's the most critical of the three.

Model Size and Training

The winning Graphormer model for PCQM4M used 12 layers, 32 attention heads, d = 768 dimensions — similar in size to BERT-Base. Training on PCQM4M took about 3 days on 8 NVIDIA V100 GPUs. The model has approximately 47M parameters. By comparison, a GIN-VN of equivalent performance might be 10× smaller but 5× less accurate.

Performance Comparison: Graphormer vs GNN Baselines

MAE on ZINC (lower is better). Use the selector to compare methods. Graphormer nearly halves the error of the best GNN.

According to the Graphormer ablation study, which of the three structural encodings contributes the most to performance?

Chapter 7: Why It Works: Expressiveness Analysis

Graphormer's empirical success is impressive. But can we understand why it works theoretically? The paper provides a formal analysis showing that Graphormer is at least as expressive as any 1-WL GNN — and strictly more expressive in many cases.

GNNs Are Bounded by 1-WL

From Lec 6 (Theory of GNNs): any message-passing GNN is upper-bounded by the Weisfeiler-Leman (1-WL) graph isomorphism test. Two graphs that the 1-WL test cannot distinguish will receive identical representations from any GNN — regardless of architecture, depth, or width.

1-WL fails on many simple graphs. For example: two regular graphs (all nodes have the same degree) with the same degree sequence but different global structure. Or any graph where all nodes have identical k-hop neighborhoods for some k.

Graphormer Breaks the 1-WL Barrier

Graphormer's spatial encoding computes SPD(i, j) for all pairs — this is equivalent to knowing the entire distance matrix of the graph. Two non-isomorphic graphs can have identical local neighborhoods (fooling 1-WL) but different distance matrices. The spatial encoding distinguishes them.

The key theorem (informal): For any two non-isomorphic graphs G and G' that are indistinguishable by 1-WL, there exists a choice of spatial encoding bias weights such that Graphormer assigns them different representations. Graphormer is strictly more expressive than the entire class of 1-WL GNNs.

When Does It Matter in Practice?

For molecules, 1-WL distinguishes most pairs — but not all. Ring systems that differ in how distant atoms relate to each other can fool 1-WL. For example, cyclohexane (6-ring, all carbons) and two fused 3-rings have different distance matrices but potentially similar local neighborhoods. Graphormer's spatial encoding catches this.

More practically: even when 1-WL would technically distinguish two graphs, Graphormer learns to do so more efficiently — the spatial and edge encodings provide a direct signal that would require many GNN layers to approximate through message passing.

An important nuance: "More expressive" doesn't always mean "better at prediction." Higher expressiveness can lead to overfitting on small datasets. Graphormer works best on large molecular datasets (PCQM4M with 3.8M molecules) where the extra expressiveness can be learned without overfitting. On tiny datasets (e.g., MUTAG with 188 graphs), simpler GNNs may generalize better.
Why is Graphormer strictly more expressive than 1-WL GNNs?

Chapter 8: Limitations

Graphormer is powerful but has real constraints that limit its applicability. Understanding these is essential for choosing when to use it.

Limitation 1: O(n²) Compute and Memory

Full self-attention over all n nodes costs O(n²·d) per layer. The attention matrix A is n×n — 400 MB at float32 for n=10,000. The spatial encoding matrix (precomputed SPD for all pairs) is also n×n. For molecular graphs (n ≈ 5–50), this is trivial. For citation networks (n = 169,343 for ogbn-arxiv), it's completely infeasible.

This is not a solvable problem within Graphormer's framework — it's a fundamental architectural constraint. O(n²) attention requires O(n²) memory. Workarounds (sparse attention, linear attention) require architectural changes that modify Graphormer into a different model.

Limitation 2: Graph-Level Tasks Only

Graphormer was designed and evaluated for graph-level prediction (a single property per graph, like molecular solubility). For node-level tasks (predicting a property per node, like citation category), Graphormer is less natural — the virtual node approach produces a single graph embedding, not per-node embeddings.

In principle, the node representations HL from the final layer could be used for node prediction. But the lack of positional encoding (no Laplacian PE — Graphormer uses SPD-based attention bias instead) means the node representations may be less distinctive for node-level discrimination than RWPE-augmented GPS.

Limitation 3: Pre-computation Cost

Computing SPD(i, j) for all pairs and finding shortest path edge sequences requires running all-pairs shortest paths (APSP). For unweighted graphs, this is O(n · |E|) using BFS from every node. For n = 1,000 and |E| = 10,000, that's 10M operations — manageable offline. But for dynamic graphs (where edges change), re-running APSP at every timestep is prohibitive.

LimitationImpactWorkaround
O(n²) attentionInfeasible for n > ~5KSwitch to GPS with linear attention
Graph-level onlyAwkward for node tasksUse node representations HL directly
APSP pre-computationExpensive for large/dynamic graphsApproximate SPD or use RWPE instead
Large model size47M params, needs large training setPretrain on large corpus (e.g., PCQM4M)
Bottom line: Graphormer is the right tool for graph-level prediction on small-to-medium molecular graphs where you have a large training set. For large graphs, node-level tasks, or dynamic settings, consider GPS (local MPNN + linear attention + RWPE) which avoids all three of these limitations.
Why is Graphormer ill-suited for node classification on large graphs like citation networks?

Chapter 9: Connections

Graphormer sits at an important crossroads in the evolution of graph learning. Understanding its place in the broader landscape helps you pick the right tool and see where the field is going.

Relationship to CS224W Lecture 8

Graphormer is the central paper of CS224W Lecture 8 (Graph Transformers). Its three encodings — centrality, spatial, edge — exemplify the general principle that structural encoding is what makes Transformers graph-aware. The lecture generalizes this to other approaches (Laplacian PE, RWPE) and to GPS, which extends Graphormer's ideas to larger graphs by combining MPNN + attention.

Key Related Papers

  • GPS (NeurIPS 2022) — Rampásek et al. Fixes Graphormer's scalability limit by combining local MPNN + linear attention + RWPE. The natural successor for large-graph settings.
  • SAN (NeurIPS 2021) — Kreuzer et al. Uses the full Laplacian spectrum as node PE. Similar timing to Graphormer, different PE approach.
  • GRPE (ICML 2022) — Han et al. Relative positional encoding for graphs. Competitive with Graphormer on PCQM4M, similar philosophy.
  • Exphormer (ICML 2023) — Shirzad et al. Sparse attention for Graph Transformers. Scales to large graphs while maintaining high expressiveness. Directly addresses Graphormer's main weakness.
  • Uni-Mol (ICLR 2023) — Zhou et al. Graphormer-style Transformer pre-trained at scale on 200M+ molecules. Shows that Graphormer's approach benefits from self-supervised pre-training.

Micro-Lesson Links

  • CS224W Lec 8: Graph Transformers — The lecture that places Graphormer in context alongside GPS, RWPE, and Laplacian PE.
  • CS224W Lec 6: Theory of GNNs — The expressiveness analysis (1-WL, GIN) that Graphormer builds on. Explains why Graphormer breaks the 1-WL barrier.
  • CS224W Lec 4: GAT — Graph Attention Networks: attention on sparse graphs. Graphormer extends this to full attention with structural biases.
  • Transformer — The original architecture Graphormer modifies. Reading this before Graphormer clarifies exactly what changes and what doesn't.
The lasting lesson: Graphormer showed that the right inductive bias — not a specialized architecture — is what graph learning needs. A standard Transformer, fed three simple structural signals, beats years of carefully engineered GNN architectures. The principle generalizes: for any new data modality, the question is not "do I need a new architecture?" but "what structural information does my architecture need to see, and how do I inject it?"
"It seems that if one is working from the point of view of getting beauty in one's equations, and if one has really a sound insight, one is on a sure line of progress." — Paul Dirac

Paper: Ying et al. "Do Transformers Really Perform Bad for Graph Representation?" NeurIPS 2021. arXiv 2106.05234