Graphormer shows that a standard Transformer, with three simple structural encodings, can match or beat specialized GNNs on graph-level prediction tasks — and won the OGB-LSC molecular competition.
By 2021, the standard wisdom was clear: Transformers dominate NLP and vision, but graphs belong to GNNs. Transformers "need" positional encoding to encode sequence order, and graphs have no order. Full self-attention ignores edge structure. So the answer, everyone assumed, was to design carefully structured GNN architectures with graph-specific inductive biases.
Graphormer asks the uncomfortable question: what if that's wrong? What if a vanilla Transformer, with a few simple modifications to inject structural information, can outperform the best GNNs on graph prediction tasks?
The paper's answer: yes, and it's not close. Graphormer won 1st place in the OGB Large-Scale Challenge (OGB-LSC) at KDD Cup 2021, beating the best GNN baselines on PCQM4M — the largest molecular property prediction benchmark in existence at the time.
Graphormer makes exactly three modifications to a standard Transformer:
That's it. Standard Transformer otherwise. Layer norm, feedforward network, residual connections — all unchanged. The three encodings do all the graph-awareness work.
A molecular graph with 8 atoms. Each structural encoding adds different structural information. Click to highlight what each encoding contributes.
The first problem: standard Transformers treat all tokens equally at initialization. Every token starts with the same type of representation, regardless of its role in the graph. But in a molecule, a carbon atom bonded to 4 other atoms (a quaternary carbon) plays a fundamentally different structural role than a hydrogen atom bonded to 1 atom. They need different initial representations — before any attention is computed.
Graphormer's solution: add a centrality encoding to each node's initial feature vector. For undirected graphs, this is based on the node's degree. For directed graphs, it uses both in-degree and out-degree separately:
Where xi is the raw node feature (atom type, etc.), WE is a learned projection, and zdeg is a learned embedding vector for each degree value. The degree embeddings are parameters of the model — learned during training, one vector per possible degree value.
How many distinct degree values need embeddings? For molecular graphs (OGB-Mol), maximum degree is about 9 (heavy atoms). For larger graphs, it can be much higher. In practice, degrees above some threshold (e.g., 512) are binned together. Each degree gets one learned d-dimensional embedding vector — same dimensionality as the node features themselves, so addition is straightforward.
This is analogous to the learned position embeddings in BERT. BERT has one learned vector per position index. Graphormer has one learned vector per degree value. Both are added to the input features. The key difference: BERT positions are sequential integers (1, 2, 3, ...), while Graphormer degrees capture structural role, not sequential order.
A graph with nodes of different degrees. The bar height shows how much centrality encoding shifts each node's initial representation. High-degree nodes (hubs) receive the largest structural bias.
Centrality encoding gives each node a structural identity. But attention also needs to know about pairwise relationships — how close are nodes i and j in the graph? A standard Transformer treats every pair identically in the attention computation (the difference is learned from features, not injected from structure). Graphormer adds an explicit structural prior to every attention score.
For each pair of nodes (i, j), compute the shortest path distance SPD(i, j) — the minimum number of edges needed to travel from i to j. Add a bias term bSPD(i,j) to the attention score:
Where φ(vi, vj) = SPD(i, j) and b is a learned scalar bias for each distance value. So the model learns: "when two nodes are 1 hop apart, add b1 to their attention score; when they're 2 hops apart, add b2; ..." These biases are learned, not hand-coded.
For disconnected node pairs (different connected components), SPD is undefined (infinity). Graphormer assigns a special distance value (e.g., -1 or a dedicated token) and learns a separate bias for this case. This allows the model to explicitly distinguish "no path exists" from "path of length k exists."
Many GNNs inject structural information by adding Laplacian eigenvectors to node features. This modifies the tokens. Graphormer's spatial encoding modifies the attention scores directly — it tells the communication mechanism itself how to weight each node pair, rather than hoping the model infers this from modified features. More direct, more expressive.
A 7-node graph. The heatmap shows the attention bias b_SPD(i,j) for every node pair. Click a node to highlight its row/column — see how its attention to near neighbors (SPD=1) differs from far nodes (SPD=3,4). Drag the bias sliders to see how learned biases shape the attention pattern.
Spatial encoding tells the model how far apart two nodes are. But it loses information about what's along the path. In a molecule, knowing that two atoms are 3 bonds apart is useful, but knowing those bonds are single-single-double tells you much more about the electronic environment.
Graphormer's edge encoding addresses this. For each pair (i, j), find the shortest path between them: i → e₁ → v₁ → e₂ → v₂ → ... → j. Each edge eₙ on this path has a feature vector xeₙ (bond type, bond length, etc.). The edge encoding computes an additional attention bias by averaging learned projections of these edge features:
Where Nij is the length of the shortest path between i and j, xeₙ is the raw feature of the n-th edge on the path, and wn is a learned weight vector for position n in the path. The scalar cij is added to the attention score:
Pre-computing edge encodings requires finding shortest paths between all node pairs — O(n² · path_length). For molecular graphs (n ≈ 20–50 atoms), this is cheap. For larger graphs, it can be expensive and is one of the key computational costs of Graphormer.
The learned weights wn are position-specific: w1 for the first edge on the path, w2 for the second, etc. This is like a position-aware pooling — the model can learn to weight the edge closest to i differently from the edge closest to j. In practice, a maximum path length is set (e.g., 20), and paths longer than this use only the first 20 edges.
Now we can write the complete Graphormer computation. It's a standard Transformer with three injections at specific points:
Node features projected to d dimensions, plus in-degree and out-degree centrality embeddings. This is done once before layer 1.
Spatial bias bSPD and edge bias cij are pre-computed from graph structure and added to every attention head in every layer (with separate learned biases per head).
Standard pre-norm Transformer residual block. No changes from BERT here.
For a graph with n nodes, d model dimensions, H attention heads:
| Tensor | Shape | Description |
|---|---|---|
| H⁰ (input) | n × d | Node features after centrality encoding |
| Q, K, V | n × d (each) | Projected queries, keys, values |
| A (attention logits) | n × n | Dot products + SPD bias + edge bias |
| Attn (output) | n × d | Weighted sum of values |
| HL (final) | n × d | Node representations after L layers |
| Graph repr. | 1 × d | Output of virtual node (see Ch 5) |
Step through the full computation: input, centrality, attention with biases, FFN, and pooling. Click Next Step to advance.
Graph-level prediction requires pooling all node representations into a single graph-level vector. Standard options — mean pooling, sum pooling, max pooling — all have problems: they're permutation-invariant by construction but don't know which nodes are important, and they're fixed operations that can't be learned.
Graphormer borrows BERT's elegant solution: a virtual node [VNode] connected to every node in the graph. Like BERT's [CLS] token, it participates in attention and its final representation serves as the graph-level embedding.
The virtual node is implemented by assigning it a special distance value in the spatial encoding. Real nodes are at distances SPD = 1, 2, 3, ... from each other. The virtual node is assigned SPD = 0 from itself and SPD = -1 (a special token) from all real nodes. This gives it a unique learned attention bias and lets the model learn how strongly to weight the virtual node's participation.
The virtual node's centrality encoding also uses a special degree value (it's connected to all n nodes, so its degree is n — usually its own learned embedding vector rather than the n-th degree embedding, to avoid conflating it with high-degree real nodes).
After L Transformer layers, the virtual node has gathered information from all real nodes via attention — but not uniformly. High-degree real nodes contributed more (via their centrality encoding), nearby nodes contributed more (via spatial encoding), and nodes along chemically rich paths contributed differently (via edge encoding). The final [VNode] representation is a learned, structure-aware summary of the entire graph.
A 7-node graph with a virtual node (VNode) connected to all real nodes. The bar chart shows how much each real node contributes to the final VNode representation — taller = more influence. Click "Run Attention" to animate the aggregation.
Graphormer was evaluated on two main benchmarks: the OGB-Mol suite (molecular property prediction) and the OGB-LSC competition (PCQM4M, the large-scale molecular dataset). On both, it substantially outperformed the best GNNs at the time.
| Dataset | Task | Metric | GIN-VN (best GNN) | Graphormer |
|---|---|---|---|---|
| ogbg-molhiv | Binary cls | AUC ↑ | 77.07% | 80.51% |
| ogbg-molpcba | Multi-label | AP ↑ | 27.03% | 31.39% |
| ogbg-moltox21 | Toxicity | AUC ↑ | 75.58% | 76.36% |
| ZINC | Regression | MAE ↓ | 0.252 | 0.122 |
On ZINC — a popular benchmark with 250K molecules — Graphormer achieves MAE of 0.122, roughly half the error of the best GNN baseline (0.252). This is a substantial gap for a benchmark that has been intensely optimized for years.
PCQM4M contains 3.8 million molecules with DFT-computed quantum chemistry properties (HOMO-LUMO gap). This is the largest molecular ML benchmark available at the time. Graphormer won 1st place at KDD Cup 2021:
| Model | Valid MAE ↓ | Notes |
|---|---|---|
| GIN-VN (SOTA GNN) | 0.1395 | Previous best |
| GRPE | 0.0898 | Another graph Transformer |
| Graphormer | 0.0864 | 1st place, KDD Cup 2021 |
The winning Graphormer model for PCQM4M used 12 layers, 32 attention heads, d = 768 dimensions — similar in size to BERT-Base. Training on PCQM4M took about 3 days on 8 NVIDIA V100 GPUs. The model has approximately 47M parameters. By comparison, a GIN-VN of equivalent performance might be 10× smaller but 5× less accurate.
MAE on ZINC (lower is better). Use the selector to compare methods. Graphormer nearly halves the error of the best GNN.
Graphormer's empirical success is impressive. But can we understand why it works theoretically? The paper provides a formal analysis showing that Graphormer is at least as expressive as any 1-WL GNN — and strictly more expressive in many cases.
From Lec 6 (Theory of GNNs): any message-passing GNN is upper-bounded by the Weisfeiler-Leman (1-WL) graph isomorphism test. Two graphs that the 1-WL test cannot distinguish will receive identical representations from any GNN — regardless of architecture, depth, or width.
1-WL fails on many simple graphs. For example: two regular graphs (all nodes have the same degree) with the same degree sequence but different global structure. Or any graph where all nodes have identical k-hop neighborhoods for some k.
Graphormer's spatial encoding computes SPD(i, j) for all pairs — this is equivalent to knowing the entire distance matrix of the graph. Two non-isomorphic graphs can have identical local neighborhoods (fooling 1-WL) but different distance matrices. The spatial encoding distinguishes them.
For molecules, 1-WL distinguishes most pairs — but not all. Ring systems that differ in how distant atoms relate to each other can fool 1-WL. For example, cyclohexane (6-ring, all carbons) and two fused 3-rings have different distance matrices but potentially similar local neighborhoods. Graphormer's spatial encoding catches this.
More practically: even when 1-WL would technically distinguish two graphs, Graphormer learns to do so more efficiently — the spatial and edge encodings provide a direct signal that would require many GNN layers to approximate through message passing.
Graphormer is powerful but has real constraints that limit its applicability. Understanding these is essential for choosing when to use it.
Full self-attention over all n nodes costs O(n²·d) per layer. The attention matrix A is n×n — 400 MB at float32 for n=10,000. The spatial encoding matrix (precomputed SPD for all pairs) is also n×n. For molecular graphs (n ≈ 5–50), this is trivial. For citation networks (n = 169,343 for ogbn-arxiv), it's completely infeasible.
This is not a solvable problem within Graphormer's framework — it's a fundamental architectural constraint. O(n²) attention requires O(n²) memory. Workarounds (sparse attention, linear attention) require architectural changes that modify Graphormer into a different model.
Graphormer was designed and evaluated for graph-level prediction (a single property per graph, like molecular solubility). For node-level tasks (predicting a property per node, like citation category), Graphormer is less natural — the virtual node approach produces a single graph embedding, not per-node embeddings.
In principle, the node representations HL from the final layer could be used for node prediction. But the lack of positional encoding (no Laplacian PE — Graphormer uses SPD-based attention bias instead) means the node representations may be less distinctive for node-level discrimination than RWPE-augmented GPS.
Computing SPD(i, j) for all pairs and finding shortest path edge sequences requires running all-pairs shortest paths (APSP). For unweighted graphs, this is O(n · |E|) using BFS from every node. For n = 1,000 and |E| = 10,000, that's 10M operations — manageable offline. But for dynamic graphs (where edges change), re-running APSP at every timestep is prohibitive.
| Limitation | Impact | Workaround |
|---|---|---|
| O(n²) attention | Infeasible for n > ~5K | Switch to GPS with linear attention |
| Graph-level only | Awkward for node tasks | Use node representations HL directly |
| APSP pre-computation | Expensive for large/dynamic graphs | Approximate SPD or use RWPE instead |
| Large model size | 47M params, needs large training set | Pretrain on large corpus (e.g., PCQM4M) |
Graphormer sits at an important crossroads in the evolution of graph learning. Understanding its place in the broader landscape helps you pick the right tool and see where the field is going.
Graphormer is the central paper of CS224W Lecture 8 (Graph Transformers). Its three encodings — centrality, spatial, edge — exemplify the general principle that structural encoding is what makes Transformers graph-aware. The lecture generalizes this to other approaches (Laplacian PE, RWPE) and to GPS, which extends Graphormer's ideas to larger graphs by combining MPNN + attention.
"It seems that if one is working from the point of view of getting beauty in one's equations, and if one has really a sound insight, one is on a sure line of progress." — Paul Dirac
Paper: Ying et al. "Do Transformers Really Perform Bad for Graph Representation?" NeurIPS 2021. arXiv 2106.05234