Rampasek et al. — Mila / McGill University, NeurIPS 2022

GPS: General, Powerful, Scalable Graph Transformer

A recipe: combine a local MPNN (for graph structure) with a global Transformer (for long-range dependencies), add positional encoding to make the Transformer graph-aware. Run both in every layer.

Prerequisites: Message Passing GNNs (GCN, GAT) + Transformer (attention, positional encoding) + Graph expressiveness (WL test basics)
8
Chapters
4+
Simulations

Chapter 0: The Problem

Two families of graph neural networks have emerged, each with a fundamental flaw:

Pure MPNNs (GCN, GAT, GraphSAGE): Efficient on graphs, capture local structure, scale well. But their receptive field is limited to L-hop neighborhoods. A 3-layer GCN sees only 3-hop neighbors. For tasks requiring reasoning about nodes that are 10 hops apart (long-range dependencies), you'd need 10+ layers — causing over-smoothing (all embeddings collapse to the same value) and vanishing gradients.

Pure Transformers on graphs: Every node attends to every other node — O(N²) computation, but no position blindness. The problem: standard Transformers are permutation-invariant. Give it the same graph with nodes numbered differently, it produces different outputs. They don't know about graph topology — adjacency, shortest paths, cycle membership — unless you tell them explicitly.

The fundamental tension: MPNNs know graph structure perfectly but can't capture long-range dependencies. Transformers capture long-range dependencies but are graph-blind. GPS says: use both, in every layer. Let MPNN handle local structure; let Transformer handle global dependencies; use positional encoding to inject graph structure into the Transformer.

The "Over-Squashing" Problem of Deep MPNNs

As you stack MPNN layers to increase the receptive field, the message from a distant node must flow through many intermediate nodes. Each intermediate node aggregates many neighbors' messages into a fixed-size vector. This over-squashing compresses exponentially many source messages into one vector — information is lost. Long-range dependencies are not just slow to reach with MPNNs, they're actively compressed away.

Receptive Field Growth and Over-Squashing

How many nodes does a GNN "see" after L layers on a graph with average degree k? Watch the receptive field explode — but also watch information compress into one fixed-size vector.

GNN layers L 3
Average degree k 5
Why do pure MPNNs struggle with long-range dependencies even when stacked deeply?

Chapter 1: The GPS Layer

The GPS layer is elegantly simple: run an MPNN and a Transformer in parallel on the same node embeddings, then add their outputs.

hl+1i = MLP( LocalMPNNl(hli, N(i)) + GlobalAttnl(hli, V) )

where N(i) is i's graph neighborhood (local) and V is all nodes in the graph (global). The final MLP applies a feedforward transformation (like the FFN in a Transformer) before the next layer.

What Each Component Contributes

Local MPNN
Processes the actual graph edges. Learns: which atoms are bonded, how features propagate through local structure, graph isomorphism up to the WL test limit. Fast: O(|E|) per layer.

Implementation: any MPNN — GCN, GAT, GIN, GINE. The choice is a hyperparameter.
Global Attention
Processes all node pairs. Learns: long-range correlations, global context, cross-cluster information. Not limited by hop count. Slow: O(N²) per layer.

Implementation: standard multi-head self-attention from Transformers. Nodes = tokens.
Why parallel, not sequential? You could run MPNN then Transformer (or vice versa) at each layer. But the paper finds parallel works better — neither operation is "upstream" of the other within a layer. They see the same input hli and contribute complementary views. Sequential would let one view "contaminate" the other before it forms its own view.

The Full GPS Layer

Input: hli + PEi
Node embeddings + positional encoding (injected here)
↓ parallel branches
MPNN branch
Aggregate from N(i) via edges. Output: ΔhMPNNi
Attention branch
Attend over all nodes V. Output: ΔhAttni
↓ add outputs
Combined
h = ΔhMPNNi + ΔhAttni
↓ MLP + LayerNorm + skip
Output: hl+1i
Ready for next GPS layer or prediction head
In the GPS layer, how are the MPNN and Attention branches combined?

Chapter 2: MPNN + Attention Running Together (Showcase)

Let's watch GPS process a graph and see concretely what each branch contributes. A key question: when are they different? When is the Transformer attention actually adding something beyond what the MPNN already computed?

GPS Processing — Interactive Layer Walk

A graph with 8 nodes. Orange glow = attention weight from the selected node (long-range). Teal glow = MPNN message (only immediate neighbors). Toggle between views or see both at once. Select a node to start.

Select a node to see how MPNN (neighbors only) and Attention (all nodes) give it different information.

The Key Scenario: When Attention Matters Most

Consider a molecule with two reactive sites at opposite ends (10 bonds apart). The MPNN at layer 1 only sees each site's immediate neighbors. Even at layer 5, information about site A is compressed and distorted by the time it reaches site B. The Transformer's attention lets site A directly query site B at every layer — no compression, no distortion.

In practice, this matters most for: bond angle prediction (non-adjacent atoms interact), reaction outcome prediction (distant functional groups affect each other), and graph classification tasks where the entire molecular motif matters.

For a node that has NO graph neighbors (an isolated node), what does the GPS MPNN branch contribute?

Chapter 3: Positional Encodings

A Transformer on graphs is permutation-invariant by default — it doesn't know about graph structure. If you give it two different graphs with the same node features (different topology), it produces the same output. This is clearly wrong for graph tasks.

Positional Encodings (PEs) solve this: append a graph-structural descriptor to each node's feature vector before it enters the GPS layers. Now the Transformer can distinguish nodes by their structural position in the graph.

Laplacian PE (LapPE)

Compute the k smallest non-trivial eigenvectors of the normalized graph Laplacian L = I − D−½AD−½. The i-th node's positional encoding is the concatenation of these k eigenvector values at position i.

PELapPE(i) = [u1[i], u2[i], ..., uk[i]] ∈ ℝk

where uj[i] is the i-th component of the j-th Laplacian eigenvector. These encode the node's "spectral position" — nodes with similar topological roles (same distance from hub nodes, similar cycle membership) will have similar LapPEs.

Sign ambiguity: Each eigenvector uj can be flipped (−uj is also a valid eigenvector). LapPE encodings have this sign ambiguity — the absolute sign is arbitrary. The paper handles this by randomly flipping signs during training so the model learns sign-invariant features. SignNet (Lim et al., 2022) provides a principled sign-equivariant MLP for this.

Random Walk PE (RWPE)

The alternative PE used in GPS: compute random walk landing probabilities. For node i, RWPEk(i) = (RWk)ii — the probability of returning to node i after exactly k steps of a random walk starting from i.

PERWPE(i) = [(RW)ii, (RW2)ii, ..., (RWk)ii] ∈ ℝk

RWPE is deterministic (no sign ambiguity) and has a natural interpretation: nodes in dense clusters have high return probabilities at short walk lengths; nodes on long paths have high probabilities only at longer walks. It captures local community structure naturally.

Laplacian PE vs RWPE — Visualization

Node color intensity = PE value at dimension k. For LapPE (top): 2nd and 3rd eigenvectors. For RWPE (bottom): k=1 and k=3 return probabilities. Different nodes get different PE vectors despite having the same node features.

PE dimension k 2
What problem does Laplacian PE solve in the GPS framework?

Chapter 4: Scalability

Full O(N²) Transformer attention is the bottleneck for large graphs. GPS addresses this with plug-in scalable attention mechanisms — the framework doesn't require full quadratic attention.

BigBird / Performer as Drop-In Replacements

GPS treats the attention module as a hyperparameter. Any efficient Transformer variant can replace the standard O(N²) attention:

Empirically: full attention (O(N²)) performs best but is limited to graphs with N < ~1,000 nodes in a GPU batch. BigBird extends this to N ~ 10,000+ nodes while retaining most of the performance gain.

The MPNN is always O(|E|): Even when global attention is approximated or absent, the MPNN branch still gives you graph-structure-aware processing at linear cost. So GPS degrades gracefully: if you turn off global attention entirely, you have a standard MPNN with positional encoding — which is already better than vanilla MPNN.
Scalability: Attention Cost vs Graph Size

Computational cost as graph size N grows. Full attention scales as N². MPNN scales as |E| ≈ N × avg_degree. BigBird/Performer scale as N. Adjust graph size to see crossover points.

Max graph size N 1000
Avg degree k 8
If GPS is configured with a BigBird attention instead of full attention, what is the computational cost per GPS layer (for graph size N)?

Chapter 5: The Design Space

One of GPS's core contributions is formalizing a design space for graph Transformers. Rather than proposing one fixed architecture, the paper provides a modular framework where each component is a choice.

The Three Axes of Design

1. Local MPNN type: GCN, GAT, GIN, GINE, PNA, or none. GIN (Graph Isomorphism Network) is provably maximally expressive for MPNNs. PNA (Principal Neighbourhood Aggregation) uses multiple aggregators (sum, max, std) and degree scalers. The choice depends on the task and graph type.

2. Global Attention type: Full, BigBird, Performer, Linformer, or none. Full attention for small graphs (<1000 nodes), efficient variants for larger.

3. Positional Encoding type: None (ablation), LapPE, RWPE, SignNet, EquivStableLapPE. The PE is the most impactful choice for Transformer performance on graphs.

Why a design space matters: Before GPS, every paper proposed a unique, inseparable architecture. You couldn't take the attention from paper A and the PE from paper B — they were baked together. GPS's modular framework makes ablations clean: "does LapPE help more than RWPE? Does GIN as MPNN beat GAT?" These questions have clear answers in GPS.

Ablation Summary

ConfigurationZINC MAE ↓Peptides-func AP ↑
No PE, No Attn (pure MPNN)0.0830.623
No PE, Full Attn0.1100.651
LapPE, No Attn0.0740.632
RWPE, No Attn0.0710.638
LapPE + Full Attn (GPS)0.0600.668
RWPE + Full Attn (GPS)0.0570.672

Key insight: PE alone helps (row 3 vs row 1). Attention alone without PE hurts vs pure MPNN (row 2 vs row 1) — because without PE, attention is graph-blind and adds noise. PE + Attention together gives the best results.

According to GPS's ablations, what happens when you add full Transformer attention but WITHOUT any positional encoding?

Chapter 6: Results

GPS is evaluated on the Long Range Graph Benchmark (LRGB) — a suite of 5 datasets specifically designed to require long-range reasoning. These are the hardest tests for MPNN-only models.

Long Range Graph Benchmark (LRGB)

DatasetTaskGCNGATGINGPS
Peptides-funcGraph classif.0.59300.58640.54980.6715
Peptides-structGraph regress.0.34960.35570.35470.2509
PCQM-ContactLink predict.0.3180.3100.3110.395
Pascal-VOC-SPNode classif.0.12680.25400.12650.3748
COCO-SPNode classif.0.08410.22350.09080.3412
The long-range benchmark is designed for exactly this: Peptides are small molecules where atoms at opposite ends of the chain affect each other's biological activity. PCQM-Contact requires predicting contacts between distant parts of protein structures. Standard MPNNs fail these tasks because they literally cannot propagate information far enough without over-squashing.

Standard Benchmarks (ZINC, MOLHIV)

On the standard OGB benchmarks (not specifically long-range), GPS is competitive but not always the absolute best — specialized models can still win on familiar datasets. GPS's advantage is breadth: it's consistently strong across long-range, standard, and heterogeneous settings without needing task-specific tricks.

DatasetGINPNAGraphormerGPS
ZINC (MAE↓)0.1630.1420.1220.057
MOLHIV (AUC↑)75.679.180.581.5
On the Long Range Graph Benchmark, GPS's improvement over GCN is largest on which type of task?

Chapter 7: Connections & Beyond

Limitations

O(N²) attention bottleneck: Even with BigBird alternatives, getting the full benefit of GPS requires full attention, which limits graphs to ~1,000 nodes per GPU batch. Truly large graphs (social networks with millions of nodes) require aggressive sampling that may lose the long-range advantage.

Positional encoding compute: LapPE requires computing eigenvectors of the graph Laplacian — O(N³) in general, though fast approximations exist. RWPE requires matrix powers — O(N² × k). Both are precomputed, but add preprocessing cost.

Homogeneous graphs only (base version): GPS as presented handles graphs with one node type and one edge type. Extension to heterogeneous graphs (like HGT's setting) requires type-specific modifications.

The Graph Transformer Landscape

ModelAttention ScopePELocal Message Passing
GraphormerGlobal (full)Structural (degree, path)None
Graph-BERTSubgraph samplingPositional roleNone
SATGlobalSubgraph encodingsGNN-based PE
GPSGlobal + Local MPNNLapPE / RWPEAny MPNN (parallel)
Exphormer (2023)Sparse virtual edgesRWPEMPNN (GPS-style)
The design space contribution is GPS's most lasting impact: By modularizing the three components (MPNN, attention, PE), GPS enabled systematic ablations across many concurrent graph Transformer papers. It showed that PE is the critical ingredient — not attention alone. This insight redirected the community toward developing better PEs (SignNet, EquivStableLapPE, RRWP) rather than just novel attention mechanisms.

Related Lessons

"Local structure and global context are not in competition — they are complementary. The right model uses both, always."
— GPS design philosophy