A recipe: combine a local MPNN (for graph structure) with a global Transformer (for long-range dependencies), add positional encoding to make the Transformer graph-aware. Run both in every layer.
Two families of graph neural networks have emerged, each with a fundamental flaw:
Pure MPNNs (GCN, GAT, GraphSAGE): Efficient on graphs, capture local structure, scale well. But their receptive field is limited to L-hop neighborhoods. A 3-layer GCN sees only 3-hop neighbors. For tasks requiring reasoning about nodes that are 10 hops apart (long-range dependencies), you'd need 10+ layers — causing over-smoothing (all embeddings collapse to the same value) and vanishing gradients.
Pure Transformers on graphs: Every node attends to every other node — O(N²) computation, but no position blindness. The problem: standard Transformers are permutation-invariant. Give it the same graph with nodes numbered differently, it produces different outputs. They don't know about graph topology — adjacency, shortest paths, cycle membership — unless you tell them explicitly.
As you stack MPNN layers to increase the receptive field, the message from a distant node must flow through many intermediate nodes. Each intermediate node aggregates many neighbors' messages into a fixed-size vector. This over-squashing compresses exponentially many source messages into one vector — information is lost. Long-range dependencies are not just slow to reach with MPNNs, they're actively compressed away.
How many nodes does a GNN "see" after L layers on a graph with average degree k? Watch the receptive field explode — but also watch information compress into one fixed-size vector.
The GPS layer is elegantly simple: run an MPNN and a Transformer in parallel on the same node embeddings, then add their outputs.
where N(i) is i's graph neighborhood (local) and V is all nodes in the graph (global). The final MLP applies a feedforward transformation (like the FFN in a Transformer) before the next layer.
Let's watch GPS process a graph and see concretely what each branch contributes. A key question: when are they different? When is the Transformer attention actually adding something beyond what the MPNN already computed?
A graph with 8 nodes. Orange glow = attention weight from the selected node (long-range). Teal glow = MPNN message (only immediate neighbors). Toggle between views or see both at once. Select a node to start.
Consider a molecule with two reactive sites at opposite ends (10 bonds apart). The MPNN at layer 1 only sees each site's immediate neighbors. Even at layer 5, information about site A is compressed and distorted by the time it reaches site B. The Transformer's attention lets site A directly query site B at every layer — no compression, no distortion.
In practice, this matters most for: bond angle prediction (non-adjacent atoms interact), reaction outcome prediction (distant functional groups affect each other), and graph classification tasks where the entire molecular motif matters.
A Transformer on graphs is permutation-invariant by default — it doesn't know about graph structure. If you give it two different graphs with the same node features (different topology), it produces the same output. This is clearly wrong for graph tasks.
Positional Encodings (PEs) solve this: append a graph-structural descriptor to each node's feature vector before it enters the GPS layers. Now the Transformer can distinguish nodes by their structural position in the graph.
Compute the k smallest non-trivial eigenvectors of the normalized graph Laplacian L = I − D−½AD−½. The i-th node's positional encoding is the concatenation of these k eigenvector values at position i.
where uj[i] is the i-th component of the j-th Laplacian eigenvector. These encode the node's "spectral position" — nodes with similar topological roles (same distance from hub nodes, similar cycle membership) will have similar LapPEs.
The alternative PE used in GPS: compute random walk landing probabilities. For node i, RWPEk(i) = (RWk)ii — the probability of returning to node i after exactly k steps of a random walk starting from i.
RWPE is deterministic (no sign ambiguity) and has a natural interpretation: nodes in dense clusters have high return probabilities at short walk lengths; nodes on long paths have high probabilities only at longer walks. It captures local community structure naturally.
Node color intensity = PE value at dimension k. For LapPE (top): 2nd and 3rd eigenvectors. For RWPE (bottom): k=1 and k=3 return probabilities. Different nodes get different PE vectors despite having the same node features.
Full O(N²) Transformer attention is the bottleneck for large graphs. GPS addresses this with plug-in scalable attention mechanisms — the framework doesn't require full quadratic attention.
GPS treats the attention module as a hyperparameter. Any efficient Transformer variant can replace the standard O(N²) attention:
Empirically: full attention (O(N²)) performs best but is limited to graphs with N < ~1,000 nodes in a GPU batch. BigBird extends this to N ~ 10,000+ nodes while retaining most of the performance gain.
Computational cost as graph size N grows. Full attention scales as N². MPNN scales as |E| ≈ N × avg_degree. BigBird/Performer scale as N. Adjust graph size to see crossover points.
One of GPS's core contributions is formalizing a design space for graph Transformers. Rather than proposing one fixed architecture, the paper provides a modular framework where each component is a choice.
1. Local MPNN type: GCN, GAT, GIN, GINE, PNA, or none. GIN (Graph Isomorphism Network) is provably maximally expressive for MPNNs. PNA (Principal Neighbourhood Aggregation) uses multiple aggregators (sum, max, std) and degree scalers. The choice depends on the task and graph type.
2. Global Attention type: Full, BigBird, Performer, Linformer, or none. Full attention for small graphs (<1000 nodes), efficient variants for larger.
3. Positional Encoding type: None (ablation), LapPE, RWPE, SignNet, EquivStableLapPE. The PE is the most impactful choice for Transformer performance on graphs.
| Configuration | ZINC MAE ↓ | Peptides-func AP ↑ |
|---|---|---|
| No PE, No Attn (pure MPNN) | 0.083 | 0.623 |
| No PE, Full Attn | 0.110 | 0.651 |
| LapPE, No Attn | 0.074 | 0.632 |
| RWPE, No Attn | 0.071 | 0.638 |
| LapPE + Full Attn (GPS) | 0.060 | 0.668 |
| RWPE + Full Attn (GPS) | 0.057 | 0.672 |
Key insight: PE alone helps (row 3 vs row 1). Attention alone without PE hurts vs pure MPNN (row 2 vs row 1) — because without PE, attention is graph-blind and adds noise. PE + Attention together gives the best results.
GPS is evaluated on the Long Range Graph Benchmark (LRGB) — a suite of 5 datasets specifically designed to require long-range reasoning. These are the hardest tests for MPNN-only models.
| Dataset | Task | GCN | GAT | GIN | GPS |
|---|---|---|---|---|---|
| Peptides-func | Graph classif. | 0.5930 | 0.5864 | 0.5498 | 0.6715 |
| Peptides-struct | Graph regress. | 0.3496 | 0.3557 | 0.3547 | 0.2509 |
| PCQM-Contact | Link predict. | 0.318 | 0.310 | 0.311 | 0.395 |
| Pascal-VOC-SP | Node classif. | 0.1268 | 0.2540 | 0.1265 | 0.3748 |
| COCO-SP | Node classif. | 0.0841 | 0.2235 | 0.0908 | 0.3412 |
On the standard OGB benchmarks (not specifically long-range), GPS is competitive but not always the absolute best — specialized models can still win on familiar datasets. GPS's advantage is breadth: it's consistently strong across long-range, standard, and heterogeneous settings without needing task-specific tricks.
| Dataset | GIN | PNA | Graphormer | GPS |
|---|---|---|---|---|
| ZINC (MAE↓) | 0.163 | 0.142 | 0.122 | 0.057 |
| MOLHIV (AUC↑) | 75.6 | 79.1 | 80.5 | 81.5 |
O(N²) attention bottleneck: Even with BigBird alternatives, getting the full benefit of GPS requires full attention, which limits graphs to ~1,000 nodes per GPU batch. Truly large graphs (social networks with millions of nodes) require aggressive sampling that may lose the long-range advantage.
Positional encoding compute: LapPE requires computing eigenvectors of the graph Laplacian — O(N³) in general, though fast approximations exist. RWPE requires matrix powers — O(N² × k). Both are precomputed, but add preprocessing cost.
Homogeneous graphs only (base version): GPS as presented handles graphs with one node type and one edge type. Extension to heterogeneous graphs (like HGT's setting) requires type-specific modifications.
| Model | Attention Scope | PE | Local Message Passing |
|---|---|---|---|
| Graphormer | Global (full) | Structural (degree, path) | None |
| Graph-BERT | Subgraph sampling | Positional role | None |
| SAT | Global | Subgraph encodings | GNN-based PE |
| GPS | Global + Local MPNN | LapPE / RWPE | Any MPNN (parallel) |
| Exphormer (2023) | Sparse virtual edges | RWPE | MPNN (GPS-style) |
"Local structure and global context are not in competition — they are complementary. The right model uses both, always."
— GPS design philosophy