GPS — Graph Transformer

Chapter 0: The Problem

Two families of graph neural networks have emerged, each with a fundamental flaw:

Pure MPNNs (GCN, GAT, GraphSAGE): Efficient on graphs, capture local structure, scale well. But their receptive field is limited to L-hop neighborhoods. A 3-layer GCN sees only 3-hop neighbors. For tasks requiring reasoning about nodes that are 10 hops apart (long-range dependencies), you'd need 10+ layers — causing over-smoothing (all embeddings collapse to the same value) and vanishing gradients.

Pure Transformers on graphs: Every node attends to every other node — O(N²) computation, but no position blindness. The problem: standard Transformers are permutation-invariant. Give it the same graph with nodes numbered differently, it produces different outputs. They don't know about graph topology — adjacency, shortest paths, cycle membership — unless you tell them explicitly.

The fundamental tension: MPNNs know graph structure perfectly but can't capture long-range dependencies. Transformers capture long-range dependencies but are graph-blind. GPS says: use both, in every layer. Let MPNN handle local structure; let Transformer handle global dependencies; use positional encoding to inject graph structure into the Transformer.

The "Over-Squashing" Problem of Deep MPNNs

As you stack MPNN layers to increase the receptive field, the message from a distant node must flow through many intermediate nodes. Each intermediate node aggregates many neighbors' messages into a fixed-size vector. This over-squashing compresses exponentially many source messages into one vector — information is lost. Long-range dependencies are not just slow to reach with MPNNs, they're actively compressed away.

Receptive Field Growth and Over-Squashing

How many nodes does a GNN "see" after L layers on a graph with average degree k? Watch the receptive field explode — but also watch information compress into one fixed-size vector.

GNN layers L 3

Average degree k 5

Why do pure MPNNs struggle with long-range dependencies even when stacked deeply?

Deep MPNNs are too slow to train — the computational cost is prohibitive Deep MPNNs can't run on graphs with varying node counts Over-squashing: exponentially many source messages are compressed into fixed-size vectors at each layer — information from distant nodes is lossy, and over-smoothing makes deep embeddings identical

Chapter 1: The GPS Layer

The GPS layer is elegantly simple: run an MPNN and a Transformer in parallel on the same node embeddings, then add their outputs.

h^l+1_i = MLP( LocalMPNN^l(h^l_i, N(i)) + GlobalAttn^l(h^l_i, V) )

where N(i) is i's graph neighborhood (local) and V is all nodes in the graph (global). The final MLP applies a feedforward transformation (like the FFN in a Transformer) before the next layer.

What Each Component Contributes

Local MPNN
Processes the actual graph edges. Learns: which atoms are bonded, how features propagate through local structure, graph isomorphism up to the WL test limit. Fast: O(|E|) per layer.

Implementation: any MPNN — GCN, GAT, GIN, GINE. The choice is a hyperparameter.

Global Attention
Processes all node pairs. Learns: long-range correlations, global context, cross-cluster information. Not limited by hop count. Slow: O(N²) per layer.

Implementation: standard multi-head self-attention from Transformers. Nodes = tokens.

Why parallel, not sequential? You could run MPNN then Transformer (or vice versa) at each layer. But the paper finds parallel works better — neither operation is "upstream" of the other within a layer. They see the same input h^l_i and contribute complementary views. Sequential would let one view "contaminate" the other before it forms its own view.

The Full GPS Layer

Input: h^l_i + PE_i

Node embeddings + positional encoding (injected here)

↓ parallel branches

MPNN branch

Aggregate from N(i) via edges. Output: Δh^MPNN_i

Attention branch

Attend over all nodes V. Output: Δh^Attn_i

↓ add outputs

Combined

h = Δh^MPNN_i + Δh^Attn_i

↓ MLP + LayerNorm + skip

Output: h^l+1_i

Ready for next GPS layer or prediction head

In the GPS layer, how are the MPNN and Attention branches combined?

Sequential: MPNN output feeds into Attention as input Gated: a learned gate decides which branch output to use Parallel and additive: both run on the same input, their outputs are summed, then passed through an MLP

Chapter 2: MPNN + Attention Running Together (Showcase)

Let's watch GPS process a graph and see concretely what each branch contributes. A key question: when are they different? When is the Transformer attention actually adding something beyond what the MPNN already computed?

GPS Processing — Interactive Layer Walk

A graph with 8 nodes. Orange glow = attention weight from the selected node (long-range). Teal glow = MPNN message (only immediate neighbors). Toggle between views or see both at once. Select a node to start.

Select a node to see how MPNN (neighbors only) and Attention (all nodes) give it different information.

The Key Scenario: When Attention Matters Most

Consider a molecule with two reactive sites at opposite ends (10 bonds apart). The MPNN at layer 1 only sees each site's immediate neighbors. Even at layer 5, information about site A is compressed and distorted by the time it reaches site B. The Transformer's attention lets site A directly query site B at every layer — no compression, no distortion.

In practice, this matters most for: bond angle prediction (non-adjacent atoms interact), reaction outcome prediction (distant functional groups affect each other), and graph classification tasks where the entire molecular motif matters.

For a node that has NO graph neighbors (an isolated node), what does the GPS MPNN branch contribute?

Nothing — GPS fails on disconnected graphs The same as a fully-connected node — MPNN treats isolation specially Only the self-loop term (h_i itself, no neighbor aggregation) — but the Attention branch still attends to all other nodes

Chapter 3: Positional Encodings

A Transformer on graphs is permutation-invariant by default — it doesn't know about graph structure. If you give it two different graphs with the same node features (different topology), it produces the same output. This is clearly wrong for graph tasks.

Positional Encodings (PEs) solve this: append a graph-structural descriptor to each node's feature vector before it enters the GPS layers. Now the Transformer can distinguish nodes by their structural position in the graph.

Laplacian PE (LapPE)

Compute the k smallest non-trivial eigenvectors of the normalized graph Laplacian L = I − D^−½AD^−½. The i-th node's positional encoding is the concatenation of these k eigenvector values at position i.

PE_LapPE(i) = [u₁[i], u₂[i], ..., u_k[i]] ∈ ℝ^k

where u_j[i] is the i-th component of the j-th Laplacian eigenvector. These encode the node's "spectral position" — nodes with similar topological roles (same distance from hub nodes, similar cycle membership) will have similar LapPEs.

Sign ambiguity: Each eigenvector u_j can be flipped (−u_j is also a valid eigenvector). LapPE encodings have this sign ambiguity — the absolute sign is arbitrary. The paper handles this by randomly flipping signs during training so the model learns sign-invariant features. SignNet (Lim et al., 2022) provides a principled sign-equivariant MLP for this.

Random Walk PE (RWPE)

The alternative PE used in GPS: compute random walk landing probabilities. For node i, RWPE_k(i) = (RW^k)_ii — the probability of returning to node i after exactly k steps of a random walk starting from i.

PE_RWPE(i) = [(RW)_ii, (RW²)_ii, ..., (RW^k)_ii] ∈ ℝ^k

RWPE is deterministic (no sign ambiguity) and has a natural interpretation: nodes in dense clusters have high return probabilities at short walk lengths; nodes on long paths have high probabilities only at longer walks. It captures local community structure naturally.

Laplacian PE vs RWPE — Visualization

Node color intensity = PE value at dimension k. For LapPE (top): 2nd and 3rd eigenvectors. For RWPE (bottom): k=1 and k=3 return probabilities. Different nodes get different PE vectors despite having the same node features.

PE dimension k 2

What problem does Laplacian PE solve in the GPS framework?

It speeds up the MPNN branch by precomputing neighbor features It makes the Transformer branch graph-aware — without PE, the Transformer is permutation-invariant and can't distinguish nodes with different topological roles even if they have different structural positions It replaces the adjacency matrix in the MPNN branch

Chapter 4: Scalability

Full O(N²) Transformer attention is the bottleneck for large graphs. GPS addresses this with plug-in scalable attention mechanisms — the framework doesn't require full quadratic attention.

BigBird / Performer as Drop-In Replacements

GPS treats the attention module as a hyperparameter. Any efficient Transformer variant can replace the standard O(N²) attention:

BigBird: Combines window attention (each node attends to nearby tokens) + random attention (random global attention) + global tokens (a few "summarizer" nodes attend to everything). O(N) in total attention cost.
Performer: Approximates softmax attention with random feature maps, reducing O(N²) to O(N). Theoretically principled.
Linformer: Low-rank approximation of the attention matrix. O(N) with bounded error.

Empirically: full attention (O(N²)) performs best but is limited to graphs with N < ~1,000 nodes in a GPU batch. BigBird extends this to N ~ 10,000+ nodes while retaining most of the performance gain.

The MPNN is always O(|E|): Even when global attention is approximated or absent, the MPNN branch still gives you graph-structure-aware processing at linear cost. So GPS degrades gracefully: if you turn off global attention entirely, you have a standard MPNN with positional encoding — which is already better than vanilla MPNN.

Scalability: Attention Cost vs Graph Size

Computational cost as graph size N grows. Full attention scales as N². MPNN scales as |E| ≈ N × avg_degree. BigBird/Performer scale as N. Adjust graph size to see crossover points.

Max graph size N 1000

Avg degree k 8

If GPS is configured with a BigBird attention instead of full attention, what is the computational cost per GPS layer (for graph size N)?

O(N²) — BigBird is just a faster implementation of full attention O(N × k) where k is a constant — O(N) overall — BigBird uses window + random + global attention patterns that grow linearly O(N log N) — BigBird uses hierarchical attention

Chapter 5: The Design Space

One of GPS's core contributions is formalizing a design space for graph Transformers. Rather than proposing one fixed architecture, the paper provides a modular framework where each component is a choice.

The Three Axes of Design

1. Local MPNN type: GCN, GAT, GIN, GINE, PNA, or none. GIN (Graph Isomorphism Network) is provably maximally expressive for MPNNs. PNA (Principal Neighbourhood Aggregation) uses multiple aggregators (sum, max, std) and degree scalers. The choice depends on the task and graph type.

2. Global Attention type: Full, BigBird, Performer, Linformer, or none. Full attention for small graphs (<1000 nodes), efficient variants for larger.

3. Positional Encoding type: None (ablation), LapPE, RWPE, SignNet, EquivStableLapPE. The PE is the most impactful choice for Transformer performance on graphs.

Why a design space matters: Before GPS, every paper proposed a unique, inseparable architecture. You couldn't take the attention from paper A and the PE from paper B — they were baked together. GPS's modular framework makes ablations clean: "does LapPE help more than RWPE? Does GIN as MPNN beat GAT?" These questions have clear answers in GPS.

Ablation Summary

Configuration	ZINC MAE ↓	Peptides-func AP ↑
No PE, No Attn (pure MPNN)	0.083	0.623
No PE, Full Attn	0.110	0.651
LapPE, No Attn	0.074	0.632
RWPE, No Attn	0.071	0.638
LapPE + Full Attn (GPS)	0.060	0.668
RWPE + Full Attn (GPS)	0.057	0.672

Key insight: PE alone helps (row 3 vs row 1). Attention alone without PE hurts vs pure MPNN (row 2 vs row 1) — because without PE, attention is graph-blind and adds noise. PE + Attention together gives the best results.

According to GPS's ablations, what happens when you add full Transformer attention but WITHOUT any positional encoding?

Performance improves — global attention always helps even without PE Performance actually gets worse than pure MPNN — graph-blind attention adds noise rather than signal Performance is unchanged — attention and PE are independent contributions

Chapter 6: Results

GPS is evaluated on the Long Range Graph Benchmark (LRGB) — a suite of 5 datasets specifically designed to require long-range reasoning. These are the hardest tests for MPNN-only models.

Long Range Graph Benchmark (LRGB)

Dataset	Task	GCN	GAT	GIN	GPS
Peptides-func	Graph classif.	0.5930	0.5864	0.5498	0.6715
Peptides-struct	Graph regress.	0.3496	0.3557	0.3547	0.2509
PCQM-Contact	Link predict.	0.318	0.310	0.311	0.395
Pascal-VOC-SP	Node classif.	0.1268	0.2540	0.1265	0.3748
COCO-SP	Node classif.	0.0841	0.2235	0.0908	0.3412

The long-range benchmark is designed for exactly this: Peptides are small molecules where atoms at opposite ends of the chain affect each other's biological activity. PCQM-Contact requires predicting contacts between distant parts of protein structures. Standard MPNNs fail these tasks because they literally cannot propagate information far enough without over-squashing.

Standard Benchmarks (ZINC, MOLHIV)

On the standard OGB benchmarks (not specifically long-range), GPS is competitive but not always the absolute best — specialized models can still win on familiar datasets. GPS's advantage is breadth: it's consistently strong across long-range, standard, and heterogeneous settings without needing task-specific tricks.

Dataset	GIN	PNA	Graphormer	GPS
ZINC (MAE↓)	0.163	0.142	0.122	0.057
MOLHIV (AUC↑)	75.6	79.1	80.5	81.5

On the Long Range Graph Benchmark, GPS's improvement over GCN is largest on which type of task?

Tasks specifically designed to require long-range dependencies (Peptides, PCQM-Contact, SP datasets) — where MPNNs structurally cannot succeed without long-range attention Standard molecular property prediction (ZINC, MOLHIV) — common benchmarks Social network tasks with high-degree nodes

Chapter 7: Connections & Beyond

Limitations

O(N²) attention bottleneck: Even with BigBird alternatives, getting the full benefit of GPS requires full attention, which limits graphs to ~1,000 nodes per GPU batch. Truly large graphs (social networks with millions of nodes) require aggressive sampling that may lose the long-range advantage.

Positional encoding compute: LapPE requires computing eigenvectors of the graph Laplacian — O(N³) in general, though fast approximations exist. RWPE requires matrix powers — O(N² × k). Both are precomputed, but add preprocessing cost.

Homogeneous graphs only (base version): GPS as presented handles graphs with one node type and one edge type. Extension to heterogeneous graphs (like HGT's setting) requires type-specific modifications.

The Graph Transformer Landscape

Model	Attention Scope	PE	Local Message Passing
Graphormer	Global (full)	Structural (degree, path)	None
Graph-BERT	Subgraph sampling	Positional role	None
SAT	Global	Subgraph encodings	GNN-based PE
GPS	Global + Local MPNN	LapPE / RWPE	Any MPNN (parallel)
Exphormer (2023)	Sparse virtual edges	RWPE	MPNN (GPS-style)

The design space contribution is GPS's most lasting impact: By modularizing the three components (MPNN, attention, PE), GPS enabled systematic ablations across many concurrent graph Transformer papers. It showed that PE is the critical ingredient — not attention alone. This insight redirected the community toward developing better PEs (SignNet, EquivStableLapPE, RRWP) rather than just novel attention mechanisms.

Related Lessons

GCN — foundational MPNN that can serve as GPS's local branch
GAT — attention-weighted MPNN (another GPS local branch option)
Graphormer — pure Transformer on graphs (no local MPNN branch)
HGT — heterogeneous graph Transformer (different problem domain)

"Local structure and global context are not in competition — they are complementary. The right model uses both, always."
— GPS design philosophy

GPS: General, Powerful, Scalable Graph Transformer