Transformers were built for sequences. Graphs have no natural order. Here's how we reconcile that — and why the solution unlocks long-range dependencies that GNNs cannot reach.
A Transformer takes a sequence of tokens — words, image patches, whatever — and lets every token attend to every other token simultaneously. No notion of "left neighbor" or "right neighbor." Just a set of elements that all interact with each other in one giant attention operation.
A graph is also a set of elements — nodes — that interact with each other. The edges define which interactions happen. So here's the key insight: a Transformer on a fully-connected graph is exactly a standard Transformer. Every node attends to every other node. A Transformer on a sparse graph — where attention is only allowed between connected nodes — is exactly a GNN with learned attention weights. That's just GAT.
So Graph Transformers aren't a radically new idea. They're the same self-attention mechanism, but applied carefully to graph-structured data. The critical challenge is a single issue: Transformers are permutation-equivariant. If you shuffle the input tokens, the outputs shuffle in the same way. For sequences, the positions tell the model where each token lives. For graphs, there are no positions. Two graphs with different structure look identical if you just hand the model a bag of node features.
Every Graph Transformer architecture answers the same three questions:
Click the buttons to see how attention patterns differ. Full attention (Transformer): every node attends to every other. Sparse attention (GNN/GAT): each node only attends to its neighbors. Hybrid (GPS): both patterns run in parallel.
Before we talk about graphs, let's nail down exactly what self-attention computes. You have n input vectors, each of dimension d. Call them x₁, x₂, ..., xₙ. Three learned weight matrices — WQ, WK, WV — project each xᵢ into a query, key, and value:
The attention score between node i and node j is how much i "wants" information from j — computed as a dot product between i's query and j's key:
The output for node i is a weighted sum of all values, weighted by attention scores:
The √d denominator prevents the dot products from growing large (which would push softmax into near-zero gradient territory). This is standard scaled dot-product attention.
Instead of one set of Q/K/V projections, we run H independent "heads" in parallel. Each head has its own WQ, WK, WV. The outputs of all heads are concatenated and projected back to dimension d:
Why multiple heads? Each head can specialize. One head might learn to attend to chemically similar neighbors. Another might learn to attend to structurally distant nodes. The concatenation lets the model use all these relationships simultaneously.
For a graph with n nodes, each with feature dimension d:
The n×n attention matrix is why full self-attention costs O(n²) memory and compute. For a graph with 100 nodes, this is fine. For a graph with 100,000 nodes, it's completely infeasible.
Imagine two nodes in a graph: node A has three neighbors, and node B also has three neighbors. Their features happen to be identical. Without any structural information injected, a Graph Transformer will assign them identical embeddings — regardless of how differently they're connected in the larger graph.
Now scale this up. Two nodes might look identical locally but be in completely different parts of the graph — one is a hub connecting two communities, the other is a peripheral leaf. Without PE, the Transformer cannot distinguish them.
Good positional encoding for graphs should tell the model:
For sequences (NLP), positional encoding is simple: node i gets position i. That completely determines structure. For graphs, there's no canonical ordering. Node 3 in one graph has nothing in common with node 3 in another graph. So we need PE that's derived from the graph's own topology — structure-aware PE.
The field has converged on two broad strategies:
Both highlighted nodes have identical raw features (degree 2, same color). Without PE, they get identical embeddings. The canvas shows their neighborhoods — structurally very different. Toggle PE to see how it breaks the symmetry.
The graph Laplacian is a matrix that encodes the full connectivity structure. For a graph with adjacency matrix A and degree matrix D (diagonal matrix where Dii = degree of node i):
L is symmetric and positive semidefinite, which means its eigenvalues are all ≥ 0 and its eigenvectors are real. The smallest eigenvalue is always 0, with eigenvector [1, 1, ..., 1] — the constant vector (all nodes the same). The second smallest eigenvalue (called the Fiedler value) tells you something deep: how well-connected the graph is. A small Fiedler value means the graph is nearly disconnected; two communities are barely linked.
The key idea: take the k smallest non-trivial eigenvectors (eigenvectors 2 through k+1, skipping the constant eigenvector). Use these as the positional encoding for each node. Node i's PE is the i-th entry of each of these k eigenvectors, forming a k-dimensional vector.
There's a catch. If v is an eigenvector of L, then so is -v (multiply the equation Lv = λv by -1). Both are mathematically valid, so there's no reason to prefer one over the other. This means the Laplacian PE has a sign ambiguity: two runs on the same graph might produce v or -v with equal probability.
This is a real problem. If during training the model sees node A with PE [0.5, 0.3] and at test time it sees the same node with PE [-0.5, -0.3], the model has no way to recognize these are the same node. Solutions include:
A graph with 12 nodes. The slider selects which Laplacian eigenvector to visualize. Node color encodes the eigenvector value — warm/orange = positive, teal = negative. Watch how low eigenvectors capture global community structure, high eigenvectors capture local variation.
Laplacian PE is elegant but has the sign ambiguity problem. There's a simpler alternative that sidesteps this entirely: Random Walk Positional Encoding (RWPE), introduced in the GraphGPS paper.
Imagine a random walker starting at node v. At each step, it moves to a uniformly random neighbor. What's the probability that after exactly k steps, it returns to v? This is called the k-step landing probability p(v → v in k steps). Compute this for k = 1, 2, ..., K. The result is a K-dimensional vector that serves as node v's positional encoding.
How do you compute this efficiently? The random walk probability matrix is P = D-1A, where D is the degree matrix. Then p(k)v→v = [Pk]vv — the (v, v) entry of the k-th power of P. You only need the diagonal of each matrix power.
The 1-step landing probability p(1)v→v is always 0 for non-self-loop graphs (you can't return to v in one step if there's no self-loop). The 2-step probability captures degree: a node with many neighbors has low p(2) (it walked away and is unlikely to come back in one step), while a node with degree 1 has p(2) = 1. Higher-k probabilities capture multi-hop neighborhood structure — triangle participation, clustering coefficient, and more.
Three nodes with different local structures: a leaf (degree 1), a middle node (degree 2), and a hub (degree 5). The bars show their random walk PE vectors for k=1 through k=6. Notice how the hub (many exits) has low early probabilities while the leaf (one exit) spikes at k=2.
Graphormer (Ying et al., NeurIPS 2021) takes a different approach: instead of pre-computing a PE vector and concatenating it to node features, it adds structural biases directly into the Transformer's attention mechanism. This is elegant because the attention scores are already the place where node-to-node relationships are computed.
Graphormer introduces three types of structural encoding. Each one solves a different piece of the structural blindness problem.
Important nodes should have different representations than peripheral nodes, even before any message passing. Graphormer adds a degree-based bias to node features:
Where zdeg is a learned embedding vector for each degree value (in-degree and out-degree for directed graphs, just degree for undirected). A hub with degree 50 gets a different initial representation than a leaf with degree 1, before any attention is computed. This is like the BERT [CLS] embedding — a global structural prior.
Two nodes that are close in the graph should attend to each other more strongly than two distant nodes. Graphormer adds a bias to each attention score based on the shortest-path distance (SPD) between node pairs:
Where bSPD(i,j) is a learned scalar bias for each possible distance value. Disconnected nodes get a special bias value. This tells the Transformer about graph topology without changing the node features at all — the structural information lives in the attention scores themselves.
For molecular graphs, edges carry rich chemistry information (bond type, bond length, bond angle). Standard spatial encoding only uses the shortest-path distance, not what's along that path. Graphormer adds an additional attention bias that aggregates edge features along the shortest path between each node pair:
Where N is the length of the shortest path, xeₙ is the feature of the n-th edge on that path, and wₙ is a learned weight for position n in the path. This allows the model to "see through" the graph structure rather than just counting hops.
Both full attention (Graphormer) and local attention (GAT) have real limitations. Full attention captures long-range dependencies but costs O(n²) and ignores graph structure directly. Local attention is efficient but misses global patterns. What if we ran both in parallel and combined them?
That's exactly what GPS (General, Powerful, Scalable Graph Transformer, Rampásek et al., NeurIPS 2022) does. Each GPS layer has three components running in sequence:
The MPNN component knows about edges. The attention component sees everything. Together, they're strictly more expressive than either alone — the MPNN brings local inductive bias (nearby nodes are related by edge semantics), while attention brings global reach (any two nodes can communicate in one step).
Without positional encoding, the global attention component would be structurally blind — just DeepSets. But with RWPE or Laplacian PE added to node features, the attention scores are implicitly informed about structural proximity. Node i, knowing its own PE, can tell from node j's PE whether they're likely to be near or far in the graph — even without explicit SPD computation.
Toggle components to see what each one contributes. Local MPNN: information flows only along edges. Global Attention: every node attends to every other. GPS combines both at every layer.
Full self-attention costs O(n²) in both time and memory. For a graph with 1,000 nodes, the attention matrix has 1,000,000 entries. For 10,000 nodes, it's 100,000,000 entries — that's 400 MB just for the attention matrix in float32. For molecular graphs (typically 5–50 nodes) and small social networks (hundreds of nodes), this is fine. For citation networks like ogbn-arxiv (169,343 nodes) or product graphs with millions of nodes, it's completely infeasible.
The field has developed several strategies to reduce the O(n²) bottleneck:
Only attend within a local window. For graphs, the natural local window is the graph neighborhood — restrict attention to nodes within k hops. This reduces cost to O(n · davgk), where davg is average degree. With k=1, it's exactly GAT. You can also add a small number of randomly selected long-range "global" attention connections.
Standard attention is softmax(QKT/√d)V. The softmax creates the O(n²) bottleneck. Linear attention approximations decompose this as φ(Q)(φ(K)TV), where φ is a kernel feature map. The key insight: compute KTV first (a d×d matrix), then multiply by φ(Q). This costs O(n · d²) — linear in n. Examples: Performer (random Fourier features), Nyströmformer.
Left: O(n·d²) — linear in nodes. Right: O(n²·d) — quadratic in nodes.
Run local MPNN (O(|E|)) for inductive bias and use linear attention (O(n·d²)) for long-range — no full n×n matrix needed. GPS with linear attention scales to graphs with tens of thousands of nodes while maintaining the expressiveness of both local and global reasoning.
Memory cost as a function of graph size. Drag the slider to see how each method scales. Full attention explodes quadratically; sparse and linear scale gracefully.
Graph Transformers aren't always the right tool. They excel in specific regimes and struggle in others. Understanding when to reach for a Graph Transformer vs. a GNN is as important as understanding how they work.
| Graph Transformer Strengths | |
|---|---|
| Long-range deps. | ✓ |
| Graph-level tasks | ✓ |
| Rich edge features | ✓ |
| Molecular graphs | ✓ |
| GNN Strengths | |
|---|---|
| Large graphs | ✓ |
| Local structure | ✓ |
| Fast inference | ✓ |
| Dynamic graphs | ✓ |
Graph Transformers sit at the intersection of two major lines of research. Understanding where they fit in the broader landscape helps you pick the right tool for each problem.
"What I cannot create, I do not understand." — Richard Feynman
The next step: implement a GPS layer from scratch. Take a graph (say, ZINC molecules), precompute RWPE, implement one GPS layer with GIN + full attention, train on graph regression. You'll feel the expressiveness difference directly.