CS224W Lecture 8

Graph Transformers

Transformers were built for sequences. Graphs have no natural order. Here's how we reconcile that — and why the solution unlocks long-range dependencies that GNNs cannot reach.

Prerequisites: Self-attention (basic Transformer idea) + GNNs / GAT (Lec 3–4). That's it.
10
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: Transformers Meet Graphs

A Transformer takes a sequence of tokens — words, image patches, whatever — and lets every token attend to every other token simultaneously. No notion of "left neighbor" or "right neighbor." Just a set of elements that all interact with each other in one giant attention operation.

A graph is also a set of elements — nodes — that interact with each other. The edges define which interactions happen. So here's the key insight: a Transformer on a fully-connected graph is exactly a standard Transformer. Every node attends to every other node. A Transformer on a sparse graph — where attention is only allowed between connected nodes — is exactly a GNN with learned attention weights. That's just GAT.

So Graph Transformers aren't a radically new idea. They're the same self-attention mechanism, but applied carefully to graph-structured data. The critical challenge is a single issue: Transformers are permutation-equivariant. If you shuffle the input tokens, the outputs shuffle in the same way. For sequences, the positions tell the model where each token lives. For graphs, there are no positions. Two graphs with different structure look identical if you just hand the model a bag of node features.

The problem in one sentence: Without positional encoding, a Graph Transformer treats the graph as an unordered bag of nodes — it has no idea which nodes are connected, which are far apart, or who is central and who is peripheral. Every structural insight the graph contains is invisible.

The Three-Part Recipe

Every Graph Transformer architecture answers the same three questions:

  1. Tokenization. What are the "tokens"? Answer: nodes. Each node becomes one token with a feature vector. (Edges can carry extra information, but nodes are the primary units.)
  2. Positional Encoding. How do we inject structural information? Answer: compute features that capture WHERE a node lives in the graph, and add them to the node features before attention.
  3. Attention Pattern. Should every node attend to every other (O(n²))? Or only to graph neighbors (O(|E|))? Or a hybrid?
Full vs. Sparse Attention on a Graph

Click the buttons to see how attention patterns differ. Full attention (Transformer): every node attends to every other. Sparse attention (GNN/GAT): each node only attends to its neighbors. Hybrid (GPS): both patterns run in parallel.

Full attention: every node attends to every other. Cost: O(n²).
A Transformer with full self-attention applied to nodes of a graph is equivalent to which GNN variant?

Chapter 1: Self-Attention Review

Before we talk about graphs, let's nail down exactly what self-attention computes. You have n input vectors, each of dimension d. Call them x₁, x₂, ..., xₙ. Three learned weight matrices — WQ, WK, WV — project each xᵢ into a query, key, and value:

qi = WQ xi,   ki = WK xi,   vi = WV xi

The attention score between node i and node j is how much i "wants" information from j — computed as a dot product between i's query and j's key:

αij = softmaxj( qi · kj / √d )

The output for node i is a weighted sum of all values, weighted by attention scores:

hi = ∑j αij · vj

The √d denominator prevents the dot products from growing large (which would push softmax into near-zero gradient territory). This is standard scaled dot-product attention.

Key insight: This is exactly a weighted message-passing step. Node i sends its value vi to everyone. Node j decides how much of each incoming message to use (via attention weights). The only difference from GAT: in standard self-attention, every node is connected to every other. In GAT, only graph neighbors participate.

Multi-Head Attention

Instead of one set of Q/K/V projections, we run H independent "heads" in parallel. Each head has its own WQ, WK, WV. The outputs of all heads are concatenated and projected back to dimension d:

hi = Concat(head1, ..., headH) WO

Why multiple heads? Each head can specialize. One head might learn to attend to chemically similar neighbors. Another might learn to attend to structurally distant nodes. The concatenation lets the model use all these relationships simultaneously.

Data Flow: Tensor Shapes

For a graph with n nodes, each with feature dimension d:

Input
X ∈ ℝn×d — one row per node
Project
Q = XWQ, K = XWK, V = XWV — each ∈ ℝn×d
Attention scores
A = softmax(QKT/√d) ∈ ℝn×n — one score per node pair
Output
H = AV ∈ ℝn×d — updated node representations

The n×n attention matrix is why full self-attention costs O(n²) memory and compute. For a graph with 100 nodes, this is fine. For a graph with 100,000 nodes, it's completely infeasible.

Cost comparison: GNN (MPNN): O(|E| · d) per layer — scales with edges. Full self-attention: O(n² · d) per layer — scales with nodes squared. For sparse graphs (|E| ≪ n²), GNNs are much cheaper. For dense graphs or small graphs where long-range matters, Transformers win on expressiveness.
Why does the attention computation use √d as a scaling factor in the denominator?

Chapter 2: Why Graphs Need Positional Encoding

Imagine two nodes in a graph: node A has three neighbors, and node B also has three neighbors. Their features happen to be identical. Without any structural information injected, a Graph Transformer will assign them identical embeddings — regardless of how differently they're connected in the larger graph.

Now scale this up. Two nodes might look identical locally but be in completely different parts of the graph — one is a hub connecting two communities, the other is a peripheral leaf. Without PE, the Transformer cannot distinguish them.

The core failure mode: A Transformer without positional encoding is a DeepSets model — it treats the input as an unordered set and produces a permutation-equivariant but structurally blind output. You could shuffle the adjacency matrix randomly, and the embeddings would shuffle the same way. The graph structure is completely invisible.

What PE Must Capture

Good positional encoding for graphs should tell the model:

For sequences (NLP), positional encoding is simple: node i gets position i. That completely determines structure. For graphs, there's no canonical ordering. Node 3 in one graph has nothing in common with node 3 in another graph. So we need PE that's derived from the graph's own topology — structure-aware PE.

Two Main Approaches

The field has converged on two broad strategies:

Spectral methods (Laplacian eigenvectors): Derive PE from the eigendecomposition of the graph Laplacian. Captures global structure and community membership. Sign ambiguity is a challenge.
Random walk methods (landing probabilities): Define PE as the probability of returning to a node after k steps of a random walk. Captures local structure. No sign ambiguity. Very efficient.
PE Blindness: Two Structurally Different Nodes, Same Features

Both highlighted nodes have identical raw features (degree 2, same color). Without PE, they get identical embeddings. The canvas shows their neighborhoods — structurally very different. Toggle PE to see how it breaks the symmetry.

PE OFF: both nodes get the same embedding (red = collision).
Without positional encoding, what does a Graph Transformer treat the graph as?

Chapter 3: Laplacian Positional Encoding

The graph Laplacian is a matrix that encodes the full connectivity structure. For a graph with adjacency matrix A and degree matrix D (diagonal matrix where Dii = degree of node i):

L = D − A

L is symmetric and positive semidefinite, which means its eigenvalues are all ≥ 0 and its eigenvectors are real. The smallest eigenvalue is always 0, with eigenvector [1, 1, ..., 1] — the constant vector (all nodes the same). The second smallest eigenvalue (called the Fiedler value) tells you something deep: how well-connected the graph is. A small Fiedler value means the graph is nearly disconnected; two communities are barely linked.

The key idea: take the k smallest non-trivial eigenvectors (eigenvectors 2 through k+1, skipping the constant eigenvector). Use these as the positional encoding for each node. Node i's PE is the i-th entry of each of these k eigenvectors, forming a k-dimensional vector.

Why eigenvectors encode position: The k-th eigenvector of the Laplacian is a smooth function over the graph that oscillates k times. Low-index eigenvectors vary slowly — they capture global structure, like which cluster a node belongs to. High-index eigenvectors vary rapidly — they capture local structure, like which specific node within a cluster. It's the graph equivalent of Fourier modes.

The Sign Ambiguity Problem

There's a catch. If v is an eigenvector of L, then so is -v (multiply the equation Lv = λv by -1). Both are mathematically valid, so there's no reason to prefer one over the other. This means the Laplacian PE has a sign ambiguity: two runs on the same graph might produce v or -v with equal probability.

This is a real problem. If during training the model sees node A with PE [0.5, 0.3] and at test time it sees the same node with PE [-0.5, -0.3], the model has no way to recognize these are the same node. Solutions include:

Laplacian Eigenvectors as Node Positional Encoding

A graph with 12 nodes. The slider selects which Laplacian eigenvector to visualize. Node color encodes the eigenvector value — warm/orange = positive, teal = negative. Watch how low eigenvectors capture global community structure, high eigenvectors capture local variation.

Eigenvector index 1
What is the sign ambiguity problem in Laplacian positional encoding?

Chapter 4: Random Walk Positional Encoding

Laplacian PE is elegant but has the sign ambiguity problem. There's a simpler alternative that sidesteps this entirely: Random Walk Positional Encoding (RWPE), introduced in the GraphGPS paper.

Imagine a random walker starting at node v. At each step, it moves to a uniformly random neighbor. What's the probability that after exactly k steps, it returns to v? This is called the k-step landing probability p(v → v in k steps). Compute this for k = 1, 2, ..., K. The result is a K-dimensional vector that serves as node v's positional encoding.

PEv = [pv→v(1), pv→v(2), ..., pv→v(K)]

How do you compute this efficiently? The random walk probability matrix is P = D-1A, where D is the degree matrix. Then p(k)v→v = [Pk]vv — the (v, v) entry of the k-th power of P. You only need the diagonal of each matrix power.

Advantages over Laplacian PE: No sign ambiguity — the landing probability is always a positive number. Captures local structure well: a high-degree hub has low return probability after 1 step (many choices), while a leaf node has return probability 1 after 2 steps (only one neighbor, which must come back). Computationally more efficient than full eigendecomposition for large graphs.

What the Features Encode

The 1-step landing probability p(1)v→v is always 0 for non-self-loop graphs (you can't return to v in one step if there's no self-loop). The 2-step probability captures degree: a node with many neighbors has low p(2) (it walked away and is unlikely to come back in one step), while a node with degree 1 has p(2) = 1. Higher-k probabilities capture multi-hop neighborhood structure — triangle participation, clustering coefficient, and more.

Random Walk Landing Probabilities

Three nodes with different local structures: a leaf (degree 1), a middle node (degree 2), and a hub (degree 5). The bars show their random walk PE vectors for k=1 through k=6. Notice how the hub (many exits) has low early probabilities while the leaf (one exit) spikes at k=2.

Leaf: high 2-step return probability — only one neighbor to go to and come back from.
What does the k-step landing probability p(v→v in k steps) measure?

Chapter 5: Graphormer: Centrality + Spatial Encoding

Graphormer (Ying et al., NeurIPS 2021) takes a different approach: instead of pre-computing a PE vector and concatenating it to node features, it adds structural biases directly into the Transformer's attention mechanism. This is elegant because the attention scores are already the place where node-to-node relationships are computed.

Graphormer introduces three types of structural encoding. Each one solves a different piece of the structural blindness problem.

Encoding 1: Centrality Encoding

Important nodes should have different representations than peripheral nodes, even before any message passing. Graphormer adds a degree-based bias to node features:

hi0 = xi + zdeg⁻(i) + zdeg⁺(i)

Where zdeg is a learned embedding vector for each degree value (in-degree and out-degree for directed graphs, just degree for undirected). A hub with degree 50 gets a different initial representation than a leaf with degree 1, before any attention is computed. This is like the BERT [CLS] embedding — a global structural prior.

Encoding 2: Spatial Encoding

Two nodes that are close in the graph should attend to each other more strongly than two distant nodes. Graphormer adds a bias to each attention score based on the shortest-path distance (SPD) between node pairs:

Aij = (qi · kj) / √d + bSPD(i,j)

Where bSPD(i,j) is a learned scalar bias for each possible distance value. Disconnected nodes get a special bias value. This tells the Transformer about graph topology without changing the node features at all — the structural information lives in the attention scores themselves.

Why this is beautiful: Instead of "what PE vector do I add to node i?", Graphormer asks "how should node i's attention toward node j change based on their structural relationship?" This is a more direct injection of graph structure — it directly modulates the communication pattern rather than the feature representation.

Encoding 3: Edge Encoding

For molecular graphs, edges carry rich chemistry information (bond type, bond length, bond angle). Standard spatial encoding only uses the shortest-path distance, not what's along that path. Graphormer adds an additional attention bias that aggregates edge features along the shortest path between each node pair:

bijedge = &frac{1}{N} ∑n=1N xenT wn

Where N is the length of the shortest path, xeₙ is the feature of the n-th edge on that path, and wₙ is a learned weight for position n in the path. This allows the model to "see through" the graph structure rather than just counting hops.

Graphormer won the OGB-LSC competition (KDD Cup 2021), the largest graph ML benchmark at the time, on the PCQM4M molecular property prediction task. It outperformed all GNN baselines by a substantial margin — showing that a well-designed Transformer can beat specialized GNNs on graph tasks.
Graphormer's spatial encoding adds a bias b_SPD(i,j) to the attention score between nodes i and j. What does SPD stand for?

Chapter 6: GPS: General Powerful Scalable

Both full attention (Graphormer) and local attention (GAT) have real limitations. Full attention captures long-range dependencies but costs O(n²) and ignores graph structure directly. Local attention is efficient but misses global patterns. What if we ran both in parallel and combined them?

That's exactly what GPS (General, Powerful, Scalable Graph Transformer, Rampásek et al., NeurIPS 2022) does. Each GPS layer has three components running in sequence:

Step 1: Positional Encoding
Compute RWPE or Laplacian PE. Add to node features. This happens once before the first layer.
Step 2: Local MPNN
Run one step of any message-passing GNN (GIN, GAT, GCN) over the graph's actual edges. Captures local structure efficiently in O(|E|).
+
Step 3: Global Attention
Run full self-attention (or a linear approximation) over all nodes. Captures long-range dependencies in O(n²) (or O(n) with linear attention).
Step 4: Combine
Add the MPNN output and the attention output. Apply feedforward network + LayerNorm.
The GPS layer equation: hiout = FFN( MPNN(hiin, N(i)) + Attn(hiin, all nodes) )

The MPNN component knows about edges. The attention component sees everything. Together, they're strictly more expressive than either alone — the MPNN brings local inductive bias (nearby nodes are related by edge semantics), while attention brings global reach (any two nodes can communicate in one step).

Why PE Makes This Work

Without positional encoding, the global attention component would be structurally blind — just DeepSets. But with RWPE or Laplacian PE added to node features, the attention scores are implicitly informed about structural proximity. Node i, knowing its own PE, can tell from node j's PE whether they're likely to be near or far in the graph — even without explicit SPD computation.

GPS Layer: Local + Global in Parallel

Toggle components to see what each one contributes. Local MPNN: information flows only along edges. Global Attention: every node attends to every other. GPS combines both at every layer.

Select a mode to see which connections each component uses.
In GPS, why is local MPNN needed if global attention already sees all nodes?

Chapter 7: Efficiency

Full self-attention costs O(n²) in both time and memory. For a graph with 1,000 nodes, the attention matrix has 1,000,000 entries. For 10,000 nodes, it's 100,000,000 entries — that's 400 MB just for the attention matrix in float32. For molecular graphs (typically 5–50 nodes) and small social networks (hundreds of nodes), this is fine. For citation networks like ogbn-arxiv (169,343 nodes) or product graphs with millions of nodes, it's completely infeasible.

The field has developed several strategies to reduce the O(n²) bottleneck:

Strategy 1: Sparse Attention

Only attend within a local window. For graphs, the natural local window is the graph neighborhood — restrict attention to nodes within k hops. This reduces cost to O(n · davgk), where davg is average degree. With k=1, it's exactly GAT. You can also add a small number of randomly selected long-range "global" attention connections.

Strategy 2: Linear Attention

Standard attention is softmax(QKT/√d)V. The softmax creates the O(n²) bottleneck. Linear attention approximations decompose this as φ(Q)(φ(K)TV), where φ is a kernel feature map. The key insight: compute KTV first (a d×d matrix), then multiply by φ(Q). This costs O(n · d²) — linear in n. Examples: Performer (random Fourier features), Nyströmformer.

φ(Q)(φ(K)TV)   vs   softmax(QKT/√d)V

Left: O(n·d²) — linear in nodes. Right: O(n²·d) — quadratic in nodes.

Strategy 3: Local-Global Hybrid (GPS approach)

Run local MPNN (O(|E|)) for inductive bias and use linear attention (O(n·d²)) for long-range — no full n×n matrix needed. GPS with linear attention scales to graphs with tens of thousands of nodes while maintaining the expressiveness of both local and global reasoning.

In practice: For molecular benchmarks (PCQM4M, ZINC, OGB-Mol), full attention works — molecules have ~20-50 atoms. For citation networks (OGB-Arxiv, 169K nodes) or social graphs (Reddit, 232K nodes), you need sparse or linear attention. The GPS framework supports both modes via a configurable attention type.
Scaling: Full vs. Sparse vs. Linear Attention

Memory cost as a function of graph size. Drag the slider to see how each method scales. Full attention explodes quadratically; sparse and linear scale gracefully.

Nodes (×100) 500
Why does linear attention reduce the O(n²) cost of standard self-attention?

Chapter 8: When to Use Graph Transformers

Graph Transformers aren't always the right tool. They excel in specific regimes and struggle in others. Understanding when to reach for a Graph Transformer vs. a GNN is as important as understanding how they work.

Use Graph Transformers When...

Stick with GNNs When...

Graph Transformer Strengths
Long-range deps.
Graph-level tasks
Rich edge features
Molecular graphs
GNN Strengths
Large graphs
Local structure
Fast inference
Dynamic graphs
In practice: GPS-style hybrids often win. Local MPNN handles edges and local structure efficiently. Global attention (possibly linear) captures long range. Positional encoding bridges both. This combination outperforms pure GNNs and pure Transformers on most benchmark tasks — with GPS winning multiple benchmarks including PCQM4M and Long Range Graph Benchmark.
For a citation network with 170,000 nodes, which approach is most practical?

Chapter 9: Connections

Graph Transformers sit at the intersection of two major lines of research. Understanding where they fit in the broader landscape helps you pick the right tool for each problem.

Within the CS224W Series

Key Papers

Related Micro-Lessons

The bigger picture: Graph Transformers represent a convergence between geometric deep learning (GNNs) and sequence modeling (Transformers). The key lesson: architecture is not destiny. With the right structural inductive bias (positional encoding), a generic Transformer can be made graph-aware. The "right inductive bias" — not "the right architecture from scratch" — is increasingly the design challenge.
"What I cannot create, I do not understand." — Richard Feynman

The next step: implement a GPS layer from scratch. Take a graph (say, ZINC molecules), precompute RWPE, implement one GPS layer with GIN + full attention, train on graph regression. You'll feel the expressiveness difference directly.