CS224W Lecture 8 — Graph Transformers

Chapter 0: Transformers Meet Graphs

A Transformer takes a sequence of tokens — words, image patches, whatever — and lets every token attend to every other token simultaneously. No notion of "left neighbor" or "right neighbor." Just a set of elements that all interact with each other in one giant attention operation.

A graph is also a set of elements — nodes — that interact with each other. The edges define which interactions happen. So here's the key insight: a Transformer on a fully-connected graph is exactly a standard Transformer. Every node attends to every other node. A Transformer on a sparse graph — where attention is only allowed between connected nodes — is exactly a GNN with learned attention weights. That's just GAT.

So Graph Transformers aren't a radically new idea. They're the same self-attention mechanism, but applied carefully to graph-structured data. The critical challenge is a single issue: Transformers are permutation-equivariant. If you shuffle the input tokens, the outputs shuffle in the same way. For sequences, the positions tell the model where each token lives. For graphs, there are no positions. Two graphs with different structure look identical if you just hand the model a bag of node features.

The problem in one sentence: Without positional encoding, a Graph Transformer treats the graph as an unordered bag of nodes — it has no idea which nodes are connected, which are far apart, or who is central and who is peripheral. Every structural insight the graph contains is invisible.

The Three-Part Recipe

Every Graph Transformer architecture answers the same three questions:

Tokenization. What are the "tokens"? Answer: nodes. Each node becomes one token with a feature vector. (Edges can carry extra information, but nodes are the primary units.)
Positional Encoding. How do we inject structural information? Answer: compute features that capture WHERE a node lives in the graph, and add them to the node features before attention.
Attention Pattern. Should every node attend to every other (O(n²))? Or only to graph neighbors (O(|E|))? Or a hybrid?

Full vs. Sparse Attention on a Graph

Click the buttons to see how attention patterns differ. Full attention (Transformer): every node attends to every other. Sparse attention (GNN/GAT): each node only attends to its neighbors. Hybrid (GPS): both patterns run in parallel.

Full attention: every node attends to every other. Cost: O(n²).

A Transformer with full self-attention applied to nodes of a graph is equivalent to which GNN variant?

GCN — it averages neighbor features with fixed weights GAT on a fully-connected graph — every node attends to every other with learned weights GraphSAGE — it samples a fixed number of neighbors

Chapter 1: Self-Attention Review

Before we talk about graphs, let's nail down exactly what self-attention computes. You have n input vectors, each of dimension d. Call them x₁, x₂, ..., xₙ. Three learned weight matrices — W_Q, W_K, W_V — project each xᵢ into a query, key, and value:

q_i = W_Q x_i, k_i = W_K x_i, v_i = W_V x_i

The attention score between node i and node j is how much i "wants" information from j — computed as a dot product between i's query and j's key:

α_ij = softmax_j( q_i · k_j / √d )

The output for node i is a weighted sum of all values, weighted by attention scores:

h_i = ∑_j α_ij · v_j

The √d denominator prevents the dot products from growing large (which would push softmax into near-zero gradient territory). This is standard scaled dot-product attention.

Key insight: This is exactly a weighted message-passing step. Node i sends its value v_i to everyone. Node j decides how much of each incoming message to use (via attention weights). The only difference from GAT: in standard self-attention, every node is connected to every other. In GAT, only graph neighbors participate.

Multi-Head Attention

Instead of one set of Q/K/V projections, we run H independent "heads" in parallel. Each head has its own W_Q, W_K, W_V. The outputs of all heads are concatenated and projected back to dimension d:

h_i = Concat(head₁, ..., head_H) W_O

Why multiple heads? Each head can specialize. One head might learn to attend to chemically similar neighbors. Another might learn to attend to structurally distant nodes. The concatenation lets the model use all these relationships simultaneously.

Data Flow: Tensor Shapes

For a graph with n nodes, each with feature dimension d:

Input

X ∈ ℝ^n×d — one row per node

↓

Project

Q = XW_Q, K = XW_K, V = XW_V — each ∈ ℝ^n×d

↓

Attention scores

A = softmax(QK^T/√d) ∈ ℝ^n×n — one score per node pair

↓

Output

H = AV ∈ ℝ^n×d — updated node representations

The n×n attention matrix is why full self-attention costs O(n²) memory and compute. For a graph with 100 nodes, this is fine. For a graph with 100,000 nodes, it's completely infeasible.

Cost comparison: GNN (MPNN): O(|E| · d) per layer — scales with edges. Full self-attention: O(n² · d) per layer — scales with nodes squared. For sparse graphs (|E| ≪ n²), GNNs are much cheaper. For dense graphs or small graphs where long-range matters, Transformers win on expressiveness.

Why does the attention computation use √d as a scaling factor in the denominator?

Without it, dot products grow with dimension d, pushing softmax into near-zero gradient regions and making training unstable It normalizes the output to have zero mean It makes the computation invariant to the number of heads H

Chapter 2: Why Graphs Need Positional Encoding

Imagine two nodes in a graph: node A has three neighbors, and node B also has three neighbors. Their features happen to be identical. Without any structural information injected, a Graph Transformer will assign them identical embeddings — regardless of how differently they're connected in the larger graph.

Now scale this up. Two nodes might look identical locally but be in completely different parts of the graph — one is a hub connecting two communities, the other is a peripheral leaf. Without PE, the Transformer cannot distinguish them.

The core failure mode: A Transformer without positional encoding is a DeepSets model — it treats the input as an unordered set and produces a permutation-equivariant but structurally blind output. You could shuffle the adjacency matrix randomly, and the embeddings would shuffle the same way. The graph structure is completely invisible.

What PE Must Capture

Good positional encoding for graphs should tell the model:

Proximity. Are nodes A and B connected? Are they 2 hops apart? 10 hops?
Role. Is this node a hub? A leaf? A bridge between communities?
Community. Do A and B belong to the same cluster, or different ones?

For sequences (NLP), positional encoding is simple: node i gets position i. That completely determines structure. For graphs, there's no canonical ordering. Node 3 in one graph has nothing in common with node 3 in another graph. So we need PE that's derived from the graph's own topology — structure-aware PE.

Two Main Approaches

The field has converged on two broad strategies:

Spectral methods (Laplacian eigenvectors): Derive PE from the eigendecomposition of the graph Laplacian. Captures global structure and community membership. Sign ambiguity is a challenge.

Random walk methods (landing probabilities): Define PE as the probability of returning to a node after k steps of a random walk. Captures local structure. No sign ambiguity. Very efficient.

PE Blindness: Two Structurally Different Nodes, Same Features

Both highlighted nodes have identical raw features (degree 2, same color). Without PE, they get identical embeddings. The canvas shows their neighborhoods — structurally very different. Toggle PE to see how it breaks the symmetry.

PE OFF: both nodes get the same embedding (red = collision).

Without positional encoding, what does a Graph Transformer treat the graph as?

A sequence of nodes in adjacency-list order A weighted graph where edge weights capture structure An unordered bag of node features — all structural information from edges is invisible

Chapter 3: Laplacian Positional Encoding

The graph Laplacian is a matrix that encodes the full connectivity structure. For a graph with adjacency matrix A and degree matrix D (diagonal matrix where D_ii = degree of node i):

L = D − A

L is symmetric and positive semidefinite, which means its eigenvalues are all ≥ 0 and its eigenvectors are real. The smallest eigenvalue is always 0, with eigenvector [1, 1, ..., 1] — the constant vector (all nodes the same). The second smallest eigenvalue (called the Fiedler value) tells you something deep: how well-connected the graph is. A small Fiedler value means the graph is nearly disconnected; two communities are barely linked.

The key idea: take the k smallest non-trivial eigenvectors (eigenvectors 2 through k+1, skipping the constant eigenvector). Use these as the positional encoding for each node. Node i's PE is the i-th entry of each of these k eigenvectors, forming a k-dimensional vector.

Why eigenvectors encode position: The k-th eigenvector of the Laplacian is a smooth function over the graph that oscillates k times. Low-index eigenvectors vary slowly — they capture global structure, like which cluster a node belongs to. High-index eigenvectors vary rapidly — they capture local structure, like which specific node within a cluster. It's the graph equivalent of Fourier modes.

The Sign Ambiguity Problem

There's a catch. If v is an eigenvector of L, then so is -v (multiply the equation Lv = λv by -1). Both are mathematically valid, so there's no reason to prefer one over the other. This means the Laplacian PE has a sign ambiguity: two runs on the same graph might produce v or -v with equal probability.

This is a real problem. If during training the model sees node A with PE [0.5, 0.3] and at test time it sees the same node with PE [-0.5, -0.3], the model has no way to recognize these are the same node. Solutions include:

Random sign flipping during training. Randomly flip the sign of each eigenvector during each training pass. This makes the model robust to both signs by design.
Sign-invariant networks. Design the PE module to output the same result regardless of sign (e.g., using absolute values, or processing both v and -v and taking a symmetric combination).

Laplacian Eigenvectors as Node Positional Encoding

A graph with 12 nodes. The slider selects which Laplacian eigenvector to visualize. Node color encodes the eigenvector value — warm/orange = positive, teal = negative. Watch how low eigenvectors capture global community structure, high eigenvectors capture local variation.

Eigenvector index 1

What is the sign ambiguity problem in Laplacian positional encoding?

The Laplacian eigenvalues can be negative, making the PE undefined If v is a valid eigenvector, so is -v — there's no canonical sign, so the same graph can produce opposite PE vectors across different runs Different graphs always share the same eigenvectors, making the PE non-unique

Chapter 4: Random Walk Positional Encoding

Laplacian PE is elegant but has the sign ambiguity problem. There's a simpler alternative that sidesteps this entirely: Random Walk Positional Encoding (RWPE), introduced in the GraphGPS paper.

Imagine a random walker starting at node v. At each step, it moves to a uniformly random neighbor. What's the probability that after exactly k steps, it returns to v? This is called the k-step landing probability p(v → v in k steps). Compute this for k = 1, 2, ..., K. The result is a K-dimensional vector that serves as node v's positional encoding.

PE_v = [p_v→v⁽¹⁾, p_v→v⁽²⁾, ..., p_v→v^(K)]

How do you compute this efficiently? The random walk probability matrix is P = D^-1A, where D is the degree matrix. Then p^(k)_v→v = [P^k]_vv — the (v, v) entry of the k-th power of P. You only need the diagonal of each matrix power.

Advantages over Laplacian PE: No sign ambiguity — the landing probability is always a positive number. Captures local structure well: a high-degree hub has low return probability after 1 step (many choices), while a leaf node has return probability 1 after 2 steps (only one neighbor, which must come back). Computationally more efficient than full eigendecomposition for large graphs.

What the Features Encode

The 1-step landing probability p⁽¹⁾_v→v is always 0 for non-self-loop graphs (you can't return to v in one step if there's no self-loop). The 2-step probability captures degree: a node with many neighbors has low p⁽²⁾ (it walked away and is unlikely to come back in one step), while a node with degree 1 has p⁽²⁾ = 1. Higher-k probabilities capture multi-hop neighborhood structure — triangle participation, clustering coefficient, and more.

Random Walk Landing Probabilities

Three nodes with different local structures: a leaf (degree 1), a middle node (degree 2), and a hub (degree 5). The bars show their random walk PE vectors for k=1 through k=6. Notice how the hub (many exits) has low early probabilities while the leaf (one exit) spikes at k=2.

Leaf: high 2-step return probability — only one neighbor to go to and come back from.

What does the k-step landing probability p(v→v in k steps) measure?

The probability that a random walker starting at v returns to v after exactly k steps — capturing local structural properties of v's neighborhood The probability that any two nodes in the graph are at distance k from each other The number of paths of length k starting from v

Chapter 5: Graphormer: Centrality + Spatial Encoding

Graphormer (Ying et al., NeurIPS 2021) takes a different approach: instead of pre-computing a PE vector and concatenating it to node features, it adds structural biases directly into the Transformer's attention mechanism. This is elegant because the attention scores are already the place where node-to-node relationships are computed.

Graphormer introduces three types of structural encoding. Each one solves a different piece of the structural blindness problem.

Encoding 1: Centrality Encoding

Important nodes should have different representations than peripheral nodes, even before any message passing. Graphormer adds a degree-based bias to node features:

h_i⁰ = x_i + z_deg⁻(i) + z_deg⁺(i)

Where z_deg is a learned embedding vector for each degree value (in-degree and out-degree for directed graphs, just degree for undirected). A hub with degree 50 gets a different initial representation than a leaf with degree 1, before any attention is computed. This is like the BERT [CLS] embedding — a global structural prior.

Encoding 2: Spatial Encoding

Two nodes that are close in the graph should attend to each other more strongly than two distant nodes. Graphormer adds a bias to each attention score based on the shortest-path distance (SPD) between node pairs:

A_ij = (q_i · k_j) / √d + b_SPD(i,j)

Where b_SPD(i,j) is a learned scalar bias for each possible distance value. Disconnected nodes get a special bias value. This tells the Transformer about graph topology without changing the node features at all — the structural information lives in the attention scores themselves.

Why this is beautiful: Instead of "what PE vector do I add to node i?", Graphormer asks "how should node i's attention toward node j change based on their structural relationship?" This is a more direct injection of graph structure — it directly modulates the communication pattern rather than the feature representation.

Encoding 3: Edge Encoding

For molecular graphs, edges carry rich chemistry information (bond type, bond length, bond angle). Standard spatial encoding only uses the shortest-path distance, not what's along that path. Graphormer adds an additional attention bias that aggregates edge features along the shortest path between each node pair:

b_ij^edge = &frac{1}{N} ∑_n=1^N x_{e_n}^T w_n

Where N is the length of the shortest path, x_eₙ is the feature of the n-th edge on that path, and wₙ is a learned weight for position n in the path. This allows the model to "see through" the graph structure rather than just counting hops.

Graphormer won the OGB-LSC competition (KDD Cup 2021), the largest graph ML benchmark at the time, on the PCQM4M molecular property prediction task. It outperformed all GNN baselines by a substantial margin — showing that a well-designed Transformer can beat specialized GNNs on graph tasks.

Graphormer's spatial encoding adds a bias b_SPD(i,j) to the attention score between nodes i and j. What does SPD stand for?

Spectral Positional Dimension — the index of the Laplacian eigenvector Spectral Property Distance — based on the difference in node feature spectra Shortest Path Distance — the number of edges in the shortest path between nodes i and j in the graph

Chapter 6: GPS: General Powerful Scalable

Both full attention (Graphormer) and local attention (GAT) have real limitations. Full attention captures long-range dependencies but costs O(n²) and ignores graph structure directly. Local attention is efficient but misses global patterns. What if we ran both in parallel and combined them?

That's exactly what GPS (General, Powerful, Scalable Graph Transformer, Rampásek et al., NeurIPS 2022) does. Each GPS layer has three components running in sequence:

Step 1: Positional Encoding

Compute RWPE or Laplacian PE. Add to node features. This happens once before the first layer.

↓

Step 2: Local MPNN

Run one step of any message-passing GNN (GIN, GAT, GCN) over the graph's actual edges. Captures local structure efficiently in O(|E|).

Step 3: Global Attention

Run full self-attention (or a linear approximation) over all nodes. Captures long-range dependencies in O(n²) (or O(n) with linear attention).

↓

Step 4: Combine

Add the MPNN output and the attention output. Apply feedforward network + LayerNorm.

The GPS layer equation: h_i^out = FFN( MPNN(h_iⁱⁿ, N(i)) + Attn(h_iⁱⁿ, all nodes) )

The MPNN component knows about edges. The attention component sees everything. Together, they're strictly more expressive than either alone — the MPNN brings local inductive bias (nearby nodes are related by edge semantics), while attention brings global reach (any two nodes can communicate in one step).

Why PE Makes This Work

Without positional encoding, the global attention component would be structurally blind — just DeepSets. But with RWPE or Laplacian PE added to node features, the attention scores are implicitly informed about structural proximity. Node i, knowing its own PE, can tell from node j's PE whether they're likely to be near or far in the graph — even without explicit SPD computation.

GPS Layer: Local + Global in Parallel

Toggle components to see what each one contributes. Local MPNN: information flows only along edges. Global Attention: every node attends to every other. GPS combines both at every layer.

Select a mode to see which connections each component uses.

In GPS, why is local MPNN needed if global attention already sees all nodes?

MPNN is faster to compute, so GPS uses it to save memory Global attention can't process more than 100 nodes, so MPNN handles larger graphs MPNN uses actual edge information and provides a local structural inductive bias that pure attention misses — edges encode real graph semantics

Chapter 7: Efficiency

Full self-attention costs O(n²) in both time and memory. For a graph with 1,000 nodes, the attention matrix has 1,000,000 entries. For 10,000 nodes, it's 100,000,000 entries — that's 400 MB just for the attention matrix in float32. For molecular graphs (typically 5–50 nodes) and small social networks (hundreds of nodes), this is fine. For citation networks like ogbn-arxiv (169,343 nodes) or product graphs with millions of nodes, it's completely infeasible.

The field has developed several strategies to reduce the O(n²) bottleneck:

Strategy 1: Sparse Attention

Only attend within a local window. For graphs, the natural local window is the graph neighborhood — restrict attention to nodes within k hops. This reduces cost to O(n · d_avg^k), where d_avg is average degree. With k=1, it's exactly GAT. You can also add a small number of randomly selected long-range "global" attention connections.

Strategy 2: Linear Attention

Standard attention is softmax(QK^T/√d)V. The softmax creates the O(n²) bottleneck. Linear attention approximations decompose this as φ(Q)(φ(K)^TV), where φ is a kernel feature map. The key insight: compute K^TV first (a d×d matrix), then multiply by φ(Q). This costs O(n · d²) — linear in n. Examples: Performer (random Fourier features), Nyströmformer.

φ(Q)(φ(K)^TV) vs softmax(QK^T/√d)V

Left: O(n·d²) — linear in nodes. Right: O(n²·d) — quadratic in nodes.

Strategy 3: Local-Global Hybrid (GPS approach)

Run local MPNN (O(|E|)) for inductive bias and use linear attention (O(n·d²)) for long-range — no full n×n matrix needed. GPS with linear attention scales to graphs with tens of thousands of nodes while maintaining the expressiveness of both local and global reasoning.

In practice: For molecular benchmarks (PCQM4M, ZINC, OGB-Mol), full attention works — molecules have ~20-50 atoms. For citation networks (OGB-Arxiv, 169K nodes) or social graphs (Reddit, 232K nodes), you need sparse or linear attention. The GPS framework supports both modes via a configurable attention type.

Scaling: Full vs. Sparse vs. Linear Attention

Memory cost as a function of graph size. Drag the slider to see how each method scales. Full attention explodes quadratically; sparse and linear scale gracefully.

Nodes (×100) 500

Why does linear attention reduce the O(n²) cost of standard self-attention?

It skips the softmax normalization entirely, which was the only expensive step It computes attention only within each connected component, reducing the effective n By rewriting attention as φ(Q)(φ(K)ᵀV), the d×d inner product is computed first, making the total cost O(n·d²) instead of O(n²·d)

Chapter 8: When to Use Graph Transformers

Graph Transformers aren't always the right tool. They excel in specific regimes and struggle in others. Understanding when to reach for a Graph Transformer vs. a GNN is as important as understanding how they work.

Use Graph Transformers When...

Long-range dependencies matter. Molecular property prediction often requires the model to understand how atoms on opposite ends of a molecule interact. Social recommendation needs to capture global popularity signals. These require communication across many hops — which GNNs with few layers can't do without over-squashing.
Graphs are small-to-medium. Molecules (5–100 atoms), knowledge graph entities, circuit components, small social subgraphs. Full attention is feasible here.
Graph-level tasks. Predicting a property of the entire graph (e.g., molecular toxicity, drug solubility). The [VNode] (virtual node, like BERT's [CLS]) in Graphormer aggregates global information naturally.
Good PE is available. RWPE or Laplacian PE can be pre-computed in reasonable time. If the graph changes dynamically (live social network), re-computing PE at every step is expensive.

Stick with GNNs When...

Graphs are large. Citation networks (169K nodes), social graphs (millions of nodes), protein interaction networks. Full attention is infeasible; even sparse attention may be costly.
Local structure is sufficient. Traffic prediction on road networks, image superpixel graphs. Local neighborhood is sufficient; long-range attention just adds noise.
Computational budget is tight. GNNs are 10–100× cheaper than Transformers at comparable performance on many tasks. Mobile or edge inference favors GNNs.
Graph evolves dynamically. Temporal graphs where edges change constantly. Re-computing PE at each timestep is prohibitively expensive.

Graph Transformer Strengths
Long-range deps.	✓
Graph-level tasks	✓
Rich edge features	✓
Molecular graphs	✓

GNN Strengths
Large graphs	✓
Local structure	✓
Fast inference	✓
Dynamic graphs	✓

In practice: GPS-style hybrids often win. Local MPNN handles edges and local structure efficiently. Global attention (possibly linear) captures long range. Positional encoding bridges both. This combination outperforms pure GNNs and pure Transformers on most benchmark tasks — with GPS winning multiple benchmarks including PCQM4M and Long Range Graph Benchmark.

For a citation network with 170,000 nodes, which approach is most practical?

Full self-attention Graphormer — it's the most expressive model No attention at all — use pure GCN for maximum scalability Sparse or linear attention (or GPS with linear attention) — full O(n²) attention is infeasible at 170K nodes but local MPNN + linear attention scales

Chapter 9: Connections

Graph Transformers sit at the intersection of two major lines of research. Understanding where they fit in the broader landscape helps you pick the right tool for each problem.

Within the CS224W Series

Lec 3–4 (GNN basics, GAT): GAT is Graph Transformer with sparse attention. GAT's attention score is (softmax over neighbors of q·k) — identical to full self-attention, but the sum is only over connected nodes. Graph Transformers generalize this to all nodes.
Lec 6 (Theory of GNNs): Expressiveness is critical here too. With the right PE, Graph Transformers are strictly more expressive than 1-WL GNNs — they can distinguish nodes that WL-equivalent GNNs cannot.
Lec 7 (Positional Encoding in graphs): The PE from Lec 7 (Laplacian eigenvectors, RWPE) is the critical ingredient that makes Graph Transformers graph-aware. Without PE, Graph Transformers are weaker than GNNs.
Lec 9 (Heterogeneous Graphs): Heterogeneous Graph Transformers extend this to graphs with multiple node types and edge types — adding type-specific attention heads and encodings.

Key Papers

Graphormer — Ying et al., NeurIPS 2021. The paper that proved Transformers can beat GNNs on molecular graphs. Centrality + spatial + edge encoding. Won OGB-LSC. arXiv 2106.05234
GPS — Rampásek et al., NeurIPS 2022. The unified framework: MPNN + Attention + PE. Most flexible and best-scaling approach to date.
SAN — Kreuzer et al., NeurIPS 2021. Spectral Attention Network — uses full Laplacian spectrum as PE.
Exphormer — Shirzad et al., ICML 2023. Sparse attention for Graph Transformers, enabling scaling to large graphs while maintaining full expressiveness.

Related Micro-Lessons

CS224W Lec 4 — GAT: attention on sparse graphs. The direct predecessor of Graph Transformers.
CS224W Lec 6 — Theory of GNNs: expressiveness, WL test, GIN. The expressiveness analysis applies here too.
Transformer — The original self-attention mechanism. Graph Transformers use the same Q/K/V machinery.

The bigger picture: Graph Transformers represent a convergence between geometric deep learning (GNNs) and sequence modeling (Transformers). The key lesson: architecture is not destiny. With the right structural inductive bias (positional encoding), a generic Transformer can be made graph-aware. The "right inductive bias" — not "the right architecture from scratch" — is increasingly the design challenge.

"What I cannot create, I do not understand." — Richard Feynman

The next step: implement a GPS layer from scratch. Take a graph (say, ZINC molecules), precompute RWPE, implement one GPS layer with GIN + full attention, train on graph regression. You'll feel the expressiveness difference directly.