GIN is the most expressive message-passing GNN — but message-passing itself has a ceiling. This lecture shows how to break through it with positional encodings, structural features, and anchor-based position awareness.
Last lecture, we proved a powerful result: GIN is the most expressive message-passing GNN. Its sum-aggregation is injective over multisets of neighbor features, which means GIN can distinguish any two nodes that the Weisfeiler-Leman (WL) graph isomorphism test can distinguish.
That sounds great — until you realize WL itself has blind spots. There are pairs of graphs that WL cannot tell apart, no matter how many rounds you run it. And because GIN is equivalent to WL — not stronger — GIN has those exact same blind spots.
So if WL is the ceiling for all message-passing GNNs, and that ceiling has holes, we need to think differently. We need to go beyond message-passing. Specifically, we need to augment node features with information that breaks the symmetries WL gets stuck on.
This lecture covers three complementary strategies, each targeting a different WL failure mode:
Think of it like this: a message-passing GNN is like a person who can only talk to their neighbors. WL is the limit of what they can learn that way. Our upgrades are like giving everyone a map — suddenly they know where they are, not just who's next to them.
The expressiveness hierarchy of GNNs. Standard GNNs live below the WL ceiling. The strategies in this lecture push past it. Click each tier to see what it can distinguish.
The WL test works by color-refinement: assign every node the same initial color, then repeatedly recolor each node based on its own color and the multiset of its neighbors' colors. Two graphs are declared non-isomorphic when they produce different color histograms.
This sounds thorough. But there are two specific structural properties WL fundamentally cannot detect, and they matter for real graph tasks.
A k-regular graph is one where every node has exactly k neighbors. In a regular graph, every node starts with the same initial color. After one WL round, every node sees k neighbors all with the same color — so every node gets the same new color. After two rounds, same thing. WL never produces different colors, no matter how many rounds you run.
This means WL gives the same output to all k-regular graphs with the same number of nodes, even if they have completely different structure. A 3-regular graph that forms two triangles connected by edges looks identical to WL as a 3-regular graph with a completely different topology. Yet a GNN task like "does this graph contain a triangle?" would need to distinguish them.
Now consider a specific node. WL represents it by the multiset of its neighbors' colors. But it never sees how those neighbors connect to each other. Two of your neighbors being connected to each other (forming a triangle with you) looks the same as those same neighbors having no edge between them — from WL's perspective, the neighbor multisets are identical.
This is a direct consequence of message-passing: in one round, you see features of nodes one hop away. You never see the edges between those nodes. Triangles require seeing edges two hops away — and message-passing can't deliver that.
Both failures trace back to the same root: WL starts with constant, featureless initial colors (or only degree). Featureless nodes are symmetric — any permutation of them looks identical. The color refinement respects this symmetry and can never break it. To distinguish more graphs, we must break symmetry in the initial features.
Two different 3-regular graphs. Run WL rounds and watch — both graphs converge to the same color histogram. WL declares them "possibly isomorphic" even though their topology differs.
The graph Laplacian L = D − A is a matrix that captures the structure of a graph. Here D is the diagonal degree matrix (Dii = degree of node i) and A is the adjacency matrix. L tells you how "rough" or "smooth" signals are when they live on graph nodes.
The eigenvectors of L are not just math — they encode the geometric structure of the graph. Think of the graph as a physical membrane. The eigenvectors are the natural vibration modes of that membrane. Low-frequency modes (small eigenvalues) describe the broad, coarse structure — which community a node belongs to. High-frequency modes (large eigenvalues) describe fine local structure — which side of a clique a node is on.
The eigenvector v1 with λ1 = 0 is trivial (all ones). The second eigenvector v2 (the Fiedler vector) divides the graph into two communities — nodes on one side of a cut get positive values, the other side gets negative. The k-th eigenvector adds more resolution, like a finer and finer grid overlaid on the graph.
Here's the key insight: if two nodes lie in structurally different positions in the graph, their values in the eigenvectors will differ. Nodes that WL sees as identical (because they have the same local neighborhood) can have very different Laplacian eigenvector values — because the eigenvectors reflect the global structure of the graph, not just the local neighborhood.
We stack the k smallest non-trivial eigenvectors into a matrix. Row i of this matrix is a k-dimensional positional encoding for node i. We concatenate this with the node's original features and feed the augmented features into any GNN.
The beauty of Laplacian eigenvectors is their multi-scale nature. Small eigenvalues → large-scale structure (are you in the left half of the graph or the right?). Large eigenvalues → small-scale structure (are you on the edge of a clique?). By concatenating k eigenvectors, a node gets a positional encoding that encodes structure at k different scales simultaneously.
python import numpy as np import scipy.sparse.linalg as spla def laplacian_pe(adj, k): # adj: (N, N) adjacency matrix # k: number of eigenvectors to use N = adj.shape[0] deg = adj.sum(axis=1) D = np.diag(deg) L = D - adj # Graph Laplacian # Compute k+1 smallest eigenvalues/vectors # (skip the trivial all-ones eigenvector at lambda=0) eigvals, eigvecs = np.linalg.eigh(L) pe = eigvecs[:, 1:k+1] # shape: (N, k) # pe[i] is the k-dim positional encoding for node i return pe # concatenate with node features before GNN
Theory is clear: Laplacian eigenvectors break the WL symmetries that constant initial features cannot. Let's see exactly what this means in practice — with a graph where WL assigns every node the same color, and eigenvectors assign each node a distinct signature.
We'll use a simple 6-node graph arranged in two triangles connected by a bridge. WL cannot distinguish the two triangles from each other because they're perfectly symmetric. The Fiedler vector, however, assigns negative values to one triangle and positive to the other. The third eigenvector further differentiates nodes within each triangle. Each node gets a unique 2D encoding.
A graph of two triangles connected by a bridge. Node colors show the value of the selected eigenvector — warm = positive, teal = negative, gray = near zero. WL would color all nodes identically; eigenvectors reveal the global structure.
Positional encodings use spectral information to tell nodes where they are. A different strategy is to tell nodes what their local structure looks like — not through message-passing (which is bounded by WL), but by directly computing structural statistics and appending them as features.
The most important structural statistics are substructure counts: how many triangles does this node participate in? How many 4-cycles? These numbers directly measure properties that WL provably cannot detect through message-passing. By computing them and appending them to node features, we hand the GNN exactly the information it was missing.
Recall why WL fails at triangles: it aggregates from direct neighbors, but never looks at edges between neighbors. A triangle through node v means two of v's neighbors are connected to each other. WL only sees the neighbors as a multiset of colors, not their mutual connections.
But we can count triangles ourselves using the adjacency matrix. The number of triangles passing through node v is:
Here A3 means "number of walks of length 3 from v back to v." Any closed walk of length 3 is a triangle. The factor of 2 accounts for traversing each triangle in both directions. This is a straightforward matrix computation — polynomial in the number of nodes, and directly gives us what WL cannot compute.
The same idea extends to longer cycles. The diagonal of Ak counts closed walks of length k. These walks are not exactly cycles (they might revisit nodes), but with careful inclusion-exclusion, you can extract exact cycle counts.
For practical purposes, you don't need exact counts. Including (A3)vv, (A4)vv, (A5)vv as node features already captures a rich portrait of the local ring structure — useful for molecular property prediction (where ring sizes determine chemical behavior) and many other graph tasks.
python import numpy as np def structural_features(A): # A: adjacency matrix (N, N) A2 = A @ A A3 = A2 @ A A4 = A3 @ A A5 = A4 @ A tri = np.diag(A3) / 2 # triangles cyc4 = np.diag(A4) # 4-cycle walks cyc5 = np.diag(A5) / 2 # 5-cycle walks # Stack into (N, 3) feature matrix return np.stack([tri, cyc4, cyc5], axis=1)
Adding more and more substructure features climbs the expressiveness ladder. Triangles help. 4-cycles add more. 5-cycles add more. But each adds computational cost. In practice, triangles (3-cycles) and 4-cycles cover the most important chemistry and the cost stays manageable.
Positional encodings and structural features both augment node features before feeding into a GNN. But there's a deeper question: what kinds of tasks fundamentally require knowing a node's position in the graph?
Consider predicting the shortest-path distance between two specific nodes u and v in a graph. This requires knowing where u and v are relative to each other — not just their local structures. Two pairs of nodes with identical local neighborhoods can have wildly different pairwise distances. A purely structure-based GNN cannot distinguish these cases.
The most practical approach to position awareness is anchor-based encoding. Pick a set of "landmark" nodes — anchors — spread across the graph. For each node v, compute the shortest-path distance from v to each anchor. The resulting vector of distances is the node's positional encoding.
Intuitively, the anchors are like GPS satellites. You don't need to know your absolute coordinates — you just need distances to a few known landmarks, and trilateration tells you where you are. In a graph, distances to a handful of strategically chosen anchors is often enough to uniquely pin down a node's location.
Anchor distances and Laplacian eigenvectors both serve as positional encodings, but they measure different things. Eigenvectors measure smooth spectral structure — useful for global community detection. Anchor distances measure actual shortest-path distances — useful when the task is intrinsically about graph distances (like predicting interactions between molecules at specific positions, or reasoning about network diameter).
python import networkx as nx, numpy as np def anchor_pe(G, K=20, seed=42): nodes = list(G.nodes()) rng = np.random.default_rng(seed) anchors = rng.choice(nodes, K, replace=False) pe = np.zeros((len(nodes), K)) for j, s in enumerate(anchors): dists = nx.single_source_shortest_path_length(G, s) for i, v in enumerate(nodes): pe[i, j] = dists.get(v, -1) # -1 if unreachable return pe # shape: (N, K)
A single anchor might not differentiate all nodes. Two nodes equidistant from every anchor would get the same encoding, even if they're structurally and positionally different. The solution is to use anchor sets rather than individual anchors, and to use distance to the closest node in each set.
Here's why sets help. Suppose you have two nodes u and v that are equidistant from anchor s₁. If you add anchor s₂, the chances that both u and v are also equidistant from s₂ are much lower — especially if s₂ is chosen randomly and independently. With K random single-node anchors, you get K distance values. With K random anchor sets of size q, you get K distances-to-nearest-set values, which partition the graph differently and provide more discriminating power.
Using sufficiently many random anchor sets provably makes the GNN more expressive than 1-WL for position-aware tasks. The intuition is information-theoretic: random anchor sets, with high probability over the randomness in set construction, will produce distinct encodings for any two non-isomorphic node positions. As the number of sets grows, the probability of a "collision" (two distinct positions getting identical encodings) goes to zero.
A graph where two nodes (red, blue) are equidistant from any single anchor. Add anchor sets (each circle = one set) and watch the nodes become distinguishable. Toggle between single anchors and sets.
python import networkx as nx, numpy as np def anchor_set_pe(G, K=20, set_size=4, seed=42): # K anchor sets, each of size q=set_size nodes = list(G.nodes()) rng = np.random.default_rng(seed) pe = np.zeros((len(nodes), K)) for k in range(K): S_k = rng.choice(nodes, set_size, replace=False) # Dist from v to nearest member of S_k for i, v in enumerate(nodes): d_min = float('inf') for s in S_k: try: d = nx.shortest_path_length(G, v, s) d_min = min(d_min, d) except nx.NetworkXNoPath: pass pe[i, k] = d_min if d_min != float('inf') else -1 return pe
The strategies so far — eigenvectors, substructure counts, anchor distances — all require preprocessing the graph before the GNN runs. There's a simpler-sounding alternative: just assign each node a random feature, sampled fresh before each forward pass. This trivially breaks all symmetries, because with probability 1 every node gets a unique random number.
This sounds almost too easy. The catch is generalization: if a node gets a different random ID every time the model sees the graph, the model cannot learn to associate "node with this ID" with "node that should be labeled X." A random ID that changes at each inference is not useful — the model has no way to learn from it in a stable way.
There's a fundamental tension between expressiveness and generalization:
The middle ground: use random features at training time, but evaluate the model in expectation over many random seeds. This approach, studied formally by Abboud et al. (2020) and Sato et al. (2021), has a key property: in expectation over random seeds, the model is as expressive as a 3-WL test — far beyond 1-WL — while still generalizing because the expected output depends only on graph structure, not on any specific random seed.
In practice, you approximate this expectation by averaging over several random seeds during inference (typically 10–50). This is computationally expensive but turns an otherwise useless trick into a genuinely powerful technique for structural graph tasks.
python import torch, torch.nn as nn class GNNWithRandFeatures(nn.Module): def __init__(self, gnn, rand_dim=16): super().__init__() self.gnn = gnn self.rand_dim = rand_dim def forward(self, x, edge_index, n_seeds=1): N = x.size(0) out = 0 for _ in range(n_seeds): z = torch.randn(N, self.rand_dim) # fresh random features x_z = torch.cat([x, z], dim=-1) # augment node features out += self.gnn(x_z, edge_index) return out / n_seeds # average over seeds
All the approaches so far augment the input to a standard (1-WL equivalent) GNN. But there's a fundamentally different path to more expressive models: change the unit of computation from individual nodes to tuples of nodes.
The standard WL test operates on single nodes — it colors each node by the multiset of its neighbors' colors. The k-WL test operates on k-tuples of nodes. A 2-WL test colors each pair (u, v) of nodes. A 3-WL test colors each triple (u, v, w). Each step up the hierarchy is strictly more expressive: 2-WL distinguishes every pair of graphs that 1-WL can, plus strictly more.
The k-WL test can detect and count all substructures with at most k nodes. So 3-WL can count triangles (3 nodes), 4-WL can count 4-cliques, and so on. This directly resolves WL's failure to count substructures — but at a steep cost.
Several architectures approximate the power of higher-order GNNs without the full O(nk) cost:
We've covered five distinct ways to push past the WL ceiling. Let's place them on a single map and connect them to the broader landscape of GNN research.
| Method | Expressiveness | Cost | Best For |
|---|---|---|---|
| Standard GCN / GraphSAGE | Below 1-WL | O(n + m) | Simple node classification |
| GIN | 1-WL equivalent | O(n + m) | Most graph tasks baseline |
| GIN + Laplacian PE | Breaks WL symmetry | O(n²) eigenvec | Position-sensitive tasks |
| GIN + Structural Features | Counts substructures | O(n³) for A³ | Molecular property prediction |
| GIN + Anchor Sets | Position-aware | O(K · n) BFS | Link prediction, distance tasks |
| PPGN (2-WL approx) | Strictly > 1-WL | O(n²) | Small molecular graphs |
| 3-WL / k-IGN | 3-WL | O(n³) | Tiny graphs only |
Lecture 6 (WL Test Basics): This lecture picks up exactly where Lec 6 ends. Lec 6 established that GIN = 1-WL and that standard message-passing is bounded by 1-WL. This lecture shows how to go beyond that bound.
Lecture 8 (Graph Transformers): Graph Transformers use Laplacian positional encodings (exactly as described in Ch 2–3) as input to the self-attention mechanism. The connection is direct: the transformer's global attention sees all nodes at once, making position ambiguity less severe, but positional encodings still help. The random sign-flip trick for eigenvectors (Ch 7) is a Graph Transformer training technique.
You now understand the expressiveness frontier of GNNs. Where to go next:
"Understanding the limits of a tool is not a defeat — it's the prerequisite for building the next one."
— a useful principle for ML theory