CS224W Lecture 7

Designing Powerful Graph Encoders

GIN is the most expressive message-passing GNN — but message-passing itself has a ceiling. This lecture shows how to break through it with positional encodings, structural features, and anchor-based position awareness.

Prerequisites: GNN basics (Lec 3–4) + WL test (Lec 6). That's it.
10
Chapters
4+
Simulations
0
Assumed Knowledge

Chapter 0: Beyond the WL Ceiling

Last lecture, we proved a powerful result: GIN is the most expressive message-passing GNN. Its sum-aggregation is injective over multisets of neighbor features, which means GIN can distinguish any two nodes that the Weisfeiler-Leman (WL) graph isomorphism test can distinguish.

That sounds great — until you realize WL itself has blind spots. There are pairs of graphs that WL cannot tell apart, no matter how many rounds you run it. And because GIN is equivalent to WL — not stronger — GIN has those exact same blind spots.

So if WL is the ceiling for all message-passing GNNs, and that ceiling has holes, we need to think differently. We need to go beyond message-passing. Specifically, we need to augment node features with information that breaks the symmetries WL gets stuck on.

The key idea: WL is limited because all nodes start with the same constant color (or only degree information). By giving nodes richer initial features — positional encodings, structural counts, random IDs — we can distinguish graphs that WL treats as identical. The GNN architecture itself need not change; what changes is what we feed into it.

Three Upgrades to the Input Features

This lecture covers three complementary strategies, each targeting a different WL failure mode:

  1. Laplacian positional encodings — spectral features that give each node a unique "address" in the graph.
  2. Structural features — counts of local substructures (triangles, cycles) that WL provably cannot count.
  3. Anchor-based position encoding — graph distances to a set of landmark nodes, enabling position-aware reasoning.

Think of it like this: a message-passing GNN is like a person who can only talk to their neighbors. WL is the limit of what they can learn that way. Our upgrades are like giving everyone a map — suddenly they know where they are, not just who's next to them.

The WL Ceiling and Ways to Break It

The expressiveness hierarchy of GNNs. Standard GNNs live below the WL ceiling. The strategies in this lecture push past it. Click each tier to see what it can distinguish.

Click a tier to see its expressive power and what it can distinguish.
GIN achieves maximum expressiveness among message-passing GNNs. What does this mean for WL's blind spots?

Chapter 1: What WL Gets Wrong

The WL test works by color-refinement: assign every node the same initial color, then repeatedly recolor each node based on its own color and the multiset of its neighbors' colors. Two graphs are declared non-isomorphic when they produce different color histograms.

This sounds thorough. But there are two specific structural properties WL fundamentally cannot detect, and they matter for real graph tasks.

Failure 1: Indistinguishable Regular Graphs

A k-regular graph is one where every node has exactly k neighbors. In a regular graph, every node starts with the same initial color. After one WL round, every node sees k neighbors all with the same color — so every node gets the same new color. After two rounds, same thing. WL never produces different colors, no matter how many rounds you run.

This means WL gives the same output to all k-regular graphs with the same number of nodes, even if they have completely different structure. A 3-regular graph that forms two triangles connected by edges looks identical to WL as a 3-regular graph with a completely different topology. Yet a GNN task like "does this graph contain a triangle?" would need to distinguish them.

Concrete example: The two 3-regular graphs on 6 nodes — the complete bipartite graph K3,3 and the prism graph (triangular prism) — both have 6 nodes, 9 edges, and every node with degree 3. WL assigns them the same color sequence forever. But K3,3 has no triangles while the prism has two. A chemistry GNN trying to detect ring structures would fail here.

Failure 2: Can't Count Triangles

Now consider a specific node. WL represents it by the multiset of its neighbors' colors. But it never sees how those neighbors connect to each other. Two of your neighbors being connected to each other (forming a triangle with you) looks the same as those same neighbors having no edge between them — from WL's perspective, the neighbor multisets are identical.

This is a direct consequence of message-passing: in one round, you see features of nodes one hop away. You never see the edges between those nodes. Triangles require seeing edges two hops away — and message-passing can't deliver that.

The Root Cause: Constant Initial Colors

Both failures trace back to the same root: WL starts with constant, featureless initial colors (or only degree). Featureless nodes are symmetric — any permutation of them looks identical. The color refinement respects this symmetry and can never break it. To distinguish more graphs, we must break symmetry in the initial features.

WL Failure: Two Indistinguishable Graphs

Two different 3-regular graphs. Run WL rounds and watch — both graphs converge to the same color histogram. WL declares them "possibly isomorphic" even though their topology differs.

Both graphs start with the same initial coloring. Run rounds to see they stay identical.
Why can WL not count the number of triangles a node participates in?

Chapter 2: Laplacian Eigenvectors as Features

The graph Laplacian L = D − A is a matrix that captures the structure of a graph. Here D is the diagonal degree matrix (Dii = degree of node i) and A is the adjacency matrix. L tells you how "rough" or "smooth" signals are when they live on graph nodes.

The eigenvectors of L are not just math — they encode the geometric structure of the graph. Think of the graph as a physical membrane. The eigenvectors are the natural vibration modes of that membrane. Low-frequency modes (small eigenvalues) describe the broad, coarse structure — which community a node belongs to. High-frequency modes (large eigenvalues) describe fine local structure — which side of a clique a node is on.

L · vk = λk · vk,   0 = λ1 ≤ λ2 ≤ … ≤ λn

The eigenvector v1 with λ1 = 0 is trivial (all ones). The second eigenvector v2 (the Fiedler vector) divides the graph into two communities — nodes on one side of a cut get positive values, the other side gets negative. The k-th eigenvector adds more resolution, like a finer and finer grid overlaid on the graph.

Using Eigenvectors as Node Features

Here's the key insight: if two nodes lie in structurally different positions in the graph, their values in the eigenvectors will differ. Nodes that WL sees as identical (because they have the same local neighborhood) can have very different Laplacian eigenvector values — because the eigenvectors reflect the global structure of the graph, not just the local neighborhood.

We stack the k smallest non-trivial eigenvectors into a matrix. Row i of this matrix is a k-dimensional positional encoding for node i. We concatenate this with the node's original features and feed the augmented features into any GNN.

hv(0) = concat( xv ,  [v2[v], v3[v], …, vk+1[v]] )
Sign ambiguity: Eigenvectors have a sign flip problem — both v and −v are valid eigenvectors (multiply both sides of Lv = λv by −1 and it still holds). So if you train on one graph where v2 points one way and test on another where it points the opposite way, the model sees contradictory inputs. Solutions: use sign-invariant architectures, random sign flipping during training, or absolute values of eigenvector components.

Global vs. Local Structure

The beauty of Laplacian eigenvectors is their multi-scale nature. Small eigenvalues → large-scale structure (are you in the left half of the graph or the right?). Large eigenvalues → small-scale structure (are you on the edge of a clique?). By concatenating k eigenvectors, a node gets a positional encoding that encodes structure at k different scales simultaneously.

Spectral interpretation: The Fiedler vector (2nd eigenvector) is used in spectral graph partitioning. Nodes with positive values go into cluster 1, negative into cluster 2. This is not coincidence — it's the optimal balanced cut of the graph, and it directly reflects node positions relative to the graph's community structure.
When do eigenvectors help? Most powerfully when two nodes have identical local neighborhoods but live in different parts of the global graph. Example: two hub nodes, each with 5 star-shaped neighbors — locally identical, but one hub connects the left half and the other connects the right half. Eigenvectors see this; WL doesn't.
python
import numpy as np
import scipy.sparse.linalg as spla

def laplacian_pe(adj, k):
    # adj: (N, N) adjacency matrix
    # k: number of eigenvectors to use
    N = adj.shape[0]
    deg = adj.sum(axis=1)
    D = np.diag(deg)
    L = D - adj                      # Graph Laplacian

    # Compute k+1 smallest eigenvalues/vectors
    # (skip the trivial all-ones eigenvector at lambda=0)
    eigvals, eigvecs = np.linalg.eigh(L)
    pe = eigvecs[:, 1:k+1]           # shape: (N, k)

    # pe[i] is the k-dim positional encoding for node i
    return pe  # concatenate with node features before GNN
Why does the second eigenvector of the Laplacian (Fiedler vector) serve as a useful positional encoding?

Chapter 3: Positional Encoding in Practice

Theory is clear: Laplacian eigenvectors break the WL symmetries that constant initial features cannot. Let's see exactly what this means in practice — with a graph where WL assigns every node the same color, and eigenvectors assign each node a distinct signature.

We'll use a simple 6-node graph arranged in two triangles connected by a bridge. WL cannot distinguish the two triangles from each other because they're perfectly symmetric. The Fiedler vector, however, assigns negative values to one triangle and positive to the other. The third eigenvector further differentiates nodes within each triangle. Each node gets a unique 2D encoding.

The showcase: Select which eigenvector (k=1, 2, or 3) to visualize as node colors. Watch how each eigenvector reveals different structural information. The first eigenvector is flat — everyone is the same. By the third, every node has a unique value. This is what breaks WL's blind spot.
Laplacian Eigenvectors as Node Colors

A graph of two triangles connected by a bridge. Node colors show the value of the selected eigenvector — warm = positive, teal = negative, gray = near zero. WL would color all nodes identically; eigenvectors reveal the global structure.

Eigenvector v₁
v₁ (trivial): all nodes get the same value — no information. Use eigenvectors v₂–v₄ to reveal structure.

The Full Pipeline

Input Graph
Nodes with features xv (or just degree). Adjacency matrix A, degree matrix D.
Compute Laplacian
L = D − A. Compute k smallest eigenvectors v2, ..., vk+1 via eigendecomposition.
Augment Node Features
hv = concat(xv, [v2[v], ..., vk+1[v]]). Each node now has structural position information.
Standard GNN
Feed augmented features into any GNN (GIN, GCN, GraphSAGE). The GNN can now distinguish nodes that WL could not.
More Expressive Embeddings
Nodes with different global positions in the graph get different embeddings, even if their local neighborhoods are identical.
In the two-triangles-plus-bridge graph, why does the Fiedler vector (2nd eigenvector) assign negative values to one triangle and positive to the other?

Chapter 4: Structural Features — Counting Substructures

Positional encodings use spectral information to tell nodes where they are. A different strategy is to tell nodes what their local structure looks like — not through message-passing (which is bounded by WL), but by directly computing structural statistics and appending them as features.

The most important structural statistics are substructure counts: how many triangles does this node participate in? How many 4-cycles? These numbers directly measure properties that WL provably cannot detect through message-passing. By computing them and appending them to node features, we hand the GNN exactly the information it was missing.

Why WL Can't Count Triangles

Recall why WL fails at triangles: it aggregates from direct neighbors, but never looks at edges between neighbors. A triangle through node v means two of v's neighbors are connected to each other. WL only sees the neighbors as a multiset of colors, not their mutual connections.

But we can count triangles ourselves using the adjacency matrix. The number of triangles passing through node v is:

triangles(v) = (A3)vv / 2

Here A3 means "number of walks of length 3 from v back to v." Any closed walk of length 3 is a triangle. The factor of 2 accounts for traversing each triangle in both directions. This is a straightforward matrix computation — polynomial in the number of nodes, and directly gives us what WL cannot compute.

Extending to Longer Cycles

The same idea extends to longer cycles. The diagonal of Ak counts closed walks of length k. These walks are not exactly cycles (they might revisit nodes), but with careful inclusion-exclusion, you can extract exact cycle counts.

For practical purposes, you don't need exact counts. Including (A3)vv, (A4)vv, (A5)vv as node features already captures a rich portrait of the local ring structure — useful for molecular property prediction (where ring sizes determine chemical behavior) and many other graph tasks.

python
import numpy as np

def structural_features(A):
    # A: adjacency matrix (N, N)
    A2 = A @ A
    A3 = A2 @ A
    A4 = A3 @ A
    A5 = A4 @ A

    tri   = np.diag(A3) / 2   # triangles
    cyc4  = np.diag(A4)        # 4-cycle walks
    cyc5  = np.diag(A5) / 2   # 5-cycle walks

    # Stack into (N, 3) feature matrix
    return np.stack([tri, cyc4, cyc5], axis=1)
Verified: GNNs augmented with substructure counts are strictly more expressive than 1-WL for graph classification tasks that require counting these substructures. In molecular benchmarks (e.g., ZINC, ogbg-molhiv), adding triangle counts as node features consistently improves performance — because many important molecular properties depend on whether atoms are in rings.

The Expressiveness Ladder

Adding more and more substructure features climbs the expressiveness ladder. Triangles help. 4-cycles add more. 5-cycles add more. But each adds computational cost. In practice, triangles (3-cycles) and 4-cycles cover the most important chemistry and the cost stays manageable.

What does the diagonal entry (A³)vv of the cubed adjacency matrix count, and what does dividing by 2 account for?

Chapter 5: Position-Aware GNNs

Positional encodings and structural features both augment node features before feeding into a GNN. But there's a deeper question: what kinds of tasks fundamentally require knowing a node's position in the graph?

Consider predicting the shortest-path distance between two specific nodes u and v in a graph. This requires knowing where u and v are relative to each other — not just their local structures. Two pairs of nodes with identical local neighborhoods can have wildly different pairwise distances. A purely structure-based GNN cannot distinguish these cases.

Structure vs. Position: Structure-aware means "what does the neighborhood around me look like?" Position-aware means "where am I in the global graph?" These are genuinely different questions. A node can have a unique local structure but still be ambiguous in position (multiple spots in the graph might have the same local structure), and vice versa. Many real tasks — link prediction, distance estimation, role assignment — require position.

Anchor-Based Positional Encoding

The most practical approach to position awareness is anchor-based encoding. Pick a set of "landmark" nodes — anchors — spread across the graph. For each node v, compute the shortest-path distance from v to each anchor. The resulting vector of distances is the node's positional encoding.

PE(v) = [ d(v, s1), d(v, s2), …, d(v, sK) ]

Intuitively, the anchors are like GPS satellites. You don't need to know your absolute coordinates — you just need distances to a few known landmarks, and trilateration tells you where you are. In a graph, distances to a handful of strategically chosen anchors is often enough to uniquely pin down a node's location.

Why Graph Distances, Not Spectral?

Anchor distances and Laplacian eigenvectors both serve as positional encodings, but they measure different things. Eigenvectors measure smooth spectral structure — useful for global community detection. Anchor distances measure actual shortest-path distances — useful when the task is intrinsically about graph distances (like predicting interactions between molecules at specific positions, or reasoning about network diameter).

How to choose anchors: Randomly select K nodes. Alternatively, pick high-degree nodes (hubs) or nodes that span the diameter of the graph. The exact selection matters less than having enough variety — K=20 random anchors on a 1000-node graph provides surprisingly good positional resolution.
python
import networkx as nx, numpy as np

def anchor_pe(G, K=20, seed=42):
    nodes = list(G.nodes())
    rng   = np.random.default_rng(seed)
    anchors = rng.choice(nodes, K, replace=False)

    pe = np.zeros((len(nodes), K))
    for j, s in enumerate(anchors):
        dists = nx.single_source_shortest_path_length(G, s)
        for i, v in enumerate(nodes):
            pe[i, j] = dists.get(v, -1)  # -1 if unreachable
    return pe  # shape: (N, K)
Why is anchor-based positional encoding necessary for tasks like predicting shortest-path distances between node pairs?

Chapter 6: Anchor Sets — Robustness through Redundancy

A single anchor might not differentiate all nodes. Two nodes equidistant from every anchor would get the same encoding, even if they're structurally and positionally different. The solution is to use anchor sets rather than individual anchors, and to use distance to the closest node in each set.

Here's why sets help. Suppose you have two nodes u and v that are equidistant from anchor s₁. If you add anchor s₂, the chances that both u and v are also equidistant from s₂ are much lower — especially if s₂ is chosen randomly and independently. With K random single-node anchors, you get K distance values. With K random anchor sets of size q, you get K distances-to-nearest-set values, which partition the graph differently and provide more discriminating power.

PEset(v) = [ mins ∈ S1 d(v, s), mins ∈ S2 d(v, s), … ]
Why "distance to set" not "distance to centroid"? In a graph, there's no notion of centroid — you can't average nodes. The distance to the nearest member of a set is a well-defined graph quantity: it's the length of the shortest path from v to any node in Sk. This generalizes "distance to anchor" while allowing the anchor region to be a cluster rather than a point.

Theoretical Guarantee

Using sufficiently many random anchor sets provably makes the GNN more expressive than 1-WL for position-aware tasks. The intuition is information-theoretic: random anchor sets, with high probability over the randomness in set construction, will produce distinct encodings for any two non-isomorphic node positions. As the number of sets grows, the probability of a "collision" (two distinct positions getting identical encodings) goes to zero.

Anchor Distances vs. Anchor Set Distances

A graph where two nodes (red, blue) are equidistant from any single anchor. Add anchor sets (each circle = one set) and watch the nodes become distinguishable. Toggle between single anchors and sets.

Mode: single anchors. Red and blue nodes have equal distances to all anchors.
python
import networkx as nx, numpy as np

def anchor_set_pe(G, K=20, set_size=4, seed=42):
    # K anchor sets, each of size q=set_size
    nodes  = list(G.nodes())
    rng    = np.random.default_rng(seed)
    pe     = np.zeros((len(nodes), K))

    for k in range(K):
        S_k = rng.choice(nodes, set_size, replace=False)
        # Dist from v to nearest member of S_k
        for i, v in enumerate(nodes):
            d_min = float('inf')
            for s in S_k:
                try:
                    d = nx.shortest_path_length(G, v, s)
                    d_min = min(d_min, d)
                except nx.NetworkXNoPath:
                    pass
            pe[i, k] = d_min if d_min != float('inf') else -1
    return pe
Why do anchor SETS outperform individual anchor nodes for positional encoding?

Chapter 7: Random Features — Breaking Symmetry Stochastically

The strategies so far — eigenvectors, substructure counts, anchor distances — all require preprocessing the graph before the GNN runs. There's a simpler-sounding alternative: just assign each node a random feature, sampled fresh before each forward pass. This trivially breaks all symmetries, because with probability 1 every node gets a unique random number.

This sounds almost too easy. The catch is generalization: if a node gets a different random ID every time the model sees the graph, the model cannot learn to associate "node with this ID" with "node that should be labeled X." A random ID that changes at each inference is not useful — the model has no way to learn from it in a stable way.

The Generalization Tension

There's a fundamental tension between expressiveness and generalization:

Maximally expressive: Give each node a unique, permanent integer ID. Every node is now distinct. The GNN becomes a lookup table. It can perfectly memorize the training graphs. But it generalizes to new graphs poorly — a node it's never seen gets a random ID that doesn't correspond to any learned pattern.
Maximally generalizable: Give all nodes the same constant feature (or only degree). The model learns only structural patterns that generalize across graphs. But it can't distinguish nodes with identical local neighborhoods — the WL blind spot.

Random Features with Expectation Averaging

The middle ground: use random features at training time, but evaluate the model in expectation over many random seeds. This approach, studied formally by Abboud et al. (2020) and Sato et al. (2021), has a key property: in expectation over random seeds, the model is as expressive as a 3-WL test — far beyond 1-WL — while still generalizing because the expected output depends only on graph structure, not on any specific random seed.

f(G) = Ez ~ P[ GNN(G, z) ]

In practice, you approximate this expectation by averaging over several random seeds during inference (typically 10–50). This is computationally expensive but turns an otherwise useless trick into a genuinely powerful technique for structural graph tasks.

Random features in transformers: This same idea appears in Graph Transformers (Lec 8). Randomized Laplacian positional encodings — where the sign of eigenvectors is randomly flipped during training — force the model to learn sign-invariant representations. The training randomness regularizes the model and improves generalization, while still providing positional signal in expectation.
python
import torch, torch.nn as nn

class GNNWithRandFeatures(nn.Module):
    def __init__(self, gnn, rand_dim=16):
        super().__init__()
        self.gnn      = gnn
        self.rand_dim = rand_dim

    def forward(self, x, edge_index, n_seeds=1):
        N = x.size(0)
        out = 0
        for _ in range(n_seeds):
            z    = torch.randn(N, self.rand_dim)   # fresh random features
            x_z  = torch.cat([x, z], dim=-1)       # augment node features
            out += self.gnn(x_z, edge_index)
        return out / n_seeds                       # average over seeds
Why do random node features, used naively (one seed, permanent ID), fail to generalize to unseen graphs?

Chapter 8: Higher-Order GNNs

All the approaches so far augment the input to a standard (1-WL equivalent) GNN. But there's a fundamentally different path to more expressive models: change the unit of computation from individual nodes to tuples of nodes.

The standard WL test operates on single nodes — it colors each node by the multiset of its neighbors' colors. The k-WL test operates on k-tuples of nodes. A 2-WL test colors each pair (u, v) of nodes. A 3-WL test colors each triple (u, v, w). Each step up the hierarchy is strictly more expressive: 2-WL distinguishes every pair of graphs that 1-WL can, plus strictly more.

1-WL < 2-WL < 3-WL < …

What k-WL Can Distinguish

The k-WL test can detect and count all substructures with at most k nodes. So 3-WL can count triangles (3 nodes), 4-WL can count 4-cliques, and so on. This directly resolves WL's failure to count substructures — but at a steep cost.

The exponential cost: A k-WL GNN operates on the set of all k-tuples of nodes. For a graph with n nodes, that's nk tuples. With n=1000 nodes and k=3, that's 109 triples — far too many. For k=2 (pairs), n=1000 gives 106 pairs: expensive but feasible with sparse approximations. This is why k-WL approaches are generally only practical for k=2 or k=3 on moderate-sized graphs.

Practical Approximations

Several architectures approximate the power of higher-order GNNs without the full O(nk) cost:

k-IGN (k-dimensional Invariant Graph Networks)
Operate on tensor representations of the graph. More algebraically principled, same expressiveness as k-WL but with some sparsity tricks.
PPGN (Provably Powerful Graph Networks)
2-WL equivalent with careful sparse message-passing. Practical for graphs with hundreds of nodes. Used in molecular property prediction benchmarks.
Local k-WL
Only form k-tuples within local neighborhoods, not globally. Reduces cost from O(nk) to O(n · dk-1) where d is the max degree. Still captures most useful substructures for sparse graphs.
Practical guidance: For molecular graphs (typically n < 100), k=3 is computationally feasible and provides strong expressiveness. For social networks or citation graphs (n = 10,000+), stick with 1-WL augmented with structural features or positional encodings — they provide most of the practical benefit at a fraction of the cost.
Why is the 3-WL test more expressive than 1-WL for counting triangles, while 1-WL fails?

Chapter 9: Connections and the Expressiveness Landscape

We've covered five distinct ways to push past the WL ceiling. Let's place them on a single map and connect them to the broader landscape of GNN research.

The Expressiveness Hierarchy

MethodExpressivenessCostBest For
Standard GCN / GraphSAGEBelow 1-WLO(n + m)Simple node classification
GIN1-WL equivalentO(n + m)Most graph tasks baseline
GIN + Laplacian PEBreaks WL symmetryO(n²) eigenvecPosition-sensitive tasks
GIN + Structural FeaturesCounts substructuresO(n³) for A³Molecular property prediction
GIN + Anchor SetsPosition-awareO(K · n) BFSLink prediction, distance tasks
PPGN (2-WL approx)Strictly > 1-WLO(n²)Small molecular graphs
3-WL / k-IGN3-WLO(n³)Tiny graphs only

Connections to Other Lectures

Lecture 6 (WL Test Basics): This lecture picks up exactly where Lec 6 ends. Lec 6 established that GIN = 1-WL and that standard message-passing is bounded by 1-WL. This lecture shows how to go beyond that bound.

Lecture 8 (Graph Transformers): Graph Transformers use Laplacian positional encodings (exactly as described in Ch 2–3) as input to the self-attention mechanism. The connection is direct: the transformer's global attention sees all nodes at once, making position ambiguity less severe, but positional encodings still help. The random sign-flip trick for eigenvectors (Ch 7) is a Graph Transformer training technique.

Which method should I use? For molecular tasks (n < 100): add triangle/4-cycle counts and try PPGN if you have the compute budget. For social/citation graphs (n > 1000): add Laplacian PE or anchor distances, stay with standard GNN architecture. For position-sensitive link prediction: anchor sets are the most targeted approach. The choice depends on what WL failure mode your task actually suffers from.

Key Papers

Related Lessons

You now understand the expressiveness frontier of GNNs. Where to go next:

"Understanding the limits of a tool is not a defeat — it's the prerequisite for building the next one."
— a useful principle for ML theory