CS224W Lecture 6 — Theory of GNNs

Chapter 0: Can GNNs Tell Nodes Apart?

You have a social network. Alice follows Bob, Charlie, and Dana. Bob follows Alice and Charlie. They have the same number of friends — but their local neighborhoods are different. A good GNN should give Alice and Bob different embeddings, because their structural role in the graph differs.

But will it? This depends entirely on the aggregation function. If two nodes happen to have neighbor features that average to the same value — even though the neighbor multisets are different — the GNN cannot tell them apart. It will assign them identical embeddings, regardless of how many layers you stack or how wide you make the network.

This is not a bug you can tune away. It's a fundamental mathematical limit, tied to what it means for a function to be injective — to map distinct inputs to distinct outputs. If the aggregation function is not injective over neighbor multisets, the GNN loses information on every single layer, and there's no recovery.

The central question: Under what conditions can a GNN assign different embeddings to nodes with different local neighborhoods? And what is the most expressive possible message-passing GNN? This lecture answers both questions precisely.

What We Mean by "Expressive"

A GNN is more expressive than another if it can distinguish more pairs of nodes (or graphs) that are structurally different. Maximum expressiveness means: any two nodes with genuinely different K-hop neighborhoods get different embeddings after K layers. No information is lost.

Expressiveness has a cost: less expressive models are often more regularized and generalize better on small datasets. But understanding the theoretical ceiling is essential — it tells you what's impossible, and where to look when your model fails.

Two Nodes — Can the GNN Tell Them Apart?

Node A and Node B have different neighbor structures. An expressive GNN should give them different embeddings. Click "Aggregate" to see what mean vs. sum aggregation produces. Watch what information gets lost.

Node A: neighbors {1, 1, 2}. Node B: neighbors {1, 2}. Different neighborhoods — should get different embeddings.

What does it mean for a GNN aggregation to be "injective"?

It means the aggregation is computed in parallel across all nodes It means the aggregation uses a neural network instead of a fixed formula It means different input multisets of neighbor features always produce different output values — no information is lost

Chapter 1: Computational Graphs = Rooted Subtrees

A GNN doesn't see the whole graph at once. With K layers, each node sees exactly its K-hop neighborhood — the subgraph reachable within K steps. This neighborhood, unfolded into a tree structure rooted at the node, is called its computational graph.

Here's how it works. Before the first layer (K=0), a node only knows its own features. After one layer (K=1), it has aggregated from its immediate neighbors. After two layers (K=2), it has aggregated from its neighbors' neighbors. The "view" expands with each layer, but it's always a tree — because we unfold the graph by tracing paths outward, and the same node may appear multiple times in the unfolding.

Key insight: Two nodes with identical rooted K-hop subtrees MUST receive the same embedding after K layers. The GNN is a deterministic function of the computational graph — identical inputs, identical outputs. This sets a lower bound on what any GNN can achieve.

The Two Conditions for Perfect Expressiveness

For a K-layer GNN to be maximally expressive:

Same subtrees → same embeddings. This is guaranteed by the GNN's computation structure — it always holds.
Different subtrees → different embeddings. This is NOT guaranteed. It depends on whether every aggregation function in every layer is injective over multisets of neighbor features.

The second condition is the hard one. It's asking: can the aggregation function distinguish any two distinct multisets it might see? If yes at every layer, the GNN is as expressive as the Weisfeiler-Leman test. If no at any layer, some structural information is permanently discarded.

Unfolding a Graph into Rooted Subtrees

A 4-node graph. Select a node to see its 2-hop rooted subtree. Notice how the same neighbor can appear multiple times in the unfolding — the computational graph is a tree, not the original graph.

Select a node to unfold its computational graph.

In a GNN with K layers, what does each node's computational graph look like?

A K-hop subgraph with cycles preserved from the original graph A rooted tree unfolded from the node's K-hop neighborhood, where the same neighbor can appear multiple times The entire graph, seen through K aggregation steps

Chapter 2: Multiset Functions

Here's a concept that makes all the difference: the difference between a set and a multiset. A set is a collection where each element appears at most once: {red, blue, green}. A multiset allows repeats: {red, red, blue}. {red, blue} and {red, red, blue} are the same set of colors, but different multisets.

When a GNN aggregates neighbor features, it's computing a function over a multiset. Node A might have neighbors with features {1, 1, 2} — two neighbors with value 1, one with value 2. Node B might have neighbors with features {1, 2} — one of each. Same set of distinct values, different multisets. An expressive aggregation must treat them differently.

Why this matters: If your aggregation ignores multiplicity — how many times each feature value appears — then a node with five blue neighbors looks the same as a node with one blue neighbor. You've thrown away information about neighbor counts, which is fundamental structural information about the graph.

Three Aggregations, Three Levels of Sensitivity

Consider these three aggregation functions on a multiset of scalar values:

Mean: average the values. Sensitive to proportions but not to total counts.
Max: take the largest value. Sensitive only to which values are present, not how many.
Sum: add all values. Sensitive to both which values are present and how many of each.

Mean and max both discard information. Sum preserves everything — and this turns out to be the key property for maximal GNN expressiveness.

Multiset Distinguisher

Two multisets of neighbor features. All three aggregations are computed. Green = aggregations differ (can distinguish). Red = aggregations are equal (cannot distinguish). Drag the sliders to construct failure cases for mean and max.

Multiset A: count of 1s2

Multiset A: count of 2s1

Multiset B: count of 1s1

Multiset B: count of 2s2

Why is aggregating over a "multiset" the right framing for GNN neighbor aggregation?

Because neighbor features are always integers and multisets only contain integers Because a node can have multiple neighbors with the same feature vector, and multiplicity (how many times each value appears) carries structural information Because graph edges form multisets where each edge can have multiple labels

Chapter 3: Why GCN Fails

GCN uses mean aggregation: h_v^(k) = MLP(mean({h_u^(k-1) : u ∈ N(v)}). The mean is the average — it divides the sum by the number of neighbors. This normalization is exactly the problem.

Mean aggregation cannot distinguish two different multisets that have the same average. Consider: multiset {1, 1, 1} has mean 1. Multiset {1} also has mean 1. To GCN, a node surrounded by three identical neighbors looks exactly the same as a node with one such neighbor. Three blue nodes or one blue node — the GCN can't tell.

Concrete failure: Suppose neighbors have feature values in {red=0, blue=1}. Two nodes: Node A has neighbors {red, red, blue, blue} — equal mix of red and blue. Node B has neighbors {red, blue} — also equal mix. Mean(A) = 0.5 red + 0.5 blue. Mean(B) = 0.5 red + 0.5 blue. Identical. GCN assigns A and B the same embedding even though they have structurally different neighborhoods (4 neighbors vs 2).

What GCN Can and Cannot Distinguish

GCN CAN distinguish nodes whose neighbor feature distributions differ. If A's neighbors are 80% red and B's neighbors are 60% red, GCN gets different means and can tell them apart.

GCN CANNOT distinguish nodes whose neighbor feature distributions are identical but counts differ. {red×2, blue×2} and {red×1, blue×1} have the same mean — 50% red, 50% blue — regardless of how many neighbors there are. This is a systematic blind spot, not fixable by increasing depth or width.

GCN Mean Aggregation Failure

Two nodes with different neighborhoods. Both have the same proportion of red/blue neighbors, but different counts. GCN computes identical means and assigns identical embeddings. Adjust the node counts to find cases where GCN succeeds vs. fails.

Node A: reds2

Node A: blues2

Node B: reds1

Node B: blues1

Why does GCN's mean aggregation fail to distinguish {red, red, blue, blue} from {red, blue}?

Because GCN doesn't use edge weights, so all neighbors look equal Because GCN doesn't have enough layers to see all neighbors Because dividing by the number of neighbors destroys count information — both multisets produce a mean of 50% red, 50% blue

Chapter 4: Why GraphSAGE Fails

GraphSAGE uses max-pool aggregation: for each neighbor, apply an MLP to get a transformed feature vector, then take the element-wise maximum across all neighbors. This is more sophisticated than mean — it can capture the "most extreme" feature value in the neighborhood.

But max-pool has its own blind spot: it completely ignores multiplicity. max({red, red, blue}) = max({red, blue}). Once you've seen that red is present, seeing more reds adds nothing to the maximum. The max is determined by which distinct values are present, not how many of each.

Think of it this way: max-pool answers the question "which colors are present in the neighborhood?" But it cannot answer "how many of each color are there?" A node with five red neighbors and a node with one red neighbor look identical to max-pool — the maximum is "red" either way.

The Exact Failure Mode

Consider two nodes: Node A has neighbors {red, red, red} — three identical red neighbors. Node B has neighbors {red} — just one red neighbor. Max-pool on A: max(red, red, red) = red. Max-pool on B: max(red) = red. Identical outputs. GraphSAGE cannot distinguish A from B.

Contrast with GCN's failure. GCN fails when proportions are identical. GraphSAGE fails when the set of distinct values is identical. These are different failure modes — and both can occur in real graphs. Neither model is a superset of the other in expressiveness.

Max-Pool Failure Cases

Two nodes with different neighborhoods. Max-pool aggregation is shown. When both nodes have the same set of distinct neighbor features (regardless of counts), max-pool assigns identical embeddings. Toggle the neighbor configurations to find failure cases.

The Expressiveness Hierarchy (So Far)

We now have two concrete failure modes:

GCN (mean): fails when proportions match but counts differ
GraphSAGE (max): fails when the set of distinct features matches

Neither is strictly better than the other. Both are strictly worse than some ideal injective aggregation. What would that look like?

GraphSAGE max-pool aggregation cannot distinguish {red, red, red} from {red}. Why?

Because the maximum of any multiset containing only red values is red, regardless of how many red values are in the multiset Because max-pool is a linear operation and cannot capture nonlinear relationships Because GraphSAGE uses sampling and may miss some neighbors

Chapter 6: GIN: The Most Expressive GNN

We now have a target: build a GNN that is exactly as powerful as the WL test. Xu et al. (ICLR 2019) proved that this is achievable, and designed the Graph Isomorphism Network (GIN) to achieve it. The key theorem is elegant.

Theorem (Xu et al., 2019): A GNN is at most as powerful as the WL test. A GNN with sum aggregation + a sufficiently powerful (universal) function achieves this maximum. GIN is such a network.

Why Sum is the Key

Recall: mean fails because it normalizes away counts. Max fails because it ignores multiplicity. What about sum?

Sum does not normalize. Sum({1, 1, 2}) = 4. Sum({1, 2}) = 3. Different. Sum({1, 1, 1}) = 3. Sum({1}) = 1. Different. In general, if two multisets have different counts of any element, their sums will differ — provided the values themselves differ. The formal statement is: there exists a function f such that SUM of f(x) over a multiset is injective.

This is not trivial — it requires f to map feature values to numbers in a way that makes the sums unique. For discrete features, such f always exists. For continuous features, an MLP can approximate it. This is the connection between sum aggregation and injectivity.

The Universal Multiset Function Theorem

Any injective function over multisets can be written in the form:

Φ(∑_{x ∈ S} f(x))

Where f maps individual elements and Φ maps the resulting sum. Both Φ and f can be approximated to arbitrary precision by MLPs. This is why GIN uses MLPs for both the feature transformation and the final output — it's not just a design choice, it's a theoretical necessity for achieving maximum expressiveness.

Why Sum is Injective: A Visualization

Compare mean, max, and sum on four canonical multiset pairs. Green = can distinguish (injective for this pair). Red = cannot distinguish (failure). Sum always gets it right.

Why does sum aggregation (unlike mean or max) achieve maximum expressiveness for GNNs?

Because sum is computationally faster than mean or max Because sum aggregation naturally incorporates self-loops Because sum preserves both which elements are present AND how many of each — it can be made injective over multisets, unlike mean (loses count) or max (loses multiplicity)

Chapter 7: The GIN Update Rule

GIN's update rule has one subtle detail beyond "use sum + MLP." The node needs to aggregate information from both its own current representation AND its neighbors' representations. Naively, you might just sum them all together — but then the node's own features get mixed in with the neighbor features in a way that loses the distinction between "self" and "neighbor."

The full GIN update:

h_v^(k) = MLP^(k)((1 + ε^(k)) · h_v^(k-1) + ∑_{u ∈ N(v)} h_u^(k-1))

The (1 + ε) factor is the key. It ensures the node's own embedding is weighted separately from the neighbor sum. Without it, a node with feature vector h and no neighbors would look identical to a node with no self-features but neighbors that sum to h — they'd both produce the same input to the MLP. The (1+ε) breaks this ambiguity.

ε in practice: ε can be a learnable parameter (learned from data) or a fixed constant (often just 0, meaning the node's own feature counts once, same as a self-loop). In either case, the (1+ε) factor distinguishes the node's contribution from its neighbors'. Xu et al. found that fixing ε=0 works nearly as well as learning it — the MLP itself can compensate.

Data Flow Through One GIN Layer

Let's trace a concrete example. Node v has feature h_v = [0.5, 0.3] and two neighbors with features [1.0, 0.2] and [0.7, 0.8]. With ε=0:

Scale self: (1+0) · [0.5, 0.3] = [0.5, 0.3]
Sum neighbors: [1.0, 0.2] + [0.7, 0.8] = [1.7, 1.0]
Add: [0.5, 0.3] + [1.7, 1.0] = [2.2, 1.3]
MLP: h_v^(new) = MLP([2.2, 1.3])

The MLP is a multi-layer perceptron (at least 2 layers — shallow MLPs are not universal approximators over multisets). GIN uses batch normalization between MLP layers for training stability.

GIN = Neural WL: If you replace the MLP with a perfect injective hash, GIN exactly simulates the WL test. The MLP is the "neural" version of the hash — it approximates injectivity by learning from data. This is why GIN is called the Graph Isomorphism Network: it's a differentiable, learnable implementation of WL.

Graph-Level Readout with GIN

For graph classification (not just node classification), GIN uses a concatenation readout across all layers:

h_G = CONCAT(READOUT({h_v^(k) : v ∈ G}) | k = 0, 1, ..., K)

This preserves embeddings from all depths. Shallow layers capture local structure (triangles, paths of length 2). Deep layers capture global structure. Concatenating all layers is more expressive than just using the final layer — it uses structural information at every scale.

Why does GIN include the (1+ε)·h_v term instead of just summing h_v with the neighbors?

To make the gradient flow better during backpropagation To distinguish the node's own contribution from the neighbor sum — without it, a node's self-features and neighbor features could produce the same aggregated input, losing information To normalize the aggregation by the node's degree

Chapter 8: What GNNs Cannot Do

GIN achieves the maximum expressiveness of message-passing GNNs — but that maximum is not unlimited. The WL test has known failure cases, and GIN inherits every one of them. Understanding these limits is essential for knowing when a GNN will fail you in practice.

The Classic WL Failure: Regular Graphs

A k-regular graph is one where every node has exactly k neighbors. Consider two different k-regular graphs on the same number of nodes — say, two different 3-regular (cubic) graphs with 6 nodes. After any number of WL iterations, every node in both graphs has the same label: "3-regular node surrounded by 3-regular nodes." The label multisets are identical. WL — and therefore any message-passing GNN — cannot distinguish them.

This is not a flaw that more layers, wider MLPs, or better training can fix. It's a structural impossibility. The 1-WL test has blind spots, and GIN has exactly the same blind spots.

Why it matters in practice: Many real-world graph tasks require distinguishing regular substructures. Counting triangles, detecting cycles of specific length, recognizing subgraph patterns — these all require expressiveness beyond 1-WL. GNN-based molecular property prediction, for example, can fail to distinguish certain non-isomorphic molecules that WL also cannot distinguish.

The Expressiveness Hierarchy

Researchers have developed a hierarchy of more expressive (and more expensive) graph algorithms:

Method	Expressiveness	Cost	Example
GCN (mean)	< 1-WL	O(E)	Loses count info
GraphSAGE (max)	< 1-WL	O(E)	Loses multiplicity
GIN (sum)	= 1-WL	O(E)	Maximum for 1-hop MSG
k-WL / k-GNN	> 1-WL	O(n^k)	Exponential in k
Random features	~WL + randomness	O(E)	Probabilistic ID

Higher-order GNNs (k-WL) can distinguish structures that 1-WL cannot. k=2 GNNs operate on pairs of nodes. k=3 on triples. Each step up in k multiplies the computational cost by n, making them impractical for large graphs.

Practical Workarounds

When you need expressiveness beyond 1-WL but cannot afford k-WL, several practical approaches exist:

Random node features: Give each node a random ID at inference time. With high probability, the random IDs break symmetries that WL cannot. Loses consistency across runs, but works empirically.
Structural features: Precompute features like cycle membership, eigenvectors, or subgraph counts. Feed these as node features. Offloads expressiveness to preprocessing.
Port the WL test and augment: Run WL first, then use its labels as additional input features. The GNN can use these to bootstrap expressiveness.

Why can't adding more GIN layers or a wider MLP overcome the WL expressiveness limit?

Because GIN's MLP is not deep enough to capture complex patterns Because more layers cause over-smoothing which erases the expressiveness gains Because the limit is structural — certain non-isomorphic graphs produce identical computational graphs at every depth, so no amount of computation can distinguish them

Chapter 9: Connections & What's Next

This lecture gave a complete theoretical picture of message-passing GNN expressiveness. Let's place it in context.

The Expressiveness Hierarchy in Full

GNN	Aggregation	Expressiveness	Key Paper
GCN	Mean	Strictly < WL	Kipf & Welling 2017
GraphSAGE	Max-pool	Strictly < WL	Hamilton et al. 2017
GAT	Attention-weighted mean	Strictly < WL	Velickovic et al. 2018
GIN	Sum + MLP	= WL (maximum)	Xu et al. 2019

Note that GAT, despite its attention mechanism, still uses a weighted mean — so it's strictly less expressive than WL, just like GCN. Attention helps with training stability and task performance, but not with theoretical expressiveness.

Where to Go Next

Lec 4

GCN, GraphSAGE, GAT architectures — the models whose limits we analyzed in this lecture

↓

Lec 6 (this)

Theory — WL expressiveness, GIN achieving the maximum, limits of 1-WL

↓

Lec 7

Designing more powerful graph encoders — beyond 1-WL, positional encodings, higher-order GNNs

↓

Lec 8+

Knowledge graphs, scalable GNNs, applications to biology and chemistry

Related Micro-Lessons

Lec 3 — GNN Basics — The node classification task and the GNN framework from scratch.
Lec 4 — GCN, GraphSAGE, GAT — The three main architectures whose expressiveness we analyzed here.
Lec 5 — GNN Augmentation & Training — Feature engineering and training pipelines for GNNs.

The GIN Paper

Read the full analysis in Veanors: How Powerful are Graph Neural Networks? (Xu et al., ICLR 2019). The paper contains the formal proofs, experimental results on bioinformatics benchmarks, and the exact conditions under which sum aggregation is injective.

The one-line summary: Mean and max aggregations are provably limited. Sum aggregation + MLP = Graph Isomorphism Network = as powerful as WL = the theoretical maximum for message-passing GNNs. Anything beyond this requires fundamentally different architectures (k-WL, structural encodings, or randomness).

A Closing Thought

The WL test was invented in 1968 to solve a combinatorics problem. It took 50 years for the deep learning community to realize it was secretly describing the limits of neural message passing. This is one of the most satisfying theoretical results in geometric deep learning: a simple classical algorithm, a new class of neural networks, and a precise mathematical equivalence between them.

"What I cannot create, I do not understand." — Richard Feynman

You now understand GNNs well enough to know exactly where they fail, and why. That understanding is the foundation for building better ones.

Theory of GNNs