How expressive are GNNs? This is not a soft question — it has a precise mathematical answer. GNNs are exactly as powerful as the Weisfeiler-Leman graph isomorphism test, and GIN achieves that maximum.
You have a social network. Alice follows Bob, Charlie, and Dana. Bob follows Alice and Charlie. They have the same number of friends — but their local neighborhoods are different. A good GNN should give Alice and Bob different embeddings, because their structural role in the graph differs.
But will it? This depends entirely on the aggregation function. If two nodes happen to have neighbor features that average to the same value — even though the neighbor multisets are different — the GNN cannot tell them apart. It will assign them identical embeddings, regardless of how many layers you stack or how wide you make the network.
This is not a bug you can tune away. It's a fundamental mathematical limit, tied to what it means for a function to be injective — to map distinct inputs to distinct outputs. If the aggregation function is not injective over neighbor multisets, the GNN loses information on every single layer, and there's no recovery.
A GNN is more expressive than another if it can distinguish more pairs of nodes (or graphs) that are structurally different. Maximum expressiveness means: any two nodes with genuinely different K-hop neighborhoods get different embeddings after K layers. No information is lost.
Expressiveness has a cost: less expressive models are often more regularized and generalize better on small datasets. But understanding the theoretical ceiling is essential — it tells you what's impossible, and where to look when your model fails.
Node A and Node B have different neighbor structures. An expressive GNN should give them different embeddings. Click "Aggregate" to see what mean vs. sum aggregation produces. Watch what information gets lost.
A GNN doesn't see the whole graph at once. With K layers, each node sees exactly its K-hop neighborhood — the subgraph reachable within K steps. This neighborhood, unfolded into a tree structure rooted at the node, is called its computational graph.
Here's how it works. Before the first layer (K=0), a node only knows its own features. After one layer (K=1), it has aggregated from its immediate neighbors. After two layers (K=2), it has aggregated from its neighbors' neighbors. The "view" expands with each layer, but it's always a tree — because we unfold the graph by tracing paths outward, and the same node may appear multiple times in the unfolding.
For a K-layer GNN to be maximally expressive:
The second condition is the hard one. It's asking: can the aggregation function distinguish any two distinct multisets it might see? If yes at every layer, the GNN is as expressive as the Weisfeiler-Leman test. If no at any layer, some structural information is permanently discarded.
A 4-node graph. Select a node to see its 2-hop rooted subtree. Notice how the same neighbor can appear multiple times in the unfolding — the computational graph is a tree, not the original graph.
Here's a concept that makes all the difference: the difference between a set and a multiset. A set is a collection where each element appears at most once: {red, blue, green}. A multiset allows repeats: {red, red, blue}. {red, blue} and {red, red, blue} are the same set of colors, but different multisets.
When a GNN aggregates neighbor features, it's computing a function over a multiset. Node A might have neighbors with features {1, 1, 2} — two neighbors with value 1, one with value 2. Node B might have neighbors with features {1, 2} — one of each. Same set of distinct values, different multisets. An expressive aggregation must treat them differently.
Consider these three aggregation functions on a multiset of scalar values:
Mean and max both discard information. Sum preserves everything — and this turns out to be the key property for maximal GNN expressiveness.
Two multisets of neighbor features. All three aggregations are computed. Green = aggregations differ (can distinguish). Red = aggregations are equal (cannot distinguish). Drag the sliders to construct failure cases for mean and max.
GCN uses mean aggregation: hv(k) = MLP(mean({hu(k-1) : u ∈ N(v)}). The mean is the average — it divides the sum by the number of neighbors. This normalization is exactly the problem.
Mean aggregation cannot distinguish two different multisets that have the same average. Consider: multiset {1, 1, 1} has mean 1. Multiset {1} also has mean 1. To GCN, a node surrounded by three identical neighbors looks exactly the same as a node with one such neighbor. Three blue nodes or one blue node — the GCN can't tell.
GCN CAN distinguish nodes whose neighbor feature distributions differ. If A's neighbors are 80% red and B's neighbors are 60% red, GCN gets different means and can tell them apart.
GCN CANNOT distinguish nodes whose neighbor feature distributions are identical but counts differ. {red×2, blue×2} and {red×1, blue×1} have the same mean — 50% red, 50% blue — regardless of how many neighbors there are. This is a systematic blind spot, not fixable by increasing depth or width.
Two nodes with different neighborhoods. Both have the same proportion of red/blue neighbors, but different counts. GCN computes identical means and assigns identical embeddings. Adjust the node counts to find cases where GCN succeeds vs. fails.
GraphSAGE uses max-pool aggregation: for each neighbor, apply an MLP to get a transformed feature vector, then take the element-wise maximum across all neighbors. This is more sophisticated than mean — it can capture the "most extreme" feature value in the neighborhood.
But max-pool has its own blind spot: it completely ignores multiplicity. max({red, red, blue}) = max({red, blue}). Once you've seen that red is present, seeing more reds adds nothing to the maximum. The max is determined by which distinct values are present, not how many of each.
Consider two nodes: Node A has neighbors {red, red, red} — three identical red neighbors. Node B has neighbors {red} — just one red neighbor. Max-pool on A: max(red, red, red) = red. Max-pool on B: max(red) = red. Identical outputs. GraphSAGE cannot distinguish A from B.
Contrast with GCN's failure. GCN fails when proportions are identical. GraphSAGE fails when the set of distinct values is identical. These are different failure modes — and both can occur in real graphs. Neither model is a superset of the other in expressiveness.
Two nodes with different neighborhoods. Max-pool aggregation is shown. When both nodes have the same set of distinct neighbor features (regardless of counts), max-pool assigns identical embeddings. Toggle the neighbor configurations to find failure cases.
We now have two concrete failure modes:
Neither is strictly better than the other. Both are strictly worse than some ideal injective aggregation. What would that look like?
The Weisfeiler-Leman (WL) graph isomorphism test was invented in 1968, long before GNNs existed. It's a classical algorithm for deciding whether two graphs might be isomorphic — that is, structurally identical up to node relabeling. The key word is "might": the WL test can definitively prove two graphs are NOT isomorphic, but it cannot guarantee they ARE.
Start by giving every node a label (initially, just its feature or a constant). Then repeat:
After K iterations, each node has a label that summarizes its K-hop rooted subtree. Compare the multisets of labels across the two graphs. If the multisets ever differ, the graphs are NOT isomorphic. If they never differ after convergence, the test is inconclusive — the graphs might be isomorphic or they might not be.
Two graphs (left and right). Each node starts with a color label. Step through the WL iterations. Watch how labels refine to capture local structure. The algorithm terminates when labels stop changing or when the label multisets diverge.
The WL test cannot distinguish certain pairs of non-isomorphic graphs. A classic example: two regular graphs of the same degree where every node looks locally identical. After any number of iterations, all nodes have the same label in both graphs — but the graphs are genuinely different (perhaps they have different global structures). This is a hard limit of the 1-WL test, and it's inherited by all message-passing GNNs.
We now have a target: build a GNN that is exactly as powerful as the WL test. Xu et al. (ICLR 2019) proved that this is achievable, and designed the Graph Isomorphism Network (GIN) to achieve it. The key theorem is elegant.
Recall: mean fails because it normalizes away counts. Max fails because it ignores multiplicity. What about sum?
Sum does not normalize. Sum({1, 1, 2}) = 4. Sum({1, 2}) = 3. Different. Sum({1, 1, 1}) = 3. Sum({1}) = 1. Different. In general, if two multisets have different counts of any element, their sums will differ — provided the values themselves differ. The formal statement is: there exists a function f such that SUM of f(x) over a multiset is injective.
This is not trivial — it requires f to map feature values to numbers in a way that makes the sums unique. For discrete features, such f always exists. For continuous features, an MLP can approximate it. This is the connection between sum aggregation and injectivity.
Any injective function over multisets can be written in the form:
Where f maps individual elements and Φ maps the resulting sum. Both Φ and f can be approximated to arbitrary precision by MLPs. This is why GIN uses MLPs for both the feature transformation and the final output — it's not just a design choice, it's a theoretical necessity for achieving maximum expressiveness.
Compare mean, max, and sum on four canonical multiset pairs. Green = can distinguish (injective for this pair). Red = cannot distinguish (failure). Sum always gets it right.
GIN's update rule has one subtle detail beyond "use sum + MLP." The node needs to aggregate information from both its own current representation AND its neighbors' representations. Naively, you might just sum them all together — but then the node's own features get mixed in with the neighbor features in a way that loses the distinction between "self" and "neighbor."
The full GIN update:
The (1 + ε) factor is the key. It ensures the node's own embedding is weighted separately from the neighbor sum. Without it, a node with feature vector h and no neighbors would look identical to a node with no self-features but neighbors that sum to h — they'd both produce the same input to the MLP. The (1+ε) breaks this ambiguity.
Let's trace a concrete example. Node v has feature hv = [0.5, 0.3] and two neighbors with features [1.0, 0.2] and [0.7, 0.8]. With ε=0:
The MLP is a multi-layer perceptron (at least 2 layers — shallow MLPs are not universal approximators over multisets). GIN uses batch normalization between MLP layers for training stability.
For graph classification (not just node classification), GIN uses a concatenation readout across all layers:
This preserves embeddings from all depths. Shallow layers capture local structure (triangles, paths of length 2). Deep layers capture global structure. Concatenating all layers is more expressive than just using the final layer — it uses structural information at every scale.
GIN achieves the maximum expressiveness of message-passing GNNs — but that maximum is not unlimited. The WL test has known failure cases, and GIN inherits every one of them. Understanding these limits is essential for knowing when a GNN will fail you in practice.
A k-regular graph is one where every node has exactly k neighbors. Consider two different k-regular graphs on the same number of nodes — say, two different 3-regular (cubic) graphs with 6 nodes. After any number of WL iterations, every node in both graphs has the same label: "3-regular node surrounded by 3-regular nodes." The label multisets are identical. WL — and therefore any message-passing GNN — cannot distinguish them.
This is not a flaw that more layers, wider MLPs, or better training can fix. It's a structural impossibility. The 1-WL test has blind spots, and GIN has exactly the same blind spots.
Researchers have developed a hierarchy of more expressive (and more expensive) graph algorithms:
| Method | Expressiveness | Cost | Example |
|---|---|---|---|
| GCN (mean) | < 1-WL | O(E) | Loses count info |
| GraphSAGE (max) | < 1-WL | O(E) | Loses multiplicity |
| GIN (sum) | = 1-WL | O(E) | Maximum for 1-hop MSG |
| k-WL / k-GNN | > 1-WL | O(nk) | Exponential in k |
| Random features | ~WL + randomness | O(E) | Probabilistic ID |
Higher-order GNNs (k-WL) can distinguish structures that 1-WL cannot. k=2 GNNs operate on pairs of nodes. k=3 on triples. Each step up in k multiplies the computational cost by n, making them impractical for large graphs.
When you need expressiveness beyond 1-WL but cannot afford k-WL, several practical approaches exist:
This lecture gave a complete theoretical picture of message-passing GNN expressiveness. Let's place it in context.
| GNN | Aggregation | Expressiveness | Key Paper |
|---|---|---|---|
| GCN | Mean | Strictly < WL | Kipf & Welling 2017 |
| GraphSAGE | Max-pool | Strictly < WL | Hamilton et al. 2017 |
| GAT | Attention-weighted mean | Strictly < WL | Velickovic et al. 2018 |
| GIN | Sum + MLP | = WL (maximum) | Xu et al. 2019 |
Note that GAT, despite its attention mechanism, still uses a weighted mean — so it's strictly less expressive than WL, just like GCN. Attention helps with training stability and task performance, but not with theoretical expressiveness.
Lec 3 — GNN Basics — The node classification task and the GNN framework from scratch.
Lec 4 — GCN, GraphSAGE, GAT — The three main architectures whose expressiveness we analyzed here.
Lec 5 — GNN Augmentation & Training — Feature engineering and training pipelines for GNNs.
Read the full analysis in Veanors: How Powerful are Graph Neural Networks? (Xu et al., ICLR 2019). The paper contains the formal proofs, experimental results on bioinformatics benchmarks, and the exact conditions under which sum aggregation is injective.
The WL test was invented in 1968 to solve a combinatorics problem. It took 50 years for the deep learning community to realize it was secretly describing the limits of neural message passing. This is one of the most satisfying theoretical results in geometric deep learning: a simple classical algorithm, a new class of neural networks, and a precise mathematical equivalence between them.
"What I cannot create, I do not understand." — Richard Feynman
You now understand GNNs well enough to know exactly where they fail, and why. That understanding is the foundation for building better ones.