The paper that gave GNNs a ceiling. It proved that message-passing GNNs are at most as powerful as the Weisfeiler-Leman isomorphism test, and designed GIN — a sum-aggregation + MLP network that achieves this maximum.
By 2018, several GNN architectures existed: GCN (Kipf & Welling 2017), GraphSAGE (Hamilton et al. 2017), GAT (Velickovic et al. 2018). Each had empirical results on benchmarks. But a foundational question had never been asked rigorously: can these architectures even, in principle, solve the task we're asking them to solve?
Consider graph classification: given two graphs, are they isomorphic (structurally identical up to node relabeling)? A maximally expressive GNN would give different embeddings to any two non-isomorphic graphs. A weak GNN would confuse some pairs — even with perfect training data and unlimited computation. The question Xu et al. asked: which existing GNNs are maximally expressive, and which are provably limited?
This is not purely theoretical. Molecular property prediction requires distinguishing molecules that have the same atom types but different bond structures. A molecule with C-C-C (propane-like chain) has different properties from one with a C-C ring — but if your GNN can't tell them apart structurally, no amount of training data will teach it the difference. The model literally cannot represent the distinction.
Social network analysis requires recognizing structural roles: hubs, bridges, triangle members. If node A is part of a triangle and node B is not, but both have three neighbors, they have different structural roles. A GNN that confuses them cannot learn tasks that depend on local topology.
The paper also explains empirical puzzles. Why does GCN catastrophically fail on the Reddit-Binary dataset (random chance accuracy) while GraphSAGE achieves 84%? Not because GCN is undertrained — because the task requires count-sensitive aggregation that GCN's mean operation cannot provide.
Before this paper, GNN expressiveness had been studied informally. Scarselli et al. (2009), the original GNN paper, noted that GNNs could approximate certain graph functions but gave no precise characterization. Gilmer et al.'s MPNN (2017) unified many architectures in one framework but didn't analyze expressiveness. The WL test's connection to GNNs was occasionally mentioned in passing but never made precise.
What was missing: a rigorous framework that (a) identified the exact property needed for maximum expressiveness (injectivity over multisets), (b) proved which aggregations have this property and which don't, (c) connected GNN expressiveness to a well-studied classical algorithm (WL test), and (d) constructed a GNN that provably achieves the maximum. This paper provides all four.
Lemma 5 (the sum characterization theorem) assumes countable feature spaces. This covers all practical cases: atom types in molecules (finite set), node labels (finite set), discretized continuous features (finite set). For fully continuous features, the theorem still holds in a limiting sense — any probability distribution over continuous features can be approximated by a discrete distribution, and the sum over the continuous f can be shown to be injective almost surely (with probability 1 over random inputs). The paper discusses this extension in the appendix.
All GNNs considered in this paper follow the same general framework. At each layer k:
hv(0) = xv (input features). After K layers, hv(K) is node v's final embedding, summarizing its K-hop neighborhood. The paper's central question is: for which choices of AGGREGATE and COMBINE is this framework maximally expressive?
The paper uses graph isomorphism as the hardest test of expressiveness. Two graphs G1 and G2 are isomorphic if there exists a bijection φ: V(G1) → V(G2) that preserves edge structure: (u,v) ∈ E(G1) if and only if (φ(u),φ(v)) ∈ E(G2). Testing isomorphism is a key unsolved problem in theoretical computer science (no known polynomial algorithm, not known to be NP-complete).
For GNNs, the question is simplified: a maximally expressive GNN should give different graph-level embeddings to any two non-isomorphic graphs — at least for the pairs that some algorithm can distinguish. The WL test is such an algorithm (though incomplete). This doesn't require solving isomorphism — it requires representing structure faithfully enough that distinct structures produce distinct codes. The WL test provides a practical (but incomplete) solution.
python import torch from torch_geometric.data import DataLoader from torch_geometric.datasets import TUDataset # Load MUTAG dataset (188 molecular graphs, 2 classes) dataset = TUDataset(root='/tmp/mutag', name='MUTAG') print(f"Dataset: {len(dataset)} graphs, {dataset.num_classes} classes") print(f"Node features: {dataset.num_node_features}") # 7 atom types # 10-fold cross-validation fold_size = len(dataset) // 10 for fold in range(10): test_dataset = dataset[fold * fold_size : (fold+1) * fold_size] train_dataset = dataset[:fold*fold_size] + dataset[(fold+1)*fold_size:] train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False) model = GINClassifier( in_dim=dataset.num_node_features, hidden=64, num_classes=dataset.num_classes, num_layers=5 ) optimizer = torch.optim.Adam(model.parameters(), lr=0.01) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.5) best_test_acc = 0.0 for epoch in range(350): # Train model.train() for batch in train_loader: optimizer.zero_grad() logits = model(batch.x, batch.edge_index, batch.batch) loss = torch.nn.functional.cross_entropy(logits, batch.y) loss.backward() optimizer.step() scheduler.step() # Evaluate at best validation acc (paper uses this, not final epoch) model.eval() with torch.no_grad(): correct = sum(model(b.x, b.edge_index, b.batch).argmax(dim=-1).eq(b.y).sum().item() for b in test_loader) acc = correct / len(test_dataset) best_test_acc = max(best_test_acc, acc) print(f"Fold {fold}: best test acc = {best_test_acc:.3f}")
The key insight is that neighbor aggregation in a GNN is a function over a multiset — a set where elements can appear more than once. When node v aggregates from its neighbors, it receives the multiset {hu(k-1) : u ∈ N(v)}, where the same vector may appear multiple times if two neighbors have identical features.
The distinction between a set and a multiset is crucial. A set only records which elements are present. A multiset records how many of each. {red, red, blue} ≠ {red, blue} as multisets, even though they have the same underlying set {red, blue}. A GNN aggregation must be sensitive to this difference — the multiplicity of neighbor features carries structural information about the graph.
GCN uses mean aggregation: AGGREGATE({hu}) = (1/|N(v)|) Σ hu. Mean is NOT injective over multisets. A simple counterexample:
These are genuinely different multisets — S1 has 4 elements, S2 has 2. But mean maps both to 1.5. A node surrounded by {1,1,2,2} neighbors and a node surrounded by {1,2} neighbors will receive identical aggregated signals, and produce identical updated embeddings — and this failure propagates through every subsequent layer.
GraphSAGE uses max-pool aggregation: AGGREGATE({hu}) = max({MLP(hu) : u ∈ N(v)}). Max is also NOT injective. Counterexample:
Max-pool captures which feature values are present (the maximum), but is completely insensitive to how many times each value appears. It can only answer "is this value present?" — not "how many of this value are there?"
Sum aggregation: AGGREGATE({hu}) = Σ hu. For scalar features, sum({1,1,2,2}) = 6 ≠ sum({1,2}) = 3. Different. Sum({1,1,1}) = 3 ≠ sum({1}) = 1. Different. In general, sum preserves both the set of values present and their counts — it encodes the full multiset information.
For vector features, this requires an additional condition: there must exist a function f such that Σ f(x) is injective. The paper shows this exists for countable feature spaces, and MLPs can approximate it. This is the foundation of GIN.
| Aggregation | Injective? | Loses | Can't distinguish |
|---|---|---|---|
| Mean | No | Count information | {1,1,2,2} vs {1,2} |
| Max | No | Multiplicity | {R,R,R} vs {R} |
| Sum + f | Yes | Nothing | Distinguishes all pairs |
For discrete feature spaces X = {1, 2, ..., N}, the proof is constructive. Let f(x) = ex, the standard basis vector in ℝN (the one-hot encoding of x). Then Σx∈S f(x) produces a count vector: the i-th entry is the number of times value i appears in S. Two distinct multisets differ in at least one count — so this vector is distinct for any two distinct multisets. Thus, sum ∘ one-hot is injective over all multisets.
For continuous feature spaces, the argument is more subtle. The paper uses the universal approximation theorem: any continuous injective function can be approximated by an MLP. Since an injective f exists (the one-hot argument works for discrete approximations), and MLPs can approximate it, GIN with a sufficiently large MLP can realize this injection approximately.
Practically, the MLP doesn't need to compute an exact one-hot encoding — it just needs to learn a mapping f where the sums are distinguishable for all multisets that appear in the training data. This is a much easier condition, which is why GIN works well even with small MLPs.
It might seem obvious that "sum captures more information than mean." But the formal statement is stronger: there EXISTS a function f such that the sum over f is injective over ALL multisets, not just the ones you happen to see. This universality is what the theorem guarantees, and it's what connects to WL expressiveness — WL is also a universal algorithm (perfect injective hash), not just one that works for most inputs.
Contrast with mean: you could try to find a function f such that mean over f is injective. For the pair {1,2} and {1,1,2,2}: mean(f({1,2})) = (f(1)+f(2))/2 and mean(f({1,1,2,2})) = (2f(1)+2f(2))/4 = (f(1)+f(2))/2. These are always equal regardless of f. No choice of f can make mean injective for this pair — the failure is fundamental to mean's normalization structure.
Sum doesn't have this pathology because it doesn't normalize. The sum grows with the number of elements. This is exactly the information that mean throws away (by dividing by count) and that sum preserves (by not dividing).
The 1-dimensional Weisfeiler-Leman (1-WL) test is a classical algorithm from 1968, originally designed to heuristically check graph isomorphism. The paper uses it as the reference standard for GNN expressiveness. Understanding it precisely is essential — the main theorem is a mathematical equivalence between WL and GIN, not just an analogy.
The HASH function is any injective mapping from the tuple (own label, sorted neighbor labels) to a new label. Because it's injective, if two nodes get the same new label, they must have had the same own label AND the same multiset of neighbor labels. This makes the label a perfect fingerprint for the rooted subtree.
Consider a triangle (3-cycle) and a path of length 3 (3 nodes in a line), both with all-same initial features (label = 0).
Iteration 0: All nodes in both graphs have label 0. Multisets identical: {0, 0, 0}.
Iteration 1: Triangle: each node has 2 neighbors with label 0. New label = HASH(0, {0,0}) = LA. Path: end nodes have 1 neighbor (label 0), middle node has 2 neighbors (both label 0). End nodes get HASH(0, {0}) = LB. Middle gets HASH(0, {0,0}) = LA.
Triangle multiset: {LA, LA, LA}. Path multiset: {LB, LA, LB}. Different! WL correctly identifies these graphs as non-isomorphic in 1 iteration.
Consider a star graph: one center node connected to 4 leaf nodes. All nodes start with label 0 (no features). After WL iteration 1:
Label multiset: {LC, LL, LL, LL, LL}. After WL iteration 2:
The labels have stabilized in form (though their values change). Compare this star graph to a path of 5 nodes (P5). After iteration 1, the path's center has 2 neighbors not 4 — it gets a different hash. WL immediately distinguishes star from path. A GNN with the right aggregation also distinguishes them from the first layer.
Look at the WL iteration: c(k+1)v = HASH(c(k)v, {{c(k)u : u ∈ N(v)}}). Compare to a GNN layer: h(k)v = UPDATE(h(k-1)v, AGGREGATE({h(k-1)u : u ∈ N(v)})). The structure is identical. The GNN generalizes WL by using learned functions (MLP) instead of perfect hash labels, and continuous features instead of discrete labels. This isn't an analogy — it's the same computation with different implementations.
The key difference: WL uses a perfect injective HASH — it never loses information. A GNN using mean or max aggregation uses an AGGREGATE that is NOT injective — it loses information. GIN using sum + MLP approximates injectivity — it retains as much information as WL does. This is the precise sense in which GIN is "as powerful as" WL.
A useful way to think about WL: after convergence, each graph has a "histogram" of node labels — how many nodes have each label. This histogram is what's compared to decide isomorphism. Two graphs with the same histogram might still be non-isomorphic (WL isn't complete), but graphs with different histograms are definitely non-isomorphic.
For the WL graph kernel (Shervashidze et al. 2011), the concatenation of these histograms across all depths d = 0, 1, ..., K is used as the graph feature vector for classification. The kernel between two graphs is the dot product of their feature vectors. GIN can be seen as learning a continuous relaxation of this feature vector — instead of counting discrete label occurrences, it sums continuous node embeddings that are trained to be injective.
python from collections import defaultdict import hashlib def wl_test(G1_adj, G1_labels, G2_adj, G2_labels, K=5): """ Run WL isomorphism test. G_adj: dict {node: [neighbors]} G_labels: dict {node: initial_label} Returns: True if graphs are distinguishable, False if inconclusive. """ labels1 = dict(G1_labels) labels2 = dict(G2_labels) def wl_hash(own_label, neighbor_labels): # Perfect injective hash: (own, sorted_neighbors) → new_label key = str((own_label, sorted(neighbor_labels))) return hashlib.md5(key.encode()).hexdigest()[:8] # truncate for readability def update_labels(adj, labels): new_labels = {} for v, neighbors in adj.items(): neighbor_labels = [labels[u] for u in neighbors] new_labels[v] = wl_hash(labels[v], neighbor_labels) return new_labels def label_multiset(labels): counts = defaultdict(int) for l in labels.values(): counts[l] += 1 return frozenset(counts.items()) for k in range(K): ms1 = label_multiset(labels1) ms2 = label_multiset(labels2) if ms1 != ms2: return True, k # Distinguishable at iteration k labels1 = update_labels(G1_adj, labels1) labels2 = update_labels(G2_adj, labels2) return False, K # Inconclusive # Example: triangle vs. path-of-3 G1_adj = {0:[1,2], 1:[0,2], 2:[0,1]} # triangle K3 G1_labels = {0:'a', 1:'a', 2:'a'} G2_adj = {0:[1], 1:[0,2], 2:[1]} # path P3 G2_labels = {0:'a', 1:'a', 2:'a'} distinguishable, at_iter = wl_test(G1_adj, G1_labels, G2_adj, G2_labels) print(f"Distinguishable: {distinguishable}, at iteration {at_iter}") # Output: Distinguishable: True, at iteration 1 # (Triangle: all nodes degree 2. Path: end nodes degree 1, middle degree 2. # Different degree distributions → WL distinguishes immediately.)
The paper's first major theorem establishes that no message-passing GNN can be more expressive than the WL test. This is a universal upper bound — it applies to any possible AGGREGATE and COMBINE functions, including ones not yet invented. If the WL test cannot distinguish two graphs, no GNN can.
Base case (k=0): All node features are identical across G1 and G2 (otherwise WL trivially distinguishes them at iteration 0). So the initial embeddings h(0)v are the same in both graphs — same distribution, same values.
Inductive step: Assume at layer k, for any two nodes v1 ∈ G1 and v2 ∈ G2 that WL would assign the same k-iteration label, the GNN assigns the same k-layer embedding. Now consider layer k+1.
If WL assigns the same (k+1)-iteration label to v1 and v2, it means: (a) they had the same k-iteration label, AND (b) they had the same multiset of k-iteration neighbor labels. By the inductive hypothesis, (a) means they have the same k-layer GNN embedding, and (b) means their multisets of k-layer GNN neighbor embeddings are identically distributed. Therefore, AGGREGATE produces the same output, COMBINE produces the same output, and the (k+1)-layer embeddings are identical. QED.
The key move: "same WL label multiset" implies "same GNN embedding multiset" at every layer. The GNN's computation is always a coarsening of WL's computation — it can never make finer distinctions than WL makes.
Theorem 2 says GNNs are bounded by WL. A separate result (Propositions 1 and 2) shows that GCN and GraphSAGE are strictly below WL — there exist pairs of graphs that WL distinguishes but GCN/GraphSAGE do not. The explicit constructions:
| Architecture | Failure type | Counterexample |
|---|---|---|
| GCN (mean) | Cannot distinguish multisets with same proportions | Node with neighbors {1,1,2,2} vs {1,2}: both get mean 1.5 |
| GraphSAGE (max) | Cannot distinguish multisets with same distinct elements | Node with neighbors {1,1,1} vs {1}: both get max 1 |
| GAT (attention mean) | Weighted mean is still a mean | Same failure as GCN, weighted but not injective |
These failures are constructive — the paper gives specific graphs where the GNNs provably fail, not just abstract existence arguments. The counterexamples are small enough to trace by hand.
Proposition 1 of the paper states: "There exist graphs G1 and G2 that the 1-WL test can distinguish, but GCNs with mean aggregation cannot." The construction is explicit. Let G1 consist of a node with 4 neighbors (2 with feature 1, 2 with feature 2) and G2 consist of a node with 2 neighbors (1 with feature 1, 1 with feature 2). One WL iteration: G1 node gets label HASH(0, {1,1,2,2}), G2 node gets HASH(0, {1,2}). Different hashes → WL distinguishes. GCN: mean({1,1,2,2}) = 1.5 = mean({1,2}) → identical embedding → GCN cannot distinguish.
Proposition 2 gives the max-pool failure: "There exist graphs that 1-WL can distinguish, but GraphSAGE with max-pool cannot." Construction: Let G1 have a node with neighbors {1,2,3} and G2 have a node with neighbors {1,2,3,3,3}. WL distinguishes (different neighbor multisets). Max-pool: max({1,2,3}) = 3 = max({1,2,3,3,3}) → GraphSAGE assigns identical embeddings.
Note: GraphSAGE uses an MLP per-neighbor before max-pooling, which can change the absolute values. But the max operation applied after the per-neighbor MLP still suffers the same multiplicity blindness — max({f(3),f(3),f(3)}) = max({f(3)}) for any function f.
Let's make this concrete with a simple GCN forward pass. Consider 1-dimensional node features and weight matrix W = 1 (identity). GCN update: hvnew = ReLU(W · mean({hu})) = ReLU(mean({hu})).
Identical. At every subsequent layer, A and B get identical inputs, produce identical outputs, and propagate the same information to their neighbors. The failure is irreversible — once information is lost, it's gone from all downstream computation.
Now the same for GraphSAGE max-pool. Assume element-wise MLP = identity (for clarity). Max-pool on A's neighbors {1,1,1}: max = [1]. Max-pool on B's neighbors {1}: max = [1]. Identical. The MLP applied before max-pool doesn't help — max is still max, regardless of what transformation precedes it.
Compare to GIN sum: sum({1,1,2,2}) = 6. sum({1,2}) = 3. Different. sum({1,1,1}) = 3. sum({1}) = 1. Different. GIN correctly distinguishes all four cases that GCN and GraphSAGE fail on.
Theorem 2 proves the ceiling. Theorem 3 proves that GIN reaches it. The key is Lemma 5 — a characterization theorem for all injective functions over multisets — which tells us exactly what form a maximum-expressiveness aggregation must take.
In plain English: the sum of per-element features is a universal injective multiset function, given the right f. And ANY injective multiset function can be expressed in this sum-of-f form. So if we can learn f and Φ — which MLPs can approximate — we can realize any injective multiset function.
This is why GIN uses MLP(Φ) ∘ SUM ∘ MLP(f) — it's not a heuristic design, it's the exact form that the theorem requires. The MLP for f maps each neighbor feature to a space where summation is injective. The outer MLP (Φ) maps the sum to the output embedding.
The (1+ε) term distinguishes the node's own contribution from the neighbor sum. The sum aggregates neighbor features. The outer MLP serves as Φ — the injective transformation on the aggregated representation. Crucially, the MLP must have at least 2 layers (otherwise it's just a linear transform, which cannot approximate all injective functions).
Build two neighbor multisets using the sliders. See instantly which aggregations can distinguish them (green) and which cannot (red). The sum column is always green when the multisets genuinely differ — this is Lemma 5 in action.
The proof strategy: Given any two graphs G1, G2 that WL distinguishes at iteration K, WL produces different label multisets. Since GIN's aggregation is injective (Lemma 5), nodes with different WL labels will have different GIN embeddings at every layer. Since the READOUT sums over node embeddings, different label multisets imply different summed embeddings. Therefore GIN distinguishes G1 from G2.
Together with Theorem 2 (GNN ≤ WL), Theorem 3 gives the exact characterization: GIN = WL. Not weaker, not stronger — equal. This is a tight characterization, which is what makes it a satisfying theoretical result.
GIN achieves WL expressiveness, but it does so through a continuous, differentiable approximation. There are two important nuances:
This gap between representational capacity and achievable performance is a persistent theme in deep learning theory. The paper focuses on the capacity question (the harder and more fundamental one) and leaves the optimization question open.
python import torch import torch.nn as nn from torch_geometric.nn import MessagePassing from torch_geometric.utils import add_self_loops class GINConv(MessagePassing): """Graph Isomorphism Convolution (Xu et al., ICLR 2019)""" def __init__(self, mlp, eps=0.0, train_eps=False): super().__init__(aggr='add') # 'add' = sum aggregation self.nn = mlp # epsilon: learnable or fixed self.initial_eps = eps if train_eps: self.eps = nn.Parameter(torch.Tensor([eps])) else: self.register_buffer('eps', torch.Tensor([eps])) def forward(self, x, edge_index): # Aggregate neighbor features via sum (no self-loops needed) agg = self.propagate(edge_index, x=x) # (1 + eps) * self + sum of neighbors out = (1 + self.eps) * x + agg # Apply MLP (at least 2 layers for universal approximation) return self.nn(out) def message(self, x_j): # x_j: features of neighbors — passed through directly # (the MLP is applied to the aggregated sum, not per-message) return x_j # The MLP must be at least 2 layers (Theorem 3 requires this): mlp = nn.Sequential( nn.Linear(64, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Linear(128, 64) # 2 layers = universal approximator ) conv = GINConv(mlp, train_eps=True)
GCN (Kipf & Welling 2017) is derived from spectral graph convolutions — it approximates a first-order Chebyshev polynomial filter applied to graph signals. This spectral derivation is elegant but disconnected from expressiveness. The spectral framing doesn't tell you anything about what structures the GNN can distinguish. This paper's spatial/multiset framing is what gives the expressiveness characterization.
Importantly, GCN's mean aggregation is not a consequence of the spectral derivation — it's a design choice (normalizing by degree for numerical stability). A spectral GNN with sum aggregation would be WL-expressive. The theoretical gap between GCN and GIN is a consequence of the normalization choice, not of the spectral vs. spatial framing.
A concrete experiment to build intuition: train a 2-layer MLP to classify multisets of integers. Give it examples like ({1,1,2} → class 0, {1,2} → class 1) by constructing the "sum of one-hot" representation, then training a classifier. The MLP quickly learns to exploit the count difference. Now try the same with mean: ({1,1,2,2} → class 0, {1,2} → class 1) — the mean representation is identical for both classes; the classifier cannot do better than random. This is Lemma 5 made empirical.
GIN's update rule: hv(k) = MLP((1+ε) · hv(k-1) + Σ hu(k-1)). The (1+ε) factor is the subtlest piece. Why not just write MLP(hv + Σ hu)? What does the explicit self-scaling add?
Without (1+ε), the input to the MLP is hv + Σu hu. Consider two scenarios:
Both produce identical MLP inputs [1, 0] — even though the structural situations are completely different (one node has a feature, the other has a neighbor). The MLP cannot distinguish them.
With (1+ε): Scenario A gives (1+ε)[1,0] + 0 = [(1+ε), 0]. Scenario B gives (1+ε)[0,0] + [1,0] = [1, 0]. For any ε ≠ 0, these are different MLP inputs. The (1+ε) factor separates the node's own contribution from the neighbor sum.
The paper evaluates two variants. GIN-0 fixes ε = 0: hv(k) = MLP(hv(k-1) + Σ hu(k-1)). GIN-ε learns ε from data alongside the MLP weights.
GIN-0 is equivalent to adding a self-loop to every node and then summing. This is a valid choice because: (a) in most graphs, v ≠ u for all u ∈ N(v), so self-features and neighbor features are naturally distinguishable by their graph position; (b) the MLP that follows has enough capacity to learn the effective ε implicitly.
Empirically (Table 3 of the paper), GIN-0 and GIN-ε perform nearly identically. The authors conclude that ε adds expressiveness in theory but the MLP compensates in practice. Both are provably WL-equivalent — the theoretical argument works for any fixed ε ≠ −1 (the singular case where the self-contribution cancels).
Let's trace one GIN layer concretely. Node v has feature hv = [0.5, 0.3] and two neighbors with h1 = [1.0, 0.2] and h2 = [0.7, 0.8]. Using ε = 0:
| Step | Operation | Value |
|---|---|---|
| 1. Self-scale | (1+0) · hv | [0.5, 0.3] |
| 2. Neighbor sum | h1 + h2 | [1.7, 1.0] |
| 3. Add | [0.5, 0.3] + [1.7, 1.0] | [2.2, 1.3] |
| 4. BN + Linear 1 | W1 · BN([2.2, 1.3]) + b1 | hidden vector |
| 5. ReLU | max(0, hidden) | hidden vector |
| 6. Linear 2 | W2 · hidden + b2 | hv(new) |
The formal proof of Theorem 3 (GIN achieves WL) requires showing that GIN's layer function is injective as a function of the pair (hv, {hu}). With (1+ε) for ε ≠ −1, the mapping from (hv, {hu}) to (1+ε)hv + Σhu is injective in the following sense: if (1+ε)hv + Σhu = (1+ε)hv' + Σhu', then by Lemma 5 (applied to the combined multiset), we can conclude hv = hv' and {hu} = {hu'} as multisets. The MLP (Φ) then maps this injective representation to the output embedding.
Gilmer et al.'s MPNN framework is the general message-passing paradigm. GIN is a specific instance. MPNN uses: hvt+1 = Ut(hvt, Σu∈N(v) Mt(hvt, hut, evu)). GIN simplifies this by (a) not using edge features evu, (b) setting Mt(hv, hu, e) = hu (pass neighbor features directly), (c) using sum aggregation, and (d) using MLP for Ut. GIN's contribution is proving that this simple instantiation is theoretically optimal — more complex message functions (with edge features, attention, etc.) don't improve expressiveness beyond 1-WL unless they fundamentally change the message-passing structure.
For graph classification, we need to aggregate all node embeddings into a single graph-level representation. This is the readout function, applied after all GNN layers. GIN's readout is a key part of its expressiveness — and it differs from what most prior work used.
The standard approach: after K GNN layers, compute hG = READOUT({hv(K) : v ∈ G}). This uses only the final layer's node embeddings. But GNN layers at different depths capture different aspects of structure:
If you only use layer K, you discard all the fine-grained local structure learned by earlier layers. Two graphs might have identical deep structure but different local structure — last-layer readout would call them the same.
where READOUT = sum (not mean). This preserves embeddings at every depth, giving the final classifier access to structure at all scales. The concatenation dimension grows linearly with K — for K=5 layers with hidden dim d, the final graph embedding has dimension 6d (including the initial k=0 features).
GIN's concatenation readout is related to Jumping Knowledge networks (JK-Net, Xu et al. 2018 — same group, different paper). JK-Net concatenates node embeddings across layers at the node level, giving each node access to representations at all depths. GIN applies this idea at the graph level, for the readout. Both recognize that different tasks may require structural information at different depths, and concatenation across depths is a simple way to provide all of it.
If the initial node feature dimension is d0, and each GIN layer outputs dimension dh (the hidden dimension), then after K layers the concatenated graph embedding has dimension:
For d0 = 1 (constant feature), dh = 64, K = 5: dim = 1 + 64×5 = 321. The paper uses 64-dimensional hidden layers with 5 layers, giving a 321-dimensional graph embedding fed to the final classifier. This is linear in depth — unlike architectures that grow exponentially with depth.
GIN includes batch normalization after each GIN layer. This is not theoretically motivated (BN doesn't affect expressiveness) but practically important. The sum aggregation can produce vectors with very different scales depending on node degree — a hub node summing 1000 neighbors gets a much larger activation than a leaf node summing 1. BN normalizes these across the batch, stabilizing training. The paper notes that BN is critical for GIN to converge on the RDT datasets (social networks with very high-degree nodes).
python import torch import torch.nn as nn from torch_geometric.nn import global_add_pool class GINClassifier(nn.Module): def __init__(self, in_dim, hidden, num_classes, num_layers=5): super().__init__() self.num_layers = num_layers # One GINConv + BN per layer self.convs = nn.ModuleList() self.bns = nn.ModuleList() for i in range(num_layers): mlp = nn.Sequential( nn.Linear(in_dim if i==0 else hidden, hidden), nn.BatchNorm1d(hidden), nn.ReLU(), nn.Linear(hidden, hidden) ) self.convs.append(GINConv(mlp, train_eps=True)) self.bns.append(nn.BatchNorm1d(hidden)) # Classifier on concatenated layer readouts # Each layer contributes `hidden` dims + initial features `in_dim` total_dim = in_dim + hidden * num_layers self.classifier = nn.Sequential( nn.Linear(total_dim, hidden), nn.ReLU(), nn.Dropout(0.5), nn.Linear(hidden, num_classes) ) def forward(self, x, edge_index, batch): # Collect graph-level readouts from ALL layers layer_readouts = [global_add_pool(x, batch)] # k=0: initial features for conv, bn in zip(self.convs, self.bns): x = bn(conv(x, edge_index)).relu() layer_readouts.append(global_add_pool(x, batch)) # Concatenate all layer graph embeddings graph_emb = torch.cat(layer_readouts, dim=-1) # Classify return self.classifier(graph_emb) # Data flow example: # Input: (batch_size=32 graphs) with total n=500 nodes # x: [500, in_dim] edge_index: [2, 1200] batch: [500] # After each layer: x → [500, hidden] # global_add_pool: sums per-graph → [32, hidden] # Final graph_emb: [32, in_dim + hidden*5] # Output: [32, num_classes]
The paper validates its theoretical claims on 9 graph classification benchmarks from two domains: social networks (IMDB-B, IMDB-M, COLLAB, RDT-B, RDT-M5K) and bioinformatics (MUTAG, PROTEINS, PTC, NCI1). These are standard benchmarks used by all prior GNN papers, enabling direct apples-to-apples comparison.
All GNN models use 5 layers, trained with Adam, evaluated with 10-fold cross-validation. For datasets without node features, all nodes get the constant feature 1. GIN uses batch normalization after each layer. The baseline models include not just neural GNNs but also the classical WL kernel (run as a feature extractor with an SVM) — the theoretical upper bound.
| Model | MUTAG | PTC | NCI1 | PROTEINS | COLLAB | RDT-B |
|---|---|---|---|---|---|---|
| WL Kernel | 90.4±5.7 | 59.9±4.3 | 86.0±1.8 | 75.0±3.1 | 78.9±1.9 | 81.0±3.1 |
| GCN | 85.6±14 | 64.6±7.0 | — | 76.0±3.2 | 81.7±1.6 | 50.0±0.0 |
| GraphSAGE | 85.1±7.6 | 63.9±7.7 | 77.7±1.5 | 75.9±3.2 | 79.7±1.7 | 84.3±1.9 |
| GIN-ε | 89.0±6.0 | 64.6±7.0 | 82.7±1.7 | 76.2±2.8 | 80.2±1.9 | 92.4±2.5 |
| GIN-0 | 89.4±5.6 | 64.6±7.0 | 82.7±1.6 | 76.2±2.8 | 80.6±1.9 | 92.4±2.5 |
The most striking result: GCN on RDT-B (Reddit Binary) scores 50.0 — exactly random chance. This is not a training failure; it's an expressiveness failure. Reddit-B graphs require distinguishing nodes by their count of high-degree neighbors, which mean aggregation cannot do. GIN achieves 92.4% on the same dataset — directly predicted by the theory.
Table 3 (ablation study) tests variations of GIN on bioinformatics datasets:
| Variation | NCI1 | Finding |
|---|---|---|
| GIN with sum | 82.7 | Best — theory-motivated |
| GIN with mean | 80.9 | Loses count info (Proposition 2) |
| GIN with max | 77.9 | Loses multiplicity (Proposition 2) |
| GIN, 1-layer MLP (linear) | 79.2 | Not universal approximator |
| GIN, 2-layer MLP | 82.7 | Universal — matches theory |
| GIN, last layer only | 80.1 | Discards multi-scale info |
| GIN, all layers concat | 82.7 | Best — multi-scale readout |
Every theoretical prediction is confirmed: sum beats mean/max, multi-layer MLP beats linear, all-layer readout beats last-layer. The agreement between theory and ablation is unusually clean, which is part of what makes this paper compelling.
On MUTAG, the classical WL kernel (90.4%) slightly outperforms GIN (89.4%). The paper attributes this to overfitting: WL kernel + SVM has strong regularization through the kernel method, while GIN is a neural network that can overfit on small datasets. MUTAG has only 188 graphs. On larger datasets (RDT-B: 2000 graphs), GIN (92.4%) clearly surpasses WL kernel (81.0%).
For researchers wanting to replicate the results:
| Hyperparameter | Value | Rationale |
|---|---|---|
| Number of GIN layers | 5 | Covers up to 5-hop neighborhoods |
| Hidden dimension | 64 | Sufficient for bioinformatics, social datasets |
| MLP layers per GIN layer | 2 | Required for universal approximation (Theorem 3) |
| Dropout | 0.5 | Applied before final classifier only |
| Optimizer | Adam, lr=0.01 | Decayed every 50 epochs |
| Epochs | 350 | With 5-epoch decay schedule |
| Evaluation | 10-fold CV | Average test accuracy across folds |
| Batch normalization | After each GIN layer | Critical for high-degree social graphs |
The paper evaluates at the epoch achieving peak validation accuracy (not final epoch) — a common source of implementation discrepancy in GNN papers. Using final epoch accuracy typically gives 1-2% lower results. This is worth noting when comparing to reproductions.
The results include standard deviations across 10 folds (e.g., GIN-0 on MUTAG: 89.4 ± 5.6). These large deviations reflect two things: (a) the datasets are small (MUTAG has 188 graphs, ~19 per fold), so individual folds have high variance; and (b) graph classification is a hard task where even top models don't achieve near-perfect accuracy. The proper comparison is between models' mean accuracies, with the standard deviation as a guide to significance.
GIN's improvement over GraphSAGE on RDT-B (92.4 vs 84.3, with ±2.5 and ±1.9 respectively) is clearly statistically significant — the confidence intervals don't overlap. GIN's improvement over WL kernel on RDT-B (92.4 vs 81.0) is even more dramatic. These are not marginal improvements — they reflect a genuine architectural advantage on count-sensitive tasks.
The Reddit-Binary dataset consists of threads from Reddit. Each graph represents a thread: nodes are users, edges connect users who replied to each other. The task is to classify whether the thread is from a "discussion" subreddit or an "online-community" subreddit. The key structural difference: discussion threads tend to be star-like (one post, many replies) while community threads are more densely interconnected.
Distinguishing these requires knowing HOW MANY neighbors of a specific type a node has — exactly the count information that mean aggregation destroys. A node in a star-shaped discussion thread might have 10 degree-1 neighbors (users who only replied once). A node in a community thread might have 3 degree-3 neighbors. The mean of their degree features is different, but GCN's normalization by degree can obscure this. GIN's sum aggregation preserves the count and gives the classifier the information it needs.
GIN achieves maximum expressiveness among 1-hop message-passing GNNs — but that maximum is not unlimited. The WL test itself has failure cases, and every GNN (including GIN) inherits them. This section characterizes exactly what is beyond the reach of any message-passing GNN.
A k-regular graph is a graph where every node has exactly k neighbors. Consider any two non-isomorphic 3-regular graphs on 6 nodes (e.g., K3,3 vs. two disjoint triangles K3 ∪ K3). At iteration 0, all nodes have label "degree 3". At iteration 1, every node sees 3 neighbors all labeled "degree 3" — so all nodes get the same new label in both graphs. This repeats indefinitely. WL never distinguishes them.
GIN faces the same wall. No matter how deep or wide, GIN assigns identical embeddings to all nodes in any k-regular graph with uniform initial features. This means GIN cannot count triangles, detect cycles of specific lengths, or identify any structural property that requires looking beyond the local neighborhood context.
Some graph pairs are locally identical at all depths but globally non-isomorphic. Two 6-node graphs where every node has exactly 2 neighbors of degree 2 and 1 neighbor of degree 3 will have identical WL labels at every depth, even if their global structures differ. WL only sees local tree-like structure — it cannot detect global properties like global cycles or global symmetries that don't manifest locally.
This is not a fixable bug — it's an architectural limit of message passing. Every message-passing GNN computes a function of rooted subtrees. Two nodes with identical rooted subtrees must have identical embeddings. If two graphs have isomorphic rooted subtrees at every node (which is exactly what WL cannot distinguish), no message-passing GNN can tell them apart.
The k-dimensional WL test (k-WL) operates on k-tuples of nodes rather than individual nodes. Each step up in k increases expressiveness but exponentiates computation cost.
| Method | Expressiveness | Node-level cost | Can count triangles? |
|---|---|---|---|
| 1-WL = GIN | Subtrees | O(n) | No |
| 2-WL | + some pairs | O(n²) | Yes |
| 3-WL | + triples | O(n³) | Yes |
| k-WL (k≥7) | Full isomorphism test | O(nk) | Yes |
Here's the most concrete way to see why 1-WL fails on certain regular graphs. Consider two 3-regular graphs on 6 nodes:
In G1 (K3,3), every node sees 3 neighbors, all of which have 3 neighbors. Every neighbor-of-neighbor also has 3 neighbors of degree 3. The WL labels never diverge — all nodes at all depths look identical. In the prism graph, every node also sees 3 neighbors all of degree 3. WL cannot tell these graphs apart. GIN (as a WL-equivalent) also cannot.
This is not a pathological example — these are real graphs with different topologies (K3,3 is bipartite, the prism is not) but identical WL fingerprints. Tasks distinguishing them require higher-order methods.
The paper spawned a research thread on going beyond 1-WL. Key approaches:
A deep theoretical connection (proven post-2019, building on this paper): WL distinguishability is equivalent to distinguishability by counting homomorphisms from trees. A homomorphism from graph H to graph G is a map φ: V(H) → V(G) that preserves edges. The number of such maps |Hom(H, G)| encodes structural information about G.
Two graphs G1, G2 are WL-indistinguishable if and only if |Hom(T, G1)| = |Hom(T, G2)| for every tree T. Since GIN = WL, GIN can distinguish G1 from G2 if and only if some tree T has different homomorphism counts in G1 vs G2. Things that GIN CANNOT compute: homomorphism counts from graphs with cycles (triangles, squares, etc.). This is why GIN cannot count triangles — triangle counting requires homomorphisms from K3, which has a cycle.
Higher expressiveness comes with a tradeoff in practice. More expressive models:
GIN represents a sweet spot: maximum 1-WL expressiveness at O(E) (linear in edges) computation — the same asymptotic cost as GCN and GraphSAGE. This is why GIN has become the standard strong baseline rather than, say, 2-WL GNNs that theoretically do more but are quadratically more expensive and harder to train.
Xu et al. 2019 has accumulated over 5,000 citations (as of 2024) — making it one of the most-cited papers in graph ML. It appears in the related work of essentially every paper that proposes a new GNN architecture or expressiveness result. The paper's central contribution — the WL-GNN equivalence — has become a standard reference point, cited when:
Few theoretical ML papers become as operationally useful as this one. Most expressiveness results are "existence" theorems — they prove something can or cannot be done but don't directly guide practice. The WL-GNN equivalence is different: it tells you exactly which architectures to use for which tasks, and why. That directness is what made it so widely adopted.
Before this paper, GNN design was largely empirical: try an architecture, evaluate on a benchmark, iterate. This paper introduced a rigorous framework for asking "why does this work?" and "can it work in principle?" It turned GNN design from an empirical art into something closer to a science.
Specifically, it:
The WL graph kernel (Shervashidze et al., 2011) uses WL labels as features for an SVM classifier. It's been a strong baseline for graph classification for a decade. GIN can be seen as the differentiable neural version of the WL kernel: instead of fixed hash labels, it uses learned MLP transformations. The expressiveness equivalence means they represent the same graph structures — but GIN learns task-specific features while the WL kernel uses fixed structural fingerprints.
A common misconception: GAT (Graph Attention Network, Velickovic et al. 2018) seems more sophisticated than GCN and should therefore be more expressive. In fact, GAT uses attention-weighted mean aggregation. A weighted mean is still a mean — the weights can vary per-neighbor, but the aggregation still normalizes by the total attention mass. GAT is provably below WL expressiveness, same as GCN. Attention helps with learning dynamics and generalization, but not with the theoretical expressiveness ceiling.
More formally: GAT computes hvnew = σ(Σu∈N(v)∪{v} αvu W hu) where αvu = softmax(evu) and evu is an attention score. The softmax normalization Σu αvu = 1 makes this exactly a weighted mean. The paper's Proposition showing mean fails applies to GAT as well: a node with neighbors {1,1,2,2} and softmax attention weights {0.25, 0.25, 0.25, 0.25} produces the same embedding as a node with neighbors {1,2} and attention weights {0.5, 0.5}. Mean = mean = failure.
A key takeaway from this paper is that sophistication is not expressiveness. GAT has a more complex mechanism than GCN (learned attention vs. fixed normalization) but is no more expressive. GraphSAGE has more flexible neighbor sampling than GCN but is not WL-equivalent. Complexity of the message-passing mechanism doesn't translate directly into representational power.
The right question to ask of any new GNN is: "Is your aggregation function injective over multisets of neighbor features?" If not, the architecture is bounded below WL, regardless of how sophisticated the rest of the design is. This gives a clean prior filter before running any experiments.
Recent architectures called "Graph Transformers" (e.g., Graphormer, GraphGPS) use attention between all node pairs — not just neighbors. These are technically NOT message-passing GNNs in the sense of this paper (they don't restrict to local neighborhoods). Theorem 2 does not apply to them. Graph transformers with global attention can potentially exceed 1-WL expressiveness because they operate on node pairs (closer to 2-WL). However, they also face scalability challenges — O(n²) attention over all pairs is expensive for large graphs.
GIN remains relevant as a component: many graph transformer architectures use a GIN-like local aggregation step combined with global attention, getting the best of both worlds — local structural sensitivity from GIN and global context from attention. This hybrid approach is a common design pattern in 2024-era graph ML.
The paper focuses on node features and undirected graphs. Extensions:
| Architecture | Aggregation | Expressiveness vs WL |
|---|---|---|
| GCN | Mean (uniform) | Strictly < WL |
| GAT | Mean (attention-weighted) | Strictly < WL |
| GraphSAGE | Max-pool | Strictly < WL |
| GIN | Sum + MLP | = WL (maximum) |
GIN remains one of the most widely used baselines in graph ML research. It appears as a baseline in almost every graph classification paper. Its simplicity (sum + MLP) makes it easy to implement and reason about. In practice, even on tasks theoretically beyond WL, GIN often performs competitively because: (a) real-world features usually break the symmetries WL can't see, and (b) simple strong baselines often outperform complex architectures when data is limited.
This paper is a model for how theory and practice should interact in ML research. The theoretical framework (message-passing as multiset functions) is clean enough to reason about precisely. The theoretical results (injectivity, WL equivalence) are sharp enough to make concrete predictions. Those predictions are then verified by direct ablation experiments that confirm every theoretically predicted distinction. The result is a paper where the theory actually explains the empirical observations, not just post-hoc.
Compare to many deep learning papers where the theory is decorative — it provides intuition but doesn't predict specific, testable outcomes. Xu et al. predict "mean will fail on count-sensitive tasks," test it on Reddit-Binary (a count-sensitive task), and observe GCN at 50% accuracy. That's what good theoretical machine learning looks like.
The hierarchy of GNN expressiveness. Each level can distinguish everything the levels below it can, plus more. GIN sits at the 1-WL ceiling for message-passing GNNs. Higher-order methods go beyond but at exponential cost.
The paper focuses on graph classification, but GIN's layer is also directly applicable to node classification. For node classification, the readout is simply the final-layer node embedding hv(K) — no graph-level pooling needed. The all-layer concatenation trick can also be applied at the node level (each node's final representation = concatenation of its embeddings at all K depths), similar to JK-Net.
python class GINNodeClassifier(torch.nn.Module): """GIN for node classification — no global pooling.""" def __init__(self, in_dim, hidden, num_classes, num_layers=5): super().__init__() self.convs = torch.nn.ModuleList() self.bns = torch.nn.ModuleList() for i in range(num_layers): mlp = torch.nn.Sequential( torch.nn.Linear(in_dim if i==0 else hidden, hidden), torch.nn.BatchNorm1d(hidden), torch.nn.ReLU(), torch.nn.Linear(hidden, hidden) ) self.convs.append(GINConv(mlp, train_eps=True)) self.bns.append(torch.nn.BatchNorm1d(hidden)) # Classify from last-layer node embeddings (or all-layer concat) self.classifier = torch.nn.Linear(hidden, num_classes) def forward(self, x, edge_index): for conv, bn in zip(self.convs, self.bns): x = bn(conv(x, edge_index)).relu() return self.classifier(x) # [num_nodes, num_classes]
Since the original paper (2018/2019), the Open Graph Benchmark (OGB) has established harder and more realistic graph classification benchmarks. GIN remains competitive:
On molecular datasets, GIN's 1-WL expressiveness is often sufficient because atom types and bond features provide enough label diversity to distinguish the relevant structures. The WL failure cases (regular graphs with uniform features) rarely appear in practice — molecules always have heterogeneous atom types.
| Question | If Yes | If No |
|---|---|---|
| Do your graphs have node features? | Use them as initial h_v^(0) | Use constant feature or degree one-hot |
| Do your graphs have edge features? | Extend GIN with edge MLP | Standard GIN applies |
| Are your graphs large (>10K nodes)? | Consider mini-batch sampling | Standard GIN applies |
| Do you need to count cycles? | Use subgraph GNN or add structural features | Standard GIN is sufficient |
| Do you have very few graphs (<500)? | Consider WL kernel + SVM for stronger regularization | GIN works well |
Since 2019, the GIN framework has been extended in several directions, all rooted in the same theoretical foundation:
| Paper | Extension | Key idea |
|---|---|---|
| Haggai et al. (2019) | k-GNN | Operate on node k-tuples for higher expressiveness |
| Murphy et al. (2019) | RP-GNN | Random node features to break symmetries |
| Li et al. (2020) | Distance-encoding GNN | Use shortest-path distances as node features |
| Zhao et al. (2022) | Equivariant Subgraph Aggregation | Subgraph GNNs for beyond-1WL expressiveness |
| Lim et al. (2023) | SIGN + structural features | Graph Laplacian eigenvectors as positional encodings |
All of these papers cite Xu et al. 2019 as the foundational result they're building on. The WL equivalence theorem defined the ceiling; subsequent work is about breaking through it.
This paper provides a template for evaluating any new GNN architecture. When you read a new GNN paper, ask these questions in order:
This five-question filter eliminates a lot of architecture-hunting. Many papers have proposed "more expressive" GNNs that are actually bounded below GIN because they use attention-weighted mean. The expressiveness framework from this paper makes these failures immediately obvious.
For practitioners who just want a reliable graph classifier: you don't need to implement GIN from scratch. PyTorch Geometric's GINConv with aggr='add', a 2-layer MLP per layer, batch normalization, 5 GIN layers, and sum readout over all layers reproduces the paper's results. Typical hyperparameters: hidden dim 64, dropout 0.5, Adam lr=0.01, 350 epochs with stepwise decay. Start here before trying any more complex architecture.
| Theorem | Statement (informal) | Implication |
|---|---|---|
| Lemma 5 | Sum-of-f is the universal injective multiset function | GIN's aggregation can represent any injective multiset function |
| Theorem 2 | No GNN can exceed WL expressiveness | WL is the ceiling for all message-passing GNNs |
| Prop. 1 | GCN (mean) is strictly below WL | Concrete failure case: same-proportion, different-count multisets |
| Prop. 2 | GraphSAGE (max) is strictly below WL | Concrete failure case: same-elements, different-counts multisets |
| Theorem 3 | GIN achieves WL expressiveness | GIN = WL — the theoretical maximum for message-passing GNNs |
This paper closed one chapter of GNN theory and opened several more. As of 2024, open questions directly motivated by this work include:
CS224W Lec 3 — GNN Basics — The node classification task and the GNN framework from scratch.
CS224W Lec 4 — GCN, GraphSAGE, GAT — The three main architectures this paper analyzes.
CS224W Lec 6 — Theory of GNNs — The interactive micro-lesson covering this paper's content with visualizations.
Neural Message Passing for Quantum Chemistry — Gilmer et al. 2017, the general message-passing framework this paper analyzes.
Message-passing GNNs aggregate over multisets of neighbor features. Mean and max aggregations are not injective over multisets — they cannot distinguish multisets with same proportions (mean) or same elements (max). Sum aggregation + MLP is injective (Lemma 5). A GNN with injective aggregation is as powerful as the 1-WL graph isomorphism test (Theorem 3). No message-passing GNN can do better (Theorem 2). GIN implements this with: hv(k) = MLP((1+ε)hv(k-1) + Σ hu(k-1)). This is the most expressive possible message-passing GNN.
A good theoretical result in machine learning satisfies three criteria: (1) it makes precise, falsifiable predictions about observable behavior; (2) those predictions are confirmed experimentally; (3) the theory guides future design decisions. This paper satisfies all three. It predicts GCN fails on count-sensitive tasks (confirmed: 50% on RDT-B). It predicts sum beats mean and max in ablations (confirmed: Table 3). It guides the design of GIN (confirmed: GIN = WL, the theoretical maximum). This is what good theoretical ML looks like — not post-hoc rationalization, but genuine predictive power.
"The art of science is to ask the right question." — Max Planck
The right question was not "what can we do with GNNs?" but "what can we prove GNNs can and cannot do?" That shift — from empiricism to theory — is what produced a result that will be cited for decades.