315,000+ GNN configurations tested systematically. The finding: there is no universally best GNN. The right architecture depends on the task. But the framework for thinking about GNN design — now that's universal.
GCN, GraphSAGE, GAT, GIN, SGC, JK-Net, APPNP, PNA. For any new graph learning task, a researcher faces an overwhelming choice: which GNN architecture should I use? The answer from the literature is unsatisfying: "it depends."
Before this paper, "it depends" was a hand-wavy excuse. Different papers proposed different architectures, evaluated on different benchmarks, with different hyperparameter search budgets. Comparing them was almost meaningless — you couldn't tell if GAT beat GCN because of the attention mechanism or because of better tuning.
The key idea: treat GNN design as a design space — a structured collection of choices along defined dimensions. Run controlled experiments across all combinations. Measure which choices have high impact and which tasks they help. Replace "it depends" with "it depends on X, Y, Z, and here's why."
You et al. organize GNN design into three groups of choices, each with several specific dimensions:
Within a single GNN layer, the key operation is: aggregate messages from neighbors, transform, apply nonlinearity. Each part is a design choice.
For a directed edge (u→v), what does node u send to node v? Three options studied:
How to combine all received messages {mu→v} into one vector?
| Aggregation | Formula | Property |
|---|---|---|
| Mean | (1/|N(v)|) ∑ mu→v | Degree-invariant |
| Max | maxu mu→v | Structure-sensitive |
| Sum | ∑ mu→v | Degree-aware, most expressive |
Vary neighbor features and see how mean, max, and sum aggregation produce different node representations. Drag neighbor values with the sliders.
Between layers, the key design choices are how many layers to use and how they connect to each other. These choices interact heavily.
The paper systematically studies adding MLP layers before the GNN (to transform input features) and after (to transform the final representation before classification). The finding: post-processing MLPs consistently help. Pre-processing MLPs help when input features are high-dimensional or noisy. This is now standard practice.
Shows gradient magnitude at each layer during backpropagation. Skip connections prevent vanishing gradients in deep GNNs.
Training hyperparameters — learning rate, batch size, optimizer, regularization — are often treated as an afterthought in GNN research. You et al. find they are not: for some tasks, training configuration matters as much as architecture.
| Choice | Options Tested | Finding |
|---|---|---|
| Optimizer | Adam, SGD | Adam consistently better for GNN tasks |
| Learning rate | 0.1, 0.01, 0.001, 0.0001 | Task-dependent; 0.01 often best for node cls. |
| L2 regularization | 0.0 to 0.01 | Small reg (1e-5) helps most tasks |
| Dropout rate | 0.0 to 0.5 | 0.0–0.3 typical; depends on graph density |
| Batch norm | On/Off | Helps for deep GNNs; hurts for shallow |
To assess which design choices matter, the authors use Kendall rank correlation: for each design dimension, fix all other choices randomly, vary this dimension, and measure how consistently the ranking of architectures changes. High rank correlation = this dimension matters a lot.
The scale of this study is its most distinctive feature. You et al. define a design space with approximately 315,000 valid GNN configurations. They test across 12 tasks spanning node classification, link prediction, and graph classification.
| Design Choice | Impact (high = matters more) | Domain |
|---|---|---|
| Layer connectivity (skip/stack) | High | Inter-layer |
| Aggregation (mean/max/sum) | High for some tasks | Intra-layer |
| Number of layers | High | Inter-layer |
| Activation function | Medium | Intra-layer |
| Learning rate | High | Training |
| Message normalization | Medium | Intra-layer |
| Optimizer (Adam vs SGD) | Low-Medium | Training |
The main message: match your GNN design to your task type. Here are the key patterns.
This paper's contribution is methodological as much as empirical. It established a vocabulary and framework for discussing GNN design that the community now uses widely.
| Related Work | Relation |
|---|---|
| GCN, GAT, GraphSAGE, GIN | Specific instantiations within the design space |
| JK-Net | One inter-layer design (skip-concat) studied as a dimension |
| SGC | Extreme point: linear activation + precomputed propagation |
| NAS for GNNs (GNAS, AutoGNN) | Automated search over the same design space |
| OGB benchmarks | Standard evaluation suite motivated by this work |
| Benchmarking GNNs (Dwivedi et al.) | Complementary: benchmarks on diverse datasets |