You, Ying, Leskovec — NeurIPS 2020 · arXiv 2011.08843

Design Space for Graph Neural Networks

315,000+ GNN configurations tested systematically. The finding: there is no universally best GNN. The right architecture depends on the task. But the framework for thinking about GNN design — now that's universal.

Prerequisites: GCN / GAT basics + what node classification is. That's it.

Chapters

Simulations

315K+

Configurations Tested

Chapter 0: Which GNN to Use?

GCN, GraphSAGE, GAT, GIN, SGC, JK-Net, APPNP, PNA. For any new graph learning task, a researcher faces an overwhelming choice: which GNN architecture should I use? The answer from the literature is unsatisfying: "it depends."

Before this paper, "it depends" was a hand-wavy excuse. Different papers proposed different architectures, evaluated on different benchmarks, with different hyperparameter search budgets. Comparing them was almost meaningless — you couldn't tell if GAT beat GCN because of the attention mechanism or because of better tuning.

The evaluation crisis: A landmark survey by Errica et al. (2020) showed that many proposed graph neural networks fail to outperform a simple baseline (an MLP on node features, ignoring the graph) when hyperparameters are tuned fairly. The field was advancing architecture complexity without rigorous evaluation. You et al. decided to run the controlled experiment at scale: fix the evaluation protocol, enumerate all design choices systematically, and find what actually matters.

The key idea: treat GNN design as a design space — a structured collection of choices along defined dimensions. Run controlled experiments across all combinations. Measure which choices have high impact and which tasks they help. Replace "it depends" with "it depends on X, Y, Z, and here's why."

Why was comparing GNN architectures difficult before this paper?

Different papers used different datasets, evaluation protocols, and hyperparameter search budgets — making it impossible to isolate which design choices actually mattered GNN training is too slow to run controlled experiments There were too few GNN architectures to compare

Chapter 1: Design Dimensions

You et al. organize GNN design into three groups of choices, each with several specific dimensions:

Intra-Layer Design (within each GNN layer)

Message passing: What information to pass along edges? (identity, L2-normalized, or degree-normalized)
Aggregation: How to combine neighbor messages? (mean, max, sum)
Activation: Which nonlinearity? (ReLU, PReLU, linear)
Dropout: On features, on messages, or none?
Batch normalization: Use it or not?

Inter-Layer Design (how layers connect)

Number of layers (L): 2, 4, 8?
Layer connectivity: Stack (standard), skip-sum (residual), skip-cat (JK-Net-style)
Pre/post processing layers: MLP before GNN, MLP after GNN?

Training Configuration

Optimizer: Adam, SGD?
Learning rate: 0.01, 0.001?
L2 regularization: Weight decay?
Batch size: Full-batch or minibatch?

The combinatorial explosion: 5 aggregation types × 3 activation types × 3 skip connection types × 5 depth options × 3 post-processing options × 4 learning rate options × ... = 315,000+ configurations. No human can search this manually. You et al. use random search with ranking statistics (Friedman's test) to identify which dimensions have significant impact.

What are the three groups of GNN design choices studied in this paper?

Intra-layer design (message, aggregation, activation), inter-layer design (depth, skip connections), and training configuration (optimizer, LR, regularization) Model architecture, dataset preprocessing, and evaluation metrics Node embedding, edge embedding, and graph-level pooling

Chapter 2: Intra-Layer Design (Showcase)

Within a single GNN layer, the key operation is: aggregate messages from neighbors, transform, apply nonlinearity. Each part is a design choice.

Message passing

For a directed edge (u→v), what does node u send to node v? Three options studied:

Identity: Send h_u as-is.
Simple, no normalization.
Risk: high-degree nodes dominate.

Degree-norm: Send h_u / d_u
GCN-style: divide by √(d_u · d_v)
Risk: peripheral nodes over-weighted.

Neighbor aggregation

How to combine all received messages {m_u→v} into one vector?

Aggregation	Formula	Property
Mean	(1/\|N(v)\|) ∑ m_u→v	Degree-invariant
Max	max_u m_u→v	Structure-sensitive
Sum	∑ m_u→v	Degree-aware, most expressive

Aggregation Methods — Live Comparison

Vary neighbor features and see how mean, max, and sum aggregation produce different node representations. Drag neighbor values with the sliders.

Neighbor 1 value 0.8

Neighbor 2 value 0.3

Neighbor 3 value 0.6

Sum vs Mean — the key distinction: Mean is invariant to adding more neighbors with the same features. If all neighbors have feature value 0.5, mean gives 0.5 regardless of how many neighbors there are. Sum gives 0.5, 1.0, 1.5, ... — it counts the neighbors. For tasks where degree matters (how popular is this node?), sum is more expressive. For tasks where average neighborhood features matter, mean suffices. GIN (Xu et al. 2019) proves that sum is strictly more expressive than mean for distinguishing graph structures.

A node has 5 neighbors all with the same feature value 0.4. How do mean and sum aggregation differ in their output?

Mean → 0.4 (same regardless of how many neighbors); Sum → 2.0 (= 5 × 0.4, encodes degree information) Both give 0.4 since all neighbors are identical Mean → 2.0; Sum → 0.4

Chapter 3: Inter-Layer Design

Between layers, the key design choices are how many layers to use and how they connect to each other. These choices interact heavily.

Layer connectivity patterns

Stack (GCN default)

h^(k+1) = GNN_k(h^(k)) — each layer only sees previous layer

↓ more expressive

Skip-Sum (ResGCN)

h^(k+1) = GNN_k(h^(k)) + h^(k-1) — previous features are added via residual

↓ most expressive

Skip-Concat (JK-Net)

h^final = concat(h^(1), ..., h^(K)) — all layers feed the final representation

The depth vs. connectivity tradeoff: For shallow GNNs (L=2), all three connectivity types perform similarly. For deep GNNs (L=6+), skip connections become essential — stacking without skip connections causes oversmoothing. But the "best" connectivity type depends on the task: tasks requiring global information favor deeper + skip; tasks requiring local structure favor shallow + none.

Pre- and post-processing MLPs

The paper systematically studies adding MLP layers before the GNN (to transform input features) and after (to transform the final representation before classification). The finding: post-processing MLPs consistently help. Pre-processing MLPs help when input features are high-dimensional or noisy. This is now standard practice.

Layer Connectivity: Gradient Flow

Shows gradient magnitude at each layer during backpropagation. Skip connections prevent vanishing gradients in deep GNNs.

Mode: Stack

Why do skip connections become important specifically for DEEP GNNs (L=6+), not shallow ones (L=2)?

Deep stacked GCNs suffer from oversmoothing and vanishing gradients; skip connections provide shortcut paths for gradient flow and preserve features from earlier (less-smoothed) layers Skip connections increase the number of parameters, which helps with large graphs Skip connections are only useful when using sum aggregation

Chapter 4: Training Configuration

Training hyperparameters — learning rate, batch size, optimizer, regularization — are often treated as an afterthought in GNN research. You et al. find they are not: for some tasks, training configuration matters as much as architecture.

Key findings on training

Choice	Options Tested	Finding
Optimizer	Adam, SGD	Adam consistently better for GNN tasks
Learning rate	0.1, 0.01, 0.001, 0.0001	Task-dependent; 0.01 often best for node cls.
L2 regularization	0.0 to 0.01	Small reg (1e-5) helps most tasks
Dropout rate	0.0 to 0.5	0.0–0.3 typical; depends on graph density
Batch norm	On/Off	Helps for deep GNNs; hurts for shallow

Batch normalization and GNNs: Standard batch norm normalizes across the batch dimension. In GNNs, this means normalizing across nodes in a mini-batch. This can disrupt the relative information between neighboring nodes — adjacent nodes that should be correlated get normalized independently. For this reason, batch norm helps less in GNNs than in standard deep learning, and should be used cautiously with shallow GNNs.

The rank correlation metric

To assess which design choices matter, the authors use Kendall rank correlation: for each design dimension, fix all other choices randomly, vary this dimension, and measure how consistently the ranking of architectures changes. High rank correlation = this dimension matters a lot.

r_dim = E_{other choices}[ τ(acc, dim_choice) ]

Why might batch normalization help deep GNNs but hurt shallow ones?

Deep GNNs risk vanishing activations that batch norm stabilizes; shallow GNNs don't have this problem, and batch norm's cross-node normalization can distort local feature relationships Shallow GNNs use larger batch sizes where batch norm becomes unstable Batch normalization increases model depth, which helps deep models and hurts shallow ones

Chapter 5: 315K Experiments

The scale of this study is its most distinctive feature. You et al. define a design space with approximately 315,000 valid GNN configurations. They test across 12 tasks spanning node classification, link prediction, and graph classification.

What "315K" actually means: The full design space has ~315K configurations. They don't test all of them — that would require millions of GPU-hours. Instead, they use random sampling: for each task and each design dimension, randomly sample 96 configurations, measuring the rank correlation of each dimension. This statistical approach identifies which dimensions matter without exhaustive search.

Most impactful design choices (rank correlation)

Design Choice	Impact (high = matters more)	Domain
Layer connectivity (skip/stack)	High	Inter-layer
Aggregation (mean/max/sum)	High for some tasks	Intra-layer
Number of layers	High	Inter-layer
Activation function	Medium	Intra-layer
Learning rate	High	Training
Message normalization	Medium	Intra-layer
Optimizer (Adam vs SGD)	Low-Medium	Training

The task-specific landscape: The paper shows that ranking of design choices is NOT consistent across tasks. For node classification on citation networks (homophilic), skip connections and deeper models are most important. For graph classification, aggregation function choice dominates. For link prediction, training configuration (LR, regularization) matters as much as architecture. This is the core finding: there is no universal best.

How do You et al. avoid having to test all 315,000+ GNN configurations exhaustively?

Random sampling with rank correlation statistics — sample a manageable subset, measure which design dimensions have high rank correlation with performance, identify what matters statistically Bayesian optimization with a surrogate model They test only the 96 most popular configurations from the literature

Chapter 6: Task-Specific Findings

The main message: match your GNN design to your task type. Here are the key patterns.

Node classification (homophilic graphs: Cora, Citeseer)

Skip connections (skip-cat) consistently help
Depth L=2–4 optimal (L=8 hurts)
Mean aggregation competitive with sum
Post-processing MLP with 2 layers helps
Batch norm helpful for L≥4

Node classification (heterophilic graphs)

Shallower models (L=1–2) often better — avoid oversmoothing across different-label nodes
Max aggregation can help (resists "averaging" with dissimilar neighbors)
Higher regularization needed

Graph classification

Sum aggregation strongly preferred (captures structural counts)
Deeper models (L=6+) with skip connections
Pre-processing MLP helps (to encode node features)
Global pooling (mean vs sum) matters as much as GNN design

The practical takeaway: Before building a GNN, ask: (1) Is my graph homophilic or heterophilic? (2) Am I doing node-level or graph-level prediction? (3) Does degree information matter (use sum) or average neighborhood (use mean)? Answering these three questions narrows the design space from 315K configurations to a few dozen promising candidates.

For graph classification tasks (predict a property of the entire graph), which aggregation function is usually preferred and why?

Sum aggregation — because graph-level tasks depend on counting structural patterns (how many triangles, what degree distribution), and sum captures these counts while mean erases them Mean aggregation — because it is the most stable for variable-size graphs Max aggregation — because it identifies the most prominent feature regardless of graph size

Chapter 7: Connections

This paper's contribution is methodological as much as empirical. It established a vocabulary and framework for discussing GNN design that the community now uses widely.

Related Work	Relation
GCN, GAT, GraphSAGE, GIN	Specific instantiations within the design space
JK-Net	One inter-layer design (skip-concat) studied as a dimension
SGC	Extreme point: linear activation + precomputed propagation
NAS for GNNs (GNAS, AutoGNN)	Automated search over the same design space
OGB benchmarks	Standard evaluation suite motivated by this work
Benchmarking GNNs (Dwivedi et al.)	Complementary: benchmarks on diverse datasets

GraphGym: Along with the paper, You et al. released GraphGym — an open-source platform implementing the design space as a modular codebase. Each design dimension is a configurable option. This makes it easy to reproduce results and run your own design space exploration on new tasks. It's become the standard starting point for systematic GNN experimentation.

Limitations and open questions

The design space is fixed — it doesn't include every GNN ever proposed (e.g., spectral methods, graph transformers were not available in 2020).
Results on 12 tasks don't generalize to all tasks — heterogeneous graphs, temporal graphs, and molecular graphs have different optimal designs.
The study is empirical, not theoretical — it tells you what works, not always why.

The lasting lesson: GNN research has a reproducibility and evaluation problem. This paper's greatest contribution is demonstrating that systematic, controlled experimentation can replace anecdotal "our model beats GCN on this one dataset." The discipline of defining a design space, fixing evaluation, and measuring rank correlations is now considered good practice in the field.