You, Ying, Leskovec — NeurIPS 2020 · arXiv 2011.08843

Design Space for Graph Neural Networks

315,000+ GNN configurations tested systematically. The finding: there is no universally best GNN. The right architecture depends on the task. But the framework for thinking about GNN design — now that's universal.

Prerequisites: GCN / GAT basics + what node classification is. That's it.
8
Chapters
3+
Simulations
315K+
Configurations Tested

Chapter 0: Which GNN to Use?

GCN, GraphSAGE, GAT, GIN, SGC, JK-Net, APPNP, PNA. For any new graph learning task, a researcher faces an overwhelming choice: which GNN architecture should I use? The answer from the literature is unsatisfying: "it depends."

Before this paper, "it depends" was a hand-wavy excuse. Different papers proposed different architectures, evaluated on different benchmarks, with different hyperparameter search budgets. Comparing them was almost meaningless — you couldn't tell if GAT beat GCN because of the attention mechanism or because of better tuning.

The evaluation crisis: A landmark survey by Errica et al. (2020) showed that many proposed graph neural networks fail to outperform a simple baseline (an MLP on node features, ignoring the graph) when hyperparameters are tuned fairly. The field was advancing architecture complexity without rigorous evaluation. You et al. decided to run the controlled experiment at scale: fix the evaluation protocol, enumerate all design choices systematically, and find what actually matters.

The key idea: treat GNN design as a design space — a structured collection of choices along defined dimensions. Run controlled experiments across all combinations. Measure which choices have high impact and which tasks they help. Replace "it depends" with "it depends on X, Y, Z, and here's why."

Why was comparing GNN architectures difficult before this paper?

Chapter 1: Design Dimensions

You et al. organize GNN design into three groups of choices, each with several specific dimensions:

Intra-Layer Design (within each GNN layer)

Inter-Layer Design (how layers connect)

Training Configuration

The combinatorial explosion: 5 aggregation types × 3 activation types × 3 skip connection types × 5 depth options × 3 post-processing options × 4 learning rate options × ... = 315,000+ configurations. No human can search this manually. You et al. use random search with ranking statistics (Friedman's test) to identify which dimensions have significant impact.
What are the three groups of GNN design choices studied in this paper?

Chapter 2: Intra-Layer Design (Showcase)

Within a single GNN layer, the key operation is: aggregate messages from neighbors, transform, apply nonlinearity. Each part is a design choice.

Message passing

For a directed edge (u→v), what does node u send to node v? Three options studied:

Identity: Send hu as-is.
Simple, no normalization.
Risk: high-degree nodes dominate.
Degree-norm: Send hu / du
GCN-style: divide by √(d_u · d_v)
Risk: peripheral nodes over-weighted.

Neighbor aggregation

How to combine all received messages {mu→v} into one vector?

AggregationFormulaProperty
Mean(1/|N(v)|) ∑ mu→vDegree-invariant
Maxmaxu mu→vStructure-sensitive
Sum∑ mu→vDegree-aware, most expressive
Aggregation Methods — Live Comparison

Vary neighbor features and see how mean, max, and sum aggregation produce different node representations. Drag neighbor values with the sliders.

Neighbor 1 value 0.8
Neighbor 2 value 0.3
Neighbor 3 value 0.6
Sum vs Mean — the key distinction: Mean is invariant to adding more neighbors with the same features. If all neighbors have feature value 0.5, mean gives 0.5 regardless of how many neighbors there are. Sum gives 0.5, 1.0, 1.5, ... — it counts the neighbors. For tasks where degree matters (how popular is this node?), sum is more expressive. For tasks where average neighborhood features matter, mean suffices. GIN (Xu et al. 2019) proves that sum is strictly more expressive than mean for distinguishing graph structures.
A node has 5 neighbors all with the same feature value 0.4. How do mean and sum aggregation differ in their output?

Chapter 3: Inter-Layer Design

Between layers, the key design choices are how many layers to use and how they connect to each other. These choices interact heavily.

Layer connectivity patterns

Stack (GCN default)
h^(k+1) = GNN_k(h^(k)) — each layer only sees previous layer
↓ more expressive
Skip-Sum (ResGCN)
h^(k+1) = GNN_k(h^(k)) + h^(k-1) — previous features are added via residual
↓ most expressive
Skip-Concat (JK-Net)
h^final = concat(h^(1), ..., h^(K)) — all layers feed the final representation
The depth vs. connectivity tradeoff: For shallow GNNs (L=2), all three connectivity types perform similarly. For deep GNNs (L=6+), skip connections become essential — stacking without skip connections causes oversmoothing. But the "best" connectivity type depends on the task: tasks requiring global information favor deeper + skip; tasks requiring local structure favor shallow + none.

Pre- and post-processing MLPs

The paper systematically studies adding MLP layers before the GNN (to transform input features) and after (to transform the final representation before classification). The finding: post-processing MLPs consistently help. Pre-processing MLPs help when input features are high-dimensional or noisy. This is now standard practice.

Layer Connectivity: Gradient Flow

Shows gradient magnitude at each layer during backpropagation. Skip connections prevent vanishing gradients in deep GNNs.

Mode: Stack
Why do skip connections become important specifically for DEEP GNNs (L=6+), not shallow ones (L=2)?

Chapter 4: Training Configuration

Training hyperparameters — learning rate, batch size, optimizer, regularization — are often treated as an afterthought in GNN research. You et al. find they are not: for some tasks, training configuration matters as much as architecture.

Key findings on training

ChoiceOptions TestedFinding
OptimizerAdam, SGDAdam consistently better for GNN tasks
Learning rate0.1, 0.01, 0.001, 0.0001Task-dependent; 0.01 often best for node cls.
L2 regularization0.0 to 0.01Small reg (1e-5) helps most tasks
Dropout rate0.0 to 0.50.0–0.3 typical; depends on graph density
Batch normOn/OffHelps for deep GNNs; hurts for shallow
Batch normalization and GNNs: Standard batch norm normalizes across the batch dimension. In GNNs, this means normalizing across nodes in a mini-batch. This can disrupt the relative information between neighboring nodes — adjacent nodes that should be correlated get normalized independently. For this reason, batch norm helps less in GNNs than in standard deep learning, and should be used cautiously with shallow GNNs.

The rank correlation metric

To assess which design choices matter, the authors use Kendall rank correlation: for each design dimension, fix all other choices randomly, vary this dimension, and measure how consistently the ranking of architectures changes. High rank correlation = this dimension matters a lot.

rdim = Eother choices[ τ(acc, dim_choice) ]
Why might batch normalization help deep GNNs but hurt shallow ones?

Chapter 5: 315K Experiments

The scale of this study is its most distinctive feature. You et al. define a design space with approximately 315,000 valid GNN configurations. They test across 12 tasks spanning node classification, link prediction, and graph classification.

What "315K" actually means: The full design space has ~315K configurations. They don't test all of them — that would require millions of GPU-hours. Instead, they use random sampling: for each task and each design dimension, randomly sample 96 configurations, measuring the rank correlation of each dimension. This statistical approach identifies which dimensions matter without exhaustive search.

Most impactful design choices (rank correlation)

Design ChoiceImpact (high = matters more)Domain
Layer connectivity (skip/stack)HighInter-layer
Aggregation (mean/max/sum)High for some tasksIntra-layer
Number of layersHighInter-layer
Activation functionMediumIntra-layer
Learning rateHighTraining
Message normalizationMediumIntra-layer
Optimizer (Adam vs SGD)Low-MediumTraining
The task-specific landscape: The paper shows that ranking of design choices is NOT consistent across tasks. For node classification on citation networks (homophilic), skip connections and deeper models are most important. For graph classification, aggregation function choice dominates. For link prediction, training configuration (LR, regularization) matters as much as architecture. This is the core finding: there is no universal best.
How do You et al. avoid having to test all 315,000+ GNN configurations exhaustively?

Chapter 6: Task-Specific Findings

The main message: match your GNN design to your task type. Here are the key patterns.

Node classification (homophilic graphs: Cora, Citeseer)

Node classification (heterophilic graphs)

Graph classification

The practical takeaway: Before building a GNN, ask: (1) Is my graph homophilic or heterophilic? (2) Am I doing node-level or graph-level prediction? (3) Does degree information matter (use sum) or average neighborhood (use mean)? Answering these three questions narrows the design space from 315K configurations to a few dozen promising candidates.
For graph classification tasks (predict a property of the entire graph), which aggregation function is usually preferred and why?

Chapter 7: Connections

This paper's contribution is methodological as much as empirical. It established a vocabulary and framework for discussing GNN design that the community now uses widely.

Related WorkRelation
GCN, GAT, GraphSAGE, GINSpecific instantiations within the design space
JK-NetOne inter-layer design (skip-concat) studied as a dimension
SGCExtreme point: linear activation + precomputed propagation
NAS for GNNs (GNAS, AutoGNN)Automated search over the same design space
OGB benchmarksStandard evaluation suite motivated by this work
Benchmarking GNNs (Dwivedi et al.)Complementary: benchmarks on diverse datasets
GraphGym: Along with the paper, You et al. released GraphGym — an open-source platform implementing the design space as a modular codebase. Each design dimension is a configurable option. This makes it easy to reproduce results and run your own design space exploration on new tasks. It's become the standard starting point for systematic GNN experimentation.

Limitations and open questions

The lasting lesson: GNN research has a reproducibility and evaluation problem. This paper's greatest contribution is demonstrating that systematic, controlled experimentation can replace anecdotal "our model beats GCN on this one dataset." The discipline of defining a design space, fixing evaluation, and measuring rank correlations is now considered good practice in the field.