Wu, Zhang, Souza Jr., Fifty, Yu, Weinberger — ICML 2019 · arXiv 1902.07153

SGC: The Simplest Graph Neural Network

What happens if you remove every nonlinearity from a GCN? You get one matrix multiplication on propagated features — and it's nearly as good as the full model. Sometimes simplicity is the feature, not a bug.

Prerequisites: GCN basics + matrix multiplication. That's it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: The Problem

Graph Convolutional Networks (GCNs) are powerful — but they have a dirty secret: they're slow to train, tricky to tune, and hard to analyze theoretically. The reason is their complexity: multiple layers with nonlinear activations stacked on top of each other.

A standard GCN (Kipf & Welling, 2017) looks like this: for a K-layer network, start with features X, apply normalized adjacency Â, apply learnable weight W, apply ReLU, repeat K times. Written out:

H⁽¹⁾ = ReLU(Â X W⁽⁰⁾)

H⁽²⁾ = ReLU(Â H⁽¹⁾ W⁽¹⁾)

Ŷ = softmax(Â H⁽²⁾ W⁽²⁾)

For node classification, we train all W^(k) end-to-end. The problem: every training step must propagate through the entire graph AND backpropagate through K layers. For large graphs, this is the bottleneck.

The combinatorial explosion: In a GCN with K layers, computing gradients for node v requires fetching the K-hop neighborhood of v. In many real graphs, the 2-hop neighborhood contains thousands of nodes, and the 3-hop neighborhood can cover the entire graph. Full-batch training becomes infeasible for graphs with millions of nodes.

Wu et al. asked a radical question: what if the nonlinearities between layers aren't actually helping? What if we can collapse the entire network into a single linear operation — and still get competitive results?

What makes multi-layer GCN training expensive on large graphs?

Computing gradients for node v requires fetching its K-hop neighborhood, which grows exponentially with depth and can cover the entire graph GCNs require GPU memory proportional to the square of the number of nodes Softmax in the final layer is slow to compute for large graphs

Chapter 1: Collapsing Layers

Here's the key observation. In a K-layer GCN, what would happen if we removed all the ReLU activations between layers, keeping only the final softmax?

Without ReLU, a two-layer GCN becomes:

Ŷ = softmax( Â · (ReLU(Â X W⁽⁰⁾)) · W⁽¹⁾ )

↓ remove inner ReLU

Ŷ = softmax( Â · (Â X W⁽⁰⁾) · W⁽¹⁾ )

= softmax( Â² X W⁽⁰⁾ W⁽¹⁾ )

Because matrix multiplication is associative and linear, all the weight matrices collapse into a single matrix: W = W⁽⁰⁾W⁽¹⁾. The K-layer network with no intermediate nonlinearities is exactly equivalent to:

Ŷ = softmax( Â^K X W )

This is the Simple Graph Convolution (SGC). One weight matrix W, applied to features that have been propagated through the graph K times.

The separation principle: SGC separates two distinct operations that GCN conflates: (1) feature propagation — spreading information through the graph by repeatedly multiplying by Â — and (2) feature transformation — learning a linear classifier from the propagated features. By separating them, we can precompute the propagation once, then train a simple linear classifier. This is much faster.

What algebraic property allows a K-layer linear GCN to collapse into a single matrix multiplication?

Matrix multiplication is associative and the composition of linear functions is linear — W^(0)W^(1)...W^(K-1) is a single matrix W The normalized adjacency matrix Â is symmetric Softmax commutes with matrix multiplication

Chapter 2: The SGC Model (Showcase)

SGC in full: precompute Â^KX — the K-step propagated features — then train a single linear classifier (logistic regression) on top.

S = Â^K X (precompute once, offline)

Ŷ = softmax( S W )

Where Â = D̃^-1/2(A + I)D̃^-1/2 is the normalized adjacency with self-loops, X is the n×d feature matrix, and W is a d×c weight matrix (c = number of classes).

SGC Feature Propagation — Live Demo

Watch features spread across a small graph as propagation steps K increase. Node color = feature value (warm = high, teal = low). The propagated features are what SGC trains on.

K (propagation steps) 0

The key precomputation: Computing Â^KX for a sparse graph with n nodes and m edges takes O(Km·d) time — linear in the number of edges. This happens once before training. Then training the linear classifier W on the fixed features S is just logistic regression: O(nd·c) per epoch. No graph structure needed during training at all.

In SGC, when does the graph structure (adjacency matrix) need to be accessed during training?

Never — the graph is only used once to precompute Â^K X, after which training is pure logistic regression on fixed features Every forward pass, to normalize the node features Only during backpropagation to compute graph-aware gradients

Chapter 3: Feature Propagation

What does multiplying by Â once actually do to node features? Node i's new feature vector becomes the average of its own feature and all its neighbors' features (with degree normalization):

X̄_i = D̃_ii^-1/2 ∑_{j ∈ N(i) ∪ {i}} D̃_jj^-1/2 X_j

After K steps, each node aggregates information from its K-hop neighborhood. A node in a tight cluster will quickly converge to the cluster's average features. A node on the periphery will slowly pull information from afar.

K as a hyperparameter: K controls the "locality" of the learned representation. K=1 means each node only sees its immediate neighbors. K=2 means 2-hop neighborhoods (neighbors of neighbors). K=3 and beyond: most nodes in a connected graph start seeing almost everyone else's features. In practice, K=2 is the sweet spot for most citation and social network benchmarks.

The smoothing interpretation

Repeated multiplication by Â is a low-pass filter on the graph signal. Each application smooths the features — making nearby nodes more similar. This is helpful when nearby nodes share labels (homophily), and harmful when they don't (heterophily).

With large K, all nodes converge to the same feature vector (the all-ones vector in the limit for a connected regular graph). This is oversmoothing — the reason deep GCNs don't simply add more layers. SGC sidesteps the training problem of deep GCNs but cannot escape oversmoothing at large K.

Spectral View: Low-Pass Filtering

The eigenvalues of Â are in [-1, 1]. Repeated multiplication by Â raises eigenvalues to the K-th power. High-frequency components (eigenvalues near -1) decay; low-frequency (eigenvalue near 1) survive.

K (propagation steps) 2

What happens to node features if you increase K very large in SGC?

Oversmoothing: all nodes converge to similar feature vectors, losing the local structure that makes different nodes distinguishable Features become more discriminative because each node sees its entire neighborhood The graph becomes fully connected in feature space

Chapter 4: Computational Savings

The computational advantage of SGC is substantial. Let's compare directly.

Operation	GCN (K layers)	SGC
Graph propagation	Every forward pass: O(Kmd)	Once: O(Kmd)
Training complexity	O(Kmd·c) per epoch	O(nd·c) per epoch
Memory	All layer activations stored for backprop	Only S and W stored
Hyperparameters to tune	LR, dropout, hidden dims, L2 reg, K	LR, L2 reg, K
Training (Cora, 200 epochs)	~5 sec	~0.1 sec

50x speedup on Cora: Wu et al. report that SGC is 40–50x faster to train than a 2-layer GCN on the Cora dataset. On large graphs like Reddit (232K nodes, 11.6M edges), SGC trains in seconds where minibatch GCN takes minutes. The precomputation of Â^KX itself is fast because Â is sparse — it's just K sparse matrix-dense matrix products.

The precomputation trick

The crucial insight: Â^KX can be computed iteratively:

X⁽⁰⁾ = X

Original node features (n × d)

↓ one sparse mat-mat multiply

X⁽¹⁾ = Â X⁽⁰⁾

1-hop aggregated features

↓ one more sparse mat-mat multiply

X⁽²⁾ = Â X⁽¹⁾

2-hop aggregated features

↻ repeat K times, then stop

S = X^(K)

Fixed features for logistic regression

Why does SGC's training complexity drop from O(Kmd·c) to O(nd·c) per epoch compared to GCN?

Graph propagation (accessing adjacency Â) is done once in precomputation — each training epoch only updates W with no graph access needed SGC uses a smaller weight matrix than GCN SGC skips the softmax computation

Chapter 5: Results

Wu et al. tested SGC on five benchmarks: three citation networks (Cora, Citeseer, Pubmed), one Reddit post-classification dataset, and a 20-newsgroup text classification task.

Dataset	GCN Acc.	SGC Acc.	Speedup
Cora	81.5%	81.0%	~45x
Citeseer	70.3%	71.9%	~60x
Pubmed	79.0%	78.9%	~40x
Reddit	93.3%	94.9%	Large
20news	—	88.5%	Baseline

Near-parity at a fraction of the cost: On Citeseer, SGC actually outperforms GCN (71.9% vs 70.3%). On Reddit, SGC beats GCN (94.9% vs 93.3%) while being orders of magnitude faster. The conclusion is striking: for homophilic citation and social graphs, the nonlinearities in GCN provide essentially zero benefit. The work is done by the feature propagation step.

The Reddit result is particularly meaningful. Reddit is a large graph (232,965 nodes, 11.6M edges) with rich feature vectors. GCN with minibatch training takes ~177 seconds per epoch. SGC precomputes Â^KX in ~177 seconds total, then trains in seconds per epoch.

On which dataset does SGC actually outperform GCN in accuracy?

Cora (81.5% → 81.0%: GCN wins) Citeseer (70.3% → 71.9%: SGC wins) and Reddit (93.3% → 94.9%: SGC wins) Pubmed (79.0% → 78.9%: GCN wins)

Chapter 6: When Linearity Suffices

SGC's strong performance raises the question: when do nonlinearities actually matter in GCNs? The answer depends critically on the structure of the data.

Homophily is the key: In a homophilic graph, connected nodes tend to have the same label (friends are similar, academic papers cite similar papers). In this setting, spreading features through the graph already "clusters" nodes by label — and a linear classifier can easily separate the resulting clusters. Nonlinearities aren't needed when the graph structure does the work.

When SGC fails

SGC struggles in three scenarios:

Heterophily: Nodes connect to dissimilar nodes (e.g., protein-protein interaction with cross-class connections). Spreading features across such edges mixes rather than separates classes.
Deep hierarchies: Tasks requiring multi-level abstractions (community detection, hierarchical classification) that genuinely need nonlinear transformations at each hop.
Edge features: When edge weights or types matter — SGC propagates uniformly by default.

Graph Type	SGC Performance	Better Alternative
Homophilic citation	Excellent (matches GCN)	—
Social networks (Reddit)	Excellent	—
Heterophilic (chameleon, squirrel)	Poor	H2GCN, GPRGNN
Large-scale inductive	Good (with precompute)	GraphSAGE, SIGN

In which type of graph does SGC perform poorly, and why?

Heterophilic graphs — where connected nodes have different labels — because SGC's feature smoothing mixes rather than separates classes Dense graphs — because Â^K becomes expensive to compute Directed graphs — because Â assumes undirected adjacency

Chapter 7: Connections

SGC is simultaneously a practical tool and a theoretical lens. Its simplicity lets us reason rigorously about what GCNs are actually doing.

Method	Key Idea	Relation to SGC
GCN	Nonlinear layered propagation	SGC = GCN without intermediate ReLUs
APPNP	Personalized PageRank propagation	SGC + adaptive propagation weights
SIGN	Multiple propagation operators concatenated	SGC with multi-scale precomputation
GraphSAGE	Sampled neighborhood aggregation	SGC without sampling (full propagation)
JK-Net	Aggregate features from all K layers	SGC uses only layer K; JK-Net uses all
Label Prop.	Diffuse labels instead of features	Same diffusion operator Â, different signal

SGC as ablation: One of SGC's lasting contributions is methodological. By showing that a linear model matches GCN, it established a principled baseline. Any new GCN variant should now answer: does it beat SGC? If it can't, the graph structure isn't helping. This shifted the community toward understanding when graph structure helps, not just adding more complexity.

The SIGN extension

SIGN (Scalable Inception Graph Networks, Frasca et al. 2020) extends SGC by precomputing and concatenating features at multiple scales: [X, ÂX, Â²X, ..., Â^KX]. This gives a richer multi-scale feature that a linear (or shallow MLP) classifier can leverage. SIGN keeps the fast precomputation of SGC while recovering some of the expressiveness of deep GCNs.

Closing thought: SGC teaches a lesson that goes beyond graph learning: before adding complexity, verify that the complexity is necessary. In many real graph datasets, the hard work is graph structure (which spreads useful features), not nonlinear transformation. The model should match the problem — and for homophilic graphs, the problem is linear after propagation.