Wu, Zhang, Souza Jr., Fifty, Yu, Weinberger — ICML 2019 · arXiv 1902.07153

SGC: The Simplest Graph Neural Network

What happens if you remove every nonlinearity from a GCN? You get one matrix multiplication on propagated features — and it's nearly as good as the full model. Sometimes simplicity is the feature, not a bug.

Prerequisites: GCN basics + matrix multiplication. That's it.
8
Chapters
3+
Simulations
0
Assumed Knowledge

Chapter 0: The Problem

Graph Convolutional Networks (GCNs) are powerful — but they have a dirty secret: they're slow to train, tricky to tune, and hard to analyze theoretically. The reason is their complexity: multiple layers with nonlinear activations stacked on top of each other.

A standard GCN (Kipf & Welling, 2017) looks like this: for a K-layer network, start with features X, apply normalized adjacency Â, apply learnable weight W, apply ReLU, repeat K times. Written out:

H(1) = ReLU(Â X W(0))
H(2) = ReLU(Â H(1) W(1))
Ŷ = softmax(Â H(2) W(2))

For node classification, we train all W(k) end-to-end. The problem: every training step must propagate through the entire graph AND backpropagate through K layers. For large graphs, this is the bottleneck.

The combinatorial explosion: In a GCN with K layers, computing gradients for node v requires fetching the K-hop neighborhood of v. In many real graphs, the 2-hop neighborhood contains thousands of nodes, and the 3-hop neighborhood can cover the entire graph. Full-batch training becomes infeasible for graphs with millions of nodes.

Wu et al. asked a radical question: what if the nonlinearities between layers aren't actually helping? What if we can collapse the entire network into a single linear operation — and still get competitive results?

What makes multi-layer GCN training expensive on large graphs?

Chapter 1: Collapsing Layers

Here's the key observation. In a K-layer GCN, what would happen if we removed all the ReLU activations between layers, keeping only the final softmax?

Without ReLU, a two-layer GCN becomes:

Ŷ = softmax( Â · (ReLU(Â X W(0))) · W(1) )
↓ remove inner ReLU
Ŷ = softmax( Â · (Â X W(0)) · W(1) )
= softmax( Â2 X W(0) W(1) )

Because matrix multiplication is associative and linear, all the weight matrices collapse into a single matrix: W = W(0)W(1). The K-layer network with no intermediate nonlinearities is exactly equivalent to:

Ŷ = softmax( ÂK X W )

This is the Simple Graph Convolution (SGC). One weight matrix W, applied to features that have been propagated through the graph K times.

The separation principle: SGC separates two distinct operations that GCN conflates: (1) feature propagation — spreading information through the graph by repeatedly multiplying by  — and (2) feature transformation — learning a linear classifier from the propagated features. By separating them, we can precompute the propagation once, then train a simple linear classifier. This is much faster.
What algebraic property allows a K-layer linear GCN to collapse into a single matrix multiplication?

Chapter 2: The SGC Model (Showcase)

SGC in full: precompute ÂKX — the K-step propagated features — then train a single linear classifier (logistic regression) on top.

S = ÂK X     (precompute once, offline)
Ŷ = softmax( S W )

Where  = D̃-1/2(A + I)D̃-1/2 is the normalized adjacency with self-loops, X is the n×d feature matrix, and W is a d×c weight matrix (c = number of classes).

SGC Feature Propagation — Live Demo

Watch features spread across a small graph as propagation steps K increase. Node color = feature value (warm = high, teal = low). The propagated features are what SGC trains on.

K (propagation steps) 0
The key precomputation: Computing ÂKX for a sparse graph with n nodes and m edges takes O(Km·d) time — linear in the number of edges. This happens once before training. Then training the linear classifier W on the fixed features S is just logistic regression: O(nd·c) per epoch. No graph structure needed during training at all.
In SGC, when does the graph structure (adjacency matrix) need to be accessed during training?

Chapter 3: Feature Propagation

What does multiplying by  once actually do to node features? Node i's new feature vector becomes the average of its own feature and all its neighbors' features (with degree normalization):

i = D̃ii-1/2j ∈ N(i) ∪ {i}jj-1/2 Xj

After K steps, each node aggregates information from its K-hop neighborhood. A node in a tight cluster will quickly converge to the cluster's average features. A node on the periphery will slowly pull information from afar.

K as a hyperparameter: K controls the "locality" of the learned representation. K=1 means each node only sees its immediate neighbors. K=2 means 2-hop neighborhoods (neighbors of neighbors). K=3 and beyond: most nodes in a connected graph start seeing almost everyone else's features. In practice, K=2 is the sweet spot for most citation and social network benchmarks.

The smoothing interpretation

Repeated multiplication by  is a low-pass filter on the graph signal. Each application smooths the features — making nearby nodes more similar. This is helpful when nearby nodes share labels (homophily), and harmful when they don't (heterophily).

With large K, all nodes converge to the same feature vector (the all-ones vector in the limit for a connected regular graph). This is oversmoothing — the reason deep GCNs don't simply add more layers. SGC sidesteps the training problem of deep GCNs but cannot escape oversmoothing at large K.

Spectral View: Low-Pass Filtering

The eigenvalues of  are in [-1, 1]. Repeated multiplication by  raises eigenvalues to the K-th power. High-frequency components (eigenvalues near -1) decay; low-frequency (eigenvalue near 1) survive.

K (propagation steps) 2
What happens to node features if you increase K very large in SGC?

Chapter 4: Computational Savings

The computational advantage of SGC is substantial. Let's compare directly.

OperationGCN (K layers)SGC
Graph propagationEvery forward pass: O(Kmd)Once: O(Kmd)
Training complexityO(Kmd·c) per epochO(nd·c) per epoch
MemoryAll layer activations stored for backpropOnly S and W stored
Hyperparameters to tuneLR, dropout, hidden dims, L2 reg, KLR, L2 reg, K
Training (Cora, 200 epochs)~5 sec~0.1 sec
50x speedup on Cora: Wu et al. report that SGC is 40–50x faster to train than a 2-layer GCN on the Cora dataset. On large graphs like Reddit (232K nodes, 11.6M edges), SGC trains in seconds where minibatch GCN takes minutes. The precomputation of ÂKX itself is fast because  is sparse — it's just K sparse matrix-dense matrix products.

The precomputation trick

The crucial insight: ÂKX can be computed iteratively:

X(0) = X
Original node features (n × d)
↓ one sparse mat-mat multiply
X(1) = Â X(0)
1-hop aggregated features
↓ one more sparse mat-mat multiply
X(2) = Â X(1)
2-hop aggregated features
↻ repeat K times, then stop
S = X(K)
Fixed features for logistic regression
Why does SGC's training complexity drop from O(Kmd·c) to O(nd·c) per epoch compared to GCN?

Chapter 5: Results

Wu et al. tested SGC on five benchmarks: three citation networks (Cora, Citeseer, Pubmed), one Reddit post-classification dataset, and a 20-newsgroup text classification task.

DatasetGCN Acc.SGC Acc.Speedup
Cora81.5%81.0%~45x
Citeseer70.3%71.9%~60x
Pubmed79.0%78.9%~40x
Reddit93.3%94.9%Large
20news88.5%Baseline
Near-parity at a fraction of the cost: On Citeseer, SGC actually outperforms GCN (71.9% vs 70.3%). On Reddit, SGC beats GCN (94.9% vs 93.3%) while being orders of magnitude faster. The conclusion is striking: for homophilic citation and social graphs, the nonlinearities in GCN provide essentially zero benefit. The work is done by the feature propagation step.

The Reddit result is particularly meaningful. Reddit is a large graph (232,965 nodes, 11.6M edges) with rich feature vectors. GCN with minibatch training takes ~177 seconds per epoch. SGC precomputes ÂKX in ~177 seconds total, then trains in seconds per epoch.

On which dataset does SGC actually outperform GCN in accuracy?

Chapter 6: When Linearity Suffices

SGC's strong performance raises the question: when do nonlinearities actually matter in GCNs? The answer depends critically on the structure of the data.

Homophily is the key: In a homophilic graph, connected nodes tend to have the same label (friends are similar, academic papers cite similar papers). In this setting, spreading features through the graph already "clusters" nodes by label — and a linear classifier can easily separate the resulting clusters. Nonlinearities aren't needed when the graph structure does the work.

When SGC fails

SGC struggles in three scenarios:

Graph TypeSGC PerformanceBetter Alternative
Homophilic citationExcellent (matches GCN)
Social networks (Reddit)Excellent
Heterophilic (chameleon, squirrel)PoorH2GCN, GPRGNN
Large-scale inductiveGood (with precompute)GraphSAGE, SIGN
In which type of graph does SGC perform poorly, and why?

Chapter 7: Connections

SGC is simultaneously a practical tool and a theoretical lens. Its simplicity lets us reason rigorously about what GCNs are actually doing.

MethodKey IdeaRelation to SGC
GCNNonlinear layered propagationSGC = GCN without intermediate ReLUs
APPNPPersonalized PageRank propagationSGC + adaptive propagation weights
SIGNMultiple propagation operators concatenatedSGC with multi-scale precomputation
GraphSAGESampled neighborhood aggregationSGC without sampling (full propagation)
JK-NetAggregate features from all K layersSGC uses only layer K; JK-Net uses all
Label Prop.Diffuse labels instead of featuresSame diffusion operator Â, different signal
SGC as ablation: One of SGC's lasting contributions is methodological. By showing that a linear model matches GCN, it established a principled baseline. Any new GCN variant should now answer: does it beat SGC? If it can't, the graph structure isn't helping. This shifted the community toward understanding when graph structure helps, not just adding more complexity.

The SIGN extension

SIGN (Scalable Inception Graph Networks, Frasca et al. 2020) extends SGC by precomputing and concatenating features at multiple scales: [X, ÂX, Â2X, ..., ÂKX]. This gives a richer multi-scale feature that a linear (or shallow MLP) classifier can leverage. SIGN keeps the fast precomputation of SGC while recovering some of the expressiveness of deep GCNs.

Closing thought: SGC teaches a lesson that goes beyond graph learning: before adding complexity, verify that the complexity is necessary. In many real graph datasets, the hard work is graph structure (which spreads useful features), not nonlinear transformation. The model should match the problem — and for homophilic graphs, the problem is linear after propagation.