What happens if you remove every nonlinearity from a GCN? You get one matrix multiplication on propagated features — and it's nearly as good as the full model. Sometimes simplicity is the feature, not a bug.
Graph Convolutional Networks (GCNs) are powerful — but they have a dirty secret: they're slow to train, tricky to tune, and hard to analyze theoretically. The reason is their complexity: multiple layers with nonlinear activations stacked on top of each other.
A standard GCN (Kipf & Welling, 2017) looks like this: for a K-layer network, start with features X, apply normalized adjacency Â, apply learnable weight W, apply ReLU, repeat K times. Written out:
For node classification, we train all W(k) end-to-end. The problem: every training step must propagate through the entire graph AND backpropagate through K layers. For large graphs, this is the bottleneck.
Wu et al. asked a radical question: what if the nonlinearities between layers aren't actually helping? What if we can collapse the entire network into a single linear operation — and still get competitive results?
Here's the key observation. In a K-layer GCN, what would happen if we removed all the ReLU activations between layers, keeping only the final softmax?
Without ReLU, a two-layer GCN becomes:
Because matrix multiplication is associative and linear, all the weight matrices collapse into a single matrix: W = W(0)W(1). The K-layer network with no intermediate nonlinearities is exactly equivalent to:
This is the Simple Graph Convolution (SGC). One weight matrix W, applied to features that have been propagated through the graph K times.
SGC in full: precompute ÂKX — the K-step propagated features — then train a single linear classifier (logistic regression) on top.
Where  = D̃-1/2(A + I)D̃-1/2 is the normalized adjacency with self-loops, X is the n×d feature matrix, and W is a d×c weight matrix (c = number of classes).
Watch features spread across a small graph as propagation steps K increase. Node color = feature value (warm = high, teal = low). The propagated features are what SGC trains on.
What does multiplying by  once actually do to node features? Node i's new feature vector becomes the average of its own feature and all its neighbors' features (with degree normalization):
After K steps, each node aggregates information from its K-hop neighborhood. A node in a tight cluster will quickly converge to the cluster's average features. A node on the periphery will slowly pull information from afar.
Repeated multiplication by  is a low-pass filter on the graph signal. Each application smooths the features — making nearby nodes more similar. This is helpful when nearby nodes share labels (homophily), and harmful when they don't (heterophily).
With large K, all nodes converge to the same feature vector (the all-ones vector in the limit for a connected regular graph). This is oversmoothing — the reason deep GCNs don't simply add more layers. SGC sidesteps the training problem of deep GCNs but cannot escape oversmoothing at large K.
The eigenvalues of  are in [-1, 1]. Repeated multiplication by  raises eigenvalues to the K-th power. High-frequency components (eigenvalues near -1) decay; low-frequency (eigenvalue near 1) survive.
The computational advantage of SGC is substantial. Let's compare directly.
| Operation | GCN (K layers) | SGC |
|---|---|---|
| Graph propagation | Every forward pass: O(Kmd) | Once: O(Kmd) |
| Training complexity | O(Kmd·c) per epoch | O(nd·c) per epoch |
| Memory | All layer activations stored for backprop | Only S and W stored |
| Hyperparameters to tune | LR, dropout, hidden dims, L2 reg, K | LR, L2 reg, K |
| Training (Cora, 200 epochs) | ~5 sec | ~0.1 sec |
The crucial insight: ÂKX can be computed iteratively:
Wu et al. tested SGC on five benchmarks: three citation networks (Cora, Citeseer, Pubmed), one Reddit post-classification dataset, and a 20-newsgroup text classification task.
| Dataset | GCN Acc. | SGC Acc. | Speedup |
|---|---|---|---|
| Cora | 81.5% | 81.0% | ~45x |
| Citeseer | 70.3% | 71.9% | ~60x |
| Pubmed | 79.0% | 78.9% | ~40x |
| 93.3% | 94.9% | Large | |
| 20news | — | 88.5% | Baseline |
The Reddit result is particularly meaningful. Reddit is a large graph (232,965 nodes, 11.6M edges) with rich feature vectors. GCN with minibatch training takes ~177 seconds per epoch. SGC precomputes ÂKX in ~177 seconds total, then trains in seconds per epoch.
SGC's strong performance raises the question: when do nonlinearities actually matter in GCNs? The answer depends critically on the structure of the data.
SGC struggles in three scenarios:
| Graph Type | SGC Performance | Better Alternative |
|---|---|---|
| Homophilic citation | Excellent (matches GCN) | — |
| Social networks (Reddit) | Excellent | — |
| Heterophilic (chameleon, squirrel) | Poor | H2GCN, GPRGNN |
| Large-scale inductive | Good (with precompute) | GraphSAGE, SIGN |
SGC is simultaneously a practical tool and a theoretical lens. Its simplicity lets us reason rigorously about what GCNs are actually doing.
| Method | Key Idea | Relation to SGC |
|---|---|---|
| GCN | Nonlinear layered propagation | SGC = GCN without intermediate ReLUs |
| APPNP | Personalized PageRank propagation | SGC + adaptive propagation weights |
| SIGN | Multiple propagation operators concatenated | SGC with multi-scale precomputation |
| GraphSAGE | Sampled neighborhood aggregation | SGC without sampling (full propagation) |
| JK-Net | Aggregate features from all K layers | SGC uses only layer K; JK-Net uses all |
| Label Prop. | Diffuse labels instead of features | Same diffusion operator Â, different signal |
SIGN (Scalable Inception Graph Networks, Frasca et al. 2020) extends SGC by precomputing and concatenating features at multiple scales: [X, ÂX, Â2X, ..., ÂKX]. This gives a richer multi-scale feature that a linear (or shallow MLP) classifier can leverage. SIGN keeps the fast precomputation of SGC while recovering some of the expressiveness of deep GCNs.