Xu, Li, Tian, Sonobe, Kawarabayashi, Jegelka — ICML 2018 · arXiv 1806.03536

JK-Net: Jumping Knowledge Networks

In a graph, different nodes need different receptive fields. A leaf node needs local info; a hub needs global context. JK-Net's fix: keep embeddings from every layer, then let the model choose which layer to attend to for each node.

Prerequisites: GCN basics + what graph layers do. That's it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: One Depth Doesn't Fit All

Imagine two nodes in the same graph: a peripheral node with 2 neighbors, and a hub node connected to 100 others. For the peripheral node, 2 GCN layers is plenty — it sees its whole neighborhood. For the hub, 2 layers means aggregating 100 nodes' worth of information in one shot: the signal gets diluted immediately.

Now flip it. Give the hub 5 layers. The peripheral node now aggregates 5-hop neighborhoods — but on a sparse graph, that might be the entire graph. Its features become an average of every node's features. It's lost its identity in a sea of global information.

The Receptive Field Problem

A leaf node (few neighbors) vs a hub node (many neighbors). Adjust depth K — watch how their receptive fields expand differently. Green = informative neighborhood; red = oversmoothed.

GCN depth K 2

The core tension: GCN depth K is a global hyperparameter — one number for the entire graph. But different nodes in the same graph need different amounts of aggregation. A node at graph periphery needs shallow aggregation (local structure is informative). A node at the dense core needs deep aggregation (or none — its immediate neighbors already contain diverse signal). No single K is right for all nodes.

Standard GCN handles this by choosing a single K (usually 2) and hoping it's good enough for most nodes. JK-Net asks: why not keep the representations from every depth and let the model decide, per node, which depth is most useful?

Why is a single GCN depth K problematic for graphs with mixed node degrees?

Peripheral nodes need local context (small K); hub nodes need either local (immediately diluted by many neighbors) or global context — the right K differs per node Deeper GCNs are always better; we should just use K=10 K controls learning rate, not receptive field

Chapter 1: Layer Aggregation

JK-Net's solution is elegant: run K GCN layers as usual, but instead of only using the last layer's output, keep every layer's output and combine them.

Let h_v^(k) be node v's embedding after layer k. A standard GCN uses only h_v^(K) for prediction. JK-Net uses all of them: h_v⁽¹⁾, h_v⁽²⁾, ..., h_v^(K).

h_v^final = AGG( h_v⁽¹⁾, h_v⁽²⁾, ..., h_v^(K) )

The aggregation function AGG is the design choice that makes JK-Net flexible. It can be concatenation (keep everything), max-pooling (take the maximum across layers per dimension), or LSTM-attention (learn which layers to attend to).

The "jumping" analogy: Information "jumps" from any layer directly to the final representation — like residual connections but across ALL layers simultaneously. In a standard GCN, information must pass through every subsequent layer. In JK-Net, the model can "jump" to access the representation at exactly layer k, for any k from 1 to K.

Why does this help? Because h_v^(k) encodes node v's features aggregated from its k-hop neighborhood. By having access to all k from 1 to K, the model can learn to emphasize the k that provides the most useful signal for each node's specific position in the graph.

What distinguishes JK-Net from standard GCN in terms of which layer representations are used?

Standard GCN uses only the final layer K; JK-Net aggregates representations from ALL layers 1 through K, giving each node access to multiple scales of neighborhood information JK-Net uses only the first layer for speed JK-Net skips alternating layers (even layers only)

Chapter 2: JK-Net Architecture (Showcase)

The full JK-Net architecture: K standard GCN layers, each producing a node embedding, followed by a layer-wise aggregation module that selects the right representation per node.

JK-Net Live: Layer Contributions per Node

Click a node to see how much each GCN layer contributes to its final representation. Peripheral nodes (few neighbors) rely on shallow layers; hub nodes rely on deeper layers. Use the aggregation selector to compare methods.

Aggregation

Click any node to inspect its layer contributions.

Click a node above

Data flow through JK-Net:
Input: X ∈ ℝ^n×d (node features)
Layer 1: H⁽¹⁾ = ReLU(Â X W⁽⁰⁾) ∈ ℝ^n×d'
Layer 2: H⁽²⁾ = ReLU(Â H⁽¹⁾ W⁽¹⁾) ∈ ℝ^n×d'
... (K layers)
Jump: H^final = AGG(H⁽¹⁾, ..., H^(K)) ∈ ℝ^n×(K·d') [concat] or ℝ^n×d' [maxpool/lstm]
Predict: Ŷ = softmax(H^final W^out)

What is the output dimension of JK-Net's final representation using concatenation aggregation with K=4 layers and d'=64 hidden units per layer?

4 × 64 = 256 (each of K layers contributes a d'-dimensional vector, all concatenated) 64 (same as each individual layer) 64/4 = 16 (divided by number of layers)

Chapter 3: Aggregation Options

JK-Net proposes three ways to combine layer-wise representations. Each has different expressiveness and computational cost.

1. Concatenation

Simply concatenate all K layer outputs for each node:

h_v^final = [h_v⁽¹⁾ || h_v⁽²⁾ || ... || h_v^(K)]

Output dimension: K × d'. The final classifier (a linear layer) then learns to weight the contributions of each layer. Simple, no additional parameters, but the representation grows linearly with K.

2. Max-Pooling

For each feature dimension j, take the maximum across all K layers:

h_v,j^final = max_k=1,...,K h_v,j^(k)

Output dimension: d' (same as each layer). This is permutation-invariant over layers — the model doesn't care about layer ordering. It selects the "most activated" value for each feature across all depths.

3. LSTM-Attention

Treat the K layer representations as a sequence and run a bidirectional LSTM over them, taking the final hidden state:

h_v^final = LSTM_bidir([h_v⁽¹⁾, ..., h_v^(K)])

This is the most expressive — the LSTM learns to weight different layers differently and can model interactions between layer representations. But it requires training additional parameters (the LSTM weights) and adds complexity.

Method	Output Dim	Extra Params	Order-Sensitive	Best For
Concatenation	K·d'	None	Yes	Simple, all-around
Max-Pool	d'	None	No	Compact representations
LSTM-Attn	d'	LSTM params	Yes	Complex depth patterns

Which is best? Empirically, max-pooling is often competitive with or better than LSTM-attention despite being simpler. This suggests that for most graph tasks, the key is accessing the right depth, not learning complex interactions between depths. Concatenation often wins when the downstream task can afford the larger representation size.

Why might max-pooling outperform LSTM-attention in JK-Net despite being simpler?

The key benefit of JK-Net is accessing the right depth, not modeling complex inter-layer dependencies — max-pool provides depth selection without adding LSTM parameters that can overfit LSTM-attention has too many parameters and always overfits Max-pooling uses more memory than LSTM-attention

Chapter 4: Influence Distribution

Xu et al. introduce a diagnostic tool: the influence distribution of node v is the probability distribution over all other nodes that captures how much each node u influences node v's representation after K layers.

α_v(u) = |∂h_v^(K) / ∂x_u|

For a standard K-layer GCN, node v's influence distribution is exactly the K-hop random walk distribution starting from v — a fixed function of the graph structure. It doesn't adapt to the actual task or the node's position.

The oversmoothing diagnosis: For a node at the periphery of a sparse graph, the K-hop distribution is narrow (only a few nodes are reachable in K steps). Fine. For a node in a dense clique, the K-hop distribution is nearly uniform over the entire clique — every node in the clique has equal influence, regardless of relevance. This is exactly oversmoothing: all nodes in the neighborhood look the same.

JK-Net changes the influence distribution fundamentally. Because node v's final representation is an aggregation over all K layers, and layer k uses k-hop neighborhoods, the effective influence distribution is a mixture of 1-hop through K-hop random walk distributions. The mixture weights are learned — they can be concentrated at shallow depths for peripheral nodes and at deeper depths for hub nodes.

Influence Distribution: GCN vs JK-Net

For a selected node type, visualize which other nodes influence its representation. GCN uses a fixed K-hop distribution; JK-Net learns a mixture. Toggle the mode and node type.

Mode: GCN | Node: Peripheral

How does JK-Net change the influence distribution compared to standard GCN?

JK-Net creates a learnable mixture of 1-hop through K-hop distributions, allowing peripheral nodes to concentrate influence locally and hub nodes to select the appropriate scale JK-Net uses uniform influence across all nodes regardless of graph structure JK-Net restricts influence to 1-hop neighbors only

Chapter 5: Results

Xu et al. evaluated JK-Net on node classification (Cora, Citeseer, Pubmed) and several social network datasets. The key insight is not raw accuracy numbers but the improvement from increasing depth — JK-Net scales with more layers; GCN degrades.

Dataset	GCN (K=2)	GCN (K=6)	JK-Concat (K=6)	JK-MaxPool (K=6)
Cora	81.5%	79.8%	83.3%	83.6%
Citeseer	70.3%	68.1%	72.6%	73.0%
Pubmed	79.0%	78.2%	79.8%	80.2%

The crucial comparison: GCN accuracy drops when going from K=2 to K=6 (e.g., Cora: 81.5% → 79.8%). JK-Net with K=6 beats GCN with K=2 (Cora: 83.6% vs 81.5%). JK-Net scales with depth while GCN degrades. This confirms the hypothesis: the right information for many nodes is at deeper layers, but GCN can't access it without oversmoothing. JK-Net can.

On social network datasets (Reddit, PPI) with stronger long-range dependencies, the improvements are even larger. Hub nodes in social networks benefit particularly from the ability to selectively attend to shallow layers, avoiding the dilution caused by aggregating thousands of neighbors.

What does the GCN K=2 vs K=6 comparison reveal about standard GCN?

More GCN layers hurt performance (oversmoothing) — depth doesn't scale; JK-Net avoids this by aggregating all layer outputs More GCN layers always improve performance until K=6 GCN at K=6 uses 3x more parameters than JK-Net

Chapter 6: vs Deep GCN

Several approaches tackle the "deep GCN" problem — the fact that standard GCN degrades with depth. JK-Net is one. Let's compare the strategies.

Method	Strategy	Handles Oversmoothing	Per-Node Adaptation
GCN (baseline)	Fixed K layers, use last only	No	No
ResGCN	Skip connections: h^(k+1) += h^(k)	Partially	No
DenseGCN	All previous layers → current layer input	Better	No
JK-Net	All layers → final aggregation only	Yes	Yes (per node)
DropEdge	Random edge dropout during training	Partially	No
PairNorm	Normalize to prevent oversmoothing	Partially	No

JK-Net vs ResGCN: ResGCN adds a skip connection from layer k to layer k+1 — this helps gradient flow but doesn't give the final layer access to all intermediate representations. JK-Net is architecturally different: it's not about gradient flow during training; it's about representation access at inference. The final prediction uses representations from all depths, not just a shortcut-assisted version of the last layer.

The key distinction: ResGCN improves the training of deep GCNs. JK-Net improves the representation available for prediction. These are complementary — you can combine JK-Net with residual connections inside each layer (and this often works best).

What makes JK-Net architecturally different from ResGCN (skip connections)?

ResGCN passes previous layer outputs INTO the next layer (improving gradient flow); JK-Net passes all layer outputs INTO the final prediction (improving representation access) JK-Net uses residual connections at every layer, ResGCN uses none JK-Net and ResGCN are equivalent architecturally

Chapter 7: Connections

JK-Net introduced multi-scale representation learning for graphs — an idea that reappears in many subsequent architectures and remains a core design principle for graph neural networks on heterogeneous graphs.

Method	Key Idea	Relation to JK-Net
APPNP	Personalized PageRank as aggregation weights	Adaptive receptive field (different angle)
SIGN	Multi-scale precomputed features	JK-Net idea + SGC precomputation
MixHop	Mix h^(1), h^(2), ... as GCN layer input	Multi-hop within each layer (not just final)
DAGNN	Decouple propagation and transformation	Similar philosophy to JK + SGC
Design Space GNNs	Study of all GNN design choices	JK (skip connections) as one design dimension

JK-Net in the design space: The Design Space for GNNs paper (You et al., 2020) systematically studies which GNN design choices matter. They find that inter-layer connections (skip connections of which JK-Net is the most aggressive form) are one of the most impactful design choices, particularly for deep GNNs. JK-Net's full-layer access is the upper bound of the skip connection spectrum.

When to use JK-Net

Heterogeneous degree distributions: Graphs where some nodes have degree 1 and others degree 1000+. JK-Net's per-node adaptation is most valuable here.
Deep architectures (K > 3): Any time you need deeper GCNs and oversmoothing is an issue.
Tasks requiring multi-scale features: Community detection, structural role classification — tasks where different scales of neighborhood structure matter simultaneously.

Closing thought: JK-Net embodies a principle that appears throughout deep learning: when you don't know which level of representation is most informative, keep them all and let the model learn. This is the same insight behind DenseNet in vision and multi-scale feature pyramids in object detection. JK-Net brings this idea to the graph domain, where the relevant "scale" is not spatial resolution but graph-topological distance.