In a graph, different nodes need different receptive fields. A leaf node needs local info; a hub needs global context. JK-Net's fix: keep embeddings from every layer, then let the model choose which layer to attend to for each node.
Imagine two nodes in the same graph: a peripheral node with 2 neighbors, and a hub node connected to 100 others. For the peripheral node, 2 GCN layers is plenty — it sees its whole neighborhood. For the hub, 2 layers means aggregating 100 nodes' worth of information in one shot: the signal gets diluted immediately.
Now flip it. Give the hub 5 layers. The peripheral node now aggregates 5-hop neighborhoods — but on a sparse graph, that might be the entire graph. Its features become an average of every node's features. It's lost its identity in a sea of global information.
A leaf node (few neighbors) vs a hub node (many neighbors). Adjust depth K — watch how their receptive fields expand differently. Green = informative neighborhood; red = oversmoothed.
Standard GCN handles this by choosing a single K (usually 2) and hoping it's good enough for most nodes. JK-Net asks: why not keep the representations from every depth and let the model decide, per node, which depth is most useful?
JK-Net's solution is elegant: run K GCN layers as usual, but instead of only using the last layer's output, keep every layer's output and combine them.
Let hv(k) be node v's embedding after layer k. A standard GCN uses only hv(K) for prediction. JK-Net uses all of them: hv(1), hv(2), ..., hv(K).
The aggregation function AGG is the design choice that makes JK-Net flexible. It can be concatenation (keep everything), max-pooling (take the maximum across layers per dimension), or LSTM-attention (learn which layers to attend to).
Why does this help? Because hv(k) encodes node v's features aggregated from its k-hop neighborhood. By having access to all k from 1 to K, the model can learn to emphasize the k that provides the most useful signal for each node's specific position in the graph.
The full JK-Net architecture: K standard GCN layers, each producing a node embedding, followed by a layer-wise aggregation module that selects the right representation per node.
Click a node to see how much each GCN layer contributes to its final representation. Peripheral nodes (few neighbors) rely on shallow layers; hub nodes rely on deeper layers. Use the aggregation selector to compare methods.
JK-Net proposes three ways to combine layer-wise representations. Each has different expressiveness and computational cost.
Simply concatenate all K layer outputs for each node:
Output dimension: K × d'. The final classifier (a linear layer) then learns to weight the contributions of each layer. Simple, no additional parameters, but the representation grows linearly with K.
For each feature dimension j, take the maximum across all K layers:
Output dimension: d' (same as each layer). This is permutation-invariant over layers — the model doesn't care about layer ordering. It selects the "most activated" value for each feature across all depths.
Treat the K layer representations as a sequence and run a bidirectional LSTM over them, taking the final hidden state:
This is the most expressive — the LSTM learns to weight different layers differently and can model interactions between layer representations. But it requires training additional parameters (the LSTM weights) and adds complexity.
| Method | Output Dim | Extra Params | Order-Sensitive | Best For |
|---|---|---|---|---|
| Concatenation | K·d' | None | Yes | Simple, all-around |
| Max-Pool | d' | None | No | Compact representations |
| LSTM-Attn | d' | LSTM params | Yes | Complex depth patterns |
Xu et al. introduce a diagnostic tool: the influence distribution of node v is the probability distribution over all other nodes that captures how much each node u influences node v's representation after K layers.
For a standard K-layer GCN, node v's influence distribution is exactly the K-hop random walk distribution starting from v — a fixed function of the graph structure. It doesn't adapt to the actual task or the node's position.
JK-Net changes the influence distribution fundamentally. Because node v's final representation is an aggregation over all K layers, and layer k uses k-hop neighborhoods, the effective influence distribution is a mixture of 1-hop through K-hop random walk distributions. The mixture weights are learned — they can be concentrated at shallow depths for peripheral nodes and at deeper depths for hub nodes.
For a selected node type, visualize which other nodes influence its representation. GCN uses a fixed K-hop distribution; JK-Net learns a mixture. Toggle the mode and node type.
Xu et al. evaluated JK-Net on node classification (Cora, Citeseer, Pubmed) and several social network datasets. The key insight is not raw accuracy numbers but the improvement from increasing depth — JK-Net scales with more layers; GCN degrades.
| Dataset | GCN (K=2) | GCN (K=6) | JK-Concat (K=6) | JK-MaxPool (K=6) |
|---|---|---|---|---|
| Cora | 81.5% | 79.8% | 83.3% | 83.6% |
| Citeseer | 70.3% | 68.1% | 72.6% | 73.0% |
| Pubmed | 79.0% | 78.2% | 79.8% | 80.2% |
On social network datasets (Reddit, PPI) with stronger long-range dependencies, the improvements are even larger. Hub nodes in social networks benefit particularly from the ability to selectively attend to shallow layers, avoiding the dilution caused by aggregating thousands of neighbors.
Several approaches tackle the "deep GCN" problem — the fact that standard GCN degrades with depth. JK-Net is one. Let's compare the strategies.
| Method | Strategy | Handles Oversmoothing | Per-Node Adaptation |
|---|---|---|---|
| GCN (baseline) | Fixed K layers, use last only | No | No |
| ResGCN | Skip connections: h^(k+1) += h^(k) | Partially | No |
| DenseGCN | All previous layers → current layer input | Better | No |
| JK-Net | All layers → final aggregation only | Yes | Yes (per node) |
| DropEdge | Random edge dropout during training | Partially | No |
| PairNorm | Normalize to prevent oversmoothing | Partially | No |
The key distinction: ResGCN improves the training of deep GCNs. JK-Net improves the representation available for prediction. These are complementary — you can combine JK-Net with residual connections inside each layer (and this often works best).
JK-Net introduced multi-scale representation learning for graphs — an idea that reappears in many subsequent architectures and remains a core design principle for graph neural networks on heterogeneous graphs.
| Method | Key Idea | Relation to JK-Net |
|---|---|---|
| APPNP | Personalized PageRank as aggregation weights | Adaptive receptive field (different angle) |
| SIGN | Multi-scale precomputed features | JK-Net idea + SGC precomputation |
| MixHop | Mix h^(1), h^(2), ... as GCN layer input | Multi-hop within each layer (not just final) |
| DAGNN | Decouple propagation and transformation | Similar philosophy to JK + SGC |
| Design Space GNNs | Study of all GNN design choices | JK (skip connections) as one design dimension |