Li, Wang, Wang, Leskovec — NeurIPS 2020

Distance Encoding: More Powerful GNNs

Augment every node with its distance to a set of target nodes. This simple addition is provably more expressive than 1-WL, generalizes inductively, and closes the gap between graph neural networks and the Weisfeiler-Lehman hierarchy.

Prerequisites: What a graph is + Basic GNN intuition. No proofs required.
8
Chapters
4+
Simulations
2009.00142
arXiv

Chapter 0: The Problem

You're building a drug discovery pipeline. Your molecules are graphs — atoms are nodes, bonds are edges. You want to predict which atoms are part of a specific reactive site. The catch: many atoms are structurally identical within the molecule. Carbon atoms bonded to two hydrogens appear in dozens of positions. Standard GNNs assign them the same embedding. Your model has no way to tell them apart.

This isn't just a molecule problem. In social networks, you want to know if two specific users are likely to connect. In knowledge graphs, you need to know how a query node relates to the entities it might link to. In road networks, you need each node to know how far it is from key hubs. In all these cases, the question isn't "what does this node's neighborhood look like?" but "where is this node relative to something that matters?"

The root cause: Standard GNNs compute node embeddings by aggregating neighborhood information. Two nodes with identical k-hop neighborhoods get identical embeddings — always. This is the 1-Weisfeiler-Lehman (1-WL) ceiling. Position in the global graph structure is invisible to these models. They can describe the local texture of a neighborhood but cannot answer "where am I?"

The distance encoding paper asks a precise question: what is the minimal structural information you can add to a GNN to make it provably more expressive? The answer is elegant: distance to a set of target nodes. Give every node a feature vector encoding how far it is from each target node in the set S. This one addition breaks all 1-WL symmetries that prevent graph-level and pair-level structural distinction.

Distance is not arbitrary. Shortest-path distance between nodes is a canonical, interpretable, globally consistent structural signal. It answers "where am I relative to what matters?" — the question standard GNNs cannot ask.

The Structural Blind Spot

Two graphs that look different globally but have identical local structures at every node. A standard GNN assigns the same embeddings — it cannot tell the graphs apart. Click nodes to see their local neighborhoods.

What question can distance encoding answer that standard GNNs cannot?

Chapter 1: Distance as Feature

The core idea is simple enough to state in one sentence: for each node v in the graph, compute its shortest-path distance to every node in a designated set S of "target nodes," and append that distance vector to v's feature vector before running the GNN.

xvDE = [xv ‖ d(v, s1) ‖ d(v, s2) ‖ ... ‖ d(v, s|S|)]

Here, d(v, si) is the shortest-path distance from node v to target node si. The augmented feature xvDE is then fed into any standard GNN — GCN, GAT, GIN, whatever you prefer. No architecture change is needed. Distance encoding is a preprocessing step that makes any existing GNN more powerful.

Why shortest-path distance? Distance is the fundamental metric on graphs — the analogue of Euclidean distance in space. It satisfies the triangle inequality, is symmetric, and uniquely characterizes how "close" two nodes are in the graph topology. It also encodes cycle information: if d(u, v) = k, then u and v participate in a shared cycle of length ≥ 2k. Distance is not just a feature — it is structural position itself.

Distance must be encoded, not used raw. A distance of 5 is meaningfully different from a distance of 6 in a sparse graph; but in a dense graph, both might be "far." Two encodings are proposed:

Distance Vectors

Click a target node (orange star). Each node's color shows its distance to the target. Hover over nodes to see their distance feature vector.

What does the distance encoding vector for a node v contain?

Chapter 2: DE-GNN — The Showcase

DE-GNN is the complete pipeline: distance encoding preprocessing followed by a standard GNN. For link prediction between nodes u and v, the target set is S = {u, v}. For node classification, S might be a fixed set of "anchor" nodes. For graph classification, S is the entire node set (or a sampled subset).

Link Prediction: The Key Use Case

For predicting whether nodes u and v will link, DE-GNN extracts a subgraph around the pair and augments each node with its distances to u and v:

xwDE = [xw ‖ d(w, u) ‖ d(w, v)]   ∀ w in subgraph

Now every node "knows" how close it is to both endpoints. The GNN aggregates this information, and the final embeddings of u and v are combined to score the link. The key insight: even if u and v are structurally symmetric (same local neighborhoods), their distance vectors to themselves are always distinct — d(u, u) = 0 ≠ d(v, u) in general.

Connection to SEAL: DE-GNN with S = {u, v} and the MSD encoding is provably equivalent to SEAL's Double-Radius Node Labeling (DRNL) in terms of expressiveness. SEAL labels each node with its (d(w,u), d(w,v)) pair; DE-GNN uses these as continuous features. DE-GNN subsumes SEAL's labeling as a special case, and adds more via the MSD encoding.
DE-GNN for Link Prediction — Interactive

Select two nodes (source = orange, target = teal). The color intensity of each node shows its distance feature. Adjust GNN layers to see how information propagates. The link score bar updates in real time.

GNN layers 2
For link prediction between nodes u and v, what is the target set S in DE-GNN?

Chapter 3: Expressiveness Proof

The paper proves that DE-GNN is strictly more powerful than 1-WL GNNs. The proof proceeds by showing two things: (1) if 1-WL assigns identical colors to two graphs, DE-GNN can still distinguish them; (2) DE-GNN can detect certain graph properties — like cycle lengths — that 1-WL cannot.

The Key Theorem

Let G and G' be two graphs that are 1-WL indistinguishable (i.e., any standard GNN gives them the same representation). Then there exist graphs G'' such that DE-GNN distinguishes G from G'' even though 1-WL cannot.

Intuition for the proof: 1-WL fails because it cannot detect whether a node participates in a cycle. Consider a 6-cycle vs. two triangles sharing a node. To every node, the k-hop neighborhood looks the same. But if you pick any two nodes s, t as targets, their pairwise distances are different in the two graphs. The distance encoding breaks the symmetry that 1-WL cannot.

More formally, the paper shows that DE-GNN can compute any function that depends on the distance distribution between a target set S and the rest of the graph. This includes:

None of these are computable by standard GNNs. The proof is constructive: for each property, a DE-GNN is explicitly constructed that computes it correctly.

Expressiveness:   1-WL-GNN ⊊ DE-GNN

The "strictly less than" (⊊) is key — DE-GNN is not just different, it is provably more expressive in the formal sense.

What kind of structural property can DE-GNN compute that 1-WL GNNs cannot?

Chapter 4: Choosing the Target Set S

Distance encoding is parameterized by the choice of target set S. Different tasks call for different strategies:

TaskTarget Set SRationale
Link prediction (u,v){u, v}Encode structural position relative to both endpoints
Node classificationAll labeled nodesEach node knows its distance to the training signal sources
Graph classificationAll nodes (or sample)Global structural fingerprint
Community detectionCommunity seedsSoft community membership via distance to seeds
The anchor node strategy: For large graphs, using all nodes as targets is O(N²) in memory — infeasible. The solution: sample a fixed set of k "anchor" nodes and compute distances only to those k nodes. Each node gets a k-dimensional distance feature vector. If anchors are chosen to span the graph (e.g., high-betweenness centrality nodes or random samples), they provide good coverage of graph structure. This is the same idea as landmark-based shortest path algorithms in computational geometry.

How many targets are needed? Empirically, |S| = 10 to 50 anchor nodes suffices for graphs with thousands of nodes. The key is diversity of anchors — a single cluster of nearby anchors gives redundant information. Well-spread anchors maximize the structural information encoded.

Target Set Exploration

A graph with multiple possible target sets. Select a strategy and see how each node's distance feature vector (shown as colored bars per node) changes. More diverse targets → more discriminative embeddings.

Targets (k) 2
For large graphs, why can't you use all nodes as target set S?

Chapter 5: Results

DE-GNN was evaluated on link prediction and node classification benchmarks. The gains are consistent and sometimes dramatic — especially on tasks where global structure matters.

TaskDatasetMetricGINDE-GNNGain
Link predictionCoraAUC88.4%96.3%+7.9%
Link predictionCiteseerAUC89.7%97.1%+7.4%
Link predictionPower (structural)AUC63.1%91.2%+28.1%
Node classif.CoraAccuracy81.5%83.9%+2.4%
Subgraph countingPATTERNAccuracy85.2%92.4%+7.2%

The biggest gains appear on structural link prediction datasets (Power grid, USAir, etc.) where graph topology — not node features — determines link probability. These are exactly the cases where distance encoding provides new information unavailable to feature-only GNNs.

Comparison with SEAL: DE-GNN matches or exceeds SEAL (the prior state-of-the-art for link prediction) on most benchmarks, while being more interpretable and applicable to tasks beyond link prediction. SEAL requires extracting a subgraph for every candidate pair at test time — O(|E|) subgraphs. DE-GNN precomputes distances once and applies to all tasks. The precomputation pays off at scale.

On node classification with rich node features (Cora, Citeseer), gains are smaller because the node features already partially disambiguate nodes. This is the expected pattern: distance encoding helps most when structural position is the primary signal, less when node features are already highly discriminative.

Why does DE-GNN gain the most on structural link prediction datasets like Power grid?

Chapter 6: vs Positional Encoding

Distance encoding is often compared to positional encoding in Transformers — the technique of adding position information to token embeddings. The analogy is real but the implementations differ in important ways.

PropertyTransformer Positional Enc.Distance Encoding (DE)
Position defined byIndex in sequence (1D, fixed)Graph distance to target set (graph topology)
InductiveYes (within max length)Yes (works on any graph)
ExpressiveBreaks sequence symmetryBreaks WL symmetry (provably)
PrecomputationO(1) or O(n)O(N·|S|) BFS
Relative or absoluteBoth exist (absolute PE, relative PE)Relative to target set
Learned?Often yesFixed (distances) + learned (GNN)
The deeper analogy: In a Transformer, positional encoding tells each token "you are in position 7 of the sequence." In DE-GNN, distance encoding tells each node "you are 3 hops from target u and 5 hops from target v." Both convert a permutation-invariant architecture (attention / message passing) into one that is aware of structural position. The difference is that graph position is defined by topology, not by a linear ordering — and that's exactly what shortest-path distance captures.

More recent graph Transformers (like Graphormer, GPS-Transformer) have rediscovered this insight independently: they use shortest-path distances between all pairs of nodes as attention biases. This is essentially DE-GNN applied globally, with the full NxN distance matrix rather than distances to a small target set. The result is more powerful but O(N²) in memory — confirming that the target set abstraction is important for scalability.

Another comparison: Laplacian eigenvectors as positional encoding. Eigenvectors of the graph Laplacian provide smooth, global positional signals. They're coordinate-free (no need to choose target nodes) but not interpretable as distances, and they can be unstable (sign ambiguity). Distance encoding is discrete, interpretable, and task-specific — you choose targets based on the task at hand.

How is distance encoding in DE-GNN analogous to positional encoding in Transformers?

Chapter 7: Connections

Distance encoding sits at the crossroads of expressiveness theory, structural graph learning, and positional representations. Understanding its connections places it precisely in the research landscape.

MethodPosition SignalExpressive vs 1-WLTask
GCN/GINNone≤ 1-WLGeneral
SEALDRNL (pair distances)> 1-WLLink pred.
DE-GNNDistance to target set> 1-WL (proven)General
ID-GNNBinary ego flag> 1-WLGeneral
GraphormerFull pairwise distance> 1-WLGeneral
k-WL GNNk-tuples> 1-WLGeneral
Subsumption relationships: DE-GNN with S = all nodes subsumes SEAL's DRNL labeling. ID-GNN's binary flag is equivalent to DE-GNN with |S| = 1 and a threshold encoding (distance 0 or not). Graphormer's spatial encoding is DE-GNN with S = all nodes and direct distance biases in attention. The key contribution of the distance encoding paper is proving the expressiveness separation formally and providing a principled framework for choosing S based on the task.
Limitations: BFS distance computation is O(N+E) per target node — O(N·|S|) total. For dense graphs (E = O(N²)) or large target sets, this is expensive. The MSD encoding (random walk distances) is even more costly. For very large graphs (social networks with 10M+ nodes), approximate distances (landmark algorithms, Bourgain embedding) are necessary. Also, DE-GNN assumes distances are informative — in some synthetic tasks, the graph distance is not the relevant structural property, and other encodings might work better.

Related Papers

  • ID-GNN (AAAI 2021) — binary identity flag as a special case of distance encoding
  • SEAL (Zhang & Chen 2018) — link prediction via subgraph extraction + DRNL
  • Graphormer (Ying et al. 2021) — full distance matrix as attention bias

Key Paper

Li, Wang, Wang, Leskovec. "Distance Encoding: Design Provably More Powerful GNNs." NeurIPS 2020. arXiv:2009.00142

"Distance is the one graph metric that encodes global position locally." — the paper's core intuition