You, Gomes-Selman, Ying, Leskovec — Stanford, AAAI 2021

Identity-aware Graph Neural Networks

GNNs can't tell two nodes apart if they look the same. ID-GNN fixes this by giving every node a unique identity signal — no extra parameters, just a provably more expressive message-passing scheme.

Prerequisites: Basic graph theory + What a neural network is. No GNN experience required.
8
Chapters
4+
Simulations
2101.10320
arXiv

Chapter 0: The Problem

Imagine a graph shaped like a perfect 6-cycle: nodes A–B–C–D–E–F connected in a ring. Ask a standard GNN to compute an embedding for node A. It collects messages from B and F (its two neighbors), averages them, passes them up. Now do the same for node D. D collects messages from C and E, averages them, passes them up.

The result? A and D get identical embeddings. Of course they do — they have the same number of neighbors, the same local structure, the same everything. The GNN has no way to know it is operating on A versus D. As far as it is concerned, they are the same node.

This is not a bug. It's a fundamental limitation of how GNNs work. Standard GNNs aggregate neighborhood information. If two nodes have isomorphic neighborhoods, they get the same embedding — no matter what. This is called the 1-Weisfeiler-Lehman (1-WL) limitation: GNNs are at most as expressive as the 1-WL graph isomorphism test.

Why does this matter? Many graph tasks depend on node identity — who you are, not just what your neighborhood looks like. Consider:

The naive fix is to give every node a random unique ID as an extra feature. That works — but it breaks inductive generalization. If you train on graph G and test on graph G', the IDs from G mean nothing on G'. You need identity signals that are meaningful without being arbitrary.

The Symmetry Problem

A 6-cycle. Click any node to highlight it and see its 1-hop neighborhood. Notice that all nodes have identical neighborhoods — the GNN cannot distinguish them.

Why do standard GNNs assign identical embeddings to nodes A and D in a 6-cycle?

Chapter 1: Node Coloring

Graph theorists have studied the problem of distinguishing nodes for a century. The classic solution is graph coloring: assign a label (color) to each node such that different nodes get different labels if they play structurally different roles. The Weisfeiler-Lehman algorithm does exactly this — iteratively refining color assignments based on neighborhood colors until stable.

The 1-WL algorithm works like this: start with all nodes having the same color (say, "gray"). Each round, every node collects the colors of its neighbors, hashes them together with its own color, and gets a new color. Repeat until no colors change. Two nodes that end up with the same color are structurally equivalent from 1-WL's perspective.

Key insight: The 1-WL algorithm is exactly what a GNN computes, with learned hash functions instead of combinatorial ones. So "GNN expressiveness ≤ 1-WL" is not just an analogy — it's a formal theorem. Any two nodes that 1-WL assigns the same color will get the same GNN embedding, regardless of architecture or depth.

What breaks 1-WL? The 6-cycle is a perfect example. After any number of rounds, all 6 nodes remain the same color because every node has exactly two neighbors with the same color as each other. 1-WL cannot distinguish nodes that are in globally symmetric positions.

The fix ID-GNN proposes: heterogeneous coloring. Instead of starting with the same color for all nodes, give the "target node" (the node we're computing an embedding for) a special color — call it red. All other nodes stay gray. Now rerun the propagation. The red color propagates outward, creating a gradient of distinctiveness that tells each node exactly how far it is from the target.

colorv(0) = RED if v = target, else GRAY
colorv(k+1) = hash(colorv(k), {coloru(k) : u ∈ N(v)})
Heterogeneous Coloring

Click any node to make it the "target." Watch how the coloring propagates outward, making every node's position unique relative to the target.

Propagation depth 0
In heterogeneous coloring, what makes the target node's embedding different from all others?

Chapter 2: ID-GNN — The Algorithm

ID-GNN turns heterogeneous coloring into a concrete, differentiable GNN algorithm. The key idea: to compute node v's embedding, run a standard GNN on the entire graph — but mark v with an extra binary feature (identity flag = 1; all others = 0). This forces the GNN to produce a distinct output for v even if v is structurally symmetric with other nodes.

Formally, for target node v, augment each node u's feature vector:

huaug = [hu1[u = v]]

Then run the standard GNN on these augmented features. The resulting embedding hv(L) is the identity-aware embedding of v. The binary flag is the minimal identity signal — just one extra bit per node, but it breaks all symmetries in the graph.

The catch: This requires running the GNN N times — once per node — if you want all N node embeddings. That is N forward passes instead of one. ID-GNN-Fast addresses this: instead of full separate passes, just extract the diagonal of the stacked computation. In practice, a single forward pass with batched ego-networks achieves the same result much more efficiently.

Full Data Flow

Input
Graph G = (V, E) with node features X ∈ R|V|×d. Target: compute embedding for node v.
Augment
Append identity flag: xuaug = [xu ‖ 1[u=v]] for all u. One extra dimension.
Message Pass
Run L layers of GNN: hu(k+1) = σ(W1hu(k) + W2w∈N(u) hw(k)). The identity signal diffuses through the graph.
Read out
hv(L) is now unique to v — it encodes v's structural position AND v's identity.
ID-GNN Interactive

Select a target node. Adjust GNN depth to see how the identity signal propagates. The embedding bars show how each node's representation changes — the target's embedding is always unique.

GNN layers 1
What is the identity signal added to each node in ID-GNN?

Chapter 3: Ego-Network Extraction

Running the full graph GNN N times is expensive. But here's the key observation: when computing the embedding of node v using an L-layer GNN, only nodes within L hops of v matter. Everything further away is outside v's receptive field and contributes nothing to hv(L).

This motivates ego-network extraction: for target node v, extract the subgraph induced by all nodes within L hops of v. This is the L-hop ego-network of v, centered on v. Now run the augmented GNN on this much smaller subgraph.

What is an ego-network? An ego-network (ego-graph) is the subgraph containing a center node (the "ego") and all nodes within K hops, plus all edges between them. For K=1, it is the node and its immediate neighbors. For K=2, it adds the neighbors of neighbors. The ego-network grows exponentially with K but is often much smaller than the full graph.

The savings are dramatic in sparse graphs. If the average degree is d and you use L=3 layers, the ego-network has at most 1 + d + d² + d³ nodes. For a social network with d=10 and L=3, that is at most 1111 nodes — far smaller than a million-node graph.

For link prediction, you need embeddings for pairs (u, v). The ego-network approach means: extract the ego-network centered on u (to compute hu) and the ego-network centered on v (to compute hv), then score the pair. The computation is embarrassingly parallel — all ego-networks can be processed simultaneously in a batch.

Ego-Network Extraction

A larger graph. The highlighted region is the L-hop ego-network for the selected target node. Adjust L to see how many nodes are included.

Ego-network radius L 1
Why is ego-network extraction important for scaling ID-GNN?

Chapter 4: Inductive Coloring

The central challenge for any graph identity scheme: it must work inductively. Training on one graph, testing on another. If you assign each node a fixed ID (node 1, node 2, ...), that ID is meaningless on a new graph. Node 47 in the training graph has no relationship to node 47 in the test graph.

ID-GNN's approach is inherently inductive because the identity signal is relative, not absolute. The binary flag "am I the target?" is defined by the computation task, not by a global node ordering. When you run ID-GNN on a new graph, you simply mark whichever node you're computing for — the flag is always binary and always locally meaningful.

The magic of relative identity: Traditional node IDs say "I am node 47." The ID-GNN flag says "I am the node we care about right now." The first is a global, absolute label that doesn't transfer. The second is a local, computation-scoped label that works on any graph, any size, any structure. This is why ID-GNN generalizes inductively — the flag encodes role, not name.

This also means ID-GNN learns how to USE identity signals, not what the signals mean globally. The GNN learns: "if the identity flag is 1 nearby, here's how to weight that information." This learned strategy transfers perfectly to new graphs.

Compare to random node features (RNF) — another approach where each node gets a random d-dimensional vector as extra features. RNF also breaks symmetry, but the features are random at test time and random at training time — the model cannot learn stable patterns. ID-GNN's binary flag is deterministic and semantically clear: it encodes distance to the target. RNF is noise; ID-GNN is information.

python
# ID-GNN: inductive node embedding for target node v
def id_gnn_embed(graph, node_features, target_v, gnn_model):
    # Step 1: Extract ego-network (L-hop subgraph around target_v)
    ego_graph, ego_mask = extract_ego(graph, target_v, L=gnn_model.num_layers)

    # Step 2: Create identity flag — 1 for target, 0 for all others
    id_flag = (ego_graph.nodes == target_v).float().unsqueeze(-1)  # [N_ego, 1]

    # Step 3: Augment node features with identity flag
    aug_features = torch.cat([node_features[ego_mask], id_flag], dim=-1)  # [N_ego, d+1]

    # Step 4: Run standard GNN on augmented ego-graph
    embeddings = gnn_model(ego_graph, aug_features)  # [N_ego, h]

    # Step 5: Return embedding of the target node
    target_idx = (ego_graph.nodes == target_v).nonzero()[0]
    return embeddings[target_idx]  # [h] — unique to v, inductive
Why does ID-GNN generalize inductively (to new, unseen graphs)?

Chapter 5: Results

ID-GNN was evaluated on node classification, link prediction, and graph classification tasks. The key claim: adding identity information strictly improves GNN expressiveness, and this shows in practice.

TaskDatasetMetricGINID-GNNGain
Link predictionogbl-collabHits@5051.4%62.6%+11.2%
Link predictionogbl-ddiHits@2037.1%55.9%+18.8%
Graph classificationEXP (WL-hard)Accuracy50.0%100.0%+50.0%
Graph classificationCEXPAccuracy50.0%100.0%+50.0%
Node classificationogbn-arxivAccuracy71.7%72.4%+0.7%

The most dramatic improvements are on the EXP and CEXP datasets — synthetic graphs specifically designed to be hard for 1-WL tests. Standard GNNs score exactly 50% (random chance) because they cannot distinguish any nodes. ID-GNN scores 100%: the identity signal is exactly what's needed to break the 1-WL limitation on these graphs.

Real-world gains matter too. Link prediction improvements of 11-19% on OGB benchmarks are large — these are competitive leaderboards with significant engineering. The gains come purely from adding a single bit of identity information, with the same GNN architecture and the same number of trainable parameters. The identity flag is free information.

Where does ID-GNN NOT help much? Node classification tasks where nodes are already distinguishable by their features (text, chemistry). If node A has a unique feature vector, the standard GNN can already tell it apart from node D without any identity signal. The gains are largest where graph structure is the primary signal and features are homogeneous.

Why does ID-GNN score 100% on EXP/CEXP while GIN scores 50%?

Chapter 6: vs GIN

Graph Isomorphism Network (GIN) is the most expressive 1-WL GNN — it achieves the theoretical maximum expressiveness within the 1-WL class by using injective (sum) aggregation with MLP transformation. GIN is the strongest baseline ID-GNN needs to beat.

The GIN update rule:

hv(k) = MLP(k)((1 + ε(k)) · hv(k-1) + ∑u∈N(v) hu(k-1))

The sum aggregation (not mean, not max) is key — it preserves multiset information. Two neighborhoods with different numbers of copies of the same node feature produce different sums. GIN is as expressive as 1-WL can be.

But that's not enough. GIN being 1-WL-complete means it still fails on graphs that 1-WL cannot distinguish. ID-GNN goes strictly beyond 1-WL. In graph theory terms: 1-WL ⊂ ID-GNN in terms of the graph properties each can distinguish. Any property that 1-WL can detect, ID-GNN can detect. But ID-GNN can detect properties that 1-WL cannot — specifically, properties involving cycle structures and distance-dependent patterns.
PropertyGIN (1-WL)ID-GNN
Node degreeYesYes
Triangle membershipNoYes
Cycle lengthPartiallyYes
Shortest path between pairNoYes
Graph diameterNoYes
Inductive to new graphsYesYes
Extra parameters neededNoNo

ID-GNN matches GIN on everything GIN can do, and adds strictly more. The cost: O(N) ego-network computations instead of O(1). ID-GNN-Fast recovers much of the efficiency by processing all ego-networks in a single batched forward pass, achieving roughly 2-4x runtime overhead over standard GNN — acceptable for the expressiveness gain.

What is the relationship between GIN's expressiveness and ID-GNN's expressiveness?

Chapter 7: Connections

ID-GNN sits at the intersection of several active research threads. Understanding these connections helps place it in the broader landscape of expressive graph learning.

MethodIdentity SignalInductive?Extra Cost
Standard GNN (GCN, GIN)NoneYesNone
Random Node FeaturesRandom d-dim vectorNo+d params
Distance Encoding (DE)Distance to target setYesMedium
ID-GNNBinary ego flagYesO(N) passes
SEAL (subgraph)DRNL node labelingYesO(N) subgraphs
k-WL GNNsk-tuple node featuresYesO(Nk)
SEAL and Distance Encoding are closely related to ID-GNN. SEAL (Zhang & Chen 2018) extracts a subgraph around a target pair and uses a special node labeling (DRNL) that encodes each node's distance to both endpoints — essentially a more structured form of identity. Distance Encoding (this reading list's next paper) generalizes this to multiple target nodes. All three approaches break 1-WL symmetry through structural position information. ID-GNN is perhaps the simplest: one bit per node, same GNN architecture, minimal overhead.

On the theoretical side, ID-GNN is related to the folklore Weisfeiler-Lehman (FWL) and k-WL hierarchy. While higher-order WL tests achieve greater expressiveness, they require processing k-tuples of nodes — exponential in k. ID-GNN achieves a sweet spot: provably more expressive than 1-WL without exponential cost.

Limitations to know: ID-GNN still cannot distinguish all non-isomorphic graphs — there exist pairs that even FWL cannot distinguish. The O(N) computation overhead is manageable for medium graphs but becomes expensive for million-node graphs. For very large graphs, sampling-based ID-GNN (random subset of target nodes) or ID-GNN-Fast's diagonal trick are necessary. Also, the binary flag adds only 1 dimension — if you want richer identity signals, Distance Encoding is the natural extension.

Go Deeper

  • Distance Encoding (NeurIPS 2020) — richer positional signals using distance to target sets
  • SEAL (Zhang & Chen 2018) — subgraph-based link prediction with DRNL labeling
  • GIN (Xu et al. 2018) — understanding what 1-WL GNNs can and cannot do

Key Paper

You, Gomes-Selman, Ying, Leskovec. "Identity-aware Graph Neural Networks." AAAI 2021. arXiv:2101.10320

"What a neural network cannot distinguish, it cannot reason about." — paraphrase of the WL theorem's lesson