Identity-aware Graph Neural Networks

Chapter 0: The Problem

Imagine a graph shaped like a perfect 6-cycle: nodes A–B–C–D–E–F connected in a ring. Ask a standard GNN to compute an embedding for node A. It collects messages from B and F (its two neighbors), averages them, passes them up. Now do the same for node D. D collects messages from C and E, averages them, passes them up.

The result? A and D get identical embeddings. Of course they do — they have the same number of neighbors, the same local structure, the same everything. The GNN has no way to know it is operating on A versus D. As far as it is concerned, they are the same node.

This is not a bug. It's a fundamental limitation of how GNNs work. Standard GNNs aggregate neighborhood information. If two nodes have isomorphic neighborhoods, they get the same embedding — no matter what. This is called the 1-Weisfeiler-Lehman (1-WL) limitation: GNNs are at most as expressive as the 1-WL graph isomorphism test.

Why does this matter? Many graph tasks depend on node identity — who you are, not just what your neighborhood looks like. Consider:

Link prediction: Will node A link to node D? If both get the same embedding, the model cannot distinguish "predict a link from A" from "predict a link from D." The two nodes are identical to the model even if they're in completely different parts of the graph.
Graph classification: Is this graph a 6-cycle or a 6-cycle with a chord? The standard GNN assigns all nodes the same embedding in both graphs — it cannot detect the difference.
Counting structures: How many triangles does node A participate in? A GNN without identity information cannot count cycles longer than its receptive field.

The naive fix is to give every node a random unique ID as an extra feature. That works — but it breaks inductive generalization. If you train on graph G and test on graph G', the IDs from G mean nothing on G'. You need identity signals that are meaningful without being arbitrary.

The Symmetry Problem

A 6-cycle. Click any node to highlight it and see its 1-hop neighborhood. Notice that all nodes have identical neighborhoods — the GNN cannot distinguish them.

Why do standard GNNs assign identical embeddings to nodes A and D in a 6-cycle?

Both nodes have isomorphic 1-hop (and k-hop) neighborhoods, so the aggregation produces the same result for any number of GNN layers The GNN weights are randomly initialized with the same seed for both nodes 6-cycles are too small for GNNs to process correctly

Chapter 1: Node Coloring

Graph theorists have studied the problem of distinguishing nodes for a century. The classic solution is graph coloring: assign a label (color) to each node such that different nodes get different labels if they play structurally different roles. The Weisfeiler-Lehman algorithm does exactly this — iteratively refining color assignments based on neighborhood colors until stable.

The 1-WL algorithm works like this: start with all nodes having the same color (say, "gray"). Each round, every node collects the colors of its neighbors, hashes them together with its own color, and gets a new color. Repeat until no colors change. Two nodes that end up with the same color are structurally equivalent from 1-WL's perspective.

Key insight: The 1-WL algorithm is exactly what a GNN computes, with learned hash functions instead of combinatorial ones. So "GNN expressiveness ≤ 1-WL" is not just an analogy — it's a formal theorem. Any two nodes that 1-WL assigns the same color will get the same GNN embedding, regardless of architecture or depth.

What breaks 1-WL? The 6-cycle is a perfect example. After any number of rounds, all 6 nodes remain the same color because every node has exactly two neighbors with the same color as each other. 1-WL cannot distinguish nodes that are in globally symmetric positions.

The fix ID-GNN proposes: heterogeneous coloring. Instead of starting with the same color for all nodes, give the "target node" (the node we're computing an embedding for) a special color — call it red. All other nodes stay gray. Now rerun the propagation. The red color propagates outward, creating a gradient of distinctiveness that tells each node exactly how far it is from the target.

color_v⁽⁰⁾ = RED if v = target, else GRAY

color_v^(k+1) = hash(color_v^(k), {color_u^(k) : u ∈ N(v)})

Heterogeneous Coloring

Click any node to make it the "target." Watch how the coloring propagates outward, making every node's position unique relative to the target.

Propagation depth 0

In heterogeneous coloring, what makes the target node's embedding different from all others?

The target gets a larger hidden dimension The target node is initialized with a special color that propagates outward, making each node's position relative to the target structurally unique The target node is given a random embedding drawn from a different distribution

Chapter 2: ID-GNN — The Algorithm

ID-GNN turns heterogeneous coloring into a concrete, differentiable GNN algorithm. The key idea: to compute node v's embedding, run a standard GNN on the entire graph — but mark v with an extra binary feature (identity flag = 1; all others = 0). This forces the GNN to produce a distinct output for v even if v is structurally symmetric with other nodes.

Formally, for target node v, augment each node u's feature vector:

h_u^aug = [h_u ‖ 1[u = v]]

Then run the standard GNN on these augmented features. The resulting embedding h_v^(L) is the identity-aware embedding of v. The binary flag is the minimal identity signal — just one extra bit per node, but it breaks all symmetries in the graph.

The catch: This requires running the GNN N times — once per node — if you want all N node embeddings. That is N forward passes instead of one. ID-GNN-Fast addresses this: instead of full separate passes, just extract the diagonal of the stacked computation. In practice, a single forward pass with batched ego-networks achieves the same result much more efficiently.

Full Data Flow

Input

Graph G = (V, E) with node features X ∈ R^|V|×d. Target: compute embedding for node v.

↓

Augment

Append identity flag: x_u^aug = [x_u ‖ 1[u=v]] for all u. One extra dimension.

↓

Message Pass

Run L layers of GNN: h_u^(k+1) = σ(W₁h_u^(k) + W₂ ∑_w∈N(u) h_w^(k)). The identity signal diffuses through the graph.

↓

Read out

h_v^(L) is now unique to v — it encodes v's structural position AND v's identity.

ID-GNN Interactive

Select a target node. Adjust GNN depth to see how the identity signal propagates. The embedding bars show how each node's representation changes — the target's embedding is always unique.

GNN layers 1

What is the identity signal added to each node in ID-GNN?

A binary flag: 1 if the node is the current target, 0 otherwise — appended to the feature vector A random d-dimensional vector sampled fresh for each forward pass The node's index in the adjacency matrix, cast to a float

Chapter 3: Ego-Network Extraction

Running the full graph GNN N times is expensive. But here's the key observation: when computing the embedding of node v using an L-layer GNN, only nodes within L hops of v matter. Everything further away is outside v's receptive field and contributes nothing to h_v^(L).

This motivates ego-network extraction: for target node v, extract the subgraph induced by all nodes within L hops of v. This is the L-hop ego-network of v, centered on v. Now run the augmented GNN on this much smaller subgraph.

What is an ego-network? An ego-network (ego-graph) is the subgraph containing a center node (the "ego") and all nodes within K hops, plus all edges between them. For K=1, it is the node and its immediate neighbors. For K=2, it adds the neighbors of neighbors. The ego-network grows exponentially with K but is often much smaller than the full graph.

The savings are dramatic in sparse graphs. If the average degree is d and you use L=3 layers, the ego-network has at most 1 + d + d² + d³ nodes. For a social network with d=10 and L=3, that is at most 1111 nodes — far smaller than a million-node graph.

For link prediction, you need embeddings for pairs (u, v). The ego-network approach means: extract the ego-network centered on u (to compute h_u) and the ego-network centered on v (to compute h_v), then score the pair. The computation is embarrassingly parallel — all ego-networks can be processed simultaneously in a batch.

Ego-Network Extraction

A larger graph. The highlighted region is the L-hop ego-network for the selected target node. Adjust L to see how many nodes are included.

Ego-network radius L 1

Why is ego-network extraction important for scaling ID-GNN?

It reduces the number of message-passing layers needed It allows gradient checkpointing to reduce memory An L-layer GNN only uses nodes within L hops of the target, so computing on the full graph wastes computation — the ego-network is the minimal sufficient subgraph

Chapter 4: Inductive Coloring

The central challenge for any graph identity scheme: it must work inductively. Training on one graph, testing on another. If you assign each node a fixed ID (node 1, node 2, ...), that ID is meaningless on a new graph. Node 47 in the training graph has no relationship to node 47 in the test graph.

ID-GNN's approach is inherently inductive because the identity signal is relative, not absolute. The binary flag "am I the target?" is defined by the computation task, not by a global node ordering. When you run ID-GNN on a new graph, you simply mark whichever node you're computing for — the flag is always binary and always locally meaningful.

The magic of relative identity: Traditional node IDs say "I am node 47." The ID-GNN flag says "I am the node we care about right now." The first is a global, absolute label that doesn't transfer. The second is a local, computation-scoped label that works on any graph, any size, any structure. This is why ID-GNN generalizes inductively — the flag encodes role, not name.

This also means ID-GNN learns how to USE identity signals, not what the signals mean globally. The GNN learns: "if the identity flag is 1 nearby, here's how to weight that information." This learned strategy transfers perfectly to new graphs.

Compare to random node features (RNF) — another approach where each node gets a random d-dimensional vector as extra features. RNF also breaks symmetry, but the features are random at test time and random at training time — the model cannot learn stable patterns. ID-GNN's binary flag is deterministic and semantically clear: it encodes distance to the target. RNF is noise; ID-GNN is information.

python
# ID-GNN: inductive node embedding for target node v
def id_gnn_embed(graph, node_features, target_v, gnn_model):
    # Step 1: Extract ego-network (L-hop subgraph around target_v)
    ego_graph, ego_mask = extract_ego(graph, target_v, L=gnn_model.num_layers)

    # Step 2: Create identity flag — 1 for target, 0 for all others
    id_flag = (ego_graph.nodes == target_v).float().unsqueeze(-1)  # [N_ego, 1]

    # Step 3: Augment node features with identity flag
    aug_features = torch.cat([node_features[ego_mask], id_flag], dim=-1)  # [N_ego, d+1]

    # Step 4: Run standard GNN on augmented ego-graph
    embeddings = gnn_model(ego_graph, aug_features)  # [N_ego, h]

    # Step 5: Return embedding of the target node
    target_idx = (ego_graph.nodes == target_v).nonzero()[0]
    return embeddings[target_idx]  # [h] — unique to v, inductive

Why does ID-GNN generalize inductively (to new, unseen graphs)?

The identity signal is a relative, task-scoped binary flag — it encodes "am I the target?" which is meaningful on any graph, not an absolute global node ID that only makes sense on the training graph ID-GNN trains separate models for each graph in the test set The ego-network size is fixed, so new graphs always have the same structure

Chapter 5: Results

ID-GNN was evaluated on node classification, link prediction, and graph classification tasks. The key claim: adding identity information strictly improves GNN expressiveness, and this shows in practice.

Task	Dataset	Metric	GIN	ID-GNN	Gain
Link prediction	ogbl-collab	Hits@50	51.4%	62.6%	+11.2%
Link prediction	ogbl-ddi	Hits@20	37.1%	55.9%	+18.8%
Graph classification	EXP (WL-hard)	Accuracy	50.0%	100.0%	+50.0%
Graph classification	CEXP	Accuracy	50.0%	100.0%	+50.0%
Node classification	ogbn-arxiv	Accuracy	71.7%	72.4%	+0.7%

The most dramatic improvements are on the EXP and CEXP datasets — synthetic graphs specifically designed to be hard for 1-WL tests. Standard GNNs score exactly 50% (random chance) because they cannot distinguish any nodes. ID-GNN scores 100%: the identity signal is exactly what's needed to break the 1-WL limitation on these graphs.

Real-world gains matter too. Link prediction improvements of 11-19% on OGB benchmarks are large — these are competitive leaderboards with significant engineering. The gains come purely from adding a single bit of identity information, with the same GNN architecture and the same number of trainable parameters. The identity flag is free information.

Where does ID-GNN NOT help much? Node classification tasks where nodes are already distinguishable by their features (text, chemistry). If node A has a unique feature vector, the standard GNN can already tell it apart from node D without any identity signal. The gains are largest where graph structure is the primary signal and features are homogeneous.

Why does ID-GNN score 100% on EXP/CEXP while GIN scores 50%?

EXP/CEXP are designed to have graphs that are indistinguishable by 1-WL, so GIN (≤ 1-WL) fails; ID-GNN breaks 1-WL symmetry with the identity flag, successfully distinguishing all graphs ID-GNN uses larger hidden dimensions than GIN on these datasets The EXP dataset contains node features that only ID-GNN can read

Chapter 6: vs GIN

Graph Isomorphism Network (GIN) is the most expressive 1-WL GNN — it achieves the theoretical maximum expressiveness within the 1-WL class by using injective (sum) aggregation with MLP transformation. GIN is the strongest baseline ID-GNN needs to beat.

The GIN update rule:

h_v^(k) = MLP^(k)((1 + ε^(k)) · h_v^(k-1) + ∑_u∈N(v) h_u^(k-1))

The sum aggregation (not mean, not max) is key — it preserves multiset information. Two neighborhoods with different numbers of copies of the same node feature produce different sums. GIN is as expressive as 1-WL can be.

But that's not enough. GIN being 1-WL-complete means it still fails on graphs that 1-WL cannot distinguish. ID-GNN goes strictly beyond 1-WL. In graph theory terms: 1-WL ⊂ ID-GNN in terms of the graph properties each can distinguish. Any property that 1-WL can detect, ID-GNN can detect. But ID-GNN can detect properties that 1-WL cannot — specifically, properties involving cycle structures and distance-dependent patterns.

Property	GIN (1-WL)	ID-GNN
Node degree	Yes	Yes
Triangle membership	No	Yes
Cycle length	Partially	Yes
Shortest path between pair	No	Yes
Graph diameter	No	Yes
Inductive to new graphs	Yes	Yes
Extra parameters needed	No	No

ID-GNN matches GIN on everything GIN can do, and adds strictly more. The cost: O(N) ego-network computations instead of O(1). ID-GNN-Fast recovers much of the efficiency by processing all ego-networks in a single batched forward pass, achieving roughly 2-4x runtime overhead over standard GNN — acceptable for the expressiveness gain.

What is the relationship between GIN's expressiveness and ID-GNN's expressiveness?

They are equal — both are 1-WL complete GIN is strictly more expressive because it uses injective aggregation ID-GNN is strictly more expressive — it can distinguish all graph pairs that GIN can, plus graph pairs that are indistinguishable by 1-WL (like certain cycle structures)

Chapter 7: Connections

ID-GNN sits at the intersection of several active research threads. Understanding these connections helps place it in the broader landscape of expressive graph learning.

Method	Identity Signal	Inductive?	Extra Cost
Standard GNN (GCN, GIN)	None	Yes	None
Random Node Features	Random d-dim vector	No	+d params
Distance Encoding (DE)	Distance to target set	Yes	Medium
ID-GNN	Binary ego flag	Yes	O(N) passes
SEAL (subgraph)	DRNL node labeling	Yes	O(N) subgraphs
k-WL GNNs	k-tuple node features	Yes	O(N^k)

SEAL and Distance Encoding are closely related to ID-GNN. SEAL (Zhang & Chen 2018) extracts a subgraph around a target pair and uses a special node labeling (DRNL) that encodes each node's distance to both endpoints — essentially a more structured form of identity. Distance Encoding (this reading list's next paper) generalizes this to multiple target nodes. All three approaches break 1-WL symmetry through structural position information. ID-GNN is perhaps the simplest: one bit per node, same GNN architecture, minimal overhead.

On the theoretical side, ID-GNN is related to the folklore Weisfeiler-Lehman (FWL) and k-WL hierarchy. While higher-order WL tests achieve greater expressiveness, they require processing k-tuples of nodes — exponential in k. ID-GNN achieves a sweet spot: provably more expressive than 1-WL without exponential cost.

Limitations to know: ID-GNN still cannot distinguish all non-isomorphic graphs — there exist pairs that even FWL cannot distinguish. The O(N) computation overhead is manageable for medium graphs but becomes expensive for million-node graphs. For very large graphs, sampling-based ID-GNN (random subset of target nodes) or ID-GNN-Fast's diagonal trick are necessary. Also, the binary flag adds only 1 dimension — if you want richer identity signals, Distance Encoding is the natural extension.

Go Deeper

Distance Encoding (NeurIPS 2020) — richer positional signals using distance to target sets
SEAL (Zhang & Chen 2018) — subgraph-based link prediction with DRNL labeling
GIN (Xu et al. 2018) — understanding what 1-WL GNNs can and cannot do

Key Paper

You, Gomes-Selman, Ying, Leskovec. "Identity-aware Graph Neural Networks." AAAI 2021. arXiv:2101.10320

"What a neural network cannot distinguish, it cannot reason about." — paraphrase of the WL theorem's lesson