Biased random walks with two parameters — p and q — that smoothly interpolate between homophily and structural equivalence. One paper. Two knobs. Every graph structure task.
You're building a recommender system for a social network. You want to learn embeddings for each user such that users who are likely to interact are embedded nearby. Your tool: DeepWalk — uniform random walks on the network. You train embeddings, evaluate, and the system works. Users with common friends cluster together. Done.
Then your colleague runs the same system on a protein-protein interaction network to classify protein function. Results are terrible. DeepWalk, same algorithm, completely different performance. Why?
The answer is that these two tasks require fundamentally different notions of "similar node." In the social network, two users are similar because they live in the same community — they share friends, interests, geography. This is called homophily: birds of a feather flock together. Embeddings should capture community membership.
In the protein network, two proteins are similar because they play the same structural role in different parts of the network — both are "hub connectors" or both are "leaf proteins." This is called structural equivalence: nodes are similar if they have similar local network topology, regardless of where they are in the global graph.
This graph has two communities (warm and teal) connected by a bridge. Nodes A and B are structurally equivalent (both are community hubs) but far apart. Click "Homophily" or "Structural" to see which nodes should embed nearby under each objective.
node2vec inherits the encoder-decoder framework from DeepWalk. Nodes are mapped to d-dimensional vectors. Similarity is measured by dot product. The objective: find embeddings such that nodes co-occurring in random walks have high dot products.
For a node u, define its network neighborhood NS(u) as the multiset of nodes visited during random walks from u using strategy S. The objective is:
where f: V → ℝd is the embedding function. Two simplifying assumptions (both from Word2Vec):
The conditional probability of visiting node v given source node u is modeled with a softmax:
The strategy S is what makes node2vec different from DeepWalk. It controls HOW the random walk explores the graph — biased toward local or global structure. Different strategies produce different NS(u), which changes what the optimization objective forces embeddings to encode.
You might ask: why not directly compute BFS neighborhoods or DFS neighborhoods and use those as NS(u)? Two reasons. First, exact BFS/DFS neighborhoods can be huge — they scale with graph size. Random walks give a flexible, sampled approximation that's controllable via walk length. Second, pure BFS and DFS are the extremes. The power of node2vec is the ability to smoothly interpolate between them, finding the right balance for your task.
Consider a source node u. Two classical graph exploration strategies give very different notions of "neighborhood":
Breadth-first search explores all direct neighbors of u before moving to any node two hops away. The BFS neighborhood NBFS(u) stays tightly local — it's essentially the immediate neighborhood of u, perhaps up to depth 2 or 3.
BFS neighborhoods capture homophily. Two nodes with overlapping BFS neighborhoods — sharing many neighbors — are likely in the same community. BFS-based embeddings encode community membership: users in the same social circle embed nearby.
Depth-first search follows one path as far as possible before backtracking. The DFS neighborhood NDFS(u) rapidly moves away from u, visiting distant nodes.
DFS neighborhoods capture structural equivalence. A long DFS walk from u visits nodes across the whole graph, exposing structural patterns. Two nodes where DFS walks have similar statistical distributions (visiting similar types of nodes in similar proportions) are likely structurally equivalent — even if they're in completely different parts of the graph.
The BFS neighborhood of u is exactly the same if u is the center of a star or the center of a clique — both have the same set of immediate neighbors. BFS cannot distinguish structural roles beyond degree.
DFS is high-variance. A single depth-first walk from u might take completely different paths on different runs. It's inefficient for capturing local community structure, where you need to densely sample the neighborhood.
Click a node, then toggle BFS/DFS to see which nodes get included in its neighborhood (highlighted). Notice how BFS stays local and DFS reaches far.
node2vec introduces a second-order random walk. Unlike a standard random walk that only remembers the current node, node2vec's walk remembers the previous node too. This one extra bit of memory is enough to implement a continuous BFS–DFS tradeoff.
Suppose the walk has just traversed edge (t → v) — it came from node t and is now at node v. It must choose where to go next. For each neighbor x of v, the unnormalized transition weight αpq(t, x) is:
where dtx is the shortest-path distance from t to x. These unnormalized weights are then divided by their sum to give proper transition probabilities.
p controls the likelihood of revisiting a node you just came from. High p: returning to t is very unlikely (1/p is small) — the walk prefers to explore new territory. Low p: returning to t is very likely — the walk tends to backtrack, staying local. Think of p as the "backtracking penalty."
q controls whether the walk explores inward (toward t) or outward (away from t). Low q: 1/q is large — moving farther from t is preferred — the walk explores outward, DFS-style. High q: 1/q is small — moving farther from t is discouraged — the walk prefers nodes near t, BFS-style. Think of q as "how adventurous the walk is."
Select a start node, adjust p and q, then run walks. The heatmap shows which nodes get visited most. Low q = DFS (explores far, blue halo). Low p = BFS (stays local, orange halo). The walk's "memory" of the previous node drives all the magic.
Once walks are generated, node2vec uses the skip-gram objective from Word2Vec. The walk is treated as a "sentence": a sequence of node IDs. The skip-gram objective asks the model to predict the context (surrounding nodes within a window) given a source node.
Given a walk w1, w2, ..., wL, for each node wi, predict the nodes wi-k, ..., wi-1, wi+1, ..., wi+k in a window of size k. The objective maximizes:
The conditional probability uses the softmax:
| Symbol | Meaning | Data Flow |
|---|---|---|
| f(u) | Embedding of node u | Column u of learnable matrix Z ∈ ℝd×|V| |
| f(u) · f(v) | Dot product similarity | ℝd · ℝd → scalar |
| Softmax denominator | Partition function over all nodes | |V| dot products per training step |
| Window k | Context size | Typically k = 10 for node2vec |
| Walk length L | Steps per random walk | Typically L = 80 |
| Walks per node r | Walk count per source node | Typically r = 10 |
Before optimization, node2vec pre-generates all random walks. This is more efficient than sampling on-the-fly during training. The pipeline:
python def node2vec_train(G, p, q, d=128, r=10, l=80, k=10): # Precompute transition probabilities (second-order Markov) probs = precompute_transition_probs(G, p, q) # probs[(t, v)] = dict {neighbor: weight} for each edge (t,v) # Generate walk corpus walks = [] for _ in range(r): # r walks per node for u in G.nodes(): walk = biased_walk(G, probs, u, l) # length l walks.append(walk) # Train skip-gram (Word2Vec) on walk corpus model = Word2Vec(walks, vector_size=d, window=k, sg=1, min_count=0, workers=4) return {u: model.wv[u] for u in G.nodes()} # Returns dict: node_id -> np.array of shape (d,)
The softmax denominator ∑v∈V exp(f(u)·f(v)) is the core computational bottleneck. For a graph with a million nodes, computing this sum requires a million dot products — for every (u, context) pair in every training step. This is completely intractable.
Instead of normalizing over all V nodes, reformulate as a binary classification problem: given node u and a candidate node v, is v a true context of u (from the walk), or is it noise (a random node)? This is noise-contrastive estimation.
Mikolov et al. showed that a simplified version — negative sampling — works well in practice. For each positive (u, v) pair from the walk, sample k negative nodes n1, ..., nk from a noise distribution Pn. The loss becomes:
This has two terms:
node2vec follows Word2Vec: sample negative nodes with probability proportional to degree3/4. Why 3/4 and not uniform? Uniform would under-sample high-degree nodes (hubs), which are the most informative negatives — they appear everywhere in the graph. Degree3/4 biases toward frequent nodes without making them completely dominate:
| k (negatives) | Training speed | Embedding quality | Typical use |
|---|---|---|---|
| 1–2 | Very fast | Lower | Huge graphs (>100M nodes) |
| 5–10 | Fast | Good | node2vec default |
| 15–20 | Moderate | Very good | Small graphs with sparse edges |
| >20 | Slow | Diminishing returns | Rarely used |
python import numpy as np def sigmoid(x): return 1.0 / (1.0 + np.exp(-x)) def neg_sample_loss(f_u, f_v, Z, degrees, k=10): # f_u, f_v: embeddings of source and context node, shape (d,) # Z: all embeddings, shape (|V|, d) # degrees: node degree array, shape (|V|,) # Positive term pos_score = f_u @ f_v # scalar pos_loss = -np.log(sigmoid(pos_score) + 1e-8) # Negative sampling distribution: degree^0.75 probs = degrees ** 0.75 probs /= probs.sum() neg_idx = np.random.choice(len(degrees), size=k, p=probs) # Negative terms neg_scores = Z[neg_idx] @ f_u # shape (k,) neg_loss = -np.sum(np.log(sigmoid(-neg_scores) + 1e-8)) return pos_loss + neg_loss / k
Node embeddings zu were designed for nodes. But many tasks are about pairs of nodes — specifically, link prediction: given nodes u and v, does an edge exist between them (or will one form in the future)?
The trick: define an edge embedding using a binary operator on the two node embeddings. node2vec evaluates four operators:
| Operator | Formula | Intuition |
|---|---|---|
| Hadamard | zu ⊙ zv | Element-wise product; captures feature co-occurrence |
| Average | (zu + zv) / 2 | Midpoint in embedding space |
| Weighted-L1 | |zu − zv| | Element-wise absolute difference |
| Weighted-L2 | |zu − zv|2 | Squared element-wise difference |
Each operator produces a d-dimensional vector for each edge. This vector is then fed into a logistic regression classifier trained on known edges (positive examples) and non-edges (negative examples).
The link prediction setup requires careful handling to avoid data leakage:
python from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score import numpy as np def link_prediction(embeddings, train_edges, train_nonedges, test_edges, test_nonedges, operator='hadamard'): def edge_emb(u, v): zu, zv = embeddings[u], embeddings[v] if operator == 'hadamard': return zu * zv if operator == 'avg': return (zu + zv) / 2 if operator == 'l1': return np.abs(zu - zv) if operator == 'l2': return (zu - zv) ** 2 # Build training features and labels X_train = [edge_emb(u, v) for u, v in train_edges + train_nonedges] y_train = [1] * len(train_edges) + [0] * len(train_nonedges) clf = LogisticRegression().fit(X_train, y_train) # Evaluate on test set X_test = [edge_emb(u, v) for u, v in test_edges + test_nonedges] y_test = [1] * len(test_edges) + [0] * len(test_nonedges) scores = clf.predict_proba(X_test)[:, 1] return roc_auc_score(y_test, scores)
The paper evaluates node2vec on two tasks: multi-label node classification and link prediction. The key result: node2vec outperforms all baselines on both tasks, on all datasets. And the p,q parameters are the reason.
| Dataset | Nodes | Edges | Labels | Task |
|---|---|---|---|---|
| BlogCatalog | 10,312 | 333,983 | 39 | Social interest groups |
| Protein-Protein (PPI) | 3,890 | 76,584 | 50 | Biological states of proteins |
| Wikipedia | 4,777 | 184,812 | 40 | Part-of-speech labels |
| Method | BlogCatalog | PPI | Wikipedia |
|---|---|---|---|
| Spectral Clustering | 0.0405 | 0.0681 | 0.1164 |
| DeepWalk | 0.2110 | 0.1768 | 0.1839 |
| LINE | 0.0784 | 0.1447 | 0.1244 |
| node2vec | 0.2581 | 0.2033 | 0.1989 |
node2vec improves over DeepWalk by 22% on BlogCatalog (a social network — homophily) and 15% on PPI (a protein network — structural equivalence). The improvement comes from tuning p and q.
| Dataset | Spectral | DeepWalk | LINE | node2vec |
|---|---|---|---|---|
| 0.9625 | 0.9680 | 0.9490 | 0.9980 | |
| PPI | 0.9429 | 0.9568 | 0.9623 | 0.9760 |
| arXiv (astro) | 0.9607 | 0.9786 | 0.9730 | 0.9996 |
| arXiv (hep) | 0.9515 | 0.9761 | 0.9668 | 0.9905 |
On Facebook (social network), node2vec achieves 0.9980 AUC — near-perfect link prediction. On PPI (protein interaction), it reaches 0.9760. Across all four datasets, node2vec achieves the highest AUC, with the Hadamard operator performing best.
Bar chart of F1 scores for node classification across methods. node2vec consistently wins — by a larger margin on social networks (homophily dominant) than protein networks (structural equivalence dominant).
By now you understand the theory. But in practice, how do you choose p and q for a new graph? Here's a principled guide based on task type, graph structure, and what the literature shows.
| Task | Signal Type | Recommended | Intuition |
|---|---|---|---|
| Community detection | Homophily | Low p, high q (BFS-like) | Dense local sampling captures community membership |
| Role discovery | Structural equivalence | High p, low q (DFS-like) | Outward exploration exposes structural patterns |
| Link prediction (social) | Homophily | Low p, moderate q | Friends-of-friends are most likely future friends |
| Link prediction (biology) | Mixed | Grid search p,q ∈ {0.25, 0.5, 1, 2, 4} | Protein interactions can be structural or community |
| Unknown | Unknown | p=1, q=1 (DeepWalk) | Reasonable default; then tune if needed |
You can diagnose what type of similarity is dominant before training by examining graph structure:
The paper uses a simple grid search: p ∈ {0.25, 0.5, 1, 2, 4}, q ∈ {0.25, 0.5, 1, 2, 4}. That's 25 combinations. Train node2vec once for each combination (walks only need to be regenerated, not the optimization from scratch for each). Evaluate on a 10% held-out node validation set. Pick the best p, q.
node2vec sits at the intersection of random walk theory, Word2Vec, and network science. Understanding its place in the literature explains both why it worked and where to go next.
node2vec has over 14,000 citations (as of 2024). It is one of the most cited graph ML papers ever, and serves as the standard baseline for any node classification or link prediction task. "Did you beat node2vec?" is the minimum bar in graph ML papers.
| Method | Year | Key Advance Over node2vec |
|---|---|---|
| GraphSAGE | 2017 | Inductive; aggregates neighbor features not just IDs |
| GAT | 2018 | Attention-weighted aggregation; learns importance of neighbors |
| Graph Transformer | 2020+ | Global attention; no local neighborhood assumption |
| SEAL | 2018 | Subgraph-based link prediction; beats node2vec on all link tasks |
| METIS + node2vec | Various | Cluster-then-embed for scalability on billion-node graphs |
"The key is to be able to control the kind of communities you're trying to learn — and that's exactly what p and q give you." — Jure Leskovec
Paper: Grover & Leskovec, "node2vec: Scalable Feature Learning for Networks," KDD 2016.
arXiv: 1607.00653 |
Code: github.com/aditya-grover/node2vec