Grover & Leskovec, KDD 2016 — arXiv:1607.00653

node2vec

Biased random walks with two parameters — p and q — that smoothly interpolate between homophily and structural equivalence. One paper. Two knobs. Every graph structure task.

Prerequisites: Random walks + skip-gram (Word2Vec) + dot products
10
Chapters
4+
Simulations

Chapter 0: The Problem

You're building a recommender system for a social network. You want to learn embeddings for each user such that users who are likely to interact are embedded nearby. Your tool: DeepWalk — uniform random walks on the network. You train embeddings, evaluate, and the system works. Users with common friends cluster together. Done.

Then your colleague runs the same system on a protein-protein interaction network to classify protein function. Results are terrible. DeepWalk, same algorithm, completely different performance. Why?

The answer is that these two tasks require fundamentally different notions of "similar node." In the social network, two users are similar because they live in the same community — they share friends, interests, geography. This is called homophily: birds of a feather flock together. Embeddings should capture community membership.

In the protein network, two proteins are similar because they play the same structural role in different parts of the network — both are "hub connectors" or both are "leaf proteins." This is called structural equivalence: nodes are similar if they have similar local network topology, regardless of where they are in the global graph.

The fundamental tension: Homophily wants embeddings that reflect community membership (close nodes embed together). Structural equivalence wants embeddings that reflect graph role (nodes with similar structure embed together, even if far apart). These require opposite exploration strategies. One embedding method cannot be optimal for both — unless you let the algorithm choose.
Homophily vs. Structural Equivalence

This graph has two communities (warm and teal) connected by a bridge. Nodes A and B are structurally equivalent (both are community hubs) but far apart. Click "Homophily" or "Structural" to see which nodes should embed nearby under each objective.

Select a similarity notion to highlight which nodes should embed nearby.
Two nodes are in completely different parts of a large graph, but both are "connector hubs" bridging two communities. Which similarity notion says they should embed nearby?

Chapter 1: Random Walk Framework

node2vec inherits the encoder-decoder framework from DeepWalk. Nodes are mapped to d-dimensional vectors. Similarity is measured by dot product. The objective: find embeddings such that nodes co-occurring in random walks have high dot products.

The Neighborhood NS(u)

For a node u, define its network neighborhood NS(u) as the multiset of nodes visited during random walks from u using strategy S. The objective is:

maxfu ∈ V log Pr(NS(u) | f(u))

where f: V → ℝd is the embedding function. Two simplifying assumptions (both from Word2Vec):

  1. Conditional independence: The likelihood of each neighbor node in NS(u) is independent of the others given f(u). So the product factorizes over individual terms.
  2. Symmetry in feature space: A node's embedding as source and as context (neighbor) are the same. So Pr(ni|f(u)) is modeled symmetrically.

The conditional probability of visiting node v given source node u is modeled with a softmax:

Pr(v | f(u)) = exp(f(u) · f(v)) / ∑n ∈ V exp(f(u) · f(n))

The strategy S is what makes node2vec different from DeepWalk. It controls HOW the random walk explores the graph — biased toward local or global structure. Different strategies produce different NS(u), which changes what the optimization objective forces embeddings to encode.

Why Random Walks, Not BFS/DFS Directly?

You might ask: why not directly compute BFS neighborhoods or DFS neighborhoods and use those as NS(u)? Two reasons. First, exact BFS/DFS neighborhoods can be huge — they scale with graph size. Random walks give a flexible, sampled approximation that's controllable via walk length. Second, pure BFS and DFS are the extremes. The power of node2vec is the ability to smoothly interpolate between them, finding the right balance for your task.

The key question: What walk strategy S produces a neighborhood NS(u) that, when embedded, best captures the similarity we care about? DeepWalk uses S = uniform. node2vec uses S = biased with parameters p and q.
Why do the two assumptions (conditional independence and feature space symmetry) simplify the optimization?

Chapter 2: BFS vs. DFS Neighborhoods

Consider a source node u. Two classical graph exploration strategies give very different notions of "neighborhood":

BFS: Breadth-First Sampling

Breadth-first search explores all direct neighbors of u before moving to any node two hops away. The BFS neighborhood NBFS(u) stays tightly local — it's essentially the immediate neighborhood of u, perhaps up to depth 2 or 3.

BFS neighborhoods capture homophily. Two nodes with overlapping BFS neighborhoods — sharing many neighbors — are likely in the same community. BFS-based embeddings encode community membership: users in the same social circle embed nearby.

DFS: Depth-First Sampling

Depth-first search follows one path as far as possible before backtracking. The DFS neighborhood NDFS(u) rapidly moves away from u, visiting distant nodes.

DFS neighborhoods capture structural equivalence. A long DFS walk from u visits nodes across the whole graph, exposing structural patterns. Two nodes where DFS walks have similar statistical distributions (visiting similar types of nodes in similar proportions) are likely structurally equivalent — even if they're in completely different parts of the graph.

The paper's framing: The choice of S determines which network property gets preserved in embedding space. BFS → homophily. DFS → structural equivalence. node2vec's insight: don't choose — parameterize the walk to interpolate continuously between the two.

Why Pure BFS or DFS Fails

BFS limitation

The BFS neighborhood of u is exactly the same if u is the center of a star or the center of a clique — both have the same set of immediate neighbors. BFS cannot distinguish structural roles beyond degree.

DFS limitation

DFS is high-variance. A single depth-first walk from u might take completely different paths on different runs. It's inefficient for capturing local community structure, where you need to densely sample the neighborhood.

BFS vs. DFS Neighborhood Visualization

Click a node, then toggle BFS/DFS to see which nodes get included in its neighborhood (highlighted). Notice how BFS stays local and DFS reaches far.

Click a node, then select BFS or DFS to see its neighborhood.
For a node classification task where community membership determines the label (e.g., political affiliation in a social network), which exploration strategy should dominate?

Chapter 3: The p,q Parameters — Walk Explorer

node2vec introduces a second-order random walk. Unlike a standard random walk that only remembers the current node, node2vec's walk remembers the previous node too. This one extra bit of memory is enough to implement a continuous BFS–DFS tradeoff.

The Transition Probabilities

Suppose the walk has just traversed edge (t → v) — it came from node t and is now at node v. It must choose where to go next. For each neighbor x of v, the unnormalized transition weight αpq(t, x) is:

αpq(t, x) = 1/p   if dtx = 0    (x = t, returning)
1        if dtx = 1    (x is neighbor of both t and v)
1/q   if dtx = 2    (x is farther from t than v was)

where dtx is the shortest-path distance from t to x. These unnormalized weights are then divided by their sum to give proper transition probabilities.

Intuition for p (Return Parameter)

p controls the likelihood of revisiting a node you just came from. High p: returning to t is very unlikely (1/p is small) — the walk prefers to explore new territory. Low p: returning to t is very likely — the walk tends to backtrack, staying local. Think of p as the "backtracking penalty."

Intuition for q (In-Out Parameter)

q controls whether the walk explores inward (toward t) or outward (away from t). Low q: 1/q is large — moving farther from t is preferred — the walk explores outward, DFS-style. High q: 1/q is small — moving farther from t is discouraged — the walk prefers nodes near t, BFS-style. Think of q as "how adventurous the walk is."

The sweet spot: p=1, q=1 → uniform walk (same as DeepWalk). p<1, q>1 → BFS-like, captures homophily. p>1, q<1 → DFS-like, captures structural equivalence. The paper searches p,q in {0.25, 0.5, 1, 2, 4} and finds the best pair for each task via cross-validation.
node2vec Walk Explorer — p & q Sliders ★ SHOWCASE

Select a start node, adjust p and q, then run walks. The heatmap shows which nodes get visited most. Low q = DFS (explores far, blue halo). Low p = BFS (stays local, orange halo). The walk's "memory" of the previous node drives all the magic.

p (return) 1.0
q (in-out) 1.0
Select a start node (click on graph), adjust p and q, then run walks.
After traversing edge (t→v), node2vec considers moving to neighbor x with d(t,x) = 2. What transition weight is assigned?

Chapter 4: Skip-gram Objective

Once walks are generated, node2vec uses the skip-gram objective from Word2Vec. The walk is treated as a "sentence": a sequence of node IDs. The skip-gram objective asks the model to predict the context (surrounding nodes within a window) given a source node.

Formally

Given a walk w1, w2, ..., wL, for each node wi, predict the nodes wi-k, ..., wi-1, wi+1, ..., wi+k in a window of size k. The objective maximizes:

i=1L-k ≤ j ≤ k, j≠0 log Pr(wi+j | f(wi))

The conditional probability uses the softmax:

Pr(wi+j | f(wi)) = exp(f(wi) · f(wi+j)) / ∑v ∈ V exp(f(wi) · f(v))

What Is Each Element?

SymbolMeaningData Flow
f(u)Embedding of node uColumn u of learnable matrix Z ∈ ℝd×|V|
f(u) · f(v)Dot product similarityd · ℝd → scalar
Softmax denominatorPartition function over all nodes|V| dot products per training step
Window kContext sizeTypically k = 10 for node2vec
Walk length LSteps per random walkTypically L = 80
Walks per node rWalk count per source nodeTypically r = 10

The Walk Generation Strategy

Before optimization, node2vec pre-generates all random walks. This is more efficient than sampling on-the-fly during training. The pipeline:

python
def node2vec_train(G, p, q, d=128, r=10, l=80, k=10):
    # Precompute transition probabilities (second-order Markov)
    probs = precompute_transition_probs(G, p, q)
    # probs[(t, v)] = dict {neighbor: weight} for each edge (t,v)

    # Generate walk corpus
    walks = []
    for _ in range(r):               # r walks per node
        for u in G.nodes():
            walk = biased_walk(G, probs, u, l)  # length l
            walks.append(walk)

    # Train skip-gram (Word2Vec) on walk corpus
    model = Word2Vec(walks, vector_size=d, window=k,
                      sg=1, min_count=0, workers=4)
    return {u: model.wv[u] for u in G.nodes()}
    # Returns dict: node_id -> np.array of shape (d,)
Precomputing transitions is the key optimization. The second-order Markov transitions depend only on the edge (t, v), not on the full walk history. So you can precompute, for every directed edge (t, v) in G, the transition probabilities from v given you came from t. This is O(|E|) space and time upfront — then sampling walks is fast.
With walk length L=80, walks per node r=10, and window k=10, how many training pairs does node2vec generate for a single node u?

Chapter 5: Negative Sampling

The softmax denominator ∑v∈V exp(f(u)·f(v)) is the core computational bottleneck. For a graph with a million nodes, computing this sum requires a million dot products — for every (u, context) pair in every training step. This is completely intractable.

The Noise-Contrastive Estimation (NCE) Idea

Instead of normalizing over all V nodes, reformulate as a binary classification problem: given node u and a candidate node v, is v a true context of u (from the walk), or is it noise (a random node)? This is noise-contrastive estimation.

Mikolov et al. showed that a simplified version — negative sampling — works well in practice. For each positive (u, v) pair from the walk, sample k negative nodes n1, ..., nk from a noise distribution Pn. The loss becomes:

L = −log σ(f(v) · f(u)) − ∑i=1k Eni ~ Pn[log σ(−f(ni) · f(u))]

This has two terms:

  1. −log σ(f(v)·f(u)): Maximize the dot product between true co-occurring pairs. Pull their embeddings together.
  2. −log σ(−f(n)·f(u)): Minimize the dot product with negative samples. Push those embeddings apart.

The Noise Distribution

node2vec follows Word2Vec: sample negative nodes with probability proportional to degree3/4. Why 3/4 and not uniform? Uniform would under-sample high-degree nodes (hubs), which are the most informative negatives — they appear everywhere in the graph. Degree3/4 biases toward frequent nodes without making them completely dominate:

Pn(v) = degree(v)3/4 / ∑u degree(u)3/4

Effect of k on Quality

k (negatives)Training speedEmbedding qualityTypical use
1–2Very fastLowerHuge graphs (>100M nodes)
5–10FastGoodnode2vec default
15–20ModerateVery goodSmall graphs with sparse edges
>20SlowDiminishing returnsRarely used
Why not exact softmax? Consider a graph with |V| = 106 nodes. With k=10 negatives, each training step costs 11 dot products. With exact softmax, it's 106 dot products. That's a 100,000x speedup. At 108 training pairs, this is the difference between an afternoon and a decade.
python
import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def neg_sample_loss(f_u, f_v, Z, degrees, k=10):
    # f_u, f_v: embeddings of source and context node, shape (d,)
    # Z: all embeddings, shape (|V|, d)
    # degrees: node degree array, shape (|V|,)

    # Positive term
    pos_score = f_u @ f_v                          # scalar
    pos_loss = -np.log(sigmoid(pos_score) + 1e-8)

    # Negative sampling distribution: degree^0.75
    probs = degrees ** 0.75
    probs /= probs.sum()
    neg_idx = np.random.choice(len(degrees), size=k, p=probs)

    # Negative terms
    neg_scores = Z[neg_idx] @ f_u                  # shape (k,)
    neg_loss = -np.sum(np.log(sigmoid(-neg_scores) + 1e-8))

    return pos_loss + neg_loss / k
Why does node2vec sample negatives with probability proportional to degree3/4 rather than uniform?

Chapter 6: Link Prediction with Embeddings

Node embeddings zu were designed for nodes. But many tasks are about pairs of nodes — specifically, link prediction: given nodes u and v, does an edge exist between them (or will one form in the future)?

The trick: define an edge embedding using a binary operator on the two node embeddings. node2vec evaluates four operators:

OperatorFormulaIntuition
Hadamardzu ⊙ zvElement-wise product; captures feature co-occurrence
Average(zu + zv) / 2Midpoint in embedding space
Weighted-L1|zu − zv|Element-wise absolute difference
Weighted-L2|zu − zv|2Squared element-wise difference

Each operator produces a d-dimensional vector for each edge. This vector is then fed into a logistic regression classifier trained on known edges (positive examples) and non-edges (negative examples).

Training Setup

The link prediction setup requires careful handling to avoid data leakage:

  1. Split edges into train/test sets. Remove test edges from the graph before generating walks.
  2. Generate node embeddings using node2vec on the training graph only.
  3. For each edge in train set and an equal number of non-edges, compute edge embedding using one of the four operators.
  4. Train logistic regression on these edge embeddings.
  5. Evaluate on test edges + equal non-edges using AUC-ROC.
The Hadamard operator wins. Across most datasets in the paper, the Hadamard product (element-wise multiplication) outperforms the other operators for link prediction. The intuition: if both zu[i] and zv[i] are large in dimension i, then dimension i represents a feature both nodes share — exactly what you'd expect for nodes that should be connected. The product amplifies agreement in each dimension.
python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np

def link_prediction(embeddings, train_edges, train_nonedges,
                    test_edges, test_nonedges, operator='hadamard'):
    def edge_emb(u, v):
        zu, zv = embeddings[u], embeddings[v]
        if operator == 'hadamard': return zu * zv
        if operator == 'avg':      return (zu + zv) / 2
        if operator == 'l1':       return np.abs(zu - zv)
        if operator == 'l2':       return (zu - zv) ** 2

    # Build training features and labels
    X_train = [edge_emb(u, v) for u, v in train_edges + train_nonedges]
    y_train = [1] * len(train_edges) + [0] * len(train_nonedges)

    clf = LogisticRegression().fit(X_train, y_train)

    # Evaluate on test set
    X_test = [edge_emb(u, v) for u, v in test_edges + test_nonedges]
    y_test  = [1] * len(test_edges) + [0] * len(test_nonedges)
    scores = clf.predict_proba(X_test)[:, 1]
    return roc_auc_score(y_test, scores)
When using node2vec embeddings for link prediction, why must you remove test edges from the graph BEFORE generating random walks?

Chapter 7: Results

The paper evaluates node2vec on two tasks: multi-label node classification and link prediction. The key result: node2vec outperforms all baselines on both tasks, on all datasets. And the p,q parameters are the reason.

Node Classification Datasets

DatasetNodesEdgesLabelsTask
BlogCatalog10,312333,98339Social interest groups
Protein-Protein (PPI)3,89076,58450Biological states of proteins
Wikipedia4,777184,81240Part-of-speech labels

Node Classification Results (F1 Score)

MethodBlogCatalogPPIWikipedia
Spectral Clustering0.04050.06810.1164
DeepWalk0.21100.17680.1839
LINE0.07840.14470.1244
node2vec0.25810.20330.1989

node2vec improves over DeepWalk by 22% on BlogCatalog (a social network — homophily) and 15% on PPI (a protein network — structural equivalence). The improvement comes from tuning p and q.

Link Prediction (AUC-ROC)

DatasetSpectralDeepWalkLINEnode2vec
Facebook0.96250.96800.94900.9980
PPI0.94290.95680.96230.9760
arXiv (astro)0.96070.97860.97300.9996
arXiv (hep)0.95150.97610.96680.9905

On Facebook (social network), node2vec achieves 0.9980 AUC — near-perfect link prediction. On PPI (protein interaction), it reaches 0.9760. Across all four datasets, node2vec achieves the highest AUC, with the Hadamard operator performing best.

The ablation confirms it: The paper shows that removing the p,q parameterization (setting p=q=1, reducing to DeepWalk) consistently hurts performance. The gap is especially large on BlogCatalog (+22%) — a social network where homophily matters and low-p BFS walks give the right neighborhood definition.
Results Comparison

Bar chart of F1 scores for node classification across methods. node2vec consistently wins — by a larger margin on social networks (homophily dominant) than protein networks (structural equivalence dominant).

node2vec improves over DeepWalk by ~22% on BlogCatalog but only ~15% on PPI. Why might the social network benefit more from p,q tuning?

Chapter 8: When to Use Which Settings

By now you understand the theory. But in practice, how do you choose p and q for a new graph? Here's a principled guide based on task type, graph structure, and what the literature shows.

Task-Based Guidelines

TaskSignal TypeRecommendedIntuition
Community detectionHomophilyLow p, high q (BFS-like)Dense local sampling captures community membership
Role discoveryStructural equivalenceHigh p, low q (DFS-like)Outward exploration exposes structural patterns
Link prediction (social)HomophilyLow p, moderate qFriends-of-friends are most likely future friends
Link prediction (biology)MixedGrid search p,q ∈ {0.25, 0.5, 1, 2, 4}Protein interactions can be structural or community
UnknownUnknownp=1, q=1 (DeepWalk)Reasonable default; then tune if needed

Graph Structure Signals

You can diagnose what type of similarity is dominant before training by examining graph structure:

Hyperparameter Search Protocol

The paper uses a simple grid search: p ∈ {0.25, 0.5, 1, 2, 4}, q ∈ {0.25, 0.5, 1, 2, 4}. That's 25 combinations. Train node2vec once for each combination (walks only need to be regenerated, not the optimization from scratch for each). Evaluate on a 10% held-out node validation set. Pick the best p, q.

Practical tip: Walk generation is the bottleneck for large graphs, not training. If you have |V| = 106 nodes, generating walks 25 times is expensive. Instead, generate a large walk corpus once and then train with each p,q by reweighting the samples — though this only works approximately. For medium graphs (<100k nodes), full grid search is fast (<30 minutes total).

When NOT to Use node2vec

You're studying protein-protein interaction networks and want to find proteins that play the same structural role (e.g., "hub connector") across different organisms. Which p,q setting should you use?

Chapter 9: Connections & Legacy

node2vec sits at the intersection of random walk theory, Word2Vec, and network science. Understanding its place in the literature explains both why it worked and where to go next.

Predecessors

Spectral Embeddings (2001–2003)
Belkin & Niyogi (Laplacian Eigenmaps), Roweis & Saul (LLE). Embed by preserving local geometry. Matrix factorization of graph Laplacian. Transductive, no features, expensive.
↓ replaced by stochastic sampling
DeepWalk (2014)
Perozzi et al. Random walks + skip-gram. First to show NLP tools (Word2Vec) work on graphs. Uniform walks — no control over exploration.
↓ extended by biased walks
node2vec (2016)
Grover & Leskovec. Add p, q parameters. Smoothly interpolate BFS–DFS. Enable both homophily and structural equivalence capture.
↓ superseded by learned aggregation
GNNs: GCN, GraphSAGE, GAT (2016–2018)
Kipf & Welling, Hamilton et al., Veličković et al. Replace lookup tables with learned aggregators. Inductive, feature-aware, generalizes to unseen nodes.

What node2vec Got Right (That's Still True)

What node2vec Got Wrong (Or Left Unresolved)

Citation Impact

node2vec has over 14,000 citations (as of 2024). It is one of the most cited graph ML papers ever, and serves as the standard baseline for any node classification or link prediction task. "Did you beat node2vec?" is the minimum bar in graph ML papers.

MethodYearKey Advance Over node2vec
GraphSAGE2017Inductive; aggregates neighbor features not just IDs
GAT2018Attention-weighted aggregation; learns importance of neighbors
Graph Transformer2020+Global attention; no local neighborhood assumption
SEAL2018Subgraph-based link prediction; beats node2vec on all link tasks
METIS + node2vecVariousCluster-then-embed for scalability on billion-node graphs
"The key is to be able to control the kind of communities you're trying to learn — and that's exactly what p and q give you." — Jure Leskovec

Paper: Grover & Leskovec, "node2vec: Scalable Feature Learning for Networks," KDD 2016.
arXiv: 1607.00653  |  Code: github.com/aditya-grover/node2vec